AWS CloudFront can log activity on a distribution as numerous (often small) access log files from all edge locations into a single central S3 bucket (S3 stands for “Amazon Simple Storage Service”). The access logs are created and shipped with a slight delay, but typically arrive within minutes in the S3 bucket. The sheer volume of distinct files makes it quite hard to analyze without tools or scripts.
This topic explains how to analyze CloudFront logs stored in thousands of separate log files using Invantive UniversalSQL by creating a table. This table can be stored in-memory, in a traditional SQL database or streamed to for instance Power BI or into an Azure Data Factory pipeline.
CloudFront Log File Format
The CloudFront access logs are all compressed using gzip, and stored with a .gz
extension.
Each CloudFront log file has the format of fields, separated by the tab-character. The first line contains the version number as comment, and the second line contains the order of the fields as comment. Each individual access to the distribution is stored as a separate access log file.
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2021-09-04 00:38:43 AMS54-C1 3245 99.999.210.253 GET dd598hveo3k9a.cloudfront.net / 200 - Mozilla/5.0%...Safari/537.36 - - Hit HL5_dvnGCs_hoMi2qsVDUD3dQ_faFygRD0zR8TGkpzluUnYZSL3XVw== eolclientredirect.invantive.com https 384 0.001 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Hit HTTP/1.1 - - 47113 0.001 Hit text/html - - -
Analyze AWS CloudFront logs
First step is to download the Amazon CloudFront log files from the AWS S3 bucket to a local folder, using for instance the aws
command line.
Then use the following SQL query to create a table with all access from the logs:
local define FOLDER "C:\some\logs"
create or replace table httprequests@inmemorystorage
as
select cte.*
--
-- Scan for log files starting with "E", ending on ".gz".
--
from files@os('${FOLDER}', 'E*.*gz', true) fle
join read_file@Os(fle.file_path) rfe
join csvtable
--
-- Unzip the file and then remove the comment lines.
--
( passing regexp_replace(to_char(ungzip(rfe.file_contents)), '^#.*$', '', 1, 1, 'm')
column delimiter chr(9)
columns date date not null position next
, time time not null position next
, x_edge_location varchar2 not null position next
, sc_bytes int64 not null position next
, c_ip varchar2 not null position next
, cs_method varchar2 not null position next
, cs_host varchar2 not null position next
, cs_uri_stem varchar2 not null position next
, sc_status number(3, 0) not null position next
, cs_referer varchar2 not null position next /* Referrer string. */
, cs_user_agent varchar2 not null position next
, cs_uri_query varchar2 not null position next
, cs_cookie varchar2 not null position next
, x_edge_result_type varchar2 not null position next
, x_edge_request_id varchar2 not null position next
, x_host_header varchar2 not null position next
, cs_protocol varchar2 not null position next
, cs_bytes int64 not null position next
, time_taken decimal not null position next
, x_forwarded_for varchar2 not null position next
, ssl_protocol varchar2 not null position next
, ssl_cipher varchar2 not null position next
, x_edge_response_result_type varchar2 not null position next
, cs_protocol_version varchar2 not null position next
, fle_status varchar2 not null position next
, fle_encrypted_fields varchar2 not null position next
, c_port uint16 not null position next
, time_to_first_byte decimal not null position next
, x_edge_detailed_result_type varchar2 not null position next
, sc_content_type varchar2 not null position next
, sc_content_len varchar2 not null position next
, sc_range_start varchar2 not null position next
, sc_range_end varchar2 not null position next
) cte
Then log data is then stored in an in-memory table that can be queried.