Analyze CloudFront access logs using Invantive UniversalSQL

AWS CloudFront can log activity on a distribution as numerous (often small) access log files from all edge locations into a single central S3 bucket (S3 stands for “Amazon Simple Storage Service”). The access logs are created and shipped with a slight delay, but typically arrive within minutes in the S3 bucket. The sheer volume of distinct files makes it quite hard to analyze without tools or scripts.

This topic explains how to analyze CloudFront logs stored in thousands of separate log files using Invantive UniversalSQL by creating a table. This table can be stored in-memory, in a traditional SQL database or streamed to for instance Power BI or into an Azure Data Factory pipeline.

CloudFront Log File Format

The CloudFront access logs are all compressed using gzip, and stored with a .gz extension.

Each CloudFront log file has the format of fields, separated by the tab-character. The first line contains the version number as comment, and the second line contains the order of the fields as comment. Each individual access to the distribution is stored as a separate access log file.

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2021-09-04	00:38:43	AMS54-C1	3245	99.999.210.253	GET	dd598hveo3k9a.cloudfront.net	/	200	-	Mozilla/5.0%...Safari/537.36	-	-	Hit	HL5_dvnGCs_hoMi2qsVDUD3dQ_faFygRD0zR8TGkpzluUnYZSL3XVw==	eolclientredirect.invantive.com	https	384	0.001	-	TLSv1.2	ECDHE-RSA-AES128-GCM-SHA256	Hit	HTTP/1.1	-	-	47113	0.001	Hit	text/html	-	-	-

Analyze AWS CloudFront logs

First step is to download the Amazon CloudFront log files from the AWS S3 bucket to a local folder, using for instance the aws command line.

Then use the following SQL query to create a table with all access from the logs:

local define FOLDER "C:\some\logs"

create or replace table httprequests@inmemorystorage
as
select cte.*
--
-- Scan for log files starting with "E", ending on ".gz".
--
from   files@os('${FOLDER}', 'E*.*gz', true) fle
join   read_file@Os(fle.file_path) rfe
join   csvtable
        --
        -- Unzip the file and then remove the comment lines.
        --
       ( passing regexp_replace(to_char(ungzip(rfe.file_contents)), '^#.*$', '', 1, 1, 'm')
         column  delimiter chr(9)
         columns date                        date         not null position next
         ,       time                        time         not null position next
         ,       x_edge_location             varchar2     not null position next
         ,       sc_bytes                    int64        not null position next
         ,       c_ip                        varchar2     not null position next
         ,       cs_method                   varchar2     not null position next
         ,       cs_host                     varchar2     not null position next
         ,       cs_uri_stem                 varchar2     not null position next
         ,       sc_status                   number(3, 0) not null position next
         ,       cs_referer                  varchar2     not null position next /* Referrer string. */
         ,       cs_user_agent               varchar2     not null position next
         ,       cs_uri_query                varchar2     not null position next
         ,       cs_cookie                   varchar2     not null position next
         ,       x_edge_result_type          varchar2     not null position next
         ,       x_edge_request_id           varchar2     not null position next
         ,       x_host_header               varchar2     not null position next
         ,       cs_protocol                 varchar2     not null position next
         ,       cs_bytes                    int64        not null position next
         ,       time_taken                  decimal      not null position next
         ,       x_forwarded_for             varchar2     not null position next
         ,       ssl_protocol                varchar2     not null position next
         ,       ssl_cipher                  varchar2     not null position next
         ,       x_edge_response_result_type varchar2     not null position next
         ,       cs_protocol_version         varchar2     not null position next
         ,       fle_status                  varchar2     not null position next
         ,       fle_encrypted_fields        varchar2     not null position next
         ,       c_port                      uint16       not null position next
         ,       time_to_first_byte          decimal      not null position next
         ,       x_edge_detailed_result_type varchar2     not null position next
         ,       sc_content_type             varchar2     not null position next
         ,       sc_content_len              varchar2     not null position next
         ,       sc_range_start              varchar2     not null position next
         ,       sc_range_end                varchar2     not null position next
       ) cte

Then log data is then stored in an in-memory table that can be queried.