BigQuery is a cloud-based data warehousing tool that lets you store and analyze petabytes of data at lightning speed. For GA 360 users, Google provides you the option to get the daily data dump for the website’s sessions into BigQuery. You can use this data to overcome the limitations of Google Analytics.

I. How does it work?:

A. Every day Google Analytics stores its sessions’ data in BigQuery Servers. This session data is a table where each row is dedicated to a user visit while each column represents a different dimension or metric that can be repeated and nested. For a rough idea: Column A: visitor Id or cookie id, Column B: date of the session. The point where it gets complex is when the table stores all the hits(events), page views, and custom dimensions in one row. This is what makes BigQuery different from a flat table system.

Google Analytics Data Sample in BigQuery

B. This table acts as our input. First, we un-nest and un-stitch this complex nested data by flattening it. And then, we stitch it back according to our needs.

Flattened Data

C. Being a proper cloud based ETL tool, it provides us great transformation features and returns un-sampled data at great speed.

Random person trying to explore BigQuery Schema

In the following sections, you will get an in-depth knowledge about How Google Analytics stores and calculate all the reports. Expect some creative ideas to pop in your mind to find answers which Google Analytics cannot provide.

II. Prerequisites 

The following sections will expect basic knowledge of Google Analytics Metrics and some knowledge of SQL. Here is great article by Benjamin from ‘LovesData’ to revise Google Analytics Metrics Concepts. Another great tutorial for revising SQL concepts by my favorite tutorial site: W3 School.

III. Lets Start Querying

Here is the detailed schema of Google Analytics data stored in BigQuery. I will first start with the high level and basic metrics like Sessions, Users etc and then gradually move to more deep and complex metrics. There are two ways to calculate the high level metric: First way is to query the ‘totals’ record in the table; Second way is to query the flattened table (resolved complex data structure to a flat table) and apply relevant logic. I will take the second approach for all the metrics in this blog. We will be using this particular dataset for learning purpose


1. Users

The table contains a field named ‘Full Visitor Id’. This is nothing but the cookie ID that is unique for a browser on machine. So, if you can find the distinct number of these IDs, you can find the number of Users.

Select Count ( Distinct fullVisitorId) as Users from `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits

2. Sessions

Along with the field ‘Full Visitor Id’, the table contains fields such as ‘visitNumber’ that is the sequence number of a particular session for that user (Full Visitor id). Also, ‘visitStartTime’ denotes the time when the session was started. If we concatenate these terms and find the distinct count, we will get the number of sessions.

Select
Count ( Distinct
CASE
WHEN totals.visits=1 THEN
CONCAT( fullvisitorid,”-“,CAST(visitNumber AS string),”-“,CAST(visitStartTime AS string))
End)
as Sessions
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits

3. Pageviews

For calculating the number of pages viewed, we will use the “hit type” field by counting the number of times there was pageview hit/event in the session.

Select
SUM(
Case when hits.type=”PAGE” then 1 else 0 END
)
as Pageviews
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits

4. Unique PageViews

Unique Pageviews are calculated by ignoring the duplicate pageviews for a session. If a Page A has 2 pageviews in a sessions, the unique pageviews of A will only be 1.

Hence, we need a combination of Session Identifier and Page Identifier, and take a unique count of this combination.

Select
Count ( Distinct
CONCAT( fullvisitorid,”-“,CAST(visitNumber AS string),”-“,CAST(visitStartTime AS string),”-“,hits.page.pagePath)
)
as Unique_Pageviews
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits

5. Bounce Rate

Bounces are the sessions that had exactly one interaction event. To calculate this, we will calculate total bounces and divide it by the number of sessions.

SELECT
Page,
( ( bounces / sessions ) * 100 ) AS Bounce_Rate,
Sessions
FROM (
SELECT
hits.page.pagePath AS Page,
Count ( Distinct
CASE
WHEN totals.visits=1 THEN
CONCAT( fullvisitorid,”-“,CAST(visitNumber AS string),”-“,CAST(visitStartTime AS string))
End)
as Sessions,
SUM ( totals.bounces ) AS Bounces
from
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits
GROUP BY
Page )
ORDER BY
Sessions DESC

6. Entrances

Entrances are calculated by using a field called isEntrance. This field has the value “TRUE” if the hit is the first one of the session.

Select
hits.page.pagePath AS Page,
SUM(
CASE
WHEN hits.isEntrance = TRUE and hits.type=”PAGE” AND totals.visits=1 THEN 1
ELSE 0
END
) AS Entrances
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits
group by Page

7. Exits

Similarly, there is a field dedicated to exits as well. It is set to TRUE if the hit is the last hit of that session.

Select
hits.page.pagePath AS Page,
SUM(
CASE
WHEN hits.isExit = TRUE and hits.type=”PAGE” AND totals.visits=1 THEN 1
ELSE 0
END
) AS Exits
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits
group by Page

8. Average Session Duration

For calculating the engagement metric, Avg. Session Duration, we will first calculate the session duration of each session. This is done by find the hit time of the interactive hit in that session. This duration is then aggregated for a dimension such as Channel and divided by the number of sessions.

Select Channel, SUM(Total_Session_Duration)/Count(Distinct Session) as Avg_Session_Duration
from(
Select
Channel, Session,
MAX(hitTIme)as Total_Session_Duration
from(
Select
channelGrouping as Channel,
case when totals.visits=1 then CONCAT( fullvisitorid ,”-“,Cast(visitNumber as string),”-“,cast(visitStartTime as string)) end as Session,
Case when hits.IsInteraction=TRUE then hits.Time/1000 else 0 end as hitTime
from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits)
group by channel, session)
group by Channel

9. Average Time on Page

Calculating average time on Page is similar to calculating avg. session duration. The major difference is that we aggregate the timestamps of last interactive hits of the particular page instead of sessions’.

select

Page,
SUM(TIMEOnPage) as TimeOnPage,
SUM(Exits) as Exits,
SUM(Pageviews) as Pageviews,
safe_divide(SUM(TIMEOnPage),(SUM(Pageviews)-Sum(Exits))) as Avg_Time_On_Page
from(
SELECT
Sessions, Page, Pageviews,
Case when exit =TRUE then LastInteraction-hitTime
else LEAD(hitTime) OVER (PARTITION BY Sessions ORDER BY hitseq) – hitTime end
as TimeOnPage,
Exits
FROM (
SELECT
CASE
WHEN totals.visits=1 THEN CONCAT( fullvisitorid,”-“,CAST(visitNumber AS string),”-“,CAST(visitStartTime AS string))
END
AS Sessions,
hits.Page.pagePath AS Page,
hits.IsExit AS exit,
Case when hits.Isexit =TRUE
then 1 else 0 end As Exits,
hits.hitNUmber as hitSeq,
hits.Type AS hitType,
hits.time/1000 AS hitTime,
CASE WHEN type=”PAGE” AND totals.visits=1 THEN 1
ELSE 0 END AS PageViews,
MAX(
IF (hits.isInteraction =TRUE,
hits.time / 1000,0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS LastInteraction
from
`dm-corp-marketing-001.137933647.ga_sessions_20200803` , UNNEST(hits) AS hits
order by Sessions,hitSeq)
WHERE
hitType=’PAGE’
)
group by Page order by Pageviews desc

10. Event Based Goal

If you want calculate the total completions of an event based goal, you need to count the number of sessions where that event occurred. The below example counts goal completion for Event Category: Lead Generation and Event Action: Brochure Download. I am using a regex filter instead of ‘equal to’ operator; you can use either in this case.

Select Count( distinct CASE WHEN REGEXP_CONTAINS(hits.eventInfo.eventAction,r’^Brochure Download$’) AND REGEXP_CONTAINS(hits.eventInfo.eventCategory,r’^Lead Generation’) THEN CONCAT( fullvisitorid,”-“, CAST(visitStartTime AS string) ) end) as Goal_Lead_Generation from`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits

11. Nth() Page Path

If you want to see the most common Nth() pages such as most common 1st page (Landing Page), most common 2nd Page (page after Landing Page) and so on, then this piece of code is for you. You manipulate this code to see different page flows up to Nth() level and Top Page Path levels, and also see these ‘data views’ for particular behavior like conversions, device types etc.

SELECT
second_page_path, count (distinct SessionIdentity) as Sessions
FROM (
SELECT
CASE WHEN totals.visits=1 THEN CONCAT( fullvisitorid,”-“,CAST(visitNumber AS string),”-“,CAST(visitStartTime AS string)) END
AS SessionIdentity,
CASE WHEN hits.isEntrance=TRUE THEN hits.page.pagePath END
AS Landing_Page,
CASE WHEN hits.isEntrance = TRUE THEN LEAD( hits.page.pagePath,1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.type)
ELSE NULL END AS second_page_path,
CASE WHEN hits.isEntrance = TRUE THEN LEAD( hits.page.pagePath,2) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.type)
ELSE NULL
END AS third_page_path,
CASE WHEN hits.isEntrance = TRUE THEN LEAD( hits.page.pagePath,3) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.type)
ELSE NULL END AS fourth_page_path,
CASE WHEN hits.isEntrance = TRUE THEN LEAD( hits.page.pagePath,4) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.type)
ELSE NULL END AS fifth_page_path
from
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` , UNNEST(hits) AS hits
ORDER BY SessionIdentity,Landing_Page)
WHERE SessionIdentity IS NOT NULL AND landing_page IS NOT NULL
GROUP BY second_page_path
ORDER BY Sessions Desc


I will conclude by saying BigQuery is a great tool to leverage with GA data. It provides you the freedom of seeing the data in ways that are not possible to see via Google Analytics.

Let me know in the comment section if you have any questions or suggestions!

This Post Has One Comment

Leave a Reply