Minhaz Kazi, Developer Advocate, Google Analytics – October 2022
Measuring exact distinct counts (i.e. cardinality) for large datasets requires significant memory and affects performance. Google Analytics 4 properties use HyperLogLog++ (HLL++) algorithm to estimate cardinality for most used metrics including Active Users and Sessions. For example, when viewing Active Users in the Google Analytics UI as well as through the Data API, the count will be an approximation. Using HLL++ for these metrics ensures better performance with higher estimation accuracy and lower error bounds. This blogpost provides details on how HLL++ is used in Google Analytics 4 properties and how you can replicate some of the estimations using BigQuery event export data.
HLL++ estimates cardinality while using less memory and improving performance.
HLL++ has augmentations made over the HyperLogLog algorithm and is based on
HyperLogLog in Practice: Algorithmic Engineering of a State of the Art
Cardinality Estimation Algorithm. Changes in the HLL++
trade off memory usage vs. accuracy of the computed stats. Increasing the
parameter value lowers the error, but increases the memory consumption.
Unique count implementation in BigQuery
COUNT(DISTINCT)to measure exact cardinality. This approach requires more memory and will take longer to run, especially for large datasets.
APPROX_COUNT_DISTINCTapproximates results with HLL++. However,
APPROX_COUNT_DISTINCTdoes not allow users to configure the approximation accuracy.
- To use custom
precisionvalues, use HyperLogLog++ functions. See HLL++ Sketches for permitted
precisionvalues and the confidence intervals for typical precisions.
sparse precisionis another parameter for HLL+. In BigQuery,
sparse precisionvalue is not user definable and is fixed at
precision + 5.
HLL++ Implementation in Google Analytics 4 properties
Google Analytics 4 uses the following configuration for measuring cardinality of the related metrics.
Using BigQuery HLL++ Functions with Google Analytics Event data
If your Google Analytics 4 property's event data is available in BigQuery, you can try to match the metrics from UI with BigQuery data. The following examples assume:
- Reporting identifier is set to
By device onlyfor the Google Analytics 4 property.
- Other potential sources of discrepancy, e.g. reporting timezone, are already addressed.
Exact count using
SELECT COUNT(DISTINCT user_pseudo_id) AS exact_total_user_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
Approximate count using
SELECT APPROX_COUNT_DISTINCT(user_pseudo_id) AS approx_total_user_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
You can replicate
APPROX_COUNT_DISTINCT using BigQuery HLL++ functions. This
will return the identical or very similar results as
SELECT HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 15)) AS approx_total_user_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
And finally, to replicate the data in Google Analytics UI, use
precision = 14:
SELECT HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS total_user_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
To calculate Active Users count from BigQuery event export table, you first have to filter the events for Active Users only. Implementing the Active User filter is out of scope for this article.
WITH ActiveUsers AS ( SELECT user_pseudo_id FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` WHERE <implement active user filter logic> ) SELECT HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS active_user_count, FROM ActiveUsers
ga_session_id event parameter identifies individual unique sessions for
each user. The combination of
ga_session_id will be
unique across your dataset for unique sessions. This is the standard method of
counting sessions for Google Analytics 4 properties. For sessions,
SELECT HLL_COUNT.EXTRACT( HLL_COUNT.INIT( CONCAT( user_pseudo_id, (SELECT `value` FROM UNNEST(event_params) WHERE key = 'ga_session_id' LIMIT 1).int_value), 12)) AS session_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
You can use queries in this blogpost with the Basic Queries and Advanced Queries examples to gain additional insights from BigQuery event export data for your Google Analytics 4 properties. If your property generates significant amount of event data, you can also implement the HLL++ functions to estimate cardinality of other metrics that you commonly use in your BigQuery queries. To learn more about HLL++ and why cardinality computation is expensive for large datasets, read the in-depth blogpost on Cloud Blog: Using HLL++ to speed up count-distinct in massive datasets.