Unique count approximation in Google Analytics

Minhaz Kazi, Developer Advocate, Google Analytics – October 2022

Measuring exact distinct counts (i.e. cardinality) for large datasets requires significant memory and affects performance. Google Analytics 4 properties use HyperLogLog++ (HLL++) algorithm to estimate cardinality for most used metrics including Active Users and Sessions. For example, when viewing Active Users in the Google Analytics UI as well as through the Data API, the count will be an approximation. Using HLL++ for these metrics ensures better performance with higher estimation accuracy and lower error bounds. This blogpost provides details on how HLL++ is used in Google Analytics 4 properties and how you can replicate some of the estimations using BigQuery event export data.

About HLL++

HLL++ estimates cardinality while using less memory and improving performance. HLL++ has augmentations made over the HyperLogLog algorithm and is based on HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. Changes in the HLL++ precision parameter trade off memory usage vs. accuracy of the computed stats. Increasing the parameter value lowers the error, but increases the memory consumption.

Unique count implementation in BigQuery

  • Use COUNT(DISTINCT) to measure exact cardinality. This approach requires more memory and will take longer to run, especially for large datasets.
  • APPROX_COUNT_DISTINCT approximates results with HLL++. However, APPROX_COUNT_DISTINCT does not allow users to configure the approximation accuracy.
  • To use custom precision values, use HyperLogLog++ functions. See HLL++ Sketches for permitted precision values and the confidence intervals for typical precisions.
  • sparse precision is another parameter for HLL+. In BigQuery, sparse precision value is not user definable and is fixed at precision + 5.

HLL++ Implementation in Google Analytics 4 properties

Google Analytics 4 uses the following configuration for measuring cardinality of the related metrics.

Metric precision sparse precision
Sessions 12 17
Active Users 14 25
Total Users 14 25

Using BigQuery HLL++ Functions with Google Analytics Event data

If your Google Analytics 4 property's event data is available in BigQuery, you can try to match the metrics from UI with BigQuery data. The following examples assume:

  • Reporting identifier is set to By device only for the Google Analytics 4 property.
  • Other potential sources of discrepancy, e.g. reporting timezone, are already addressed.

Total Users

Exact count using COUNT(DISTINCT):

SELECT
  COUNT(DISTINCT user_pseudo_id) AS exact_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`

Approximate count using APPROX_COUNT_DISTINCT:

SELECT
  APPROX_COUNT_DISTINCT(user_pseudo_id) AS approx_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`

You can replicate APPROX_COUNT_DISTINCT using BigQuery HLL++ functions. This will return the identical or very similar results as APPROX_COUNT_DISTINCT:

SELECT
  HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 15)) AS approx_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`

And finally, to replicate the data in Google Analytics UI, use precision = 14:

SELECT
  HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`

Active Users

To calculate Active Users count from BigQuery event export table, you first have to filter the events for Active Users only. Implementing the Active User filter is out of scope for this article.

WITH ActiveUsers AS
(
  SELECT
    user_pseudo_id
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  WHERE
  <implement active user filter logic>
)
SELECT
  HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS active_user_count,
FROM ActiveUsers

Sessions

The ga_session_id event parameter identifies individual unique sessions for each user. The combination of user_pseudo_id and ga_session_id will be unique across your dataset for unique sessions. This is the standard method of counting sessions for Google Analytics 4 properties. For sessions, precision is 12.

SELECT
  HLL_COUNT.EXTRACT(
    HLL_COUNT.INIT(
      CONCAT(
        user_pseudo_id,
        (SELECT `value` FROM UNNEST(event_params) WHERE key = 'ga_session_id' LIMIT 1).int_value),
      12)) AS session_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`

You can use queries in this blogpost with the Basic Queries and Advanced Queries examples to gain additional insights from BigQuery event export data for your Google Analytics 4 properties. If your property generates significant amount of event data, you can also implement the HLL++ functions to estimate cardinality of other metrics that you commonly use in your BigQuery queries. To learn more about HLL++ and why cardinality computation is expensive for large datasets, read the in-depth blogpost on Cloud Blog: Using HLL++ to speed up count-distinct in massive datasets.