Noise injection

  • Noise injection protects user privacy by adding random noise to aggregated data in database queries, providing reasonably accurate results without traditional difference checks.

  • Using noise injection simplifies troubleshooting, requires no new query syntax, and provides clear result accuracy.

  • Noise injection does not rely on existing difference checks but outputs data based on approximate user thresholds (20+ for impressions, 10+ for clicks/conversions).

  • Ads Data Hub injects noise by clamping outlier contributions, aggregating clamped contributions, adding noise to each aggregate result, and eliminating rows with too few users.

  • You can choose between implicit clamping (automatic bounds) and explicit clamping (user-defined bounds) to limit outlier contributions, with explicit clamping generally resulting in less noise if reasonable bounds can be estimated.

Noise injection is a technique used to protect user privacy when querying a database. It works by adding random noise to an aggregating SELECT clause of a query. This noise protects user privacy while providing reasonably accurate results, eliminating the need for difference checks, and reducing the required aggregation threshold for output. Most existing queries can be executed in noise mode, with some limitations.

Learn the benefits of using noise injection

Difference checks do not apply: When running queries with noise injection, Ads Data Hub does not filter rows due to similarity to previous result sets. This means that you can still get a holistic view of the data while protecting user privacy.

Troubleshooting is simplified: Rows are only omitted due to aggregation requirements, making it simpler to troubleshoot and adapt queries.

There is no new syntax to learn: You don't need to learn any new query syntax or be versed in privacy concepts to use noise instead of difference checks.

Result accuracy is clear: A successful job shows the total percentage of data with the expected amount of noise.

Learn how noise impacts privacy requirements

Difference checks: Noise injection does not rely on existing difference checks in Ads Data Hub. When you use noise injection, difference checks are disabled.

Aggregation requirement: Noise injection outputs impression data represented by approximately 20 or more unique users, and click or conversion data represented by approximately 10 or more unique users.

Static checks: No impact.

Budgets and query limits: Queries executed using noise share the data access budget used with difference checks. As with difference checks, if you execute the same query on the same dataset many times, you might lose access to frequently queried dates in the dataset. This can happen if you run sliding window queries, or if you make the same request multiple times.

Noise mode imposes additional, stricter limits on recomputing the same aggregate results within or across queries. As with the data access budget, you can lose access to frequently queried dates in the dataset; but limitations due to recomputing the same aggregate results will only restrict queries in noise mode, not queries in difference check mode. For more information, see Repeated results.

Learn more about privacy checks.

Understand how noise injection affects results

Ads Data Hub injects noise to mitigate disclosure risk—the risk that someone can learn information about an individual user. It balances privacy against utility.

Noise injection in Ads Data Hub transforms the query results as follows:

  • It clamps outlier users' contributions in aggregate results. It sums each user's contribution in each aggregation and then caps each contribution with minimum and maximum clamping bounds.
  • It aggregates the clamped per-user contributions.
  • It adds noise to each aggregate result—the result of each aggregation function call in each row. The scale of this random noise is proportional to the clamped bounds.
  • It computes a noisy user count for each row and eliminates rows with too few users. This is similar to k-anonymity in difference check mode, but because of the noise, jobs running on the same dataset can drop different rows. Also, noise mode drops fewer rows because the aggregation requirement is lower (approximately 20 versus exactly 50).

The final result is a dataset where each row has noisy aggregate results and small groups have been eliminated. This masks an individual user's effect on returned results.

About aggregation clamping

Noise injection in Ads Data Hub uses implicit or explicit aggregation clamping to limit the contribution of outliers. You can choose which type of clamping to use, depending on your use case.

Implicit clamping

In implicit clamping, bounds are determined automatically. You don't need any special SQL syntax to use implicit clamping. If one row has a wider range of values than another, implicit bounding finds different bounds for these rows. This typically gives a lower margin of error for each result. On the other hand, each aggregation gets different clamping bounds and noise levels, which can make them hard to compare.

Implicit clamping can fail when an aggregation gets data from too few users—for example, a COUNTIF() call with a rare condition. These cases return NULL results. Also note that COUNT(DISTINCT user_id) automatically uses explicit clamping with bounds of 0 and 1.

Explicit clamping

Explicit clamping clamps the total contribution from each user to a specified range. Explicit bounds are uniformly applied to all rows and must be literal values. Even if some rows have a wider range of per-user contributions than others, the same bounds are applied to all of them. This makes results from different rows more comparable, though some rows get more noise than they might with implicit clamping.

Explicit clamping uses half as much noise as implicit clamping, for a given set of clamping bounds. Therefore, if you can estimate reasonable bounds, you will get better results by setting them explicitly.

To use explicit clamping, set the bounds for each supported aggregate function by adding integers representing the lower bound and the upper bound. For example:

SELECT
campaign_name,
-- Set lower and upper bounds to 0 and 1, respectively
ADH.ANON_COUNT(*, contribution_bounds_per_group => (0,1))
FROM data
GROUP BY 1

Run a query using noise injection

  1. Open a report.
  2. Click the Privacy noise settings toggle to the Use noise position.
  3. Run the query.
  4. Review the impact of the added noise.
  5. Optional: Adapt the query to reduce noise impact.

Review noise impact

Once job completes successfully, Ads Data Hub displays the reliability of the result in the privacy summary. Reliability is based on the percentage of cells in the output that are highly impacted by noise. A value in the result table is considered highly impacted if the scale of the added noise is greater than 5% of the result in the cell.

For impacted output data sets, the privacy summary lists the ten noisiest columns from highest to lowest impact and their corresponding contribution to noise. This is the breakdown of the expected amount of noise.

Results with >5% noise Indicator color Impact
<5% Green Low impact
5%-15% Yellow Medium impact
15%-25% Orange High impact
>25% Red Very high impact

You can also preview the privacy summary for recent report jobs on the Home page. To preview the privacy for a particular job, hold the pointer over the privacy tip icon privacy_tip in the job card under Recent activity.

Adapt queries

Aggregate results are more likely to have an unexpected amount of noise when few users contribute to those results. This can happen when rows have few users or when some users don't affect results—for example, when using the COUNTIF function. Based on the noise details, you may want to adjust your query to increase the percentage of data with the expected amount of noise.

The following are general guidelines:

  • Expand the date range.
  • Rewrite the query to reduce the granularity of the data, such as by grouping by fewer parameters or replacing COUNTIF with COUNT.
  • Remove noisy columns.
  • Use explicit clamping.

Supported aggregate functions

The following aggregate functions are supported with noise:

  • SUM(...)
  • COUNT(*)
  • COUNT(...)
  • COUNTIF(...)
  • COUNT(DISTINCT user_id)
  • APPROX_COUNT_DISTINCT(user_id)
  • AVG(...)

The DISTINCT keyword is only supported with the COUNT function, and only when used with a direct reference to the user_id column from an Ads Data Hub table or an expression that returns either user_id or NULL, such as COUNT(DISTINCT IF(..., user_id, NULL)).

Note these limitations only apply to aggregations with noise, which is the first level of cross-user aggregation. User-level aggregates and aggregates following noise injection are unrestricted.

Compute MIN and MAX

The MIN and MAX functions are not directly supported in noise aggregations, but there are often alternative methods to compute these results.

If you have a MIN or MAX of values that can be used as grouping keys, such as event date, you can first GROUP BY that value, then compute MIN/MAX afterward. This returns the minimum or maximum value that passes aggregation thresholding.

Example:

WITH campaign_date_ranges AS (
  SELECT campaign_id, MIN(event_date) AS min_date, MAX(event_date) AS max_date
  FROM (
    # Aggregation thresholding will be applied here
    SELECT DISTINCT
      campaign_id,
      DATE(query_id.time_usec, @time_zone) AS event_date
    FROM adh.google_ads_impressions
  )
)
SELECT campaign_id, num_impressions, min_date, max_date
FROM (
  # Noise and aggregation thresholding will be applied here
  SELECT campaign_id, COUNT(*) AS num_impressions
  FROM adh.google_ads_impressions
)
JOIN campaign_date_ranges USING(campaign_id)

Alternatively, if you have a MIN or MAX of granular values with known bounds, you can use PERCENTILE_CONT with explicit bounds for an approximate result.

Example:

SELECT
  campaign_id,
  COUNT(*) AS num_impressions,
  ADH.ANON_PERCENTILE_CONT(
    query_id.time_usec, 0,
    contribution_bounds_per_row => (@min_timestamp, @max_timestamp))
    AS min_timestamp,
  ADH.ANON_PERCENTILE_CONT(
    query_id.time_usec, 1,
    contribution_bounds_per_row => (@min_timestamp, @max_timestamp))
    AS max_timestamp
FROM adh.google_ads_impressions

About integer results

Although Ads Data Hub will automatically inject noise for these aggregate functions, the function signatures don't change. Because functions like COUNT or SUM of INT64 return INT64, any decimal part of the noised result is rounded. This is usually negligible relative to the size of the result and noise.

If you need the granularity of the decimal in your result, then avoid writing functions that return INT64–for example, by using SUM with its input cast to FLOAT64.

About negative results

In principle, noise with very small values can result in negative numbers, even when this should be semantically impossible for the query. To maintain expected behavior, all forms of COUNT and COUNTIF are automatically clamped at zero, so they never give negative results. If you want this same behavior with another function, such as SUM, then you can clamp results manually using GREATEST(0, SUM(...)).

This change is usually negligible, but it does introduce a small positive bias to overall results. If you need to avoid this, then consider using ADH.ANON_COUNT instead of COUNT, or consider using GROUP BY ROLLUP to compute total counts across rows.


Supported query patterns

Important: Most of Ads Data Hub's standard best practices still apply to queries that use noise injection. In particular, we recommend that you review the guidance on repeatedly querying the same data.

This section describes query patterns that are supported when running queries using noise injection.

User-level aggregates

Unrestricted user-level aggregates are supported in the same way that they are in difference check mode. Noise is only injected in aggregations that combine data across multiple users. Aggregations that explicitly group by user_id, or analytic functions that partition by user_id, don't receive any noise and any function is allowed. User-level aggregations that don't explicitly group by user_id–for example, GROUP BY impression_id, are treated as cross-user aggregations, so noise is added.

Grouping by external_cookie is not enough. While external_cookie can be used to join *_match tables with customer-owned tables, any single-user aggregations should explicitly group by user_id column, not just the external_cookie column.

Aggregate function example:

WITH user_paths AS (
  # Grouping by user_id, no noise needed, all functions allowed
  SELECT user_id, STRING_AGG(campaign_id, ">" ORDER BY query_id.time_usec) AS path
  FROM adh.google_ads_impressions
  GROUP BY 1
)
# Noise applied here to num_users
SELECT path, COUNT(*) AS num_users
FROM user_paths
GROUP BY 1;

Analytic function example:

WITH events AS (
  # Partitioning by user_id, no noise needed, all functions allowed
  SELECT
    campaign_id,
    ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY query_id.time_usec) AS index
  FROM adh.google_ads_impressions
)
# Noise applied here to first_impressions
SELECT campaign_id, COUNT(*) AS first_impressions
FROM events
WHERE index = 1
GROUP BY 1;

Parallel aggregates

Each cross-user aggregation receives noise independently. You can run multiple such aggregations in a single statement, combining results into one table using a JOIN or UNION.

Example:

WITH result_1 AS (
  # Noise applied here to num_impressions
  SELECT campaign_id, COUNT(*) AS num_impressions
  FROM adh.google_ads_impressions
  GROUP BY 1
), result_2 AS (
  # Noise applied here to num_clicks
  SELECT campaign_id, COUNT(*) AS num_clicks
  FROM adh.google_ads_creative_conversions
  GROUP BY 1
)
SELECT * FROM result_1 JOIN result_2 USING(campaign_id)

Note that this would be supported but should be avoided in difference check mode. This practice is not a problem with noise, as each parallel aggregate is noised and filtered independently.

Aggregated data joined with unaggregated data

Since Ads Data Hub only supports analytic windows that partition by user_id, it is a common workaround to aggregate these results separately and self-join them before aggregating again. These queries are supported in noise mode, and often perform better than they would with in difference check mode due to privacy requirements being resolved earlier.

Example:

WITH campaign_totals AS (
  # Noise applied here to campaign_imps
  SELECT campaign_id, COUNT(*) AS campaign_imps
  FROM adh.google_ads_impressions
  GROUP BY 1
)
# Noise applied here to imps
SELECT campaign_id, demographics, campaign_imps, COUNT(*) AS imps
FROM adh.google_ads_impressions JOIN campaign_totals USING(campaign_id)
GROUP BY 1,2,3

Noise mode discourages reaggregating aggregate results, such as AVG(campaign_imps).


Unsupported query patterns

This section describes query patterns that are not supported when running queries using noise injection.

Today-inclusive queries

Noise mode queries don't support querying the current day's data. (This is discouraged in difference check mode.) The current date is not selectable for queries that use noise injection.

Repeated results

In noise mode, Ads Data Hub limits how often you can repeat the same aggregation. If you reach these limits, your noise mode queries will lose access to frequently queried dates in the dataset. The following are examples of how this can occur.

Query repetition happens when the same query is run multiple times with the same parameters or highly similar parameters, such as overlapping date ranges. You can avoid this by using data that is already exported to your BigQuery project.

Note that if two jobs are querying overlapping date ranges, they might produce repetitions if performing the same computation on the same users. For example, the following query, executed on overlapping date ranges, creates repetitions because it partitions by date:

SELECT DATE(TIMESTAMP_MICROS(event.event_time)) AS date,
COUNT(*) AS cnt
FROM adh.cm_dt_clicks
GROUP BY 1

In this case, you should run the query on disjoint date segments.

Another example of a repetition happens when data is somewhat date independent. The following query produces repetitions when executed on overlapping dates, where both jobs cover the entire lifetime of a campaign:

SELECT campaign_id, COUNT(*) AS cnt
FROM adh.google_ads_impressions
GROUP BY 1

In this case, you should run this query only once since the result doesn't change.

Aggregation repetition happens when the same aggregation is repeated multiple times within a query:

SELECT COUNT(*) AS cnt1, COUNT(*) AS cnt2
FROM table

In this case, you should remove one of the repetitions.

Note that even if the aggregations are syntactically different but compute the same value, it would count as a repetition. In other words, if the values of condition1 and condition2 are the same for all users with some value of key, the following query would have a repetition:

SELECT key, COUNTIF(condition1) AS cnt1, COUNTIF(condition2) AS cnt2
FROM table
GROUP BY key

If you have conditions that are very similar for some groups of users, you might consider rewriting the query to have only one COUNT.

Row duplication happens when an Ads Data Hub table is joined with a BigQuery table in a way that each row from the Ads Data Hub table matches multiple rows in the BigQuery table. For example, the following query produces a repetition if there are multiple rows with the same campaign ID in bq_table:

SELECT r.campaign_id, COUNT(*) AS cnt
FROM adh_table
INNER JOIN bq_table ON l.campaign_id = r.campaign_id

In this case, you should restructure the query so that bq_table would have only one row per join key value (campaign_id, in this case).

Note that unnesting an array from the Ads Data Hub table could produce the same effect if most users have the same arrays of values:

SELECT in_market_id, COUNT(*)
FROM adh.dv360_youtube_impressions,
UNNEST(in_market) AS in_market_id
GROUP BY 1

Learn about other query best practices.

About Lookback Windows

Some query patterns generate reports over a large timeframe, periodically regenerating to include new results. These queries may need adjustments to work in noise mode because if they recompute previous results, they will be blocked. Instead, each job should only generate new results, then new results can be combined with results from previous jobs for a full report.

For example, if you are creating a report of metrics by date, refreshed daily:

SELECT
  campaign_id,
  DATE(TIMESTAMP_MICROS(query_id.time_usec), @time_zone) AS event_date,
  COUNT(*) AS impressions
FROM adh.google_ads_impressions
GROUP BY 1,2

You shouldn't run this with a large date range as this will recompute results of previous days. Instead, you should run each job only the latest day, which has new data, then combine with results from previous jobs.

If you do need to refresh a previous result (for example to account for late arriving data), then you should avoid recomputing any single result more than 1 or 2 times. Otherwise, you may get errors due to repeated query attempts.

Direct reaggregation

Noise is applied to the first layer of cross-user aggregation in the query. Queries with multiple layers of aggregation will combine noisy results, so final aggregates may have much higher noise. These queries receive a warning on validation:

WITH layer_1 AS (
  # Noise applied here to partial_result
  SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
  FROM adh.google_ads_impressions
  GROUP BY 1,2,3
  HAVING partial_result > 5
)
# Reaggregation of partial_result with no user-level data, will be rejected
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1

To get the best results from noise, compute all cross-user operations within a single aggregation. For example, take a SUM of events rather than a SUM of intermediate counts.

If multi-layer aggregation is unavoidable, you can resolve the warning by exporting results directly from the first layer instead. To do this within a single job without changing script results, create a temp table (or a table exported to your BigQuery project) with the OPTIONS(privacy_checked_export=true) syntax. For example:

CREATE TEMP TABLE layer_1 OPTIONS(privacy_checked_export=true) AS (
  # Noise applied here to partial_result
  SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
  FROM adh.google_ads_impressions
  GROUP BY 1,2,3
  HAVING partial_result > 5
);
# Reaggregation of privacy checked data, no noise needed
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1

Learn more about temp tables.

If the first layer of aggregation is too granular for privacy checks, consider rewriting the query with user-level aggregates. If this is not possible, then this query is not supported in noise mode.

Unjoined user IDs

Queries in noise mode must not combine data from separate users into a single row, except when performing an aggregation with noise. As a result, joins of unaggregated Ads Data Hub data should explicitly join on the user_id column.

This query does not explicitly join on the user_id column, which results in a validation warning:

SELECT 
FROM adh.google_ads_impressions
JOIN adh.google_ads_creative_conversions USING(impression_id)

Joins like this may not behave as expected because only rows with the same user_id value will match. This can be fixed by adjusting the USING clause to explicitly include user_id – for example, USING(impression_id, user_id).

Note that this limitation applies only to joins between Ads Data Hub tables (with the exception of dimension tables). It does not apply to customer-owned tables. For example, the following is allowed:

SELECT 
FROM adh.google_ads_impressions
JOIN bigquery_project.dataset.table USING(any_column)

Ads Data Hub-BigQuery right joins

Outer joins with customer-owned data can lead to rows with missing user identifiers, which prevents noise from working well.

Both of these queries result in validation warnings because they allow for unmatched rows with missing user identifiers on the Ads Data Hub side:

SELECT 
FROM adh.google_ads_impressions
RIGHT JOIN bigquery_project.dataset.table USING(column)
SELECT 
FROM bigquery_project.dataset.table
LEFT JOIN adh.google_ads_impressions USING(column)

Note that either join would work if the order of the tables was reversed. There is also an exception for RDID tables that join directly on device_id_md5. For example, the following query will work with no warnings:

SELECT 
FROM bigquery_project.dataset.table
LEFT JOIN adh.google_ads_impressions_rdid USING(device_id_md5)

Filtered Row Summary

The filtered row summary spec is not supported in noise mode. This feature is most often unnecessary with noise due to lower filtering rates and the lack of filtering from difference checks.

If you observe significant data filtering in a noise result, then increase the aggregated data. You may perform a parallel aggregation over the full dataset to compare an estimate of the total, for example:

SELECT campaign_name, COUNT(*)
FROM data
GROUP BY 1
UNION ALL
SELECT 'Total', COUNT(*)
FROM data
GROUP BY 1

Note that the total count is independently noised and total values may not add up, but the total count is often more accurate than taking the sum of noised rows.

Cross-mode created tables

Unexported tables in Ads Data Hub can only be used with the same privacy mode where they were created. You can't create a table in normal aggregation mode and use it in noise mode, or the other way around (unless that table is exported to BigQuery first).