Noise injection

Noise injection is a technique used to protect user privacy when querying a database. It works by adding random noise to an aggregating SELECT clause of a query. This noise protects user privacy while providing reasonably accurate results, eliminating the need for difference checks, and reducing the required aggregation threshold for output. Most existing queries can be executed in noise mode, with some limitations.

Learn the benefits of using noise injection

Difference checks do not apply: When running queries with noise injection, Ads Data Hub does not filter rows due to similarity to previous result sets. This means that you can still get a holistic view of the data while protecting user privacy.

Troubleshooting is simplified: Rows are only omitted due to aggregation requirements, making it simpler to troubleshoot and adapt queries.

There is no new syntax to learn: You don't need to learn any new query syntax or be versed in privacy concepts to use noise instead of difference checks.

Result accuracy is clear: A successful job shows the total percentage of data with the expected amount of noise.

Learn how noise impacts privacy requirements

Difference checks: Noise injection does not rely on existing difference checks in Ads Data Hub. When you use noise injection, difference checks are disabled.

Aggregation requirement: Noise injection outputs impression data represented by approximately 20 or more unique users, and click or conversion data represented by approximately 10 or more unique users.

Static checks: No impact.

Budgets and query limits: Queries executed using noise share the data access budget used with difference checks. As with difference checks, if you execute the same query on the same dataset many times, you might lose access to frequently queried dates in the dataset. This can happen if you run sliding window queries, or if you make the same request multiple times.

Noise mode imposes additional, stricter limits on recomputing the same aggregate results within or across queries. As with the data access budget, you can lose access to frequently queried dates in the dataset; but limitations due to recomputing the same aggregate results will only restrict queries in noise mode, not queries in difference check mode. For more information, see Repeated results.

Learn more about privacy checks.

Understand how noise injection affects results

Ads Data Hub injects noise to mitigate disclosure risk—the risk that someone can learn information about an individual user. It balances privacy against utility.

Noise injection in Ads Data Hub transforms the query results as follows:

  • It clamps outlier users' contributions in aggregate results. It sums each user's contribution in each aggregation and then caps each contribution with minimum and maximum clamping bounds.
  • It aggregates the clamped per-user contributions.
  • It adds noise to each aggregate result—the result of each aggregation function call in each row. The scale of this random noise is proportional to the clamped bounds.
  • It computes a noisy user count for each row and eliminates rows with too few users. This is similar to k-anonymity in difference check mode, but because of the noise, jobs running on the same dataset can drop different rows. Also, noise mode drops fewer rows because the aggregation requirement is lower (approximately 20 versus exactly 50).

The final result is a dataset where each row has noisy aggregate results and small groups have been eliminated. This masks an individual user's effect on returned results.

About aggregation clamping

Noise injection in Ads Data Hub uses implicit or explicit aggregation clamping to limit the contribution of outliers. You can choose which type of clamping to use, depending on your use case.

Implicit clamping

In implicit clamping, bounds are determined automatically. You don't need any special SQL syntax to use implicit clamping. If one row has a wider range of values than another, implicit bounding finds different bounds for these rows. This typically gives a lower margin of error for each result. On the other hand, each aggregation gets different clamping bounds and noise levels, which can make them hard to compare.

Implicit clamping can fail when an aggregation gets data from too few users—for example, a COUNTIF() call with a rare condition. These cases return NULL results. Also note that COUNT(DISTINCT user_id) automatically uses explicit clamping with bounds of 0 and 1.

Explicit clamping

Explicit clamping clamps the total contribution from each user to a specified range. Explicit bounds are uniformly applied to all rows and must be literal values. Even if some rows have a wider range of per-user contributions than others, the same bounds are applied to all of them. This makes results from different rows more comparable, though some rows get more noise than they might with implicit clamping.

Explicit clamping uses half as much noise as implicit clamping, for a given set of clamping bounds. Therefore, if you can estimate reasonable bounds, you will get better results by setting them explicitly.

To use explicit clamping, set the bounds for each supported aggregate function by adding integers representing the lower bound and the upper bound. For example:

SELECT
campaign_name,
-- Set lower and upper bounds to 0 and 1, respectively
ADH.ANON_COUNT(*, contribution_bounds_per_group => (0,1))
FROM data
GROUP BY 1

Run a query using noise injection

  1. Write a query or open an existing query. To see which aggregate functions can be used, see Supported functions.
  2. In the query editor, click Run enter details for a new job.
  3. Click the Privacy settings toggle to the Use noise position.
  4. Run the query.
  5. Review the added noise.
  6. Optional: Adapt the query to reduce noise impact.

Review noise impact

Once a query completes successfully, Ads Data Hub displays the reliability of the result, based on how many cells in the output have the expected amount of noise. A value in the result table is considered highly impacted if the scale of the added noise is greater than 5% of the result in the cell. See the impact ranges in the following table.

For impacted output data sets, the details tab lists the 10 noisiest columns from highest to lowest impact and their corresponding contribution to noise. This is the breakdown of the expected amount of noise.

Data with expected amount of noise Indicator color Impact
>95% Green Low impact
85%-95% Yellow Medium impact
75%-85% Orange High impact
<75% Red Very high impact

To see detailed information about noise impact:

  1. Click Reports.
  2. Select a report from the list. The privacy summary tooltip indicates the percentage of results that have the expected amount of noise, corresponding to the amount of noise added that is greater than 5% of the result.
  3. To see more information, click Jobs > Details
  4. View Privacy messages in the job details. The results fall into one of the listed categories.
  5. If needed, adjust your query to improve the result.

Adapt queries

Aggregate results are more likely to have an unexpected amount of noise when few users contribute to those results. This can happen when rows have few users or when some users don't affect results—for example, when using the COUNTIF function. Based on the noise details, you may want to adjust your query to increase the percentage of data with the expected amount of noise.

The following are general guidelines:

  • Expand the date range.
  • Rewrite the query to reduce the granularity of the data, such as by grouping by fewer parameters or replacing COUNTIF with COUNT.
  • Remove noisy columns.
  • Use explicit clamping.

Supported aggregate functions

The following aggregate functions are supported with noise:

  • SUM(...)
  • COUNT(*)
  • COUNT(...)
  • COUNTIF(...)
  • COUNT(DISTINCT user_id)
  • APPROX_COUNT_DISTINCT(user_id)
  • AVG(...)

The DISTINCT keyword is only supported with the COUNT function, and only when used with a direct reference to the user_id column from an Ads Data Hub table or an expression that returns either user_id or NULL, such as COUNT(DISTINCT IF(..., user_id, NULL)).

The following functions are not directly supported, but can be replaced with other aggregates with noise to get statistical results. Note that the numeric values are examples only:

  • LOGICAL_OR(...). Suggested replacement: COUNT(DISTINCT IF(..., user_id, NULL)) > 0
  • LOGICAL_AND(...). Suggested replacement: COUNT(DISTINCT IF(NOT ..., user_id, NULL)) <= 0

About integer results

Although Ads Data Hub will automatically inject noise for these aggregation functions, the function signatures do not change. Because functions like COUNT or SUM of INT64 return INT64, any decimal part of the noised result is rounded. This is usually negligible relative to the size of the result and noise.

If you need the granularity of the decimal in your result, then avoid writing functions that return INT64–for example, by using SUM with its input cast to FLOAT64.


Supported query patterns

Important: Most of Ads Data Hub's standard best practices still apply to queries that use noise injection. In particular, we recommend that you review the guidance on repeatedly querying the same data.

This section describes query patterns that are supported when running queries using noise injection.

User-level aggregates

Unrestricted user-level aggregates are supported in the same way that they are in difference check mode. Noise is only injected in aggregations that combine data across multiple users. Aggregations that explicitly group by user_id, or analytic functions that partition by user_id, don't receive any noise and any function is allowed. User-level aggregations that don't explicitly group by user_id–for example, GROUP BY impression_id, are treated as cross-user aggregations, so noise is added.

Grouping by external_cookie is not enough. While external_cookie can be used to join *_match tables with customer-owned tables, any single-user aggregations should explicitly group by user_id column, not just the external_cookie column.

Aggregate function example:

WITH user_paths AS (
  # Grouping by user_id, no noise needed, all functions allowed
  SELECT user_id, STRING_AGG(campaign_id, ">" ORDER BY query_id.time_usec) AS path
  FROM adh.google_ads_impressions
  GROUP BY 1
)
# Noise applied here to num_users
SELECT path, COUNT(*) AS num_users
FROM user_paths
GROUP BY 1;

Analytic function example:

WITH events AS (
  # Partitioning by user_id, no noise needed, all functions allowed
  SELECT
    campaign_id,
    ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY query_id.time_usec) AS index
  FROM adh.google_ads_impressions
)
# Noise applied here to first_impressions
SELECT campaign_id, COUNT(*) AS first_impressions
FROM events
WHERE index = 1
GROUP BY 1;

Parallel aggregates

Each cross-user aggregation receives noise independently. You can run multiple such aggregations in a single statement, combining results into one table using a JOIN or UNION.

Example:

WITH result_1 AS (
  # Noise applied here to num_impressions
  SELECT campaign_id, COUNT(*) AS num_impressions
  FROM adh.google_ads_impressions
  GROUP BY 1
), result_2 AS (
  # Noise applied here to num_clicks
  SELECT campaign_id, COUNT(*) AS num_clicks
  FROM adh.google_ads_clicks
  GROUP BY 1
)
SELECT * FROM result_1 JOIN result_2 USING(campaign_id)

Note that this would be supported but should be avoided in difference check mode. This practice is not a problem with noise, as each parallel aggregate is noised and filtered independently.

Aggregated data joined with unaggregated data

Since Ads Data Hub only supports analytic windows that partition by user_id, it is a common workaround to aggregate these results separately and self-join them before aggregating again. These queries are supported in noise mode, and often perform better than they would with in difference check mode due to privacy requirements being resolved earlier.

Example:

WITH campaign_totals AS (
  # Noise applied here to campaign_imps
  SELECT campaign_id, COUNT(*) AS campaign_imps
  FROM adh.google_ads_impressions
  GROUP BY 1
)
# Noise applied here to imps
SELECT campaign_id, demographics, campaign_imps, COUNT(*) AS imps
FROM adh.google_ads_impressions JOIN campaign_totals USING(campaign_id)
GROUP BY 1,2,3

Noise mode prohibits reaggregating aggregate results, such as AVG(campaign_imps).


Unsupported query patterns

This section describes query patterns that are not supported when running queries using noise injection.

Today-inclusive queries

Noise mode queries do not support querying the current day's data. (This is discouraged in difference check mode.) The current date is not selectable for queries that use noise injection.

Repeated results

In noise mode, Ads Data Hub limits how often you can repeat the same aggregation. If you reach these limits, your noise mode queries will lose access to frequently queried dates in the dataset. The following are examples of how this can occur.

Query repetition happens when the same query is run multiple times with the same parameters or highly similar parameters, such as overlapping date ranges. You can avoid this by using data that is already exported to your BigQuery project.

Note that if two jobs are querying overlapping date ranges, they might produce repetitions if performing the same computation on the same users. For example, the following query, executed on overlapping date ranges, creates repetitions because it partitions by date:

SELECT DATE(TIMESTAMP_MICROS(event.event_time)) AS date,
COUNT(*) AS cnt
FROM adh.cm_dt_clicks
GROUP BY 1

In this case, you should run the query on disjoint date segments.

Another example of a repetition happens when data is somewhat date independent. The following query produces repetitions when executed on overlapping dates, where both jobs cover the entire lifetime of a campaign:

SELECT campaign_id, COUNT(*) AS cnt
FROM adh.google_ads_impressions
GROUP BY 1

In this case, you should run this query only once since the result doesn't change.

Aggregation repetition happens when the same aggregation is repeated multiple times within a query:

SELECT COUNT(*) AS cnt1, COUNT(*) AS cnt2
FROM table

In this case, you should remove one of the repetitions.

Note that even if the aggregations are syntactically different but compute the same value, it would count as a repetition. In other words, if the values of condition1 and condition2 are the same for all users with some value of key, the following query would have a repetition:

SELECT key, COUNTIF(condition1) AS cnt1, COUNTIF(condition2) AS cnt2
FROM table
GROUP BY key

If you have conditions that are very similar for some groups of users, you might consider rewriting the query to have only one COUNT.

Row duplication happens when an Ads Data Hub table is joined with a BigQuery table in a way that each row from the Ads Data Hub table matches multiple rows in the BigQuery table. For example, the following query produces a repetition if there are multiple rows with the same campaign ID in bq_table:

SELECT r.campaign_id, COUNT(*) AS cnt
FROM adh_table
INNER JOIN bq_table ON l.campaign_id = r.campaign_id

In this case, you should restructure the query so that bq_table would have only one row per join key value (campaign_id, in this case).

Note that unnesting an array from the Ads Data Hub table could produce the same effect if most users have the same arrays of values:

SELECT in_market_id, COUNT(*)
FROM adh.dv360_youtube_impressions,
UNNEST(in_market) AS in_market_id
GROUP BY 1

Learn about other query best practices.

Direct reaggregation

Noise is applied to the first layer of cross-user aggregation in the query. Queries with multiple layers of aggregation are blocked:

WITH layer_1 AS (
  # Noise applied here to partial_result
  SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
  FROM adh.google_ads_impressions
  GROUP BY 1,2,3
  HAVING partial_result > 5
)
# Reaggregation of partial_result with no user-level data, will be rejected
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1

To get the best results from noise, compute all cross-user operations within a single aggregation. For example, take a SUM of events rather than a SUM of intermediate counts. It is possible to rewrite a query to reaggregate noised aggregates, but the final aggregates may have much higher noise.

If this is unavoidable, you can rewrite your query to export results directly from the first layer instead. To do this within a single job without changing script results, create a temp table (or a table exported to your BigQuery project) with the OPTIONS(privacy_checked_export=true) syntax. For example:

CREATE TEMP TABLE layer_1 OPTIONS(privacy_checked_export=true) AS (
  # Noise applied here to partial_result
  SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
  FROM adh.google_ads_impressions
  GROUP BY 1,2,3
  HAVING partial_result > 5
);
# Reaggregation of privacy checked data, no noise needed
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1

Learn more about temp tables.

If the first layer of aggregation is too granular for privacy checks, consider rewriting the query with user-level aggregates. If this is not possible, then this query is not supported in noise mode.

Unjoined user IDs

Queries in noise mode must not combine data from separate users into a single row, except when performing an aggregation with noise. As a result, joins of unaggregated Ads Data Hub data are required to explicitly join on the user_id column.

This query does not explicitly join on the user_id column, which results in a validation error:

SELECT …
FROM adh.google_ads_impressions
JOIN adh.google_ads_clicks USING(impression_id)

This can be fixed by adjusting the USING clause to explicitly include user_id–for example, USING(impression_id, user_id).

Note that this limitation applies only to joins between Ads Data Hub tables (with the exception of dimension tables). It does not apply to customer-owned tables. For example, the following is allowed:

SELECT …
FROM adh.google_ads_impressions
JOIN bigquery_project.dataset.table USING(any_column)

Ads Data Hub-BigQuery unions

Aggregations with noise require user identifiers to perform well. Customer-owned data in BigQuery has no user identifiers, so it cannot be unioned into a noise aggregation without joining to an Ads Data Hub table.

This query results in a validation error:

SELECT COUNT(*) FROM (
  SELECT 1 FROM adh.google_ads_impressions
  UNION ALL
  SELECT 1 FROM bigquery_project.dataset.table
)

To fix this, you should either join the BigQuery table to augment Ads Data Hub data rather than union, or separate the data to aggregate each source separately.

Note that it is fine to take a union between multiple Ads Data Hub tables with user data, or multiple customer-owned BigQuery tables, but you cannot mix the two.

Ads Data Hub-BigQuery right joins

Outer joins with customer-owned data can lead to rows with missing user identifiers, which prevents noise from working well.

Both of these queries result in validation errors because they allow for unmatched rows with missing user identifiers on the Ads Data Hub side:

SELECT …
FROM adh.google_ads_impressions
RIGHT JOIN bigquery_project.dataset.table USING(column)
SELECT …
FROM bigquery_project.dataset.table
LEFT JOIN adh.google_ads_impressions USING(column)

Note that either join would work if the order of the tables was reversed.

Filtered Row Summary

The filtered row summary spec is not supported in noise mode. This feature is most often unnecessary with noise due to lower filtering rates and the lack of filtering from difference checks.

If you observe significant data filtering in a noise result, then increase the aggregated data. You may perform a parallel aggregation over the full dataset to compare an estimate of the total, for example:

SELECT campaign_name, COUNT(*)
FROM data
GROUP BY 1
UNION ALL
SELECT 'Total', COUNT(*)
FROM data
GROUP BY 1

Note that the total count is independently noised and total values may not add up, but the total count is often more accurate than taking the sum of noised rows.

Cross-mode created tables

Unexported tables in Ads Data Hub can only be used with the same privacy mode where they were created. You can't create a table in normal aggregation mode and use it in noise mode, or the other way around (unless that table is exported to BigQuery first).