Regression modeling in Ads Data Hub

Linear and logistic regression are machine learning models that enable you to generate meaningful predictions from advertising data.

  • Linear regression creates a function to fit a line to data such that the distance between the data points and line is minimized. This model can then be used to predict a numerical value based on inputs, such as predicting a user’s lifetime value based on purchases, prior interactions, etc.
  • Logistic regression is used for predictive classification problems. The model can be one of 2 types depending on how many variables you provide:
    • Binary logistic regression answers "yes/no" questions, such as how likely it is that a conversion event will occur.
    • Multiclass logistic regression is used to predict multiple possible values, such as determining whether a customer is "low-value", "medium-value", or "high-value".

Linear and logistic regression both learn from training data (in this case, your advertising data), giving you a predictive model to make advertising decisions. Generally, providing more data and ensuring that the data you provide is of high quality will improve the accuracy of your model. Both models perform better when given tightly clustered training data.

Differential privacy

Linear and logistic regression both use differential privacy, which is a different system of privacy checks than is used by other operations in Ads Data Hub. Differential privacy ensures end user privacy by injecting noise into your results during the training process. This level of noise remains low enough that the end results are still useful, but high enough that end users can’t be identified. Additionally the level of noise is non-deterministic, such that results have an inconsistent level of noise, further ensuring end-user privacy.

Query limiting

You are limited to 100 linear and logistic regression modeling queries per “data day”, when using the default EPSILON_PER_MODEL. Data day refers to events generated on a given day. These events correspond to the start and end dates you provide when running your query, and the start and end dates used when creating any tables your query uses—such as temp tables used for training. This means that data from 1 day can be used in no more than 100 models. If you choose to specify a value of EPSILON_PER_MODEL greater than the default, you will be able to create fewer models, but they will have higher quality. And if you choose smaller values of EPSILON_PER_MODEL, you can train more models, but they will have lower quality.

How it works

The workflow follows these essential steps:

  1. Prepare training data.
  2. Create a model.
  3. Gather insights from the model.

Prepare training data

As mentioned above, using larger, high-quality datasets will generally lead to better outcomes. Additionally, since the input data is scaled using min-max scaling, sparsely clustered data or data with significant outliers can adversely affect the model by moving the mean.

By default, Ads Data Hub randomly selects 18% of your training data to use for validation. The percent of data used for validation can be controlled by the data_split_eval_fraction option.

Create a model

Specify parameters and input data for training your model.

Best Practices

One of the most important factors in the quality of a model is the size of the training set. However, the size/quality tradeoff will be different depending on the problem and the factors listed below. Please let us know your experience.

  • We have seen accuracy > 0.70 for logistic regression models built from training sets with at least 100,000 users.
  • We have seen r-squared > 0.70 for linear regression models built from training sets with at least 800,000 users.

There are other factors that can reduce the quality of a model.

  • Logistic models where one class is much more highly represented than the others. Specifically, when one label has few training set users, having a lot of training set users with other labels probably won't be of much help to the model's accuracy on the small label. For example, 20,000 & 1,000 training set users for two labels is worse than 10,000 & 2,000.
  • Feature data that does not provide a strong signal for the labels.
  • Raw data that needs more extensive feature engineering. For example, some fields may have a very large number of possible values. One way to improve data of this type is to transform the values into a feature with a smaller number of categories, or buckets.

Gather insights

You can call functions to evaluate your model’s performance on validation data, inspect features and information on training iterations (such as the underlying weights used by a model during prediction), and predict on unseen data.

CREATE MODEL statement

The CREATE MODEL statement creates a model with the name and dataset that you specify. If the model name already exists, CREATE MODEL replaces the existing model.

CREATE MODEL syntax

CREATE MODEL
model_name
OPTIONS
(
  // model_option_list:
  // Required parameter.
  MODEL_TYPE = { 'ADH_LINEAR_REGRESSION' | 'ADH_LOGISTIC_REGRESSION'}

  // Optional tuning parameters.
  [, L1_REG = float64_value ]
  [, L2_REG = float64_value ]
  [, DATA_SPLIT_EVAL_FRACTION = float64_value ]
  [, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' |
                           'NORMAL_EQUATION' } ]
  [, MAX_ITERATIONS = int64_value ]
  [, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]
  [, LEARN_RATE = float64_value ]
  [, EARLY_STOP = { TRUE | FALSE } ]
  [, MIN_REL_PROGRESS = float64_value ]
  [, LS_INIT_LEARN_RATE = float64_value ]
  [, EPSILON_PER_MODEL = float64_value ]
  [, AUTOMATIC_IMPUT_SCALING = bool_value ]
  [, MIN_MAX_SCALED_COLS = [string_value, string_value... ] ]
  [, STANDARD_SCALED_COLS = [string_value, string_value... ] ]
  [, QUANTILE_BUCKETIZED_COLS = [
        STRUCT(string_value AS col_name, int64_value AS num_buckets), 
       STRUCT(string_value AS col_name, int64_value AS num_buckets)... ] ]
)
AS query_statement

model_name

An alphanumeric name, including underscores and dashes. Can’t include dots. If the model name that you provide already exists, the older model will be overwritten.

query_statement

Specifies the standard SQL query that is used to generate the training data. If you’re creating multiple models using the same training data, create a temporary table with the training data and reference it here. This tactic avoids potential difference check errors due to wipeout or late spam.

model_option_list

model_type

(Required) The only required option. Can be 'adh_linear_regression', or 'adh_logistic_regression'

l1_reg

(Optional) The amount of L1 regularization applied. L1 regularization penalizes weights in proportion to the sum of the absolute values of the weights. May be any non-negative number, defaults to zero.

l2_reg

(Optional) The amount of L2 regularization applied. L2 regularization penalizes weights in proportion to the square root of the sum of the squares of the weights. May be any non-negative number, defaults to zero.

data_split_eval_fraction

(Optional) Must be between .01 and .99, with the default being .18. Determines the fraction of data which ends up in the evaluation set. This affects model accuracy by reducing the number of rows which end up in the model but also increases the number of models a user can run. Here’s a chart of that relationship assuming EVERY model on a given dataset has the same fraction:

Validation fraction Allowed queries
0.01 7
0.1 8
0.15 8
0.18 9
0.2 9
0.3 10
0.5 14
0.9 50

optimize_strategy

(Optional) The strategy to train linear regression models.

Arguments

'AUTO_STRATEGY' determines the training strategy as follows:

  • If l1_reg or warm_start is specified, batch_gradient_descent strategy is used.
  • If total cardinalities of training features are more than 10,000, batch_gradient_descent strategy is used.
  • If there is an over-fitting issue (the number of training examples is less than 10 • the total cardinality), batch_gradient_descent strategy is used.
  • NORMAL_EQUATION strategy is used for all other cases.

'BATCH_GRADIENT_DESCENT' (logistic only) trains the model using the batch gradient descent method, which optimizes the loss function using the gradient function.

'NORMAL_EQUATION' (linear only) directly computes the least square solution of the linear regression problem with the analytical formula. Normal equation cannot be used in the following cases:

  • l1_reg is specified.
  • warm_start is specified.
  • Total cardinality of training features is more than 10,000.
  • The default value is 'AUTO_STRATEGY'.

max_iterations

(Optional) The number of training iterations or steps. Since this query builds one tree for each iteration, this is also the number of trees. Must be an integer greater than 1. Defaults to 20.

learn_rate_strategy

(Optional, logistic only) The strategy for specifying the learning rate during training.

Arguments

'LINE_SEARCH' uses the line search method to calculate the learning rate. The line search initial learn rate is the value specified for LS_INIT_LEARN_RATE .

  • Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.

'CONSTANT' sets the learning rate to the value specified for LEARN_RATE.

The default value is 'LINE_SEARCH'.

learn_rate

(Optional, logistic only) The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. If LEARN_RATE_STRATEGY is set to 'LINE_SEARCH', an error is returned.

Arguments

float64_value can be any 64-bit floating point number. Defaults to 0.1 (10%).

early_stop

(Optional) Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

TRUE for “yes”, FALSE for “no”. Defaults to TRUE.

min_rel_progress

(Optional) The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value can be any 64-bit floating point number. Defaults to 0.1 (10%).

ls_init_learn_rate

(Optional) Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH' uses. This option can only be used if LINE_SEARCH is specified.

If the model’s LEARN_RATE appears to be doubling every iteration as indicated by ML.TRAINING_INFO, try setting LS_INIT_LEARN_RATE to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

Arguments

float64_value can be any 64-bit floating point number.

epsilon_per_model

(Optional) Specifies the amount of privacy budget to be used for training this model. Each ads data customer is granted a privacy budget of 10.0 per data day. A successfully trained model will spend EPSILON_PER_MODEL of the budget for each data day in the date range specified when running the query. Using the default value of ln(3)/10 will allow the creation of around 100 models. If you use a higher value, you will be able to create fewer models, but they will be of higher quality. If you use a smaller value, you will be able to create more models of lower quality.

Arguments

float64_value can be any positive 64-bit floating point number less than ln(3), which is about 1.0986. Defaults to ln(3)/10.

automatic_input_scaling

(Optional) When TRUE, all numerical feature columns will automatically have min_max_scaling applied, as if the column names were explicitly called out in the min_max_scaled_cols option, with the exception of any columns that have been explicitly called out in either the standard_scaled_cols or quantile_bucketized_cols options.

Arguments

bool_value is a BOOL, default value is TRUE.

min_max_scaled_cols

(Optional) Scales each of the specified numerical_expression feature columns in the range of 0 to 1, capped with MIN and MAX across all rows. The same MIN and MAX are automatically used in prediction. If the prediction data is outside the MIN, MAX range, it is capped to either 0 or 1.

Arguments

Array of string_value, where each string_value is a STRING representing the column name to be transformed.

standard_scaled_cols

(Optional) Standardizes the specified numerical_expression feature columns across all rows. The STDDEV and MEAN calculated to standardize the expression are automatically used in prediction.

Arguments

Array of string_value, where each string_value is a STRING representing the column name to be transformed.

quantile_bucketized_cols

Bucketizes the specified continuous numerical feature columns into a STRING with the bucket name as the value based on quantiles. The same quantiles are automatically used in prediction.

Arguments

Array of STRUCT(string_value AS col_name, int64_value AS num_buckets), where each string_value is a STRING representing the continuous numerical column name to be transformed and each int64_value is the number of buckets to split the numerical values into.

Validation

  • Each data day in the date range specified for this query must have sufficient privacy budget, that is, more than EPSILON_PER_MODEL, or the query will fail.
  • The optional tuning parameters, if specified, will be validated for the ranges shown above.
  • Only the one required model_type parameter must be specified explicitly.
  • One column in the training set must be named "label". Multiple labels are not currently supported.
  • The label column cannot contain NULL values. If the label column contains NULL values, the query fails.
  • None of the feature columns may be derived from the user_id.
  • Each row must represent exactly one unique user. One row cannot represent data from more than one user. This can happen with certain joins, e.g. a CROSS JOIN.
  • No user can be in two separate rows.
  • For privacy reasons, only the options described in the syntax section are usable. Other options that might be found in the BQML CREATE MODEL query documentation are not currently supported.

Evaluation functions

ML.EVALUATE

Use the ML.EVALUATE function to evaluate model metrics. The ML.EVALUATE function can be used with linear regression or logistic regression models.

SELECT
  *
FROM ML.EVALUATE(MODEL `linear_model_test`);

ML.ROC_CURVE

Use the ML.ROC_CURVE function to evaluate logistic regression-specific metrics. ML.ROC_CURVE only evaluates logistic regression models.

SELECT
  *
FROM ML.ROC_CURVE(MODEL `logistic_model_test`);

Prediction functions

ML.PREDICT

The ML.PREDICT function can be used to predict outcomes using the model. Results obtained using ML.PREDICT are subject to the same privacy checks as other results in Ads Data Hub. Learn more about privacy checks

Linear regression

/* This example outputs the average value for labels that the model predicted */
SELECT
  AVG(predicted_label) AS average_predicted_label
FROM
  ML.PREDICT(MODEL `linear_model_test`, TABLE tmp.linear_training_set);

Logistic regression

/* This example outputs the model's prediction and probabilities for said prediction over individual users. It groups by label and prediction, counting the number of users in each prediction */
SELECT
  label,
  predicted_label, /* one of the two input labels, depending on which label has the higher predicted probability */
  COUNT(*) AS num /* a tally of users */
FROM
  ML.PREDICT(MODEL `logistic_model_test`, TABLE tmp.logistic_training_set)
GROUP BY 1, 2;

Model and feature inspection functions

ML.TRAINING_INFO

The ML.TRAINING_INFO function allows you to see information about the training iterations of a model.

SELECT
  *
FROM ML.TRAINING_INFO(MODEL `logistic_model_test`);

ML.FEATURE_INFO

The ML.FEATURE_INFO function allows you to see information about the input features used to train a model

SELECT
  *
FROM ML.FEATURE_INFO(MODEL `logistic_model_test`);

ML.WEIGHTS

The ML.WEIGHTS function allows you to see the underlying weights used by a model during prediction.

SELECT
  *
FROM ML.WEIGHTS(MODEL `linear_model_test`);

Examples

Create model

The following examples all use the natality sample table to demonstrate how to create a model.

Training data in inner-select (linear)

The following example uses birth weight, gender, gestation weeks, mother age, and mother race to predict the birth weight of a child.

CREATE MODEL `natality_model`
OPTIONS
  (model_type='adh_linear_regression') AS
SELECT
  weight_pounds as label,
  is_male,
  gestation_weeks,
  mother_age,
  CAST(mother_race AS string) AS mother_race
FROM
  `bigquery-public-data.samples.natality`
WHERE
  weight_pounds IS NOT NULL

Training data in inner-select (logistic)

The following example uses birth weight, gender, gestation weeks, mother age, and mother race to predict the gender of a child.

CREATE MODEL `natality_model`
OPTIONS
  (model_type='adh_logistic_regression') AS
SELECT
  weight_pounds,
  is_male as label,
  gestation_weeks,
  mother_age,
  CAST(mother_race AS string) AS mother_race
FROM
  `bigquery-public-data.samples.natality`
WHERE
  weight_pounds IS NOT NULL