Linear and logistic regression are machine learning models that enable you to generate meaningful predictions from advertising data.

*Linear regression*creates a function to fit a line to data such that the distance between the data points and line is minimized. This model can then be used to predict a numerical value based on inputs, such as predicting a user’s lifetime value based on purchases, prior interactions, etc.*Logistic regression*is used for predictive classification problems. The model can be one of 2 types depending on how many variables you provide:*Binary logistic regression*answers "yes/no" questions, such as how likely it is that a conversion event will occur.*Multiclass logistic regression*is used to predict multiple possible values, such as determining whether a customer is "low-value", "medium-value", or "high-value".

Linear and logistic regression both learn from training data (in this case, your advertising data), giving you a predictive model to make advertising decisions. Generally, providing more data and ensuring that the data you provide is of high quality will improve the accuracy of your model. Both models perform better when given tightly clustered training data.

## Differential privacy

Linear and logistic regression both use differential privacy, which is a different system of privacy checks than is used by other operations in Ads Data Hub. Differential privacy ensures end user privacy by injecting noise into your results during the training process. This level of noise remains low enough that the end results are still useful, but high enough that end users can’t be identified. Additionally the level of noise is non-deterministic, such that results have an inconsistent level of noise, further ensuring end-user privacy.

### Query limiting

You are limited to 100 linear and logistic regression modeling queries per “data day”, when using the default `EPSILON_PER_MODEL`

. Data day refers to events generated on a given day. These events correspond to the start and end dates you provide when running your query, and the start and end dates used when creating any tables your query uses—such as temp tables used for training. This means that data from 1 day can be used in no more than 100 models. If you choose to specify a value of EPSILON_PER_MODEL greater than the default, you will be able to create fewer models, but they will have higher quality. And if you choose smaller values of EPSILON_PER_MODEL, you can train more models, but they will have lower quality.

## How it works

The workflow follows these essential steps:

- Prepare training data.
- Create a model.
- Gather insights from the model.

### Prepare training data

As mentioned above, using larger, high-quality datasets will generally lead to better outcomes. Additionally, since the input data is scaled using min-max scaling, sparsely clustered data or data with significant outliers can adversely affect the model by moving the mean.

By default, Ads Data Hub randomly selects 18% of your training data to use for validation. The percent of data used for validation can be controlled by the `data_split_eval_fraction`

option.

### Create a model

Specify parameters and input data for training your model.

### Best Practices

One of the most important factors in the quality of a model is the size of the training set. However, the size/quality tradeoff will be different depending on the problem and the factors listed below. Please let us know your experience.

- We have seen accuracy > 0.70 for logistic regression models built from training sets with at least 100,000 users.
- We have seen r-squared > 0.70 for linear regression models built from training sets with at least 800,000 users.

There are other factors that can reduce the quality of a model.

- Logistic models where one class is much more highly represented than the others. Specifically, when one label has few training set users, having a lot of training set users with other labels probably won't be of much help to the model's accuracy on the small label. For example, 20,000 & 1,000 training set users for two labels is worse than 10,000 & 2,000.
- Feature data that does not provide a strong signal for the labels.
- Raw data that needs more extensive feature engineering. For example, some fields may have a very large number of possible values. One way to improve data of this type is to transform the values into a feature with a smaller number of categories, or buckets.

### Gather insights

You can call functions to evaluate your model’s performance on validation data, inspect features and information on training iterations (such as the underlying weights used by a model during prediction), and predict on unseen data.

## CREATE MODEL statement

The `CREATE MODEL`

statement creates a model with the name and dataset that you specify. If the model name already exists, `CREATE MODEL`

replaces the existing model.

### CREATE MODEL syntax

```
CREATE MODEL
model_name
OPTIONS
(
// model_option_list:
// Required parameter.
MODEL_TYPE = { 'ADH_LINEAR_REGRESSION' | 'ADH_LOGISTIC_REGRESSION'}
// Optional tuning parameters.
[, L1_REG = float64_value ]
[, L2_REG = float64_value ]
[, DATA_SPLIT_EVAL_FRACTION = float64_value ]
[, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' |
'NORMAL_EQUATION' } ]
[, MAX_ITERATIONS = int64_value ]
[, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]
[, LEARN_RATE = float64_value ]
[, EARLY_STOP = { TRUE | FALSE } ]
[, MIN_REL_PROGRESS = float64_value ]
[, LS_INIT_LEARN_RATE = float64_value ]
[, EPSILON_PER_MODEL = float64_value ]
[, AUTOMATIC_IMPUT_SCALING = bool_value ]
[, MIN_MAX_SCALED_COLS = [string_value, string_value... ] ]
[, STANDARD_SCALED_COLS = [string_value, string_value... ] ]
[, QUANTILE_BUCKETIZED_COLS = [
STRUCT(string_value AS col_name, int64_value AS num_buckets),
STRUCT(string_value AS col_name, int64_value AS num_buckets)... ] ]
)
AS query_statement
```

### model_name

An alphanumeric name, including underscores and dashes. Can’t include dots. If the model name that you provide already exists, the older model will be overwritten.

### query_statement

Specifies the standard SQL query that is used to generate the training data. If you’re creating multiple models using the same training data, create a temporary table with the training data and reference it here. This tactic avoids potential difference check errors due to wipeout or late spam.

### model_option_list

#### model_type

(Required) The only required option. Can be '`adh_linear_regression`

', or '`adh_logistic_regression`

'

#### l1_reg

(Optional) The amount of L1 regularization applied. L1 regularization penalizes weights in proportion to the sum of the absolute values of the weights. May be any non-negative number, defaults to zero.

#### l2_reg

(Optional) The amount of L2 regularization applied. L2 regularization penalizes weights in proportion to the square root of the sum of the squares of the weights. May be any non-negative number, defaults to zero.

#### data_split_eval_fraction

(Optional) Must be between .01 and .99, with the default being .18. Determines the fraction of data which ends up in the evaluation set. This affects model accuracy by reducing the number of rows which end up in the model but also increases the number of models a user can run. Here’s a chart of that relationship assuming EVERY model on a given dataset has the same fraction:

Validation fraction | Allowed queries |
---|---|

0.01 | 7 |

0.1 | 8 |

0.15 | 8 |

0.18 | 9 |

0.2 | 9 |

0.3 | 10 |

0.5 | 14 |

0.9 | 50 |

#### optimize_strategy

(Optional) The strategy to train linear regression models.

**Arguments**

'`AUTO_STRATEGY`

' determines the training strategy as follows:

- If
`l1_reg`

or`warm_start`

is specified,`batch_gradient_descent`

strategy is used. - If total cardinalities of training features are more than 10,000,
`batch_gradient_descent`

strategy is used. - If there is an over-fitting issue (the number of training examples is less than 10 • the total cardinality),
`batch_gradient_descent`

strategy is used. `NORMAL_EQUATION`

strategy is used for all other cases.

'`BATCH_GRADIENT_DESCENT`

' (logistic only) trains the model using the batch gradient descent method, which optimizes the loss function using the gradient function.

'`NORMAL_EQUATION`

' (linear only) directly computes the least square solution of the linear regression problem with the analytical formula. Normal equation cannot be used in the following cases:

`l1_reg`

is specified.`warm_start`

is specified.- Total cardinality of training features is more than 10,000.
- The default value is '
`AUTO_STRATEGY`

'.

#### max_iterations

(Optional) The number of training iterations or steps. Since this query builds one tree for each iteration, this is also the number of trees. Must be an integer greater than 1. Defaults to 20.

#### learn_rate_strategy

(Optional, logistic only) The strategy for specifying the learning rate during training.

**Arguments**

'`LINE_SEARCH`

' uses the line search method to calculate the learning rate. The line search initial learn rate is the value specified for `LS_INIT_LEARN_RATE`

.

- Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.

'`CONSTANT`

' sets the learning rate to the value specified for `LEARN_RATE`

.

The default value is '`LINE_SEARCH`

'.

#### learn_rate

(Optional, logistic only) The learn rate for gradient descent when `LEARN_RATE_STRATEGY`

is set to `CONSTANT`

. If `LEARN_RATE_STRATEGY`

is set to '`LINE_SEARCH`

', an error is returned.

**Arguments**

`float64_value`

can be any 64-bit floating point number. Defaults to 0.1 (10%).

#### early_stop

(Optional) Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for `MIN_REL_PROGRESS`

.

**Arguments**

`TRUE`

for “yes”, `FALSE`

for “no”. Defaults to `TRUE`

.

#### min_rel_progress

(Optional) The minimum relative loss improvement that is necessary to continue training when `EARLY_STOP`

is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

**Arguments**

`float64_value`

can be any 64-bit floating point number. Defaults to 0.1 (10%).

#### ls_init_learn_rate

(Optional) Sets the initial learning rate that `LEARN_RATE_STRATEGY='LINE_SEARCH'`

uses. This option can only be used if `LINE_SEARCH`

is specified.

If the model’s `LEARN_RATE`

appears to be doubling every iteration as indicated by `ML.TRAINING_INFO`

, try setting `LS_INIT_LEARN_RATE`

to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

**Arguments**

`float64_value`

can be any 64-bit floating point number.

#### epsilon_per_model

(Optional) Specifies the amount of privacy budget to be used for training this model. Each ads data customer is granted a privacy budget of 10.0 per data day. A successfully trained model will spend EPSILON_PER_MODEL of the budget for each data day in the date range specified when running the query. Using the default value of ln(3)/10 will allow the creation of around 100 models. If you use a higher value, you will be able to create fewer models, but they will be of higher quality. If you use a smaller value, you will be able to create more models of lower quality.

**Arguments**

`float64_value`

can be any positive 64-bit floating point number less than ln(3), which is about 1.0986. Defaults to ln(3)/10.

#### automatic_input_scaling

(Optional) When `TRUE`

, all numerical feature columns will automatically have `min_max_scaling`

applied, as if the column names were explicitly called out in the `min_max_scaled_cols`

option, with the exception of any columns that have been explicitly called out in either the `standard_scaled_cols`

or `quantile_bucketized_cols`

options.

**Arguments**

`bool_value`

is a `BOOL`

, default value is `TRUE`

.

#### min_max_scaled_cols

(Optional) Scales each of the specified numerical_expression feature columns in the range of 0 to 1, capped with `MIN`

and `MAX`

across all rows.
The same `MIN`

and `MAX`

are automatically used in prediction. If the prediction data is outside the `MIN`

, `MAX`

range, it is capped to either 0 or 1.

**Arguments**

Array of `string_value`

, where each `string_value`

is a STRING representing the column name to be transformed.

#### standard_scaled_cols

(Optional) Standardizes the specified `numerical_expression`

feature columns across all rows.
The `STDDEV`

and `MEAN`

calculated to standardize the expression are automatically used in prediction.

**Arguments**

Array of `string_value`

, where each `string_value`

is a `STRING`

representing the column name to be transformed.

#### quantile_bucketized_cols

Bucketizes the specified continuous numerical feature columns into a `STRING`

with the bucket name as the value based on quantiles.
The same quantiles are automatically used in prediction.

**Arguments**

Array of `STRUCT(string_value AS col_name, int64_value AS num_buckets)`

, where each string_value is a `STRING`

representing the continuous numerical column name to be transformed and each `int64_value`

is the number of buckets to split the numerical values into.

### Validation

- Each data day in the date range specified for this query must have sufficient privacy budget, that is, more than EPSILON_PER_MODEL, or the query will fail.
- The optional tuning parameters, if specified, will be validated for the ranges shown above.
- Only the one required model_type parameter must be specified explicitly.
- One column in the training set must be named "label". Multiple labels are not currently supported.
- The label column cannot contain NULL values. If the label column contains NULL values, the query fails.
- None of the feature columns may be derived from the user_id.
- Each row must represent exactly one unique user. One row cannot represent data from more than one user. This can happen with certain joins, e.g. a CROSS JOIN.
- No user can be in two separate rows.
- For privacy reasons, only the options described in the syntax section are usable. Other options that might be found in the BQML CREATE MODEL query documentation are not currently supported.

## Evaluation functions

### ML.EVALUATE

Use the `ML.EVALUATE`

function to evaluate model metrics. The `ML.EVALUATE`

function can be used with linear regression or logistic regression models.

```
SELECT
*
FROM ML.EVALUATE(MODEL `linear_model_test`);
```

### ML.ROC_CURVE

Use the `ML.ROC_CURVE`

function to evaluate logistic regression-specific metrics. `ML.ROC_CURVE`

only evaluates logistic regression models.

```
SELECT
*
FROM ML.ROC_CURVE(MODEL `logistic_model_test`);
```

## Prediction functions

### ML.PREDICT

The `ML.PREDICT`

function can be used to predict outcomes using the model. Results obtained using `ML.PREDICT`

are subject to the same privacy checks as other results in Ads Data Hub. Learn more about privacy checks

**Linear regression**

```
/* This example outputs the average value for labels that the model predicted */
SELECT
AVG(predicted_label) AS average_predicted_label
FROM
ML.PREDICT(MODEL `linear_model_test`, TABLE tmp.linear_training_set);
```

**Logistic regression**

```
/* This example outputs the model's prediction and probabilities for said prediction over individual users. It groups by label and prediction, counting the number of users in each prediction */
SELECT
label,
predicted_label, /* one of the two input labels, depending on which label has the higher predicted probability */
COUNT(*) AS num /* a tally of users */
FROM
ML.PREDICT(MODEL `logistic_model_test`, TABLE tmp.logistic_training_set)
GROUP BY 1, 2;
```

## Model and feature inspection functions

### ML.TRAINING_INFO

The `ML.TRAINING_INFO`

function allows you to see information about the training iterations of a model.

```
SELECT
*
FROM ML.TRAINING_INFO(MODEL `logistic_model_test`);
```

### ML.FEATURE_INFO

The `ML.FEATURE_INFO`

function allows you to see information about the input features used to train a model

```
SELECT
*
FROM ML.FEATURE_INFO(MODEL `logistic_model_test`);
```

### ML.WEIGHTS

The `ML.WEIGHTS`

function allows you to see the underlying weights used by a model during prediction.

```
SELECT
*
FROM ML.WEIGHTS(MODEL `linear_model_test`);
```

## Examples

### Create model

The following examples all use the `natality`

sample table to demonstrate how to create a model.

**Training data in inner-select (linear)**

The following example uses birth weight, gender, gestation weeks, mother age, and mother race to predict the birth weight of a child.

```
CREATE MODEL `natality_model`
OPTIONS
(model_type='adh_linear_regression') AS
SELECT
weight_pounds as label,
is_male,
gestation_weeks,
mother_age,
CAST(mother_race AS string) AS mother_race
FROM
`bigquery-public-data.samples.natality`
WHERE
weight_pounds IS NOT NULL
```

**Training data in inner-select (logistic)**

The following example uses birth weight, gender, gestation weeks, mother age, and mother race to predict the gender of a child.

```
CREATE MODEL `natality_model`
OPTIONS
(model_type='adh_logistic_regression') AS
SELECT
weight_pounds,
is_male as label,
gestation_weeks,
mother_age,
CAST(mother_race AS string) AS mother_race
FROM
`bigquery-public-data.samples.natality`
WHERE
weight_pounds IS NOT NULL
```

### Audience activation

You can make predictions using an existing linear regression model—with known
weights—without using `ML.PREDICT`

, even without access to the model itself. To
learn how, see the Regression modeling for audience activation
codelab.