You can make predictions using an existing linear or logistic regression model
with known weights without using ML.PREDICT
, even without access to the model
itself. To do so, you need to use a workaround for using differentially private
(DP) regression models inside audience activation queries in Ads Data Hub.
This step-by-step example will teach you how to perform the simulated inference
for actual linear and binary logistic regression models, then comparing the
results with that of ML.PREDICT
to show the accuracy of the simulated results.
It will also demonstrate a practical example of how to create an audience list
with a binary logistic model, which would be used in applying a conversion model
to audience activation.
Overview of the example:
- Generate data
- Train the model
- Get the weights and intercept
- Simulate the prediction
- Compare results
Step-by-step example
1. Generate data
Create a table with simulated data for training the model. Mark a fraction of the rows for the holdback set.
Linear regression
CREATE OR REPLACE TABLE DATASET_NAME.LIN_REG_TRAINING_SET AS
WITH
A AS (
SELECT
*
FROM
UNNEST(GENERATE_ARRAY(1, 100000)) AS row_number),
B AS (
SELECT
row_number,
RAND() AS rand_label,
RAND() AS rand_feature_1,
RAND() AS rand_feature_2,
RAND() AS rand_feature_3,
RAND() AS rand_feature_4,
RAND() AS rand_feature_5,
RAND() AS rand_feature_6,
RAND() AS rand_feature_7,
RAND() AS rand_feature_8,
RAND() AS rand_feature_9,
RAND() AS rand_feature_10
FROM
A),
C AS (
SELECT
rand_label AS label,
*
FROM
B),
D AS (
SELECT
row_number,
CAST(round(10 * label) AS INT64) AS label,
(rand_label + rand_feature_1) / 2 AS feature_1,
(rand_label + rand_feature_2) / 2 AS feature_2,
(rand_label + rand_feature_3) / 2 AS feature_3,
(rand_label + rand_feature_4) / 2 AS feature_4,
(rand_label + rand_feature_5) / 2 AS feature_5,
(rand_label + rand_feature_6) / 2 AS feature_6,
(rand_label + rand_feature_7) / 2 AS feature_7,
(rand_label + rand_feature_8) / 2 AS feature_8,
(rand_label + rand_feature_9) / 2 AS feature_9,
(rand_label + rand_feature_10) / 2 AS feature_10
FROM
C)
SELECT
label,
feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10,
RAND() < 0.1 AS holdback -- Ten percent will be true.
FROM
D
Binary logistic regression
SELECT
CASE
WHEN label < 5 THEN 0
WHEN label >= 5 THEN 1
END
AS label,
* EXCEPT (label)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
2. Train the model
Train a regression model from the training set.
Linear regression
CREATE OR REPLACE MODEL `DATASET_NAME.LIN_REG_MODEL` OPTIONS (model_type="linear_reg") AS
SELECT
* except (holdback)
FROM
`DATASET_NAME.LIN_REG_TRAINING_SET`
WHERE
NOT holdback
Note that we have added enough noise to the simulated data to get a model with R2 = 0.9009.
Measurement | Value |
---|---|
Mean absolute error | 0.7359 |
Mean squared error | 0.8432 |
Mean squared log error | 0.0810 |
Median absolute error | 0.6239 |
R squared | 0.9009 |
Binary logistic regression
CREATE OR REPLACE MODEL `DATASET_NAME.BIN_LOG_REG_MODEL` OPTIONS (model_type="logistic_reg") AS
SELECT
* EXCEPT (holdback)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
NOT holdback
Sample results. Note the accuracy of 0.9260.
Measurement | Value |
---|---|
Positive class | 1 |
Negative class | 0 |
Precision | 0.0810 |
Recall | 0.9315 |
Accuracy | 0.9260 |
F1 score | 0.9328 |
The bold values in this confusion matrix show how often the model classified each label correctly and the non-bold values show how often the model misclassified each label.
True label | Predicted label 1 | Predicted label 2 |
---|---|---|
1 | 93% | 7% |
0 | 8% | 92% |
3. Get weights and intercept
Get the weights and the intercept for the model:
Linear regression
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.LIN_REG_MODEL`)
weight | category_weights.category |
---|---|
feature_1 | 1.8263055528635743 |
feature_2 | 1.8143804404490813 |
feature_3 | 1.8601204874033492 |
feature_4 | 1.8507603439031859 |
feature_5 | 1.7899764387123640 |
feature_6 | 1.8645246630251291 |
feature_7 | 1.8698005281925356 |
feature_8 | 1.7904637080330201 |
feature_9 | 1.8036887855406274 |
feature_10 | 1.8117115890624449 |
INTERCEPT | -4.1428754911504306 |
Binary logistic regression
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`)
weight | category_weights.category |
---|---|
feature_1 | 3.823533928 |
feature_2 | 3.734812819 |
feature_3 | 3.842239823 |
feature_4 | 3.785488823 |
feature_5 | 3.737386716 |
feature_6 | 3.567663961 |
feature_7 | 3.819643052 |
feature_8 | 3.734673763 |
feature_9 | 3.839301406 |
feature_10 | 3.787306994 |
INTERCEPT | -17.922169920 |
4. Simulate the prediction
Linear regression
Use the dot product of the feature values with the weights, and add the
intercept, to make the prediction using standard SQL without using
ML.PREDICT
. This query compares the predictions using this technique to
those using ML.PREDICT
. Notice how the bold SQL lines are performing the dot
product of the feature values for the row with the model weights, then adding
the intercept.
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[1.8263055528635743,
1.8143804404490813,
1.8601204874033492,
1.8507603439031859,
1.789976438712364,
1.8645246630251291,
1.8698005281925356,
1.7904637080330201,
1.8036887855406274,
1.8117115890624449] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.LIN_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.LIN_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
SUM(element1 * element2) - 4.1428754911504306
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ABS(ml_predicted_label - sql_predicted_label) < 0.00000000001 AS diff_is_negligible
FROM
P
Binary logistic regression
For binary logistic regression, the technique for simulating the predictions is very similar to linear regression, with the addition of applying the sigmoid function on the last step with the desired threshold.
Use the dot product of the feature values with the weights, and add the
intercept, to make the prediction using standard SQL without using ML.PREDICT
.
Then use the sigmoid function with a threshold of 0.5 on the result to predict
either 0 or 1. This query compares the predictions using this technique to those
using ML.PREDICT
.
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ml_predicted_label = sql_predicted_label AS simulation_is_accurate
FROM
P
The block of SQL code in bold in the above query is performing the dot product of the feature values for each row with the model's weights and adding the intercept to get the linear regression prediction:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
Then it is applying the sigmoid function Y = 1 / (1+e^-z)
to the dot product
and intercept, using standard SQL:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
Finally, the result of the sigmoid function is compared to the threshold value of 0.5 to arrive at the binary logistic regression prediction of either 0, if it is less than 0.5, or 1, if it is not. Note that you can use any threshold value between 0 and 1.
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
This technique may also be extended to multiclass logistic regression. In that case, the model's weights will be an nxn matrix, rather than a vector, and the weights would be a vector rather than a scalar. You would multiply the feature values vector by the weights matrix and add the intercept vector. The resulting vector would have a score for each label, and you could choose the label with the highest score for your prediction. If you wanted to return a probability array, you would apply the sigmoid function to each element of the array.
5. Compare results
Linear regression
The sample results are nearly identical, except for a tiny rounding error.
actual_label | ml_predicted_label | sql_predicted_label | diff_is_negligible |
---|---|---|---|
6 | 5.2062349420751834 | 5.2062349420751826 | true |
0 | 0.40318472770048075 | 0.403184727700479 | true |
3 | 3.0703766078249597 | 3.0703766078249597 | true |
7 | 7.0588171538562 | 7.0588171538562 | true |
6 | 6.7802375930646 | 6.7802375930646 | true |
6 | 5.1088569571339368 | 5.1088569571339377 | true |
4 | 4.051839078116874 | 4.051839078116874 | true |
4 | 5.1810254680219243 | 5.1810254680219234 | true |
6 | 6.1440349466401223 | 6.1440349466401205 | true |
1 | 2.0842399472783519 | 2.0842399472783519 | true |
2 | 2.1911209811886847 | 2.1911209811886838 | true |
3 | 3.0236086790006622 | 3.0236086790006613 | true |
2 | 2.573083132964213 | 2.5730831329642125 | true |
7 | 5.68662973136732 | 5.6866297313673186 | true |
9 | 8.1860026312677938 | 8.1860026312677938 | true |
Binary logistic regression
The comparison of the simulated inference with the actual results of
ML.PREDICT
are perfect - not a single contradiction in the 10k row holdback
set. There are a few rows where both ML.PREDICT
and the simulated inference
both disagree with the actual label, and that is expected as the model accuracy
is about 93%, and there are small but non-zero values in the off-diagonal cells
of the confusion matrix.
actual_label | ml_predicted_label | sql_predicted_label | simulation_is_accurate |
---|---|---|---|
0 | 1 | 1 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 1 | 1 | true |
0 | 0 | 0 | true |
Create an audience activation list with ML
A typical use case would be to create a differentially private binary logistic regression model to predict conversions, then apply inference on this model while creating an audience list. Assume that the binary logistic model created in the example above is modeling conversions and that each row in the training and evaluation sets represents a distinct user.
The following query shows how to create an audience list with those users that the model predicts will convert:
WITH
T AS (
SELECT
*,
label AS actual_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
holdback),
P AS (
SELECT
*,
(
SELECT
IF
((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) predicted_label,
features,
weights
FROM
T),
SELECT
user_id
FROM
P
WHERE
predicted_label = 1;