邏輯迴歸程式碼研究室

1. 簡介

本程式碼研究室將說明如何使用邏輯迴歸,瞭解性別、年齡層、曝光時間和瀏覽器類型等特徵,與使用者按下廣告的可能性有多大的關聯。

必要條件

如要完成這個程式碼研究室,您需要足夠的高品質廣告活動資料來建立模型。

2. 挑選廣告活動

首先,請選取含有大量高品質資料的舊廣告活動。要是不確定哪個廣告活動含有最佳品質的資料,不妨針對時間最早且可存取的整月資料執行以下查詢:

SELECT
  campaign_id,
  COUNT(DISTINCT user_id) AS user_count,
  COUNT(*) AS impression_count
FROM adh.google_ads_impressions

ORDER BY user_count DESC;

選取 12 到 13 個月的資料後,您就能針對即將從廣告資料中心移除的資料訓練及測試模型。如果這項資料受到模型訓練限制,資料刪除後,這些限制就會結束。

如果廣告活動特別活躍,一週的資料可能就足夠。最後,不重複使用者人數應為 100,000 人以上,特別是使用多項特徵進行訓練時。

3. 建立暫時資料表

找出要用於訓練模型的廣告活動後,請執行以下查詢。

CREATE TABLE
 binary_logistic_regression_example_data
AS(
 WITH all_data AS (
   SELECT
     imp.user_id as user_id,
     ROW_NUMBER() OVER(PARTITION BY imp.user_id) AS rowIdx,
     imp.browser as browser_name,
     gender_name as gender_name,
     age_group_name as age_group_name,
     DATETIME(TIMESTAMP_MICROS(
       imp.query_id.time_usec), "America/Los_Angeles") as impression_time,
     CASE # Binary classification of clicks simplifies model weight interpretation
        WHEN clk.click_id.time_usec IS NULL THEN 0
        ELSE 1
     END AS label
   FROM adh.google_ads_impressions imp
     LEFT JOIN adh.google_ads_clicks clk USING (impression_id)
     LEFT JOIN adh.gender ON demographics.gender = gender_id
     LEFT JOIN adh.age_group ON demographics.age_group = age_group_id
   WHERE
     campaign_id IN (YOUR_CID_HERE)
 )
 SELECT
   label,
   browser_name,
   gender_name,
   age_group_name,
   # Although BQML could divide impression_time into several useful variables on
   # its own, it may attempt to divide it into too many features. As a best
   # practice extract the variables that you think will be most helpful.
   # The output of impression_time is a number, but we care about it as a
   # category, so we cast it to a string.
   CAST(EXTRACT(DAYOFWEEK FROM impression_time) AS STRING) AS day_of_week,
   # Comment out the previous line if training on a single week of data
   CAST(EXTRACT(HOUR FROM impression_time) AS STRING) AS hour,
 FROM
   all_data
 WHERE
   rowIdx = 1 # This ensures that there's only 1 row per user.
   AND
   gender_name IS NOT NULL
   AND
   age_group_name IS NOT NULL
);

4. 建立及訓練模型

最佳做法是將資料表建立步驟與模型建立步驟分開。

請針對您在上一個步驟建立的暫存資料表,執行以下查詢。請放心,您不用提供開始和結束日期,系統會根據臨時資料表的資料推斷這兩項資訊。

CREATE OR REPLACE
MODEL `binary_logistic_example`
OPTIONS(
   model_type = 'adh_logistic_regression'
)
AS (
   SELECT *
   FROM
       tmp.binary_logistic_regression_example_data
);

SELECT * FROM ML.EVALUATE(MODEL `binary_logistic_example`)

5. 解讀結果

查詢執行完畢後,您會看到像下面這樣的資料表,但實際的廣告活動成效會有所不同。

資料列

precision

recall

accuracy

f1_score

log_loss

roc_auc

1

0.53083894341399718

0.28427804486705865

0.54530547622568992

0.370267971696336

0.68728232223722974

0.55236263736263735

查看權重

請執行下列查詢來查看權重,瞭解哪些特徵會影響模型預測點擊的可能性:

SELECT * FROM ML.WEIGHTS(MODEL `binary_logistic_example`)

這項查詢會產生類似下方的結果。請注意,BigQuery 會將指定標籤排序,並將「最小」設為 0,「最大」設為 1。在本例中,clicked 為 0,not_clicked 為 1。因此,較大的權重可以解讀為相關特徵促成點擊的可能性較低。此外,第 1 天對應週日。

processed_input

weight

category_weights.category

category_weights.weight

1

INTERCEPT

-0.0067900886484743364

2

browser_name

空值

不明 0.78205563068099249

Opera 0.097073700069504443

Dalvik -0.75233190448454246

Edge 0.026672464688442348

Silk -0.72539916969348706

其他 -0.10317444840919325

Samsung Browser 0.49861066525009368

Yandex 1.3322608977581121

IE -0.44170947381475295

Firefox -0.10372609461557714

Chrome 0.069115931084794066

Safari 0.10931362123676475

3

day_of_week

空值

7 0.051780350639992277

6 -0.098905011477176716

4 -0.092395178188358462

5 -0.010693625983554155

3 -0.047629987110766638

1 -0.0067030673140933122

2 0.061739400111810727

4

hour

空值

15 -0.12081420778273

16 -0.14670467657779182

1 0.036118460001355934

10 -0.022111985303061014

3 0.10146297241339688

8 0.00032334907570882464

12 -0.092819888101463813

19 -0.12158349523248162

2 0.27252001951689164

4 0.1389215333278028

18 -0.13202189122418825

5 0.030387010564142392

22 0.0085803647602565782

13 -0.070696534712732753

14 -0.0912853928925844

9 -0.017888651719350213

23 0.10216569641652029

11 -0.053494611827240059

20 -0.10800180853273429

21 -0.070702105471528345

0 0.011735200996326559

6 0.016581239381563598

17 -0.15602138949559918

7 0.024077394387953525

5

age_group_name

空值

45-54 -0.013192901125032637

65+ 0.035681341407469279

25-34 -0.044038102549733116

18-24 -0.041488170110836373

不明 0.025466344709472313

35-44 0.01582412778809188

55-64 -0.004832373590628946

6

gender_name

空值

男性 0.061475274448403977

不明 0.46660611583398443

女性 -0.13635601771194916