Control variables

Jump to:

Selecting control variables

The purpose of marketing mixed modeling (MMM) is causal inference on media effects, not prediction accuracy. So the primary purpose of control variables is to improve inference on the causal effect of media execution on the KPI.

Controls are variables in the model that aren't media effects. There are several types of control variables, as follows:

  • Confounding variables have a causal effect on media execution and the KPI. Including these variables debiases the causal estimates of media execution on the KPI.

  • Mediator variables are causally affected by media execution and have a causal effect on the KPI. Including these variables causes a bias, and so they must be excluded from the model.

  • Predictor variables have a causal effect on the KPI, but nothing else. Including these variables does nothing to debias the causal effect of media execution. However, strong predictors can reduce the variance of causal estimates. A canonical example of this is price.

We recommend that you only include strong predictors, but don't include too many variables with the sole purpose of optimizing predictive accuracy. Too many predictor variables can increase the risk of model misspecification bias.

This can be summarized in the following causal directed acyclic graph (DAG), with the goal of getting the causal effect of media on the KPI. In the names of the nodes, the number 1 denotes variable values at time period 1, the number 2 denotes variable values at time period 2, and so on. The figure only shows nodes for time periods 1 and 2, but assume it continues for \(T\) many time periods.

DAG causal effect of media on
KPI

When brainstorming possible confounding variables to include in the model, we recommend focusing primarily on gathering variables that explain media budget decisions instead of focusing on gathering variables that predict the KPI. Typically, there are a vast number of variables that are predictive of the KPI, but relatively fewer variables that are used to set advertising budgets. In principle, marketing managers can provide a list of all quantifiable information that was used to make budget decisions, though in reality it might be difficult to compile a complete list.

Basic questions to ask marketing managers include:

  • At the annual or quarterly level, how did they decide on the total media budget?
  • How did they decide the allocation across media channels?
  • Within each year, how did they decide the high and low budget weeks?
  • Are there any spikes in spend that correspond to certain events, such as holidays or product launches?
  • For questions 1-4, what data sources would most strongly correlate with the budget decisions? For example, the previous years' KPI values or economic variables?

Ultimately, we recommend that you:

  • Include confounder variables.
  • Exclude mediator variables.
  • Include strong predictors that can reduce variance of causal estimates. A canonical example of this is price.
  • Don't include too many variables with the sole purpose of optimizing predictive accuracy because it can increase the risk of model misspecification bias.

Including query volume as a control variable

As mentioned in Selecting control variables, including confounding variables is necessary for debiasing the causal effect of media on the KPI. Excluding mediator variables is also necessary for unbiased causal estimates. Query volume might be a mediator for some media channels, but a confounder for other media channels. For example, query volume is certainly a confounder for search ads as a relevant query is often a prerequisite for a search ad. However, other forms of media can drive search behavior, and so query volume is a mediator for those media channels.

Since you want to estimate the joint treatment effect of all media channels, you use a single model for inference. So, you must decide either to assume query volume is a confounder and include it in the model, or to assume query volume is a mediator and exclude it from the model. Base your selected assumption on the following considerations:

  • The channels that are more important to get unbiased estimates for
  • The assumed strengths of relationships between media channels, query volume, and the KPI
  • The assumed number of channels where query volume is a confounding variable instead of a mediator variable

We believe that assuming query volume is a confounder, and including it in the model will more often be the right decision due to the relative strength of the relationship between query volume and search media. However, the decision depends on the use case.

Using lagged variables

For certain control variables \(X\), it can make sense to include lagged values. For example, at each week \(t\), include \(X_{t-1},\dots ,X_{t-L}\) for some value of \(L\). We recommend only doing this if you think the lagged values \(t-1, \dots ,t-L\) have a causal effect on the KPI at week \(t\).

When lagged controls aren't needed

The following diagram shows a causal directed acyclic graph (DAG) where media is assumed to have a lagged effect, but controls aren't. Assuming this DAG, lagged controls are not needed. In the names of the nodes, the number 1 denotes variable values at time period 1, the number 2 denotes variable values at time period 2, and so on. The figure only shows nodes for time periods 1 and 2, but assume it continues for \(T\) many time periods.

Using the backdoor criteria (Pearl, J. 2009), you can estimate the causal effect of media on week 2 KPI by fitting a regression model to estimate \(E\bigl( K2 \big| M2,M1,C2 \bigr) = E\bigl( K2^{(M2, M1)} \big| C2 \bigr)\). Previous controls (\(C1\)) aren't needed.

Lag controls not needed

When lagged controls are needed

The following diagram is a causal DAG where lagged controls are needed. Again, the number in the names of the nodes correspond to the time period. To estimate the causal effect of media on week 2 KPI, you must condition on week 1 control variables with a lagged effect on KPI. Failing to do so will leave an unblocked path \(M1 \leftarrow L1 \rightarrow K2\). Utilizing the backdoor criteria, you can fit a regression model to estimate \(E\bigl( K2 \big| M2,M1,C2,L2,L1 \bigr) = E\bigl( K2^{(M2,M1)} \big| C2,L2,L1 \bigr)\).

Lag controls are needed

The previous diagram is a simplified 2-week DAG, but in general, for each week \(t\), you should include controls from week \(t,t-1, \dots ,t-L\), where \(L\) is the longest lag where controls are still thought to affect KPI. The value of \(L\) can differ by control variable.

In practice, you can truncate \(L\) at a reasonable value to prevent inflating the model variance by adding too many variables. In many cases, it can be reasonable to ignore lagged controls altogether if the lagged effects are relatively weak. This type of model simplification can be viewed as a bias-variance trade-off.

Population scaling control variables

By default, the KPI and media execution are population scaled. Control variables aren't population scaled by default because some controls, such as temperature, shouldn't be population scaled. However, some control variables, such as query volume, should be population scaled to maximize the correlation between the population scaled KPI and population scaled media variables.