Getting MCMC convergence
Lack of convergence is typically due to one of the following causes:
- The model is poorly specified for the data. This problem can be in the likelihood (model specification) or in the prior.
-
The
n_adapt + n_burnin
orn_keep
arguments ofMeridian.sample_posterior
are not large enough.
To get your chains to converge, try the following recommendations in this order:
- Check for identifiability or weak identifiability using these questions:
-
Do you have highly multicollinear
media
orcontrols
variables? -
Is the variation in your
media
orcontrols
variable so small that it is difficult to estimate its effect? -
Is one of the
media
orcontrols
highly correlated with time or even perfectly collinear with time? For more information, see When you must useknots < n_times
. -
Is one of the
media
variables quite sparse? Sparsity can mean very little execution in a channel, too many geos with no execution whatsoever, or too many time periods with no media execution whatsoever (especially if the number ofknots
is close ton_times
). - Reassess the priors. Highly uninformative priors often make convergence difficult, but highly informative priors could also make convergence difficult in certain situations.
- If your KPI is revenue or if you have revenue per KPI data, consider the advice in ROI priors and calibration for paid media channels.
- If you don't have revenue data, consider the advice In Set custom priors when outcome is not revenue for paid media channels. Reducing the total media contribution prior mean and/or standard deviation may help to achieve a sufficient degree of regularization.
-
Adjust the modeling options. In particular, try decreasing the
knots
argument ofModelSpec
. Other modeling options to adjust includeunique_sigma_for_each_geo
ormedia_effects_dist
ofModelSpec
. -
Check for a data error, for example, whether the
population
order doesn't matchmedia
order for geos. Meridian's model assumes a geo hierarchy in media and control effects. If this assumption does not match your data, regularize these parameters further by setting the priors on parameters that measure hierarchical variance (eta_m
andxi_c
), for example,HalfNormal(0.1)
. You can also turn off the geo hierarchy assumption with aDeterministic(0)
prior. - Consider whether you have enough data. For more information, see Amount of data needed.
When the posterior is the same as the prior
When there are lots of variables that the model is trying to understand, you need more data to understand any particular variable. MMM typically tries to make inference on many variables without that many data points, particularly in the case of a national model. This means that there will be instances where there is little information in the data for a particular media channel. This situation can be exacerbated when a particular channel either has low spend, very low variance in the scaled media execution, or high correlation of scaled media execution between channels. For more information about data amounts, see Amount of data needed. For more information about channels with low spend, see Channels with low spend).
You can make the prior and the posterior different from each other by using increasingly uninformative priors. Recall that the prior represents an estimate of a parameter before the data has been taken into account and the posterior is meant to be an estimate of a parameter after the data has been taken into account. When there is little information in the data, the before and after data are going to be similar. This is particularly true when the prior is relatively informative. Relative refers to the information in the prior relative to the information in the data. This means that the data can always dominate the prior if you set an uninformative enough prior. However, if the prior is uninformative relative to the data, which also has low information in it, then the posterior will be quite wide, representing a lot of uncertainty.
One way to simplify things is to think about the prior you are setting for parameters such as ROI. You don't have to worry too much about the relative informativeness of the prior if you just set reasonable priors that you believe in. If there is little or no information in the data, then it makes sense from a Bayesian perspective that the prior and the posterior are similar. If there is a lot of information in the data, then your prior will likely move based on this data.
Channels with low spend
Channels with low spend are particularly susceptible to have an ROI posterior similar to the ROI prior. Each channel has a range of ROI values that fit the data reasonably well. If this range is wide and completely covers most of the prior probability mass, then the posterior tends to look like the prior. The range of reasonable ROI values for a small spend channel tends to be much wider than that of a high spend channel because small spend channels need very large ROI to have much influence on the model fit. It is more likely that a large range of ROI values will fit the data reasonably well.
Media effects are modeled based on the media metric provided, such as impressions and clicks. Neither the scale of the media metric nor the spend level has any effect on the model fit or the range of incremental outcome that could reasonably be attributed to the channel. ROI is defined as incremental outcome divided by spend, so when the range of reasonable incremental outcome values is translated to an ROI range, a channel with larger spend will have a narrower range of ROI values that fit the data well.
Note: In the case of ordinary least squares regression, the scale of the covariates has no effect on the fit. The scale can matter in a Bayesian regression setting when priors are applied to the coefficients; however, Meridian applies a scaling transformation to each media metric. Scaling a channel's impressions by a factor of 100, for example, does not affect the Meridian model fit.
When ROI results are widely different depending on the prior used
ROI results can be very different depending on whether ROI default priors are used or beta default priors are used.
The use of ROI default priors and beta default priors can affect ROI results for the following reasons:
- When default ROI priors are used, each media channel's posterior ROI is regularized towards the same distribution. This is a good thing because every channel is treated equitably.
- When default priors on the media coefficients (beta) are used, each media channel's posterior ROI is regularized towards different distributions. This is because the scaling that is done on the media data is not the same scaling used across the channels. So the same beta value means different ROIs for different channels. The default priors on media coefficients are also uninformative relative to the default ROI prior to account for potentially big differences in scaling of the media data across channels.
- When there is little information in the data, the prior and the posterior will be similar, as discussed in When the posterior is the same as the prior. When there is little information in the data and beta priors are used, posterior ROIs will be different across the media channels. However, this difference is only coming from the inequitable priors on the media channels and not the data. In summary, it is important to not interpret different ROI results across the channels as a result that is picking up signal from the data, when the difference is only driven by inequitable priors.
ResourceExhaustedError when running Meridian.sample_posterior
Meridian.sample_posterior
calls
tfp.experimental.mcmc.windowed_adaptive_nuts
, which can be memory
intensive on GPUs when sampling a large number of chains in parallel or when
training with large datasets.
One way to reduce the peak GPU memory consumption is to sample chains
serially. This capability is provided by passing a list of integers to
n_chains
. For example, n_chains=[5, 5]
will sample a
total of 10 chains by calling
tfp.experimental.mcmc.windowed_adaptive_nuts
consecutively, each
time with the argument n_chains=5
.
Note that this does come with a runtime cost. Because using this method
reduces memory consumption by using consecutive calls to our MCMC sampling
method, the total runtime will increase linearly with the length of the list
passed to n_chains
. For example, n_chains=[5,5]
can
take up to 2 times as long to run as n_chains=10
, and
n_chains=[4,3,3]
can take up to 3 times as long.
Organic media contribution is too high
If the organic media contribution is higher than expected, the prior used may
not be appropriate. Organic media has no defined ROI and, as such, uses the
regression coefficient parameterization with the coefficient (
beta_om
or beta_orf
) prior. If the contribution
for organic media is observed to be higher than expected, it is suggested to
revisit the priors used for the organic media channels. By default, the priors
assumed are relatively uninformative, but do assume to have a positive effect
that can result in a high prior mean. When there is little information in the
data, this can also lead to a high posterior mean. If this is an issue, you
may want to consider using an alternative prior with more mass in the lower
end of the range of the distribution. Also, note that when
media_effects_dist = 'log_normal'
, $\beta_i^{[OM]}$ is the
prior mean of the log of the geo-level media effect,
$\log(\beta_{g,i}^{[OM]})$. The default prior, HalfNormal(5.0)
,
in this case, may be assuming too much prior mass away from zero. This is
exacerbated when exponentiating, and you may want to consider a prior with
more mass near zero, such as a HalfNormal(0.1)
prior. Note,
although the variance is small, it still provides a wide range of possible
values on the exponentiated scale. Alternatively, for more flexibility, you
could consider a Normal prior that allows setting both the location and scale,
for example, Normal(0.0, 3.0)
. Similarly, when
media_effects_dist = 'normal'
, you may want to consider
using a prior with a smaller scale than the default, such as
HalfNormal(1.0)
.
Error about controls that don't vary across groups or geos
This error means that you have a national-level variable that doesn't vary across geos and you have set `knots = n_times`. When `knots = n_times`, each time period is getting its own parameter. A national-level variable varies only across time, and not across geo. Therefore, the national-level variable is perfectly collinear with time and is redundant with a model that has a parameter for each time period. Redundant means that you can either keep the national-level variable or set `knots < n_times`. Which variable you choose depends on your interpretation goals.