## Stacking, pseudo-BMA, and AIC type weights for combining Bayesian predictive distributions

This post is by Aki.

We have often been asked in the Stan user forum how to do model combination for Stan models. Bayesian model averaging (BMA) by computing marginal likelihoods is challenging in theory and even more challenging in practice using only the MCMC samples obtained from the full model posteriors.

Some users have suggested using Akaike type weighting by exponentiating WAIC or LOO to compute weights to combine models.

We had doubts on this approach and started investigating this more so that we could give some recommendations.

Investigation led to the paper we (Yuling Yao, Aki Vehtari, Daniel Simpson and Andrew Gelman) have finally finished and which contains our recommendations:

• Ideally, we prefer to attack the Bayesian model combination problem via continuous model expansion, forming a larger bridging model that includes the separate models as special cases.
• In practice constructing such an expansion can require conceptual and computational effort, and so it sometimes makes sense to consider simpler tools that work with existing inferences from separately-fit models.
• Bayesian model averaging based on marginal likelihoods can fail badly in the M-open setting in which the true data-generating process is not one of the candidate models being fit.
• We propose and recommend a new log-score stacking for combining predictive distributions.
• Akaike-type weights computed using Bayesian cross-validation are closely related to the pseudo Bayes factor of Geisser and Eddy (1979), and thus we label model combination using such weights as pseudo-BMA.
• We propose a improved variant, which we call pseudo-BMA+, that is stabilized using the Bayesian bootstrap, to properly take into account the uncertainty of the future data distribution.
• Based on our theory, simulations, and examples, we recommend stacking (of predictive distributions) for the task of combining separately-fit Bayesian posterior predictive distributions. As an alternative, Pseudo-BMA+ is computationally cheaper and can serve as an initial guess for stacking.

The paper is also available in arXiv:1704.02030 (with minor typo correction appearing there tonight in the regular arXiv update), and the code is part of the loo package in R (currently in github https://github.com/stan-dev/loo/ and later in CRAN).

P.S. from Andrew: I really like this idea. It resolves a problem that’s been bugging me for many years.

1. “the M-open setting in which the true data-generating process is not one of the candidate models being fit.”

I am suspicious of this oft-used phrase, “the true data-generating process.” It sounds to me an awful lot like an instance of Jaynes’s mind-projection fallacy–mistaking the uncertainty in your mind, your lack of information, for some sort of physical process. If you’re doing Bayesian statistics, then you’re using epistemic probabilities, and what’s relevant is whether the models you are using make use of all the significant and relevant information you have. If they don’t — because you don’t know how to incorporate all of the information, or it would be too computationally difficult, or you can’t afford to spend ten years on model-building research before doing the analysis — then *that* is what I would call the M-open setting.

• Phil says:

I’m not sure what you’re saying here. Do you disagree with the idea that there is a “true data-generating process”?

The data were generated somehow. The process that generated the data is the “true data-generating process.” It may be of arbitrary complexity and may include true quantum-mechanical randomness, but it certainly exists.

The true data-generating process can never be modeled perfectly. In some settings it can be modeled very well: coin flips, spins of a roulette wheel, etc., can usually be modeled quite well. But in many settings your model of the process is going to differ from the true process in important ways, so if you have several candidate models you can be sure none of them are all that close to the data-generating process.

Presumably you disagree with some part of what I just said. Which part?

• A lot of the time referring to the true data generating process is making the mistake of treating the problem as if there were a correct choice of random number generator but you just don’t know what it is.

Saying that there is some real physics going on is less controversial but maybe less useful in social sci or such

• Wayne says:

Daniel: You’re talking about someone randomly sampling from a somewhat-justified distribution and reifying this into a generation process, am I understanding correctly? This also happens a lot in terms of the word “chance” (as in “determined by chance”), where vague chance becomes a cause.

• Yep. Like the measurement errors are normally distributed with zero mean and SD 1. Well no the measurement errors are equal to the time varying fluctuation in charge across a capacitor caused by thermal noise and local electromagnetic fields induced by radio stations and etc.

2. How does what you’re doing with all this stacking compare to just writing a joint mixture with the same structure?

Is the key feature that there’s some kind of BUGS-like “cut” between the process of (a) fitting the models, and (b) combining them?

• Jonah says:

Looks like this is discussed in Section 3.3 of the paper.

• They provide two motivations up front for avoiding mixture models—they’re costly to compute and can be hard to identify.

Then they use an example where data are generated as $latex y_n \sim \mathrm{Normal}(\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3, 1)$ with parameters generated from $latex \beta_k \sim \mathrm{Normal}(0, 1)$ priors, then show that model averaging the predictions of $latex \mathrm{Normal}(\beta_k x_k, 1), k \in \{ 1, 2, 3\}$ is better than building a mixture model of the three components.

Two questions immediately arise:

1 How does the true model fit this data? It’s the continuous model expansion they say they favor up front and it’d show how much we lose by model averaging or stacking.

2. What happens when the misspecification is reversed in the mixture case? That is, what if we generate data according to a mixture then try to average the components? This would be interesting to see the mixing-at-the-whole-dataset level, too, as well as mixing at the observation level.

Section 3.1 is sort of related to (2)—it generates data from a Gaussian, then uses a mixture composed of a spread of fixed Gaussians. This is another one of those cases where the obvious continuous model expansion would be the thing to do in practice.

Section 3.6 discusses an example where they seem to be saying that stacking is better than continuous model expansion, but I’m not 100% sure.

• Yuling Yao says:

To me it is like the same argument why simple additive model can work in many cases. I agree that stacking can be much worse than a mixture model is the true model is exactly a mixture model. In the extreme case: a mixture of gaussian with unknown location, stacking will estimate each sub-model to be identical, which makes no sense. However, the reason why researchers want to average a list of models is they believe each model is good enough to use. In real application, it is less likely that a user will average a list of gaussian; it is more likely that he wants to average a list of mixture models with different number of components!

3. Bob says:

Thank you, this is a fascinating paper.

4. shravan says:

On page 35 sigma has a gamma(.1,.1) prior. Is this a typo?

• Yuling Yao says:

It tries to give a relatively flat prior on gamma. The point is that BMA can be sensitive to priors, even if n goes large.

• shravan says:

Since sigma is already constrained to be greater than zero you might as well use a cauchy(0,2.5) prior or something?

• Aki Vehtari says:

gamma(.1,.1) is not a bad choice for noise sigma in y ~ normal(mu, sigma), but half-cauchy could be used, too. I guess you rememebered warnings that gamma(.1,.1) should not be used as a prior in hierarchical models for population prior scale.

5. Pierre Jacob says:

Fascinating stuff…

An alternative to your Pareto smoothed stuff: http://onlinelibrary.wiley.com/doi/10.1002/cjs.10045/full
(You don’t need to introduce bias in the importance weights to do cross-validation, you can just introduce intermediate steps… from 2010)

A lot of recent work on other scoring rules that address the sensitivity of marginal likelihoods from the prior
https://projecteuclid.org/euclid.ba/1423083641 (2015)
(also https://projecteuclid.org/euclid.aos/1336396184 (2012) in the discrete case)
(NOT every scoring rule is equivalent to the logarithmic scoring rule. This depends on your definition of “locality”. If you allow the scoring rule
to use derivatives of the predictive density at the observation, then there are other scoring rules, some of which are useful in the context of vague priors).

There’s obviously a vast literature on algorithms to estimate normalizing constants, but they’re not implemented in stan, so why bother mentioning them!?

• Andrew says:

Pierre:

Your first link above does not work. What is in this paper?

• Pierre Jacob says:

http://www.lukebornn.com/papers/bornn_cjs_2010.pdf
and another one
http://www.people.fas.harvard.edu/~bornn/Papers/2010CJS/Bornn2010b_pre.pdf

The idea is that, to do IS in the “wrong direction”, you just need to introduce enough intermediate steps, e.g. via tempering: essentially, you can just take SMC samplers off the shelf.

For instance, to do IS from Normal(0, sigma^2) to Normal(0, tau^2) with tau > sigma, a condition for the finite variance of the weights is tau^2 sigma^2, instead of doing IS in one step and fail, you can introduce a sequence of distributions, say with variances (eta * sigma^2), (eta^2 * sigma^2), (eta^3 * sigma^2), etc, with eta between 1 and 2. It’s not going to take many steps to go from any sigma to any tau. And it gives Monte Carlo estimates with finite variance.

• Aki Vehtari says:

Multi-step IS is not an alternative to PSIS, it’s an alternative to one-step IS. Pareto smoothing can be used with both of them. In our case, we want to first use the fastest possible option which is one-step IS, and if that fails we do something else.

The method in the linked paper has two things I’m worried: 1) need for Markov kernel with knowledge of the mixing properties of the given kernel – which is not easy in high dimensional cases, 2) the adaptive version uses effective sample size estimate based on variance estimate, which is invalid if the variance is infinite and noisy even if the variance is finite. 2) would be easily improved by use of Pareto shape estimate, and PSIS would reduce the number of needed steps, but still I would be worried how to choose the Markov kernel in high dimensional cases.

• Dan Simpson says:

I’m not sure we were looking to reduce the sensitivity of the marginal likelihood to the prior. I’m a touch obsessed with priors and broadly think that that sensitivity is a good thing. It makes model selection harder, but that’s why we’re not doing model selection.

6. Dave C. says:

Maybe a tangent: If the underlying models are from GLMs, using quasilikelihoods rather than true likelihoods, BMA seems to get a little more difficult. Can you use quasilikelihoods with this stacking approach?