## Using Stacking to Average Bayesian Predictive Distributions (with Discussion)

I’ve posted on this paper (by Yuling Yao, Aki Vehtari, Daniel Simpson, and myself) before, but now the final version has been published, along with a bunch of interesting discussions and our rejoinder.

This has been an important project for me, as it answers a question that’s been bugging me for over 20 years (since the time of the writing of the first edition of BDA), which is how to average inferences from several fitted Bayesian models. Traditional Bayesian model averaging doesn’t work (see chapter 7 of BDA3, or chapter 6 of the first two editions, for some discussion of why; as McAlinn, Aastveit, and West put it in their discussion of our new paper, “This is not a failure of Bayesian thinking, of course; the message remains follow-the-rules, but understand when assumptions fail to justify blind application of the rules.”), but it’s been frustrating to have no real alternative. Yes, I always say that continuous model expansion is the better approach, but sometimes you have a few models you can fit, and you want to do something.

Here, for example, is something I wrote on Bayesian model averaging a few years ago, back when I had no idea what to do.

I’m happy with our current solution, which is to consider the combination of inferences from several fitted models as a prediction problem, and use cross-validation to find optimal weights. This solution is not perfect—cross-validation can be noisy, and in any case we should be able to combine models more adaptively rather than estimating fixed weights—but I feel like it’s on the right track, and it gives us a practical method to get reasonable answers.

Here’s the abstract of our paper:

Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the com- bination of predictive distributions. We extend the utility function to any proper scoring rule and use Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions. We compare stacking of pre- dictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with bootstrapped-Pseudo-BMA as an approximate alternative when computation cost is an issue.

And the journal, Bayesian Analysis, published it with discussions by:

Bertrand Clarke
Meng Li
Peter Grunwald and Rianne de Heide
A. Philip Dawid
William Weimin Yoo
Robert L. Winkler, Victor Richmond R. Jose, Kenneth C. Lichtendahl Jr., and Yael Grushka-Cockayne
Kenichiro McAlinn, Knut Are Aastveit, and Mike West
Minsuk Shin
Tianjian Zhou
Lennart Hoogerheide and Herman K. van Dijk
Haakon C. Bakka, Daniela Castro-Camilo, Maria Franco-Villoria, Anna Freni-Sterrantino, Raphael Huser, Thomas Opitz, and Havard Rue
Marco A. R. Ferreira
Luis Pericchi
Christopher T. Franck
Eduard Belitser and Nurzhan Nurushev
Matteo Iacopini and Stefano Tonellato

and a rejoinder by us.

1. Jeff Walker says:

I posed this question to Andrew and the other authors of the paper that Andrew posted and Andrew suggested I post it here. The blog has moved on with 3 new posts so already, my question is already buried!

I have a question about Figure 8 (the table of coefficients) in the paper, which shows estimates of the model coefficients using OLS and the various averaging methods discussed in the paper.

Ever since at least Hoeting et al 1999, there has been the occasional, offhand comment in the statistics literature that it makes no sense to average coefficients across nested submodels since these “estimate different parameters” (the effect conditional on different sets of covariates) — see quotes below.

I disagree. I would argue that, at least if “explanatory” modeling, or predictive modeling where we hope to use the model to make interventions in the system, then model-averaging averages biased estimates of the same parameter and not consistent estimates of different parameters. Or a bit more formally, a regression coefficient for X_j estimates the causal (Pearl’s do or Rubins potential outcome) effect of X_j on Y. While its estimate is conditional on the other covariates in the model, its interpretation or “meaning” is not. Its interpretation takes its meaning from “causal conditioning” and not “probabilistic conditioning” sensu Shalizi’s text p. 505 or section 4.7 here https://plato.stanford.edu/entries/causal-models/

This isn’t an issue with averaging the prediction since the parameter estimated is agreed on by everyone. But Fig 8 is a table of coefficients that are functions of combining (averaging) information from different nested models. Is this apples and lawn tractors (estimates of different parameters) or is it combining coefficients that estimate the same thing?

Some quotes.

“(It is probably also worth mentioning regarding HMRV’s equations (1–3) that ???? needs to have the same meaning in all “models” for the equations to be straightforwardly interpretable; the coefficient of x1 in a regression of y on x1 is a different beast than the coefficient of x1 in a regression of y on x1 and x2.)” – Draper, comment in Hoeting et al. 1999, p. 405

“It is important to apprciate that model averaging makes sense only if the quantities being averaged have the same interpretation for all the models under consideration. Thus averaging of parameter values or estimates over different models is not usually useful, as parameters typically pertain to particular models, even if the same Greek letter is used.” – Candolo, Davison, Demetrio, 2003, p. 166

“What model averaging does not mean is averaging parameter estimates, because parameters in different models have different meanings and should not be averaged, unless you are sure your are in a special case in which it is safe to do so.” Mcelreath, Statistical Rethinking, p. 196

“This is concerning because the interpretation of partial regression coefficients can depend on other variables that have been included in the model, so averaging regression coefficients across models may not be practically meaningful.” – Banner and Higgs 2017

“In general, the only case for which partial regression coefficients associated with a particular explanatory variable hold the same interpretation across models is when the explanatory variables are orthogonal.” – Banner and Higgs 2017

Banner, Katharine M., and Megan D. Higgs. “Considerations for Assessing Model Averaging of Regression Coefficients.” Ecological Applications 27, no. 1 (January 1, 2017): 78–93. https://doi.org/10.1002/eap.1419.

Candolo, C., A. C. Davison, and C. G. B. Demétrio. “A Note on Model Uncertainty in Linear Regression.” Journal of the Royal Statistical Society: Series D (The Statistician) 52, no. 2 (2003): 165–177.

Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14, no. 4 (1999): 382–417.

McElreath, Richard. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, 2018.

• Yuling Yao says:

It also reminds me that at some point I was thinking about how to do model averaging in causal models/ causal assumptions. The potential outcome is still a predictive quantity so I think it is still possible to define the utility based on predictions. The only change in a causal model is you have to model the covariates shift (the “do” operator as you mentioned).