“When are Bayesian model probabilities overconfident?” . . . and we’re still trying to get to meta-Bayes

Oscar Oelrich, Shutong Ding, Måns Magnusson, Aki Vehtari, and Mattias Villani write:

Bayesian model comparison is often based on the posterior distribution over the set of compared models. This distribution is often observed to concentrate on a single model even when other measures of model fit or forecasting ability indicate no strong preference. Furthermore, a moderate change in the data sample can easily shift the posterior model probabilities to concentrate on another model.

We document overconfidence in two high-profile applications in economics and neuroscience.

To shed more light on the sources of overconfidence we derive the sampling variance of the Bayes factor in univariate and multivariate linear regression. The results show that overconfidence is likely to happen when i) the compared models give very different approximations of the data-generating process, ii) the models are very flexible with large degrees of freedom that are not shared between the models, and iii) the models underestimate the true variability in the data.

This is related to our work on stacking:

[2018] Using stacking to average Bayesian predictive distributions (with discussion). {\em Bayesian Analysis} {\bf 13}, 917–1003. (Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman)

[2022] Bayesian hierarchical stacking: Some models are (somewhere) useful. {\em Bayesian Analysis} {\bf 17}, 1043–1071. (Yuling Yao, Gregor Pirš, Aki Vehtari, and Andrew Gelman)

[2022] Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors. {\em Journal of Machine Learning Research} {\bf 23}, 79. (Yuling Yao, Aki Vehtari, and Andrew Gelman)

Big open problems remain in this area. For choosing among or working with discrete models, stacking or other predictive model averaging techniques seem to work much better than Bayesian model averaging. On the other hand, for models with continuous parameters we’re usually happy with full Bayes. The difficulty here is that discrete models can be embedded in a continuous space, and continuous models can be discretized. What’s missing is some sort of meta-Bayes (to use Yuling’s term) that puts this all together.

4 thoughts on ““When are Bayesian model probabilities overconfident?” . . . and we’re still trying to get to meta-Bayes

  1. BayesBag is one potential solution to the meta-Bayesian problem, as it works well in both the discrete and continuous cases:

    J. H. Huggins & J. W. Miller (2024). Reproducible Parameter Inference Using Bagged Posteriors. Electronic Journal of Statistics 18(1): 1549–1585.

    J. H. Huggins & J. W. Miller (2023). Reproducible Model Selection Using Bagged Posteriors. Bayesian Analysis 18(1): 79–104.

    That said, there are many open questions. For example, how do we make it more computationally efficient? (Our recent work on subsampling MCMC algorithms like SGLD provides a partial solution: https://arxiv.org/abs/2207.12395.) And, how should we handle structured data like time-series, where bootstrapping is not appropriate? Block bootstrapping is one possibility, but perhaps there are more elegant solutions.

  2. I didn’t give a deep reading of the paper, but it seems to indicate situations where all the models are substantially deficient. It feels a little paradoxical to look at P(model A is correct | data) when we know that P(any model in the set is correct) = 0. Wouldn’t it make more sense to pushforward the analysis into the substantive question of interest in data space, which as you and the paper note doesn’t usually come with the same degeneracies? I’m just not sure what I would do with P(model) except in cases like physics or pharmacokinetics where the model family is well motivated to begin with.

    • Somebody:

      I agree. For reasons discussed in this paper from 1995, I am almost never interested in Pr(model | data).

      Often, though, we have a set of existing flawed models, and in the meantime we want to use the models to make inferences and predictions, and for this purpose it can make sense to use some version of predictive model averaging. Stacking is one way to do this, Bayesian model averaging is another, and it makes sense to compare them and to think of how we can do better.

  3. Depends on what you mean by meta-Bayes, but if you mean a unifying Bayesian framework for meta-inference, we’ve already done that.

    Here is it applied to time-series:
    https://www.sciencedirect.com/science/article/abs/pii/S0304407618302112
    spatial data:
    https://arxiv.org/abs/2203.05197
    causal inference:
    https://arxiv.org/abs/2304.07726
    decisions:
    https://academic.oup.com/jrsssb/article-abstract/86/2/340/7329254?redirectedFrom=fulltext

    This was, in fact, the main point of our commentary on your first stacking paper (that it’s a special case of our framework).

Leave a Reply

Your email address will not be published. Required fields are marked *