Oscar Oelrich, Shutong Ding, Måns Magnusson, Aki Vehtari, and Mattias Villani write:

Bayesian model comparison is often based on the posterior distribution over the set of compared models. This distribution is often observed to concentrate on a single model even when other measures of model fit or forecasting ability indicate no strong preference. Furthermore, a moderate change in the data sample can easily shift the posterior model probabilities to concentrate on another model. We document overconfidence in two high-profile applications in economics and neuroscience. To shed more light on the sources of overconfidence we derive the sampling variance of the Bayes factor in univariate and multivariate linear regression. The results show that overconfidence is likely to happen when i) the compared models give very different approximations of the data-generating process, ii) the models are very flexible with large degrees of freedom that are not shared between the models, and iii) the models underestimate the true variability in the data.

Stacking is more stable and I think makes more sense. One of the problems with Bayes factor is that people have an erroneous attitude that it’s the pure or correct thing to do. See our 1995 paper for some discussion of that point.

I think the above Oelrich et al. paper is valuable in making a new point—that the procedure of selecting a model using Bayes factor can be very noisy. This is similar to the problem of selecting effects in a fitted model by looking for the highest estimate or the smallest p-value, and the problem is that with just one dataset it is easy to take your inference too seriously. Bootstrapping or equivalent analytical work can be helpful in understanding this variation.

Thanks for sharing this!

I like that this is essentially looking at the frequency properties of Bayes factors. I wonder if there is a way to think about this in a Bayesian way? In other words, knowing that the BF for a particular application will come from some distribution with a mean and variance, could we put a prior on that distribution and get an appropriately tuned BF that is not so overconfident?

Presumably this prior would be based on a consideration of the relative dimensionality of the models being compared, including the extent to which the models have correlated dimensions. The result would be that a high-magnitude BF between extremely different models should not shift our posterior BF distribution as much as a high-magnitude BF between very similar models.

Mike Evans argued for Measuring the Strength of the Evidence here https://www.routledge.com/Measuring-Statistical-Evidence-Using-Relative-Belief/Evans/p/book/9781482242799

If I remember correctly, it involved repeatedly simulating data given a particular point in the parameter space to assess the frequency of the posterior becoming more concentrated versus less concentrated to that particular point than the prior.

Made very clear the advantage of larger sample sizes versus smaller sample sizes and surprise the posterior distribution is not the full answer.