I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.
Bayes factors
Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data , one model with parameters and a second model with parameters
The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,
where we have
and
The distributions and are known as prior predictive distributions because they integrate the likelihood over the prior.
There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.
Held-out validation
Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data into two pieces, and then training on and testing on . The held out validation values are
and
The distributions and are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.
This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.
(Leave one out) cross validation
Suppose our data is . Leave-one-out cross validation works by successively taking and and then averaging on the log scale.
and
Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.
Conclusion and a question
The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.
This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.
My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.
Further reading
For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.