Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data y, one model p_1(\theta_1, y) with parameters \theta_1 and a second model p_2(\theta_2, y) with parameters \theta_2.

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

\textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),

where we have

p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1


p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.

The distributions p_1(y) and p_2(y) are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data y into two pieces, y = y', y'' and then training on y' and testing on y''. The held out validation values are

p_1(y'' \mid y') = \mathbb{E}[p_1(y'' \mid \Theta_1) \mid y'] = \int p_1(y'' \mid \theta_1) \cdot p_1(\theta_1 \mid y') \, \textrm{d}\theta_1


p_2(y'' \mid y') = \mathbb{E}[p_2(y'' \mid \Theta_2) \mid y'] = \int p_2(y'' \mid \theta_2) \cdot p_2(\theta_2 \mid y') \, \textrm{d}\theta_2.

The distributions p_1(y'' \mid y') and p_2(y'' \mid y') are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is y_1, \ldots, y_N. Leave-one-out cross validation works by successively taking y'' = y_n and y' = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N and then averaging on the log scale.

\frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)


\frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

