Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data y, one model p_1(\theta_1, y) with parameters \theta_1 and a second model p_2(\theta_2, y) with parameters \theta_2.

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

\textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),

where we have

p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1

and

p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.

The distributions p_1(y) and p_2(y) are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data y into two pieces, y = y', y'' and then training on y' and testing on y''. The held out validation values are

p_1(y'' \mid y') = \mathbb{E}[p_1(y'' \mid \Theta_1) \mid y'] = \int p_1(y'' \mid \theta_1) \cdot p_1(\theta_1 \mid y') \, \textrm{d}\theta_1

and

p_2(y'' \mid y') = \mathbb{E}[p_2(y'' \mid \Theta_2) \mid y'] = \int p_2(y'' \mid \theta_2) \cdot p_2(\theta_2 \mid y') \, \textrm{d}\theta_2.

The distributions p_1(y'' \mid y') and p_2(y'' \mid y') are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is y_1, \ldots, y_N. Leave-one-out cross validation works by successively taking y'' = y_n and y' = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N and then averaging on the log scale.

\frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)

and

\frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Bayes factors measure prior predictive performance

I was having a discussion with a colleague after a talk that was focused on computing the evidence and mentioned that I don’t like Bayes factors because they measure prior predictive performance rather than posterior predictive performance. But even after filling up a board, I couldn’t convince my colleagues that Bayes factors were really measuring prior predictive performance. So let me try in blog form and maybe the discussion can help clarify what’s going on.

Prior predictive densities (aka, evidence)

If we have data $latex y$, parameters $latex \theta$, sampling density $latex p(y \mid \theta)$ and prior density $latex p(\theta)$, the prior predictive density is defined as

$latex p(y) = \int p(y \mid \theta) \, p(\theta) \, \textrm{d}\theta.$

 

The integral computes an average of the sampling density $latex p(y \mid \theta)$ weighted by the prior $latex p(\theta)$. That’s why we call it “prior predictive”.

Bayes factors compare prior predictive densities

Let’s write $latex p_{\mathcal{M}}(y)$ to indicate that the prior predictive density depends on the model $latex \mathcal{M}$. Then if we have two models, $latex \mathcal{M}_1, \mathcal{M}_2$, the Bayes factor for data $latex y$ is defined to be

$latex \textrm{BF}(y) = \frac{p_{\mathcal{M}_1}(y)}{p_{\mathcal{M}_2}(y)}$.

What are Bayes factors measuring? Ratios of prior predictive densities. Usually this isn’t so interesting because the difference between a weakly informative prior and one an order of magnitude wider usually doesn’t make much of a difference for posterior predictive inference. There’s more discussion of this with examples in Gelman an et al.’s Bayesian Data Analysis.

Jeffreys set thresholds for Bayes factors of “barely worth mentioning” (below $latex \sqrt{10}$) to “decisive” (above 100). But we don’t need to worry about that.

Posterior predictive distribution

Suppose we’ve already observed some data $latex y^{\textrm{obs}}$. The posterior predictive distribution is

$latex p(y \mid y^{\textrm{obs}}) = \int p(y \mid \theta) \, p(\theta \mid y^{\textrm{obs}}) \, \textrm{d}\theta.$

 

The key difference from the prior predictive distribution is that we average our sampling density $latex p(y \mid \theta)$ over the posterior $latex p(\theta \mid y^{\textrm{obs}})$ rather than the prior $latex p(\theta)$.

Cross-validation

In the Bayesian workflow paper, we recommend using cross-validation to compare posterior predictive distributions and we don’t even mention Bayes factors. Stan provides an R package, loo, for efficiently computing approximate leave-one-out cross-validation.

The path from prior predictive to posterior predictive

Introductions to Bayesian inference often start with a very simple beta-binomial model which can be solved analytically online. That is, we can update the posterior by simple counting after each observation. Each posterior is also a beta distribution. We can do this in general and consider our data $latex y = y_1, \ldots, y_N$ arriving sequentially and updating the posterior each time.

$latex p(y_1, \ldots, y_N) = p(y_1 \mid \theta) \ p(y_2 \mid y_1, \theta) \, \cdots \, p(y_N \mid y_1, \ldots, y_{N-1}, \theta).$

 

In this factorization, we predict $latex y_1$ based only on the prior, then $latex y_2$ based on $latex y_1$ and the prior and so on until the last point is modeled in the same way as leave-one-out cross-validation as $latex p(y_N \mid y_1, \ldots, y_{N-1})$. We can do this in any order and the result will be the same. As $latex N$ increases, prior predictive density converges to posterior predictive density on an average (per observation $latex y_n$) basis. But for finite amounts of data $latex N \ll \infty$, the measures can be very different.