It’s all in chapter 6 of Bayesian Data Analysis. Anyway, Sam Gershman wrote to me:

My colleagues and I have been trying to jump through the ring of fire that is Bayesian model comparison. However, as bad Bayesians we have been fitting most of our models with maximum likelihood estimation, mostly due to computational considerations and uneasiness about choice of prior. We’re going to do real Bayesian inference any day now, but before that happens we have a question about ML-based asymptotic criteria for model comparison, namely the Bayesian Information Criterion. As we understand it, the BIC is derived from the Laplace approximation when you drop all the terms that don’t grow with the number of datapoints. One of these terms includes the Hessian of the likelihood function; this is in fact a quantity that we have from the gradient descent computations, and we were wondering what you thought about constructing a “fancy” BIC (as Zoubin Ghahramani called it) by keeping that term rather than dropping it. Our intuition was that this would give you an answer somewhere in between the asymptotic result and actually computing the marginal likelihood. This would be especially useful in the (very) finite data regime we are in, where the asymptotic assumptions of the BIC are unlikely to hold. As far as we can tell, no one has suggested this before, and so naturally we were scared of using it for fear of some unknown pitfall!

My reply: The short answer is that BIC isn’t really an approximation to anything (in my opinion); it’s more of a convention than anything else. The diffculty is that changes in the prior that have no impact on the posterior can have a huge impact on the Bayes factor. We discuss this a bit in chapter 6 of BDA and also in my 1995 article with Rubin in Sociological Methodology.

Sam then wrote:

After reading your article with Rubin, I think part of the problem is that there are a whole bunch of assumptions embedded in the BIC and one may only buy into some (or none) of them. By my count there is: (1) Gaussian posterior with a mode at the MLE, (2) proper priors, (3) infinite data. Is it not true that when these assumptions are satisfied, then the ratio of BICs for two models converges to the Bayes Factor?

In any case, I think we can all agree that the BIC is basically worthless in light of these assumptions virtually never being jointly satisfied. But assuming for the sake of argument that we have proper uninformative priors and a Gaussian posterior centered at the MLE, then doesn’t the “fancy” BIC give the same answer as the log marginal likelihood?

To which I replied: But my short answer is that (a) there is no such thing as infinite data, (b) no, the BIC is intended to work with improper priors, and (c) the Gaussian posterior is the least of your worries. I think there is a false belief that many people have, that the BIC is an approximation to some well-defined quantitity. But no, I don’t think that’s right. That said, perhaps BIC is useful sometimes. I personally prefer AIC-type measures that are associated with prediction error. Getting back to your other question, when you can calculate the marginal likelihood, that’s fine. But it’s not really defined when priors are improper.

And then his colleague, Dylan Simon, wrote:

Since it wasn’t totally clear to me, what Sam is referring to as “fancy BIC” is really the Laplace approximation to a Bayesian Gaussian posterior, without taking into account the prior probability:

-L – k*log(2*pi)/2 + log(det(H))/2

Where L is the maximum log likelihood, k the number of parameters, and H the Hessian of L wrt parameters. While I understand that this ends up looking a lot like BIC when H is full rank, it still seems to have advantages over BIC. If you had uniform priors on bounded, equal size parameter ranges, it ends up giving the same answer as the Laplace approximation in terms of model comparison (neglecting the fact that the approximation is clearly wrong if your priors and so posteriors are finite anyway).

Or is the point that it’s problematic to use any approach that tries to approximate the marginal likelihood when you don’t have proper priors?

My reply: Yes, to that last question.

"The short answer is that BIC isn't really an approximation to anything (in my opinion)… I think there is a false belief that many people have, that the BIC is an approximation to some well-defined quantitity [sic]."

This assertion makes me nervous because it directly contradicts Kass and Wasserman, JASA vol 90 no 431 pg 928-34. I'm not saying that I endorse marginal likelihood methods, but as a matter of pure math, one of the following must be the case:

(i) I've misunderstood the referenced paper, or you, or both; or

(ii) your assertion is just wrong.

Corey,

My expression "well-defined quantity" is imprecise. BIC has a formula, and any formula is automatically a well-defined quantity. What I'm trying to say is that BIC does not correspond to any posterior distribution. The flaw in the Kass and Wasserman paper is their idea that "the amount of information in the prior be equal to the amount of information in one observation" (see end of page 929). First off, this won't work if your prior is something different; second, the concept of an "observation" isn't clearly defined (it's part of the likelihood, but it's not always clear how the likelihood can be divided into "observations").

"What I'm trying to say is that BIC does not correspond to any posterior distribution."

That's what I took you to mean, and I'm still thinking you're wrong. The posterior distribution to which the the BIC is an approximation is a distribution over an indicator variable for the truth of a reduced model versus the truth of a full model. The prior on the indicator variable is uniform over its two states. (I agree that an analysis focused on the "truth" of any particular model is fundamentally misguided in many important applications.)

On the question of "amount of information in one observation", your first flaw isn't a flaw — the object of the game is to figure out the implicit prior underlying the BIC. This lets a potential user know when the BIC is a meaningful approximation for her, to wit, precisely when her prior over the extra parameter is close to a unit-information prior. The second flaw is indeed a flaw, but in the limited context of nested linear models where the larger model has exactly one additional dimension and the errors are normal and independent, the unit-information prior is easy to define unambiguously. Hence, in this limited context, use of the BIC is an approximation to a specific fully Bayesian analysis. That's one of the take-home messages of the paper.