It’s all in chapter 6 of Bayesian Data Analysis. Anyway, Sam Gershman wrote to me:
My colleagues and I have been trying to jump through the ring of fire that is Bayesian model comparison. However, as bad Bayesians we have been fitting most of our models with maximum likelihood estimation, mostly due to computational considerations and uneasiness about choice of prior. We’re going to do real Bayesian inference any day now, but before that happens we have a question about ML-based asymptotic criteria for model comparison, namely the Bayesian Information Criterion. As we understand it, the BIC is derived from the Laplace approximation when you drop all the terms that don’t grow with the number of datapoints. One of these terms includes the Hessian of the likelihood function; this is in fact a quantity that we have from the gradient descent computations, and we were wondering what you thought about constructing a “fancy” BIC (as Zoubin Ghahramani called it) by keeping that term rather than dropping it. Our intuition was that this would give you an answer somewhere in between the asymptotic result and actually computing the marginal likelihood. This would be especially useful in the (very) finite data regime we are in, where the asymptotic assumptions of the BIC are unlikely to hold. As far as we can tell, no one has suggested this before, and so naturally we were scared of using it for fear of some unknown pitfall!
My reply: The short answer is that BIC isn’t really an approximation to anything (in my opinion); it’s more of a convention than anything else. The diffculty is that changes in the prior that have no impact on the posterior can have a huge impact on the Bayes factor. We discuss this a bit in chapter 6 of BDA and also in my 1995 article with Rubin in Sociological Methodology.
Sam then wrote:
After reading your article with Rubin, I think part of the problem is that there are a whole bunch of assumptions embedded in the BIC and one may only buy into some (or none) of them. By my count there is: (1) Gaussian posterior with a mode at the MLE, (2) proper priors, (3) infinite data. Is it not true that when these assumptions are satisfied, then the ratio of BICs for two models converges to the Bayes Factor?
In any case, I think we can all agree that the BIC is basically worthless in light of these assumptions virtually never being jointly satisfied. But assuming for the sake of argument that we have proper uninformative priors and a Gaussian posterior centered at the MLE, then doesn’t the “fancy” BIC give the same answer as the log marginal likelihood?
To which I replied: But my short answer is that (a) there is no such thing as infinite data, (b) no, the BIC is intended to work with improper priors, and (c) the Gaussian posterior is the least of your worries. I think there is a false belief that many people have, that the BIC is an approximation to some well-defined quantitity. But no, I don’t think that’s right. That said, perhaps BIC is useful sometimes. I personally prefer AIC-type measures that are associated with prediction error. Getting back to your other question, when you can calculate the marginal likelihood, that’s fine. But it’s not really defined when priors are improper.
And then his colleague, Dylan Simon, wrote:
Since it wasn’t totally clear to me, what Sam is referring to as “fancy BIC” is really the Laplace approximation to a Bayesian Gaussian posterior, without taking into account the prior probability:
-L – k*log(2*pi)/2 + log(det(H))/2
Where L is the maximum log likelihood, k the number of parameters, and H the Hessian of L wrt parameters. While I understand that this ends up looking a lot like BIC when H is full rank, it still seems to have advantages over BIC. If you had uniform priors on bounded, equal size parameter ranges, it ends up giving the same answer as the Laplace approximation in terms of model comparison (neglecting the fact that the approximation is clearly wrong if your priors and so posteriors are finite anyway).
Or is the point that it’s problematic to use any approach that tries to approximate the marginal likelihood when you don’t have proper priors?
My reply: Yes, to that last question.