Aki Vehtari and Janne Ojanen just published a long paper that begins:

To date, several methods exist in the statistical literature for model assessment, which purport themselves specifically as Bayesian predictive methods. The decision theoretic assumptions on which these methods are based are not always clearly stated in the original articles, however. The aim of this survey is to provide a unified review of Bayesian predictive model assessment and selection methods, and of methods closely related to them. We review the various assumptions that are made in this context and discuss the connections between different approaches, with an emphasis on how each method approximates the expected utility of using a Bayesian model for the purpose of predicting future data.

AIC (which Akaike called “An Information Criterion”) is the starting point for all these methods. More recently, Watanabe came up with WAIC (which he called the “Widely Available Information Criterion”). In between there was DIC which has some Bayesian aspects but does not fully average over the posterior distribution.

I still dream of coming up with something with Vehtari and calling it the Very Good Information Criterion. But I don’t think it’s gonna happen. The tradition in this area has been to come up with clever, computable formula and then hope it does everything we want. Vehtari and Ojanen do it slightly differently by asking more clearly what the goals are. If the goal is some sort of predictive error than it turns out that there is no magic formula. In fact, it’s not even clear what the goal is. It’s easy to come up with examples where the relevant out-of-sample predictive error can be defined in different, incompatible ways. One valuable aspect of the Vehtari and Ojanen paper is that they explicitly discuss these different goals rather than assuming or implying that a single measure will tell the whole story.

I believe there is more that could be done. For example suppose you had two distributions (models) P1 and P2 for the future price of Apple’s stock. Imagine P2 was quite narrow and implied that the stock will be between $550 and $560. The other model (P1) implied the stock price was going to be between $500 and $600. It’s clear that P2’s implication is more likely to be wrong simply because it’s making a much sharper prediction and hence is consistent with fewer outcomes.

There is an immediate connection with information theory. If you compute the entropy of the two distributions (S = -\int p log p) you will get, following Boltzmann, something like “S =log W” where W is the width of the high probability manifold. Thus S1=log 100 and S2 = log 10. The distribution with the higher entropy (i.e. lower information) will tend to be more consistent with actual outcomes.

In other words, more informative models contain more “information” and therefore make sharper predictions. This is great if you’re certain the “information” is true. In practical situations where the “information” is actually dubious modeling assumptions, there will be a tendency for less informative models to make predictions which are more consistent with what actually happens.

In particular, any modeling assumptions that shrink W below some limit suggested by the data would be especially suspect unless you had good reason to be confident in those assumptions.

Admittedly in science, you’re probably better off making strong modeling assumptions, even if they’re dubius, since they lead to more precise predictions. You’ll learn a lot more by testing those predictions than you will with weak models. But there are plenty of people outside of science who have to face the problem of how to make predictive models out of a bunch of questionable assumptions.

I have a question that’s been bothering me for some time.

A case in biology, in maximum likelihood estimation in phylogeny come to mind. Often data is aligned prior to analysis. The alignment creates a NxM matrix, where columns of M are i.i.d. This is fine, but they aren’t _really_ independent observations, they are only treated that way. If we unaligned the data and added alignment procedures to the model, how do we calculate the number of observations? It makes sense that it should be one (one observation per species). If we used AIC it doesn’t matter, but AICc or BIC is necessary, since n/k would end up being far less then 40.0, in fact <1.0.

Adding some factor from the complexity of the algorithm/model is necessary. I haven't delved too far into these other model selection criteria presented here, it will be the first thing in my new year. Thanks.

While I am sure the paper (which I haven’t read) is good, the phrase “the expected utility of using a Bayesian model for the purpose of predicting future data” is not even close to specified. The “utility” has to be a scalar (preferably in currency or exchangeable units) that is computed from the details of the prediction, where different kinds of residuals might be weighted in that scalar very differently (you might be willing to pay large dollars to avoid huge outliers, or you might only care about the best few points, or etc).

In some ways, the fact that utility is so restricted (must be a scalar in USD units) but also so free (can be any arbitrarily complicated function of the outputs) is why none of these simple ICs are very useful in general. The “Haughtiest Information Criterion” is to optimize your posterior-expected Long-Term Future Discounted Free-Cash Flow. I like the HIC in part because it is literally impossible to apply, since the word “Long-Term” means (almost by definition of “Long”) that you can’t build any accurate model!

(I have been tweeting about this with the tag #LTFDFCF but somehow that hasn’t really taken off…)

The state of the model comparison literature reminds me how very unsettled the foundations of statistical inference are. It’s ironic that so much useful stuff can be done with GLMs etc, and yet there is so little consensus on the foundations of estimation and prediction (among professional statisticians at least).

For example, I was once accused of “nonsense” for stating in a talk that models have probabilities. I’m an evolutionary theorist first, and a statistician second. So the fact that foundational issues in statistical epistemology and application are still controversial feels very weird to me. I mean, we have our fights in evolutionary biology, but almost no one is debating fundamental issues anymore. Perhaps the deepest controversy is about epigenetics, but none of that has the character of the Bayes/no-Bayes fights or the grasping feel of the model comparison literature.

I’m not trying to be unfair to statistics. Really, I’m hoping others here will tell me I’ve got it all wrong, and that the fundamentals are really quite settled. But if not, it could just be the relative ages of the fields and how much empirical content helps to determine fundamentals.

I tried using WAIC for model comparison in a genetics project with MCMC output to estimate the required expectations. The short version is that the correction to the log-posterior-likelihood \sum_x log(E[p(x|theta)]) estimated by SW’s formula does a pretty good job of approximating the degrees of freedom in all the easy cases I tried, and something reasonable looking in the hard ones. The correspondence to expected CV error was also as claimed, but I only looked at that in a few linear model settings. The drawback was that the numerical error from MCMC was a bit too big for my application, although fixable with time for more iterations. Additionally, the theory is only really worked out (that I know of) if you have x independent given theta.

Ryan:

I like Waic too. The thing it took me awhile to realize, but which I now sort-of understand, is that no measure (Waic included) is perfect. There are simple examples in which any of these measures fail to give a reasonable answer. I don’t think someone’s ever going to work out the theory to prove that Waic or whatever is the right thing to do, but I do expect that experience in more and more examples (including theoretical examples) will help us understand how best to use such measures.

Each measure answers its own question of course. I ended up needing to generate frequentist calibration of whatever measure I chose. I think that false-discovery-rate properties are probably the relevant ones for these big genetics projects. I suppose that I should mention that I was using the WAICV form instead of the WAICG form, although in retrospect the latter seems easier and more stable to calculate. This is definitely a handy survey and synthesis paper, and I’ll be going through it in more detail eventually.

[…] A important new survey of Bayesian model selection – Andrew Gelman […]

[…] Andrew Gelman, acabei de descobrir a existência do Journal “Statistics Survey“, que tem como um dos […]

[…] A very comprehensive paper by Vehtari and Ojanen discusses various aspects of model assessment and c…. I was mentioning to a friend the other day that I don’t think model assessment is taught very well in first year statistics classes. I recognise that you can’t teach everything in a first year class, but I think we should be going beyond R2 as a summary of model performance. Even the AIC would be a good start, as it incorporates a trade-off between goodness of fit and model complexity. I am yet to fully read the paper but it’s something I should definitely get my head around. […]