Understanding predictive information criteria for Bayesian models

Jessy, Aki, and I write:

We review the Akaike, deviance, and Watanabe-Akaike information criteria from a Bayesian perspective, where the goal is to estimate expected out-of-sample-prediction error using a bias-corrected adjustment of within-sample error. We focus on the choices involved in setting up these measures, and we compare them in three simple examples, one theoretical and two applied. The contribution of this review is to put all these information criteria into a Bayesian predictive context and to better understand, through small examples, how these methods can apply in practice.

I like this paper. It came about as a result of preparing Chapter 7 for the new BDA. I had difficulty understanding AIC, DIC, WAIC, etc., but I recognized that these methods served a need. My first plan was to just apply DIC and WAIC on a couple of simple examples (a linear regression and the 8 schools) and leave it at that. But when I did the calculations, I couldn’t understand the results. Hence more effort working all these out in some simple examples, and further thought into the ultimate motivations for all these methods.

P.S. When introducing AIC, Akaike called it An Information Criterion. When introducing WAIC, Watanabe called it the Widely Applicable Information Criterion. Aki and I are hoping to come up with something called the Very Good Information Criterion.

26 thoughts on “Understanding predictive information criteria for Bayesian models

  1. Generally, I think if you’re having to judge models based on information criterion then you’re probably in trouble. But if you’ve got nothing else to go on it does make sense to me that models can be judged by such chriterion. If you’re trying to predict the closing price of Apples stock today Y using two different models P_1(Y) and P_2(Y) then you may notice the following:

    P_1 predicts “Apple’s stock price will be between $450 and $460”
    P_2 predicts “Apple’s stock price will be between $100 and $900”

    Clearly we can judge that P_2’s prediction is more likely to be true without having a single shred of data.

    The connection to information theory is the following: P_2’s prediction is more likely becuase the interval is wider. The length of the interval is essentially a measure of the size of the high probability region W of P_i(Y). But the size of this high probability region is related to the entropy using S ~ ln |W|.

    • We can define science as “doing things that we don’t yet know how to do”. In this sense, if we’re not “in trouble”, we’re not doing science.

    • Here’s something I’ve never bothered to actually figured out, but always found confusing. In using the “Evidence” as a measure of goodness of fit, people talk about the evidence E as the normalizing constant for the posterior distribution p… but typically we only have the posterior defined up to this constant. And the way we write down the model will change this constant.

      p1 = exp(-(L+P+c1)) where L is the likelihood P is the prior and c1 is the arbitrary constant incorporated in how we wrote this down
      p2 = exp(-(L+P+c2)) where L and P are the same but c2 is some other constant we get because of writing the posterior in some other way (such as automatically generating it using some model description language like stan’s modeling language.

      now suppose that c1=0 and c2=log(2), won’t the normalizing constant for p2 be 2 times as big as p1? Won’t this suggest that there’s much more evidence for p2, even though the models are exactly the same?

      Ok, well you might argue that you could create a “normalized form” where we force the constant c = 0 and then this solves the problem of comparing models using the evidence… but no-one typically discusses this.. I’m a little perplexed but I haven’t thought hard about it so maybe the answer is just obvious once you work through an example? any thoughts?

      Also, I recognize that your point is more about entropy but you seem to have thought about these issues so I figured you might have something to say on this topic.

      • Daniel, I understand the question, although I assume you mean “exp(-L) is the likelihood and exp(-P) is the prior”. I believe the answer to your question is that the normalization constant is a probability distribution in it’s own right. Recall that P(A|B)=P(B|A)P(A)/P(B).

        So the normalization constant is P(B). I believe the requirement that it be normalized to 1 over B removes the ambiguity you’re referring to. It’s the normalized P(B) that you want to use.

        • yes, I mean L is the portion (of the negative-log-posterior) that comes from the likelihood, P from the prior. I guess you’re right, it makes sense, but then the statement that the Evidence is just the normalizing constant is actually false. It’s the normalizing constant for P(B|A) P(A) where each of those is individually normalized, not for some other proportional form.

      • Incidentally Daniel, that Bayesian model comparison stuff is identical to Gibb’s method for analyzing phase transitions from say liquid to gas. The similarity is especially noticeable if you think of distributions this way: http://www.entsophy.net/blog/?p=50

        Comparing the models using the normalization constant is like comparing the partition functions to get the fraction of liquid versus gas. You might even be able to think it as giving the odds that a molecule is in the liquid versus the gas state, which over the n-10^23 particles translates into the percentage of material in the liquid or gas phase.

        All of which hints that there’s some powerful results along these lines which haven’t been discovered yet. Statisticians like Gelman or Frequentists, who are dismissive of the Bayesian Model comparison stuff, obviously aren’t going to discover it, but people with more of a physics/bayesian background might be able to make some headway.

        • You know, the more I think about Gibbs’s work the more it reminds me how warped most statisticians experience with statistics really is.

          From a few data points (just a couple of measurements really) Gibbs was able to infer/predict amazingly useful and successful results about phase transitions. That’s so different from the “collect masses of data and make some crappy inferences/predictions” version of statistics that most people have become accustomed to when applying probability distributions to the real world.

        • The physical problem has an extremely well understood micro-mechanical model, not so much for things like say inventory analysis of Amazon.com warehouses or election forecasting. It’s really not fair to compare models where the micro-states are extremely well understood to models where it’s not even clear what a micro-state is.

        • That is the conventional wisdom, however the fact is that Maxwell, Boltzmann, Gibbs and initially Plank and Einstein got those impressive results using a simple classical model of the atomic realm which could not have been more wrong.

          Whatever the source of their success was, it’s definitely not because they understood atoms better than we understand inventory control and elections.

        • I think “could not have been more wrong” is overstating things a little, don’t you think? Especially with regard to the difference in how well we understand the foundations of physical systems vs human systems.

        • No I don’t think it’s overstating it. Check out Maxwell’s, Boltzmann’s, or Gibb’s actual papers/books. Their “atoms” were point masses subject to simple mechanical forces. No mention of E&M, Quantum Mechanics, Relativity, elections, protons, neutrons, or nothing.

          The 19th century was full of such models of “atoms”, all of which are considered hopelessly wrong by todays standard, and 99.99% of which have been totally forgotten except to historians of science.

          My point was that it’s easy, but mistaken, to dismiss their successes this way. It’s a version of the old claim that physicists are more successful than social scientists because they had all the easy problems. I really don’t think that’s true at all.

          For example, for most of human existence thinkers found human psychological motivations as far more accessible and understandable than say the patterns of waves on a beach. We only have the reverse judgement today because Physicists were so successful.

        • “physicists are more successful than social scientists because they had all the easy problems.”

          My naive guess: (1) Physicists were more numerous & smarter people, in general (2) Social scientists often sacrifice well posed problems in favor of more ambitious ones.

        • I don’t think there’s much doubt that the average physicist is smarter than the average social scientist, but my personal experience has been that the top people in pretty much any field are similar to physicists (i.e. there are just fewer bottom feeders in physics).

          Maybe other people’s experiences are different, but I knew Mills of Yang-Mills fame before he died at Ohio State:
          https://en.wikipedia.org/wiki/Millennium_Prize_Problems
          https://en.wikipedia.org/wiki/Yang%E2%80%93Mills_theory

          (I was even going to help him write a graduate E&M book of all things! big waste of time). He didn’t seem to possess some kind of super human cleverness that Gelman, Wasserman or any of hundreds of other statisticians I’ve seen didn’t have. He just seemed to be a regular run-of-the-mill genius like the rest of us slobs.

        • I don’t know, atoms can be described fairly well by momentum and position, sure there’s all the quantum uncertainty stuff, but it comes out in the wash when you’re talking statmech of gasses and you can get some simple liquid models without too much trouble that way. Classical MD has been fairly damn successful in predicting physical properties of complex molecules for example.

          On the other hand, what’s the microstate for an inventory control system, certainly a lot more complicated to describe than a big set of 6-real-number momentum+position vectors. You’ve got maybe 300M people in the US, maybe 250M of them could order something from Amazon at any given time, they fall into pretty broad categories of life stages and soforth. some need one-off items like a new sun-hat for their trip through the southwest, others are starting nursing school and need a stethoscope and some highly specialized textbooks, others need their latest order of diapers for the baby, or a new phone because they dropped their last one in the toilet… Although there are far fewer people than 20^23 molecules in a sample of liquid, each one has a state space that’s pretty vastly high dimensional. I think it’s this high dimensionality of the space of each actor which has until recently led to little mechanistic modeling of these types of situations. I think part of the “Big data” fanaticism is that there have been some recent successes in mechanistic modeling like this. Amazon recommends and that thing about predicting pregnancy at Target that was on Kaiser Fung’s blog a year or so ago for example. People have focused on just a few important dimensions of the human state space and gotten somewhere predicting with them.

    • @Entsophy

      I’m naive about the statistical theory, but in the Apple analogy you give, wouldn’t most people intuitively use some variant of norm(Y_pred – Y_actual)?

      Is there something that would make you compare P1 & P2 the way you did?

      • Generally, the goal seem to be to judge or compare models based purely off their structure without using data or actual values. Usually though people seem to have the narrower goal picking what to include or exclude from the model in order to avoid over-fitting.

        Over-fitting models isn’t inherently bad. It’s only bad when doing so causes the model to focus tightly around features which are unlikely to be repeated in future trials making the models predictions not very robust. My point was merely that it is in fact possible to compare the predictive robustness of two models without any data and without having any actual values.

        The key quantity to focus on is what statisticians call the Kullback-Leibler divergence, but which appeared explicitly, with proofs of it’s key properties, in the work of physicists far earlier. It’s really more accurate to the think of Kullback-Leibler divergence as just “the entropy”. It has the form -\sum_{Y} P_2(Y) \ln P_2(Y)/P_1(Y).

        This quantity will capture both the fact that high probability manifolds of P_1 and P_2 overlap and that the high probability manifold of P_2 is bigger than P_1. Those are precisely the qualities that make P_2’s prediction inherently more robust than P_1.

        • it is in fact possible to compare the predictive robustness of two models without any data and without having any actual values.

          But this predicted robustness may or may not hold true once you have actual values? i.e. your metric is the best we can do in the absence of actual data to validate against?

  2. Re: p.s.

    When I was working with a SAMSI group and thinking of possible names, the group was concerned with model fit, understanding, criticism and then keeping (choosing) the better models. Thinking that criticism might come before understanding would give http://en.wikipedia.org/wiki/FCUK but I got voted down.

Comments are closed.