## Deviance, DIC, AIC, cross-validation, etc

The deviance information criterion (or DIC) is an idea of Brad Carlin and others for comparing the fits of models estimated using Bayesian simulation (for more information, see this article by Angelika van der Linde).

I don’t really ever know what to make of DIC. On one hand, it seems sensible, it handles uncertainty in inferences within each model, and it does not depend on aspects of the models that don’t affect inferences within each model (unlike Bayes factors; see discussion here). On the other hand, I don’t really have any idea what I would do with DIC in any real example. In our book we included an example of DIC–people use it and we don’t have any great alternatives–but I had to be pretty careful that the example made sense. Unlike the usual setting where we use a method and that gives us insight into a problem, here we used our insight into the problem to make sure that in this particular case the method gave a reasonable answer.

One of my practical problems with DIC is that it seems to take a lot of simulations to calculate it precisely. Long after we’ve achieved good mixing of the chains and good inference for parameters of interest and we’re ready to go on, it turns out that DIC is still unstable. In the example in our book we ran for a zillion iterations to make sure the DIC was ok.

DIC’s slow convergence is not something I’ve seen written about anywhere–a student and I worked on a project a few years ago to quantify the DIC convergence issue but we never finished it–but, by a simple application of the folk theorem, it suggests that the measure might have some more fundamental problems.

I’ve long thought of DIC, like AIC, as being a theory-assisted estimate of out-of-sample predictive error. But I’ve always been stuck on the details, maybe because I’ve never really used either measure in any applied problem.

While writing this blog I came across an article by Martyn Plummer that gives a sense of the current thinking on DIC and its strengths and limitations. Plummer’s paper begins:

The deviance information criterion (DIC) is widely used for Bayesian model comparison, despite the lack of a clear theoretical foundation. DIC is shown to be an approximation to a penalized loss function based on the deviance, with a penalty derived from a cross-validation argument. This approximation is valid only when the effective number of parameters in the model is much smaller than the number of independent observations. In disease mapping, a typical application of DIC, this assumption does not hold and DIC under-penalizes more complex models. Another deviance-based loss function, derived from the same decision-theoretic framework, is applied to mixture models, which have previously been considered an unsuitable application for DIC.

Again, I’m not trying knock DIC, I’m just trying to express my current understanding of it.

P.S. More here.

1. Martyn says:

Thanks for highlighting my paper. DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I hope my paper added some clarity.

As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.

But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the "focus". In practice the focus is limited to the lowest level of the model. You generally can't calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question "What model yield the disease map that best captures the features of the observed incidence data during this period?" But people are often asking more fundamental questions about their models, like "Is there spatial aggregation in disease X?" There is quite a big gap between these questions.

Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be)

Compile a model with at least 2 parallel chains
Set a trace monitor for "pD".
Output with the coda command

If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.

2. Aki Vehtari says:

Check out the WAIC by Sumio Watanabe (2010)
http://jmlr.csail.mit.edu/papers/v11/watanabe10a…. (and references therein). WAIC is the Bayesian criterion that DIC wanted to be. The focus is on the predictive distribution avoiding the many problems of the DIC. Interestingly Sylvia Richardson proposed this version in the discussion of the original DIC paper and Celeux et al (2006) included this in their DIC for missing data models comparison, but there was no theoretical justification for it until Watanabe's recent work. WAIC can be computed with similar effort as DIC. Convergence when using MCMC may be a problem, although I haven't noticed that when using DIC, and nowadays we mostly use Laplace and expectation propagation approximations which avoid the MCMC convergence problems. WAIC is not yet much used and thus there are not yet much empirical results about its performance, but I'm quite positive that it will get more popular.

3. K? O'Rourke says:

I recall finding some interesting stuff on David Draper's web site.

But not my area of expertise and I can't remember exactly which talk of his I looked at.

K?