Aki and I write:

The Watanabe-Akaike information criterion (WAIC) and cross-validation are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model. WAIC is based on the series expansion of leave-one-out cross-validation (LOO), and asymptotically they are equal. With finite data, WAIC and cross-validation address different predictive questions and thus it is useful to be able to compute both. WAIC and an importance-sampling approximated LOO can be estimated directly using the log-likelihood evaluated at the posterior simulations of the parameter values. We show how to compute WAIC, IS-LOO, K-fold cross-validation, and related diagnostic quantities in the Bayesian inference package Stan as called from R.

This is important, I think. One reason the deviance information criterion (DIC) has been so popular is its implementation in Bugs. We think WAIC and cross-validation make more sense than DIC, especially from a Bayesian perspective in which inference comes as a posterior distribution rather than a point estimate, and we hope that this and future Stan implementations will allow users to become more familiar with these tools.

In addition to the implementation, the paper discusses some challenges of interpretation with hierarchical models, demonstrating with the canonical 8 schools example.

Interesting. In a nutshell, WAIC is better suited to reflect estimates based on a new observation from existing data and LOO is better suited to reflect estimates based on a new observation from new data? Based on BDA3 and the 8/2013 paper on understanding predictive models, I’m switching to WAIC. However, based on these results, it seems that IS-LOO is more appropriate if trying to select the best model to predict the future based on past results, or is that incorrect if the assumption is that the future follows the past?

Also, if this gets rolled into RStan, there are a few tweaks for optimization that could be made. Most importantly, it would probably pay to have the loglikelihood be made method compatible with the ‘logLik‘ class to allow further integration with existing R packages and functions.

Thanks.

> In a nutshell, WAIC is better suited to reflect estimates based on a new observation from existing data and LOO is better suited to reflect estimates based on a new observation from new data?

You might have the right idea, but what you write is wrong.

For hierarchical models, WAIC useful to estimates predictive performance for a new observation from an existing group (group j with parameter alpha_j) and cross-validation is better suited to reflect estimates based on a new observation from a new group (group J+1 with parameter alpha_{J+1}). If each group has single observation, which is common, for example, when the group is indexed by continuous covariate values (alpha|x), then LOO is sensible. If each group has several observations, then leave-one-group-out cross-validation should be used to estimate the performance in a new group. For example in cognitive brain research, we have several observations per trial and several trials per subject and want to whether we can generalize the findings for new subjects and then leave-one-subject-out cross-validation is used. Importance sampling does not usually work well if we want to approximate effect of leaving several observations out, but as there are usually not so many subjects in these studies, we can easily compute each leave-one-subject-out posterior.

I forgot to mention that IS-LOO behavior is somewhere between real LOO and WAIC, because with a finite number of samples from the full posterior the samples do not cover the tails well and thus the result is biased towards the posterior (as can be seen in Fig 2 of our paper).

Thank you, although now forgive me in that I am more confused. In a very simple example, I have 10 observations and I want to compare models fitting them to two or three different distributional families (e.g. lognormal vs. gamma vs. weibull). The 10 observations reflect annual observances of the same random variable. Given a Bayesian model (built in Stan, let’s say), which statistic should I be using if I want to use the model to predict the value of the random variable in year 11?

In this case, the parameters of the distribution are hierarchical (be they mu/sigma of the lognormal, alpha/theta of the gamma, or even hyperparameters on those parameters) but there aren’t seperate observations of different classes, just the annual observations of the variables themselves.

In this case, which statistic would be preferable, or, since there isn’t a complicated hierarchical structure and subsetted observations, it wouldn’t matter?

In this example x is the number of the year. x is deterministic and your goal is to predict the value of the random variable in year 11 before seeing the value of year 12.

LOO assumes both x and y to be random (or future (x,y) pair to be unknown), so it is not exactly what you want but if your model is simple it is likely that it is adequate for the model comparison.

WAIC assumes x to be fixed (e.g. if you would want to predict new observations in years 1 to 10), so it is not exactly what you want but if your model is simple it is likely that it is adequate for the model comparison.

Sequential approach would be closer to your prediction task, that is predict each observed year given the previous years.

Sequential approach wastes information, and there are ways to improve that, but they are bit more complex to explain.

If you are comparing (generalised) linear models with a number of observations much larger than the number of covariates, then it is likely that it does not matter whether you use LOO or WAIC.

“The lpd of observed data y is an underestimate of the elpd for future data (1).”

Shouldn’t it be an overestimate? As I understand it seems to say that the new dataset is expected to fit better than the data we already used.

D’oh! OK, I just fixed it and reposted. Thanks for noticing.

It’s too easy to make this kind of mistakes, because we are sometimes using log density and sometimes deviance…

Why Akaike decided to multiply the negative loglikelihood by -2 is probably an artifact of linear regression and the chi-square value, but for better or for worse, most information criteria metrics which have followed (BIC, HQIC, QAIC, TIC, CAIC, CAICF, ICOMP, DIC, WAIC) are on the same scale.

WAIC as defined by Watanabe is not on the same scale! Watanabe multiplies by -1/n

You’re correct; he leaves it on that scale in his paper. What can I say, I’m unfairly biased by BDA3 8-)

For what it’s worth, I think nesting colVars (seq_alopng version) inside waic (in the R-code appendix) is neater and more elegant, and dispenses with the need for footnote 2.

I have a question:

You mentioned Survival data, can these criterions be used under censoring?

Yes they can. I use cross-validation for GP survival models, but WAIC could be used, too.

I put some code here to compute WAIC from stan outputs: https://github.com/JohannesBuchner/ic4stan

Contributions are welcome.