Skip to content

Practical Bayesian model evaluation in Stan and rstanarm using leave-one-out cross-validation

Our (Aki, Andrew and Jonah) paper Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC was recently published in Statistics and Computing. In the paper we show

  • why it’s better to use LOO instead of WAIC for model evaluation
  • how to compute LOO quickly and reliably using the full posterior sample
  • how Pareto smoothing importance sampling (PSIS) reduces variance of LOO estimate
  • how Pareto shape diagnostics can be used to indicate when PSIS-LOO fails

PSIS-LOO makes it possible to use automated LOO in practice in rstanarm, which provides a flexible way to use pre-compiled Stan regression models. The estimation using sampling obtains draws from the full posterior and these same draws are used to compute PSIS-LOO estimate with a negligible additional computational cost. PSIS-LOO can fail, but possible failure is reliably detected by Pareto shape diagnostics. If there are high estimated Pareto shape values, the summary of these is reported to a user with suggestions what to do next. In the initial modeling phase the user can ignore the warnings (and get anyway more reliable results than WAIC or DIC). If there are high estimated Pareto shape values, rstanarm offers to rerun the inference only for the problematic leave-one-out folds (in the paper we named this approach PSIS-LOO+). If there are many high values, rstanarm offers to run k-fold-CV. This way the fast predictive performance estimate is always provided and user can decide how much additional computation time is used to get more accurate results. In the future we will add other utility and cost functions such as explained variance, MAE and classification accuracy to provide easier interpretation of the predictive performance.

The above approach can be used also when using Stan via other interfaces than rstanarm, although then the user needs to add a few lines to the usual Stan code. After this PSIS-LOO and diagonstics are easily computed using the available packages for R, Python, and Matlab.


  1. Mike Lawrence says:

    Neat! I understand how one can use LOO for model comparison, but the paper notes that it can be useful as a posterior predictive check as well. It would be great to see an example of this latter usage. Would you be looking at the distribution of pointwise LOO values? Or maybe adding code in generated quantities that samples new observations given the model and creates a log_lik2 for these simulated samples, permitting you to loo::compare(loo::loo(log_lik),loo:loo(log_lik2))?

  2. Gmcirco says:

    Interesting. Richard Mcelrath is a big proponent of WAIC in his book “Statistical Rethinking” I’m curious to see how these compare.

    • Aki Vehtari says:

      I was also a big proponent of WAIC before doing the research which lead to this paper. WAIC is significant improvement compared to DIC, Watanabe’s papers are important for Bayesian LOO, but PSIS-LOO is more reliable and easier to diagnose for potential failure. See also results in Vehtari et al (2016) “Bayesian Leave-One-Out Cross-Validation Approximations for Gaussian Latent Variable Models”

  3. Dear Professor Aki Vehtari,Pareto Smoothing Important Sampling Cross Validation (PSISCV) is a very interesting method to approximate Bayesian cross validation (BCV). Although WAIC is asymptotically equivalent to BCV, it is not an approximating tool of BCV but an estimator of the generalization error. I would like to recommend that you had better compare cross validations and information criteria from the viewpoint of statistical estimation tools for the generalization error. A simple experiment shows that there is a case E|PSISCV-GE| > E|WAIC-GE|, which is shown on my web page. I heard from statisticians that any estimator had better be studied from bias and variance.

Leave a Reply to Aki Vehtari