loo R package 10 years!

This post is by Aki.

The loo R package has its 10 year anniversary today! Jonah Gabry made the first loo package release (v0.1.0) 10 years ago on June 26.

  • loo has been downloaded more than 4 million times from the RStudio CRAN mirror (there are more than 80 mirrors, but the RStudio mirror is likely to be one of the most popular ones)
  • R-universe counts 304 other R packages using loo
  • based on R-universe scores, loo is in top 100 among 26,819 packages

Here’s a blog post about the background, advances during the years, and a bit about the future.

Stan

When I (Aki) got involved with Stan project, there was a need for a model selection criterion that would be fast, robust, and easy to compute. I had used cross-validation (CV) a lot, but it did require some expertise to know which computational approach to use in which case, and how to diagnose the reliability. Brute-force leave-one-out cross-validation with repeated model inference is slow. Gelfand et al. (1992) had proposed importance sampling leave-one-out (LOO) CV, but 1) that estimate may have infinite variance (e.g. Peruggia, 1997), 2) due to skewed distribution the estimate would be over-optimistic with high probability, and 3) there was no good diagnostic for reliability.

DIC

Andrew had been using DIC (Spiegelhalter et al., 2002), which was simple and fast, but 1) it assumes the predictions are made using posterior mean of the parameters (and not by integrating over the posterior), 2) it’s not invariant to parameter transformations, and 3) it was known to fail for multimodal posteriors and flexible models.

WAIC

The Widely-applicable information criterion (WAIC; Watanabe, 2010) seemed promising at first, 1) being simple and fast, 2) assuming predictions using posterior predictive distribution, 3) being invariant to parameter transformations, and 4) it works with multimodal and singular posteriors. Alas, more testing of WAIC revealed it also fails with more flexible models without a warning.

Diagnostics

It was now clear that there was a pressing need for a diagnostic for the model comparison criterion computation.

I started to investigate a diagnostic for WAIC. WAIC can be presented as a truncated Taylor series approximation. In difficult cases, the higher order functional cumulant terms are not small. Can we diagnose whether MCMC estimates of higher order terms have finite variance?

Koopman et al. (2009) had proposed a diagnostic for importance sampling: 1) fit generalized Pareto distribution (GPD) to the tail of the importance ratio distribution, 2) use shape parameter k to diagnose whether the variance is finite (k <= 1/2), and 3) reject the estimate if the variance is not finite. The problem was that there was no advice what to do if the estimate is rejected and this seemed to happen a lot with WAIC and importance sampling LOO. Furthermore, finite but high variance is also problematic.

I then realized that if we assume the tail of the ratio distribution is close to a generalized Pareto distribution, and we fit the generalized Pareto distribution, we can use that as a model for the tail. To make the computation practical, this model is used to replace the raw ratios with modelled (smoothed) ratios, which are then used in further computations. In theory, modeling should reduce variability (and if the model is good, the bias can be negligible) and this was observed in practice; smoothed importance sampling LOO performed better than WAIC (or had similar performance for simple boring models).

Jonah implemented the method in loo package while we were making the experiments and writing the paper, and loo v0.1.0 was released on Github on June 26 before the papers were public.

Main papers

As theory and experiments did take a lot of pages, we decided to split the paper into “Very Good Importance Sampling” (arXived July 9, the name inspired by WAIC) and “Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models” (arXived July 16). Eventually the paper names were changed to “Pareto smoothed importance sampling” (PSIS) and “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC”. The latter did get published in Statistics and Computing in less than 14 months, while the PSIS paper (final version 58 pages) took more than 9 years to get published eventually in JMLR.

While the PSIS algorithm stayed practically the same the whole time, the theoretical justifications and diagnostics did improve over the years. Dan Simpson and Yuling Yao provided help with the theory and joined as co-authors.

The PSIS estimator always has finite variance with a cost of some bias. If Pareto-k<0.7, the bias and variance are small (see details in the PSIS paper; Vehtari et al., 2024). Although 9 years felt too long time and we had a nasty case of reviewer 2, I’m happy about the improved theoretical understanding of the pre-asymptotic behaviour.

Outliers and influential observations

The loo package did gain additional practical advice on how examining the number of parameters, the effective number of parameters (p_loo), and the number of observations can provide information on whether the high Pareto-k values are likely due to a) badly misspecified model with outliers or b) well specified but flexible model.

As we get Pareto-k diagnostic for each LOO-fold, that can be used as identifying influential or problematic observations, but also to focus the additional computation only for the specific LOO-folds.

Additional computation for high Pareto-k cases

The simplest approach is to just re-run MCMC for the folds with high Pareto-k. rstanarm and brms packages know enough about the data and model, so that they can provide automated approach for this.

To speed-up computation, we developed moment matching approach (Paananen et al., 2021) that can adjust the posterior draws faster than what re-running MCMC would take, to better match the proposal and target. The needed functionality and a vignette was added to loo. brms makes it easy to use moment matching LOO with one option.

The loo package also has a vignette and support for K-fold-CV which is robust and relatively fast if, e.g. K=10.

Large data

Even though PSIS-LOO is generally fast, with big enough data it can be slow. We developed a sub-sampling LOO approach (Magnusson et al., 2019, 2022) with a vignette in the loo package. In the sub-sampling approach we use a faster but biased estimate for all LOO-folds and a slower but (almost) unbiased estimate for a subset of LOO-folds. The biggest benefits of this approach can be seen in the projpred package (github version), where PSIS-LOO given the full data search path is fast, but doing the search for each LOO-fold is slow. Using subsampling and the difference-estimator we can get more than a 10-fold speedup as demonstrated in one of the case studies.

Predictive checking

Cross-validation can be used to improve predictive checking. Posterior predictive checking can fail with flexible models as the same data are used for fitting and checking. We added useful functions to loo package to support LOO predictive checks and LOO probability integral transformation (LOO-PIT) calibration checks to bayesplot (Gabry et al., 2019; Säilynoja et al., 2022, 2025).

Beyond LOO-CV

The package is named loo as it started as an implementation of the PSIS-LOO algorithm (and we had only US and Finnish people thinking about the name). But it was natural to extend it beyond LOO.

Leave-one-group-out (LOGO) is useful if we want to predict for new groups. LOGO is challenging for importance sampling as the posterior for group specific parameters changes a lot if we leave out all the group specific observations. We can use K-fold-CV (loo vignette) or integrate out the group specific parameters (Roaches case study).

While LOO is valid for analysing the observation model in time series models, we may sometimes prefer leave-future-out cross-validation (LFO-CV), as it has smaller bias if the prediction task is in the future. The loo package has a vignette demonstrating how PSIS and occasional re-fits can be used for fast LFO-CV (Bürkner, Gabry and Vehtari, 2020).

The downside of LFO-CV is that it uses only a small part of the data for fitting the model and making predictions for the future, and thus has high variance. If the focus is model comparison, it is better to use K-fold-CV or hv-block-CV with joint log score as these have smaller variance, which leads to higher model selection efficiency for time series models (Cooper et al., 2025a).

Leaving out more than one observation and using joint log score improves the model selection efficiency also in the case of spatial models (Cooper et al., 2025b).

Often temporal and spatial models are presented as non-factorized normal (or t) models, in which case we need to use properties of multivariate normal (or t) to compute LOO (Bürkner, Gabry and Vehtari, 2020). The loo package has a vignette and brms uses this approach for non-factorized models.

Comparing models

Instead of just using point estimates of the predictive performance, we can also quantify the related uncertainty, which is especially useful when doing model comparison. From the beginning, the loo package was reporting the log score (elpd) difference and the related standard error based on the recommendation by Vehtari and Lampinen (2002). We (Sivula et al., 2025) have investigated in more detail the conditions when the standard error and related normal approximation are accurate. The normal approximation can be used to estimate probability that model B has better predictive performance than model A.

Model averaging

The LOO computed log score (elpd_loo) has a connection to information criteria (see, e.g. Vehtari and Ojanen, 2012). Inspired by information criteria based model weights and stacking, we developed Bayesian stacking (Yao et al., 2018), which is also implemented in the loo package. Furthermore, we extended Bayesian stacking to Bayesian hierarchical stacking (Yao et al., 2022a) and stacking for non-mixing computations (Yao et al., 2022b). There is a loo package vignette for stacking.

Other scores and metrics

The loo package supports also (S)CRPS, MAE, RMSE, MSE, ACC, and BACC, although not as nicely as log score (see below for future plans).

ArviZ

loo is an R package, but PyStan and PyMC users needed fast cross-validation, too. The Python and Julia ArviZ libraries were subsequently developed to include most of the methods that are in the loo R package,

Future

Because PSIS is useful in many other cases beyond just LOO, PSIS functionality has now been implemented in the posterior package, so that packages that would like to use just the Pareto smoothing and diagnostics do not need to depend on the loo package. We’re also in the progress of changing loo to use more of these functions from posterior.

We’re also refactoring loo to improve modularity and usability (as part of Google Summer of Code) focusing on:

  • easier use of different scores and metrics (e.g. RMSE, R^2, CRPS)
  • easier use of different cross-validation variants
  • easier use of joint log score
  • more diagnostic information provided to the user, e.g. for the uncertainty normal approximation

The ArviZ team is also working to update ArviZ.

CV-FAQ

The loo package users have been asking many questions, and eventually I wrote CV-FAQ answering the most frequently asked questions (and setting straight some common misconceptions).

Thanks

Very big thanks to Jonah Gabry for writing the loo package (see also other contributors) and to the ArviZ team for implementing the methods in Python and Julia! All the methods developed in the papers would not be widely used without these packages (The Practical LOO-CV paper has been cited more than 5900 times). While the methods in the papers made certain things possible, it is the software that makes using the methods easy!

5 thoughts on “loo R package 10 years!

  1. Aki;

    This is a great story. I wanted to just add one thing. As you say, I’d been working with DIC but had been unhappy with its theoretical justification, and I’d been working on coming up with a fully-Bayesian replacement. You told me about WAIC and we were trying to understand it on some examples such as the 8 schools. We also had a time constraint because we were finishing the third edition of Bayesian Data Analysis and we wanted to write something there on model evaluation that we could stand behind. We didn’t want to just refer to a bunch of methods without giving our own derivation and recommendation.

    One approach you had to understanding WAIC was to frame it as an approximation to LOO. As I recall it, your idea was that, instead of considering WAIC directly as an estimate of out-of-sample predictive error, you considered WAIC as an estimate of LOO, which itself estimates the out-of-sample predictive error. And then you pointed out that, in this case, why not just work with LOO directly. And this led us to certain interesting Bayesian ideas, for example that any of these measures, when considered Bayesianly, has a posterior distribution (e.g., LOO error depends on the unknown parameters theta in the model, thus you can consider LOO as a function of theta, which induces a posterior distribution for LOO).

    While doing all this I realized that the consensus perspective (which I had held too) on “information criteria” was kind of a dead end. The idea of AIC and its various followups was that you’d start with within-sample prediction error and then correct for overfitting. The result would be an “information criterion” which could be used in model comparison. Based on our conversations, we realized that the problem could be more cleanly framed as simply having the goal of estimating out-of-sample prediction error. In retrospect this seems obvious, but if you go back to the literature of DIC etc., it really seems that there was this idea (which I also had at the beginning of this project) that there was some sort of magic expression, an “information criterion,” that would solve the problem. Once we stepped back and think of this as a statistical problem to estimating out-of-sample prediction error, we could move away from a Dirac-like approach of tweaking the formulas and toward a more brute-force approach which has culminated in LOO and its variants.

    Understanding the problem in terms of predictive accuracy for individual future data points also helped me understand what had been a puzzling aspect of WAIC, which is that it required a factorization of the predictive model for the data, unlike AIC and DIC which only used the combined likelihood. I decided that the necessary reliance on the factorization was a feature, not a bug, and indeed the need for this factorization becomes clear when we consider generalizations of LOO for time series, spatial, and multilevel structures.

    We wrote this up in a paper with Jessica Hwang, Understanding predictive information criteria for Bayesian models, published in 2013.

    Regarding the challenges of evaluating and comparing models using predictive accuracy on individual data points, I recommend my 2015 paper with Wei Wang, Difficulty of selecting among multilevel models using predictive accuracy.

    It’s interesting that most of your post is about computation, and without PSIS-LOO a lot of this would not be possible. One of the virtues of AIC and DIC is that they are easy to compute given the posterior simulations that are already available. Brute-force LOO is slow and PSIS-LOO is fast and reliable. Computation is important!

  2. Thank you for a very interesting topic. The fact that the LOO package is used around the world is a remarkable achievement. I would like to add a few points that I consider important from an applied perspective:

    (1) When data is independent, the average of the cross-validation error and the prediction error are the same. However, their behavior as random variables cannot be derived from definitions alone. The theory behind information criteria shows that the cross-validation error and the prediction error are asymptotically negatively correlated as random variables (Note 1). This asymptotic negative correlation is a point that requires significant attention in practice. From the viewpoint of estimating prediction errors, the cross-validation error tends to have low bias but higher variance compared to information criteria.

    (2) In regression problems, using the cross-validation error requires both {(Yi|Xi)} and {Xi} to be independent. In contrast, the information criteria can be used if {(Yi|Xi)} is independent. For example, when {Xi} is fixed, the information criteria can be used, but the cross-validation error cannot. Especially when a leverage sample point is included, the values of the cross-validation error and the information criteria may differ significantly.

    (3) PSIS (Pareto-smoothed importance sampling) is a wonderful method for estimating the cross-validation error with low computational cost, however, it needs an assumptiton that the posterior sample errors follow a Pareto distribution.

    For these reasons, I propose that both the cross-validation error and the information criteria are important in applied statistics. Thank you.

    (Note 1) It is easy to check this negative correlation by computer simulation. For the proof, see Theorem 2 in Watanabe, Journal of Machine Learning Research, Vol. 11, pp. 3571–3594, 2010.

    • (1) “the cross-validation error tends to have low bias but higher variance compared to information criteria.”

      The experiments (e.g. in the Practical… paper) show that when we include the variance from computation, WAIC has higher variance except for very simple boring models where WAIC nor cross-validation is not even needed.

      It’s also good to not be too focused on bias. In the blog post, I mention a couple papers that use approaches that have higher bias, but the so much smaller variance that the model selection efficiency is improved.

      (1) “The theory behind information criteria shows that the cross-validation error and the prediction error are asymptotically negatively correlated as random variables (Note 1). This asymptotic negative correlation is a point that requires significant attention in practice.”

      However, we can get useful uncertainty quantification which is well calibrated (see Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison).

      “(Note 1) It is easy to check this negative correlation by computer simulation. ”

      The simulation results in the above-mentioned paper show that the negative correlation appears more likely in a simple case (e.g. linear normal model with known residual variance) but not in general at least pre-asymptotically which is what is relevant for most modellers.

      (2) “In regression problems, using the cross-validation error requires both {(Yi|Xi)} and {Xi} to be independent. In contrast, the information criteria can be used if {(Yi|Xi)} is independent. For example, when {Xi} is fixed, the information criteria can be used, but the cross-validation error cannot.”

      I disagree. Furthermore, independence is not needed even for {(Yi|Xi)}, and exchangeability is sufficient. I wish I could edit one of our papers which has one sentence which may imply that I think something else.

      (3) “PSIS (Pareto-smoothed importance sampling) is a wonderful method for estimating the cross-validation error with low computational cost, however, it needs an assumption that the posterior sample errors follow a Pareto distribution.”

      To be more accurate, the generalized Pareto assumption is for the distribution of the density ratios. We have demonstrated that there can be quite big discrepancy from this before Pareto-k-diagnostic and Pareto-smoothing fail, and these cases are such that WAIC would fail, too.

      “For these reasons, I propose that both the cross-validation error and the information criteria are important in applied statistics. Thank you.”

      Your theoretical work on WAIC and Bayesian cross-validation has had a big influence on my work and has been extremely useful for me. Thank you!

      • I would like to thank you very much for your answers.

        (1) In the linear regression problem from X to Y, when the dimension of X is close to the sample size, I think that the variance of the cross-validation error increases, and the bias of the information criterion becomes larger.

        (2) As you mentioned, it is possible to generalize independence to exchangeability. However, the situation differs depending on whether {Xi} is fixed or exchangeable. When there is an extreme leverage point, the cross-validation error tends to increase, while the information criterion tends to decrease. This is because a leverage point contains much more information. Such differences in characteristics indicate that cross-validation error and information criteria each have distinct significance in applications.

        (3) Your development of a diagnostic method to ensure the reliability of cross-validation error is a highly important research achievement. As you pointed out, in cases such as (1) and (2) above, statistical inference itself is inherently difficult, so caution is warranted.

        Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *