This post is by Aki.

Tuomas Sivula, Måns Magnusson, and I (Aki) have a new preprint paper that analyzes one of the limitations of cross-validation Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.

Normal distribution has been used to present the uncertainty in cross-validation for a single model and in model comparison at least since the 1980’s (Breiman et al, 1984, Ch 11). Surprisingly, there hasn’t been much theoretical analysis of validity of the normal approximation (or it’s very well hidden).

We (Vehtari and Lampinen, 2002; Vehtari, Gelman and Gabry, 2017) have also recommended using normal approximation and our loo package reports elpd_loo SE and elpd_diff SE, but we have been cautious about making strong claims about their accuracy.

Starting points for writing this paper were

- Shao (1993) showed in (non-Bayesian) context of model selection with nested linear models and if the true model is included, the model selection based on LOO-CV with squared prediction error as the cost function is asymptotically inconsistent. In other words, if we are comparing models A and B, where B has an additional predictor with true coefficient being zero, then the difference of the predictive performance and the associated uncertainty have similar magnitude asymptotically.
- Bengio and Grandvalet (2004) showed that there is no generally unbiased estimator for the variance used in the normal approximation, and that the estimate tends to be underestimated. This is due to the dependency between cross-validation folds as each observation is used once for testing and K-1 (where in case of LOO, K=N) times for training (or conditioning the posterior in Bayesian approach).
- Bengio and Grandvalet also demonstrated that in case of small N or bad model misspecification the estimate tends to be worse. Varoquaux et al. (2017) and Varoquaux (2018) provide additional demonstrations that the variance is underestimated if N is small.
- The normal approximation is based on law of large numbers, but in finite cases the distribution of individual predictive utilities/losses can have a very skewed distribution which could make the normal approximation badly calibrated. Vehtari and Lampinen (2002) had proposed to use Bayesian bootstrap to take into account the skewness, but they didn’t provide thorough analysis whether it actually works.

What we knew before started writing this paper

- Shao’s inconsistency result is not that worrying as asymptotically the models A and B are indistinguishable and thus the posterior of that extra coefficient will concentrate on zero and the predictions from the models are indistinguishable. Shao’s result however hints that if the relative variance (compared to the difference) doesn’t go to zero then the central limit theorem is not kicking in as usual and the distribution is not necessarily asymptotically normal. We wanted to learn more about this.
- Bengio and Grandvalet focused on variance estimators and didn’t consider skewness, but also when demonstrating with outliers they also missed to look at the possible finite case bias and asymptotic behavior. We wanted to learn more about this.
- We wanted to learn more about what is a small N in case well-specified models. When comparing small N case and outlier case, we can consider as outliers dominating the sum and thus a small number of outliers case is similar to small N in well-specified case, except we can also get significant bias. We wanted to learn more about this.
- Vehtari and Lampinen proposed to use Bayesian bootstrap, but in later experiments there didn’t seem to be much benefit compared to normal approximation. We wanted to learn more about this.

There were many papers discussing the predictive performance estimates for single models, but it turned out that the uncertainty in the model comparison has much different behavior.

Thanks to hard work by Tuomas and Måns, we learned that the uncertainty estimates in model comparison can perform badly, namely when:

- the models make very similar predictions,
- the models are misspecified with outliers in the data, and
- the number of observations is small.

We also learned that the problematic skewness of the distribution of the error of the approximation occurs with models which are making similar predictions and it is possible that the skewness does not fade away when N grows. We show that considering the skewness of the sampling distribution is not sufficient to improve the uncertainty estimate as it has a weak connection to the skewness of the distribution of the estimators’ error. This explains why Bayesian bootstrap can’t improve calibration much compared to the normal approximation.

On Twitter someone considered our results as pessimistic, as we mention misspecified models and in real life we can assume that none of the models is the true data generating mechanism. With misspecified model we mean opposite of well-specified model that doesn’t need to the true data generating mechanism, and naturally the amount of misspecification matters. The discussion about well specified and misspecified models holds for any modeling approach and it’s not unique for cross-validation. Bengio and Grandvalet had used just the term outlier, but we wanted to emphasize that outlier is not necessary a property of the data generating mechanism, but more of something that is not well modeled with a given model.

We are happy that we now know better than ever before when we can trust CV uncertainty estimates. The consequences of the above points are

- The bad calibration when models are very similar makes LOO-CV less useful for separating very small effect sizes from zero effect sizes. When the models make similar predictions there is not much difference in the predictive performance, and thus for making predictions it doesn’t matter which model we choose. The bad calibration of the uncertainty estimate doesn’t matter as the possible error is small anyway. Separating very small effect sizes from zero effect sizes is very difficult problem anyway and whatever approach is used probably needs very well specified and well identifiable models (e.g. posterior probabilities of models also suffer from overconfidence) and large N.
- The model misspecification in model comparison should be avoided by proper model checking and expansion before using LOO-CV. But this is something we should do anyway (and posterior probabilities of models also suffer from overconfidence in case of model misspecification)
- Small differences in the predictive performance can not reliably be detected by LOO-CV if the number of observations is small. What is small? We write in the paper “small data (say less than 100 observations)”, but of course that is not a change-point in the behavior, but the calibration improves gradually when N gets larger.

Cross-validation is often advocated for M-open case where we assume that none of the compared models is presenting the true data generating mechanism. The point 2. doesn’t invalidate the M-open case. If the model misspecification is bad and N is not very big, then the calibration in comparison gets worse, but the cross-validation is still useful for detecting big differences, and only when trying to detect small differences we need well behaving models. This is true for any modeling approach.

We don’t have the following in the paper, so you can consider this as my personal opinion based on what we learned. Based on the paper we could add to loo package documentation that

- If
- the compared models are well specified
- N is not too small (say > 100)
- and elpd_diff > 4

then elpd_diff SE is likely to be good presentation of the related uncertainty.

- If
- the compared models are well specified
- N is not too small (say 100)
- elpd_diff < 4
- and elpd_diff SE < 4

then elpd_diff SE is not a good presentation of the related uncertainty, but the error is likely to be small. Stacking can provide additional insight as it takes into account not only the average difference, but the shape of the predictive distributions and combination of models can perform be better than a single model.

- If
- the compared models are not well specified

then elpd_diff and related SE can be useful, but you should improve your models anyway.

- If
- N is small (say < 100)

then proceed with caution and think harder as with any statistical modeling approach in case of small data (or get more observations). (There can’t be an exact rule when N is small to be able to make inference as sometimes just N=1 can be sufficient to say that what was observed is possible etc.)

All this is supported with plenty of proofs and experiments (24 page article and 64 page appendix), but naturally we can’t claim that these recommendations are bullet proof and happy to see counter examples.

In the paper we intentionally avoid dichotomous testing and focus on what we can say about the error distribution as it has more information than yes/no answer.

**Extra**

We have also another new (much shorter, just 22 pages) paper Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance showing that although there is no generally unbiased estimator (as shown by Bengio and Grandvalet (2004)) there can be unbiased estimator for a specific model. The unbiasedness is not the goal itself, but this paper shows that it could be possible to derive model specific estimators that could have better calibration and smaller error than the naive estimator discussed above.

People don’t just do one cross validation run. They do it and see predictive skill, go adjust/change the model, do another cv, etc until they stop seeing improvements. In this way the model is informed by the “left out” data and you get overfitting.

So a lot of these theorems about how it behaves don’t apply to how it’s actually used.

This is a highly relevant point. See my answers (I need to update them to refer to these new papers) for questions 4 How is cross-validation related to overfitting and 5 How to use cross-validation for model selection? in Cross-validation FAQ. tl;dr overfitting can be negligible and these theorems are useful for recognizing when the overfitting can be negligible.

I just love this line in the post: “We have also another new (much shorter, just 22 pages) paper . . .”

That’s soooo Aki, for the 22-page paper to be the short one!

The most worrying part of this post is that loo isn’t super helpful for when the difference in predictive performance between models is small. If the difference between models is large, it seems like most model comparison techniques will reach the same conclusion anyways. Plus, most effect sizes in the social sciences – a focus of this blog – tend to be small. So, should we be pessimistic about loo for model comparisons in social science applications?

Anon:

When differences are large, LOO should give similar answers to WAIC etc. But BIC or so-called Bayesian model averaging (based on integrating over the prior) can give much worse answers. The method that is used can make a difference.

That’s good to hear, but what about the issue of using LOO in social science applications when effect sizes tend to be small? Should social scientists (or researchers that are expecting modest effect sizes) look to other methods?

Anon:

LOO is what it is. There’s a limit to what can be learned from average predictive accuracy, as we discussed in this article, Difficulty of selecting among multilevel models using predictive accuracy. In building models in social science it helps to use subject-matter understanding.

To piggyback on this point:

One important role for subject-matter understanding is knowing what kind of generalization is important for your model. Often LOO is employed to test predictive accuracy for the same type of data produced by the same participants/mechanism.

But especially in social science and psychology, this is not really an interesting or important test. We are often more interested in how well a model can predict data in an entirely new setting, e.g., the same participant in a different set of conditions or even a new participant in a new experiment. After all, especially in social science, it is not often possible to repeat the original data collection process anyway, so why bother trying to “predict” it?

Understanding and modeling the “scope” of generalization that is important for your application depends on your goals and knowledge.

Two relevant papers that tackle this issue:

Navarro, “Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection”: https://link.springer.com/content/pdf/10.1007/s42113-018-0019-z.pdf

Busemeyer & Wang, “Model Comparisons and Model Selections Based on Generalization Criterion Methodology”: https://pdfs.semanticscholar.org/5dfb/143282ac3d19341164368354e535c11b21f6.pdf

PS: To be clear, being able to “predict” the original data (as measured by LOOCV) is still a necessary component of a good model, I just don’t think it is sufficient.

> To be clear, being able to “predict” the original data (as measured by LOOCV) is still a necessary component of a good model, I just don’t think it is sufficient.

I completely agree as seen in my other writings, and briefly mentioned in this paper and blog post. I discuss the scope issue more in the CV-FAQ and in some of my talks. The Navarro’s paper is great and goes even further discussing the use of thinking.

Thanks for the links! And for another excellent paper!

Just to be clear, my remarks above were directed more at how users of CV don’t often think deeply about what it means or why they are using it. I think your work does a great service by making clear exactly what sorts of things we can do with these methods and where they can trip up.

The Navarro paper is fantastic—very thought-provoking.

As often it’s difficult to present all the subtleties in on 88 page paper and a blog post.

> The most worrying part of this post is that loo isn’t super helpful for when the difference in predictive performance between models is small.

LOO is helpful to tell that difference in predictive performance is small. That is not worrying.

> If the difference between models is large, it seems like most model comparison techniques will reach the same conclusion anyways.

If the difference is small all techniques have problems. If the difference is large most techniques don’t have problems. Interesting is to know what does small or large mean for different techniques. From the paper that says something about LOO, you can’t infer how it relates to other techniques. I don’t know what other model comparison techniques you refer to, but I mention couple: WAIC in good days has the same behavior as described in the paper for LOO, but fails more likely than exact LOO or PSIS-LOO which has about the same computational complexity. DIC/AIC/etc are worse than WAIC. Posterior probabilities have problems, too (see, here and here).

> Plus, most effect sizes in the social sciences – a focus of this blog – tend to be small. So, should we be pessimistic about loo for model comparisons in social science applications?

You should be pessimistic for all model comparison for small effect sizes. If the LOO says the difference is predictive performance is small, it means it’s likely that the difference in predictive performance is small, that is, even if you would know that the effect size is non-zero you can’t do any useful predictions and decision based on those predictions. However, if you don’t care about predictions, and you have simplified data collection you may situations where you can infer about the small effect sizes more accurately looking at the posterior than looking at the predictive performance as demonstrated in this case study.

I know I should write more about all this, and maybe there will someday be a book with more about this.

Thanks for the comments, they help to see which things to emphasize in writing.

Thank you for this additional information. Provides helpful context for some of the concerns I expressed above.

Not specifically related to LOO-CV, but… “The model misspecification in model comparison should be avoided by proper model checking and expansion before using LOO-CV.” Ultimately it can’t be done, as there are always models compatible with the data that are quite different and make all inference and prediction behave differently from the model you use. All models are always misspecified. We can avoid some of the worst problems, but we’ll never get in a position in which model misspecification is no problem anymore.

I guess I should always write “badly misspecified” instead of “misspecified”. I did write ” With misspecified model we mean opposite of well-specified model that doesn’t need to [be] the true data generating mechanism, and naturally the amount of misspecification matters.” With well-specified non-true models we can infer and predict useful things and LOO-CV also works nicely (for the task it’s meant for).