Rob Tibshirani writes:

About 9 years ago I emailed you about our new significance result for the lasso. You wrote about in your blog. For some reason I never saw that full blog until now. I do remember the Stanford-Berkeley Seminar in 1994 where I first presented the lasso and you asked that question. Anyway, thanks for

admitting that the lasso did turn out to be useful!That paper “A significance test for the lasso” did not turn to be directly useful but did help spawn the post-selection inference area. Whether that area is useful, remains to be seen.

Yesterday we just released a paper with Stephen Bates and Trevor Hastie that I think will be important, because it concerns a tool data scientists use every day: cross validation:

“Cross-validation: what is it estimating how well does it do it?”

We do two things (a) we establish, for the first time, what exactly CV is estimating and (b) we show that the SEs from CV are often WAY too small, and show how to fix them. The fix is a nested CV, and there will be software for this procedure. We also show similar properties to those in (a) for bootstrap, Cp, AIC and data splitting. i am super excited!

I forwarded this to Yuling Yao and Aki Vehtari.

Yuling wrote:

The more cross validation papers I read, the more I want to write a paper with the title “Cross validation is fundamentally unsound,” analogous to O’Hagan one:

1. (sampling variation) Even with iid data and exact LOO [leave-one-out cross validation], we rely on pseudo Monte Carlo method, which itself ignores sampling variation.

2. (outcome variation) Even worse than the unsound Monte Carlo, now we not only have sampling variation on x, but also variation on the outcome y, especially if y is binary or discrete.

3. (pointwise model performance) Sometimes we use pointwise cv error to understand local model fit, no matter how large variance it could have. The solution to 1–3 is a better modeling of cv errors other than sample average. We have already applied hierarchical modeling to stacking, and we should apply to cv, too.

4. (experiment design) Cross validation is controlled experiment, which provides us the privilege to design which unit(s) to receive the treatment (left) or control (not left). Leave-one-cell-out is some type of block-design. But there is no general guidance to link the rick literature on experiment design into the context of cv.

Just to clarify here: Yuling is saying this from the perspective of someone who thinks about cross validation all the time. His paper, “Using stacking to average Bayesian predictive distributions” (with Aki, Dan Simpson, and me) is based on leave-one-out cross validation. So when he writes, “Cross validation is fundamentally unsound,” he’s not saying that cross validation is a bad idea; he’s saying that it has room for improvement.

With that in mind, Aki wrote:

Monte Carlo is still useful even if you can reduce the estimation error by adding prior information. Cross validation is still useful even if you can reduce the estimation error by adding prior information. I think better title would be something like “How to beat cross validation by including more information.”

Aki also responded to Rob’s original message, as follows:

Tuomas Sivula, and Måns Magnusson, and I [Aki] have recently examined the frequency properties of leave-one-out cross validation in the Bayesian context and specifically for model comparison as that brings an additional twist in the behavior of CV.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison.

– Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance.

Although Bayesian LOO-CV is slightly different, there are certainly the same issues in folds not being independent and the naive SE estimator being biased. However it seems this is mostly problematic for small n, very similar models, or badly mispecified models (1st paper), and in addition there is a way to reduce the bias (2nd paper). I have shared your [Rob’s] paper with Tuomas and Måns and we will look carefully if there is something we should take into account, too.

Lots of this conversation is just whipping over my head but I thought I’d share it with you. I’ve learned so much about cross validation from working with Aki and Yuling, and of course all of statistics has benefited from Rob’s methods.

**P.S.** Zad sends in this picture demonstrating careful contemplation of the leave-one-out cross validation principle. I just hope nobody slams me in the Supplemental Materials section of a paper in the The European Journal of Clinical Investigation for using a cat picture on social media. I’ve heard that’s a sign of sloppy science!

The abstract concludes with

I thought the goal of cross-validation was to estimate the performance on the whole data set. Throwing out part of the data seems like a steep price to pay for confidence interval calibration. Is the idea that your interval on the subsampled data might exclude zero and thus be good enough? Isn’t that also what’s going on with Austern and Zhou, Asymptotics of Cross-Validation? I think they’re estimating the expected fold accuracy, not the accuracy on the full data set.

I was also confused by this claim when reading the abstract. After reading the full paper, my understanding of this claim is that it refers to a single train/test split strategy, not to K-fold CV or the nested CV method they describe.