Ryan Socha writes:
Typically in statistical modeling, we validate or test models by checking how well they perform on a holdout dataset. However, conceptually, it seems possible that we could do essentially the same job by instead keeping the training dataset fixed and using one or more “holdout evaluation functions” that differ from the training objective.
Finding such functions is probably more difficult than finding extra data in most cases, but I am curious if there are any works that come to mind when you think of “holdout evaluation functions” as a means of assessing model performance in ways that can’t be Goodharted.
Beyond this, there’s a more general question – to what extent are holdout datasets fungible with holdout evaluation functions? Are there cases where having access to additional data is inherently superior to any alternate way of evaluating a model’s performance on the given data? Or, are there cases where no amount of additional holdout data can detect there’s some flaw in what the model’s doing?
I am particularly curious whether there might be a procedure that can convert from holdout datasets to holdout evaluation functions, or the other way around. Some caution is probably required to make sure that any alternate evaluation functions constructed to replace a holdout dataset do not “secretly contain” copies of the holdout data – but maybe that kind of smuggling is required for any such procedure to be possible in the first place?
A natural case where this kind of thing might be useful: suppose we want a model to generalize to a certain distribution that’s out of distribution from the training data, but no actual instances of that distribution yet exist. In cases like this, it seems like our only option for validating an approach is to make sure that the way we’re evaluating it is well-suited for the high-level properties we expect the new distribution to have. Although this seems to lose some of the flavor of the evaluation functions needing to be holdout functions, so perhaps it is not the best example after all and this is only of theoretical interest.
This is far beyond my math chops, all thoughts or comments appreciated. Feel free to put this and any response you might have on your blog if you think it would be of interest. As always, even just telling me you think this is a bad line of thought would be a welcome reply.
My reply: Over the years we’ve thought a lot about cross validation and external validation. Two key papers are:
[2014] Understanding predictive information criteria for Bayesian models. {\em Statistics and Computing} {\bf 24}, 997–1016. (Andrew Gelman, Jessica Hwang, and Aki Vehtari)
[2017] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. {\em Statistics and Computing} {\bf 27}, 1413–1432. (Aki Vehtari, Andrew Gelman, and Jonah Gabry)
In writing these, we followed the general trend in machine learning, also expressed in the your question, moving away from looking at “information criteria” as ways to evaluate or compare models and toward a more direct interpretation of leave-one-out cross validation as an estimate of what would be obtained under external validation using new cases.
Some interesting issues arise:
1. As noted in your question, predictive performance on new data can depend on the scenario. The further the new data are in predictor space from the training data, the more uncertain and the less accurate you will expect the predictions to be. This suggests, first that any evaluation of a model’s predictive performance should require some specification of where the evaluation will be done, and second that cross validation can be studied by looking not just at average predictive performance but also on how the predictive accuracy depends on predictors in the model, as in this graph from our 2017 paper:

2. The choice of where to evaluate predictions, and how to average over these to get a summary measure of predictive accuracy, reminds me a lot of . . . poststratification! When doing this, it’s important to include predictors that capture the differences between the training and test data, and it can make sense to use multilevel modeling to facilitate predictions for new scenarios. Prediction goals can inform both design and analysis. For example, if you’re doing a study now with the goal of making inferences for future effects, then time could be a factor, and it would make sense to spread your study (your training data) across some period of time, which will then give you some leverage to estimate time trends when fitting your model. Realistically, though, inferences might still strongly depend on priors, for example if you have only one week to conduct your experiment but you want to estimate effects for a year going forward.
3. The other thing this reminds me of is the technique we’ve been using a lot lately, of poststratifying a survey on itself. That is, taking the data, fitting a model, and then creating an aggregate inference by averaging the model predictions over a hypothetical new population that’s exactly the same in its predictors as the data used to fit the model. This might seem to be a very circular thing to do, but it can be useful for two reasons:
The first reason that poststratifying a survey on itself is not an empty identity mapping is that the process of fitting the model can be thought of as a smoothing of the data, so MRP where poststrat is on the sample itself can be thought of as a three step procedure: (i) transform the data, (ii) apply a smoother on the transformed space (which is kinda what the multilevel model or Bayesian inferences is doing, where the transformed space is the space of parameters and thus step (i) of inference can be seen as an inversion of the assumed data-generating process), (iii) reverse-transform back to data space. Even if steps (i) and (iii) are inverses of each other, you get something from step (ii). Just like how you can flip two pairs of edges on a Rubik’s cube by first twisting to line up the edges in the right place, then applying the operator that does what you want, then reversing your original set of twists.
The second advantage of poststratifying a survey on itself is that you can use this as a diagnostic tool to understand what’s happening in an MRP situation. MRP does two things: it adjusts for different distributions of predictors in sample and population, and it does smoothing for small-area estimation. By poststratifying a survey on itself, you can isolate these two things and just see what the smoothing is doing, then you can poststratify on the population of interest and see the predictive effect of imbalance in predictor distributions between sample and population. We used this technique recently in an example of a survey with measurement error where we were getting results we didn’t understand. The mysterious patterns happened even when we were poststratifying the survey on itself, so we knew that it was not a problem with an unrepresentative sample; it was our Bayesian model that was to blame.
4. A few years ago we discussed the pervasive twoishness of statistics: the way that classical statistical inference, Bayesian statistics, and bootstrapping all have an incoherence in which two models coexist, with no requirement that the two models be part of a single consistent system–and how that’s actually a good thing. Cross validation, or external validation, has the same property: there’s the model used to fit the data, and the assumed population. Mathematically we can write these as p(y|theta,x) and p(x_new).
Is “poststratifying a survey on itself” another way of saying “perform a posterior predictive check on your MrP model”?
Thanks for the response! Would you say that you see the move away from information criteria and towards cross validation as almost always a good thing? Naively, just changing the criterion should be just as reasonable a strategy as changing the data for preventing overfitting.
I’m trying to understand why evaluating the model’s behavior on same piece of data a hundred different ways is not nearly as good a strategy for model testing as evaluating a hundred new pieces of data in the same way every time.
I think it comes down to it being hard to obtain the “right” set of ways to evaluate the model, but I’m having trouble articulating why constructing good information criteria should be so difficult.
Ryan—where do you hang out that you find, “Typically in statistical modeling, we validate or test models by checking how well they perform on a holdout dataset.” In my experience, only people in ML and a small subset of Bayesians ever think about cross-validation. The vast majority of statistical modeling applications I see are driven by trying to find small p-values or at most satisfy posterior predictive checks.
Andrew—what fraction of the last 50 applied papers you co-authored involved cross-validation? We didn’t use it in our joint Covid post stratification paper, but that was probably more than 50 applied papers ago for you being in 2020 :-).
I have most experience in NLP on the applied side. There we know language drifts over time in syntax, sounds, meanings of words and phrases, and in the frequency with which things are mentioned. (There’s some really cool work by Dave Blei and John Lafferty on topic models over time which highlight this quite well in the scientific domain, where over 100 years, papers went from taking about steam engines and microscopes to x-ray crystallography and lasers.). But I’ve never seen anyone trying to use this knowledge explicitly to fit a model. Cross-validation, on the other hand, was the go-to method for fitting the smoothing/priors as is often the case in ML. I guess you could do a bit of this by holding the future out rather than strictly cross-validating. We can’t actually predict the name of the next social media fad, pop star, or movie star, or even the next brand name for drugs, but we know all these things will happen. There are also a lot of unknown unknowns. Mostly what you do in these cases is just regularize so that you fit the actual data less well.
I’m an ML guy, guilty.
Bob:
Very clever of you to as, “what fraction of the last 50 applied papers you co-authored . . .”! If you’d just said, “When have you ever used cross validation in applied work,” I could’ve pointed you to this paper with Phil on radon modeling from 1996. We did this a long time ago but it’s influenced my thinking.
Many consumers of statistical models are interested in model comparison. People are always asking me about AIC or deviance or whatever. Through Aki’s influence, I think of all these as approximations to cross validation, which is itself an estimate of out-of-sample prediction error. So when someone’s comparing AIC’s, I tell them to use leave-one-out cross validation (loo) instead. And our paper on loo has been cited a lot, so I guess many people do care about these methods. For me, though, yeah, I don’t use cross validation so much in my applied work.
Isn’t that always the way to ask survey questions? Make them specific? I think I may have learned that from you.
That was also a bit of humor choosing “50” because you write so many papers.
I feel like the entire enterprise requires a heavy dose of wishful thinking. The solution is of course to augment the training data set. Anything else is just a mathematical way of saying nicely that we have low confidence in predicting cases which we have not seen much of.
I like the idea of capturing how the uncertainty varies with the predictor values – the confidence is a function of sample size so again we are essentially expressing the lack of overlap between the training and target populations.
On Bob’s point, if we take the ML perspective, we’d need a holdout dataset… and the hypothetical holdout dataset has known distributional variance on the predictors, while it has unknown outcomes. So anything we can do is more or less a wild guess.
Thank you for this interesting topic. Might I add one thing ?
When investigating the predictive accuracy of a regression problem, LOO requires the independence of {(Xi,Yi)}, but the information criterion requires the independence of only {(Yi|Xi)} and does not require the independence of {Xi}. When there is a leverage sample point in {Xi}, the difference between LOO and the information criterion becomes large. Thus both LOO and information criterion have different and important meanings.
Please no. The holdout also needs to be data collected *after* the training data. And we all know about the “peeking”. So really it should be after the model is “published” in some way.
That is old school tech, and it works. Shortcuts aren’t smart or cool, theyre cutting corners.
Of course if you just want to publish papers about (eg, curing cancer) the repeated cross validation data leakage trick works great.
In one sense you are of course right. The only true ‘out of sample’ is a legit replication. But we also have to make most efficient use of our current data to develop models for many applications including prediction, in which case these methods are important!
“This suggests, first that any evaluation of a model’s predictive performance should require some specification of where the evaluation will be done …”
I think there is a more general point to make here, and one that I think should take a more central role in methods training: that any generalization implies a decision about what you want to generalize to, and that good science depends on making that decision explicit and sensible. Any unconditional inferential statements, i.e. about ‘the causal effect’, ‘the parameter’, ‘the population’, ‘the dgp’ etc. should invite the questions on the implied conditions, i.e. ‘what causal effect?’, ‘what parameter’, ‘what population?’, ‘what dgp?’.