In a post entitled “A subtle way to over-fit,” John Cook writes:

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.

For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. . . .

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

I was glad that Cook wrote this because it does seem to me that people often think of a cross-validated estimate as being correct in some sense, not just an estimate but the right answer.

Here were my reactions to Cook’s post:

1. I’d prefer he didn’t use a polynomial example, but rather something more “realistic” such as y = A*exp(-a*t) + B*exp(-b*t). I just hate how in certain fields such as physics and economics, polynomials are the default model, even though we just about never see anything that is usefully modeled by a polynomial of degree higher than 2.

2. Cross-validation is a funny thing. When people tune their models using cross-validation they sometimes think that because it’s an optimum that it’s the best. Two things I like to say, in an attempt to shake people out of this attitude:

(a) The cross-validation estimate is itself a statistic, i.e. it is a function of data, it has a standard error etc.

(b) We have a sample and we’re interested in a population. Cross-validation tells us what performs best on the sample, or maybe on the hold-out sample, but our goal is to use what works best on the population. A cross-validation estimate might have good statistical properties for the goal of prediction for the population, or maybe it won’t.

Just cos it’s “cross-validation,” that doesn’t necessarily make it a good estimate. An estimate is an estimate, and it can and should be evaluated based on its statistical properties. We can accept cross-validation as a useful heuristic for estimation (just as Bayes is another useful heuristic) without buying into it as necessarily best.

**P.S.** Also you might be interested in this article with Aki and Jessy on cross-validation and information criteria. (Aki and I are working on a new paper on this topic but it’s not quite finished.)

What if you estimate your confidence level by evaluating the selected model on a data-partition that wasn’t used for model fitting nor model selection?

I am under the impression that using cross validation (or at least LOOCV) to pick a model is in fact equivalent to maximizing an adjusted goodness of fit measure. The idea (as I understand it) is that by using held out data to pick the best model, we are in fact using the information in the entire sample, just as we do when we use goodness of fit tests. Of course, if this logic is correct it suggests that cross validation is not really an improvement at all over goodness of fit testing (with an appropriate penalty for adding variables). But if this is right, I am not sure why it has become so common to use hold out samples in machine learning. Your thoughts appreciated.

Eric: perhaps you mean Stone (1977)‘s result, that AIC and cross-validation give the same model choice asymptotically. In this case, yes, neither approach is better. But there are some reasons to prefer one over the other elsewhere; AIC may be easier to implement, while cross-validation may be easier to generalize to complex situations, and is perhaps more transparent about what it provides.

Less compellingly, there’s also inertia, that people stick with the version they saw first.

The problem I’ve always struggled with is deciding which parameters to “fit” using the training data, and which to “select” using the validation data. In other words, what makes a parameter a hyperparameter? If we treat none of the parameters as hyperparameters, so we optimize everything for the training sample, we perform terribly in the test sample. If we treat all of the parameters as hyperparameters, we have the same problem, since we effectively turn the validation sample into the training sample. The optimum is presumably somewhere in between, but I’ve never been sure how to find it other than using rules of thumb. E.g. estimate coefficients using training data, but select which predictors to include using the validation data. Seems arbitrary to me.

Cross-validation is mostly about regularization when you’re doing some kind of optimization. If you’re minimizing, say, squared error, on your training set, you’ll end up with no regularization, because the model will fit the data better on the training set without it.

When I was in grad school we were taught that if you split the sample in half and tested your first model fit on the second that would just really confirm that all of the data issues and systematic problems of measurement would be in the test set and hence this was not actually useful for many purposes people thought it would be useful for.

Saw this http://www.forbes.com/sites/teradata/2015/05/05/2015-the-year-big-data-becomes-agile/ the other day, particularly the section “the challenge of repeatable results”

David Draper has some thoughts on this dilemma. See http://www.ma.utexas.edu/blogs/mks/2014/04/28/david-draper-on-bayesian-model-specification-toward-a-theory-of-applied-statistics/ for a summary of and a link to a talk by Draper on the question.

Thanks for the link. Those slides read like an article!

Andrew: I chose the cubic example for pedagogy rather than verisimilitude. But I agree with your point that polynomial models are overused. You see that in numerical analysis texts too, e.g. emphasis on methods that integrate polynomial-like functions well. As one better numerical analysis book put it, no polynomial ever had a horizontal asymptote and none ever will.

Ram: You’re right. The distinction between fitting and selecting is arbitrary, or at least fuzzy.

I think the cubic was great to drive home the example for me.

The following article fits perfectly this post!

Juho Piironen, Aki Vehtari (2015) Comparison of Bayesian predictive methods for model selection

http://arxiv.org/abs/1503.08650

It shows examples of using cross-validation with more than million models, describes why it’s not enough to add another layer of cross-validation (as suggested by Rahul), and also shows how it’s possible to do better than plain cross-validation in model selection (no magic involved). Stan+R code available later this week!

V-fold cross-validation has been shown to have a fairly remarkable oracle property if one is willing to assume i.i.d. data (as is common though of course has drawbacks). In particular, V-fold CV yields cross-validated risk which performs as well as the best algorithm under consideration (up to a logarithmic term if this algorithm happens to be a correctly specified parametric model). In particular, for quadratic loss the inequality looks like

(cross-validated Risk of CV selector) 0, where C(delta) is a constant. The V in the remainder is the number of folds in the V-fold CV procedure.

The first term on the right is the risk of the oracle selector which chose the candidate which minimizes cross-validated risk, i.e. given the fits on the training folds, it chooses the fit which performs best in terms of mean-squared error under the true distribution, averaged across folds.

For quadratic loss the only requirement is that the loss is uniformly bounded, which occurs if the outcome of interest is bounded. The number of candidate algorithms the CV selector considers only appears in one place in the above equality, namely in a logarithm in the final remainder term. Thus we can choose a fairly large polynomial in sample size for the number of candidates and still have a log(n)/n remainder.

The above equality is a finite sample result, though note that it relies on V growing slowly with n. In practice one might fix V (e.g. V=10), noting that optimality is then defined as the risk averaged over these 10 training folds. See

– M J van der Laan and S Dudoit. “Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples.” (2003).

– A W van der Vaart, S Dudoit, and M J van der Laan. “Oracle inequalities for multi-fold cross validation.” Statistics & Decisions 24.3 (2006): 351-371.

for a precise statement of the remainder term.

Of course this is only to say that CV is (nearly) optimal in a certain sense, namely in terms of the cross-validated risk associated with a certain loss function. But if one wants a prediction algorithm to be used down the line then this certainly seems to be a reasonable criteria.

Alex:

Cross-validation is fine. The point is that the resulting estimates are still data-based and are noisy. It’s the same with maximum likelihood, or Bayes, or whatever. Even a optimal approach, given data, is still given data and should be taken as an estimate, not as truth. But it’s my impression that many users of cross-validation think of it as giving the correct value, which is an error comparable to thinking that the MLE or Bayes estimate is equal to the true value of the parameter. Which is not the case.

Certainly, that makes perfect sense, thanks for clarifying. I guess what I was trying to say is that cross-validation is optimal in a certain sense — but of course is still noisy and only returning an estimate of whatever the object of interest is. I think the remarkable part about the optimality result is that it is very general and applies to many situations where cross-validation is (or could be) used in practice. Thanks again!

Although the other parts of this interview with Chris Wiggins of NYT would suggest otherwise http://simplystatistics.org/2015/06/01/interview-with-chris-wiggins-chief-data-scientist-at-the-new-york-times/

He does say that he can know if he is wrong 9:34 – 9:45 (not quite as bad as know if he is right).

Quite sure many listening to this will be lead to take cross-validation as truth.

Though I do think there is an under-appreciation of the noise coming through in the interview.

Problem of packing n equal circles into a unit square (n=11 -> poly degree 18, and for n=13 -> polynomial degree 40, see p17 http://www.inf.u-szeged.hu/~pszabo/Pub/45survey.pdf

Longer discussion and some more pointers https://mathoverflow.net/questions/27324/what-are-some-naturally-occurring-high-degree-polynomials

Polynomials are the default model in economics? That’s news to me. Taking logs might be the default before polynomials are the default. In the case where you’re estimating something like a Cobb-Douglas function like Y=A*(L^alpha)*(K^beta), the logged model becomes linear. Polynomials are used in some labor economics regressions when you are trying to account for years of experience or age or education. Even here though, it’s empirically motivated (and could be replaced with dummy variables for groups of years).

John Hall:

When Andrew says “economics” he means “economics according to Josh Angrist”. You have to make a bias adjustment whenever he make a reference to economics.

I usually find it embarrassing when Andrew makes a reference to economics, because he mostly doesn’t know what he’s talking about

So yes, good idea to make a bias adjustment: just ignore it

It’s a bit of a pity because I like his common sense attitude to applied statistics

Also, I don’t associate Angrist’s work with polynomials either, but then again, what do I know? I’m only an economist…

Ed:

If you want to “just ignore” the work of David Lee on polynomial models for regression discontinuity, that’s fine with me. But it

isin the economics literature and itdoeshave practical consequences.When you call polynomial models as the “default models” for the whole field of economics do you have just David Lee in mind?

i.e. It may be in the econ. literature, & it may indeed have consequences but is it widespread enough to characterize it as the “default” for the entire field?

Rahul:

I’ve seen polynomial regressions presented without comment in other settings in economics and elsewhere. Of course the default is linear regression with no interactions. But when a nonlinear form is introduced, it seems standard to start with polynomials. I understand this from a teaching perspective, as students are familiar with polynomials, but I do think it can lead to the impression that it’s the default. I think of the David Lee papers as an example of how a respected researcher can use and recommend polynomials without really thinking about it, just cos they’re there.

Regression discontinuity is a pretty rarely used model in economics. And even then, when we used Angrist and Pischke’s book, and presumably Pischke’s curriculum (as it was a methodology department course at the LSE), they cautioned profoundly against using any polynomial in a RD model.

But outside of the uncommon causal inference guys, I’ve not seen much serious polynomial usage outside of some specific debt pricing models, and I work at the Fed. I suspect your exposure to economics research is very biased towards the causal inference subfield, as opposed to the macro/micro/finance stuff, that makes up the majority of the field.

Anyway, it’s a quibble. Doesn’t really matter. But I also was a little confused when I read that you said it was standard in econ.

By “economics according to Josh Angrist” I was taking a shot at Josh Angrist’s willingness to claim, as thought it were in some way related to reality, that his narrow subfield is the entirety of real economics. For instance, this recent statement: “The ascendance of the five core econometric tools – experiments, matching and regression methods, instrumental variables, differences-in-differences and regression discontinuity designs – marks a paradigm shift in empirical economics.”

Andrew has responded below that he has in mind David Lee, someone who does that kind of work. Hopefully I can communicate that Angrist’s world describes a minority of empirical microeconomic research.

A ‘paradigm shift’?! Good lord… Angrist is brilliant, and his textbook is wonderful. But sometimes I think he misunderstands his own value. These ‘five core econometric tools’ aren’t a paradigm shift because of their brilliant statistical properties. But rather because he is insisting they be used within a robust philosophy of science and design. Whereas economists frequently just use models without any sort of causal inference underpinning giving their work a deeper level of validity.

So would there be a point of building a (Bayesian) model to gauge the uncertainty in the cross validation estimate? That is, treat the model that you are fitting as a black box and fit another model to the cross validation errors in order to get a estimated predictive error with uncertainty. Is this something that people do?

Yes.

I was quite surprised how susceptible some kernel learning methods are to over-fitting the model selection criterion (whether cross-validation or evidence maximisation) and wrote up some illustrative experiments here (using kernel ridge regression / least-squares support vector machine):

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (http://jmlr.csail.mit.edu/papers/v11/cawley10a.html)

At the very least it is a good idea to use something like nested cross-validation to get a performance estimate for the whole model-fitting procedure (including optimising hyper-parameters, feature selection, etc).

I am not sure anybody thinks CV leads to the true value, just that it is a better way of choosing a model than AIC. It is. Well, I’d be interested to see if there is a better model score.

We teach our machine learning students that if the fitting process uses any information from the “independent” test data, then the test data are no longer independent and will not give an unbiased estimate of the loss. In true cross-validation, as opposed to the simple holdout strategy described in the post, there are no independent test data. QED.

But this is unavoidable and any other alternative would have the same problem. You use the data that you have instead of the data that you *will* have in the future. It would be best to just take the future data, label it and then evaluate the algorithm. But if you could do that you wouldn’t need machine learning in the first place. You can of course work in the Bayesian framework and specify priors for certain models (kernel functions, hyperparameters) but it’s unclear what your prior should be among different kernels etc.

We have nothing else than:

– prior assumptions (Bayesian prior distributions)

– evidence/data

We have no choice but to use these things. If the algorithm overfits to the data, you have to make your priors stronger. There is no Platonic “intelligence” out there that we could tap into and magically make our classifier/regression perform better. You either improve the data or specify your priors more carefully. Even if you don’t work in an explicitly Bayesian framework, you still only have your assumptions and the data. “Intelligence” as such doesn’t exist. The no free lunch theorems are pretty clear about this.

Istvan:

I refer you to Aki’s paper for discussions about cross-validation and alternatives. In any case, when I say that cross-validation is not magic, I’m not saying there is an alternative that

ismagic, I’m just reacting to a common practice I’ve seen which is to take a cross-validataed decision or estimate and take it as correct.This is an obvious point; no need to refer to free-lunch theorems to see it. But there’s something about the framing of cross-validation that leads people to forget that a cross-validated estimate is just an estimate, a function of data, nothing magic.

What do these people think about the cross validation error? That it is exactly equal to the error that they would get on a newly collected very large dataset? I don’t think serious machine learning people think that. As I see it, the situation is fully equivalent to picking a random number from a normal distribution with unknown mean and variance (over which we have priors) and then trying to estimate the mean. The random number is the cross validation error and it gives us evidence about the real error. But if you have strong priors (because you strongly believe that the particular method is really poor then you will explain the result as just chance). The question is, what evaluation measure gives you the most evidence (in expectation). Unfortunately this also depends on your priors. There is no magical best method that we can prove to be generally better than another. Any claim that one particular algorithm is beter than another is necessarily a claim about the kinds of tasks that we are likely to encounter. It is not a claim about anything purely general about “intelligence”.

They take the tuning parameter, as estimated by cross-validation, as if it is the correct value, rather than just an estimate, possibly noisy, from data.

I think you should maybe add a PS clarifying what you mean by ‘cross validation estimate’ in the main article. Until reading this comment, I (along with many other commenters, it seems) had assumed you were referring to the estimate of out of sample error for a given model with given hyperparameter settings. While this estimate is of course just an estimate, it is often pretty stable and I didn’t understand what the fuss was about. Now I see that by ‘cross validation estimate’ you mean the value of the hyperparameter that minimizes estimated out of sample loss. This is certainly a less stable quantity. But, in machine learning at least, people tend not to be interested in a ‘true value’ of a hyperparameter. They just want to find a value that leads to good predictions.

+1 I have mostly always seen Cross Validation used in settings where people were interested in the predictions rather than the value of a parameter.

The problem is that the CV estimate often doesn’t give good predictions as the optimisation of the hyperparameters have over-fitted the CV based model selection criterion. In other words, as the model selection criterion is minimised, initially generalisation performance improves, but there comes a point where minimising the model selection criterion further starts making generalisation performance worse, rather than better. This happens because the CV error is only an estimate of performance, and so has a finite variance and it is possible to minimise the CV criterion in such a way that it exploits the random variation due to the finite sample.

Unless you use some external validation (e.g. nested cross-validation) it will not be apparent that performance is being lost from this form of “over-fitting in model selection”

Yeah, this is what Cook was talking about, but Andrew’s statements didn’t seem limited to this phenomenon.

I don’t really see anything else in Andrew’s comments, but perhaps that is just me.

“2. Cross-validation is a funny thing. When people tune their models using cross-validation they sometimes think that because it’s an optimum that it’s the best. Two things I like to say, in an attempt to shake people out of this attitude:”

seems pretty clear that the tuning of models is the key problem. Unfortunately performance evaluation methods that ignore these problems with model selection are quite common in machine learning (e.g. using default hyper-parameter settings at one end of the spectrum, through to reporting the cross-validation error used to tune the models as a performance estimate at the other). Performing rigorous model selection and performance evaluation is computationally expensive, but unfortunately very important.

Naive question:

Does nested cross validation have advantages over having an entirely pristine partition of the dataset, a “holdout” sample, to be used for only judging the performance on?

It is a trade-off between having enough pristine data so that the performance estimate has a low variance, whilst at the same time having enough data to build a good model. The larger the pristine sample, the lower the variance of the performance estimate, but the higher the bias in estimating the performance of a model build on all of the available data, but also there will be increased variance in the model itself as it is constructed from a smaller dataset.

The advantage of nested cross-validation is that all of the available data goes into reducing the variance of the performance estimate, whilst at the same time making as much data available as possible for model selection/model fitting.

Basically it is the same advantage in using cross-validation rather than just a single test-train split, but the same argument applied heirarchically.

Thanks! Very nicely explained.

Whenever you maximize something that is noisy, you unavoidably maximize the noise part as well. If you choose the fastest runner in the Olympics, his result will be better than his actual performance. This is also known as the regression to the mean. The same thing happens here as well. You choose the best parametrization of your algorithm and by this you unavoidably maximize the sum of the real performance and the noise due to sampling, not only the pure performance.

But if you only use cross validation for evaluating your final model that you tuned on a disjoint set then your estimate is unbiased (provided that your samples are really a random sample of the data that will be used in the actual real life task).

Istvan:

I think we’re in agreement here. It’s just that often I’ve seen people take a cross-validated estimate as if it is correct, without realizing that it’s just an estimate. That’s all.

I’m with you 100%, but I don’t know where you got that “…in economics polynomials are the default…”

I’m an economist and I can only think of a couple papers where polynomials were used. Maybe they are popular in very specific sub-fields I know little about.

Jack:

I’m thinking of the notorious regression discontinuity literature.

It isn’t just cross-validation (and evidence), there is a similar misunderstanding regarding performance bounds (such as the radius-margin bound), which again are calculated from a finite sample of data, and AFACIS invalidated by their optimisation.

I think polynomials get a bit of a bad rap. It’s mostly lack of mathematical sophistication and lack of thought put into the modeling that leads to the worst issues with polynomials. For example:

horizontal asymptotes: if f(x) has a horizontal asymptote then create g(x) which maps x:(-inf,inf) into g:(-1,1) (such as g(x) = 4*atan(x)/pi and then fit a polynomial p in g(x), so that f(x) ~ p(g(x))

The point is really, you have to **think about the issue** whereas if you just try to fit f(x) ~ p(x) you’re sunk because you missed the point.

err, sorry: 2*atan(x)/pi since tan(pi/2) -> inf

Generate a dataset consisting of the volume, length, width, height, color, and angle of the lower corner relative to a surface (as if some boxes are being thrown). The more extraneous variables and correlations between them the better (if we checked all boxes in the world for volume and color do you really think there would be zero correlation?). Add in some blatantly incorrect measurements, as well as some non-normal error (the more exotic the better) on the others. From this information, someone else not knowing how the data was generated derives the equation for volume of a box.

I would like to see a comparison of algorithms on this. How much data is needed, what assumptions, etc. I do not mean to denigrate efforts to solve ideal problems at all, that is very important. But most of the examples I see are not like real life. And most real life examples I never see compared to new data.

Does this exist?

Physicists like low-degree polynomials because they’re usually only interested in behavior near a point. So they feel justified in using a truncated Taylor series approximation.

I didn’t see it mentioned explicitly, but another issue with CV is when people use CV instead of nested CV (or a final holdout). Not only is there the variability you talk about, but they will be overly optimistic as well.

Isn’t the core issue that hold-out test data are nearly always simple random samples and not in any way structurally different from the general data to be fit later? SRS test data don’t tell us much and with large numbers of observations not that different from the training results (when fitted well).

A much better test data set is one that could mimic expected differences found in the production data set. For example, if geography is not part of the model, then hold out a state or two in the test data. Similarly withhold other non-model structural subsets to test for biased model building.