Yuling writes, “Bayes is guaranteed to overfit, for any model, any prior, and every data point.” This statement is not literally true: it does not hold for degenerate examples of models with no unknown parameters (for which p(theta|y) is the same as p(theta) as these are both delta functions at the true, known value of theta), nor does it hold for degenerate examples of data whose distribution does not depend on unknown parameters (it’s easy to come up with examples of this sort, for example consider the model y_i ~ normal(theta*x_i, 1), independent for data i=1,…n. In this case, for any data x_i = 0, the predictive distribution of y_i is fixed, so there’s no “overfitting”).
So, yeah, the statement is not true as written—but I know what Yuling is saying, which is that when there’s uncertainty in predictions, Bayesian fitting will pull the prediction from the prior toward the observed data, so that, in Yuling’s words, the in-sample error is smaller than the out-of-sample error, which is what he is calling “overfitting.”
This should not be surprising: from Akaike (1973) onward, there’s been a whole subfield of statistics devoted to correcting for the difference between within-sample and out-of-sample prediction error.
There’s something interesting here, though, and that is how much does the Bayesian inference “overfit,” in Yuling’s terms?
Example 1: Normal distribution with flat prior.
Let me explain in the simple case of one data point, y|a ~ normal(a, 1), with a flat prior for a. Because we’re also interested in out-of-sample prediction error, we’ll also define y_rep|a ~ normal(a, 1), independent of y|a.
Let’s start with the maximum likelihood estimate, â_mle = y, which gives the prediction y_pred = y. The within-sample prediction squared error is (y_pred – y)^2 = 0. Meanwhile, the expected out-of-sample prediction squared error is E((y_rep – â_mle)^2) = 2. So that’s how much overfitting the mle has: its within-sample squared prediction error overestimates the expected out-of-sample prediction error by 2. This difference corresponds to the “2” in the AIC formula.
In this case with a flat prior, the Bayesian posterior mean is â_bayes = y, so same point prediction with same within-sample prediction squared error of 0.
But Bayes doesn’t just give a point estimate; it gives a posterior distribution, p(a|y). Thus the relevant predictive summary here is the expected within-sample prediction squared error, averaging over the posterior distribution for a; that is, E((a – y)^2 | y). In this very simple example, the posterior distribution is just a ~ normal(y, 1), so the expected within-sample prediction error of the Bayesian inference is 1.
What about the expected squared out-of-sample prediction error? This is E((â_bayes – y_rep)^2 | y) = 2. So the overfitting is 1. Averaging over the posterior distribution has reduced the overfitting by half.
Example 2: Normal distribution with informative prior.
Now let’s consider the next most complicated example. Same as above but now our prior is a ~ normal(0, s), and our posterior is a|y ~ normal((s^2/(1 + s^2))*y, s/sqrt(1 + s^2)).
The Bayes posterior mean is now â_bayes = r*y, where r = s^2/(1 + s^2), which gives a within-sample squared prediction error of y^2/(1 + s^2)^2. This depends on y, so we can compute its expectation averaging over the prior predictive distribution (i.e., the marginal distribution of y), which here is y ~ normal(0, sqrt(1 + s^2)), hence the expected within-sample squared prediction error of the posterior mean is 1/(1 + s^2).
What about the expected within-sample squared prediction error of the posterior inference? For this, we have to add the posterior variance, and we get 1/(1 + s^2) + s/(1 + s^2) = 1.
Finally, the expected squared out-of-sample prediction error comes to (1 + 2s^2)/(1 + s^2).
We can now take differences. Suppose the goal is to estimate the expected squared out-of-sample prediction error, and you can do that using the squared within-sample prediction error of the Bayes point estimate, or the squared within-sample prediction error of the Bayes posterior distribution. It turns out that both are too optimistic. The Bayes point estimate underestimates the out-of-sample prediction error by 2s^2/(1 + s^2), and the Bayes posterior distribution underestimates the out-of-sample prediction error by s^2/(1 + s^2).
Thus, in this normal-normal example, when we us the full Bayesian posterior instead of the Bayesian point estimate, it reduces the overfitting by a factor of 2.
That factor of 2
Example 2 above is not just a special case; it’s of general interest for well-behaved problems where the limit kicks in and the likelihood follows an approximate normal curve.
It’s also the basis for information criterion calculations such as AIC, DIC, and WAIC, as discussed in our 2014 article and chapter 7 of BDA3 (see also our followup paper from 2017 focusing on leave-one-out cross validation). The “effective number of parameters” in the model is associated with the difference of within-sample and out-of-sample prediction accuracy.
None of this is new. The goal of this post is to clarify what Yuling wrote about Bayes overfitting. As Yuling said, any fitting procedure will “overfit” in the sense that it is adapting to the data.
– The Bayesian point estimate overfits (on average).
– The Bayesian posterior distribution accounts for uncertainty and reduces the overfitting by a factor of 2 (in the normal-normal case).
– This is not a problem with Bayesian inference; it’s a recognition that any fitting is, in a sense, over fitting; or, to put it another way, fitting in parameter space inevitably leads to overfitting in the space of mean squared error or log predictive density.
P.S. As Aki and Phil point out, the above definition of “overfitting” can be misleading in that, under that definition, all fitting is overfitting; see further discussion here in comments.
Hello, Andrew. Did you mean 1960, when you referred to the Akaike paper?
Akaike, H. On a limiting process which asymptotically produces f−2 spectral density. Ann Inst Stat Math 12, 7–11 (1960). https://doi.org/10.1007/BF01577661
Sorry, Andrew. You probably are referring to this 1973 paper?
Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov, B.N. and Csaki, F., Eds., International Symposium on Information Theory, 267-281.
Yes.
> But Bayes doesn’t just give a point estimate; it gives a posterior distribution, p(a|y). Thus the relevant predictive summary here is the expected within-sample prediction squared error, averaging over the posterior distribution for a; that is, E((a – y)^2 | y).
How do you define the “expected within-sample prediction squared error”? Do you mean that – conditional on the observation y0 – instead of predicting y0 you would produce a random prediction distributed as a standard normal around y0?
> In this very simple example, the posterior distribution is just a ~ normal(y, 1), so the expected within-sample prediction error of the Bayesian inference is 1.
If we observe y0 and the “prediction” is a random value pred ~ normal(y0, 1) the variance of (pred-y0) is indeed 1. So far so good.
> What about the expected squared out-of-sample prediction error? This is E((a – y_rep)^2 | y) = 2.
This is where I’m lost.
Conditional on the value of a, the out-of-sample observation has a normal distribution y_rep ~ normal(a, 1). As you said, conditional on y0, a ~ normal(y0, 1). Therefore the distribution for the out-of-sample observation has variance 2, y_rep ~ normal(y0, sqrt(2)).
We said above that the “prediction” is a random value pred ~ normal(y0, 1) so the variance of (pred-y_rep) is 3, not 2. If the “prediction” now is something different, why?
Carlos:
Sorry, I meant E((â_bayes – y_rep)^2 | y), which is 2. I just went in and fixed it above. In our 2014 paper we do these calculations more carefully using expected log predictive density. Here I was trying to go quick-and-dirty and use mean squared error.
The factor of 2 thing comes up with the careful calculations, so here I knew where I was trying to get, and I was sloppy in getting there.
> E((â_bayes – y_rep)^2 | y)
Where â_bayes is the point estimate. It’s not clear why the following reasoning wouldn’t apply here:
> But Bayes doesn’t just give a point estimate; it gives a posterior distribution, p(a|y). Thus the relevant predictive summary here is…
Carlos:
Yes, the clean way to do this is using expected log predictive density, as we do in our 2014 paper. Here I was trying to supply some quick intuition to get the factor of 2.
Thanks, I’ll take a look.
For others who may be interested: there is a free access version at https://www.stat.columbia.edu/~gelman/research/published/waic_understand3.pdf
(By the way, I found the original blog post very confusing and if “any fitting is, in a sense, over fitting” that sense doesn’t make a lot of sense…)
Is there a definition of “overfitting”? Nothing I see here seems to capture what I think of as overfitting. I’m sure there is a formal definition, perhaps more than one, and if so then perhaps I’ve used the term wrong (or at least imprecisely) for many years.
In normal, imprecise conversation with other people who work with data, we all seem to think of “overfitting” as a phenomenon whereby the point estimate of a parameter is influenced too strongly by the data. There is a relationship between this concept and the difference between in-sample and out-of-sample error for point predictions, but that is not among the metrics that leap to mind when I think of how to quantify overfitting. Indeed it seems like a weird way to think about it.
You start with a prior. You collect some data and (for a continuous parameter or parameters) your posterior estimate moves towards the parameter values that best fit the data. In general, if they move too far then you’re overfitting, and if they don’t move far enough then you’re underfitting. I do not think it’s true that Bayesian methods always overfit in this sense.
What am I missing?
Phil:
I don’t think the concept of overfitting has any general definition. Yuling is defining overfitting as when the within-sample prediction error is smaller than the out-of-sample prediction error in expectation, but then, as Aki says, if that’s your definition, then all fitting is overfitting. The factor of 2 thing is interesting to me, because one might hope that, after accounting for posterior uncertainty, that the expected prediction errors would line up, but they don’t: an additional correction is needed.
I agree with you that not all Bayesian methods overfit. A helpful example, perhaps, is to think of a noisy time series that is being fit by a spline. “Overfitting” would correspond to a super-wiggly line that tracks the data too closely, and we could tell it overfits because of its poor cross-validation performance (assuming we have enough data so the cross-validation estimate is not itself too noisy). You could also fit a curve that isn’t wiggly enough, and it would underfit.
So maybe this is a useful step forward, if we can define underfitting as well as overfitting. I’m sure there’s a literature on this based on complexity of the model. It’s actually kinda related to the first statistics article I ever published.
Please please let’s not define “overfitting” in a way that is contrary to conventional conversational usage. Statistics made that mistake with “significant” and look at all the harm that has done.
OK, I clarified above.
I like to think about underfitting and overfitting of a model in terms of bias-variance decomposition, which states that the mean squared error of predicting a new point using a model trained on a random training set can be broken down into the sum of squared bias, variance, and irreducible error. In this case, the model is underfitting if the squared bias is large compared to the irreducible error, and overfitting if the variance is large compared to the irreducible error. Naturally, the definition of what “large” means remains up to the researcher, but in any case, the emphasis is on relative rather than absolute values of bias and variance.
I’m still having trouble with the idea of “in sample prediction error”. There’s nothing to “predict” in sample. You know y0 and if asked to predict it you say “it’s y0” and you have zero error.
You could ask for the expected error in the parameter a after seeing y0 maybe.
Out of sample prediction, like prediction of y1 will have two components. First there’s uncertainty in the parameter a, and then there’s uncertainty in the value that you get even if you know a exactly.
The prediction error for y1, the next as yet unseen sample, is a perfectly reasonable quantity. To say that it’s always too small (which is what I think “always overfit” means) seems to be wrong.
> I’m still having trouble with the idea of “in sample prediction error”. There’s nothing to “predict” in sample. You know y0 and if asked to predict it you say “it’s y0” and you have zero error.
In the context of statistical models “predict” usually means “according to the model”. The difference between the observation and the prediction is often called residual.
In example 1 here the prediction is equal to the observation (and it’s hard to understand what Andrew means when he says that “Bayes doesn’t just give a point estimate; it gives a posterior distribution, p(a|y). Thus the relevant predictive summary here is the expected within-sample prediction squared error”.)
However, if you had two unequal observations the prediction would be accurate for neither data point.
Carlos. Suppose we have a regression, y = f(x, q) + error
where x is a covariate, q is a parameter (vector) and y is an outcome
and we have some observations y_i, x_i
Now, we have some prior over q, p(q) and we have some posterior:
p(q | {(x1,y1),(x2,y2)…(xn,yn)})
You could ask either of the following questions:
what is the posterior “prediction” for y3?
what is the posterior prediction for yn+1 given that xn+1 = x3?
These are *two different things*. The first one is hardly a “prediction” it’s just “tell me the number in your dataset for y3”. The second is a prediction, it asks for what you think the new unobserved yn+1 value would be, given that it has the same x value as the (x3,y3) pair.
I call the second one a prediction, and the first one not a prediction.
If you ask “what is the difference between what you’d predict for yn+1 given that it’s x value was x3, compared to the observed y3 value” then yes you’ll have less uncertainty about this difference than you’d have about the difference between yn+1(actual, as yet unobserved) and your prediction for yn+1, because yn+1 is as yet unobserved, whereas y3 was actually observed and so there’s no uncertainty about its value.
This hardly seems like “overfitting” the reduced uncertainty is because right there in your dataset you have a lot more information about y3 than you do about yn+1.
Andrew: Thanks for making these illustrations and writing this post.
The reason that I someone used “overfittling” to mean a positive generalization gap is that I started in the context of model averaging, in which a positive generalization gap makes model averaging prone to over-fiting the in-sample-performance.
So a question from the non-math inclined… and really i should probably just be quiet and go read up some more until i actually understand what you are writing… but here goes…
it seems that the problem arises because non-mechanistic models are used for essentially thresholded or stochastic data. Is this correct?
So isn’t the problem one of all non-mechanistic or statistical or regressions that do not lead to mechanistic models. No matter how a curve is fit, if it doesn’t try to use the actual drivers in the data as factors, it just will not fit well outside of a phenomenon that presents very smooth data?