Thinking about statistical modeling, overfitting, and generalization error

Allister Bernard writes:

I recently came across some research on generalization error and deep learning (references below). These papers explore how generalization error improves in Deep Neural Networks by increasing model capacity and is contrary to what one would assume with the bias-variance tradeoff. I assumed this improvement with such overparameterized models was the effect of regularization (implicit and/or explicit) in these models. However, Zhang et al. show that regularization is highly unlikely to be the source of these gains.

References:
Zhang et al. https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext
Nakkiran et al. https://openai.com/blog/deep-double-descent/
Belkin et al. https://www.pnas.org/content/pnas/116/32/15849.full.pdf

Your note on the most important statistical ideas of the past 50 years, highlights the gains achieved with overparameterized models (and regularization). It has worried me that all the hype around deep learning seemed to gloss over how overparameterized the models have become. Now this is not to diminish the gains these models have made in a number of fields, especially image recognition and NLP. I do not want to minimize these achievements as they are truly wonderful.

Here are my two questions:

1. I am curious if there is any work from the statistics community, on why we see this improvement in generalization error? Most of the research I have seen is from the ML/CS community. Belkin et al. point out this behavior is observed in other types of overparameterized models like random forests.
Another possible explanation is that these improvements may be dependent on the problem domain. Feldman et al. propose a possible reason behind this phenomenon (https://arxiv.org/pdf/2008.03703.pdf).

2. Your blog has highlighted the dangers of the garden of the forking paths and I am curious if we may have another similar phenomenon here that is not well understood?

From a practical perspective, I wonder if a lot of these tools may get applied to domains where they are not applicable and end up having effects in the real world (as opposed to the theoretical world). There is currently no reason not to do so as we don’t understand where these ideas will/will not work. Besides, it is now very easy to use some of these tools via off the shelf packages.

My reply:

I took a look at the first paper linked above, and I don’t quite get what they are doing. In particular, they say, “Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice,” but they they define regularization as follows: “When the complexity of a model is very high, regularization introduces algorithmic tweaks intended to reward models of lower complexity. Regularization is a popular technique to make optimization problems ‘well posed’: when an infinite number of solutions agree with the data, regularization breaks ties in favor of the solution with lowest complexity.” This is not the regularization that I do at all! When I talk about regularization, I’m thinking about partial pooling, or more generally approaches to get more stable predictions. From a Bayesian perspective, regularization is not “algorithmic tweaks”; it’s just inference under a model. Also, regularization is not just about “breaking ties,” which implies some sort of lexicographic decision rule, nor does regularization necessarily lead to estimates with less complexity. It leads to estimates that are less variable, but that’s something different. For example, hierarchical modeling is not less complex (let alone “with lowest complexity”) than least squares, but it gives more stable predictions.

That said, my above comment is expressed in general terms, and I’m no expert on deep learning or various other machine learning techniques. I’m sympathetic with the general idea of comparing success with training and test data, and I also recognize the challenge of these evaluations, given that cross-validation tests are themselves a function of the available data.

One thing I’ve been thinking about a lot in recent years is poststratification: the idea that you’re fitting a model on data set A and then using it to make predictions in scenario B. The most important concern here might not be overfitting to the data, so much as appropriately modeling the differences between A and B.

18 thoughts on “Thinking about statistical modeling, overfitting, and generalization error

  1. I’m surprised early stopping isn’t considered which is a common regularization technique, they mention double descent in number of epochs but I think if you keep running the optimization you generally see another rise in generalization error.

    This is an interesting difference in modeling approaches where most models we want them to be “fully” fit, whereas neural networks a fully fit (overparameterized) model is often a bad model.

  2. Your blog has highlighted the dangers of the garden of the forking paths and I am curious if we may have another similar phenomenon here that is not well understood?

    I don’t think forking paths is an issue if you only care about predictive skill (and efficiency, of course).

    The problem arises when you try to give meaning to the coefficients/parameters/weights of your model. Indeed, the root problem with the forking paths (multiverse) is that two models can have essentially equivalent predictive skill but very different specification.

  3. I believe section 10.8 of ISLR2 is relevant to the conversation:

    https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

    The authors claim that the generalization error in Deep Neural Networks improving with increased model capacity is NOT contrary to the bias-variance tradeoff. Instead, the “model capacity” measure is not equivalent to “model flexibility” (or any monotonic transformation thereof). As a result, one does not see what one would expect based on the bias-variance tradeoff. For example, you’ll see the “double descent phenomenon” instead of (roughly) U-shaped test error as model capacity increases.

    • The two lower panels of Figure 10.21 show the minimum-norm natural
      spline fits with d = 42 and d = 80 degrees of freedom. Incredibly, f ˆ 42 (X)
      is quite a bit less less wild than f ˆ 20 (X), even though it makes use of more
      degrees of freedom. And f ˆ 80 (X) is not much different. How can this be?
      Essentially, f ˆ 20 (X) is very wild because there is just a single way to inter-
      polate n = 20 observations using d = 20 basis functions, and that single
      way results in a somewhat extreme fitted function. By contrast, there are an
      infinite number of ways to interpolate n = 20 observations using d = 42 or
      d = 80 basis functions, and the smoothest of them — that is, the minimum
      norm solution — is much less wild than f ˆ 20 (X)!

      This makes sense for interpolation, but when I look at the figure 10.21 in question the 8 degrees of freedom model looks like it will be far superior to 80 degrees when the test data (x) may lie outside the range of -5:5 found in the training data.

      • True. But I’m not sure it’s especially important re: the point that the wrong measure of flexibility gives a graph that departs from the signature U-shape implied by the bias-variance tradeoff. And even if you expected the shape after the second descent to be a U, 80 degrees is to the right of the U’s bottom (see Figure 10.20).

        • I like the bias-variance tradeoff concept, but never looked into the derivation. I would bet it doesn’t assume the test data must come from a range of predictor values totally covered by the training data though.

          My intuition is it is likely another proof along the lines of “one layer is sufficient to approximate any function” (given infinite time and resources).

  4. Suppose we’re trying to estimate the conditional expectation function of a scalar random variable (y) given another scalar random variable (x). Suppose that the true CEF is globally quadratic but locally sinusoidal, and that our model is the space of polynomial functions of x. (Also assume we have practically unlimited training data available.) If we start with a linear model and expand to a quadratic model, we will see a big improvement (since now the model can match the globally quadratic pattern), but if we expand further to cubic and quartic, we will see degradation (because these models are too flexible relative to the global quadratic, but not nearly flexible enough relative to the local sinusoid). However if we expand to very high order polynomials, we will at some point be in a position to match both the global and local patterns, and so will see renewed improvement. Plotting performance against polynomial order will show a double descent pattern.

    This is obviously not a useful framework for predicting which deep learning strategies will be effective, since that depends on the nature of the underlying function being estimated and the relationship of the model’s function space to it. But I think it’s a helpful just-so story to explain how this type of observation is possible. There may be some gross features of language that simple networks can mimic, and more complex networks don’t improve on (but do add noise). Yet there may also be more detailed features of language which require substantial model complexity to be able to mimic at all. Hence we will see a double descent pattern in the performance v. complexity diagram. I have no idea what these gross v. detailed features are, but that’s not really a statistical question in the end as much as a question about the nature of language.

    • I don’t feel this example holds. The cubic model must do at least as well fitting the training data. To have worse generalization, it would have to fit idiosyncracies of the training data that don’t generalize. With a near infinite representative training set though, it’s all interpolation. In the worst case, the model would just set the cubic term to 0.

  5. I wrote about the surprisingly good generalization of deep learning for CACM last summer here. In particular, Belkin observes in some cases a phase transition as the number of parameters is increased, beyond which the prediction error goes down again. Deep learning systems are often ridiculously overparameterized, but the phenomenon probably applies to other methods as well. Some folks found that implicit regularization was probably not the cause.

  6. I feel like a lot of the miscommunication stems from using proxy language instead of what people actually mean. Model complexity means what exactly? Number of parameters? Number of degrees of freedom? Propensity to overfit? Something else?

    I feel like with the ridge regression, data scientists define complexity as degrees of freedom so they can say ridge is less complex than OLS (and hence overfits less). But with NNs, many days scientists switch to defining complexity as number of weights, or number of layers, or something.

    Proxy terms often lead to terrible intuitions, as we’ve seen with things like the fat arms or women being worse mentors papers.

    Personally, I view a model’s “complexity” as some ill-defined amount of structure captured. So a multilevel model is more complex than a fixed effects, since it adds some group-level structure to the model. Is a fixed effects model for a dataset with 8 groups less complex than one for an expanded dataset with 10 groups? Not really. Still just a fixed effects model.

  7. Thank you for posting an interesting issue. I would like to add my personal viewpoint as follows. It might be important to know the fact that a function made by a neural network has quite many singularities of parameters caused by the hierarchical structure, resulting that its complexity strongly depends on both a data-generating distribution and a sample size. In Bayesian inference, its quantitative complexity which determines the generalization error can be measured by algebro-geometric index, the real log canonical threshold (RLCT). It is far smaller than half the dimension of parameter, which may be a mathematical reason of an intrinsic or implicit regularization. If a steepest descent contains a random noise, its stochastic process converges to the Bayesian posterior distribution with uniform prior, hence the generalization performance of steepest descent may also be studied from the Bayesian point of view.

    • Anything involving algebro-geometric measures is probably beyond my expertise, but I think what you’re saying is consistent with my personal hypothesis. Specifically, that the overfitted model is making use of information about randomness, either in addition to or instead of any Fisher information in the data. More information means less variance. Put another way, generalization across samples is distinct from inference from samples to populations. Estimating unobservable parameters can only take you so far if what you really care about predicting is other sample distributions drawn from the same population.

      My question for you is this: Can your approach be generalized to models that aren’t overfit? The extreme case being a single sample, with small n, that is well-described by a single parameter. Such a model would not be singular, but the sample would still have observable random data features, and those manifest features should repeat over samples with similar random structures.

  8. I have a paper (currently under review) on what seems to be a relevant topic: the approximation of, and correction for, sample-specific accuracy in an effect size estimate computed on an isolated sample (i.e., without covariates or prior information). The method improves an estimate’s precision by conditioning on a non-probabilistic measure of an observable, random, sample-specific feature of the data. Doing so has the effect of shrinking the estimate’s conditional expected value toward the unconditional, unobservable parameter value.

    To be clear, we’re talking about two distinct information types–one probabilistic, the other non-probabilistic (specifically conceptualized as either combinatorial or algorithmic information). Conventionally, we reduce variance by increasing the Fisher information in the estimate, say, by increasing sample size or by accounting for a covariate. My approach does so by directly decreasing the size of the random discrepancy between estimate and parameter. (Obviously, many caveats and assumptions apply.)

    The point is, there is information in data, not currently used in statistical models, which may be used to change the distribution of MSE over plausible values of the parameter. Indeed, while I’ve focused on an isolated estimate of Pearson’s r–the opposite of an overparameterized model–conditioning r on my non-probabilistic measure redistributes MSE as a flat line over values of rho. So one might hypothesize these overparameterized ML models are just using more information than (the conventional interpretation of) estimation theory would suggest.

Leave a Reply

Your email address will not be published. Required fields are marked *