Overfitting

Ilya Esteban writes:

In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best.

While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different priors/hyperpriors with your model, then simply choose the values and distributions that produce the best “fit” for your data, you are probably just choosing the most optimistic, rather than the most accurate model.

My reply: I have no magic answer but in general I prefer continuous model expansion instead of discrete model choice or averaging. Also, I think many of the worst problems with over fitting arise when using least squares or other flat-prior methods. A structured prior should help. More generally this is a big open problem in statistics and AI: how to make one’s model general enough to let the data speak but structured enough to hear what the data have to say.

To which Esteban writes:

I agree with idea of continuous model expansion. However, it still leaves open questions such as “do I give the variance prior a uniform, a half-Normal or a half-Cauchy distribution”. These questions have to be answered discretely, rather than using continuous expansion, which in turn calls for an objective criteria for selecting among the alternate model formulations.

My reply: In theory, this can still be done using model expansion (for example, the half-t family includes uniform, half-normal, and half-Cauchy as special cases) but in practice, yes, choices must be made. But I don’t know that I need an objective criterion for choosing; I think it could be enough to have objective measures of prediction error and to use these along with general understanding to pick a model.

11 thoughts on “Overfitting

  1. Stupid question – what is the definition of “continuous model expansion”?

    One thing I like about hierarchical models is that the parameters are all constrained and coupled through distributional relationships, although that doesn’t solve everything. On a related note, are there any good references discuss the effective number of degrees of freedom of a model when parameters in a hierarchical model?

    I understand the argument about subjective model choices, it’s certainly reasonable in cases where the goals consist of obtaining descriptive characterizations or predictions. However, I suspect it’s not going to satisfy objectivist purists in the context of confirmatory studies, who probably consider investigator degrees of freedom as an enabler of confirmation bias.

  2. I think “continuous model expansion” means that instead of choosing between some discrete set of possibilities for your model, you create a parameter which continuously interpolates between the various choices. Andrew gives the example in the post of the half-t family, where the parameter is degrees of freedom, and with 1 degree of freedom you have half cauchy, with infinite degrees of freedom you have half normal, and anything in-between is something between these two extremes. Degrees of freedom need not be an integer (discrete model expansion) you can simply let it be a continuous variable and get a smooth transition between different shapes.

  3. This makes me curious. Does slight forgetting over time work to any degree?

    I ask, since bayesian models is what genomes use when they learn from their environment, with the prior alleles as hypotheses on the environment that gets transformed to posteriors by differential reproduction. They don’t overtrain (if that is meaningful in the context) or are even guaranteed to hit their local optima, but it relies on there being environmental change but not too much.

    What genomes do experience is that unfixed alleles gets trashed over time by variation, meaning that the genome gradually forgets about earlier environments aka training sets.

    • It isn’t so much that “genomes use Bayesian models” so much as it is the case that Bayesian updating is a special case of replicator dynamics: http://arxiv.org/pdf/0901.1342.pdf. Aside from the unnecessary anthropomorphizing, it gets the goodwill-by-association backwards I think: evolution isn’t clever because it uses Bayes, Bayes is pretty sweet in that it somewhat mimics a pretty sweet natural “learning” mechanism. So to answer your question…there are replicator “strategies” that do things Bayes cannot and I think, as a matter of definition, forgetting is one of these (doctrinaire Bayesian proudly condition on all information).

      • R:

        Bayesians can “forget” by embedding their problems into a hierarchical model in which there is variation in the parameters between “yesterday” and “tomorrow.” The more variation, the more forgetting, This is very much related to the recent discussion on this blog of the use of hierarchical models for generalization from data A to predictions B.

        • Andrew:

          Sure, and this is what I took Cosma’s paper to be about (in fact, he talks about consistency of non-parametric Bayes in terms of forgetting in the blog summary of his paper http://cscs.umich.edu/~crshalizi/weblog/601.html). But, to pick a simpler example, one could use a contamination model which is a mixture of a normal distribution and a t-distribution, say, to deal with outliers. Or we could Winsorize our sample. Both tactics try to deal with the same problem and have the same conceptual strategy, but I’d still resist saying that fitting a Bayesian latent variable contamination model with fat tails is equivalent to Winsorizing, not least because one is so much more demanding to implement than the other! My concern is that to suggest to strongly otherwise is to encourage the misperception that Bayesian methods are some kind of magic-bullet.

        • Yup, no magic bullets here. Just because a method can be interpreted as Bayesian, it doesn’t mean that any particular Bayesian solution will do what we want.

  4. The earlier statement that “Just because a method can be interpreted as Bayesian, it doesn’t mean that any particular Bayesian solution will do what we want.” touches on the core of what I raised in my original question. If during the course of modeling the data, the researcher goes through multiple Bayesian model definitions and picks the one that seems to work best, that process is very much a classical multiple comparison problem, with all the associated issues of overfitting and bias-variance tradeoffs, something Bayesians dont generally think they need to worry about.

    • I don’t think that Bayesians argue that multiple comparisons are never a problem, the argument is that one can mitigate it substantially by constraining a large set of parameter estimates to domain-specific distributional constraints. Compared to the common multiple comparison adjustments that lose power because of data-free assumptions regarding the null structure, the empirical bayes approach to reducing the degrees of freedom is usually more efficient since you’re using data to inform relationships between the parameters (I think even if there is not a strong relationship between parameters, Stein’s phenomenon will still help). Overfitting can still a problem and should be checked, but the hierarchical constraints are also helping to reduce the degrees of freedom. These advantages can probably be realized in a frequentist framework which utilizes distributional constraints and shrinkage, but bayesian data analysis is simply a straightforward approach to doing it.

      You can certainly still run into problems with poor modeling assumptions. For example, you can still have a Bayesian model that does no pooling and is effectively as bad as separate hypothesis tests or you can assume distributions which underestimate the variance.

      However, there’s simply no way of doing science under any framework that does what we _really_ want – which is to guarantee a model or conclusion is correct. But then subsequent falsification of a previously-believed hypothesis or model isn’t a bad thing, it’s scientific progress.

  5. Pingback: Nonparametric bayes | Firstcoastfin

Comments are closed.