Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.
This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.
In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. Here’s an example of what you might want:
The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision.
Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you’re compromising between the prior and the data estimates.
Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.
A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.
I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.
The key ideas of the article appear in section 5 on page 433.
Non-statistical content Continue reading →