Early stopping and penalized likelihood

Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.

This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.

In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. Here’s an example of what you might want:

Screen shot 2011-07-05 at 2.32.06 PM.png

The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision.

Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you’re compromising between the prior and the data estimates.

Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.

A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.

I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.

The key ideas of the article appear in section 5 on page 433.

Non-statistical content Continue reading

Philosophy and the practice of Bayesian statistics

Here’s an article that I believe is flat-out entertaining to read. It’s about philosophy, so it’s supposed to be entertaining, in any case.

Here’s the abstract:

A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science.

Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.

Here’s the background: Continue reading

“Standard textbooks on chemistry do not discuss subjectivity in their introductions, and so statistical textbooks need not to do that either”

Kevin Spacey famously said that the greatest trick the Devil ever pulled was convincing the world he didn’t exist. When it comes The Search for Certainty, a new book on the philosophy of statistics by mathematician Krzysztof Burdzy, the greatest trick involved was getting a copy into the hands of Christian Robert, who trashed it on his blog and then passed it on to me.

The flavor of the book is given from this quotation from the back cover: “Similarly, the ‘Bayesian statistics’ shares nothing in common with the ‘subjective philosophy of probability.” We actually go on and on in our book about how Bayesian data analysis does not rely on subjective probability, but . . . “nothing in common,” huh? That’s pretty strong.

Rather than attempt to address the book’s arguments in general, I will simply do two things. First, I will do a “Washington read” (as Yair calls it) and see what Burdzy says about my own writings. Second, I will address the question of whether Burdzy’s arguments will have any effect on statistical practice. If the answer to the latter question is no, we can safely leave the book under review to the mathematicians and philosophers, secure in the belief that it will do little mischief. Continue reading

New book on Bayesian nonparametrics

Nils Hjort, Chris Holmes, Peter Muller, and Stephen Walker have come out with a new book on Bayesian Nonparametrics. It’s great stuff, makes me realize how ignorant I am of this important area of statistics. Here are the chapters:

0. An invitation to Bayesian nonparametrics (Hjort, Holmes, Muller, and Walker)

1. Bayesian nonparametric methods: motivation and ideas (Walker)

2. The Dirichlet process, related priors and posterior asymptotics (Subhashis Ghosal)

3. Models beyond the Dirichlet process (Antonio Lijoi and Igor Prunster)

4. Further models and applications (Hjort)

5. Hierarchical Bayesian nonparametric models with applications (Yee Whye Teh and Michael I. Jordan)

6. Computational issues arising in Bayesian nonparametric hierarchical models (Jim Griffin and Chris Holmes)

7. Nonparametric Bayes applications to biostatistics (David Dunson)

8. More nonparametric Bayesian models for biostatistics (Muller and Fernando Quintana)

I have a bunch of comments, mostly addressed at some offhand remarks about Bayesian analysis made in chapters 0 and 1. But first I’ll talk a little bit about what’s in the book. Continue reading

Bayes, Jeffreys, prior distributions, and the philosophy of statistics

Christian Robert, Nicolas Chopin, and Judith Rousseau wrote this article that will appear in Statistical Science with various discussions, including mine.

I hope those of you who are interested in the foundations of statistics will read this. Sometimes I feel like banging my head against a wall, in my frustration in trying to communicate with Bayesians who insist on framing problems in terms of the probability that theta=0 or other point hypotheses. I really feel that these people are trapped in a bad paradigm and, if they would just think things through based on first principles, they could make some progress. Anyway, here’s what I wrote:

I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng, and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin, and Rousseau as a platform for further discussion of foundational issues.

In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis

Continue reading

Richard Berk’s book on regression analysis

I just finished reading Dick Berk’s book, “Regression analysis: a constructive critique” (2004). It was a pleasure to read, and I’m glad to be able to refer to it in our forthcoming book. Berk’s book has a conversational format and talks about the various assumptions required for statistical and causal inference from regression models. I was disappointed that the book used fake data–Berk discussed a lot of interesting examples but then didn’t follow up with the details. For example, Section 2.1.1 brought up the Donohue and Levitt (2001) example of abortion and crime, and I was looking forward to Berk’s more detailed analysis of the problem–but he never returned to the example later in the book. I would have learned more about Berk’s perspective on regression and causal inference if he were to apply it in detail to some real-data examples. (Perhaps in the second edition?)

I also had some miscellaneous comments: Continue reading

One more time on Bayes, Popper, and Kuhn

There was a lot of fascinating discussion on this entry from a few days ago. I feel privileged to be able to get feedback from scientists with different perspectives than my own. Anyway, I’d like to comment on some things that Dan Navarro wrote in this discussion. Not to pick on Dan but because I think his comments, and my responses, may highlight some different views about what is meant by “Bayesian inference” (or, as I would prefer to say, “Bayesian data analysis,” to include model building and model checking as well as inference).

So here goes . . . Continue reading