Jim Hammitt (director of the Harvard Center for Risk Analysis) had a question/comment about my paper, Estimating the probability of events that have never occurred: when is your vote decisive? (written with Gary King and John Boscardin, published in the Journal of the American Statistical Association).

The paper focused on the problem of estimating the probability that your single vote could be decisive in a Presidential election. There have been only 50-some elections, and so this probability can’t simply be estimated empirically. But, on the other side, political scientists and economists had a history of estimating this sort of probability purely theoretically, using models such as the binomial distribution. These theoretical models didn’t give sensible answers either.

In our paper we recommended a hybrid approach, using a theoretical model to structure the problem but using empirical data to estimate some of the key components. We suggest that this is potentially a general approach: estimate the probability of very rare events by empirically estimating the probability of more common “precursor” events, then using a model to go from the probability of the precursor to the probability of the event in question.

But Jim is skeptical. He writes:

This paper had a number of provocative comments which I’m not sure I agree with. I was especially interested in your (that’s a plural you) comments about the merits of statistical models vs. logical (data-free?) models. I guess the discussion was really in the context of the problem you address there, where it appears that the logical models that had been used weren’t really that logical or well thought out. In a broader context, I’m interested in use of data vs. theory and logic, especially when data are limited (e.g., we can’t observe very low probabilities of harm in a modest sized sample). If you are just saying that some data beat no data, that doesn’t seem very earth-shaking. Another comment I wondered about was the implication that it is hard to estimate the probability of an event that hasn’t happened (I think that was in the title, even). If we have a well-understood process (like multiple bernoulli trials with unknown probability of success), then zero successes is conceptually no different than some positive number of successes – i.e., we can estimate confidence intervals for p, construct a posterior given a prior, etc.

My answer is that I’d definitely prefer a method that allows data to enter somewhere. If the number of counts is zero, then one can’t really get a good confidence interval without some prior information. For example, zero ties out of 50 Presidential elections–so the simple Bayes estimate is 1/52? In this case, I’d rather put in other information by modeling precursors (e.g., the probability that a state is within 10,000 votes of a tie), rather than treating it as some sort of binomial prior distribution.

Similarly with environmental hazards–I’d assume that it could be possible to get empirical estimates of the probabilities of various “near-miss” events, and then you’d have to resort to mathematical modeling to extrapolate to the probabilities of the really rare events.

Does this make sense?