Alan Agresti has written some papers motivating the (y+1)/(n+2) estimate, instead of the raw y/n estimate, for probabilities. (Here we’re assuming n independent tries with y successes.)

The obvious problem with y/n is that it gives deterministic estimates (p=0 or 1) when y=0 or y=n. It’s also tricky to compute standard errors at these extremes, since sqrt(p(1-p)/n) gives zero, which can’t in general be right. The (y+1)/(n+2) formula is much cleaner. Agresti and his collaborators did lots of computations and simulations to show that, for a wide range of true probabilities, (y+1)/(n+2) is a better estimates, and the confidence intervals using this estimate have good coverage properties (generally better than the so-called exact test; see Section 3.3 of this paper for my fulminations against those misnamed “exact tests”).

The only worry is . . .

The only place where (y+1)/(n+2) will go wrong is if n is small and the true probability is very close to 0 or 1. For example, if n=10 and p is 1 in a million, then y will almost certainly be zero, and an estimate of 1/12 is much worse than the simple 0/10.

However, I doubt that would happen much: if p might be 1 in a million, you’re not going to estimate it with a n=10 experiment. For example, I’m not going to try ten 100-foot golf putts, miss all of them, and then estimate my probability of success as 1/12.

Conclusion

Yes, (y+1)/(n+2) is a better default estimate than y/n.

1. derek says:

Just as statisticians settle on a significance of 95% as "good enough" by convention, can statisticians agree on a convention that n is insufficiently high to justify use of the estimate (y+1)/(n+2) if (y=0) yields (y+1)/(n+2) greater than x? If so, what value would be good for x? Less than 1/12, you say for example; could you take that lower?

2. Andrew says:

Derek,

I'm not sure. But things do get more complicated when trying to come up with good prior distributions for multiway contingency tables. For example, if you have 32 cells, setting a prior 1 unit per cell may be overkill. There I think I'd rather set up as logistic regression but I'd have to think harder in the context of a particular example. I know that Alan Zaslavsky did some work in this area awhile ago for Census adjustment.

3. Bill says:

Any links/references to the Agresti papers?

4. Prakash says:

(a) I would have thought that (y+1/2)/(n+1/2) would be a good estimator, since adding 1/2 is what we normally do when we have zeros in 2×2 tables (e.g. Haldane, Gart)
(b) Does (y+1)/(n+2) here have anything to do with Laplace's law of succession? Intriguingly, it corresponds exactly to the posterior probability of success given y successes in n trials, and assuming a uniform prior for the probability of success.

5. Anonymous says:

Borkowf's approach (2006 stat med) augments the observed binomial data with an imaginary failure to compute the lower bound and an imaginary success to compute the upper bound. This seems to work even better than Agresti's approach, as it treats lower and upper limits differently (in a parallel manner)

6. Joshua says:

I thought Agresti's paper called for (y+2)/(n+4). No, wait, was that Wilson? Either way, I usually hear about "plus four" intervals.
for example.

Agresti comments on the Wilson and other intervals, and suggests appropriate modifications for y near 0 or near n, in a JSTOR paper …
http://www.jstor.org/view/08834237/di020229/02p01

Looks like a good review paper:
http://projecteuclid.org/DPubS?service=UI&version

7. Aleks says:

I view (y+1)/(n+2) estimate as using posterior predictive probability with the Beta(1,1) prior, which assumes all probabilities to be equally likely a priori. Agresti is essentially selling his prior. It's a good prior, but sometimes it's inappropriate. Instead of debating estimates, why not debate priors?

This particular prior is used a lot in machine learning. Cestnik generalized it to conditional probabilities in 1990 (he was in turn inspired by I.J.Good's 1965 "The Estimation of Probabilities: An Essay on Modern Bayesian Methods"), but as the original publication isn't online, a good starting point is this paper by Cussens: Bayes and Pseudo-Bayes Estimates of Conditional Probabilities and Their Reliability.

8. I tend to associate the various estimates for a binomial proportion with different choices of "noninformative" prior for the proportion.

Beta (1,1) Uniform Prior — (y+1)/(n+2)
Beta (.5,.5) Jeffrey's Rule — (y+.5)/(n+1)
Beta (0,0) MLE — y/n

The uniform prior was advocated by Laplace, and Berger cites a Bayesian justification for the improper prior of the MLE. The Jeffrey's rule prior is equivalent to the practice of adding 1/2 to the cells in the contingency table (Prakash forgot to add 1/2 to both cells in the denominator above).

Note that Dempster (1966,1968) analyzes this problem using a vacuous prior and gets a random interval for the unknown proportion. The upper and lower expectation for this interval are:
y/(n+1) and (y+1)/(n+1). There is an interesting correspondence with the Borkowf approach cited above. I discuss this in my book Graphical Belief Modeling, but it is spread across several chapters.

9. Dani says:

it's very funny to notice how, at the end of the day, "the frequentists" come up with a bayesian idea (and this is not the first time it happens).. so that putting a prior on data is good.

10. Charles says:

Whoever said theoretical mathematics was unarguably consistent was out of his/her mind. But the world is mad in any case so here goes:

Frequentest statistics are supposedly long term predictors based on a large number of trials with asymptotic results. It can be seen that y/n and (y+1)/(n+2) [LaPlace estimator] and (y+2)/(n+4) [Agresti with Jeffrey prior estimator] and all the other variants, all approach the same proportion as the sample size grows and the ratio of y/n remains the same. So which one to use? Shouldn't the decision be based on what question is being asked, which drives the theoretical model, as well as which 'fit' gives the 'best' results? Theoretical arguments without a grounding in reality are specious and purely 'fit' based arguments are wishful thinking. For my money, when trying to estimate the proportion of an unknown population from a single set of trials I use an estimator derived from a uniform prior. i.e. the LaPlace estimator. Those who are looking at continuous processes and can cheaply investigate multiple trials can use estimators that answer the questions they want to place. There is no such thing as a "Best" estimator, just one that answers the question. Wise old saying "Be careful what you ask for, you might just get the answer for it"