## Statistical methods that work in some settings but not others

David Hogg pointed me to this post by Larry Wasserman:

1. The Horwitz-Thompson estimator ${\hat \psi}$  satisfies the following condition: for every ${\epsilon>0}$ ,

$\displaystyle \sup_{\theta\in\Theta}\mathbb{P}(|\hat \psi - \psi| > \epsilon) \leq 2 \exp\left(- 2 n \epsilon^2 \delta^2\right) \ \ \ \ \ (1)$

where ${\Theta}$ — the parameter space — is the set of all functions ${\theta: [0,1]^d \rightarrow [0,1]}$. (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.)

2. A Bayes estimator requires a prior ${W(\theta)}$ for ${\theta}$. In general, if ${W(\theta)}$ is not a function of ${\pi}$ then (1) will not hold. . . .

3. If you let ${W}$ be a function if ${\pi}$, (1) still, in general, does not hold.

4. If you make ${W}$ a function if ${\pi}$ in just the right way, then (1) will hold. . . . There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. . . .

7. This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property like (1).

Larry follows up with a sociological comment:

We are surprised by how defensive Bayesians are when we present this example. Consider the following (true) story.

One day, professor X showed LW an example where maximum likelihood does not do well. LW’s response was to shrug his shoulders and say: “that’s interesting. I won’t use maximum likelihood for that example.”

Professor X was surprised. He felt that by showing one example where maximum likelihood fails, he had discredited maximum likelihood. This is absurd. We use maximum likelihood when it works well and we don’t use maximum likelihood when it doesn’t work well.

When Bayesians see the Robins-Ritov example (or other similar examples) why don’t they just shrug their shoulders and say: “that’s interesting. I won’t use Bayesian inference for that example.” Some do. But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.

Here are my reactions to this story:

1. I don’t understand the mystique of the Horwitz Thompson estimator. Like all statistical procedures, sometimes it works well and sometimes it doesn’t. I agree with Larry that not every method works on every problem.

2. It’s fine that Larry’s favorite methods are used in biostatistics and at Google and Yahoo. I’ve heard that biostatisticians and software companies also use Bayesian methods, maximum likelihoods, chi-squared tests, etc. Lots of methods are useful. The fact that somebody somewhere uses a method doesn’t mean it’s optimal or even a good thing to do in general, but it provides some positive evidence.

3. I agree that there are cases where existing Bayesian methods have problems. Larry writes, “But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.” I would rephrase this to say: “Some feel that if Bayes fails in one example then it would be good to understand what aspects of the model are causing problems.” Sometimes the problem is that the frequentist criterion being used is not of applied relevance. Consider a simple problem such as estimating a proportion p, given y successes out of n trials, where n=100 and y=0. The best estimate of p will be different if I tell you that p is the probability of a rare disease, compared to if I tell you that p is the proportion of African Americans who plan to vote for Mitt Romney.

4. For some problems, Bayesians will give up. For example, Bayesians don’t want unbiased estimates, they don’t care about various minimax properties etc. And for some problems, classical statisticians give up (for example, giving a conf interval but not a point estimate in the y/n case). Some problems are essentially ill-posed (for example, the problem of estimating a ratio whose denominator could be either positive or negative, sometimes called the Fieller-Creasey problem).

To say this again: Bayesians give up on some things but not others. We’ll give up on theoretical principles (such as Larry’s (1) above) but we don’t like to give up on getting inferences for any quantity of interest. In contrast, non-Bayesians often feel strongly about principles such as unbiasedness and confidence coverage but are willing to give up on producing an estimate if a parameter is nonidentified.

So, I think Larry has identified a real difference in attitudes, but I don’t agree with his characterization that Bayesians think “their whole world comes crashing down.” We’re just more bothered by not being able to come up with an estimate of a probability, rather than not being able to satisfy a minimax property.

1. Larry Wasserman says:

Hi Andrew

I didn’t say that Bayesian think “their world comes crashing down”.
I said that SOME Bayesian think their world comes crashing down.
I certainly was not including you in that remark.
I was referring to dogmatic statisticians who think that one tool
has to solve all problems.
In particular, I was referring to Bayesians who seem troubled by
the mathematical fact
that, IN SOME CASES, Bayes procedures can have poor frequentist properties (poor coverage).

Larry

• Andrew says:

Larry:

Yes, I agree entirely that no method or class of methods will solve all problems. I like this way of distinguishing different types of statisticians by what problems bother people. I think it’s actually true that, in some settings, I am not bothered by not being able to satisfy a minimax property, and in some settings you are not bothered by not being able to come up with a probability. If I felt it was important to satisfy a minimax property in a problem such as you describe above, I expect I would abandon Bayes; just as, if you felt it was important to come up with a probability in a setting where such a number was classically identified, I expect you would use Bayes. In either case, by using a method in a particular problem we would not be committing to some overarching philosophy; we would just be trying to use the right tool for the job.

• Larry Wasserman says:

agreed

• Entsophy says:

“IN SOME CASES, Bayes procedures can have poor frequentist properties (poor coverage)”

Indeed, there was an example of this worth retelling. I cannot vouch for whether this really happened, so please take it with a grain of salt. Scientists discovered some samples of a rare element (I believe it was called unobtainium). It was so rare that there were only 1,000,000,000,001 samples of the element in the entire universe, and no possibility of ever creating new ones. 1,000,000,000,000 of these samples were colored red and 1 was colored blue. They were all carefully collected and stored in a central facility.

Unfortunately, there was an industrial accident in which all but one of the samples were destroyed, while the last sample was buried by the rubble of the facility. None of the survivors of the accident saw which sample it was. Extensive efforts were made to dig it up since it was the last of its kind.

While the recovery was ongoing, a Bayesian was asked to predict what color of the surviving sample. He responded “Based on what I know, odds are a trillion to one that it’s red, so without knowing anything else I would predict the last sample is red”

The Bayesians procedure surely had awful Frequentist characteristics, since there was zero possibility of ever conducting a repeated trial and the industrial accident itself was not a “random data generation mechanism”. But I wonder how many Frequentists are so sure that their philosophy is correct (and Bayesians are guilty of epistemological nonsense), that they would be willing to bet their life savings against the Bayesian’s life savings that the last remaining sample is blue?

As Herald Jeffries wrote: “The essence of the present theory is that no probability, direct, prior, or posterior is simply a frequency”. I’ll just note that Herald Jeffries had no problem reasoning, philosophically or mathematically, about the frequencies in repeated trials when such was the subject of his investigations.

2. C Ryan King says:

Worth remembering that this is a problem both in 1) causal inference – selection bias is built in, and 2) infinite dimensional targets of inference – coverage in function spaces. I think we’re still in the “something interseting is happening” stage. There may still be a bayesian explanation with good properties.

3. Paul says:

Are the domain and range continuous or discrete?

4. […] I came across the following a blog post from Statistical Modeling, Causal Inference, and Social Science, which is also about the intention […]

5. Surely I agree that no method or approach is going to work well everywhere. I think the reasons deal with violated assumptions at really basic levels (e.g., various ways i.i.d. can be violated, non-stationarity, even non-ergodic, and where empirical distributions are essential).

But to the point of why HT (or HHT if Hurwitz is to be acknoledged) estimation is seen as such a breath of fresh air, I offer that it fixes the long disorganized state of survey sampling in natural settings, such as wildlife and tree surveys. That opinion is offered by Overton and Stehman in TAS, 49(3), 1995 (“The Horvitz-Thompson Theorem as a Unifying Perspective for Probability Sampling: With Examples from Natural Resource Sampling”), and is championed and nicely explained in the text by Sarndal, Swnsson, and Wretman, MODEL ASSISTED SURVEY SAMPLING.

There are non-frequentist critiques, to be sure, as Little offers in his presentation, “The calibrated Bayes approach to survey sampling inference” where he addresses HHT.

I’m not sure the connection has been made formally, but I think of HHT and its applications as a kind of importance sampling. As such, it’s approach can be made even more general, and the ties to estimation of posteriors stronger.

Still, there’s a lot of good which can come by promoting the methods of Sarndal, Swensson, and Wretman — and of course Little — in areas unfamiliar with these approaches such as, I daresay, engineering and Internet measurements sampling.

• Andrew says:

Jan:

I think much depends on what sorts of applications one focuses on. I do a lot of work on public opinion surveys, where probabilities of inclusion in the sample are unknown, survey weighting is a mess, and connections to inverse-probability weighting can be more misleading than helpful. In other areas of applications, those ideas might be more useful.

• zbicyclist says:

This is even more of a problem in applied panel work (panel in the statistical sense of multiple observations on the same respondents). There is nonresponse/dropouts at each stage, but the need to replenish these respondents. Nonresponse is not random in my contexts. Trying to determine the actual probability of selection for an observation joining in wave 1200 — an observation that was not even in the sampling frame until, say, wave 800 — is an interesting exercise.

The HT estimator as a formula can still be useful, but you’re not using actual inverse probabilities, but something you hope will have similar properties — e.g. what you have in the stratum to the stratum total.

To me, this is another example that in applied work you have to adjust the tools to the problem, because there’s only so much you can do to adjust the problem to the tools.

• Thanks for the reply, Andrew.

I imagine it does depend, and I am lucky in that I often work in an engineering environment where, although there are cost and other constraints as in every actual problem-solving situation, time can be spent on asking about and implementing what measurements SHOULD be implemented if we could. Censoring is often a problem. Inability to calibrate measurements is another problem. (Does 10 hits on Web site A mean anything like 10 hits on Web site B, when you don’t really know the code behind either?) HHT has proved useful estimating traffic across individual, big backbone routers.

6. Jonathan (a different one) says:

Your world comes crashing down if you take the perspective that Bayes rule is the only coherent way to think. (I know you don’t feel that way, Andrew.) It is troubling to think that incoherent thinking ever trumps coherent thinking if you think coherence is an essential aspect of rationality. Such Bayesians (or at least my caricature of them — I have no idea whether or not such an extreme Bayesian actually exists) don’t think of Bayes rule as a item in the toolbox for solving problems, but as a Platonic ideal of rationality. When Platonic ideals are less than ideal, people are going to get disturbed.

7. Neil says:

Jonathan: being Bayesian is about treating parameters as random variables, not about using Bayes rule . Use of Bayes rule is no more ‘Bayesian’ than use of the product and sum rules..

• Rafael says:

Regarding that, I recommend Good’s paper “46656 Varieties of Bayesians” fitelson.org/probability/good_bayes.pdf

• Neil says:

Great paper, thanks Rafael!

• guest says:

“unkown variables” not “random variables”

• Jonathan (a different one) says:

Neil: Check out the definition of “metonymy” and then get back to me.

• Neil says:

Metonymy is fine, but not when it confuses my undergraduates.Many people think being Bayesian means using Bayes’ rule. If you aren’t one of them then I suggest you don’t give the impression that you are.

• Jonathan (a different one) says:

• bk says:

As a graduate student in statistics, metonymy is something that is incredibly frustrating. It may be fine in certain instances, and maybe this discussion is that instance, but defending yourself when confronted about this topic is quite saddening. Although this is an informal setting, I would imagine the intent of this blog is for conveying ideas. Statistics is a science, and as curators for this science, we have a duty to convey these ideas in an exact manner. I teach this idea to my intro students from the first day of class until the last. Fortunately, I have found the majority of them are open and accepting of this viewpoint.

8. Lukas says:

Regarding “…but are willing to give up on producing an estimate if a parameter is nonidentified”. Well, not always. Sometimes it is fruitful to study partially identified parameters.

• Andrew says:

Lukas:

I was thinking of a case such as binomial data with y=0 and n=100. My impression is that classical statisticians will be willing to supply a 95% interval but will not supply a point estimate. But a Bayesian has to give a point estimate (i.e., a posterior mean) because the Bayesian must be willing to assign predictive probability for the next random event.

On the other hand, a Bayesian can live his entire life and never care about any minimax properties.

• fred says:

Why must a Bayesian give a point estimate? I can see that a Bayesian must use the posterior distribution (subject to posterior predictive checks, perhaps) as the source of all inferences and/or predictions, but I’m not aware of arguments that one must give a point estimate.

There’s certainly no reason that Bayesian point estimates have to be posterior means, which you imply above.

• Andrew says:

Fred:

A Bayesian must be willing to give a predicted probability for any 0/1 outcome, which corresponds to a point estimate of the probability (or, if you prefer, the posterior mean). Giving a confidence interval of the probability, or saying “we can be confident the probability is between 0 and 0.03,” is not enough.

• fred says:

Apologies in advance if I’m missing something. I do see that, in this case – y=0, n=100, data is Binomial(n,p) – i) the posterior for p would be something close to Beta(0,100) depending on the prior, and that we can easily motivate computing the posterior mean ii) as p is the probability of success for a new outcome, estimation of p and prediction about new outcomes are essentially the same.

My concern instead is that there’s no requirement to use the posterior mean; depending on how bad different estimates/predictions are when wrong, one might reasonably use other functions of the posterior, with different results. And while it’s unconventional, these functions could return intervals, not point estimates.

• Andrew says:

Fred:

If you are predicting the next event, the probability of success is p. If p is unknown, the Bayesian posterior probability of success is E(p|y), that is, the posterior mean.

9. Andrew: I am surprised you didn’t take Larry to task for the ad hominem argument, made via an anonymized story. That is an odd argument for anything, let alone frequentism. I hope we choose methods because of their performance in real circumstances, not peevish stories told without the courage of naming names!

• Andrew says:

David:

My take on it was that Larry was recounting a relevant story but didn’t want to embarrass the subject of the story (or get into a big fight with him). The same way in which, when I tell a story about a Berkeley professor who advised me not to work on Bayesian Data Analysis because it would be bad for my tenure, I don’t say it was Peter Bickel. The story is fine as it is, no need to call out Bickel (who was, after all, just giving me reasonable advice) or get into a fight with him. Similarly, if I tell the story of my senior Berkeley colleague, when told of my work on monitoring convergence of iterative simulations, said it would be better if I had several papers on the topic and not just one, there’s no purpose in saying it was Chuck Stone. What would be the point of that?

10. Longhai Li says:

A few years ago, I read a simplied version of this problem from his textbook “All of Statistics”. For the simplified version example, I did a comparison of HT estimator and a simple bayes estimator — sample mean, and found that the HT isn’t as good as the sample mean. I wrote a short report about the comparison: http://www.informaworld.com/smpp/ftinterface%7Edb=all%7Econtent=a919418073%7Efulltext=713240930. However, I haven’t had time to explore (perhaps will do soon) the original version of the problem. So I am not sure whether the point of the above paper will apply to the original version too.

Basu 1971
The circus owner is planning to ship his 50 adult elephants and so he needs a rough estimate of the total weight of the elephants. As weighing an elephant is a cumbersome process, the owner wants to estimate the total weight by weighing just one elephant. Which elephant should he weigh ? So the owner looks back on his records and discovers a list of the elephants’ weights taken 3 years ago. He ﬁnds that 3 years ago Sambo the middle-sized elephant was the average (in weight) elephant in his herd. He checks with the elephant trainer who reassures him (the owner) that Sambo may still be considered to be the average elephant in the herd. Therefore, the owner plans to weigh Sambo and take 50 y (where y is the present weight of Sambo) as an estimate of the total weight Y = Y1 + Y2 + . . . + Y50 of the 50 elephants. But the circus statistician is horriﬁed when he learns of the owner’s purposive samplings plan. “How can you get an unbiased estimate of Y this way ?” protests the statistician. So, together they work out a compromise sampling plan. With the help of a table of random numbers they devise a plan that allots a selection probability of 99/100 to Sambo and equal selection probabilities 1/4900 to each of the other 49 elephants. Naturally, Sambo is selected and the owner is happy. “How are you going to estimate Y?”, asks the statistician. “Why ? The estimate ought to be
50y of course,” says the owner. Oh! No! That cannot possibly be right,” says the statistician, “I recently read an article in the Annals of Mathematical Statistics where it is proved that the Horvitz-Thompson estimator is the unique hyperadmissible estimator in the class of all generalized polynomial unbiased estimators.” “What is the Horvitz-Thompson estimate in this case?” asks the owner, duly impressed. “Since the selection probability for Sambo in our plan was 99/100,” says the statistician, “the proper estimate of Y is 100y/99 and not 50y.” “And, how would you have estimated Y,” inquires the incredulous owner, “if our sampling plan made us select, say, the big elephant Jumbo?” “According what I understand of the Horvitz-Thompson estimation method,” says the unhappy statistician, “the proper estimate of Y would then have been 4900y, where y is Jumbo’s weight.” That is how the statistician lost his circus job (and perhaps became teacher of statistics!).

• EJ says:

Awesome. Lindley (1986), in a comment to Efron, had this to say: “It is surprising to find Efron defending Fisherian ideas when they have been so carefully investigated and found inadequate by Basu (1975, 1977, 1978). Of course, sampling theorists do not read this brilliant, lucid writer. His results discomfit them.”

Cheers,
E.J.

12. […] statisticians have many principles and hold that no statistical principle is all-encompassing (see here, also the ensuing discussion), but perhaps it is a problem with textbooks on classical statistics, […]