Did Neyman really say of Fisher’s work, “It’s easy to get the right answer if you never define what the question is,” and did Fisher really describe Neyman as “a theorem-proving poseur who wouldn’t recognize real data if it bit him in the ass”?

To answer the question in the title of this post: Of course not. Fisher is English. They say arse, not ass.

But here’s a quote that is floating around. Joseph Wilson quotes science reporter Regina Nuzzo:

Neyman called some of Fisher’s work mathematically “worse than useless”; Fisher called Neyman’s approach “childish” and “horrifying [for] intellectual freedom in the west”

and psychologist Gerd Gigerenzer:

Neyman, also with reference to the issue of power, called some of Fisher’s testing methods “worse than useless” in a mathematically specifiable sense

The latter quote seems to derive from this line of philosopher Ian Hacking from 1965:

The likelihood test in the situation just described is a sort of which, in the literature, has been called worse than useless. Its power is less than its size.

But, as Wilson notes, the “worse than useless” idea was floating around for quite awhile before then:

[Erich] Lehmann, who received his doctorial thesis from Neyman in 1946, casually called likelihood ratio tests whose power was less than its size ”worse than useless” in 1948.

I liked Erich Lehmann. He was one of the few people in the Berkeley stat dept who were nice to me when I was there in the 1990s. Perhaps surprisingly given his theoretical reputation, he seemed to get the idea that applied Bayesian statistics could be both useful and nontrivial. He didn’t share the attitude of his colleagues that probability modeling is suspect, the moment it comes in contact with the data.

Did Neyman really call some of Fisher’s work mathematically “worse than useless”?

I passed this question over to statistician and historian Steve Stigler, who replied:

Yes, Neyman and Fisher took small verbal potshots at each other after 1935 on several occasions, and as such this remark would not be a big deal. But is it accurate? A quick look tells me:

1) In a 1950 Annals of Math Stat paper, “Some principles of the theory of testing hypotheses,” Erich Lehmann twice uses the phrase, as in, “Cases exist, in which the likelihood ratio test is not only unsatisfactory but worse than useless, and hence the likelihood ratio principle is not reliable.” Of course this was not a comment directed at Fisher specifically or even at a corpus of his work, only a reflection of the view that LRs themselves were not always to be trusted, and Fisher would no doubt have disowned the uses in the specific cases given. I gather Lehmann wrote the same thing earlier as well.

2) Hacking I think had exactly this statement in mind when he wrote, “The likelihood test in the situation just described is a sort of which, in the literature, has been called worse than useless. Its power is less than its size.”

3) But it is certainly not aimed at Fisher’s work in any general sense, and it is not to be attributed to Neyman without citation, even though I suspect Neyman would endorse it, and may even have said or written it in the sense used by Lehmann. It is addressed at the idea that LR alone solves all problems, which is not something Fisher would have said.

So there you have it. In any case, I think Nuzzo’s statement gets the point across. Even if Neyman did not make this particular statement, he and Fisher insulted each others’ work in enough other places that I think the general sense of Nuzzo’s statement is correct.

18 thoughts on “Did Neyman really say of Fisher’s work, “It’s easy to get the right answer if you never define what the question is,” and did Fisher really describe Neyman as “a theorem-proving poseur who wouldn’t recognize real data if it bit him in the ass”?

  1. More on point here. This kind of rivalry has been quite active since Tartaglia and Cardan, and long before (that dispute as just one of the more interesting mud slings that I recall). I don’t see why recent day would be much different (though for example the legal terms may have changed a bit ;)). Neyman and Fisher were as much rivals as the staunch Newtonian school of physics and the challenge of general relativity.

  2. The technical work Neyman did to back this up seems to have been done by the mid 20’s if I’m reading some of Neyman’s early papers right (so there’s quite a gap between that an Lehmann’s casual use in 1948).

    I do think there’s a real possibility Nuzzo’s quote is misleading though, at least to some readers. A casual read of the quote gave me the impression of a general condemnation of Fisher, while the limited original papers I have in my library suggest “making fun a specific result”.

  3. Neither of them ever adequately addressed the use of strawman hypotheses (please correct me if I have missed it). Fisher even seems to advise it in some places. They were both alive long enough to be capable of seeing that use of statistics begin to take hold, so there was opportunity. I would say significance/hypothesis testing using a strawman null/H0 is worse than useless in the epistemological sense.

    • Question, this is me correcting you. Fisher most certainly did deal with the straw man aspect of the null hypothesis. The likelihood function approach that he discussed in Statistical Methods and Scientific Inference does so quite well. (It is a shame, as Edwards points out, that Fisher spent more effort on defending fiducial inference than he did on developing likelihood-based inference.)

      The following is a snippet from a paper in which I explore the direct connection between significance test P-values and likelihood functions:

      It is reassuring to find myself in agreement with both E.T. Jaynes and R.A. Fisher. Jaynes, a leading proponent of Bayesian approaches and no friend of significance testing, said that

      the distinction between significance testing and estimation is artificial and of doubtful value in statistics \cite[p. 629]{Jaynes:1980}

      And Fisher said:

      It may be added that in the theory of estimation we consider a continuum of hypotheses each eligible as null hypothesis, and it is the aggregate of frequencies calculated from each possibility in turn as true [\dots] which supply the likelihood function \cite[p. 73]{Fisher:1955}

      (My paper has been several times rejected, and is under revision yet again, but an earlier version is available from ArXive http://arxiv.org/abs/1311.0081)

      • Michael,

        I have come across your paper before and even referenced it earlier in posts on this blog. I found it very insightful, in fact your paper describes the only justification for using p values I have ever found palatable. It is unfortunate that peer reviewed journals have failed to publish this, but I am happy that arxiv exists to allow you to do so. To me this is just more evidence that peer review only functions to perpetuate social norms and the interests of people established in whatever field.

        The use of the null hypothesis as “a landmark in parameter space” is not something Fisher ever managed to convey effectively (possibly because he did not understand it himself). I think your way of interpreting p values is proper but it is not useful in the context of “significance” levels as are commonly used and advocated by Fisher.

      • Michael:

        My problem with the so-called Fisher null hypothesis is that, as I’ve always seen it, it assumes the treatment effect is constant for all items. To me, it makes more sense that, if the treatment effect is nonzero, that it can vary. (I’ve written a couple papers on this, one published in 2004 and one in 2008.)

        I’ve had some arguments with Rubin about this, because he’s always thought of this constant-effect null hypothesis as a great idea. So, what can I say, you’re in good company if you agree with R. A. Fisher and Don Rubin. Nonetheless, I part ways with them on this one.

        • Andrew, I’m not sure that I understand what you mean by ‘items’. Does it relate to a fixed effects ANOVA type of model? If so then I’m not sure that your concern is really relevant to a simple significance test. If you mean, instead, that the effect size varies for each observation within a sample then it must make the conventional parametric statistics meaningless.

          I guess I don’t get it.

          • Michael:

            To use potential-outcome notation, the conventional model of constant treatment effect is y_i^T – y_i^C = theta, and conf intervals can be obtained by inverting the set of null hypotheses corresponding to different thetas. I am more interested in a model y_i^T – y_i^C = theta + eta_i, where eta_i vary. By “items,” I mean whatever is being experimented on, for example, in the most natural sort of example, each i is a different person.

            For some examples, see this paper and this paper and these slides.

            • Andrew, I’ll look at the papers later, but my immediate responses are these. (i) The null is usually a hypothesis of no effect, so how is it a problem that every observation might be coming from an ‘item’ that may have a different effect size? (ii) Is the variable effect size for every item not accomplished by allowing a non-zero variance for the notional population from which we can assume the sample was drawn?

              • Michael:

                Yes, for a null hypothesis of zero effect, I have no problem with zero variation. My problem is with inverting a set of hypothesis tests corresponding to different constant but nonzero effects.

                Also, I recognize that my term “item” is nonstandard. The usual term in “statistics” is “unit” but I find that confusing, especially when talking with non-statisticians, because of confusion with the concept of units of measurement (kilograms, etc.). Hence I use “item.”

              • Andrew:

                Doesn’t “item” usually refer to questions asked in observational studies, especially in the context of item batteries meant to catch a latent variable? Keywords like “Item Response Theory” spring to mind. Hence, it might be even more confusing to use the term “item”, not only for statisticians but probably also for most social scientists.

  4. It is interesting to note the G.U. Yule, in his 1912 paper titled “On the Methods of Measuring Association Between Two Attributes”, discussed and criticized Karl Pearson’s tetrachoric correlation coefficient, but agreed that there some cases in which Pearson’s assumptions are “less unreasonable”.

  5. I was doubtless the one to underscore Nuzzo’s make-believe quote on Neyman. It isn’t merely that the precise words were never said, it’s that in using it she conveys a misleading statement, as noted by Entsophy. She doesn’t convey, and wasn’t aware of, the idiosyncratic, technical meaning of the expression. Here’s what I wrote in a footnote to a post:

    “In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.
    N-P did provide a theory of testing that could avoid the purely technical problem that can emerge without considering alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post. If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).”

    • Mayo, I share your distaste for the Nuzzo paper and for the imprecision of the pseudo-quotation. However, I wonder whether the often repeated idea that Fisher’s approach to P-values was to dichotomise them is another example of where counter-factual ideas have grown from misinterpretations.

      When reading how Fisher describes the results of significance tests, it needs to be borne in mind that the determination of exact P-values was very laborious, and that he had at his disposal tables of test statistic critical values corresponding to several specific values of P. Thus he tended to say whether the P-value was greater or less than values like 0.05 and 0.02, and very often whether they were inbetween, rather than providing exact values. Further, Fisher clearly used the word ‘significant’ in the sense of ‘interesting’ or ‘worthy of attention or follow-up’ rather than in the sense of accept/reject the null.

      With those things in mind the many statements by Fisher that are generally thought to indicate a dichotomous approach to inference can be seen to be more Fisherian than Neymanian. For example, on page 53 of his little book Fisher, Neyman and the Creation of Classical Statistics (very highly recommended) Lehman provides eight quoted statements of results from Fisher that are supposed to support the contention of a dichotomous approach. Only one of them does so clearly. The others can all be interpreted in light of my caveats as being non-dichotomous to the extent that was convenient in the absence of a click-the-button generator of exact P-values.

Leave a Reply

Your email address will not be published.