“The frequentist case against the significance test”

Richard Morey writes:

I suspect that like me, many people didn’t get a whole lot of detail about Neyman’s objections to the significance test in their statistical education besides “Neyman thought power is important”. Given the recent debate about significance testing, I have gone back to Neyman’s papers and tried to summarize, for the modern user of statistics, his main objections to the significance test, including the nice demonstration from his 1952 book that without consideration of an alternative, the you can match the distribution of all of your statistics *under the null* and get whatever you answer you like. I thought it might be of interest for your readers.

“The frequentist case against the significance test” is in two parts:
Part 1 (epistemology)
Part 2 (statistics).

I haven’t read this in detail but I’m sympathetic to the general point. I think that frequentist analysis (that is, figuring out the average properties of statistical methods, averaging over some model of the data and underlying parameters) can be valuable, but I also think that the classical framework of hypothesis testing and confidence intervals doesn’t really work—I think these ideas represent too crude a way to jump into inference and decision making.

45 thoughts on ““The frequentist case against the significance test”

  1. I should note that Deborah Mayo objected on the grounds that Neyman wrote (much later, in the 70s) that what Fisher and Neyman meant by significance testing were the the same thing (and, indeed, one *could* look at it as if Neyman was putting the significance test on a solid footing). By “significance testing” I meant what Fisher meant in his 1955 paper – that is, a test without consideration of Type II errors. I’m sympathetic to Neyman, who seems to have approached the issue in a much more principled manner than Fisher did.

    Regardless of the semantics of “significance test”, Neyman made this point several times and I think his line of argument in the 1952 book is interesting, which is why I wrote the blog post.

    • Morey: you still miss the serious point, which has nothing to do with names. Neyman’s criticisms of Fisherian tests at the very start of the NP collaboration are the points I mentioned in my earlier comment: without alternatives one has too much leeway in the choice of test (Senn, by the way, will argue that choosing a test statistic, as Fisher did, is actually more primitive than choosing an alternative), and to avoid the fallacy committed daily by social scientists, we should be clear that the null and alternative exhaust the parameter space because the error probabilities will not hold for inferring a substantive (e.g. causal) claim on the basis of statistical significance. The NP formulation of tests was a way to provide a rigorous way to end up with the tests Fisher favored.

      The other serious point which you distort is whether tests are illicit because they don’t deductively warrant posterior probabilities. Answer: they are not. Neyman put them forward deliberately as holding regardless of prior. If you and other psych folks are “sympathetic to Neyman” (as am I, to this extent), you would use NP testing and stop using the illicit NHST animal (which Fisher would never have endorsed either). Then you could take negative results and non-replications as evidence for spurious effects, among much else. Ironically, the leeway Neyman is pointing out in simple Fisherian tests is akin to the leeway permitted in analysis by Bayes factors, which Morey endorses: you can support the null or not by convenient choices of alternatives. Error control goes out the window, but at least the frequentist can assess them (even in Fisherian tests).

      • > “The other serious point which you distort is whether tests are illicit because they don’t deductively warrant posterior probabilities.”

        I never said that tests are illicit because they don’t deductively warrant posterior probabilities. In fact, I said this: “That the logic of the significance test is not deductively valid is not news; certainly if the logic were deductively valid, Fisher and Neyman would have made use of that fact, since deductive logic plays a major role in both Fisher and Neyman’s theories. As it turns out, it isn’t valid; but this isn’t necessarily a problem if another justification for the logic can be found.”

        Echoing Corey’s question below: “Where in the two posts does Morey assume Bayesian reasoning?” Please answer this question.

  2. It would be hard to find a greater distortion of Neyman’s position on significance test than Morey’s, even on a quick read. For starters it makes it appear than Neyman is against statistical significance tests, when in fact he called his own tests statistical significance tests. Morey is alluding to Neyman’s arguments for why we need to consider an alternative and power in order to justify the tests that Fisher himself chose on intuitions. It is not an indictment of the tests. Neyman also objects to the fact that, if one isn’t as careful as Fisher, one might think you can jump from the statistically significant effect to the research hypothesis (as psychologists do). Neyman’s criticisms of fiducial inference is a distinct issue.
    As for the argument that significance tests are not deductively valid, we see the same mistaken argument that has led to recent bannings of P-values. The formal statistical significance test, like the NP test, is entirely deductive. The error probabilities follow deductively from the given premises of the test and model. Only certain test abusers imagine that the tests should or claim to warrant a posterior probability assignment to the conclusion. what Morey takes as Neyman criticizing significance tests is just Neyman explaining why he and Pearson add more constraints to Fisher’s test with its single null. They were striving to wind up with the tests Fisher sanctioned on informal grounds, and they did. Both Fisher and Neyman couldn’t be clearer in rejecting the Bayesian reasoning Morey assumes (unless there’s a frequentist prior, in which case it’s all purely deductive). There is an inductive inference to a claim about what’s warranted and not warranted on the basis for the deductive tests, but that appeals to either decision criteria (in behavioristic contexts) or the severity principle, or something like it (in scientific contexts). I’m disappointed to see Gelman appear to rubber stamp a confused and misleading argument like this, especially as he will correctly employ significance test reasoning in his own work.

    • Neyman’s criticisms of fiducial inference is a distinct issue.

      Is one of the Neyman quotes in the posts actually about fiducial inference, not significance tests? If so, that would be pretty bad, but if not, then I can’t think of a reason you would bring up the subject.

      As for the argument that significance tests are not deductively valid, we see the same mistaken argument that has led to recent bannings of P-values. The formal statistical significance test, like the NP test, is entirely deductive. The error probabilities follow deductively from the given premises of the test and model.

      Morey doesn’t says that the error probabilities don’t follow deductively; he says that a positive claim of fact doesn’t follow deductively.

      Both Fisher and Neyman couldn’t be clearer in rejecting the Bayesian reasoning Morey assumes

      Where in the two posts does Morey assume Bayesian reasoning?

    • It would also be hard to find a greater distortion of Fisher’s position on significance tests than Morey’s. Fisher said that a small p-value indicated that (paraphrasing) “either the null hypothesis is false, of a very unlikely event occurred.” Clearly deductive. Further, Fisher suggested that a single test should not be taken as definitive evidence.

      • That is not an inference, that is a disjunction. The inference comes *after* that. Why do you think Fisher called his reasoning inductive and not deductive? Do you think he couldn’t tell the difference?

        • I was referring to your mischaracterization of the logic of significance testing as deductively invalid. This disjunction is a (Fisher’s) deductively valid conclusion given the premises. Further, it counteracts your silly contradiction that supposedly arises when we know the null to be true (this happens all the time, by the way, when baseline characteristics are (ill-advisedly) compared in randomized trials).

          The proper inference following from such a significance test should be along the lines of “there very well might be something to this.” If you’re after a more certain, positivistic, conclusion, then I’m afraid that you’re asking too much from statistical methods.

          Fisher certainly did characterize his (conjectural) inference as inductive. But, then again, so did Peirce. Do you think that Peirce couldn’t tell the difference? Or is it possible that different people use the same word in somewhat different ways? In all of my reading of the two, I don’t recall ever seeing Fisher refer to Popper nor Popper to Fisher. However, I have found very little between the two (with the *possible* exception of Fisher’s fiducial approach to estimating mathematical parameters) to make me think that they would have disagreed when it came to the growth of scientific knowledge.

        • Mark:

          Let me jump in here . . . I disagree with your claim that the deductive logic of the significance test is valid. There are two problems with this logic:
          1. The evidence from a significance test is probabilistic.
          2. The negation of the null hypothesis is typically not the null hypothesis of interest.

          The statement, “p is less than 0.05, therefore there is strong evidence against the null hypothesis, therefore there well might be something to the alternative hypothesis,” is in many cases not a good deduction.

          Take, for example, the notorious claim by Kanazawa of a correlation between parental beauty and child sex ratio. He found p less than 0.05, but, in light of everything that is known about sex ratios, his data provide essentially zero evidence regarding any such correlation in the population.

          One way to see the problem is that the standard rules of logic are deterministic. When we move to probabilistic evidence, certain intuitions about logic turn out not to work out.

        • Mark:

          It was when you wrote, “This disjunction is a (Fisher’s) deductively valid conclusion given the premises.” Even if the premises are correct (that is, setting aside my objection #2), there’s still the problem that the inference is probabilistic, hence the statement “therefore there well might be something to the alternative hypothesis,” is not necessarily a reasonable inference given the p-value.

        • Yes, you are correct, sorry that I didn’t qualify further. I never meant to suggest that every application of significance testing would lead to equally reliable inference (and definitely didn’t intend to imply that all reported p-values were of equivalent value or even useful).

        • Andrew:
          Don’t you use ‘testing without an alternative’ (Fisherian-style) in your posterior predictive model checking? Putting aside whether it’s graphical or numerical.

          I see this as a useful distinction
          – Fisherian-style tests (without an alternative) are typically best used for model adequacy checks because there we have a model we like as the null
          – NP-style tests are basically for estimation and include a range of hypotheses, some of which we like

          This is of course, putting aside the (incorrect) NHST procedure based on strawman nulls. There is a legitimate case against this of course, but that is not a frequentist case, it’s a ‘not-being-stupid’ case.

          A BDA-Bayesian is thus more likely to favour keeping Fisherian tests in some form and replacing NP tests with Bayesian or likelihood estimation.

        • Also, the ‘not strictly deductive’ falsification applies equally to your posterior predictive checks and to your adoption of a Popperian philosophy.

        • Hjk:

          Yes, I agree that I do these checks myself, I’m not saying they’re always bad, I was just saying that their logic is not as clear as I felt that Mark was implying in his comment.

        • So the modern Bayesian is then much closer to Fisher – model (adequacy) tests without an alternative combined with likelihood-based estimation. Which is pretty much the opposite ‘alternative’ to what one might infer from this post…

        • hjk: a huge difference is that posterior predictive checks are only used to decide a model is a poor fit. P-values are almost never used in practice _only_ to declare that the null hypothesis is a poor fit.

        • Anon:

          Good point. Indeed, I’d love it if researchers used p-values to shoot down their own models. But the usual practice is the opposite, where the p-value is used to shoot down the straw-man null hypothesis. I prefer model-checking as self-criticism, rather than model-checking as an indirect means of confirmation.

        • Also, Andrew:

          Your Type M/S error formulation of testing appears (to me, and at first blush) to be much closer to an appropriate formalisation of Fisher’s intuitions that the attempt by Neyman/Pearson in terms of power and Type I/II errors.

          Here the emphasis on choosing appropriate test statistics – yes, based on external information! – is more fundamental than e.g. power.

          As Mayo says below (and paraphrashing Senn):

          > Senn will argue that choosing a sensible test statistics is more primitive than choosing an alternative, and by this route, he avoids arbitrariness. What power does for NP, “sensitivity” does for Fisher.

          To me these are important points to emphasise. By resorting to rather throwaway ‘I don’t like testing/NHST’ comments, the ‘Frequentists’ you reach out to via concepts like Type M/S errors might infer that you agree with NP’s criticism of Fisher and hence adopt a Type I/II framework.

          In actual fact, everything you do seems much closer to Fisher’s approach (besides, you know, the rants against Bayesians).

        • Hjk:

          I agree that Fisher did great stuff. The trouble with his non-Bayesian stance is that it puts too much burden on the mathematical framing of a problem. Maximum likelihood etc have big problems when the (arbitrarily defined) space of possible parameter values includes regions that make no sense.

          But Neyman did have some good criticisms of Fisher. It’s the usual trouble with hypothesis tests, that you’re rejecting the entirety of a model, including various unrealistic assumptions that are just there for convenience. This is one place I disagree with Rubin: he likes the so-called Fisher null hypothesis model, and I think it makes no sense.

          The final part of the picture is to separate inference and model checking. I really really hate the classical framework in which estimation and testing are mathematically equivalent, the idea of confidence intervals which are inversions of hypothesis tests. I think this is a really bad attitude, and one thing that frustrates me is when people have been trained in that way and don’t even realize there are other ways of thinking about statistical inference.

        • > The final part of the picture is to separate inference and model checking. I really really hate the classical framework in which estimation and testing are mathematically equivalent, the idea of confidence intervals which are inversions of hypothesis tests. I think this is a really bad attitude, and one thing that frustrates me is when people have been trained in that way and don’t even realize there are other ways of thinking about statistical inference.

          But that is (one part of) what I am trying to say – Fisher did this (though he’s hard to pin down precisely). Estimation within a model based on a likelihood function plus carefully defined tests for specific departures of interest. Add priors for additional regularisation and you’ve got the BDA approach, no?

          Following NP makes one more likely to want to invert hypothesis tests to form confidence intervals. Implicitly endorsing their arguments seems to be a misreading of Fisher and ignoring his similarities with BDA Bayesians. Again, that’s my point.

        • The frustrated Bayesian: there are legitimate, interesting reasons for using priors, here’s how you could think about them as a frequentist. Response: priors are subjective.

          The frustrated Fisherian: there are legitimate, interesting reasons for using Fisherian tests. Response: NHST sux.

        • I am in complete agreement with Andrew that 1) p values can be helpful sometimes in informal model checks, and 2) that their logic is far from easy. My own sense is that p values are useful in the creative stage of a research program, when various theoretical models are being fleshed out and there is not yet anything to test. Patterns of weaknesses of models can be identified (e.g., “this class of models consistently seems to fail here”), but this use of p values is informal. I am a bit of an anarchist when it comes to *this* sort of research.

        • I’m also fine with this. At the risk of being a broken record, I believe this would probably be Fisher’s view too. P-values *were* informal for him, and he placed much more emphasis on e.g. the likelihood function.

          It probably wouldn’t be the NP view, however, as they were the ones wanting to formalise testing. And that, I thought, is essentially the subject under discussion: NP’s case against Fisher.

        • Mark:

          Some of us suspect Fisher read Peirce but there is no documentation.

          On the other hand Popper explicitly acknowledged Peirce as the person whose work he drew on the most.

  3. Mayo said: “Both Fisher and Neyman couldn’t be clearer in rejecting the Bayesian reasoning Morey assumes…”

    Fisher, in Statistical Methods and Scientific Inference, said that “…the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by other rational minds. The level of significance in such cases fulfils the conditions of a measure of the rational grounds for the disbelief it engenders…”

    Fisher *did* adopt the logic that Neyman is critiquing. I don’t see how, on reading the two authors, one can come to the conclusion that they agreed on “significance testing”.

    • Don’t you have the parenthetical description of null hypothesis wrong in your first point on the numbered list on the first post? It should be that the drug has no effect.

      • I think you refer to this:
        “Develop a null hypothesis that is to be (possibly) rejected (e.g., the drug has an effect)”

        That is incorrect, and so is your version (“the drug has no effect”). The null hypothesis actually used in these cases is that “all groups are samples from the same hypothetical infinitely large population”. That is all the math applies to. That null hypothesis is not equivalent to “the drug has no effect”, you need additional assumptions to get there. This sounds like pedantry but it is not, it is exactly the problem.

        • The null hypothesis is not *equivalent* to “the drug has no effect”, this is true. It cannot be. One is a substantive hypothesis, and the other is a statistical hypothesis/model. The analyst has to make a choice about what model/hypothesis to stand in place of the substantive hypothesis. This is the way all statistical inference works, of course; there’s nothing unique about a significance test here. We could add this disclaimer to all discussions of statistics, but that would get tedious. For my own part, I emphasize this point in my teaching and writing (e.g., here).

  4. Richard: You continue to fllagrantly distort the players. Fisher threw around claims like “disinclination to believe the effect is genuine,” but these were not posterior probabilities. We have evidence against the result being a genuine experimental effect–but the probability is assigned to the observed effect. Also, that comment comes in the midst of the Fisher-Neyman dispute (any time after 1935, but that book in particular). The more Fisher talked “rational disinclination” (which is vague enough to be construed as one of Neyman’s “actions” actually), and the more he spoke of “fiducial” inference, the more Neyman turned behavioristic, and separated himself from Fisherian verbiage. Fisher never wavered from rejecting Bayesian inference as irrelevant for science, always denied it was a mistake to suppose probability enters in “uncertain reasoning” in just one way. So if you’re trying to pin that on him, you are encouraging serious distortions.

    • Fisher never wavered from rejecting Bayesian inference as irrelevant for science, always denied it was a mistake to suppose probability enters in “uncertain reasoning” in just one way. So if you’re trying to pin that on him, you are encouraging serious distortions.

      Do you see anything in the two posts that can be construed as Morey trying to pin these positions on Fisher? (I don’t.)

      • Corey: If the title of his paper had been something like, “why Neyman and Pearson introduced the alternative and the power function in order to justify the tests Fisher developed on intuitive grounds”, or “why we should be using NP tests and power, rather than pure significance tests,” and the discussion formulated as how Neyman answered those questions, it would not be so misleading. What Morey concludes is:

        “To recap both posts, Neyman makes clear why significance testing, as commonly deployed in the scientific literature, does not offer a good theory of inference: it is fails epistemically by allowing arbitrary “rational” beliefs, and it fails on statistical grounds by allowing arbitrary results.”

        One can only regard Fisherian tests as “failing epistemically”, as Morey defines it, if succeeding epistemically meant being able to infer from a low significance level that “Ho is probably false”. But neither Fisher nor Neyman thought this. Nevertheless, an inferential (and thus an epistemic) assessment is warranted from both types of tests (although NP and Fisher were allergic to the idea that there was a single rational way to draw inferences.)

        Second, pure Fisherian significance tests allows arbitrary results only if one incorrectly uses them. I concur as to the importance of considering at least a directly alternative, and I recommend NP tests, properly interpreted (even though there are important applications for pure significance tests—spelled out in Cox’s taxonomy) but Fisher, and wise Fisherians, are not led into arbitrary results. Senn will argue that choosing a sensible test statistics is more primitive than choosing an alternative, and by this route, he avoids arbitrariness. What power does for NP, “sensitivity” does for Fisher.

        But the real reason that I find the discussion misleading is that, rather than explain up front that these are arguments Neyman gave for introducing constraints into the Fisherian test––in order to ensure sensible frequentist tests, it is billed as “the frequentist case against the significance test” as “commonly deployed in the scientific literature”. The reader takes this as yet more grist for the mills of those piling on against tests––as in Gelman’s remark–– when it might more correctly be seen as “frequentist grounds for NP tests”. Still, I think the Fisherians like Cox and Senn can make the case that they accomplish what Neyman does without NP statistics. Argument against Fisher’s fiducial interpretation are distinct.

        All that said, I agree with Morey that Neyman’s papers should be required reading.

        • The only part of the above that actually addresses the question I asked is,

          One can only regard Fisherian tests as “failing epistemically”, as Morey defines it, if succeeding epistemically meant being able to infer from a low significance level that “Ho is probably false”. But neither Fisher nor Neyman thought this.

          I think you’re attributing to Morey a position that he never actually takes. He quotes Fisher directly: “The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief [in the null hypothesis] it engenders.” The most charitable reading (and the correct one, I think) is that Morey is discussing how Neyman thinks this claim of Fisher’s (howsoever construed or cashed out) fails.

          But I would like to address this briefly:

          Second, pure Fisherian significance tests allows arbitrary results only if one incorrectly uses them. …Fisher, and wise Fisherians, are not led into arbitrary results.

          You once asked me how Bayesians could safely cope with model expansion in the face of model inadequacy, and my reply was basically that wise Bayesians can avoid being led into arbitrary results. I could tell by the expression on your face that you found this inadequate; so you’ll understand when others find your reliance on the wisdom of Fisherians in the defense of pure Fisherian significance tests equally weak sauce.

    • > Mayo said: “The more Fisher talked “rational disinclination” (which is vague enough to be construed as one of Neyman’s “actions” actually)…”

      That doesn’t seem (to me) like a plausible reading of Neyman. The “disinclination” for Fisher was a specifically a disinclination to *believe*, and Neyman disconnects/distinguishes the result of a hypothesis test from beliefs, which he says are personal. Neyman does say that the decision can be to adopt a “particular attitude”; however, 1) he notably didn’t use “belief” and chose the word “attitude”, and he is a very careful writer, and 2) “particular” implies something discrete (eg, “I will no longer consider this hypothesis”) rather than something continuous, like Fisher’s “disinclination”.

  5. I found Morey’s depiction of Fisherian and Neymanian thinking to be sufficiently consistent with my own readings to satisfy, without being identical. The changing nature of the dialog between Fisher and Neyman, alluded to tangentially by Mayo, makes a brief exposition of their differences and similarities pretty much impossible, so we should not be looking for such an account. Further, we should not be looking for either of them for a complete and sensible account of statistical or scientific inference, because neither of them seems to have clearly understood the distinctions that have to be drawn among the various types of response to the information that Royall enunciated clearly:
    (i) what do the data say?;
    (ii) what should I believe?;
    (iii) what should I do?

    The Fisherian P-value is an index of (i) but the Neymanian approach deals with (iii). Bayesian approaches are (iii).

    With those distinctions in mind it can be seen that there is little point in trying to reconcile the various approaches in a way that makes them become interchangeable or agree. Neither should we be looking for ways to say that one is superior to another for all purposes.

    • “The Fisherian P-value is an index of (i) but the Neymanian approach deals with (iii). Bayesian approaches are (iii).”
      hi michael, i just wanted to check. is there a typo? you have 3 items in your numbered list, but only 2 of them seem to be mapped to 3 approaches?
      thanks!

      • Whoops. Yes, a typo. P-values give (i) [I note in passing that likelihood functions tell that story much better], Bayesian approaches mostly help with (ii), and Neyman-Pearsonian tests provide decisions, (iii). It is very unfair on P-values to criticise them for not being good for (ii) and (iii).

  6. > Mayo said: ‘Fisher threw around claims like “disinclination to believe the effect is genuine,” but these were not posterior probabilities.’

    So? I never said they were. It doesn’t matter; the logic fails anyway. Neyman and Pearson (1928) say: “It is indeed obvious, upon a little consideration, that the mere fact that a particular sample may be expected to occur very rarely in sampling from [the null] would not in itself justify the rejection of the hypothesis that it had been so drawn, if there were no other more probable hypotheses conceivable.” They aren’t invoking any Bayesian posterior probabilities here.

    > Mayo said: “But the real reason that I find the discussion misleading is that, rather than explain up front that these are arguments Neyman gave for introducing constraints into the Fisherian test––in order to ensure sensible frequentist tests, it is billed as “the frequentist case against the significance test” as “commonly deployed in the scientific literature”.’

    It is deployed this way in the literature! Are you denying this?

    > Mayo: “Fisher, and wise Fisherians, are not led into arbitrary results.”

    I think you’re right that Fisher would not be led to arbitrary results. Fisher had better intuitions than probably any statistician, ever. People who read Fisher’s writings seriously, though, *could* be lead into arbitrary results.

  7. My own thoughts on tests:

    http://models.street-artists.org/?p=2309

    Testing to make decisions is very very NON Bayesian, but testing basically provides a connection between a sampling process and the distribution function you’re sampling from. It connects two totally different kinds of probability:

    1) Bayesian probability: a method for calculating with uncertainty about real-world facts

    2) Mathematical Probability: a connection between certain functions of an abstract space and sequences of vectors from that space.

    My problem with Frequentist statistics is that it replaces 1 with 2. This implicitly implies that the world is treated as a random number generator. It’s no surprise that physicists are some of the most vehement Bayesians. They don’t want to replace physical reality with random number generators.

    BUT, sometimes as a mathematical modeling technique, it DOES make sense to replace whatever’s going on with an RNG. Like for example if you want to find plausible initial conditions for a weather prediction simulation given a small set of measurements, or you want to find out how much it costs to fix a large number of houses after random-sampling a few of them and figuring out what was wrong in the sampled locations.

    In that special-case, a Bayesian would be interested in whether the mathematical form of their model fits the data well enough, and the notion of a test makes sense. This is more or less Andrew’s “posterior predictive checks” I think.

  8. IMO the logic of significance tests works over time and are overall good for science, but only if the underlying experimental designs in the experiments are good. See for example, http://www.statisticool.com/cis.jpg

    So the points are (1) don’t rely on just 1 result (Fisher said as much), and (2) have good experimental design
    1) and 2) are probably more important than if p-values, confidence intervals, Bayes factors, or anything else is used to analyze the resulting data.

    Justin
    http://www.statisticool.com

Leave a Reply to Mayo Cancel reply

Your email address will not be published. Required fields are marked *