Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2pm Thurs 15 Sept)

This is the conference, and here’s my talk (will do Google hangout, just as with my recent talks in Bern, Strasbourg, etc):

Hypothesis Testing is a Bad Idea

Through a series of examples, we consider problems with classical hypothesis testing, whether performed using classical p-values or confidence intervals, Bayes factors, or Bayesian inference using noninformative priors. We locate the problem not in the use of any particular statistical method but rather with larger problems of deterministic thinking and a misguided version of Popperianism in which the rejection of a straw-man null hypothesis is taken as confirmation of a preferred alternative. We suggest solutions involving multilevel modeling and informative Bayesian inference.

P.S. More here from 2014.

33 thoughts on “Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2pm Thurs 15 Sept)

    • Yes! In the clinical trials world I seem to have this conversation with doctors and statisticians all the time. They defend hypothesis testing on the grounds that you have to make a decision about using a treatment – so a “significant” result is a suitable criterion for them to believe that it “works”. I say, yes, you have to decide at some point but there are a whole load of other things than statistical significance that should come into the decision. Plus p < 0.05 is a really crap criterion.

  1. Oh thank goodness Andrew. I have just seen too much, “Bayes is the solution” and it supports continuing the same old thing using Bayesian methods. You’re not clear on your exact proposal but I’ve been saying to people who come to me for ages, “just say something reasonable and useful about your data. Your hypothesis test doesn’t provide this.” (and it never does all on it’s own)

  2. If I understand the problem correctly, one egregious example is study “People Who Choose Money Over Time Are Happier” (by Hal E. Hershfield and Cassie Mogilner Holmes, who recently wrote a NYT op-ed about their findings). I took a look at the actual study (in the September 2016 issue of SPPS), and it’s swarming with problems related to hypothesis testing. The researchers seem bent on proving that people who choose time over money are happier than those who choose money over time. Philosophically it’s a compelling idea (requiring careful definitions of time, happiness, and money); the study, though, is so problem-ridden that I question the vigilance of the publishers (of both the study and the op-ed).

    Seven surveys make up the study. Five of them were conducted on Mechanical Turk, another through Qualtrics, and another in a train station. Mechanical Turk is problematic enough in itself–but if you’re asking specifically about money, happiness, and time, how can you expect to get any kind of reliable results there? There’s probably a big divide between people who complete surveys to make a living and those who do it just to kill a few minutes here and there. (Come to think about it, the latter group might be relatively content with their lives but sorely in need of time, since they keep spending it on surveys and such.)

    They perform all sorts of hierarchical regressions that obscure rather than clarify things. For example, they claim to control for financial strain and time strain by asking participants about their income and work hours. I see far too many problems with that.

    I would have taken the study as a satire, except that it lacks a satire’s flair.

  3. Andrew interacts with some poor uses of tests, but there are also stringent uses of tests, and even improvements on Popper. If you don’t test in any sense, then you can’t find flaws much less falsify. So if you’re a falsificationist, you need severe tests. https://errorstatistics.com/2016/08/09/if-you-think-its-a-scandal-to-be-without-statistical-falsification-you-will-need-statistical-tests/

    The fact that it was so difficult to replicate low p-values with preregistration proves that even the lowly p-value has merit and is not to blame for those who (1) commit fallacies of rejection: inferring discrepancies larger than warranted, and moving from statistical to scientific conclusion without stringent ruling out error or (2) employ spurious p-values (those invalidated by biasing selection effects and model violations).

    Well I’ve said all these things too many times over at my blog errorstatistics.com

      • So do it right.

        I wonder if you really mean to say “deterministic thinking” as opposed to a misguided drive for certainty? I don’t see that determinism enters.

        ” a misguided version of Popperianism in which the rejection of a straw-man null hypothesis is taken as confirmation of a preferred alternative.”

        But everyone and his brother knows this is a fallacy. It’s really not a warranted argument to move:

        From: Affirming the consequent is invalid.
        To: deductive logic is a “bad idea”.

        Likewise for inferences from fallacious uses of tests to tests are bad.

        ” We suggest solutions involving multilevel modeling and informative Bayesian inference”

        I hope to read the paper to see how it cures the problems of inferential fallacies. I think it’s a mistake to suppose that scientific reasoning and methods for testing are somehow “owned” by some quirky abusers, or are limited to what you might find in a text, and that, since the methods being badly used “belong” to the abusers, no one should use them correctly. I refuse to give away good methods to those who are clueless or irresponsible. Your new methods may differ, but the requirements for stringent testing will remain for them.

        • Mayo:

          You write, “I hope to read the paper to see how it cures the problems of inferential fallacies.” There’s no paper and I’ve cured no general problems. But for specific examples you can see my books and dozens or hundreds of applied research articles.

          As for the attitude that I state, “a misguided version of Popperianism in which the rejection of a straw-man null hypothesis is taken as confirmation of a preferred alternative,” this is unfortunately not what’s done by some small group of “quirky abusers”; rather, it’s standard practice in Psychological Science, PPNAS, Science, Nature, and a big chunk of what is publicized on NPR, Ted talks, and the like. I’d say it’s the dominant philosophy of science, at least for that part of science associated with statistical analysis and high levels of variation.

        • Andrew: I didn’t say it was a small group, but much of what’s in Psych Sci is bad science and I don’t see how simply doing something completely different fixes what are known to be fallacious arguments. If the name of the game is faming, they’ll do it, perhaps even more readily, with a different methodology. Better stat can’t fix bad science. Your list might be right, but Nature and Science, rather recently, weren’t this way. (Are they that bad?)

        • Mayo:

          You write, “better stat can’t fix bad science.” Indeed. But bad stat (in particular, the idea that a scientific study is a success if the p-value is less than .05 or the Bayes factor is more than 20 or whatever) can motivate bad science. And clarifying the problems of bad stat can perhaps motivate researchers to do better science.

          One thing I’ve said a lot on the blog when bad studies come up is that I don’t recommend a preregistered replication because, on statistical grounds, I can be pretty sure it will be a waste of time. Careful statistics can help us avoid tarot-card science.

        • Andrew:

          i agree that “And clarifying the problems of bad stat can perhaps motivate researchers to do better science.” But clarifying problems isn’t quite the same as saying all tests are bad, nor getting on board a p-bashing train which often includes some confused and terrible arguments. I’m not saying you do, but the bad arguments don’t help.

          “I don’t recommend a preregistered replication because, on statistical grounds, I can be pretty sure it will be a waste of time.”

          Sure but that doesn’t belie my point that the nonreplications show that the problem wasn’t p-values but their misuses coupled with questionable measurements.
          I’m curious to hear your criticisms of Bayes factors some time.

        • Mayo:

          My technical criticisms of Bayes factors appear in chapter 8 of BDA3 (chapter 7 of earlier editions of the book), and they are completely different from my technical criticisms of p-values for null hypothesis significance testing.

          But my practical criticisms of Bayes factors are pretty much the same as my practical criticisms of p-values for null hypothesis significance testing, which is that it’s generally a mistake to try to demonstrate the truth or plausibility of favored hypothesis B by rejecting straw-man null hypothesis A.

        • As has been pointed out on your blog before, the fallacious use of these tests goes straight back to Neyman and Student. The fallacious use is not an aberration due to “clueless or irresponsible” people untrained in statistics:

          “The application of the chosen test left little doubt that the lymphocytes from household contact persons are, by and 1arge, better breast cancer cell killers than those from the controls.”
          https://errorstatistics.com/2015/08/05/neyman-distinguishing-tests-of-statistical-hypotheses-and-tests-of-significance-might-have-been-a-lapse-of-someones-pen-2/#comment-129224

          “From the table, the probability is 0.9985, or the odds are about 666 to one that 2 is the better soporific.” (Student, 1908)
          https://errorstatistics.com/2015/03/16/stephen-senn-the-pathetic-p-value-guest-post/#comment-120537

        • Andrew: There’s nothing illicit in Neyman’s phrase, he hasn’t been “caught” in any way–whether or not the particular study is good (which I can’t check now). I replied to Anon on this.

        • Mayo:

          I’m not knocking Neyman. He made great contributions to statistics. But nobody’s perfect, and in this case Neyman was following the logic of demonstrating the truth or plausibility of favored hypothesis B by rejecting straw-man null hypothesis A. That reasoning is far from “illicit”; indeed, it’s standard in the world of null hypothesis significance testing. Indeed, that’s typically what null hypothesis significance testing is all about.

          To get back to your framework of severe testing: I’m a big fan of testing and finding flaws with models that I am using. I’m not a big fan of testing straw-man nulls or of interpreting p-values as posterior probabilities.

        • I disagree that Neyman is misinterpreting tests, quite aside from the specifics of this case which I don’t know. It’s not erroneous to take statistical results from adequate studies as relevant evidence for claims about which treatments perform better or the like. To suppose it was would be crazy. What are randomized controlled studies all about? It’s when errors in moving from statistical to substantive aren’t scotched, the experiment is poor, or when a single statistically significant result, possibly accompanied by cherry-picking and interpretive latitude, is taken evidence for a causal claim. All the worse when there’s little scrutiny that the “observed effects” are even caused by the “treatment”, and the subjects are college students required to participate. Student’s remark has nothing to do with Neyman.

        • “It’s not erroneous to take statistical results from adequate studies as relevant evidence for claims about which treatments perform better or the like.”

          In Anonuoid’s example it appears that *measurement* A was consistently in the favorable direction compared to *measurement* B but that this was very likely entirely due to differences in the measurement process, and so concluding that *treatment* A was better (for people) than *treatment* B was totally unwarranted.

          The issue is resolved when you stop ignoring all other explanations for how the measurements might be different (which is exactly what you do when you simply choose a null hypothesis to test of “the average values of the RNG that generated the data are the same”).

          Including all these different plausible explanations is exactly what you do when you put priors over explanatory models and then do Bayesian analysis of the posteriors. In that case, you’d conclude for example “either the measurement A consistently overestimates the goodness compared to the measurement B or equally probably treatment A is better for people than treatment B”. That is, in a Bayesian analysis you ask the question “which of these plausible scientific models of the way the data might have come about is consistent with the data?”

          The corresponding NHST question is “is the RNG that generated data A the same as the RNG that generated data B?” which is clearly not a scientific question at all.

          It begs the question of whether the data arises from an RNG in the first place (it doesn’t) and concluding that A *is better* than B based on that result begs the question of whether there is a causal mechanism by which the results, when considered as an RNG could come out “better” for A even though A is in fact worse (which there was in this case).

        • That’s what Mayo’s talking about when she writes about scotching “errors in moving from statistical to substantive”. But it’s puzzling that she’s claiming to defend Neyman without actually pointing to some instance of Neyman being aware of the need to scotch such errors — her use of the word “adequate” is doing a lot of heavy lifting.

        • Corey: This may provide some insight on that.

          “No, causality is far too speculative in nonrandomized settings.” He [Neyman] repeated something like this quote from his biography, “. . .Without randomization an experiment has little value irrespective of the subsequent treatment.” https://arxiv.org/pdf/1404.1789.pdf

          The problem is that, when randomization doe not work (does not block systematic errors), those who have not thought carefully about nonrandomized settings have no scotches in the cupboard to reach for ;-)

          (Unfortunately, almost all randomized studies in patients have systematic errors that are unblocked and some scotch is therefor essential.)

        • >”I disagree that Neyman is misinterpreting tests”

          A difference in chromium 51 levels does not equate to “better breast cancer cell killers”, there are various explanations for an observation like that. The test does not answer the question he wanted to ask. It really is very simple. For it to be science you need to rule out other explanations for the observations before concluding anything, anyone who understands that will see his error is quite blatant.

  4. Dear Andrew,

    Am writing from Downunder, and would love to attend the Warwick meeting on Thursday, but unfortunately Scottie won’t be able to beam me up :(
    There have been lots of queries about “sharing” some talks from the meeting. If that’s not doable, I’m wondering whether there might be any chance of a global webinar/conference thing – this topic is of wide import, particularly to those of us straddling the bridge between practice & methods.

    Thanks also for addressing (one of the) elephants that often sits in the statistical “lab” (Popper), and wondered whether you have examined other philosophers of science, post-Popper? Thus, if you are not aligned with Popper, then is there another philosopher with whom you resonate? For instance, the social scientists here at Griffith University (which has a strong humanist studies bent, and hence fecund ground for Bayesian thinking) wade in and out of Whitehead’s work via LaTour and Stengers.

    From the other side,
    Sama

Leave a Reply to Mayo Cancel reply

Your email address will not be published. Required fields are marked *