Why “statistical significance” doesn’t work: An example.

Reading some of the back-and-forth in this thread, it struck me that some of the discussion was about data, some was about models, some was about underlying reality, but none of the discussion was driven by statements that this or that pattern in data was “statistically significant.”

Here’s the problem with “statistical significance” as I typically see it used. I see statistical significance used in 4 ways, all of them problematic:

1. Researcher has certain goals in mind, uses forking paths to find a statistically significant result consistent with a pre-existing story.

2. Researcher finds a non-significant result and identifies it as zero.

3. Researcher has a pile of results and agnostically uses statistical significance to decide what is real and what is not.

4. Community of researchers use p-values to distinguish between different theories.

All these are bad. Approaches 1 and 2 are obviously bad, in that statistical significance is being used to imply empirical support beyond what can really be learned from the data. Approaches 3 and 4 are bad in a different way, in that they are taking whatever process of scientific discussion and learning is happening, and sprinkling it with noise.

54 thoughts on “Why “statistical significance” doesn’t work: An example.

  1. I have a question, what about this use of p-values? Research funder use statistically significance as one of the criteria (but not the only) to decide what research hypotheses are worth funding. Maybe, that use of p-value will create incentives to do your 1 and 2, but we could in theory take care of that by having the rule “Publish everything.” Is there still a problem?

    My question is that a research funder has to make bets on what research will pay-off (in finding true results). What should be his or her decision rule?

    • Anon:

      There are two questions I see here.

      First, should it make sense for funders (or journal editors, or agencies that have to decide whether to approve drugs, or business executives who have to decide whether to develop a product) to use quantitative benchmarks of success? Here, my answer is Maybe yes, as long as you can mitigate the incentives to cheat and distort. But, sure, if you can ensure honest reporting, or if manipulation is kept to a minimum, then, yes, I can see the advantage of pre-set decision rules.

      Second, should these decision rules depend on p-values, Bayes factors, or other null hypothesis significance test? My answer is No, I don’t think so. I think the decision rule should depend on some decision analysis. Yes, just about any decision analysis will be approximate, but the point is that it should be set up to be targeted to the decision at hand. To use a Jeopardy! framing: It’s hard for me to think of any question for which “p less than 0.05” (or “p less than 0.005,” or “Bayes factor more than 20”) is the answer.

      • I agree with Andrew’s response to his second question. Decision rules depending on p-values, Bayes factors, or other null hypothesis significance tests are cop-outs: they substitute an easy decision rule for the thinking that is needed to make a really sound decision.

        • … and also substitute an easy decision rule for the thinking that is needed to evaluate whether or not an argument for a decision is sound.

        • Yes! Any decision should be argued rationally and scientifically, not merely supported by a singular statistical value. What has not yet been mentioned (and is not often mentioned in these discussions) is that any decision should be made with reference to loss functions.

      • I am trying to bait you into an interesting question. You say that quantitative benchmarks could be useful. When and where and why? I know it is a hard question. I just want to see someone smart attempt to answer it. We need more when to know when more than gut feeling is something we can rely upon.

  2. To summarize: Researchers tend to forget that statistical significance is one component in an argument in favor of a conclusion, is not equivalent to the conclusion itself, and that many different tools interact holistically to lend support for or against a model of reality.

  3. Cancer of the pancreas is a dread disease. When I was still working, there were numerous clinical trials for this disease which were all minor variations of the few meds with any possible activity. Almost all of these trials like most trials in oncology showed no improvement. I do remember sitting during one of our national meetings when the presenter proudly announced that their trial had produced a “statistically significant” result. There was nothing novel about this trial; there were about a dozen very similar trials without that p value. The presenter was extremely voluble in praising the outcome.

  4. Re: All these are bad. Approaches 1 and 2 are obviously bad, in that statistical significance is being used to imply empirical support beyond what can really be learned from the data. Approaches 3 and 4 are bad in a different way, in that they are taking whatever process of scientific discussion and learning is happening, and sprinkling it with noise.

    ——
    It makes sense to withhold characterization of any part of the discussion as scientific to control of the anchoring effect.

  5. 2a. Researcher reads Deborah Mayo, finds a nonsignificant result, and pats self on back, saying, “I severely tested it!”

    Okay, I’m a little tongue-in-cheek, but what do you think of Mayo’s argument?

  6. I think in the discussion linked, there was no need to discuss statistical significance because the difference was so large between the pre- and post- intervention that we can see it with the naked eye, and the different timing of intervention across regions lends some credence. We can do bf-test, but why bother, since it’s definitely going conclude inequality of variance. The issue arises when the effect size is not large enough to be eyeballed. Then how can we judge that something makes a difference at all? At the end of the day, we are interested in what causes what, and how big is the influence, i.e. how the world works.

    While p-values and statistical significance creates a lot of problems, many of the problems would still be there regardless. For instance, suppose p-values did not exist and a researcher wants to weave a story about a positive effect. The researcher can “remove outliers” and slice and dice the data until it shows a positive effect.

    The discussions should center more on experiment design and causal diagrams, but they can coexist with significance testing. People might be more receptive to “collect data in a better way” rather than “stop using p-values”. A lot about p-values seem off (e.g. binary decisions of p = 0.051 vs p = 0.049), but they can be useful to support the analysis as long as the study is well-designed. Running a t-test on an RCT with sufficiently large sample size seems reasonable, for instance. But fishing for p-values in an n = 20 study with a lot of noise is not reasonable.

  7. You seem to have described the misuse of p-values, but this doesn’t mean that “statistical significance” doesn’t work. There are situations in which it does work. In order to use p-values correctly you need to understand them, which requires some training. I have always found it odd/amusing that statistical methods are used by people with no training and little understanding or interest. This doesn’t happen with law and structural engineering and a host of other professional vocations. So, it’s hardly surprising that things tend to go wrong.

    • Peter: > doesn’t mean that “statistical significance” doesn’t work. There are situations in which it does work.
      Can you explain how and give an example?

      I believe the argument here is that declaring “statistical significance” if the p-value less than say a default of .05 _is_ a misuse of p-values.

      It is not saying p-values should not exist of not be used but rather not simplistically used in generic dichotomous yes/no ways.

      • “I believe the argument here is that declaring “statistical significance” if the p-value less than say a default of .05 _is_ a misuse of p-values.”

        That isn’t wrong, that is exactly right if your p-value is less than alpha = .05 and the alpha makes sense for your study. Now whether that is a useful, or practically significant is another matter altogether.

        Justin

        • > and the alpha makes sense for your study
          So the probability the default makes sense in your particular study is not exactly zero?

          So then, since the probability of gambling trust funds at the local casino increasing their value is not exactly zero and therefor that would not be a misuse of trust funds?

    • I concur with Keith. The issue isn’t so much the calculation of the p-value, but the declaration that it is or is not significant. It’s this rubbish, arbitrary threshold that can only really be misused that is the issue here. Should p-values be presented they ought to be simply stated and the reader left to decide how much support they provide for the conclusions being made.

      P.S. As a former structural engineer I can assure you that you shouldn’t be using it as an exemplar of a profession that doesn’t have unqualified (not necessarily the same thing as uneducated) people performing the calculations. Statistics is not as unique as you might think in that regard.

      • As someone with a PhD in Civil Engineering (not a licensed engineer) I can concur with Allan about what I saw of the licensing process… the main thing that keeps buildings from falling down seems to be the fact that much of the calculations are sort of rote and tabulated, and required safety factors are mandated in the codes and are large enough to keep people out of trouble when they are occasionally mistaken.

        Doing typical structural design office calculations isn’t quite like preparing your taxes, but its closer to preparing your taxes than computational fluid mechanics is. I’ve had licensed structural engineers come to me to ask my why the calculations work (engineer was calculating stiffness of a support and had an intuitive idea that they shouldn’t just add together two quantities they’d calculated, but didn’t know why).

  8. 1 and 2 are using theory (pre-existing knowledge) to decide what p values are ok. 3 and 4 are using p values to decide what theories are ok. But all are bad? How do we please you?

    • Adede:

      The point of statistical analysis is not to please me, the point is to make sense of data and learn about the world beyond the particular data at hand. My problem with all of 1, 2, 3, and 4 is that these methods are not doing that. For examples of analyses that I believe do make sense of data and learn about the world, I can point you to my books and published articles. (Also of course lots and lots of stuff not written by me, but it’s easiest as a starting point for me to point you to my own work.)

        • this is dichotomous approximation to decision theory, I don’t think it’s a good idea. I mean if a drug extends your life 5 years it’s much more valuable than extending your life 5 months. and yet we treat “extends your life a positive amount” as a single thing with the alternative being shortens by 0 or more… meh

  9. I understand these criticisms of significance, but I think they overstate how statistical significance is used in academic practice, at least in my field. My qualified defense of significance is that editors and reviewers mostly use it as ‘an important part of a balanced breakfast’. A paper is worth publishing if the question is interesting and likely answerable with the tools at hand, the theory is compelling, the authors did a good job of reducing noise and the impact of nuisance factors (those not relevant to the theory), the impact is large enough to be important, the statistical analyses seem appropriate to the design and realization of the data, and that the claimed findings are reasonably reliable. Only the last judgment is based on significance. Sure, there are studies that only seem reliable because of forking paths issues, but that concern tends to be a well-understood focus of the review process. It’s hardly perfect, but the whole package is far from a mindless decision rule based on asterisks.

      • Accounting. I actually did a pretty thorough examination of attitudes in our field toward author discretion (e.g., forking paths, HARKing, etc.), and the net benefits of registration: https://ssrn.com/abstract=3118687. The title is:

        No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality

        • Interesting. Thanks.
          It would also be interesting to see your experiment tried in other fields, to see if results are similar or vary from field to field.

        • I would expect views to be pretty similar across other business disciplines, and the social sciences whose methods are most similar (psych, econ), partly because accountants are a pretty eclectic/interdisciplinary bunch–we draw mots of our methods and theories from related fields, and apply them to our small corner of the world.

          Some differences might arise because our topic makes us very aware of the benefits of giving discretion to those who report information they have that others don’t. Sure, they might use discretion to misrepresent, but they can also use that discretion to communicate more effectively, so mandatory rules governing what you say and how you say it can hinder good communication. So we might be more inclined to support author discretion.

  10. I wonder if we could move from the analogy of racing to medical practice with which I have more experience!

    My first task is to assess the reliability of medical findings. Secondly, based on the findings I assess the probability of diagnoses / hypotheses. Thirdly on the basis of the various diagnoses and their probabilities I make decisions. As more information comes along the cycle goes (ie back to finding, diagnoses and decisions).

    The first task involves (1) assessing the probability of replication due to random variation maybe based on a random sampling model by assuming that the methodology is impeccably consistent but also (2) assessing the methodology for such consistency by going through a checklist (severe testing ?). If either (1) or (2) is hopeless then I may discard the ‘finding’. If (1) is promising (ie passes some test of preliminary significance) I do (2) carefully and reassess the probability of replication by combining (1) and (2). I may also look for an independent observation (eg by another doctor) of the same finding and after assessing it as in (1) and (2) and combine the independent observations of the same finding.

    I tend to think of assessing scientific findings, hypotheses and decisions in an analogous way. Does this correspond to how discussants or readers of this blog think?

    • Huw:

      I agree that it would be useful to move in this direction. I think that one problem is that people are trying to use a single method to answer many different questions.

      Here are a few scenarios:
      – A decision needs to be made, for example use procedure A or procedure B to treat some disease.
      – Someone has an idea for a new treatment, and there’s a desire to test this new idea.
      – Some data arise suggesting some unexpected pattern—this could arise from a study of existing data or as a byproduct of a newly-gathered data—and you want to decide how much to believe that this pattern is real, whether to follow it up with further study, whether to implement the new idea on patients right away, etc.
      – An idea has been studied by many research teams and you want to do a meta-analysis with the goal of making recommendations regarding treatments, further studies, etc.
      – You suspect that a certain treatment doesn’t work, or doesn’t work in a consistent way, and you have observational or experimental data to address this question.

      My first statement is that no single method will give good answers in all these scenarios. My second statement is that I don’t think that statistical significance gives good answers in any of these scenarios. The next step is to provide alternatives. I do think that I and others have good alternative answers, but the starting point is that the alternatives will be different for these different questions.

      • I am tempted to go into details of comparing your scenarios to analogous medical conundrums! The ‘preliminary test of significance’ that I refer to in medicine is really for beginners and the inexperienced. One medical analogy is thresholds for population screening test results’ and diagnoses (often based on two standard deviations of the test result) that are very misleading and can lead to over-diagnosis and over-treatment. The situation in statistical and scientific hypothesis testing seems equally damaging.

      • I agree that I single method can be applied to all situations and that the processes need expertise. It would be good to have a sensible train of thought that links them though. I also agree that tests of significance are not appropriate (I was one of the signatories to the petition).

        There is an analogous problem in medicine in terms of over-simplified and inappropriate probability thresholds. This applies to screening test results, diagnostic criteria and treatment thresholds. These may be useful for beginners and the inexperienced but are unfortunately in widespread use in an analogous way to statistical significance tests. This leads to damaging over-investigation, over-diagnosis and over-treatment that are directly analogous to the problems in science.

        • Re: I agree that I single method can be applied to all situations and that the processes need expertise.

          Is that what Andrew said originally? Hmmmm I can’t see how a single method can apply either

      • Andrew

        Before turning in, I will provide a rough outline of why I agree with your point that people are trying to use a single method to answer many different questions.

        I would like to do things differently by first assessing the reliability of the data by examining the normalized likelihood distribution of the study data alone (as opposed to assessing statistical significance). I would also perform ‘severe testing’ by applying a checklist and estimating the probability of impeccable methodology using another theorem derived from the extended form of Bayes’ rule (described in the Oxford Handbook of Clinical Diagnosis) that allows ‘abductive reasoning’. I would then apply a form of sensitivity analysis to assess how these likelihood distributions are affected by taking into account the probability of impeccable methodological consistency. I would also assess how the probability of replication within a sensible range would be affected with or without other data or prior probabilities by using Bayesian analyses. I would then consider how they might be used in diagnostic classification and treatment decisions in a decision analysis. In response to your 5 scenarios:

        1. I would set out the probabilities of benefit, harms and costs and the utilities of the latter by using treatments A and B taking into account the severity of the disease and other factors such as age and gender of the patient etc on the probabilities. I would then perform a decision analysis based on the various patient features.
        2. I would perform a RCT comparing the new treatment to placebo or an existing treatment.
        3. I would try to model the expected result of an RCT (e.g. using decision analysis techniques) to see if the unexpected pattern was as promising as first suggested and if so, perform a RCT comparing the new treatment to placebo or an existing treatment.
        4. I would perform a meta-analysis and then perform a decision analysis as in (1)
        5. I would plot the likelihood probability distribution based on the observational data and if the findings were important and merited further investigation, design a RCT to obtain a fresh likelihood distribution of the difference.

        I would suggest that there are other easier exploratory controlled studies which can be performed based on principles of diagnosis and treatment selection that examine efficacy in a provisional way without randomization. Also in diagnostic and scientific hypothesis testing it becomes necessary again to use the theorem derived from the extended form of Bayes rule to examine by hypothetico-deductive ‘severe testing’, alternative hypotheses to real efficacy (e.g. spurious results due to bias from poor study design).

  11. Apologies for similar posts: I though that the first submission had failed so sent a 2nd.

    Also the first line of the 2nd reply should have been (without typo) “I agree that no single…”

  12. Was recently doing a literature search of an area with little quantitative testing. There was a study that included an empirical test of what I’m interested in, albeit with small sample size. Unfortunately I get close to zero information because the test was just noted as “non-significant.” Maybe I would be statistically illiterate to do so, but I’d like to see the estimate, p value, etc. and incorporate that into my prior about the likelihood there is an effect.

    • Just noting the test as “non-significant” is sloppy reporting, so you have a legitimate complaint there. (But it may be the case that there was also more sloppiness in performing the study, so it might be a totally unreliable source of information.)

  13. There is so much work to be done here.

    For instance this from a draft of a new statistics test written in bookdown https://moderndive.com/10-hypothesis-testing.html#statistical-significance

    “If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are *statistically significant*. Statistically significant data provide *convincing evidence* against the null hypothesis in *favor of the alternative*, and allow us to *generalize* our sample results to the claim about the population.”

Leave a Reply

Your email address will not be published. Required fields are marked *