The inevitable problems with statistical significance and 95% intervals

I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.

In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right.

I was thinking about this after receiving the following email from a psychology student:

I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are randomized control designs with the same treatment and outcome measures, and each situation refers to a different treatment. It is easiest for me to put it into a table. In each of these situations only 1 of 2 published studies is found to be statistically significant.




Sig in diff


Situation 1
     Study A





Treatment is effective

     Study B



Situation 2
     Study C





Unclear, needs more replications

     Study D



Situation 3
     Study E





Unclear, needs more replications

     Study F



Situation 4
     Study G





Null/needs more replications

     Study H



Here, Situation 1 refers to 2 studies that have similar effects in magnitude, though the larger of the 2 studies (smaller se) is the only sig one. SInce the difference between the two effects is itself, not statistically significant, we should conclude treatment in situation 1 is effective (this seems to be in line with your paper).
In situation 2 there are 2 equally sized experiments that differ in treatment effect and significance. Since the difference between the estimates is statistically significant, one concludes the paradigm needs more replications.
In situation 3 the 2 studies have 2 effects, one is statistically significant while the other is not. However in this situation study F is neither statistically nor substantively significant. Unlike situation 1 it would seem unwise to conclude Treatment in situation 3 is effective and we need more replications.
Situation 4 is just some result I cam across in a research synthesis, where a smaller study (larger se) had a statistically sig effect, but a larger one did not. It would seem in this situation the true effect is null and the stat sig effect is a type 1 error. However the difference between studies is not stat sig, would this matter?

I replied that my quick reaction is that it would be better if there were data from more studies. With only two studies, your inference will necessarily depend on your prior information about effectiveness and variation of the treatments.

The student then wrote:

That is my reaction as well. Unfortunately sometimes the only data we have is from a small number of studies, and not enough to necessarily run a meta-analysis on. In addition, the hypothetical situations I sent you are sometimes all we know about the effectiveness and variation in treatments, because it is all the evidence we have. What I am trying to better understand is if your paper is addressing situation 1 ONLY, or if it is making inferences or statements about the evidence in the other situations I presented.

To which I replied that I don’t know that our paper gives any real recommendations. In a decision problem, I think ultimately it’s necessary to bite the bullet and decide what prior information you have on effectiveness rather than relying on statistical significance.

This is a problem under classical or Bayesian methods. Either way, it’s standard practice to summarize uncertainty in a way that encourages deterministic thinking.

38 thoughts on “The inevitable problems with statistical significance and 95% intervals

  1. Andrew what is the history of the 5% significance level? Do you have any old posts on it? That might be helpful in framing a good argument. Also can you expand this to 95% confidence intervals. Did we settle for these levels, or is there some basis? Thanks.

    • Jonathan:

      I don’t know the history. I’ve usually heard the 5% level attributed to R. A. Fisher, and I’ve always told people that 5% might be just perfect for the sorts of problems that were encountered in agricultural trials in the 1920s but might not be so appropriate today.

      • Completely agree, and you have the academic track record that can help steer the conversation away from those original conclusions!

      • The 5% level does go back to Fisher, but there might be more than tradition operating here. When I introduce hypothesis testing to my intro stats students I do the following: I tell them that I am going to split them into control and treatment groups at random by drawing cards from a deck (red=control, black=trt). I shuffle the deck and draw a red card. I replace the card, shuffle again, and get another red card. This pattern continues. (Of course, I am cheating and only pretend that the process is random). Eventually the class starts to laugh at the absurdity of the string of red cards. No one laughs after 3 in a row red, but at k=5 red many in the class are suspicious and at k=6 they all are (or so it seems). k=5 means p-value = .5^4 = .06 (two-sided — there is nothing suspicious about the color of the first card, only the matching colors thereafter); k=6 means p-value = .03. It is at about a 5% level that people say to themselves “I don’t believe that what I’m seeing is simply due to chance.”

        • Jeff:

          I think it depends on the example. I always say that when forecasting presidential elections I don’t want a 95% interval, because a 1-in-20 probability corresponds to once-in-80-years, and I wouldn’t expect any model to be valid for such a long period in U.S. politics.

        • Joseph: I don’t know. At one point Larry Wasserman wrote:

          The particle physicists have left a trail of such confidence intervals in their wake. Many of these parameters will eventually be known (that is, measured to great precision). Someday we can count how many of their intervals trapped the true parameter values and assess the coverage. The 95 percent frequentist intervals will live up to their advertised coverage claims.

          But I’m skeptical. I won’t believe it until I actually see this trail.

        • I agree that for different situation we want different levels of confidence. My point was only that the thinking of “Hey, that seems strange” kicks in at around 5% weirdness for most people. In the card shuffling setting I do _not_ say “Let’s do a hypothesis test.” I haven’t even mentioned the concept at that point in the class. The students simply experience something that is not consistent with their expectations. They reject an implied null hypothesis when they laugh at what they are seeing, and most of them do this when the p-value (they have never heard that term, at least not from me) is in the neighborhood of 5%. This doesn’t mean that 95% confidence/5% significance is the right level for detecting that someone is cheating with a deck of cards, let alone for something that matters, such as FDA approval of a drug. I’m just observing that 5% is more than just an arbitrary threshold; it seems to correspond to something about human judgment. Or maybe I have strange students every year ;-)

        • Andrew: wow, I hadn’t seen that Wasserman post before. I just can’t fathom that there still so many phenomenal, yet simple to correct, misconceptions on this topic.

          First, as a scientist I don’t want to be right 95% of the time. I want to be right 100% of the time. So if I’m estimating a parameter using a probability distribution for the measurement errors P(error) then to be correct using a Bayesian analysis all I need is that that the actual errors in the data set I have before me lie in the high probability manifold of P(error). This is a realistic and achievable goal. In fact it can be achieved every time by spreading P(error) out enough to guarantee it’s true. The only issue is whether the resulting Bayesian interval will be too large to be useful.

          The Frequentist 95% coverage however requires a far stronger assumption which is basically never true. Namely that the errors measured over a very long sequence explore the probability distribution P(error) in just the right way so that the histogram of the errors looks like P(error).

          But this goal is a phantom. The repeated measurements in question are often not possible even in principle, or rarely ever done. And the few instances in which they are, the histograms don’t look right and the 95% confidence intervals don’t have 95% coverage.

          The 95% confidence intervals are the right answer to the wrong question. Bayesians keep saying “it’s the wrong question” and Frequentists keep saying “it’s the right answer”.

        • This is also the example I use in class. As for the .05 mark, Fisher actually wrote that it would be reasonable to look under it for significant matter, but he also implied that it would not be a sufficient level of confidence for most purposes. I’m citing for memory but this has been brought up before, the quote is probably retrievable.

        • (This is a reply to your quotation of Larry Wasserman.)

          It looks like you’re right to be skeptical that 95% frequentist intervals will live up to their advertised coverage claims:

          “… reported uncertainties have a consistent bias towards underestimating uncertainty …”

          Within the paper, the researchers report that 98% confidence intervals have been subsequently ‘surprised’ 20% to 40% of the time.

      • Is another good way of getting at this comparing the results of a 5% significance level versus those at a 6 percent or 4 percent significance level. Then gradually doing this for each subsequent level of significance.

        I wonder at this point whether we can’t just take massive amounts of studies and run the p-value to determine new significance levels for (a) different fields and (b) different types of studies (Experiments versus IVs). Obviously this idea has not been fully fleshed out, but it could be.

        Is this a bad idea, or been suggested before? It seems using 5% as a catch-all is a terrible idea (there are no absolutes in this world).

    • but I guess fisher did not say whether you can reject or accept the null using 0.05. these was from Neyman and pearson. I feel fisher just used p-value to flag situations for further investigation.

  2. Thank you! If peer reviewed drug trials can’t be reproduced 65% of the time then imagine how bad things are out here in the statistical wilds; far from peer reviews, planned experiments and clean data. Almost every analysis I see in my applied work (military related) that mentions the words “confidence interval” or “hypothesis test” has a serious flaw of some kind directly related to the use of these methods. Please for the love of God just stop teaching it (at least to non-statistics graduate students). Things might not improve much, but they can’t get any worse than 100% wrong..

  3. As a community we really need to sort this problem out. I agree with how you pose the problem but I come on the other side in terms of solution. I have always thought that the greatest benefit of the Fisher 5%, 10%, etc. is to establish standard protocols that people can agree on in order to arrive at a consensus deterministic solution. (Note that this is not to refute the criticism of arbitrariness, and so on.)
    My working assumption is that any conclusion drawn from samples of data has uncertainty, and that the data analysis is done in order to provide the foundations for making a deterministic decision. The decision may be should a new drug be approved for sale, or does the recent accidents indicate an underlying safety problem that needs a proactive response? In all cases, the decision has to be made in the face of uncertainty.
    I also have a third assumption, which is that if the decision problem is “interesting”, it involves multiple parties with conflicting objectives. What follows from this is that each party has a favored outcome. What follows from that is each party is prone to believing a certain story, which conditions how they want to interpret the data (either consciously or subconsiously).
    If these parties accept the “common practice”, they can at least come to a conclusion, hopefully sensible in most cases. Perhaps we should have a set of standard decision criteria instead of one standard.
    Not summarizing uncertainty, say by providing the posterior probability distribution, is an option but I find it hard to use. Then the conflicting parties will argue over their risk tolerance – this is basically arguing over alpha and beta but if I have a desired outcome, I can find alpha/beta to justify my conclusion. I’m afraid that the end result is then determined by who speaks loudest or who has the authority, and the data becomes a side show.

  4. The above begs for way too much clarification for me to tackle just now—dashing off—but I wanted to just leave a marker, also, if you check my blog and publications you’ll get some insights into how I argue we should properly interpret and use CI’s and tests.

  5. I can’t find the paper I read that mentioned this, but apparently something similar to the 95% threshold goes back to Gauss. The motivation at the time supposedly was that in a mixed model of two normals of the same variance and different means, the bimodality in the sample data becomes visible to the naked eye at about two standard deviations of separation.

    It’s pretty hard to believe that anybody noticed this back then, working with hand-gathered data, and given that Pearson is credited with introducing the histogram well after that time, so I wish I could track down the reference and take another look.

  6. “…my quick reaction is that it would be better if there were data from more studies.”

    For god’s sake. We all go to war with the data we have, not the data we want.

    • Phil:

      Please read the sentence that follows my sentence that you quoted. The follow-up is, “With only two studies, your inference will necessarily depend on your prior information about effectiveness and variation of the treatments.”

      • I’m not sure what you’re saying. I’m saying it’s rarely helpful to say it would be better if there were more data. Of course it would! It certainly wouldn’t be better if you had less!

        Remember our problem with mapping U.S. indoor radon concentrations based on 30,000 data points, with predictors, and Rick saying we don’t have enough data?

        There’s almost never enough data, or if there are enough the errors are too big, or the data aren’t representative enough, or the data were collected according to several different protocols so they aren’t exactly comparable, or….

  7. One other comment: we shouldn’t throw the baby out with the bathwater. Statistical significance is supposed to quantify sampling variability. It addresses one of the biggest logical fallacies out there, which is the law of small numbers.

  8. If my quick mental calculations are correct, in all four situations, just pooling the results in the obvious way (weighting by precision) gives the result that the treatment is effective at the 5% significance level, except for situation 3, where it is just slightly short of being significant. And that seems the right thing to do, UNLESS you have doubts about the validity of some of these studies. Of course, in practice, you always have such doubts, to some degree or other, and that’s why you might look at whether there seems to be a contradiction between the results of one study and the results of another. But exactly what you should be doing in this respect must depend on the details of what sorts of flaws these studies might have, and how likely they are.

  9. I regard a 95% confidence interval as a measure of spread, one calibrated (more or less) to a probability, unlike the standard deviation which is not calibrated to anything. As such, I find it a more useful descriptor of spread in many situations than a standard deviation or similar number.

    • It’s not a measure of the spread of the data.

      Perhaps you could think of it as a measure of the spread of a hypothetical distribution of means collected under the same conditions. But it’s definitely not about the spread of the data. Perhaps a simple way to think about it is how well you’ve estimated the mean. If it’s narrow you’ve estimated the mean well and if it’s broad you haven’t got a very stable estimate.

  10. My previous comments notwithstanding, I am in agreement with what I take to be Andrew’s main point. “We set up some rules and then act as if we know what is real and what is not.” My clients want to know if a hypothesis is true or false. The trite response is to quote George Box in saying that all models are wrong but some models are useful, so I know before the client even walks through the door that her null hypothesis (beta1=0, say) is false. Any interesting dataset can be modeled in many ways, with each sensible model giving somewhat different predictions. A model that drops a predictor because a coefficient (beta1) is taken to be zero might give better or worse predictions than an alternative model. The challenge is to quantify and then live with the uncertainty, whatever model is used. (It doesn’t help matters that I live in a mathematics department, where people detest uncertainty and think of true/false as black/white; I deal with shades of green.)

  11. What is the point of doing a significance test?

    Is it to decide whether to take some action? If so, then you should be doing decision theory, not significance tests.

    Is it to get your paper published? (See the Gigerenzer paper referred to above.) Ok, do what you wish, but don’t expect me to pay much attention. You are just doing a ritual.

    But, even if you are just trying to get your paper published, the fact is that there is still a decision to be made…it is, should I submit my paper, and risk getting egg on my face, maybe not getting tenure, when it is shown to be wrong in subsequent research by others? Or should I not submit, and risk having someone else scoop me on something important. Even that is a decision.

    So here’s another paper that we see here from time to time. It’s important.

    • I’m with Steve. A colleague of mine tried to change it to “statistically discernible,” which I tried to publicize, but gave up after ASA and RSS named their joint magazine “Significance.”

      • I actually like the “significance” issue because it allows me to send multiple warnings in almost every class about confusing statistical and substantive significance. It’s an easy hook for issuing reminders about theory-building, models and so on.

  12. > I think the bigger problem is with the phrase “statistically significant,” which tends to be highly misleading to the public.

    To the _public_? You don’t go remotely far enough. Even among those who even know the meaning of “statistically significant”, and I am _including_ professional statisticians, I suspect there are very few indeed who don’t find it convenient on a regular basis to elide the difference between “statistically significant” and “significant” [ie. the latter being the word in English as understood by native English speakers]. For professional statisticians and no few academic researchers, the elision is not so much an upfront lie so much as an educated complacency as in”I said ‘_statistically_ significant’, they may be misinterpreting it, that’s their fault not mine so the paper stands”. (The fact their audience makes this mistake over and over and over again, but they gain NYTimes articles about their (cough) (statistically) significant findings and then get tenure, somehow soes not invite c introspection sufficient to lead to errata.)

    The central con in statistics throughout the last century is to proactively claim positive and emotive words (significant, confident, unbiased, etc) as technical terms (“hey, that’s just how we define it, don’t blame be, you are reading too much into it, we could have called it property “X7”) and yet standing by quietly to reap the benefit of the associations people have with the natural language word they have co-opted. Statistics cannot do everything that people want of it (be efficient, objective, decisive, simple, deterministic, etc) – but instead of honesty about inherent limitations the field as a whole has “decided” to “lie” (or more fairly, mislead by linguistic evolution) about this rather than just be upfront about restrictions.

    IMO “statistical significance” is the acid test on one’s attitude towards this linguistic con. Can someone present or point me at ANY plausible argument as to why the term”statistically discernable” isn’t on the whole MUCH more accurate and honest in every possible respect. Yes, it’s still not perfect (it’s only two words, how good can we get?) but IMO it is such an utterly dominant suggestion in every valid dimension relative to “statistically significant”. But somehow the proposal has effectively died. Love to hear a professional statistician say why! Is it really (on the whole?) worse? Or … why not? I see that it is status-damaging (harder to confuse “discernable” with “significant”, and your audience wants to hear the latter) but what are the non-self-interested arguments?

    My expectation is that no one is going to argue against “statistical discernabilty” as an absolutely better phrase than “statical significance”, but someone no one will find it convenient from a career perspective [in fact, may realize it is damaging to them] and somehow the world will continue as is.

  13. I am reminded of a remark in “The Adventures of Tom Sawyer”:

    Often, the less there is to justify a traditional custom, the harder it is to get rid of it.

  14. I am not sure why we use the difference between the pair to decide whether the treatment is effective or not. We can apply standard meta analysis technique to 2 studies to find whether the combined evidence supports the effectiveness or not

    On the other hand, ‘assume the studies are randomized control designs with the same treatment and outcome measures’ does not guarantee that the 2 studies are testing the same treatment effect. Yes, the drug may be the same and hence the pharmaceutical effect does not change between the pair. But treatment effect as observed in RCT is not simply a pharmaceutical effect: the 2 trials may use different populations (changes to the enrollment criteria), or the comparator may change. In these cases, significant difference between treatment effect may just tell us that difference treatment effects were tested. In these situations, a meta analysis is still meaningful as usual: it answers the question whether the treatment is generally effective

  15. Wasn’t significance a Gosset versus Fischer debate which Fischer won? Isn’t the fundamental problem that statistical significance ignores the actual payoff matrix, be that economic, mortality etc.

    A good analogy is Risk = Probablity * [ expected loss ].

    What blind adherence to significance does is ignore the “expected loss” term.

Comments are closed.