“Not statistically significantly different from zero” != “zero”

I’m always yammering on about the difference between significant and non-significant, etc. But the other day I heard a talk where somebody made an even more basic error: He showed a pattern that was not statistically significantly different from zero and he said it was zero. I raised my hand and said something like: It’s not _really_ zero, right? The data you show are consistent with zero but they’re consistent with all sorts of other patterns too. He replied, no, it really is zero: look at the confidence interval.

Grrrrrrr.

21 thoughts on ““Not statistically significantly different from zero” != “zero”

  1. this is related to (or essentially just the same as) the mistake of treating ns as "proof" of "no effect" for some experimental treatment or for an independent variable in an observational study. Do you have recommendations about the right sort of testing strategy to use for "no effect" hypotheses? What do you think of using power analysis to calculate the likelihood that one would expect to find an effect of a specified size (based on some justifiable theory of how big the effect *should* be if there is a meaningful one) given the sample size? If that is the right approach, where would you set the likelihood threshold as a matter of convention? If one treats p values of .05 as the right threshold for null-hypothesis significance testing, should we expect researchers to report "no effect" findings only when a study has enough power to assure 95% likelihood that an effect would have been found if it exists?

  2. This one always bothers me when testing the constant term in a regression model. Jarad's point-mass prior might be really handy for working with models of physical phenomena, when the consant SHOULD be zero.

  3. Drives me mad when I'm reviewing papers (and occasionally with co-authors):

    "Found that there was no effect" and "Did not find an effect" aren't the same thing.

    Make them read Popper!

  4. The problem doesn't just exist with the null; even when researchers "show that there is a effect" by showing significance, of course we all know that they are only making a statement about the probability of all test statistics more extreme than theirs under the null. The jump from this to "there is an effect" is kind of strange but we accept it by convention.

  5. A study concluded that there was insufficient evidence that the treatment led to positive response. A confidence interval was shown that included zero difference between treatment and control. The client stated that since a larger proportion of the confidence interval sat above zero, there was better than even odds that the difference was positive.

    This is a real story.

  6. Kaiser, from the fact that you "this is a real story" it seems like you think it's unreasonable. But in most cases that come to mind, the client's statement is entirely reasonable. Almost anything you would bother to test, the answer won't be "there is literally no difference whatsoever, the treatment has no effect to a zillion decimal places and beyond." (It might have a negative effect rather than a positive one, but it won't be exactly zero). If the (classical) estimated effect is 0.1 +/- 10, there is indeed a better than even chance that the effect is positive. Even if your prior pulls strongly towards zero, it'll still be on the positive side of zero!

  7. Phil, when the client says something like that, it usually is followed by "therefore, we can conclude that the treatment works". This logic is essentially the same as throwing out the interval estimate and using the average treatment effect as the decision maker. That's why I feel uncomfortable with it.

    Like Andrew said in the entry, I prefer to think of the values inside the confidence interval as all consistent with data at the given confidence level but would not want to assign probabilities to these values being the "true" value.

  8. Kaiser, it's misleading to "think of the values inside the confidence interval as all consistent with data at the given confidence level" if what one means by that is that they are all *equally* consistent. The values proximate to the point estimate are, of course, more likely than ones at the tails of a normal distribution. Accordingly, an effect that is ns at .05 or even .10 confidence level can still more likely than not be positive if zero is very close to the lower bound, and that can justifiably matter to a practical "client" who is as concerned with type II as type I errors (e.g., a clinician considering the possible efficacy of a low-cost medical intervention for a fatal disease). Isn't *that* the point (or at least one of the points) of saying "ns = 0" is incorrect?

  9. Suppose my hypothesis is that there was a structural change and the effect of X on Y got weaker in the latter period. I run the same regression separately for two periods. In the first period the coefficient of X is negative and statistically significant. In the second period the coefficient of X is not statistically significant. Is my hypothesis supported? Does the coefficient's sign in the second period matter for the support of my hypothesis if the effect ceases to be statistically significant?

  10. Kerim: If your hypothesis is a structural change, surely life would be easier if you gave a single estimate measuring that change directly – and didn't try to infer anything based on patching two other estimates together? (I don't think being pro- or anti- hypothesis testing affects this)

    DK: I agree with you, but to avoid confusing those who've previously met classical likelihood, I find it helpful not to say that values are "likely" or otherwise – and instead say they have strong or weak "support", or a term like that.

  11. DK, once we set the significance level, an effect is either significant or not significant. By accepting any other conclusion, we are changing the significance level of the test *after* we see the result. If the client is willing to suffer a false positive half the time, we set alpha = 0.5. But I shudder at the thought.
    This is why in practice, the often criticized Fisherian threshold of 0.05, 0.01, etc. is extremely useful. They can be described as the industry standard. Without such a standard, one can be coerced into picking the right alpha to justify any conclusion the client wants.

  12. Kaiser, wouldn't it on the contrary make more sense to not have a fixed threshold but rather deduce the p-value at which your observation predicts an effect?

    So rather than simply saying that it's not significant at p=0.05, isn't it more interesting to know that at p=0.08 it is? If the client is ok with being 92% confident that there is an effect instead of an arbitrary 95%, why not?

  13. Kaiser, you say 'when the client says something like [the odds are better than even], it usually is followed by "therefore, we can conclude that the treatment works".' I agree with you that if the client thinks "better than 50-50 proves that it works," they're wrong and should be corrected. But the erroneous part is what comes _after_ the "therefore," whereas what you were criticizing is the part _before_ the "therefore," and the client had that part right.

  14. I was at a political science job talk this year in which a candidate claimed the main variable of interest was "significant at the 70% confidence interval."

    I'm not a fan of fetishizing .05, but that is extreme even for me.

  15. Phil and Jens: I think we agree that it's all about the p-value, and to some extent, how do we guide the client. I find it frustrating to have to flip the statistical conclusion because the client claims that they are willing to take "even odds" after finding that the test result is not to their liking. Maybe there is a better way to handle that. How does one avoid the slippery slope of setting the signficance level just right to show that treatment works? (Note I speak from the perspective of an in house expert, not an external consultant.)

  16. > But the erroneous part is what comes _after_ the > "therefore,".

    Indeed, but since statistical evidence is often used in marketing, my "null hypothesis" about what comes after the therefore is "it's worth paying for the treatment", quite probably rather than "*the client* would pay for the treatment" (to which indeed, there can be little objection from the statistician as they have presented all the evidence)

    A.

  17. As a statistical editor of a medical journal I see this daily.

    Plenty of people seem to think you can do a test with super low power then declare two groups equivalent. Perhaps they know better and just don't care, perhaps they don't know better.

    That's why I wrote a paper for our journal called "The Value of a P-valueless Paper". At least if authors show the interval, we can quantify just how absurd their claim is.

  18. DK, back to your original question about how to test for "no effect."

    In all the theory classes I took, I only really saw a couple ways of tackling this concept. One is just the trivial way of seeing if the p-value hovers around .5, but that's no fun.

    One case is when you have a simple vs simple hypothesis setup. Then, I think your logic works. Set up the test such that the power would be equal to .95 if the alternative is true. If you don't reject, you can safely accept the null with 95% confidence.

    In a simple vs complex setup, I think you'd have to go Bayesian and define some kind of prior distribution to the parameter (good luck justifying the one you pick). You'd need to adjust the sample size such that the expected value of the power curve over all values of the parameter in the alternative hypothesis is .95. This is just something I came up with right now, so I'm not completely sure it would work, but I don't see any flaw in the logic. However, my intuition is that n will be so high (infinity, maybe?) for any continuous prior distribution as to make this idea infeasible.

Comments are closed.