Skip to content
 

Don’t let your standard errors drive your research agenda

Alexis Le Nestour writes:

How do you test for no effect? I attended a seminar where the person assumed that a non significant difference between groups implied an absence of effect. In that case, the researcher needed to show that two groups were similar before being hit by a shock conditional on some observable variables. The assumption was that the two groups were similar and that the shock was random. What would be the good way to set up a test in that case?

I know you’ve been through that before (http://statmodeling.stat.columbia.edu/2009/02/not_statistical/) and there are interesting comments but I wanted to have your opinion on that.

My reply: I think you have to get quantitative here. How similar is similar? Don’t let your standard errors drive your research agenda. Or, to put it another way, what would you do if you had all the data? If your sample size were 1 zillion, then everything would statistically distinguishable from everything else. And then you’d have to think about what you really care about.

12 Comments

  1. […] Or, Andrew Gelman says it another way – “think about what you really care about.” […]

  2. Christian Hennig says:

    Alexis: You could specify a threshold on the difference that in terms of interpretation means “no substantially meaningful difference” – for example |\mu_1-\mu_2|=\epsilon against the alternative “smaller” (this can often be done with a minor modification of standard tests). Significance than gives you evidence *in favour* of “no (big) difference”.

    Depending on \epsilon, this may require quite large samples, though.

  3. Christian Hennig says:

    Somehow a bit of my posting was eaten. Here is the complete one.

    Alexis: You could specify a threshold on the difference that in terms of interpretation means “no substantially meaningful difference” – for example |\mu_1-\mu_2|=\epsilon against the alternative “smaller” (this can often be done with a minor modification of standard tests). Significance than gives you evidence *in favour* of “no (big) difference”.

    Depending on \epsilon, this may require quite large samples, though.

  4. I have a paper that readers might find interesting at http://crain.co/nme

  5. Christian Hennig says:

    Again! Apparently the system gets confused with “larger”/”smaller signs? I make a third attempt. Sorry! The first two can go.

    Alexis: You could specify a threshold on the difference that in terms of interpretation means “no substantially meaningful difference” – for example |\mu_1-\mu_2| “smaller” \epsilon, and then you could test |\mu_1-\mu_2| “larger or equal” \epsilon against the alternative “smaller” (this can often be done with a minor modification of standard tests). Significance than gives you evidence *in favour* of “no (big) difference”.

    Depending on \epsilon, this may require quite large samples, though.

    • Corey says:

      The system thinks that < is the start of an html tag, so everything after it looks like a syntax error and gets wiped. To make a < glyph, type “&lt;”.

    • Rahul says:

      That triple-post was funny! :) \< /<

    • Georgette says:

      This is called equivalence testing. It is common practice, especially in biological/medical research where there is a boundary of interest to this. For example most generic drug comparisons are set up as a comparison for the log of the ratio to be within log(.8),log(1.25). There is an extensive literature on this but it doesn’t seem to get out of its biopharma niche.

      I think Andrew would agree that equivalence is one of those frequentist techniques that takes on Bayesian properties such as using prior knowledge of the boundary. Also there is Bayesian equivalence testing.

  6. Martha Smith says:

    Yes, Yes, Yes: “you have to get quantitative here. … Don’t let your standard errors drive your research agenda… you’d have to think about what you really care about,” — except I’d change
    “you’d have to” to “you have to,” no matter what your sample size.

    It seems to be the rule rather than the exception that papers draw conclusions on the basis of “statistical significance” (often even at the 0.1 level, often with multiple testing involved but its effect ignored) with either no discussion of power or something like, “a power calculation showed that this sample size will give a medium effect size,” (presumably referring to something like Cohen’s d, not a raw effect size — and with no discussion of the effect of multiple testing on power.) So rarely do I see discussion of what raw effect size is practically significant. There is no indication of thinking, just turning a crank and magically expecting a meaningful (“significant” in a magical way) result.

  7. Alexis says:

    Thank you Andrew fr sharing it. And thank you for the thoughts in the comment section.