I’ve talked about this a bit but it’s never had its own blog entry (until now).

Statistically significant findings tend to overestimate the magnitude of effects. This holds in general (because E(|x|) > |E(x)|) but even more so if you restrict to statistically significant results.

Here’s an example. Suppose a true effect of theta is unbiasedly estimated by y ~ N (theta, 1). Further suppose that we will only consider statistically significant results, that is, cases in which |y| > 2.

The estimate “|y| conditional on |y|>2” is clearly an overestimate of |theta|. First off, if |theta|<2, the estimate |y| conditional on statistical significance is not only too high in expectation, it's *always* too high. This is a problem, given that |theta| is in reality probably is less than 2. (The low-hangning fruit have already been picked, remember?)

But even if |theta|>2, the estimate |y| conditional on statistical significance will still be too high in expectation.

For a discussion of the statistical significance filter in the context of a dramatic example, see this article or the first part of this presentation.

I call it the *statistical significance filter* because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. This might not be a concern in some problems (for example, in identifying candidate genes in a gene-association study) but it arises in any analysis (including just about anything in social or environmental science where the magnitude of the effect is important.

[This is part of a series of posts analyzing the properties of statistical procedures as they are actually done rather than as they might be described in theory. Earlier I wrote about the problems of inverting a family of hypothesis tests to get a confidence interval and how this falls apart given the way that empty intervals are treated in practice. Here I consider the statistical properties of an estimate conditional on it being statistically significant, in contrast to the usual unconditional analysis.]

Perhaps this explains why many published academic articles cannot be replicated in an industrial setting? I couldn’t help but think about the “Of Beauty, Sex, and Power” article mentioned above when I read the following post by Tyler Cowen. While academics have a strong incentive to find a significant effect, industry researchers have an incentive to get the best estimate.

http://marginalrevolution.com/marginalrevolution/2011/09/how-good-is-published-academic-research.html

John Ioannidis has a nice article on this problem.

http://dcscience.net/ioannidis-associations-2008.pdf

Would you care to relate these points to GWAS (gene wide association studies)? Or perhaps you can provide a link to a relevant article or two.

Here’s one. I first linked to it in response to a previous post on Type M errors.

In GWAS (where testing takes priority over estimation) the regression to the mean problem is well known; researchers call it the “Winners’ Curse”, a term borrowed from economics and game theory.

Of course, estimation is still useful in GWAS, so bias-correction methods have been developed; e.g.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796696/

http://www.sph.umich.edu/csg/boehnke/pdf/ge33-453.pdf

@numeric In GWAS studies, the effect magnitude is usually secondary to making a binary decision on whether a gene is significant or not (effect sizes are usually pitifully small anyway). These issues are definitely there, but then again GWAS studies have all kinds of interpretation problems that are probably higher on the list than over-estimating effect sizes.

In my opinion, many of the underlying assumptions underlying GWAS are long overdue for re-evaluation – “common disease common variant”, looking for individual SNPs with large effect sizes, the confirmatory/causal, rather than exploratory, style of analysis, presentation and interpretation…

This problem occurs frequently in astronomy, when there is multi-exposure imaging. If a source is only “detected significantly” in a subset of the exposures, the mean of the brightness measurements of the significant detections is an over-estimate of the source brightness. This is obvious when you think about it for this simple case (where the significance cut happens right before the measurement), but if the “significance cut” happened long ago in the data analysis chain, maybe even by another investigator who passed on the results in “reduced” form, it is hard to see it and note its possible effect.

Pingback: A Few Highlights: 9/6-11 | Bootstrapping Life

Pingback: Type M errors in the lab « Statistical Modeling, Causal Inference, and Social Science

Pingback: A Structure to Encourage Reproducibility | Bootstrapping Life

Pingback: Reproducibility in Observational Studies | Carlisle Rainey

Pingback: Week in Review: 9/12-9/18 | Carlisle Rainey

Pingback: A Few Highlights: 9/6-11 | Carlisle Rainey

Pingback: Question on Type M errors « Statistical Modeling, Causal Inference, and Social Science

Pingback: Replication in behavioral research » Source-Filter