Gur Huberman asks what I think of this magazine article by Johah Lehrer (see also here).
My reply is that it reminds me a bit of what I wrote here. Or see here for the quick powerpoint version: The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount. I know that Dave Krantz has thought about this issue for awhile; it came up when Francis Tuerlinckx and I wrote our paper on Type S errors, ten years ago.
My current thinking is that most (almost all?) research studies of the sort described by Lehrer should be accompanied by retrospective power analyses, or informative Bayesian inferences. Either of these approaches–whether classical or Bayesian, the key is that they incorporate real prior information, just as is done in a classical prospective power analysis–would, I think, moderate the tendency to overestimate the magnitude of effects.
In answer to the question posed by the title of Lehrer’s article, my answer is Yes, there is something wrong with the scientific method, if this method is defined as running experiments and doing data analysis in a patternless way and then reporting, as true, results that pass a statistical significance threshold.
And corrections for multiple comparisons will not solve the problem: such adjustments merely shift the threshold without resolving the problem of overestimation of small effects.
I do see the huge over-estimation problem of small effects with small samples in my research on cancer genomics. Single separate analysis for a factor will generate too many significant genes. Fitting a complicated hierarchical or mixture model reduces the # of significance dramatically, indeed. Basically it's consistent with Andrew's point.
Andrew: would you mind elaborating on your call for retrospective power calculations, perhaps explaining why and in what form such calculations should take?
I've previously been asked by referees for such calculations but I see no rationale for it: if 0 is a plausible value of the parameter, then usually you have much uncertainty about the true parameter value, so plugging in a point estimate to calculate the power assuming that were the true value seems unduly confident.
A couple of papers that dismiss retrospective power calculations with arguments that I found convincing are
Hoenig and Heisey (2001) The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat 55:1–6
Goodman and Berlin (1994) The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 121:200–6
But now you sow doubt in my mind: is there some argument in favour of retrospective power calculations that I've missed?
Jiyang: Have you written this up? I think it's an important point that's not fully understood.
Alex: I'll have to read these references and then write something more formal on the topic.
For some sense of Goodman's current thinking on retrospective power see the TV show here
http://www.stat.columbia.edu/~cook/movabletype/ar…
not so much in his talk but in the after talk comments where he nicely points out that any sense of power requires priors of some sort and the confusion comes from trying to over look that.
K?
Hi Andrew,
Thanks for the asking. The paper is in preparation and will come out soon. You are right that this point is not fully understood. I am seeking theoretical evidence to get my boss fully convinced. I am lucky to see your paper with slides on this. I will keep you posted. Thanks.
I recently completed what I believe is a retrospective power analysis. The intent was to communicate the effect size threshold my final sample size was able to detect, and the sample size needed to bring "marginal" effects (i.e., effects with p-values less than .10) to less than the 5% threshold.
Jiyang – just to clarify, are you seeing too many significant genes because the p-values being produced are badly calibrated? This isn't the same point about the over-estimation, which happens even when the analysis works perfectly… although the two problems will tend to appear together.
There are some (simple-but-nice) corrections for over-estimation available; Zhong and Prentice did one, Xiao and Boehnke did something similar.
Fortunately, most people do not define the scientific method that way.
Lord Rutherford: "If your experiment requires statistics, you ought to have done a better experiment."
Hi, Anon
Thanks for the comment, but it's not a problem of p-value calibration. It's a matter of scientific method or the modeling. I use exactly the same procedure to estimate the effects and calibrate p value. The only difference is the modeling. I cannot tell too much detail about this project since it's unpublished. And it's not in the context of GWAS. The reason we see too many significances with simple separate model is many small effects have been over-estimated. Fitting a larger model can somehow fix this and reduce the false positive rate.
Hello Andrew, Alex
Looks like someone already took a stab at criticizing these papers.
http://jas.fass.org/cgi/content/full/87/6/1854
Look forward to hearing your thoughts on this.
Best,
Bob
Bob:
I think all these authors are missing the point. I am currently writing an article to explain how I think retrospective power analysis should be done.
Bob:
I love the terse style of the paper you link to.
My argument against post-hoc power calculations is two-fold:
In observational studies, you know already that all null hypotheses are false, as everything is associated with something associated with the outcome. So if H0 stands after the analysis, all you learn is that you don't know which direction the effect is, plus a little about its magnitude (not a lot though since the CI/posterior straddles 0). So if a posthoc power calculation purports to tell you you didn't reject H0 not because your sample was too small but because H0 is true, well that clashes with the prior knowledge you get from the study design.
Although Leventhal argues for a few a priori effect sizes to use, and these may be sensible, the alternative—plugging in the "observed" effect size (or, rather, a point estimate based on the data)—seems to be more common. But conditioning on not having enough evidence to be sure of the sign of the effect means you'd be plugging in a value that might be very far from the true value (and hence your "observed" power is quite far from the true power) and surely biases the "observed" power.
I look forward to reading Andrew's solution.
I looked more carefully at the Leventhal paper (linked to by Bob above) and I don't like it.
Leventhal is commenting on a paper by Lenth. At one point, Leventhal writes:
Now let's look at what Lenth wrote:
Lenth wrote that power depends on several things. He is completely clear that it's a conditional probability.
Partly following I think from a blog entry by Andrew, I've heard a number of people say something along the lines of what Alex Cook writes: "all null hypotheses are false". But it seems to me this is incorrect. If my null hypothesis is that there is no relationship between the outcome of a randomization device on my computer and the weather, then it shouldn't matter how large a dataset I collect, I will never reject the null with a probability greater than alpha (assuming of course there is no causal relationship between these two).
In the real world many factors are causally entangled with each other, but even then the effect sizes are probably very small. So it is my claim that many (most) causal effects are effectively 0 or are in fact 0, and therefore there are many null hypotheses that are true (or are for all practical purposes true).
Any thoughts on this? (Thanks Andrew and commentators for a great blog)
John:
Yes, I agree that many effects are small, even if not zero. I am not particularly interested in statistical methods (whether based on hypothesis tests, Bayes factors, cross-validation, or whatever) that attempt to estimate whether certain parameters are exactly zero. I don't like, for example, methods of trying "learn" conditional independence structures in a graph. But I do think it's a good idea for one's model to include the possibility that a parameter can be very close to zero, and to include such information in prior distributions or power analyses. See here for an example.
It's true that a big problem is that data analysis with no theory turns up lots of spurious correlations. But that's the easy problem. The hard one is the data analysis with a good but wrong theory that turns up spurious correlations. The Symmetry-and-Sexual-Attractiveness publications story sounds like that. But the scientific process still works even with publication bias if the journal editors are biased only towards interesting results, as they should be, and not towards defending particular positions. Here's what I posted over at Wired:
Interesting results are more likely to get published, introducing a selection bias.
The first stage is that everybody either believes Not-X or Uncertain-about-X. One person submits a study showing Uncertain-about-X or Not-X, and it is rejected as boring. Another person submits a paper showing X, and it is accepted because it is interesting.
The second stage is that lots of other people start studying X. Those who find X get published, and those who find Uncertain or Not-X get rejected.
The third stage is that X becomes the conventional wisdom.
The fourth stage can go either of two ways. Either way, the bias reverses. Papers that prove X now start getting rejected, as being old-hat. If X is true, interest in the area dies out or moves on to more advanced subtopics. If X is false, papers that show Not-X now start getting accepted, and in a fifth stage, Not-X becomes the conventional wisdom.
ps. Did Lord Rutherford really say something as foolish as "If your experiment requires statistics, you ought to have done a better experiment."?
Having taught meta-analysis, published a dozen of my own, and devised a few refinements (e.g., Bayesian variance estimation), I don't think the problem really is in conducting these small studies but in their interpretation. Each study contains some information, though often far less than the original researchers could imagine. Sampling error by far accounts for most variation among studies, but few appreciate the extent of this (i.e., think 50-70%).
Really, the issue becomes that people over interpret what is largely sampling error accompanied by a little bit of signal. If everyone could just wait until there was enough replications to meta-analyse a topic, everything falls into place. However, what do you think the chance of that happenning is?
Piers: Missed this, tried to encourage such waiting in
Meta-analysis in Medical Research: Strong encouragement for higher quality in individual research efforts. Journal of Clin. Epi. 42(10):1021-1024, 1989.
Not that much progress yet though
K?