Seth points to this article by Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler, which begins:
The newly emerging field of Social Neuroscience has drawn much attention in recent years, with high-profile studies frequently reporting extremely high (e.g., >.8) correlations between behavioral and self-report measures of personality or emotion and measures of brain activation obtained using fMRI. We show that these correlations often exceed what is statistically possible assuming the (evidently rather limited) reliability of both fMRI and personality/emotion measures. The implausibly high correlations are all the more puzzling because social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained. We surveyed authors of 54 articles that reported findings of this kind to determine the details of their analyses. More than half acknowledged using a strategy that computes separate correlations for individual voxels, and reports means of just the subset of voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other analysis problems likely created entirely spurious correlations in some cases.
This is cool statistical detective work. I love this sort of thing. I also appreciate that the article has graphs but no tables. I have only two very minor comments:
1. As Seth points out, the authors write that many of the mistakes appear in “such prominent journals as Science, Nature, and Nature Neuroscience.” My impression is that these hypercompetitive journals have a pretty random reviewing process, at least for articles outside of their core competence of laboratory biology. Publication in such journals is taken much more of a seal of approval than it should be, I think. The authors of this article are doing a useful service by pointing this out.
2. I think it’s a little tacky to use “voodoo” in the title of the article.
I am a bit stunned that no one noticed this before, as multiple testing isn't a new issue for statisticians.
Also, "voodoo" is only tacky if it sticks. ;-)
An excellent paper. Andrew — care to take back your claim that multiple comparisons isn't a problem?
Eastwood: Perhaps no one looked into this before because no one really believed these studies in the first place. To put it another way, we get external measures of uncertainty from replications of studies, so if we have enough replications, the internal measures aren't so important anyway. Just as, for example, if we have a time series of opinion polls, we don't need to worry about the standard error of each poll.
Jfalk: It takes more than data to get me to retract a claim. . . . Seriously, though, we say that multiple comparisons isn't a problem if an analysis is done well, but we certainly agree (and give some examples to illustrate) that multiple comparisons is a problem for non-hierarchical models.
Excellent paper; thanks for the reference. I am dismayed that this type of practice has persisted in the field for this long, but I am not overly surprised. I've seen this kind of endogenous selection many, many times in different fields, and it is rarely appreciated how much it can affect the results of any analysis. I am very glad to see Vul et. al. addressing this problem, but I wonder if there is a better way to bring statisticians into the review process, increasing the credibility of top journals and cutting down on such errors.
Andrew: I'm not sure I agree; If no one believed these studies they should never have been published in the first place. At very least the potential for bias should have been acknowledged.
I'm all for external validation, but here we don't need validation to know something is wrong. — Don't get me wrong though, this is an excellent post and topic.
Alex: I spoke with two medical residents recently who not only recognized this problem in their methods, but also correctly worked out the distribution of a maximum in order to demonstrate the problem to their colleagues. There is hope! :-)
> external measures of uncertainty from
> replications of studies
These are easier for others to appreciate as real problems rather than theoretical considerations that are misunderstood and discounted
> if an analysis is done well
this presupposes the selection rules for what was reported are known or fairly simple – which is seldom the case in much of the literature
And on my pet subject – we should encourage and insist that all the literature on a topic be processed "together" to catch and overcome poor research practices
Lack of statistical reviewing is a problem through much of the medical literature, especially journals that are more specialised. Epidemiology in non-epidemiology journals is often appalling. Wrong statistics, wrong analysis and wrong interpretation.
I do wonder if many of the medical researchers actually care, and it is simply about publications and citations rather than doing good science.
It seems the only way to fix it is a form of accreditation, where journals are audited and found to comply with a minimum requirement of statistical reviewing and confirmed by a random selection of articles.
Ken: Prevention is preferable to cure – but statistical reviewing is not a very effective prevention.
As discussed before on this blog, the reviewer only gets a description of the tip of the iceberg – no access to the data and very little disclosure about how the data were generated/collected.
Ever been brought the data along with a comment "we would not have bothered you with this data but the p_value was so small"? (i.e. the other dozen studies we did before on this useless treatment had non-significant p_values so we did't trouble you with them)
Also you have to worry about reviewing standards – a statistician I was recently helping out told me they had to back off on multiplicity issues as the other statisticians at their university were not as strict and thier clients were starting to feel "hard done by" (and there are differences of opinion amongst statisticians).
> is simply about publications and citations
> rather than doing good science.
This often can be the case, but given the career pressures on people it should not be unexpected!
The cure does seem to be to have separate groups do essentially the same studies and to compare and contrast the results (preferrably with access to as much of their data as ethically possible).
Now the perecentage of statisticians that are trained and experienced at locating, extracting, contrasting and combining studies results is not as low as it was say 20 years ago – but it probably is still too low.
Ken: Further to my comment, it might have been interesting to check what perecentage of the original paper had a "qualified" statistician as a co-author. Unfortunately their survey did not have that question on it and it might take a bit of work to determine that.
Keith, agreed that statistical reviewing wont fix all the problems. It is impossible to detect that choice of covariates is based on achieving statistical significance of the effect of interest or that the result is based on an outlier. A requirement for review should be supply of a full statistical report that could be made available online after publication.
I expect that eventually publication of epidemiology in the major medical journals will require a protocol and prespecified analysis prior to data collection, the same as for clinical trials.
The main problem is the need to publish and to publish the need for significant results. Publication of well designed negative studies would help researchers resist the temptation to be creative.
Please see http://www.bcn-nic.nl/replyVul.pdf for a reply by some of the authors that are criticized.
For those interested, you can find our response to this reply here:
Here is our invited reply
Thanks, all, for the updates. I'll post something about this on the blog soon.
Okay, firstly this here page says:
"Sign in to comment, or comment anonymously.
Remember personal info?"
Next, assuming that it's true that many of the unexpectedly strong experimental results have involved researchers fiddling the figures, where's your own control? Have you examined papers where results were not very impressive, and found that researchers in those cases were more honest? Granted, it's difficult to get such papers published, especially if the guys in the laboratory across town are making up muchdmore exciting figures. But if you only look at the suspiciously successful papers, aren't you cherry-spicking the evidence too? Maybe the scientists with less impressive results to report aren't any more honest than the high-fliers, just less imaginative.
Having said that, other research published recently indicates that overall, around 2 per cent of professional scientists have fiddled figures for experiments (I assume high school lab work, for instance, doesn't count), and 14 per cent know someone, perhaps co-authored with someone who did. But we certainly can't assume that brain scientists have the same just-fake-it rate as all scientists. There are all sorts of variables. The chance of being found out when someone tries to replicate the work, the temptation of bribery from pharmaceutical companies…
This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other analysis problems likely created entirely spurious correlations in some cases.