Benedict Carey writes a follow-up article on ESP studies and Bayesian statistics. (See here for my previous thoughts on the topic.) Everything Carey writes is fine, and he even uses an example I recommended:
The statistical approach that has dominated the social sciences for almost a century is called significance testing. The idea is straightforward. A finding from any well-designed study — say, a correlation between a personality trait and the risk of depression — is considered “significant” if its probability of occurring by chance is less than 5 percent.
This arbitrary cutoff makes sense when the effect being studied is a large one — for example, when measuring the so-called Stroop effect. This effect predicts that naming the color of a word is faster and more accurate when the word and color match (“red” in red letters) than when they do not (“red” in blue letters), and is very strong in almost everyone.
“But if the true effect of what you are measuring is small,” said Andrew Gelman, a professor of statistics and political science at Columbia University, “then by necessity anything you discover is going to be an overestimate” of that effect.
The above description of classical hypothesis testing isn’t bad. Strictly speaking, one would follow “is less than 5 percent” above with “if the null hypothesis of zero effect were actually true,” but they have serious space limitations, and I doubt many readers would get much out of that elaboration, so I’m happy with what Carey put there.
One subtlety that he didn’t quite catch was the way that researchers mix the Neyman-Pearson and Fisher approaches to inference. The 5% cutoff (associated with Neyman and Pearson) is indeed standard, and it is indeed subject to all the problems we know about, most simply that statistical significance occurs at least 5% of the time, so if you do a lot of experiments you’re gonna have a lot of chances to find statistical significance. But p-values are also used as a measure of evidence: that’s Fisher’s approach and it leads to its own problems (as discussed in the news article as well).
The other problem, which is not so well known, comes up in my quote: when you’re studying small effects and you use statistical significance as a filter and don’t do any partial pooling, whatever you have that’s left standing that survives the filtering process will overestimate the true effect. And classical corrections for “multiple comparisons” do not solve the problem: they merely create a more rigorous statistical significance filter, but anything that survives that filter will be even more of an overestimate.
If classical hypothesis testing is so horrible, how is it that it could be so popular? In particular, what was going on when a well-respected researcher like this ESP guy would use inappropriate statistical methods.
My answer to Carey was to give a sort of sociological story, which went as follows.
Psychologists have experience studying large effects, the sort of study in which data from 24 participants is enough to estimate a main effect and 50 will be enough to estimate interactions of interest. I gave the example of the Stroop effect (they have a nice one of those on display right now at the Natural History Museum) as an example of a large effect where classical statistics will do just fine.
My point was, if you’ve gone your whole career studying large effects with methods that work, then it’s natural to think you have great methods. You might not realize that your methods, which appear quite general, actually fall apart when applied to small effects. Such as ESP or human sex ratios.
The ESP dude was a victim of his own success: His past accomplishments studying large effects gave him an unwarranted feeling of confidence that his methods would work on small effects.
This sort of thing comes up a lot, and in my recent discussion of Efron’s article, I list it as my second meta-principle of statistics, the “methodological attribution problem,” which is that people think that methods that work in one sort of problem will work in others.
The other thing that Carey didn’t have the space to include was that Bayes is not just about estimating the weight of evidence in favor of a hypothesis. The other key part of Bayesian inference–the more important part, I’d argue–is “shrinkage” or “partial pooling,” in which estimates get pooled toward zero (or, more generally, toward their estimates based on external information).
Shrinkage is key, because if all you use is a statistical significance filter–or even a Bayes factor filter–when all is said and done, you’ll still be left with overestimates. Whatever filter you use–whatever rule you use to decide whether something is worth publishing–I still want to see some modeling and shrinkage (or, at least, some retrospective power analysis) to handle the overestimation problem. This is something Martin and I discussed in our discussion of the “voodoo correlations” paper of Vul et al.
Should the paper have been published in a top psychology journal?
Real-life psychology researcher Tal Yarkoni adds some good thoughts but then he makes the ridiculous (to me) juxtaposition of the following two claims: (1) The ESP study didn’t find anything real, there’s no such thing as ESP, and the study suffered many methodological flaws, and (2) The journal was right to publish the paper.
If you start with (1), I don’t see how you get to (2). I mean, sure, Yarkoni gives his reasons (basically, the claim that the ESP paper, while somewhat crappy, is no crappier than most papers that are published in top psychology journals), but I don’t buy it. If the effect is there, why not have them demonstrated it for real? I mean, how hard would it be for the experimenters to gather more data, do some sifting, find out which subjects are good at ESP, etc. There’s no rush, right? No need to publish preliminary, barely-statistically-significant findings. I don’t see what’s wrong with the journal asking for better evidence. It’s not like a study of the democratic or capitalistic peace, where you have a fixed amount of data and you have to learn what you can. In experimental psychology, once you have the experiment set up, it’s practically free to gather more data.
P.S. One thing that saddens me is that, instead of using the sex-ratio example (which I think would’ve been perfect for this article, Carey uses the following completely fake example:
Consider the following experiment. Suppose there was reason to believe that a coin was slightly weighted toward heads. In a test, the coin comes up heads 527 times out of 1,000.
And they he goes on two write about coin flipping. But, as I showed in my article with Deb, there is no such thing as a coin weighted to have a probability p (different from 1/2) of heads.
OK, I know about fake examples. I’m writing an intro textbook, and I know that fake examples can be great. But not this one!
P.P.S. I’m also disappointed he didn’t use the famous dead-fish example, where Bennett, Baird, Miller, and Wolferd found statistically significant correlations in an MRI of a dead salmon. The correlations were not only statistically significant, they were large and newsworthy!
P.P.P.S. The Times does this weird thing with its articles where it puts auto-links on Duke University, Columbia University, and the University of Missouri. I find this a bit distracting and unprofessional.