This one is like shooting fish in a barrel but sometimes the job just has to be done. . . .
The paper is by Daryl Bem, Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan, it’s called “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events,” and it begins like this:
In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded (Bem, 2011). To encourage exact replications of the experiments, all materials needed to conduct them were made available on request. We can now report a meta-analysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma . . . A Bayesian analysis yielded a Bayes Factor of 7.4 × 10-9 . . . An analysis of p values across experiments implies that the results were not a product of “p-hacking” . . .
There is a lot of selection going on here. For example, they report that 57% (or, as they quaintly put it, “56.6%”) of the experiments had been published in peer reviewed journals or conference proceedings. Think of all the unsuccessful, unpublished replications that didn’t get caught in the net. But of course almost any result that happened to be statistically significant would be published, hence a big bias. Second, they go back and forth, sometimes considering all replications, other times ruling some out as not following protocol. At one point they criticize internet experiments which is fine, but again it’s more selection because if the results from the internet experiments had looked good, I don’t think we’d be seeing that criticism. Similarly, we get statements like, “If we exclude the 3 experiments that were not designed to be replications of Bem’s original protocol . . .”. This would be a lot more convincing if they’d defined their protocols clearly ahead of time.
I question the authors’ claims that various replications are “exact.” Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense. Just to get an idea of what was going on, I tried to find one of the earlier studies that was stated to be an exact replication. I looked up the paper by Savva et al. (2005), “Further testing of the precognitive habituation effect using spider stimuli.” I could not find this one but I found a related one, also on spider stimuli. In what sense is this an “exact replication” of Bem? I looked at the Bem (2011) paper, searched on “spider,” and all I could find is a reference to Savva et al.’s 2004 work.
This baffled me so I went to the paper linked above and searched on “exact replication” to see how they defined the term. Here’s what I found:
“To qualify as an exact replication, the experiment had to use Bem’s software without any procedural modifications other than translating on-screen instructions and stimulus words into a language other than English if needed.”
I’m sorry, but, no. Using the same software is not enough to qualify as an “exact replication.”
This issue is central to the paper at hand. For example, there is a discussion on page 18 on “the importance of exact replications”: “When a replication succeeds, it logically implies that every step in the replication ‘worked’ . . .”
Beyond this, the individual experiments have multiple comparisons issues, just as did the Bem (2011) paper. We see very few actual preregistrations, and my impression is that when something counts as a successful replication there is still a lot of wiggle room regarding data inclusion rules, which interactions to study, etc.
The ESP context makes this all look like a big joke, but the general problem of researchers creating findings out of nothing, that seems to be a big issue in social psychology and other research areas involving noisy measurements. So I think it’s worth holding a firm line on this sort of thing. I have a feeling that the authors of this paper think that if you have a p-value or Bayes factor of 10^-9 then your evidence is pretty definitive, even if some nitpickers can argue on the edges about this or that. But it doesn’t work that way. The garden of forking paths is multiplicative, and with enough options it’s not so hard to multiply up to factors of 10^-9 or whatever. And it’s not like you have to be trying to cheat; you just keep making reasonable choices given the data you see, and you can get there, no problem. Selecting ten-year-old papers and calling them “exact replications” is one way to do it.