The replication crisis in social psychology (and science more generally) will not be solved by better statistics or by preregistered replications. It can only be solved by better measurement.
Let me say this more carefully. I think that improved statistics and preregistered replications will have very little direct effect on improving psychological science, but they could have important indirect effects.
Why no big direct effects? Cos if you’re studying a phenomenon that’s tiny, or that is so variable that any main effects will be swamped by interactions, with the interaction changing in each scenario where it’s studied, then better statistics and preregistered replications will just reveal what we already know, which is that existing experimental results say almost nothing about the size, direction, and structure of these effects.
I’m thinking here of various papers we’ve discussed here over the years, examples such as the studies of political moderation and shades of gray, or power pose, or fat arms and political attitudes, or ovulation and vote preference, or ovulation and clothing, or beauty and sex ratios, or elderly-related words and walking speed, or subliminal smiley faces and attitudes toward immigration, or ESP in college students, or baseball players with K in their names being more likely to strike out, or brain scans and political orientation, or the Bible Code, or . . .
Let me put it another way. Lots of the studies that we criticize don’t just have conceptual problems, they have very specific statistical errors—for example, the miscalculated test statistics in Amy Cuddy’s papers, where p-values got shifted below the .05 threshold—and they disappear under attempted replications. But this doesn’t mean that, if these researchers did better statistics or if they routinely replicated, that they’d be getting stronger conclusions. Rather, they’d just have to give up their lines of research, or think much harder about what they’re studying and what they’re measuring.
It could be, however, that improved statistical analysis and preregistered replications could have a positive indirect effect on such work: If these researchers knew ahead of time that their data would be analyzed correctly, and that outside teams would be preparing replications, they might be less willing to stake their reputations on shaky findings.
Think about Marc Hauser: had he been expected ahead of time to make all his monkey videotapes available for the world to see, he would’ve had much less motivation to code them the way they did.
So, yes, I think the prospect of reanalysis of existing data, and replication of studies, concentrates the mind wonderfully.
But . . . all the analysis and replication in the world won’t save you, if what you’re studying just isn’t there, or if any effects are swamped by variation.
That’s why, in my long blog conversations with the ovulation-and-clothing researchers, I never suggested they do a preregistered replication. If they or anyone else wants to do such a replication, fine—so far, I know of two such replications, neither of which found the pattern claimed in the original study but each of which reported a statistically significance comparison on something new, i.e., par for the course—but it’s not something I’d recommend because then I’d be recommending they waste their time. It the same reason I didn’t recommend that the beauty-and-sex-ratio guy gather more samples of size 3000. When your power is 6%, or 5.1%, or 5.01%, or whatever, to gather more data and look for statistical significance is at best a waste of time, at worst a way to confuse yourself with noise.
So . . . as I wrote a few months ago, doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements:
Performing more replicable studies is not just a matter of being more careful in your data analysis (although that can’t hurt) or increasing your sample size (although that, too, should only help) but also it’s about putting real effort into design and measurement. All too often I feel like I’m seeing the attitude that statistical significance is a win or a proof of correctness, and I think this pushes researchers in the direction of going the cheap route, rolling the dice, and hoping for a low p-value that can be published. But when measurements are biased, noisy, and poorly controlled, even if you happen to get that p less than .05, it won’t really be telling you anything.
With this in mind, let me speak specifically of the controversial studies in social priming and evolutionary psychology. One feature of many such studies is that the manipulations are small, sometimes literally imperceptible. Researchers often seem to go to a lot of trouble to do tiny things that won’t be noticed by the participants in the experiments. For example, flashing a smiley face on a computer screen for 39 milliseconds, or burying a few key words in a sham experiment. In other cases, manipulations are hypothesized to have a seemingly unlimited number of interactions with attitudes, relationship status, outdoor temperature, parents’ socioeconomic status, etc. Either way, you’re miles away from the large, stable effects you’d want be studying if you want to see statistical regularity.
If effects are small, surrounded by variability, but important, then, sure, research them in large, controlled studies. Or go the other way and try to isolate large effects from big treatments. Swing some sledgehammers and see what happens. But a lot of this research has been going in the other direction, studying tiny interventions on small samples.
The work often “succeeds” (in the sense of getting statistical significance, publication in top journals, Ted talks, NPR appearances, etc.) but we know that can happen, what with the garden of forking paths and more.
So, again, in my opinion, the solution to the “replication crisis” is not to replicate everything or to demand that every study be replicated. Rather, the solution is more careful measurement. Improved statistical analysis and replication should help indirectly in reducing the motivation for people to perform analyses that are sloppy or worse, and reducing the motivation for people to think of empirical research as a sort of gambling game where you gather some data and then hope to get statistical significance. Reanalysis of data and replication of studies should reduce the benefit of sloppy science and thus shift the cost-benefit equation in the right direction.
Piss-poor omnicausal social science
One of my favorite blogged phrases comes from political scientist Daniel Drezner, when he decried “piss-poor monocausal social science.”
By analogy, I would characterize a lot of these unreplicable studies in social and evolutionary psychology as “piss-poor omnicausal social science.” Piss-poor because of all the statistical problems mentioned above—which arise from the toxic combination of open-ended theories, noisy data, and huge incentives to obtain “p less than .05,” over and over again. Omnicausal because of the purportedly huge effects of, well, just about everything. During some times of the month you’re three times more likely to wear red or pink—depending on the weather. You’re 20 percentage points more likely to vote Republican during those days—unless you’re single, in which case you’re that much more likely to vote for a Democrat. If you’re a man, your political attitudes are determined in large part by the circumference of your arms. An intervention when you’re 4 years old will increase your earnings by 40%, twenty years down the road. The sex of your baby depends on your attractiveness, on your occupation, on how big and tall you are. How you vote in November is decided by a college football game at the end of October. A few words buried in a long list will change how fast you walk—or not, depending on some other factors. Put this together, and every moment of your life you’re being buffeted by irrelevant stimuli that have huge effects on decisions ranging from how you dress, to how you vote, to where you choose to live, your career, even your success at that career (if you happen to be a baseball player). It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly. A world if, it truly existed, would be much different from the world we live in.
A reporter asked me if I found the replication rate of various studies in psychology to be “disappointingly low.” I responded that yes it’s low, but is it disappointing? Maybe not. I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior, a world in which there are large ESP effects just waiting to be discovered. I’m glad that this fad in social psychology may be coming to an end, so in that sense, it’s encouraging, not disappointing, that the replication rate is low. If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong. Meanwhile, statistical analysis (of the sort done by Simonsohn and others), and lots of real-world examples (as discussed on this blog and elsewhere) have shown us how it is that researchers could continue to find “p less than .05” over and over again, even in the absence of any real and persistent effects.
The time-reversal heuristic
A couple more papers on psychology replication came in the other day. They were embargoed until 2pm today which is when this post is scheduled to appear.
I don’t really have much to say about the two papers (one by Gilbert et al., one by Nosek et al.). There’s some discussion about how bad is the replication crisis in psychology research (and, by extension, in many other fields of science), and my view is that it depends on what is being studied. The Stroop effect replicates. Elderly-related-words priming, no. Power pose, no. ESP, no. Etc. The replication rate we see in a study-of-studies will depend on the mix of things being studied.
Having read the two papers, I pretty much agree with Nosek et al. (see Sanjay Srivastava for more on this point), and the only thing I’d like to add is to remind you of the time-reversal heuristic for thinking about a published paper followed by an unsuccessful replication:
One helpful (I think) way to think about such an episode is to turn things around. Suppose the attempted replication experiment, with its null finding, had come first. A large study finding no effect. And then someone else runs a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any such phenomenon is fragile.
From this point of view, what the original claim has going for it is that (a) statistical significance was obtained in an uncontrolled setting, (b) it was published in a peer-reviewed journal, and (c) this paper came before, rather than after, the attempted replication. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take the apparently successful result as the starting point in our discussion, just because it was published first.
P.S. More here: “Replication crisis crisis: Why I continue in my ‘pessimistic conclusions about reproducibility”