“A strong anvil need not fear the hammer”

Wagenmakers et al. write:

A single experiment cannot overturn a large body of work. . . . An empirical debate is best organized around a series of preregistered replications, and perhaps the authors whose work we did not replicate will feel inspired to conduct their own preregistered studies. In our opinion, science is best served by ruthless theoretical and empirical critique, such that the surviving ideas can be relied upon as the basis for future endeavors. A strong anvil need not fear the hammer, and accordingly we hope that preregistered replications will soon become accepted as a vital component of a psychological science that is both though-provoking and reproducible.

I don’t feel quite so strongly as E.J. regarding preregistered replications, but I agree strongly with his anvil/hammer quote, which comes at the end of a recent paper, “Turning the hands of time again: a purely confirmatory replication study and a Bayesian analysis,” by Eric-Jan Wagenmakers, Titia Beek, Mark Rotteveel, Alex Gierholz, Dora Matzke, Helen Steingroever, Alexander Ly, Josine Verhagen, Ravi Selker, Adam Sasiadek, Quentin Gronau, Jonathon Love, and Yair Pinto, which begins:

In a series of four experiments, Topolinski and Sparenberg (2012) found support for the conjecture that clockwise movements induce psychological states of temporal progression and an orientation toward the future and novelty.

OK, before we go on, let’s just see where we stand here. This is a Psychological Science or PPNAS-style result: it’s kinda cool, it’s worth a headline, and it could be true. Just as it could be that college men with fat arms have different political attitudes, or that your time of the month could affect how you vote or how you dress, or that being primed with elderly-related words could make you walk slower. Or just as any of these effects could exist but go in the opposite direction. Or, as the authors of those notorious earlier papers claimed, such effects could exist but only in the presence of interactions with socioeconomic class, relationship status, outdoor temperature, and attitudes toward the elderly. Or just as any of these could exist, interacted with any number of other possible moderators such as age, education, religiosity, number of older siblings, number of younger siblings, etc etc etc.

Topolinski and Sparenberg (2012) wandered through the garden of forking paths and picked some pretty flowers.

What happened when Wagenmakers et al. tried to replicate?

Here we report the results of a preregistered replication attempt of Experiment 2 from Topolinski and Sparenberg (2012). Participants turned kitchen rolls either clockwise or counterclockwise while answering items from a questionnaire assessing openness to experience. Data from 102 participants showed that the effect went slightly in the direction opposite to that predicted by Topolinski and Sparenberg (2012) . . .

No surprise. If the original study is basically pure noise, a replication could go in any direction.

Wagenmakers et al. also report a Bayes factor, but I hate that sort of thing so I won’t spend any more time discussing it here. Perhaps I’ll cover it in a separate post but for now I want to focus on the psychology experiments.

And the point I want to make is how routine this now is:

1. A study is published somewhere, it has p less than .05, but we know now that this says little to nothing at all.

2. The statistically significant p-value comes with a story, but through long experience we know that these sort of just-so stories can go in either direction.

3. Someone goes to the trouble of replicating. The result does not replicate.

Let’s just hope that we can bypass the next step:

4. The original authors start spinnin and splainin.

And instead we can move to the end of this story:

5. All parties agree that any effect or interaction will be so small that it can’t be detected with this sort of crude experimental setup.

And, ultimately, to a realization that noisy studies and forking paths is not a great way to learn about the world.

Let me clarify just one thing about preregistered studies. Preregistration is fine, but it helps to have a realistic sense of what might happen. That’s one reason I did not recommend that those ovulation-and-clothing researchers do a preregistered replication. Sure, they could, but given their noise level, it’s doomed to fail (indeed, they did do a replication and it did fail in the sense of not reproducing their original result, and then they salvaged it by discovering an interaction with outdoor temperature). Instead, I usually recommend people work on reliability and validity, that is, on reducing the variance and bias of their measurements. It seems kinda mean to suggest someone do a preregistered replication, if I think they’re probably gonna fail. And, if they do succeed, it’s likely to be a type S error, which is its own sort of bummer.

I guess what I’m saying is:

– Short-term, a preregistered replication is a clean way to shoot down a lot of forking-paths-type studies.

– Medium-term, I’m hoping (and maybe EJ and his collaborators are, too) that the prospect of preregistered replication will cause researchers to moderate their claims and think twice about publishing and promoting the exciting statistically-significant patterns that show up.

– Long term, maybe people will do more careful experiments in the first place. Or, when people do want to trawl through data to find interesting patterns (not that there’s anything wrong with that, I do it all the time), that they will use multilevel models and do partial pooling to get more conservative, less excitable inference.

28 thoughts on ““A strong anvil need not fear the hammer”

  1. Apologies for asking a totally N00B question (but, hey that’s where I am some times!), but what exactly is “preregistration” and how does it work (aside from artillery fire)? TIA.

      • Thanks again for the link, Andrew!

        I can’t help but notice your statement, “His flagship example, an observational study of congressional positions and election
        outcomes, includes three tables with numbers presented to four decimal places ….” Four decimals places! Holy cow! That reminded me of my own grad. school days in chemistry, when I was a TA for the undergrad. physical chemistry class. A grad. student in biology was taking the class, and I noticed that for each homework problem having a numerical answer he would provide a very long string of digits after the decimal (often more than 10!). I quickly realized that he was copying the entire display from his calculator and had no idea of significant figures or even just rounding for the sake of the grader’s eyes. :-) He soon dropped the class.

    • Pre-registration is just basically writing down and making public a plan for carrying out the experiment and analyzing the data before you run the experiment or collect the data. The notion is that if people are searching for “statistical significance” even if only informally, that they can’t perform this kind of search after they’ve nailed down the specifics.

      If your experiment is supposed to be a TEST OF THEORY X and THEORY X is a poorly defined thing, then it’s not really possible to say whether or not you’ve tested it. If THEORY X is pretty well described ahead of time, then it’s much easier to determine whether you’ve tested it.

      The problem is, testing if THEORY X is true is a poor way to go about doing research in the first place. I’m not necessarily a fan of preregistration. It’s kind of a band-aid on a gaping wound.

      If THEORY X holds a prominent position with lots of strong evidence that accords with it, like maybe relativity, then it’s pretty clear whether some experiment is a test of THEORY X, and if THEORY X is a poorly defined thing, like how fat arms relates somehow to political attitudes, then there’s no reason to put THEORY X up on a pedestal in the first place, better I think to simply collect good measurements and re-analyze with an open mind and if you consistently find that THEORY X fits well, then you start to believe it and formalize it.

      The problem is researchers are reporting findings from a single noisy study as if their p < 0.05 proves some theory fairly conclusively. We're erecting pedestals like some kind of marble carving business at a trade convention.

  2. If the ovulation-and-clothing researchers had pre-registered then they wouldn’t have been able to salvage the 2nd study right by using that post-hoc excuse of discovering an interaction with outdoor temperature?

    They would have had to admit the plain truth that the first result was spurious?

    • Rahul:

      Had those researchers preregistered their second study, nothing would’ve stopped them from noticing the interaction with outdoor temperature and reporting that as a speculative, or non-preregistered, finding. I don’t think they would’ve had to come to the conclusion that the first result was spurious. They could continue to believe that all these things they saw were real (that is, that they generalized to some larger population of interest).

  3. Towel rolling’s economic importance palls in comparison with how a store is laid out. From William Poundstone’s book, Priceless: The Myth of Fair Value (and How to Take Advantage of It) he claims:

    Shoppers open their wallets wider when moving through a store in a counterclockwise direction. On average, these shoppers spend $2 more a trip than clockwise shoppers…[Because] North Americans see shopping carts as ‘cars” to be driven to the right…By this theory, the right-handed majority finds it easier to make impulse purchases when the wall or shelf is to the right…[resulting in] markets putting their main entrance on the right of the store’s layout to encourage counterclockwise shopping.

    Presumably, because Brits drive on the left, they would therefore prefer clockwise over anticlockwise (British for counterclockwise). I do most of the shopping in my household and my local, limited, nonrandom experience is that there is no clear winner.
    When it comes to toilet paper dispensing, there is the vexing question over the top or from the back:

    https://answers.yahoo.com/question/index;_ylt=A0LEV7.qwnRVsgsAUUInnIlQ;_ylu=X3oDMTBydWNmY2MwBGNvbG8DYmYxBHBvcwM0BHZ0aWQDBHNlYwNzcg–?qid=20071012034000AAJrkoO&p=toilet%20paper%20dispensing

    Below purports to be results of actual data:

    Most people prefer toilet paper be dispensed over the top. The Scott Paper Company conducted a poll and found older folks (over 50 years old) clearly preferred to have their toilet paper dispensed over the top by more than 4 to 1.

  4. I’ve said before that I really think there are a lot of “old school” researchers wondering what the fuss is all about. In their day, they’d gather lots of evidence, replicate many times over with slight tweaks to confirm and explore the phenomenon. Publication rates were orders of magnitude lower with larger empirical content per pub, and I think there were fewer hyped up claims. I could rattle off the names of lots of careful, thoughtful empirical researchers who are kind of puzzled by this whole replication mess. I’m not saying there weren’t fads and mirages “back then”, but maybe a better signal/noise ratio?

    There’s a retro aspect to your medium and long-term views. I like it – let’s go back to a time when equity was held at least for minutes, and when someone actually stood behind a mortgage. I do worry that all out pre-registration would be a bit stifling in an era when we are armed to the teeth with data and computational power and the ability to share and collaborate in real time. Obviously I see the purpose and the value, but perhaps a more restrained but flexible approach to exploration, discovery and confirmation could work too.

  5. > I did not recommend that those ovulation-and-clothing researchers do a preregistered replication. Sure, they could, but given their noise level, it’s doomed to fail

    How would one know in advance for such studies that this is gonna happen?

    • Rob:

      The problem is the signal-to-noise ratio. When the underlying effect is low and highly variable and measured with a lot of bias and variance, then even if you do get lucky and legitimately get p less then 05, you’re still just chasing noise. It’s one of these games where, even if you win, you lose. See here for a picture.

      • It makes sense if you assume some particular level of signal relative to noise you can make calculations like that. How are you getting these assumed signal and noise levels? Do they come directly from calculations the original authors made, or from prior information?

        What statistical procedures should I be doing if I want to avoid chasing noise in my own work?

        • Rob:

          I recommend estimating variation in signal and noise from data, using Bayesian methods (as discussed in my books) to integrate information from multiple data sources. When using published data summaries you have to watch out for selection effects (not just the so-called file-drawer effect but the more general problem that researchers will be able to dig into their data and find statistically significant comparisons which they will then highlight). If you want to start non-Bayesianly, I recommend the methods discussed in my 2014 paper with Carlin: http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

  6. Hi Andrew,

    Thanks for discussing the paper. This replication was actually fun to do (we saw the experimental setup and it was love at first sight). Topolinski was helpful and gracious throughout the entire process.

    About the preregistration: it is clear that without preregistration (and *with* significance seeking, or data-inspired garden-of-forking-path style analyses), those p-values do not “enjoy” their regular interpretation. Adriaan de Groot (who founded my department and wrote a thesis “Thought and Choice in Chess” with world chess champions as test subjects) compared the situation to a multiple testing problem with the number of comparisons unknown (http://www.ejwagenmakers.com/inpress/DeGroot1956_TA.pdf).

    Unfortunately, the problem extends to credible intervals, in the following way. Consider a hierarchical solution to the multiplicity problem. Cherry-picking the results is like reporting only a few extreme results, and omitting all the other effects that, when honestly reported and analyzed, would shrink the effects towards zero. Of course if one is completely honest then all of the results are reported and analyzed. However, there is human nature, there is tenure, and hindsight bias, and confirmation bias. So I like preregistration as a means of protection against those influences.

    I accept that in some fields the problem is less severe. In my own field -mathematical psychology- for example, nobody is biased and we are all completely honest (by the way, this bias is called “blindsight bias”, the bias that you think you’re not vulnerable to bias :-))

    I would also like to stress that preregistration does not prevent or rule out exploratory analyses in any way. It may not be appropriate for all work (especially when the purpose is more descriptive) but when a specific empirical claim is at stake, I feel much safer with preregistration.

    Cheers,
    E.J.

    • EJ:

      I liked the graphs with the plain language evidence interpretation on the right axis (anecdotal, moderate, etc…).

      Most consumers of research store findings in a coarsened, pixelated, Lego-like discrete world! Yet by drawing the curve you also keep the underlying value.

      Finally, we can have our cake and eat it.

  7. I’m becoming a fan of pre-registration. In my collaborations in the past, it was pretty predictable that when the results didn’t turn out as hoped, the clinical investigators would demand that I look at endless subset analyses to find a subgroup of the population for whom “it works.” It was very difficult to “apply the brakes” to that process, and some of my collaborations broke up over the issue. And in some cases, I ultimately caved to the pressure.

    Now, the major medical journals will only publish a randomized controlled trial if it was pre-registered. It makes my life so much easier on the back end. Yes, we have to be a bit more thoughtful ex ante–which takes time, but ultimately is a good thing, no? And at the back end there is no squabbling about p-hacking: it’s clear that we have to publish the analysis we committed to before we had the data in hand. Yes, we can do additional subset analyses if we want, but there is no possibility to deceptively present them as if they were anything other than a lucky fork in the garden path. Maybe it’s just because I’m conflict-avoidant, but pre-registration has really improved my life.

  8. I think a real danger of preregistration is that moving from the regime where 1 out of 20 results are false by definition (even setting every concern about the quality of hypotheses aside), but no one really believes them anyways (sure, people will use them in arguments, but not base their decisions on them), to the regime where 1 out of 400 results are false by definition (if you do one preregistered replication, but again not counting all the other problems), but they manage to generate significant belief changes, as they come from the movement pedestaling scientific rigor. And there are about 2 million papers per year.

    • I do not know what field you work in, but in my field of study life changing decisions are made based on nonsensical research results every day of every year.

    • Ibn:

      You write, “the regime where 1 out of 20 results are false by definition.” No! Even in a world where all the statistical models are correct and in which true effects are zero or not, there’s no reason to think that 5% of the statistically significant comparisons correspond to zero effects. You’re making the classic error of flipping the conditional probability.

      • Yeah, it’s sloppy, maybe it’s better to say that 1 out of 20 comparisons will yield a significance star just because of the shape of the likelihood (the distribution of observation noise), even when assumptions about the underlying distribution are correct (worse otherwise), regardless of any effect going or not going on. I hope this is correct.

        So what I wanted to say is that requiring one preregistered replication only lowers the number of nonsensical findings by (let’s say) an order of magnitude, which is still a lot given the number of studies, while heavily reinforcing belief in whatever passes through.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *