Some thoughts on another failed replication in psychology

Joe Simmons and Leif Nelson write:

We report our attempt to replicate a study in a recently published Journal of Marketing Research (JMR) article entitled, “Having Control Over and Above Situations: The Influence of Elevated Viewpoints on Risk Taking”. The article’s abstract summarizes the key result: “consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking.”

Lots of details, of which the most important is that their replication is very close to the original study, but with three times the sample size.

And now the findings, first the exciting statistically significant results from the original published study, then the blah, noisy results from the preregistered replication:

Yup, the usual story. It’s 50 shades of gray all over again. Or embodied cognition. Or power pose. Or a zillion other examples pushed by the happy-talk crowd.

The above graphs pretty much tell the whole story, but I have one point I’d like to pick up on.

But the top graph looked like such strong evidence! Let’s be very very aware and very very afraid of this. It’s soooo easy to get fooled by graphs such as Figure 1 that just seem to slam a point home.

So let’s say this again: Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample. This is a big deal.

P.S. I wrote this post last year but it’s appearing now, so I’ll add this special message just for coronavirus studies:

Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample. This is a big deal.

Also, thanks to Zad for the above foto which he captions, “When you get arrested by the feds for not social distancing.”

41 thoughts on “Some thoughts on another failed replication in psychology

  1. Clearly, I am not fit to be a psychologist. I fail to see how my elevation and its supposed relationship with my willingness to buy a heated butter knife say anything about my risk-taking behavior. And, that’s without worrying about MTurk selection problems, statistical noise, forking paths, or p values. I just don’t see how these supposed connections could tell us anything useful.

    As a frequent hiker/climber I can well believe that many of my attitudes change when I am standing on a high ridge or mountaintop. And, my willingness to take risks when in such places is quite important to my own safety. So, those relationships might be worth studying. But, even under the best of circumstances, I can’t see that my willingness to purchase a heated butter knife tells me anything about, well, anything.

    • Clearly, the high-elevation perspective encourages one to believe that they are “above the fray” and occupy an accordingly “higher” place in society that insulates one from the risks involved in wasting money on a heated butter knife.

      But on the other hand, a low-elevation perspective reinforces how much farther one needs to go in order to achieve one’s dreams, encouraging you to take more risks—including that of looking like an idiot for wasting money on a heated butter knife—because what have you got to lose except your dignity and money?

      • I love your speculative stories. Just as with NHST, if we run the experiment, whichever way the results turn out will be used to support either your first story or your second. However, the experiment does not really test either story at all – it merely give results “consistent with” one story or the other, as well as consistent with an infinite number of alternative stories.

        • Thanks, and I appreciate you letting me use your comment as an excuse to indulge in some creative storytelling!

          And I suppose it’s no shock that I agree with your overall message, which is that even if those stories were meant to be taken as serious psychological theories, the experiment itself would provide very little information about them. As you say, the experiment certainly doesn’t subject those theories to any kind of strong test, nor would it allow one to estimate boundary conditions for the presence/absence of the predicted effect.

          Just to keep being speculative, if I were interested in the ways that framing a picture might influence purchasing decisions—which probably does matter!—what I’d try to do is present people with a large battery of candidate products that varied along many dimensions to see whether the effect of picture angle varied systematically along any of those dimensions. It might not, of course, but at least that would be a way to establish the basic outlines of any such relationship that would serve as a foundation for subsequent experiments and theory.

          Instead, the kinds of experiments people are actually doing really just seem like excuses to tell the story they want to tell. So as you say, the problem isn’t that it doesn’t replicate, it’s that it doesn’t even answer a useful question.

        • If anyone is to be blamed for this high-elevation hypothesis and the resulting creative story-telling, it’s G. K. Chesterton.

          In one of his stories (spoiler alert incoming), a colonel is found brutally murdered in the middle of the street, his skull smashed. Suspicion naturally falls upon the blacksmith, the only man strong enough to wield a blow that devastating. Father Brown, however, figures out the murderer was actually the diminutive curate of the nearby church, the brother of the victim, who dropped a hammer on his head from the top of the church.

          Father Brown, while confronting the curate, talks about how from such heights, human beings look insignificant in size. He speculates how this might create a God complex in man, leading a good man to commit a brutal murder.

          “I think there is something rather dangerous about standing on these high places even to pray…Heights were made to be
          looked at, not to be looked from.”

          The story made a deep impression on me. I wonder if the author of the study was also unconsciously influenced by it. If so, he should have stayed true to the original design. No self-respecting curate is going to use a heated butter knife to kill his brother.

        • But the throwing of the hammer was blind rage. The hammer hitting the victim was one in a million. He was an instrument of God. It might as well have been a heated butter knife. Well, a cordless heated butterknife.

          Nevertheless, still a homicide, in need of confession. Being an instrument of God doesn’t absolve you. And you should be working within the system, you literally damned Protestants!

        • Agreed. But from a psychology perspective, the way you deal with this would generally to be to run several studies with different designs that would all reveal an effect of alititude on risk-taking if there was one, but NOT all be subject to these same confounds. Of course, this strategy is not without its potential pitfalls:

          http://datacolada.org/31

    • Well, it’s in the journal of marketting research, so one would suspect, “willingness to buy a heated butter knife” is the exact sort of behaviour they were interested in investigating, and the label “risk taking behaviour” was applied after the fact.

      • Zhou:

        I think it’s often the opposite: People who write for the Journal of Marketing Research are professors who want to do science research, and the marketing angle is just a way for them to get this research funded and published. I don’t know, but in this case I’m guessing that the researchers are interested in risk taking behavior, and the butter knife is just an example to them.

        • I looked at the author (fortunately there were not seventeen here) and it seems that his interest on how you make people buy things they never felt a strong need for is real: “Ata Jami’s research focuses on consumers’ judgement and decision-making specifically examining how contextual cues of consumption settings and consumers’ sensory system affect their perceptions, judgments, and behaviors.”

          “When touring the Empire State Building as a graduate student, he felt compelled—despite his frugal nature—to splurge at the tourist attraction’s overpriced gift store. “I’m personally very risk-averse, so it got me thinking about how the environment can affect a person’s thinking and behavior,” he recalls.”

          https://insight.kellogg.northwestern.edu/article/customers-risk-taking-elevation-visuals

        • Let’s just say that we should be relieved that the butter knife lobby did not succeed in finding another way to unlock our wallets.

        • The Butter Knife Producers and Marketers Association isn’t the least bit interested in “unlocking” your wallet! We only want you to have the high quality of life that our amazing products can bring! We’re not just producers and marketers. We’re butter knife *OWNERS* and we’ve experienced the astounding social magnetism comes with hot butter knife ownership! The hi tech hot butter knife isn’t just about butter. It tells the right people: you’re one of them! Attractive ladies, well connected lawyers, leading politicians all over America know a butter knife owner just from their unflappable presence and the shear casualness of their confidence. The culinary value is almost incidental!

          We’re confident when you understand just how much the butter knife does for you, you won’t hesitate to get one!

  2. Another effect that doesn’t necessarily disappear, but is much weaker in replication can be found in a recent Data Colada post: http://datacolada.org/87

    “In the fifth installment of Data Replicada, we report our attempt to replicate a recently published Journal of Consumer Research (JCR) article entitled, “The Influence of Product Anthropomorphism on Comparative Choice”… The article begins with the premise that “people usually form an integrated impression of [an] entire person rather than seeing this person as consisting of separate traits.” The authors then propose that human-like products will also be judged in this way: Consumers will be more likely to use a holistic process (vs. an attribute-by-attribute comparison) to judge anthropomorphized products.”

    In the two larger replications, there is a much smaller, directionally similar effect which does not reach statistical significance.

    “the effect sizes that we observed – d = .09 in the MTurk replication and d = .10 in the Prolific replication – were much smaller than the effect size reported in the original study (d = .45).”

  3. Changing the elevation, lighting, temperature, humidity, color, air flow, sound, etc. etc. combined with any outcome metric that could be remotely meaningful, no wonder we need so many “scientists”.

  4. When I first saw this I thought it was going to be about buying behavior at low vs. high altitudes relating it to something about lower oxygen levels. The actual experiment was a bit of a let down after my initial expectations.

  5. Maybe clutching at straws here, but one possible explanation could be that the original study was done in May, whereas the replication was done in September. If what is going on is that the “high elevation” pictures are more relaxing and evocative of going on holiday, there would be a difference in response if people were doing this study while looking forwards to their summer vacations, vs having just returned.

    • Do they report the temperatures on those days in May and September? If not, then it is clearly not a good replication. What about the latest unemployment figures before the data was collected? Were there any shark attacks? What color clothing were the respondents wearing? Need to know.

  6. After reading all this, I couldn’t resist the urge to look in my kitchen drawer to survey my butter knife holdings. I found one of each type: The small type, that goes on the individual bread-and-butter plate at each place setting, and the large variety that goes with the butter dish and is used for cutting off or picking up a pat of butter to put on one’s individual bread-and-butter plate. (I’m quite sure both knives were gifts from my mother.) But this raises the question: Which type of butter knife was used in the experiment?
    (I also found the grapefruit knife I inherited from my Great Uncle.)

    • Nah, you’re harkening back to a time when aesthetics mattered, and society cared about social graces. This heated butter knife thing is powered by a V8 engine with no muffler, weighs 1800 lbs, and has an enormous phallic handle/seat. You don’t so much cut butter with it as crank up the heating element until it glows red hot, put on a helmet with a skeleton face, and ride it down the street to joust other heated butterknife owners during the post apocalyptic period.

      • Martha, Daniel:

        I feel like with all these cat pictures, we should really be talking about an electric can opener, not an electric butterknife.

        So, in the grand tradition of mathematics: Assume a can opener.

        • I don’t think Daniel’s version of a heated butter knife would have been considered appropriate in my undergraduate dorm, which was dedicated to “gracious living” (Although the housemother did concede to our using a real Bohr’s head to accompany singing of The Boar’s Head Carol — but she did reprimand us for not having excused ourselves properly from our tables before the entertainment.)

        • Oops — Boar’s head (not Bohr’s head – with apologies to Niels; we didn’t really think of decapitating him).

        • Well, obviously you must have gone to school before the advent of the 1979 post-apocalyptic film “Mad Flax” or the sequel “The Butterwarrior”. By the time Beyond Appliancedome hit the screen the descent of society into a state of waiting for apocalypse was complete. Revisiting the franchise with Butty Road in 2015 was actually a trial-balloon for the current engineered pandemic that was financed by the Dairy Farmers of America.

  7. Imagine a world where the original study, its methods and analyses, were preregistered. Where the preregistered experimental hypothesis proposes that the relationship is not zero, but then goes on to predict a relatively-narrow range of values on one side of zero or the other. Where the prediction is based on a theory detailed enough to, a priori, predict individual scores based on participant characteristics. Sound unrealistic? Read any IES proposal that makes it past the first round of reviews. They require a power analysis to reject the null, but also an expected effect size with a sound justification, and (usually) predictions about the effect within specific subgroups. Unfortunately, they don’t require grant recipients to acknowledge any of that during dissemination.

  8. So, assuming that the authors of the initial paper did their analysis correctly… after this replication, what do we/should we expect of them? Should they retract? Should they restate their conclusions less strongly?

    I guess the bigger question is this: we know that science is not easy and that genuine, skilled researchers can find conclusions in their data that don’t replicate (or the effect in the replication is not as strong). What sort of behavior do we expect from them? Surely retraction in every instance is too strong of a response.

    • Kenneth:

      Assuming their original experiment was described accurately, it would be inappropriate for them to retract their methods and results sections. But I guess they could re-think their title, abstract, and conclusions.

      More generally, next time I think that researchers should limit their title and main conclusions to what they have clear evidence for. If they then want to also make speculations, that’s fine, but the speculations should be clearly labeled as such.

    • The beauty of statistics is that, when done correctly and transparently, the statistical results in a paper are never wrong. That is, the statistical results convey information about the parameter of interest that, absent shenanigans, only enriches the literature. The fact that the information may be overwhelmed by noise is the very reason we also estimate the parameter’s error. That we may have made innocent mistakes is why we state our assumptions and present any evidence that they were or were not sufficiently met. A paper meeting all these criteria has earned its place in the literature.

      The wise scientist–or journalist–understands that experimental conclusions, unlike statistical conclusions, are largely opinions, for which the author may argue persuasively, or not. Andrew’s right that these beliefs and arguments should be reconsidered in light of new evidence. Hopefully, a lesson in good science and good writing will also be learned: an argument characterized by overconfidence and overstatement, that does not draw attention to its own weaknesses, risks making the author look like a fool.

  9. My question is since the replicated study had a sample size 3 times that of the original study, does it mean we have to or it is better to have 2-3 times more samples than what our power calculations say?

    • Anoop:

      There’s more to design than sample size. Even if you have a statistically significant difference in some particular experiment, it’s not clear how it will generalize to other settings.

Leave a Reply to Anoop Cancel reply

Your email address will not be published. Required fields are marked *