“What can recent replication failures tell us about the theoretical commitments of psychology?”

Psychology/philosophy professor Stan Klein was motivated by our power pose discussion to send along this article which seems to me to be a worthy entry in what I’ve lately been calling “the literature of exasperation,” following in the tradition of Meehl etc.

I offer one minor correction. Klein writes, “I have no doubt that the complete reinstatement of experimental conditions will ensure a successful replication of a task’s outcome.” I think this statement is too optimistic. Redo the same experiment on the same people but re-randomize, and anything can happen. If the underlying effect is near zero (as I’d guess is the case, for example, in the power pose example), then there’s no reason to expect success even in an exact replication.

More to the point is Klein’s discussion of the nature of theorizing in psychology research. Near the end of his article he discusses the materialist doctrine “that reality, in its entirety, must be composed of quantifiable, material substances.”

That reminds me of one of the most ridiculous of many ridiculous hyped studies in the past few decades, a randomized experiment purporting to demonstrate the effectiveness of intercessory prayer (p=.04 after performing 3 hypothesis tests, not that anyone’s counting; Deb and I mention it in our Teaching Statistics book). What amazed me about this study—beyond the philosophically untenable (to me) idea that God is unable to interfere with the randomization but will go to the trouble of improving the health of the prayed-for people by some small amount, just enough to assure publication—was the effort the researchers put in to diminish any possible treatment effect.

It’s reasonable to think that prayer could help people in many ways, for example it is comforting to know that your friends and family care enough about your health to pray for it. But in this study they chose people to pray who had no connection to the people prayed for—and the latter group were not even told of the intervention. The experiment was explicitly designed to remove all but supernatural effects, somewhat in the manner that a magician elaborately demonstrates that there are no hidden wires, nothing hidden in the sleeves, etc. Similarly with Bargh’s embodied cognition study: the elderly words were slipped into the study so unobtrusively as to almost remove any chance they could have an effect.

I suppose if you tell participants to think about elderly people and then they walk slower, this is boring; it only reaches the status of noteworthy research if the treatment is imperceptible. Similarly for other bank-shot ideas such as the correlation between menstrual cycle and political attitudes. There seems to be something that pushes researchers to attenuate their treatments to zero, at which point they pull out the usual bag of tricks to attain statistical significance. It’s as if they were taking ESP research as a model. See discussion here on “piss-poor omnicausal social science.”

Klein’s paper, “The Unplanned Obsolescence of Psychological Science and an Argument for Its Revival”, is relevant to this discussion.

14 thoughts on ““What can recent replication failures tell us about the theoretical commitments of psychology?”

  1. As I read the linked article, I realized that an experiment that is supposed to test one thing – say the effect of “power posing” – actually tests a large number of things at once. For example, 1) the exact set of people selected to be subjects AND 2) the order in which they are tested AND 3) the physical circumstances of their testing (e.g., as one of the given examples: in a cubicle or not) AND 4) the ambient temperature AND 5) the knowledge of the test administrator of what is being tested AND 6) the season of the year AND 7) AND 8) AND …. for a near infinite number of experimental conditions.

    What is established is that a certain result was obtained during a *conjunction* (all those ANDs) of the circumstances. The reported conclusion is valid for other circumstances only if the effect of all those other items is either small or predictable (and thence can be allowed for). But, for so many of these questionable experiments in the social/psychological area, there is no theory or body of established practice that allows one to do that kind of pruning.

    Here’s a little thought experiment. Imagine you are in the early days of mechanics – that is, in Newton’s time – and you have just heard about the first law of motion: a body maintains its state of motion if not acted upon by outside forces. How can you test that? In fact, every attempt to try ends in a failure – a wheel slows down and stops rolling; a puck slides more and more slowly; etc. Experimentally, it looks like every test disproves the first law. So how can it be correct? And how can you demonstrate it?

    The context is Newton’s second law, F=ma. To test the first law, we have to remove all forces. We haven’t figured out how to do that, but we can test the second law pretty well by learning how to measure forces on an object. We then find that F=ma holds pretty closely, and the better we know or measure our forces, the closer our results come to the predictions of the equation.

    So now, finally, we can have good confidence that 1) the first law is an extreme case of the second law (i.e., no applied forces), and even though we can’t literally directly test the first law, we can have confidence in it because we have tested the second law so well. The context of the larger theory lets us evaluate the first law even though all direct tests are found to fail.

    OK, now where is the equivalent work in the social/psychological arena? Notice that we need to have a network of tests and theories that build on each other in a principled and verifiable way so that we can have confidence in pruning away those unwanted experimental circumstances. This is the very opposite of having an effect that evaporates under small changes in experimental setup.

    A nice example of how it should work is Andrew’s recent post on the underlying cause of a post-convention polling “bounce”. Now that this effect has been identified it can be allowed for in the future. Thus the art of understanding polling results has improved and the future analyses can be more robust.

    • Motte’s translation of Newton’s Pricipia into English has Newton stating, immediately after setting forth the first law:

      Projectiles persevere in their motions, so far as they are not retarded
      by the resistance of the air, or impelled downwards by the force of gravity.
      A top, whose parts by their cohesion are perpetually drawn aside from
      rectilinear motions, does not cease its rotation, otherwise than as it is retarded
      by the air. The greater bodies of the planets and comets, meeting
      with less resistance in more free spaces, preserve their motions both progressive.
      and circular for a much longer time.

      The orbits of the comets and the planets, together with the theory of gravity, give strong evidence that the first law is a good model of the world.

      Bob.

  2. Andrew’s negative assessment of the “p=.04 after performing 3 hypothesis tests” notwithstanding, intercessory prayer was quite the rage a few years ago when the John Templeton Foundation was offering financial support for studies that “proved” the existence of (presumably a Christian) God. This (hilariously) ultimate critique of intercessory prayer may be found at

    http://www.bmj.com/content/323/7327/1450.full

    However, Leonard Leibovici’s BMJ spoof overlooks the fact that many millions in the U.S., while deluded on this subject, are convinced and committed believers of the power of prayer. So much so that several commentators took his satire on face value:

    “I [Leibovici] felt embarrassed because some people quoted it as a sign of God’s powers, and I didn’t intend it as a parody on honest belief.”

    “Disappointingly, satire, reductio ad absurdum, are bad rhetorical tools when addressing large audiences.”

    • Paul:

      It does not seem unreasonable to me that prayer could work, and I’m even willing to consider that it could work by supernatural means (not that I believe this myself, but given that it seems to be a majority view in this country if not the world, it makes sense to me that lots of people would want to study it). The part that seems ridiculous to me is the idea of that sort of p-value cargo-cult scientific testing of the supernatural. If you’re interested in prayer, why not study it where people care about it, not in a meaningless clinical setting where the patients don’t even know they’re being prayed for? That’s just silly. If prayer works, God is involved. Don’t treat it like spoon bending or tarot card reading.

      • Leibovici’s BMJ contribution was especially incisive because he was studying RETROACTIVE intercessory prayer:

        “Given a ‘study’ that looks methodologically correct, but tests something that is completely out of our frame (or model) of the physical world (e.g., retroactive intervention or badly distilled water for asthma) would you believe in it?”

        “To deny from the beginning that empirical methods can be applied to questions that are completely outside our scientific model of the physical world. Or in a more formal way, if the pre-trial probability is infinitesimally low, the results of the trial will not really change it, and the trial should not be performed. This is what, to my mind, turns the BMJ piece into a ‘non-study’ although the details provided in the publication (randomization done only once, statement of a wish, analysis, etc.) are correct.”

        Andrew’s criticism misses the point: believers in prayer would indeed be comforted if some scientific mumbo-jumbo–as Andrew put it, “p-value cargo-cult”–could be enlisted. Rather similar to those believers in the flood of Noah who are desperately searching The Mountains of Ararat for pieces of the Ark. Or the need for the scientific authenticity of the Shroud of Turin.

    • I recall recently hearing something from a religious relative to the effect of
      “I prayed to God to make certain changes; instead, He changed me so that I made the changes.” This view suggests a limit to either God’s powers, or to his willingness to exercise those powers. The latter belief would be compatible with the idea that God is *unwilling* “to interfere with the randomization but will go to the trouble of improving the health of the prayed-for”.

      • Martha:

        Yah, but God would also need to know about p-hacking and the .05 threshold. Really the simpler explanation is that the authors of the study prayed for a statistically significant result and God obliged them by allowing them to p-hack.

        • Ah, but isn’t God supposed to work in mysterious ways? If so, then whether or not an explanation is simpler is irrelevant. And isn’t God supposed to be omnipotent? If so, then he/she/it knows about p-hacking and the .05 threshold.

Leave a Reply to Bob Cancel reply

Your email address will not be published. Required fields are marked *