Michael Lacour vs John Bargh and Amy Cuddy

In our discussion of the Bargh, Chen, and Burrows priming-with-elderly-related-words-makes-people-walk-slowly-paper (the study which famously failed in a preregistered replication), commenter Lois wrote:

Curious as to what people think of this comment on the Bargh et al. (1996) paper from Pubpeer: (see below).

In Experiment 3, the experimenter rated participants on irritability, hostility, anger, and uncooperativeness on 10-point scales. Independently, two blind coders rated the same participants’ facial expressions (from video-tapes) on a 5-point scale of hostility.
Figure 3 in the paper displays the mean hostility ratings from the experimenter and the blind coders for participants in the two conditions (Caucasian prime and African-American prime). The means from the experimenter and from the blind coders appear almost identical (they are indistinguishable to my eye).
How likely is it that the experimenter and the blind coders provided identical ratings, using different scales, in both conditions, for ratings of something as subjective as expressed hostility?
Are the bars shown in Figure 3 an error?

Did Bargh et al. cheat? Did their research assistant cheat? Did they make up data? Did they make a Excel error? Did they do everything correctly and we have nothing here but a suspicious PubPeer commenter?

I don’t know. But from a scientific perspective, it doesn’t really matter.

In our discussion of the Carney, Cuddy, and Yap paper on power pose (another one with a failed replication), commenters here and here noticed some irregularities in this and an earlier paper of Cuddy: test statistics were miscalculated, often in ways that moved a result from p greater than .05 to p less than .05.

Did Cuddy et al. cheat? Did they enter the wrong values in their calculator? Were the calculations performed by an unscrupulous research assistant who was paid by the star?

I don’t know. But from a scientific perspective, it doesn’t really matter.

Why doesn’t it matter? Because these studies are, from a quantitative perspective, nothing but the manipulation of noise. Any effects are too variable to be detected by these experiments. Indeed, this variability is implicitly recognized in the literature, where every new study reports the discovery of a new interaction.

So, whether these p-values less than .05 come from cheating, sloppiness, the garden of forking paths, or some combination of the above (for example, opportunistic rounding might fall somewhere in the borderline between cheating and sloppiness, and Daryl Bem’s thicket of forking paths is so tangled that in my opinion it counts as sloppiness at the very least), it just doesn’t matter. These quantitative results are useless anyway.

Enter Lacour

In contrast, consider the fraud conducted by Michael Lacour in his paper with Don Green on voter persuasion. This one, if his numbers had been true, really would’ve represented strong evidence in favor of his hypothesis. He had a clean study, a large sample, and a large and persistent effect. Indeed, when I first saw that paper, my first thought was forking paths but upon a quick read it was clear that there was something real there (conditional on the data being true, of course). No way these results could’ve been manufactured simply by excluding some awkward data points or playing around with predictors in the regression.

So in the Lacour case, the revelations of fraud were necessary. The forensics were a key part of the story. For Cuddy, etc., sure, maybe they engaged in unethical research practices, or maybe they were just sloppy, but it’s not really worth our while to try to find out. Cos even if they did everything they did with pure heart, their numbers are not offering real support for their claims.

One might say that Lacour was too clever by half. He could’ve got that paper in Science without any faking of data at all! Just do the survey, gather the data, then use the garden of forking paths to get a few statistically significant p-values. Here are some of the steps he could’ve taken:

1. If he gathers real data, lots of the data won’t fit his story: there will be people who got the treatment who remained hostile to gay rights and people who got the control who still end up supporting gay marriage. Start out by cleaning your data, removing as many of these aberrant cases as you can get away with. It’s not so hard: just go through, case by case, and see what you can find to disqualify these people. Maybe some of their responses to other survey questions were inconsistent, maybe the interview was conducted outside the official window for the study (which you can of course alter at will, and it looks so innocuous in the paper: “We conducted X interviews between May 1 and June 30, 2011” or whatever). If the aggregate results for any canvasser look particularly bad, you could figure out a way to discard that whole batch of data—for example, maybe this was the one canvasser who was over 40 years old, or the one canvasser who did not live in the L.A. area, or whatever. If there’s a particular precinct where the results did not turn out as they should, find a way to exclude it: perhaps the vote in the previous election there was more than 2 standard deviations away from the mean. Whatever.

2. Now comes the analysis stage. Subsets, comparisons, interactions. Run regressions including whatever you want. If you find no main effect, do a path analysis—that was enough to convince a renowned political scientist that a subliminally flashing smiley face could cause a large change in people’s attitudes toward immigration. The smiley-face conclusion was a stretch (to say the least); by comparison, Lacour and Green’s claims, while large, were somewhat plausible.

3. The writeup. Obviously you report any statistically significant thing you happen to find. But you can do more. You can report any non-significant comparisons as evidence against alternative explanations of your data. And if you’re really ballsy you can report p-values such as .08 as evidence for weak effects. What you’re doing is spinning a story using all these p-values as raw material. This is important: don’t think of statistical significance as the culmination of your study: feel confidence that using all your researcher degrees of freedom, you can get as many statistically significant p-values as you want, and look ahead to the next step of telling your tale. You already have statistical significance; no question about that. To get the win—the publication in Nature, Science, or PPNAS—you need that pow! that connects your empirical findings to your theory.

4. Finally, once you’ve created your statistical significance, you can use that as backing to make as many graphs as you want. So Lacour still gets to use that skill of his.

Isn’t it funny? He could’ve got all the success he wanted—NPR coverage, Ted Talk, tenure-track job, a series of papers in top journals—without faking the data at all!

For decades the statistical community completely missed the point. We obsessed about stopping rules and publication bias, pretty much ignoring the real action is in data exclusion, subsetting, and degrees of freedom in analysis.

P.S. I wrote this post months ago; it just happened to come up today, in the midst of all the discussion of the replication crisis.

41 thoughts on “Michael Lacour vs John Bargh and Amy Cuddy

  1. There are also some worrisome problems in a 2008 paper by John Bargh in Science.
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2737341/

    In the first study the primary effect is described as significant when the exact p-value associated with the F-value (F(1,39)=4.08) is greater than .05.
    In the second study the authors report an N of 53 and then claim that these participants were randomly assigned into two groups such that 75% of one group made a certain decision and 46% of the other group made that same decision. However there is no way that such a breakdown is possible from an N of 53.
    Of course both studies are incredibly underpowered.

    • By my calculations, the first split is possible, with rounding to integer percentages:
      40/53 = .7547 -> 75%
      (And by extension, (53-40=13)/53 = .2453 -> 25%)

      However, the second is not:
      24/53 = .4528 -> 45%
      25/53 = .4717 -> 47%
      (And by extension, 54% isn’t possible either).

      Andrew, I disagree that this doesn’t matter. Sure, the end result is noise in both cases, but until very recently (and I’m still far from convinced that things are about to actually change in the 10,000-publications-a-year world of psychology out there), p < .05 was good enough, no matter how you got there. In fact, I would argue that in such a world, aiming for LaCour-scale effects is stupid. The smart way to fabricate is to go for ps of around .02 and a good-looking story.

      You also missed a much simpler way to get the result you want: run the experiment, collect real data, include all the data points, and just reassign the control/intervention group variable. That way, all of your data will resist any checks on their credibility, because they are indistinguishable from what would happen if there was a real effect. Just move enough participants with high individual effects to the intervention group, and vice versa, until you hit whatever p value you want. This technique also has the advantage of likely being undetectable by your co-authors, because it was a *well-run* experiment and everyone was blind to condition. All you need is five or ten minutes alone with the data before anyone else makes a copy, and a minimum amount of variance in the sample.

      • Nick,
        The total N of 53 first needs to be split into two subgroups. The first clearly needs to be a multiple of 4 given the 75%-25% split. So probably N=24 (18-6 split) or N=28 (21-7 split). The second group would then be N=29 or N=25 but 54% of these is either 15.66 or 13.5. The four plausible splits are therefore 15/29, 16/29, 13/25 or 14/25. None of these values get you 54%.

        • It seems the most likely scenario may be 22 out of 29 and 13 out 24 for the two groups. The first one is rounded inappropriately (as it is .7586), and the second one does round to .54 (i.e., 54%). If this is just a rounding error (truncating instead of rounding), then that distribution would work.

        • Oops sorry, Mark. I misread the numbers from the original report, it should be 22 out of 29 choosing for the self in the cold condition (which rounds to 76% but could be reported as 75% if someone truncated rather than rounded) versus 11 out of 24 choosing for the self in the hot condition (which does round to 46%, but would be truncated to 45%). I think that is the most likely scenario and does lead to a Chi-Square with p = .0262 comparing the two proportions.

        • Yeah, that would be a rounding error that works against their hypothesis. A bit sloppy, but it sure isn’t creating an effect. I am not saying that the scenario I gave is actually what happened, but I think it is a possible and maybe even plausible account and in my view would be a bit sloppy but totally benign in its impact on inferences.

        • Another possibility is that the split is 18-6 in one group and 14-12. That reproduces the percentages almost perfectly but only gives a total sample size of 50. Not sure why they may have dropped 3 participant in such a simple experiment because there is no way that anyone could be an outlier. Of course, in the Carney, Cuddy, Yap paper one participant was dropped for “smiling too much”. Perhaps there was some excessive mirth in this sample too.

        • mark said, “Of course, in the Carney, Cuddy, Yap paper one participant was dropped for “smiling too much”.

          !!! Was any attempt at justification (or even rationalization) given?

        • Andrew, maybe the motivational instructor didn’t read the paper and thought smiling would help the power pose, whereas actually everyone knows that smiling lowers testosterone.

        • Martha,
          No justification given. In addition to the statistical outliers that were discussed in the paper there were four other participants that were excluded from the analyses. The reasons given for each participant number in the dataset are: 38=experiment messed guy up, 45=acted awkward smiled whole time, 48=gave incorr payout info, 56= fire alarm.

        • @mark:
          Thanks. I guess “fire alarm” is a good reason for excluding a data point (assuming it means that the fire alarm went off when the subject was being run), but I have no idea what “experiment messed guy up” is saying.

        • Martha,
          Perhaps the power pose caused a catastrophic spike in testosterone and they excluded this participant so that they would not arrive at an overestimate of the effect size for power poses ;-)

        • Yesterday I sent Lawrence Williams (the first and corresponding author of the 2008 Bargh article referenced above) inquiring about the cell frequencies for the 2 x 2 contingency table. I’ll let you know if I receive a response.

      • Hi Nick,

        But it is unlikely that the two groups would have ns of 40 and 13. The ns might not be equal, but surely they would not be so different. The ns might be 28 and 25, for example, for the first split, so then 75% (n = 21/28) and 25% (n = 7/28) is possible.

        But the second split doesn’t seem possible.

        • Sorry, I misread. I was splitting 53 into two parts in each case.

          So, the numbers in each group are about 26-27, plus or minus a few. I started with ways to split a number in this range into 46%/54%. There are three candidates: 24 (11+13), 26 (12+14), and 28 (13+15). The next possible numbers are 13 and 35, which are too far away.

          That means that the numbers in the other group are 29, 27, or 25. And indeed, there is no way to split those 25%/75%.

          All of this is covered in excruciating detail in our GRIM article, currently in press.

        • I already describe your method to my graduate class. Lots of horrified looks when I show them how many articles fail the test.

  2. This is true re: Lacour, but he also would have had to actually collect data, which he didn’t do. The kind of data he wanted (repeated measures survey panel) is expensive to collect; there is increasing pressure on graduate students to somehow create brilliant studies without the backing of a research lab. Given all that pressure, there is a temptation for any graduate student to fish around for statistical significance using whatever they can find.

    What Lacour did is reprehensible, but the incentives in the field promote what he did to “get the job done”.

  3. Andrew, on that “miscalculated” test statistic… If you are referring to the same one I think you are, it was not so much miscalculated as massaged. As I pointed out (http://statmodeling.stat.columbia.edu/2016/02/01/peer-review-make-no-damn-sense/#comment-261841), they are correctly calculating the chi-squared statistic of the g-test of independence. One reasonable hypothesis is that they used the g-test rather than Pearson’s chi-squared because it yielded a p-value <0.05

  4. But then he would have to have actually gotten the $800,000 grant (as a graduate student!) that he would have needed to run the massive survey he claimed he ran. That was the real reason he couldn’t just use real data and fudge on the margins with that data like you suggest — he didn’t have the money to even do the data collection to begin with, so there was never going to be data to p-hack.

    One of the most perplexing things about all of this is how a PhD student could walk into an adviser’s office and say “Hey, I won an $800,000 grant for my dissertation research” and the adviser doesn’t look into it enough or think anything unusual of it enough to find out its a fictitious grant from an imaginary institution. How detached from your students (and from how the world works) do you have to be to say “Great work Michael!” when you hear about this grant? Its completely off the furthest top end of the scale of what is remotely plausible in anyone’s faintest dreams for a poli sci graduate student to have landed, and yet he seems to have gotten everyone to believe the grant was real for years before he was caught.

        • How much of the Human Brain Project is devoted to trying to simulate the human brain — what most of its press seems to be about — and how much to other neuroscience? About the simulation goal: Is there a good writeup somewhere that makes a case for why this isn’t totally ridiculous? I’m not a neuroscientist, but I am a biophysicist, and this seems on the face of it so absurd that I can’t help but think it’s a $1b scam. I really would like to be persuaded otherwise!

        • I would say that most of the HBP is about simulating _parts_ of the brain (which is not the same thing as simulating the entire human brain). Over the first couple of years, the HBP focused on developing platforms that would allow researchers to do new kinds of simulations and bring together different types of information. There is a lot of neuroscientific data that gets published but is never put to use, and oftentimes different data sets seem to contradict each other. These problems make it very difficult for modelers to resolve anything about how the brain functions. If neuroscience is ever going to succeed (maybe eventually to simulate the entire human brain), there has to be a way to pull together various data sets, evaluate them, and quantify the findings in a way to support model building and testing. That’s the overarching goal of the HBP.

          A reasonable introduction to what has been done can be found at the Platform release web page:

          https://www.humanbrainproject.eu/platform-release1

          If I were the leader of the HBP, I would not have told people that we were going to simulate the entire human brain (but no one is asking me to lead a $1 billion project!). I think it sets up the project for misunderstandings and ridicule; but what is actually being done is much more interesting. The HBP might fail for lots reasons, but I think it is a good faith effort to pull together different types of information, interpret them with quantitative models, and thereby guide future empirical studies.

        • What’s most significant about the HBP is that it demonstrates in a very dramatic and expensive manner that the peer-review system in funding decisions at least has no meaning. I would love to know the names of the scientists who approved this project given its promises.

        • One implication of what you write, Greg, is that it’s OK that this guy won the money for his ridiculous project, because research got done that made sense (I haven’t looked into that, I don’t know if it does). So I guess it’s OK to con agencies into giving you money, as long as you use it sensibly?

        • Hi Greg, it seems that some 800 scientists wrote an open letter complaining about the HBP. See: http://www.neurofuture.eu/

          For example, they say: “We strongly question whether the goals and implementation of the HBP are adequate to form the nucleus of the collaborative effort in Europe that will further our understanding of the brain.”

          The HBP was apparently reconstituted after this criticism, and this guy was, as far as I understood, removed from his leadership role (is this right?). So maybe the HBP is doing sensible things *now*. I am talking about when and how it was funded. I’m curious as to how it happened that the original proposal won 1.6 billion Euros.

Leave a Reply to Carol Cancel reply

Your email address will not be published. Required fields are marked *