Skip to content

My final post on this Tony Blair thing

Gur Huberman writes on the recent fraud in experiments in polisci:

This comment is a reaction to the little of the discussion which I [Gur] followed, mostly in the NYTimes.
What I didn’t see anybody say is that the system actually worked.

First, there’s a peer-reviewed report in Science.
Then other people deem the results (or, rather, the methodology) sufficiently powerful that they imitate it to answer a related question of their own.
These other people’s apparently same method fails to produce a similar response rate.
These other people inform the senior author of the original study that their response rate was far lower than the one he had reported in Science.
The senior author requests an explanation from his partner who actually was in touch with the data collecting firm and was in possession of the raw data.
The senior author fails to receive an adequate and timely explanation from his partner.
The senior author requests that Science retract the article.

Only a few months elapsed between publication and retraction.

The interesting challenge to a statistician: How often does the system fail, i.e., makes type 1 errors, i.e., accepts into a discipline’s mainstream fraudulent results? It’s worse when these results imply consequential actions.

My first reaction was: Hey, to say “the system worked” here is like saying that if someone robs a bank, then gets caught 6 months later, then the bank security system worked. No it didn’t!

But then I thought more, and it’s not so clear. I don’t think “the system worked” but the story is a bit more complicated.

It goes like this:

The first point is the rarity of the offense. As I posted earlier, it’s been nearly 20 years since the last time there was a high-profile report of a social science survey which turned out to be undocumented. I’m referring to the case of John Lott, who said he did a survey on gun use in 1997, but, in the words of Wikipedia, “was unable to produce the data, or any records showing that the survey had been undertaken.” Lott, like LaCour nearly two decades later, mounted an aggressive, if not particularly convincing, defense.

Anyway, the point is that this just about never happens. And so, if the only concern were faked data in social science, I’d agree with Gur that the system is working fine: only 2 high-profile cases that we know about in 20 years, and both were caught. Sure, there’s selection here, there must be other fraudulent surveys that have never been detected—but it’s hard for me to imagine there are a lot of these out there. So, fine, all’s ok.

But faked data and other forms of outright fraud are not the only, or even the most important, concern here. The real problem is all the shoddy studies that nobody would ever think to retract, because the researchers aren’t violating any rules, they’re just doing useless work. I’m thinking of the ESP study and the beauty-and-sex ratio study and the ovulation and voting study and the himmicanes and hurricanes study and the air pollution in China study, and all the rest.

This is what happens in all these cases:

1. An exciting, counterintuitive claim is published in a good journal, sometimes a top journal, supported by what appears to be strong statistical evidence (one or more “p less than .05” comparisons).

2. The finding is publicized, often in leading news outlets, and often uncritically.

3. Skeptics note problems with the study.

4. The authors dig in and refuse to admit anything was wrong.

The result is a mess. And this even happens outside the scientific literature, when high-profile columnists such as David Brooks post made-up statistics and refuse to issue corrections.

Now, let me be clear here: I’m not suggesting that all these papers be retracted by the journals that published them. Retraction is a crude tool, it’s just too difficult to do at scale. Post-publication review would be better.

The real point, though, is that there’s not much reason to trust the social science papers that come out in the tabloids (Science, Nature, PPNAS) or that get featured by the New York Times or NPR or the British Psychological Society or other generally-respected outlets. And that’s a problem. I think one reason for all the attention received by this Tony Blair study was that it’s the most extreme case of a general problem of claims being published without real evidence. Those ovulation-and-clothing researchers and the fat-arms-and-voting researchers didn’t make up their data—but they were making strong claims without good evidence, even while thinking they had good evidence. That’s the way of a lot of today’s published science, and publicized science. And I think that this was one reason the story of “Bruno” Lacour resonated so strongly.


  1. Anon says:

    Is there not only one P in PNAS?

  2. D.O. says:

    And why Toni Blair? Because he made up a story of Nigerian yellowcake?

  3. Steve Sailer says:

    The bigger problem is weak interpretation of pretty good data. For example, Harvard superstar economist Raj Chetty, got his hands on millions of IRS records of your income tax returns to discover what parts of America have the right policies and cultures to boost income mobility. It’s an amazing dataset and the New York Times has been promoting his findings heavily since 2013. But Chetty has struggled to find anything that NYT subscribers would be happy to read about. He’s come up with “sprawl” and “segregation” but those are pretty tendentious interpretations. And he’s overlooked all the methodological problems, such as local booms and busts that heavily influence his results.

    I wrote an in-depth analysis of what he’s doing right and how he could improve the many things he’s doing wrong for Taki’s Magazine:

  4. Jeremy says:

    Not sure where the boundaries between disciplines are, but does Stapel not count?

  5. Anon says:

    I don’t think throwing Greenstone et al’s China air pollution study into the mix is fair here:

    “they’re just doing useless work. I’m thinking of […] the air pollution in China study,”

    • Andrew says:


      Air pollution in China is important. But that particular study seems pretty useless to me, except as a demonstration that the available data are not sufficient to draw any useful conclusions. See my paper with Zelizer for further discussion of that example.

Leave a Reply