Skip to content

It’s ok to criticize

I got a little bit of pushback on my recent post, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant: Education edition”—some commenters felt I was being too hard on the research paper I was discussing, because the research wasn’t all that bad, and the conclusions weren’t clearly wrong, and the authors didn’t hype up their claims.

What I’d like to say is that it is OK to criticize a paper, even it isn’t horrible.

We’ve talked on this blog about some papers that are just terrible (himmicanes) or that are well-intentioned but obviously wrong (air pollution in China) or with analyses that are so bad as to be uninterpretable (air rage) or which have sample sizes too small and data too noisy to possibly support any useful conclusions (beautiful parents, ovulation and voting) or which are hyped out of proportion to whatever they might be finding (gay gene tabloid hype) or which are nothing but a garden of forking paths (Bible Code, ESP). And it’s fine to blow these papers out of the water.

But it’s also fine to present measured criticisms of research papers that have some value but also have some flaws. And that’s what I did in that earlier post. As I wrote at the time:

Just to be clear, I’m not trying to “shoot down” this research article nor am I trying to “debunk” the news report. I think it’s great for people to do this sort of study, and to report on it. It’s because I care about the topic that I’m particularly bothered when they start overinterpreting the data and drawing strong conclusions from noise.

If my goal were to make a series of airtight cases, destroying published paper after published paper, then, yes, it would make sense to concentrate my fire on the worst of the worst. But that’s not what it’s all about. “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant” is a real error, it’s an important error, and it’s a frequent error—and I think it’s valuable to point out this error in the context of a paper that’s not trash, reported on in a newspaper that’s not prone to sensationalism.

So, you commenters who told me I was being too harsh: What you’re really saying is that methods criticisms should be reserved for papers that are terrible. But I disagree. I say it can be helpful to criticize some of the reasoning of a paper on methodological grounds, even while other aspects of the paper are fine.

Criticism and review are all about advancing science, not about making airtight cases against particular papers or particular bodies of work.


  1. Garnett says:

    I see these critiques as case-studies in critical thinking. I am personally less interested in the specifics of the papers, and see them more as entertaining platforms for critical thinking. The value of these critiques (and the comments!) is developing better ways of evaluating my own work.

    • leoboiko says:

      That’s exactly the reason why I follow this blog.

      • Rahul says:


        I hardly care about power poses & himmicanes. But the exercise in logic & modelling transfers well to other areas.

        I’ve found that commenting on (good) blogs is great training in making good arguments. People will just rip apart flaws.

        It’d be such an awesome world if only the average academic paper communicated with the clarity, brevity and sharpness of the typical comment on Andrew’s blog!

  2. Erikson says:

    Whenever people complain about criticism of research papers, I remember a quote from Popper’s ‘Logic of Scientific Discovery’:

    “Whenever we propose a solution to a problem, we ought to try as hard as we can to overthrow our solution, rather than defend it. Few of us, unfortunately, practise this precept; but other people, fortunately, will supply the criticism for us if we fail to supply it ourselves.”

  3. Ibn says:

    Actually, a critique of a half-good paper is typically more useful than the shooting down of a really bad one. The latter type doesn’t often lead non-clown people astray, but the former certainly does.

  4. Thank you for this post. I am reminded of Richard Hofstadter’s words on the nature of the intellectual life (on p. 30 of “Anti-Intellectualism in American Life”):

    “Whatever the intellectual is too certain of, if he is healthily playful, he begins to find unsatisfactory. The meaning of his intellectual life lies not in the possession of truth but in the quest for new uncertainties.”

    He goes on to explain what he means by play; it is not opposed to seriousness or even practicality.

    I suspect that just about every paper has uncertainties, which provide material for questioning, criticism, and play, which in turn help illuminate the topic at hand.

    • Martha (Smith) says:

      “I suspect that just about every paper has uncertainties”

      I’m fond of saying, “If it involves statistical inference, it involves uncertainty.” It’s admittedly a tautology, but one that a lot of people who use statistical inference don’t realize is a tautology.

  5. Bill Harris says:

    When reading this and the comments, I’m reminded of Guenter Grass’ Aus dem Tagebuch einer Schnecke (From the Diary of a Snail), which has a doubt and skepticism thread throughout (a main character is nicknamed “Doubt”). As I recall, Grass uses the snail as a metaphor for progress: slow, persistent, and leaving behind a trail (

    It also reminds me of John Sterman’s A Skeptic’s Guide to Computer Models ('s_Guide.pdf). Sterman’s article seems a bit like an expository form of the statistical lexicon, focused on simulation models, not regression models.

  6. Chris Auld says:

    The paper says:

    “Table 6 explores whether treatment effects vary by subgroups by conditioning the sample
    based on gender, race, baseline GPA, ACT scores, and predicted scores. Although differential
    treatment effects by subgroup are generally not statistically distinguishable, Table 6 does reveal
    some interesting differences… There is also modest evidence that permitting computers is
    most harmful to students with relatively strong baseline academic performance.”

    And after further discussion of this aspect of their analysis, the authors note:

    “Still, the point estimates in all three columns of panels B, C, and D are statistically indistinguishable, so these could be
    chance findings.”

    It seems disengenous to write a blog post condescendingly explaining to the authors that their interaction effect is not statistically significant when the authors repeatedly and explicitly note exactly that in the paper.

    • Andrew says:


      On the specifics, two things. First, when I wrote, “Nonononononono . . . The difference between “significant” and “not significant” is not itself statistically significant,” I was responding to a remark in the news article, not in the research paper. Second, when I did address the research article, I argued that they are overinterpreting noise. The authors pointed to two numbers that were “nearly identical” and called this “particularly surprising,” and I pointed out that such a pattern is not much of a surprise at all—you can get two numbers very close to each other, just by chance.

      Finally, “condescending” is in the eye of the beholder. I’m a professional statistician who’s published several papers on the interpretation of statistical results, statistical significance, etc., so I don’t see why anyone needs to feel offended or condescended to if I offer some advice.

  7. Chris Auld says:

    Andrew, if your criticism is of the manner in which the Washington Post blogger chose to discuss the results rather than the actual paper, you should make that clear.

    On the point in the actual article: I would interpret the authors’ statement that the estimated effects across the two treatment wings as “particularly surprising” as follows: if we were to hypothetically assume that at the individual level the effects of the two treatments are identical and that there are no “spillover” effects, effects in classrooms in which 80% of students were treated should be on average twice as large as in classrooms in which 40% were treated. In the event, we get estimates of about -0.18 for either treatment, with standard errors of about 0.07. One way of formally interpreting the informal discussion point is to calculate the t-stat against the null that one effect is actually twice as large as the other, which yields a t-stat of about 1.8. Now, is a t-stat of 1.8 “particularly suprising” if we hypothetically expected the null to be true? That may be overly strong language, but it seems a relatively minor lapse in wording to spend two entire blog posts jumping all over.

    • Andrew says:


      I think it was pretty clear that I was criticizing both the news article and the research article. In my post, I tagged the quotes and excerpts with the phrases, “In a news article,” “Let’s read on some more,” “Here’s the article . . . here’s the relevant table,” “Now back to the news article,” and “The research article also had this finding.” No ambiguity at all. Which is perhaps why nobody other than you found this at all confusing.

      On the details, my issue is not that anyone had a “lapse in wording”; rather, my concern is that the news reporter, and to a less extent the authors of the research article, were overinterpreting their data. I think they were doing what lots and lots of researchers do, which is to not appropriately account for variation (thus, for example, being surprised by results being “nearly identical” without recognizing that surprising similarity, as well as surprising differences, can be explained by noise).

      As I wrote in my post: “I’m not trying to ‘shoot down’ this research article nor am I trying to ‘debunk’ the news report.”

      I really really really really really don’t like the attitude by which presenting a statistical criticism is considered “jumping all over.” I’m not jumping over anybody. I’m sharing a criticism as part of my mission of public education. (My job contains teaching, research, and service components; this falls into the “service” category, also teaching, also it has some research value in that it is through many such observations and discussions that my colleague and I come up with new methods such as here.) I think the authors of this paper were overinterpreting chance fluctuations; you don’t seem to think so. This comments section is a good place to air such disagreements. But this is not about lapses in wording, it’s about the real steps that people take when trying to understand and interpret their data. This is a big deal, and as I wrote in my post above, I think it is valuable to have such discussions in the context of papers which are not terrible. These are serious researchers studying real problems to help real people. We’re not talking Daryl Bem or power pose here. But they’re still using statistics, and statistics is hard, and we can help them in various ways, including helping them get more out of their data and including helping them avoid overinterpreting their data. For me to want to help here—and not just to help these researchers but to help thousands of others who read this job—is not “condescending” or “disengenous” or “jumping all over.” It’s just scientific discussion.

  8. Ken says:

    If things that are wrong aren’t criticised then how will they change, and I’ve seen similar examples to this. If they had used a regression model rather than 3 groups it may have shown more.

    As a point this analysis is similar to one that has hopefully been eradicated, although occasionally students will have trouble with it. Hopefully not ones planning on being statisticians. Take treatments A and B test fro change from baseline. A has no significant change but B has, so incorrectly assume that A and B are significantly different.

Leave a Reply