Skip to content

The connection between varying treatment effects and the well-known optimism of published research findings

Jacob Hartog writes:

I thought this article [by Hunt Allcott and Sendhil Mullainathan], although already a couple of years old, fits very well into the themes of your blog—in particular the idea that the “true” treatment effect is likely to vary a lot depending on all kinds of factors that we can and cannot observe, and that especially large estimated effects are likely telling us as much about the sample as about the Secrets of the Social World.

They find that sites that choose to participate in randomized controlled trials are selected on characteristics correlated with the estimated treatment effect, and they have some ideas about “suggestive tests of external validity.”

I’d be curious about where you agree and disagree with their approach.

I pointed this to Avi, who wrote:

I’m actually a big fan of this paper (and of Hunt and Sendhil). Rather than look at the original NBER paper, however, I’d point you to Hunt’s recent revision, which looks at 111 experiments (!!) rather than the 14 experiments analyzed in the first study.

In particular, Hunt uses the first 10 experiments they conducted to predict the results of the next 101, finding that the predicted effect is significantly larger than the observed effect in those remaining trials.

Good stuff. I haven’t had the time to look at any of this, actually, but it does all seem both relevant to our discussions and important more generally. It’s good to see economists getting into this game. The questions they’re looking at are similar to issues of Type S and Type M error that are being discussed in psychology research, and I feel that, more broadly, we’re seeing a unification of models of the scientific process, going beyond the traditional “p less than .05” model of discovery. I’m feeling really good about all this.


  1. numeric says:

    You know, I think you’re going to have to start putting trigger warnings on your posts–I mean, they’re offensive to psychologists, plagiarists, frequentists, and Bayesian alike. I find them much more upsetting than Ovid.

    • Christoph N. says:

      I’m a psychologist. Why would they be upsetting to me?

      • Christoph N. says:

        Offensive, I mean.

      • numeric says:

        Gee, I dunno. Maybe because every third post is about publications in psychology journals that explain bogus statistical techniques and the failure of psychology journals to run critiques. You sound as if you can compartmentalize your own individual career from that of your profession, which sounds to me as if you are suffering from lack of integration. Of course, if you’re a Jungian, individuation might be seen as a positive (though it’s usually applied to differentiation from the unconsciousness, not from a profession).

  2. Elin says:

    Wowy. I thought the original paper was neat, but the newer one is really something. To me it’s not so much about the optimism of published studies, though of course that’s important, but it’s really the explanation for why scale up always either disappoints or fails completely. Usually we might have a story about how the early sites had so much more attention from the researchers and implementers or that there was real excitement about the program or some of other vague narrative. But this is saying you can model it and it’s predictable (maybe) that early sites will be the ones where success is most likely to happen.
    Does that mean the programs are “really” failures and the local estimates are “really” too high or is it horses for courses. This seems to say the latter. So I think the conclusion that we should randomly select sites is good in part because there may be meaningful effects in some kinds of contexts and figuring out if the contextual effects are real is hard. That is, for example, maybe it makes sense for some big schools to be broken down into small schools, but that doesn’t necessarily mean it makes sense for all big schools to be broken into small schools. On the other hand, maybe better to take the early wins from the biased sites and, as a policy implementer rather than a data analyst, scale up to the most similar places and don’t go into places where failure is likely without small scale testing first. It’s a hierarchical world.

  3. zbicyclist says:

    I have not YET read the 82 page paper, although it’s clearly worth reading.

    I have two possibly relevant comments.

    (1) in test marketing, the old rule of thumb is that the performance when the product was rolled out was likely to be 15% lower [substantial variability around that, of course], even after all covariates have been adjusted. I’ll have to see how close Hunt is to that number, although it may be hard to make the translation.

    (2) There’s a field of forecasting called Reference Class Forecasting. The idea here is to look at how much bias there was in similar estimates — e.g. on road projects, what’s the typical cost overrun, on bridge projects, what’s the typical overrun, etc. The name most closely associated with this is Flyjberg. Note this also is a bias issue — those doing the estimation want the project to be approved — with a bit of black swan thrown in.

    Of course, what’s hard here is to avoid this problem. It’s all very well to say that we’ll test in ‘typical’ areas, but you also need to execute the test, which is difficult to do unless the ‘typical’ areas have at least some enthusiasm for the intervention.

    And as Hunt notes on page 2, “RCTs often require highly-capable implementing partners, [so] the set of actual RCT partners may” [i.e. almost certainly do] “have more effective programs than the average potential partner.” because the less capable have to be thrown out of the program — creating bias even if the less capable are thrown out of both test and control.

    • Elin says:

      I think this last part is probably very true. Even more when you look at a lot of programs where there has to be a grant application to fund the initial program and evaluation, of course an agency is going to propose a program that makes sense for them.

Leave a Reply