“She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.”

Posted on November 22, 2018 9:52 AM by Andrew

Robert Wiblin writes:

If we have a study on the impact of a social program in a particular place and time, how confident can we be that we’ll get a similar result if we study the same program again somewhere else?

Dr Eva Vivalt . . . compiled a huge database of impact evaluations in global development – including 15,024 estimates from 635 papers across 20 types of intervention – to help answer this question.

Her finding: not confident at all.

The typical study result differs from the average effect found in similar studies so far by almost 100%. That is to say, if all existing studies of an education program find that it improves test scores by 0.5 standard deviations – the next result is as likely to be negative or greater than 1 standard deviation, as it is to be between 0-1 standard deviations.

She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.

Wiblin continues:

For researchers hoping to figure out what works and then take those programs global, these failures of generalizability and ‘external validity’ should be disconcerting.

Is ‘evidence-based development’ writing a cheque its methodology can’t cash?

Should we invest more in collecting evidence to try to get reliable results?

Or, as some critics say, is interest in impact evaluation distracting us from more important issues, like national economic reforms that can’t be tested in randomised controlled trials?

Wiblin also points to this article by Mary Ann Bates and Rachel Glennerster who argue that “rigorous impact evaluations tell us a lot about the world, not just the particular contexts in which they are conducted” and write:

If researchers and policy makers continue to view results of impact evaluations as a black box and fail to focus on mechanisms, the movement toward evidence-based policy making will fall far short of its potential for improving people’s lives.

I agree with this quote from Bates and Gellenerst, and I think the whole push-a-button, take-a-pill, black-box attitude toward causal inference has been a disastrous mistake. I feel particularly bad about this, given that econometrics and statistics textbooks, including my own, have been pushing this view for decades.

Stepping back a bit, I agree with Vivalt that, if we want to get a sense of what policies to enact, it can be a mistake to try to be making these decisions based on the results of little experiments. There’s nothing wrong with trying to learn from demonstration studies (as here), but generally I think realism is more important than randomization. And, when effects are highly variable and measurements are noisy, you can’t learn much even from clean experiments.

7 thoughts on ““She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.””

Martha (Smith) on November 22, 2018 11:07 PM at 11:07 pm said:

This seems related to the discussion here: http://statmodeling.stat.columbia.edu/2018/09/06/gaps-1-2-3-just-large/ (similar design problems, different field of application).

Reply ↓
- Martha (Smith) on November 22, 2018 11:22 PM at 11:22 pm said:
  
  See also https://journals.sagepub.com/stoken/rbtfl/hixxiPxVRpaxg/full for questions of measurement that may be relevant.
  
  Reply ↓
Richard McElreath on November 23, 2018 4:58 AM at 4:58 am said:

Sounds a lot like Angus Deaton’s 2010 “Instruments, randomization, and learning about development”. Deaton seems to have said some uncareful things about randomization at times, but I think his point about focus on mechanism is a solid one.

https://scholar.princeton.edu/deaton/publications/instruments-randomization-and-learning-about-development

Reply ↓
Rahul on November 23, 2018 11:42 AM at 11:42 am said:

Isn’t that a version of the regression to the mean effect combined with the file drawer effect? i.e. When NGOs conduct smaller sized studies some (naturally) encounter pockets of high-performance. Whether by chance or by cherry picking the site, context or cohort.

Other smaller sized studies that don’t discover great performance never get popularized (or switch goal posts till they do).

Now, try scaling all this up to a large study & the performance just regresses to the mean.

Reply ↓
yyw on November 23, 2018 12:16 PM at 12:16 pm said:

Incompetence of researchers, biases, potential difference between motivated NGO staff and more indifferent government works, etc. probably all contribute to this phenomenon. Policy makers need to be a lot more skeptical. Analyze the mechanism, refine, validate, and scale up gradually after careful cost benefit analysis.

Reply ↓
- Angus Reynolds on November 26, 2018 8:47 PM at 8:47 pm said:
  
  I remember trying to talk to my Sister in Law who works in for an NGO doing research. She immediately leapt to the opposite claim that it was the bloody policy makers not implementing the recommendations properly.
  
  Reply ↓
ndemi on November 28, 2018 5:12 AM at 5:12 am said:

Study of different programs will always differ from place to place because of cultural believes among other factors.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

“She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.”

7 thoughts on ““She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.””

Leave a Reply Cancel reply