Skip to content

The Millennium Villages Project: a retrospective, observational, endline evaluation

Shira Mitchell et al. write (preprint version here if that link doesn’t work):

The Millennium Villages Project (MVP) was a 10 year, multisector, rural development project, initiated in 2005, operating across ten sites in ten sub-Saharan African countries to achieve the Millennium Development Goals (MDGs). . . .

In this endline evaluation of the MVP, we retrospectively selected comparison villages that best matched the project villages on possible confounding variables. . . . we estimated project impacts as differences in outcomes between the project and comparison villages; target attainment as differences between project outcomes and prespecified targets; and on-site spending as expenditures reported by communities, donors, governments, and the project. . . .

Averaged across the ten project sites, we found that impact estimates for 30 of 40 outcomes were significant (95% uncertainty intervals [UIs] for these outcomes excluded zero) and favoured the project villages. In particular, substantial effects were seen in agriculture and health, in which some outcomes were roughly one SD better in the project villages than in the comparison villages. The project was estimated to have no significant impact on the consumption-based measures of poverty, but a significant favourable impact on an index of asset ownership. Impacts on nutrition and education outcomes were often inconclusive (95% UIs included zero). Averaging across outcomes within categories, the project had significant favourable impacts on agriculture, nutrition, education, child health, maternal health, HIV and malaria, and water and sanitation. A third of the targets were met in the project sites. . . .

It took us three years to do this retrospective evaluation, from designing sampling plans, gathering background data, designing the comparisons, and performing the statistical analysis.

At the very beginning of the project, we made it clear that our goal was not to find “statistical significant” effects, that we’d do our best and report what we found. Unfortunately, some of the results in the paper are summarized by statistical significance. You can’t fight City Hall. But we tried our best to minimize such statements.

In the design stage we did lots and lots of fake-data simulation to get a sense of what we might expect to see. We consciously tried to avoid the usual plan of gathering data, flying blind, and hoping for good results.

You can read the article for the full story. Also, published in the same issue of the journal:

The perspective of Jeff Sachs, leader of the Millennium Village Project,

An outside evaluation of our evaluation, from Eran Bendavid.


  1. jrc says:

    Hey look at that – from pre-analysis review (of post-intervention analysis planning) all the way to publication!!!!1!

    For those of you interested in the background, there’s this (sure, Nature is a tabloid, but you can probably believe this one):

    “In a paper published online in The Lancet last month, the [MDV] project claimed a significant milestone. It reported that after three years of interventions, child mortality was decreasing three times faster in the project’s villages than in the host nations in general. But the analysis was criticized for underestimating nationwide improvements in child mortality, and over­estimating those in the Millennium Villages…

    The MVP’s founder, Jeffrey Sachs, head of the Earth Institute at Columbia and a co-author of the partially retracted paper, says that the MVP research teams were too autonomous, and he regrets not having brought in external advisers earlier.”

    …hehehe “too autonomous”. Anyway… even if research design trumps statistics every time in the world of causal inference, it is nice that we are finally getting some useful information out of all that money spent on these projects. Just think how much we could’ve learned with a little bit more effort towards evaluation design on the front end.

    Now that my snarking is done, I’ll go read this thing….

  2. Eric says:

    Andrew, nice work given the limitations. Nice plots as well.

    You note that “some of the results in the paper are summarized by statistical significance”. You’ve explained this before, but I am still a bit confused: by showing 95% uncertainty intervals and a line of no effect, isn’t every plot in this paper summarized by significance? The plots don’t use the term significance, but Table 3 does: every UI in Table 3 that excludes 0 has a star; every UI that includes zero does not. Why don’t the plots have the same “significance” interpretation as Table 3 depending on whether the UI crosses zero?

    • Andrew says:


      We did what we had to do, and you can give the plots a “significance” interpretation if you want, but we don’t see them that way. We see them as a data summary, and we don’t think it’s appropriate to select out the estimates that happen to exceed some threshold.

      • Eric says:

        It’s an interesting case to use in class. I’m struggling to teach students about statistical significance, and at the same time, to tell them not to get hung up on it. When they look at plots like these, they tend to do the same thing that happens in the paper and count that 30 out of 40 do not include zero. But I want them to think more in summary terms. Just interesting to see that they will never really escape the pressure to think in terms of significance.

        • jrc says:

          Would it help if they marked values other than 0 on the X-axis?

          I thought they did a good job of (re)presenting the results (pun intended). And even if the vertical line draws the eye towards 0, I think that is at least something of a necessary evil, if only to give some visual reference for a long (tall) figure. But maybe they could’ve added lines at each 0.25sd or 0.5sd or something. That might de-emphasize the testing aspect and re-emphasize the comparative effect sizes and uncertainties aspect.

          Or is the problem just with putting out confidence intervals at all, since they don’t tell us what we wish they did? Should they have graphed out the density of realizations of the posterior distribution, like with color gradients or something? I’m not sure how to represent the estimates in a way that doesn’t seem like testing if adding a few extra vertical lines doesn’t do the trick (and yes, of course, this is all very nitpicky – I think Figure 3 is wonderful. I’m kinda jealous.).

          • Eric says:

            The plots are great. So is the writing and the underlying effort given the limitations. My first comment was really a question about whether there is a practical distinction between this type of visual summary and significance.

            I think I have the same misunderstanding as Jason here: Andrew had a similar reply to Jason about removing references to statistical significance in the upcoming edition of his book. I’m trying to understand if the point is just to move away from binary thinking and de-emphasize significance, or whether there is a fundamental distinction. It seems like a bit of both, but I’m confused about what CIs do and don’t represent.

  3. Pablo says:

    Broken links to the articles.

Leave a Reply