GiveWell’s Change Our Mind contest, cost-effectiveness, and water quality interventions

Some time ago I wrote about a new meta-analysis pre-print where we estimated that providing safe drinking water led to a 30% mean reduction in deaths in children under-5, based on data from 15 RCTs. Today I want to write about water, but from a perspective of cost-effectiveness analyses (CEA).

A few months ago GiveWell (GW), a major effective altruism charity, hosted a Change Our Mind contest. Its purpose was to critique and improve on GW’s process/recommendations on how to allocate funding. This type of contest is obviously a fantastic idea (if you’re distributing tens of millions of dollars to charitable causes, even a fraction of percent improvement to efficiency of your giving is worth paying good money for) and GW also provided pretty generous rewards for the top entries. There were two winners and I think both of them are worth blogging about:

1. Noah Haber’s “GiveWell’s Uncertainty Problem”
2. An examination of cost-effectiveness of water quality interventions by Matthew Romer and Paul Romer Present (MRPRP henceforth)

I will post separately on the uncertainty analysis by Haber sometime soon, but today I want to write a bit on MRPRP’s analysis.

As I wrote last time, back in April 2022 GW recommended a grant of $65 million for clean water, in a “major update” to their earlier assessment. The decision was based on a pretty comprehensive analysis by GW, which estimated cost-benefit of specific interventions aimed at improving water quality in specific countries.[1] (Scroll down for footnotes. Also, I’m flattered to say that they also cited our meta-analysis a motivation for updating their assessment.) MRPRP re-do the GW’s analysis and find effects that are 10-20% smaller in some cases. This is still highly cost effective, but (per the logic I already mentioned) even small differences in cost-effectiveness will have large real-world implications for funding, given that funding gap for provision of safe drinking water is calculated in hundreds of millions of dollars.

However, my intention is not to argue what the right number should be. I’m just wondering about one question these kind of cost-effectiveness analyses raise, which is how to combine different sources of evidence.

When trying to estimate how clean water reduces mortality in children, we can estimate these reductions due to clean water either by looking at direct experimental evidence (e.g. in our meta-analysis) or indirectly: first you look at the estimates of reductions in disease (diarrhea episodes), then at evidence on how it links to mortality. The direct approach is the ideal (it is the ultimate outcome we care about; it is objectively measured and clearly defined, unlike diarrhea), but deaths are rare. That is why researchers studying water RCTs historically focused on reductions in diarrhea and often chose not to capture/report deaths. So we have many more studies of diarrhea.

Let’s say you go the indirect evidence route. To obtain an estimate, we need to know or make assumptions on (1) the extent of self-reporting bias (e.g. “courtesy” bias), (2) how many diseases can be affected by clean water, and (3) the potentially larger effect of clean water on severe cases (leading to death) than “any” diarrhea. Each of these are obviously hard. Direct evidence model (meta-analysis of deaths) doesn’t require any of these steps.

And once we have the two estimates (indirect and direct), then what? I describe GW process in footnotes (I personally think it’s not great but want to keep this snappy).[2] Suffice to say that they use the indirect evidence to derive a “plausibility cap”, the maximum size of the effect they are willing to admit into the CEA. MRPRP do it differently, by putting distributions on parameters in direct and indirect models and then running both in Stan to arrive at a combined, inverse-weighted estimate. [3] For example, for point (2) above (which diseases are affected by clean water), they look at a range of scenarios and put a Gaussian distribution with a mean at the most probable scenario and the most optimistic scenario being 2 SDs away. MRPRP acknowledge that this is an arbitrary choice.

A priori a model-averaging approach seems obviously better than taking a model and imposing an arbitrary truncation (like in GW’s old analysis). However, now depending on how you weigh direct vs indirect evidence models, you can have ~50% reduction or ~40% increase in the estimated benefits compared to GW’s previous analysis; a more extensive numerical example is below.[4] So you want to be very careful in how you weigh! E.g. for one of the programs MRPRP estimate of benefits is ~20% lower than GW’s, because in their model 3/4 of the weight is put on the (lower variance) indirect evidence model and it dominates the result.

In the long term the answer is to collect more data on mortality. In the short term probabilistically combining several models makes sense. However, putting 75% weight on a model of indirect evidence rather than the one with a directly measured outcome strikes me as very strong assumption and the opposite of my intuition. (Maybe I’m biased?) Similarly, why would you use Gaussians as a default model for encoding beliefs (e.g. in share of deaths averted)? I had a look at using different families of distributions in Stan and got to quite different results. (If you want to follow the details, my notes are here.)

More generally, when averaging over two models that are somewhat hard to compare, how should we think about model uncertainty? I think it would be a good idea in principle to penalise both models, because there are many unknown unknowns in water interventions. So they’re both overconfident! But how to make this penalty “fair” across two different types of models, when they vary in complexity and assumptions?

I’ll stop here for now, because this blog is already a bit long. Perhaps this will be of interest to some of you.

Footnotes:

[1] There many benefits of clean water interventions that a decision maker should consider (and the GW/MRPRP analyses do): in addition to reductions in deaths there are also medical costs, developmental effects, and reductions in disease. For this post I am only concerned with how to model reductions in deaths.

[2] GW’s process is, roughly, as follows: (1) Meta-analyse data from mortality studies, take a point estimate, adjust it for internal and external validity to make it specific to relevant contexts where they want to consider their program (e.g. baseline mortality, predicted take-up etc.). (2) Using indirect evidence they hypothesise what is the maximum impact on mortality (“plausibility cap”). (3) If the benefits from direct evidence exceed the cap, they set benefits to the cap’s value. Otherwise use direct evidence.

[3] By the way, as far as I saw, neither model accounts for the fact that some of our evidence on mortality and diarrhea comes from the same sources. This is obviously a problem, but I ignore it here, because it’s not related to the core argument.

[4] To illustrate with numbers, I will use GW’s analysis of Kenya Dispensers for Safe Water (a particular method of chlorination at water source), one of several programs they consider. (The impact of using MRPRP approach on other programs analysed by GiveWell is much less.) In GW’s analysis, the direct evidence model gave 6.1% mortality reduction, but plausibility cap was 5.6%, so they set it to 5.6%. Under the MRPRP model, the direct evidence suggests about 8% reduction, compared to 3.5% in the indirect evidence model. The unweighted mean of the two would be 5.75%, but because of the higher uncertainty on the direct effect the final (inverse-variance weighted) estimate is a 4.6% reduction. That corresponds to putting 3/4 of weight on indirect evidence. If we applied the “plausibility cap” logic to the MRPRP estimates, rather than weighing two models, the estimated reduction in mortality for Kenya DSW program would be 8% rather than 4.6%, a whooping 40% increase on GW’s original estimate.

Water Treatment and Child Mortality: A Meta-analysis and Cost-effectiveness Analysis

This post is from Witold.

I thought some of you may find this pre-print (that I am a co-author of) interesting. It’s a meta-analysis of improving water quality in low and middle income countries. We estimated this reduced odds of child mortality by 30% based on 15 RCT. That’s obviously a lot! If true, this would have very large real-world implications, but there are of course statistical considerations of power, publication bias etc. So I thought that maybe some of the readers will have methodological comments while others may be interested in the public health aspect of it. It also ties to a couple of follow-up posts I’d like to write here on effective altruism and finding cost-effective interventions.

First, a word on why this is an important topic. Globally, for each thousand births, 37 children will die before the age of 5. Thankfully, this is already half of what it was in 2000. But it’s still about 5 million deaths per year. One of the leading causes for death in children is diarrhea, caused by waterborne diseases. While chlorinating [1, scroll down for footnotes] water is easy, inexpensive, and proven to remove pathogens from water, there are many countries where most people still don’t have access to clean water (the oft-cited statistic is that 2 billion people don’t have access to safe drinking water).

What is the magnitude of impact of clean water on mortality? There is a lot of experimental evidence for reductions in diarrhea, but making a link between clean water and mortality requires either an additional, “indirect”, model connecting disease to deaths, which is hard [2], or directly measuring deaths, which are rare (hence also hard) [3].

In our pre-print [4], together with my colleagues Michael Kremer, Steve Luby, Ricardo Maertens, and Brandon Tan we identify 53 RCTs of water quality treatments. Contacting the authors of each study resulted in 15 estimates that could be meta-analysed, with about 25,000 children. (Why only 15 out of 53? Apparently because the studies were not powered for mortality, with each one of them contributing just a handful of deaths, in some cases the authors decided to not collect, retain or report deaths.) As far as we are aware, this is the first attempt to meta-analyse experimental evidence on mortality and water quality.

We conduct a Bayesian meta-analysis of these 15 studies using a logit model and find a 30% reduction in odds of all-cause mortality (OR = 0.70, with a 95% interval 0.49 to 0.93), albeit with high (and uncertain) heterogeneity across studies, which means the predictive distribution for a new study has a much wider interval and slightly higher mean (OR=0.75, 95% interval 0.29 to 1.50). This heterogeneity is to be expected because we compare different types of interventions in different populations, across a few decades.[5] (Typically we would want to address this with a meta-regression, but that is hard due to a small sample.)

The whole analysis is implemented in baggr, an R package that provides meta-analysis interface for Stan. There are some interesting methodological questions related to modeling of rare events, but repeating this analysis using frequentist methods (random-effects model on Peto’s OR’s has a mean OR of 0.72) as well as various sensitivity analyses we could think of all lead to similar results. We also think that publication bias is unlikely. Still, perhaps there are things we missed.

Based on this we calculate about $3,000 cost per child death averted, or under $40 per DALY. It’s hard to convey how extremely cost-effective this is (a typical cost effectiveness threshold is equivalent of one years GDP per DALY; this is reached at 0.6% reduction in mortality), but basically it is on par with the most cost-effective child health interventions such as vaccinations.

Since the cost-effectiveness is potentially so high, there are obviously big real-world implications. Some funders have been reacting to the new evidence already. For example, some months ago GiveWell, an effective altruism non-profit that many readers will already be familiar with, conducted their own analysis of water quality interventions and in a “major update” of their assessment recommended a grant of $65 million toward a particular chlorination implementation [6]. (GiveWell’s assessment is an interesting topic for a blog post of its own, so I hope to write about it separately in the next few days.)

Of course in the longer term more RCTs will contribute to precision of this estimate (several are being worked on already), but generating evidence is a slow and costly process. In the short term the funding decisions will be driven by the existing evidence (and our paper is still a pre-print), so it would be fantastic to see if readers have comments on methods and its real-world implications.

 

Footnotes:

[1] For simplicity I simply say “chlorination” but this may refer to chlorinating at home, at the point from which water is drawn, or even using a device in the pipe, if households have piped water which may be contaminated. Each of these will have different effectiveness (primarily due to how convenient it is to use) and costs. So differentiating between them is very important for a policy maker. But in this post I group all of this to keep things simple. There are also other methods of improving quality, e.g. filtration. If you’re interested, this is covered in more detail in the meta-analyses that I link to.

[2] Why is extrapolating from evidence on diarrhea into mortality hard? First, it is possible that reduction in severe disease is higher (in the same way that vaccine may not protect you from infection, but it will almost definitely protect you from dying). Second, clean water also has lots of other benefits, e.g. it likely makes children less susceptible to other infections, nutritional deficiencies, and also makes their mothers healthier (which could in turn lead to fewer deaths during birth). So while these are just hypotheses, it’s hard a priori to say how a reduction in diarrhea would translate to a reduction in mortality.

[3] If you’re aiming for 80% power to detect 10% reduction in mortality you will need RCT data on tens of thousands of children. Exact number of course depends on how high baseline mortality rate is in the studies.

[4] Or, to be precise, an update to a version of this pre-print which we released in February 2022. If you happened to read the previous version of the paper, both main methods and results are unchanged, but we added extra publication bias checks, characterization of the sample and rewrote most of the paper for clarity.

[5] That last aspect of heterogeneity seems important, because some have argued that the impact of clean water may diminish with time. There is a trace of that in our data (see supplement), but with 15 studies the power to test for this time trend is very low (which I show using a simulation approach).

[6] GiveWell’s analysis included their own meta-analysis and led to more conservative estimates of mortality reductions. As I mention at the end of this post, this is something I will try to blog about separately. Their grant will fund Dispensers for Safe Water, an intervention which gives people access to chlorine at the water source. GW’s analysis also suggested a much larger funding gap in water qulity interventions, of about $350 million per year.