GiveWell’s Change Our Mind contest, cost-effectiveness, and water quality interventions

Some time ago I wrote about a new meta-analysis pre-print where we estimated that providing safe drinking water led to a 30% mean reduction in deaths in children under-5, based on data from 15 RCTs. Today I want to write about water, but from a perspective of cost-effectiveness analyses (CEA).

A few months ago GiveWell (GW), a major effective altruism charity, hosted a Change Our Mind contest. Its purpose was to critique and improve on GW’s process/recommendations on how to allocate funding. This type of contest is obviously a fantastic idea (if you’re distributing tens of millions of dollars to charitable causes, even a fraction of percent improvement to efficiency of your giving is worth paying good money for) and GW also provided pretty generous rewards for the top entries. There were two winners and I think both of them are worth blogging about:

1. Noah Haber’s “GiveWell’s Uncertainty Problem”
2. An examination of cost-effectiveness of water quality interventions by Matthew Romer and Paul Romer Present (MRPRP henceforth)

I will post separately on the uncertainty analysis by Haber sometime soon, but today I want to write a bit on MRPRP’s analysis.

As I wrote last time, back in April 2022 GW recommended a grant of $65 million for clean water, in a “major update” to their earlier assessment. The decision was based on a pretty comprehensive analysis by GW, which estimated cost-benefit of specific interventions aimed at improving water quality in specific countries.[1] (Scroll down for footnotes. Also, I’m flattered to say that they also cited our meta-analysis a motivation for updating their assessment.) MRPRP re-do the GW’s analysis and find effects that are 10-20% smaller in some cases. This is still highly cost effective, but (per the logic I already mentioned) even small differences in cost-effectiveness will have large real-world implications for funding, given that funding gap for provision of safe drinking water is calculated in hundreds of millions of dollars.

However, my intention is not to argue what the right number should be. I’m just wondering about one question these kind of cost-effectiveness analyses raise, which is how to combine different sources of evidence.

When trying to estimate how clean water reduces mortality in children, we can estimate these reductions due to clean water either by looking at direct experimental evidence (e.g. in our meta-analysis) or indirectly: first you look at the estimates of reductions in disease (diarrhea episodes), then at evidence on how it links to mortality. The direct approach is the ideal (it is the ultimate outcome we care about; it is objectively measured and clearly defined, unlike diarrhea), but deaths are rare. That is why researchers studying water RCTs historically focused on reductions in diarrhea and often chose not to capture/report deaths. So we have many more studies of diarrhea.

Let’s say you go the indirect evidence route. To obtain an estimate, we need to know or make assumptions on (1) the extent of self-reporting bias (e.g. “courtesy” bias), (2) how many diseases can be affected by clean water, and (3) the potentially larger effect of clean water on severe cases (leading to death) than “any” diarrhea. Each of these are obviously hard. Direct evidence model (meta-analysis of deaths) doesn’t require any of these steps.

And once we have the two estimates (indirect and direct), then what? I describe GW process in footnotes (I personally think it’s not great but want to keep this snappy).[2] Suffice to say that they use the indirect evidence to derive a “plausibility cap”, the maximum size of the effect they are willing to admit into the CEA. MRPRP do it differently, by putting distributions on parameters in direct and indirect models and then running both in Stan to arrive at a combined, inverse-weighted estimate. [3] For example, for point (2) above (which diseases are affected by clean water), they look at a range of scenarios and put a Gaussian distribution with a mean at the most probable scenario and the most optimistic scenario being 2 SDs away. MRPRP acknowledge that this is an arbitrary choice.

A priori a model-averaging approach seems obviously better than taking a model and imposing an arbitrary truncation (like in GW’s old analysis). However, now depending on how you weigh direct vs indirect evidence models, you can have ~50% reduction or ~40% increase in the estimated benefits compared to GW’s previous analysis; a more extensive numerical example is below.[4] So you want to be very careful in how you weigh! E.g. for one of the programs MRPRP estimate of benefits is ~20% lower than GW’s, because in their model 3/4 of the weight is put on the (lower variance) indirect evidence model and it dominates the result.

In the long term the answer is to collect more data on mortality. In the short term probabilistically combining several models makes sense. However, putting 75% weight on a model of indirect evidence rather than the one with a directly measured outcome strikes me as very strong assumption and the opposite of my intuition. (Maybe I’m biased?) Similarly, why would you use Gaussians as a default model for encoding beliefs (e.g. in share of deaths averted)? I had a look at using different families of distributions in Stan and got to quite different results. (If you want to follow the details, my notes are here.)

More generally, when averaging over two models that are somewhat hard to compare, how should we think about model uncertainty? I think it would be a good idea in principle to penalise both models, because there are many unknown unknowns in water interventions. So they’re both overconfident! But how to make this penalty “fair” across two different types of models, when they vary in complexity and assumptions?

I’ll stop here for now, because this blog is already a bit long. Perhaps this will be of interest to some of you.

Footnotes:

[1] There many benefits of clean water interventions that a decision maker should consider (and the GW/MRPRP analyses do): in addition to reductions in deaths there are also medical costs, developmental effects, and reductions in disease. For this post I am only concerned with how to model reductions in deaths.

[2] GW’s process is, roughly, as follows: (1) Meta-analyse data from mortality studies, take a point estimate, adjust it for internal and external validity to make it specific to relevant contexts where they want to consider their program (e.g. baseline mortality, predicted take-up etc.). (2) Using indirect evidence they hypothesise what is the maximum impact on mortality (“plausibility cap”). (3) If the benefits from direct evidence exceed the cap, they set benefits to the cap’s value. Otherwise use direct evidence.

[3] By the way, as far as I saw, neither model accounts for the fact that some of our evidence on mortality and diarrhea comes from the same sources. This is obviously a problem, but I ignore it here, because it’s not related to the core argument.

[4] To illustrate with numbers, I will use GW’s analysis of Kenya Dispensers for Safe Water (a particular method of chlorination at water source), one of several programs they consider. (The impact of using MRPRP approach on other programs analysed by GiveWell is much less.) In GW’s analysis, the direct evidence model gave 6.1% mortality reduction, but plausibility cap was 5.6%, so they set it to 5.6%. Under the MRPRP model, the direct evidence suggests about 8% reduction, compared to 3.5% in the indirect evidence model. The unweighted mean of the two would be 5.75%, but because of the higher uncertainty on the direct effect the final (inverse-variance weighted) estimate is a 4.6% reduction. That corresponds to putting 3/4 of weight on indirect evidence. If we applied the “plausibility cap” logic to the MRPRP estimates, rather than weighing two models, the estimated reduction in mortality for Kenya DSW program would be 8% rather than 4.6%, a whooping 40% increase on GW’s original estimate.

16 thoughts on “GiveWell’s Change Our Mind contest, cost-effectiveness, and water quality interventions

  1. I think this is an excellent example of “mathiness”. We telescope several mathematical analysis, rigorous in themselves, but with analyst degrees of freedom.

    At the end you get a number and given the rigorous process ( after all there are error bounds as well! ) theres a temptation to believe more in the sanctity the estimate than we should.

    • I am sympathetic to this type of analysis – I’ve done many similar exercises albeit using different methodologies. Simulations and sensitivity analyses can handle a wide variety of uncertainties and, with limited evidence, there really isn’t an alternative to attempting things like these examples (the alternative to bad analysis is no analysis, which is usually worse). Decisions need to be made and uncertainty is unavoidable, so it is best to explore as much of the uncertain space as possible and be transparent about what assumptions are being made and what difference alternative assumptions might make.

      The danger I see is when people believe that these exercises will somehow eliminate the uncertainties – that is where the mathiness can get in the way. Ultimately, I’d like to see uncertainty ranges about the final impacts that are relevant to the decisions. If the decision is how much to spend on a particular program, what is the probabilistic range of impacts (cost-benefit, cost effectiveness, etc.) and how sensitive are these to key assumptions? I think that information is useful to the ultimate decision makers. Overconfidence in the resulting analysis is always a danger with quantification of uncertainty, but not a reason to avoid the analysis.

      Rahul – it wasn’t clear to me whether you were suggesting the analysis was a bad idea or just pointing out a danger with such “rigorous” analysis. If the latter, I agree. My comment is meant to support doing these analyses despite the inherent dangers of overconfidence.

      • “the alternative to bad analysis is no analysis, which is usually worse”

        I guess I’m not sure why you assume this. In this case, people have poor water supplies. After the cost of doing the analysis – I mean here we’re already looking at data collection for a bunch of original analyses and then meta analysis on top of that – both in $$ and lost time, then the analysis still has a high uncertainty – how effective will the analysis be compared to just eyeballing which communities need water purification the most and implementing it?

        So this is where I take issue with many things discussed here. The analyses are interesting in themselves as experimental science. There’s nothing wrong with doing them. However, it’s almost always questionable whether they are effective or beneficial decision making tools after all the problems with the data and the unknowns are taken into account. Bad data isn’t necessarily better than no data and contrary to your point of view, I would say it’s usually worse.

        So it’s not surprising that stats people promote human incompetence and frequently seek to downgrade human perception – it’s in their interest to do so, just like it’s in the interest of environmental orgs and researchers to promote environmental doom and in the interests of resource producers to resist it. That’s not to say that people are doing so intentionally, cackling loudly while they count their dollars. They’re doing so because they – like all humans – have a *cognitive bias* to promote what’s in their own interest.

        • Your last paragraph is quite a leap. I agree that the data can be bad enough that “any” analysis may not be any good at all. But simple heuristics (such as eyeballing which communities need it the most) are a form of analysis. The leap I see in your last paragraph is a generalization that human perception is often better than analysis – when we are talking statistical matters, I think human perception cannot be relied upon. Our senses are very limited in what we see personally. And, to the extent that we rely on media to “inform” us, we know how unreliable that has become. So, I think your skeptical last paragraph becomes an excuse to forgo analysis in favor of gut instinct. But gut instinct that worked for protecting humans from predators in the jungle doesn’t work so well at deciding where to best direct philanthropy. I am reminded by a former college president of mine who said “I don’t have the data, but if I did, I know what it would say.”

        • chipmunk:

          You really seem to underestimate the real world complexity of water quality interventions. Decisions based on gut feelings can kill people. One famous example would be tube wells in Bangladesh. In Bangladesh water is not scarce, but surface water is usually contaminated. The groundwater was presumed to be clean, so from the 70s on they installed tube wells which supply water for more than half of the population. Unfortunately, aquifers are often contaminated with arsenic and as it turns out this became a severe problem in large parts of Bangladesh. And….this was only realized in the 90s.

          Now, risk evaluation is rather tricky. Chronic arsenic poisoning is highly undesirable, but surface water will similarly kill people. A first step was to identify problematic wells, unfortunately, the most populated areas of Bangladesh are areas with extremely high arsenic levels. It’s much better in the rural areas but even there you have to check each individual well as arsenic levels might differ even for wells in the same village.

          So for interventions it’s better to first, in your words, lose a lot of time and $$ on a bunch of original analyses or meta analyses than just eyeballing it and killing tens of millions of people in the process.

        • chipmunk –

          > Bad data isn’t necessarily better than no data and contrary to your point of view, I would say it’s usually worse.

          Data are “bad” or “good,” depending on how you analyze them or act upon them.

        • @joshua :

          That motif is the source of a lot of problems.

          The thinking seems to be that data, no matter how bad#, if only we squeeze it hard enough, long enough, will yield actionable insight.

          There’s this hope, that if only we had the right techniques or better algorithms we could make the same “bad” data sing.

          # replace “bad” with some version of noisy, biased, inaccurate, sparse etc.

        • Rahul
          The motif that bad data renders analysis worthless is equally troubling. I agree that it is dangerous to overlook the many shortcomings in the data by appeal to ever more elaborate analyses, but it is too easy (and dangerous) to dismiss analysis because the data has issues – sparse, inaccurate, noisy, bad, etc. I’d rather see the data used, despite its flaws, but with appropriate humility.

        • Rahul –

          > # replace “bad” with some version of noisy, biased, inaccurate, sparse etc.

          With that, I’m fine. That was essentially my point.

          You learn something useful, imo, by describing the data, and thus can assess the parameters of how it is informative or not along particular dimensions.

          Data connected to useful descriptors…is that bad data?

          What’s the difference between “bad” data and “good” data?

          Don’t all data, basically, have limitations?

        • Thomas said: “The groundwater was presumed to be clean”

          You see the problematic word, right? In this case it’s associated with only one feature: the safety of the ground water. In a data analysis context there are usually several and sometimes dozens of variables that have the word “presumed” attached to them in some form.

          However in the case we’re talking about here, we’re talking about providing chlorination to water that’s already being used. The presumption I make – one that I think is sound – is that properly applied this doesn’t have a downside risk.

        • “The leap I see in your last paragraph is a generalization that human perception is often better than analysis”

          That’s exactly the leap I make!

          Here’s the standard I propose for statistical or numerical analysis: it needs to provide a much high level of certainty than human perception to be useful, since it often requires many assumptions.

          I don’t suggest that “gut instinct” be replaced with statistical or numerical analysis. I suggest specifically that in complex problems *human perception* is as reliable as complex statistical analysis and modeling. That doesn’t mean “shoot from the hip” it means use tools of high certainty in combination with common reasoning to make decisions.

          In some cases statistical analysis yields a clear result. But those cases are rare. That’s why we have sophisticated procedures for things like testing the efficacy of drugs. And in the case of drug efficacy, the statistical standard is pretty high.

          Your reference to “predators in the jungle” is interesting. IMO it’s the origin of the mistaken idea that human intelligence has low capability.

          First, protection from predators or obtaining prey are not unique functions of human intelligence. Virtually all animals perform these functions. Second, these functions are much more sophisticated than you might imagine. They require continuous reassessment of conditions based on many different inputs – and, unlike statistics, have been refined over roughly half a billion years. Third, while many animals can to a small degree manipulate their environment and test the outcome of manipulations, humans are far superior to most other animals in that respect (in fact that’s what I’m doing right now – suggesting that a system that doesn’t work be relegated to experimentation, rather than applications). So, in addition to their naturally honed senses, they can add knowledge and experience from manipulation much more directly than most animals. That has given humans the ability to quickly notice and act on patterns. Fourth, as we see with the hot hand and many other scientific statistical failures, humans can identify patterns that statisticians cannot identify or have great difficulty identifying.

        • Thomas:

          BTW, the assumption that groundwater is safe is a far better assumption than many assumptions that are routinely made in various forms of statistical analysis. In the case you point out, it turned out wrong, and that’s a sad thing – especially because, unlike many assumptions made in statistical analysis, it’s cheap and easy to test with absolute certainty.

          So the moral of the story is not “use statistical analysis”. The moral of the story is “test the assumptions”. If you follow that knowledge, a lot of statistical analysis won’t look so good, because the assumptions aren’t very easy or sometimes even possible to test.

        • chipmunk
          I disagree with most of what you have said. I do agree that human ability to perceive associations is quite sophisticated and I don’t underestimate the complexity of jungle flee or fight decisions, nor the fact that humans are superior to many animals in such situations. But what does that have to do with human ability to judge whether a cancer cluster is random or caused by some factor? Yes, you’ve made a leap and one that you will have to make alone (at least without me).

        • chipmunk:

          >However in the case we’re talking about here, we’re talking about providing chlorination to water that’s already being used. The presumption I make – one that I think is sound – is that properly applied this doesn’t have a downside risk.

          I think nobody ever doubted that chlorination can be beneficial for the drinking water quality. What they question is if it’s a const-effective intervention from the perspective of a charity organisation.

          Charity organizations almost always lack funds and wrong decisions mean the loss of opportunities. Let’s consider Bangladesh again. The arsenic concentration in rural areas is often still high but acceptable compared to the alternatives. While the water quality would be further increased by purification, for a charity an investment might be more worthwhile in a different part of the country or elsewhere entirely.

          >it’s cheap and easy to test with absolute certainty.

          Assumptions can only be tested if one knows they are assumption to begin with. One can only test for compounds one expects, this is a general problem in analytics.

          And even then it’s not cheap. A generic test for an arbitrary inorganic compound will set you back at least 10 bucks in materials and half an hour for each sample. Of course, today there are cheap test kits for arsenic but they had to be developed first.

  2. Thanks for your comments, Witold. I’m the primary architect of GiveWell’s chlorination cost-effectiveness analysis, and I’d like to clarify how our model works, why we chose this approach, and what directions we are considering for the future.

    To start with the big-picture view, at GiveWell nearly all of our funding decisions are made in the context of substantial uncertainty, which involves judgment calls and a lot of researcher degrees of freedom. Because of that, getting feedback on how we make those decisions is critical. That was the motivation behind the Change Our Minds contest and why we’re grateful for your thoughtful engagement here.

    Typically, when GiveWell engages with research, we reduce initial impact estimates to account for factors like our priors, publication bias, internal validity, and external validity. In this case, the effects on all-cause mortality reported in the Kremer et al. meta-analysis are large relative to our expectations, and relative to the expectations of experts we have spoken with. We think overestimation is plausible due to the unavoidable limitations of the mortality data that are available. This concern was echoed by two external experts we hired to review the meta-analysis, including a water trial expert and a statistics expert. Reasons that overestimation may be a concern include the fact that the confidence intervals of the pooled result are fairly wide, publication bias remains possible despite helpful efforts to measure and constrain it in the paper, and bias may arise from differences in the intensity of researcher-subject interactions between intervention and control groups (particularly given that the underlying trials were not designed around measuring mortality). As a result, we feel it is important to sense-check those findings by estimating the maximum effect on mortality we could plausibly expect from chlorination. The “plausibility cap” calculation is this sense-check.

    Our current estimates of the impact of chlorination on mortality in specific settings are below our plausibility caps, so the caps don’t currently impact our cost-effectiveness analysis output. Our cost-effectiveness analysis uses a subset of the trials included in the Kremer et al. meta-analysis and adjusts the estimate downward in other ways to yield a closer approximation of the impact of a chlorination intervention in the specific locations we have investigated. For example, downward adjustments account for the fact that some of the trials included additional interventions that are not present in the simple chlorination interventions we are evaluating, such as flocculation, safe storage containers, and hygiene interventions; we adjust the effect on mortality downward because we think those probably contributed to the effect sizes observed in trials. This smaller effect size estimate (about one third to half the Kremer et al. estimate) typically passes our plausibility sense-check, after adjustments.

    I’d also like to clarify that the method we used to create the plausibility cap is not quite the “indirect method” that was described in this post. Specifically, the plausibility cap multiplies (A) the reduction in diarrhea morbidity caused by chlorination interventions by (B) the amount of mortality caused by all conditions that could plausibly be impacted by chlorination in under-5s, under generous assumptions (basically, all infectious diseases). For example, if chlorination reduces diarrhea prevalence by 25%, and infectious diseases account for 65% of mortality in under-5s, the maximum plausible reduction of mortality is 16% (25% x 65%). We consider this a “cap” because it represents the upper bound of what we believe the impact of chlorination on mortality could reasonably be.

    Our calculation of the plausibility cap involves uncertain assumptions, most notably (1) that the impact of chlorination on disease-specific mortality is proportional to the impact of chlorination on diarrhea morbidity, and (2) that all infectious diseases are impacted by chlorination to the same degree as diarrhea. Based on our research, we think these assumptions represent a generous version of the most plausible mechanism by which chlorination could reduce mortality by much more than expected, which is that it substantially impacts the risk of dying from diseases other than diarrhea. We have explored other mechanisms by which chlorination could reduce mortality more than expected, both via desk research and expert conversations, and have not found other mechanisms that are both supported by evidence and quantitatively large. That said, we welcome feedback on this approach since it involves a lot of uncertainty in interpreting the research literature and in modeling choices, and it impacts our cost-effectiveness estimates and ultimately our funding allocations.

    The Kremer et al. meta-analysis was an important update to our work– previously we had been extrapolating the impact of chlorination on mortality using the “indirect method” you describe in your post, and we now believe our initial model was probably underestimating the effect size.

    We remain concerned about the uncertainty of the evidence on the impact of chlorination on mortality, and we agree that collecting more data is a key way to reduce this uncertainty. To that end, this year we have recommended funding to conduct at least one further trial examining the link between water chlorination and all-cause mortality in children.

    As an additional step to reduce uncertainty, we are carefully considering feedback from the Change Our Minds contest, and it will likely result in updates to our chlorination model. As part of this, we are considering updating how we calculate effect sizes and incorporate different sources of evidence.

    Thank you again for your engagement on this. Our research on water treatment is still evolving, and we appreciate external feedback and will continue to take it seriously as we refine our models.

Leave a Reply

Your email address will not be published. Required fields are marked *