How different are causal estimation and decision-making?

Decision theory plays a prominent role in many texts and courses on theoretical statistics. However, the “decisions” being made are often as simple as using a particular estimator and then producing a point estimate — the point estimate, say, of an average treatment effect (ATE) of a some intervention is the “decision”. That is, these decisions are often substantially removed from the kinds of actions that policymakers, managers, doctors, and others actually have to make. (This is a point frequently made in decision theory texts, which often bemoan use of default loss functions; here I think of Berger’s Statistical Decision Theory and Bayesian Analysis and Robert’s The Bayesian Choice.)

These decision-makers are often doing things like allocating units to two or more different treatments: they have to, for a given unit, put them in treatment or control or perhaps one of a much higher-dimensional space of treatments. When possible, there can be substantial benefits from incorporating knowledge of this actual decision problem into data collection and analysis.

In a new review paper by Carlos Fernandez-Loria and Foster Provost, they explore how this kind of decision-making importantly differs from estimation of causal effects, highlighting that even highly confounded observational data can be useful for learning policies for targeting treatments. Much of the argument is that the objective functions in decision-making are different and this has important consequences (e.g., if a biased estimate still yields the same treat-or-not decision, no loss is incurred). This paper is worth reading and it points to a bunch of relevant recent — and less recent — literature. (For example, this made me aware that the expression of policy learning as a cost-sensitive classification problem originated with Bianca Zadrozny in her dissertation and some related papers.)

Here I want to spell out related but distinct reasons underlying their contrast between causal estimation and decision-making. These are multiple uses of estimates, bias–variance tradeoffs, and the loss function.

First, a lot of causal inference is done with multiple uses in mind. The same estimates and confidence intervals might be used to test a theory, inform a directly related decision (e.g., whether to expand an experimental program), inform less related decisions (e.g., in a different country, market, etc., a different implementation), and as inputs to a meta-analysis conducted years later. That is, these are often “multipurpose” estimates and models. So sometimes the choice of analysis (and how it is reported, such as reporting point estimates in tables) can be partially justified by the fact that the authors want to make their work reusable in multiple ways. This can also be true in industry — not just academic work — such as when an A/B test is both used to make an immediate decision (launch the treatment or not?) and also informs resource allocation (should we assign more engineers to this general area?).

More explicitly linking this “multipurpose” property to loss functions, meta-analysis (or less formal reviews of the literature) can be one reason to value reporting (perhaps not as the only analysis) estimates that are (nominally) unbiased. Aronow & Middleton (2013) write:

Unbiasedness may not be the statistical property that analysts are most interested in. For example, analysts may choose an estimator with lower root mean squared error (RMSE) over one that is unbiased. However, in the realm of randomized experiments, where many small experiments may be performed over time, unbiasedness is particularly important. Results from unbiased but relatively inefficient estimators may be preferable when researchers seek to aggregate knowledge from many studies, as reported estimates may be systematically biased in one direction.

So the fact that others are going to use the estimates in some not-entirely-anticipated ways can motivate preferring unbiasedness (even at the cost of higher variance and thus higher squared error). [Update: Andrew points out in the comments that, of course, conditional on seeing some results from an experiment, the estimates are not typically unbiased! I think this is true in many settings, though there can be exceptions, such as when all experiments are run through a single common system or process.]

This thus avoids the bias–variance tradeoff by at least pretending to have a lexicographic preference for one over the other. But as long as our loss function is something like MSE, then we will want to use potentially confounded observational data to improve our estimates. In some cases, this might be inadmissible to neglect such big, “bad” data.

Lastly, as Fernandez-Loria and Provost note, the loss function (or, alternatively, the objective function as they put it) can substantially change a problem. If our only decision is whether to launch the treatment to everyone or not, then error and uncertainty that doesn’t result in us making the wrong (binary) decision don’t incur any loss. The same is not true for MSE. We make a related point in our paper on learning targeting policies using surrogate outcomes, where we use historical observational data to fit a model that imputes long-term outcomes using only short-run surrogates. If one is trying to impute long-term outcomes to estimate a ATE or a conditional ATE (CATEs for various subgroups), then it is pretty easy for violations of the assumptions to result in error. However, if one is only making binary decisions, then these violations have to be enough to flip some signs before you incur loss. So using a surrogacy model can be justified under weaker assumptions if “just” doing causal decision-making. However, when does this make a difference? If many true CATEs are near zero (the decision boundary), then just a little error in estimating them (perhaps due to the surrogacy assumptions being violated) will still result in loss. So how important this difference in loss function is may depend importantly on the true distribution of treatment effects.

Thus, I agree that causal decision-making is often different than causal estimation and inference. However, some of this is because of particular, contingent choices (e.g., to value unbiasedness above reducing MSE) that make a lot of sense when estimates are reused, but may not make sense in some applied settings. So I perhaps wouldn’t attribute so much of the difference to the often binary or categorical nature of decisions to assign units to treatments, but instead I would pin this to “single-purpose” vs. “multi-purpose” differences between what we typically think of as decision-making and estimation.

I’d be interested to hear from readers about how much this all matches your experiences in academic research or in applied work in industry, government, etc.

[This post is by Dean Eckles.]

Update: I adapted this post into a commentary on Fernandez-Loria and Provost’s paper. Both are now published at INFORMS Journal on Data Science.

17 thoughts on “How different are causal estimation and decision-making?

  1. Professor Gelman,

    Cool post! and probably cool paper too (I haven’t read it yet). What about using a loss function that has some form of causal content? For example, the MSE with respect to, possibly held out, experimental (interventional) data? That probably gets the best of both worlds, in the sense that decisions are made based on both quality of the estimates and causal content. I would like to know what you think about that.

  2. Dean:

    I disagree with this sentence: “However, in the realm of randomized experiments, where many small experiments may be performed over time, unbiasedness is particularly important. Results from unbiased but relatively inefficient estimators may be preferable when researchers seek to aggregate knowledge from many studies, as reported estimates may be systematically biased in one direction.”

    Real life experimental results are not unbiased. Biases can be huge, especially in small experiments. This has come up many times on this blog, and here’s a published article making the point.

    Even beyond the huge problems of selection biases, the idea that experimental estimates are unbiased is wrong because treatment effects vary. At best, your experimental estimate is an unbiased estimate of the average treatment effect among the people who happen to be in the experiment in the particular scenario being studied.

    See here for more on the topic.

    • I anticipated you might object to the privileging of unbiasedness! And I particularly agree that often there is selection bias in what estimates we hear about and get included in any subsequent analysis. (Some of that, though, is mitigated in some settings where there is a record the results of all experiments conducted, such as in some parts of some tech companies.)

      But also I must of messed up the formatting (now fixed) as that particular statement is supposed to be a quote from Aronow & Middleton!

      I actually had missed your paper, so will incorporate that into further writing on this. I completely agree that a focus on unbiasedness (note also that I modified that with “nominal” at point point) often doesn’t make sense. But I do think computing and storing unbiased estimates arising from randomized experiments is a good idea — even if they need not be given top billing in resulting dashboards, reports, or papers.

  3. I often encounter (sometimes in myself) the attitude that “it’s hard enough to estimate the ATE, let’s not attempt the optimal treatment rule”. The thought process seems to basically be that since the ATE is just one number it’s easier to get right than an entire function (treatment rule). Among other things, this post is a nice reminder that that’s not necessarily true.

  4. Great post.

    In portfolio management, the decision is to invest a proportion of your money in a number of asset classes (or securities). The typical Bayesian approach to portfolio management is to get the posterior predictive distribution of what you may invest in and calculate the mean and covariance of that. From these Bayesian estimates, you can perform mean-variance optimization, as you would classically (the objective function is: w’ * mu – 0.5 * lambda * w’ * sigma * w, where w is an Nx1 vector and lambda is a scaler)

  5. My experience from industry ab testing, almost all that matters to business people is the decision making portion but ab testing tools are built purely with casual estimation in mind. This leads to situations such as something being tested has a positive non significant point estimate with uncertainty. There is then confusion on what action. There are exceptions of course. Optimizing marketing spend is an exception because the effect size can imply a level of investment.

    • Agreed to some degree. But one addendum is that often results of one experiment are used to make other decisions besides launch or not, such as resource allocation decisions (should we put more engineers on that that part of the product? should we produce more new marketing creatives? etc.)

      • Fair point. A large chunk of resource allocation decisions do often come before the ab test has been run to build out the product feature and changes in resource allocations come when estimated effects are larger than expected.

  6. I’ll offer an extreme example.

    In medical research, the discovery pipeline is often (0) observational data analysis, (1) cells in a petri dish, (2) mice, (3) clinical trials. The goal is to figure out whether a new therapy would benefit patients on average (i.e. positive ATE).

    Yet at every step in the pipeline, we are actually computing effect sizes on entirely different populations which may include entirely different organisms treated with different doses under different conditions with potentially only approximately the same disease as the target population!

    That this pipeline works at all is amazing from a statistical perspective. But fundamentally the reason it (sometimes) works is because all these confounding factors make it impossible to estimate the ATE we care about but not impossible to estimate a decision rule. At each step we get some noisy evidence of whether this drug will work. Sometimes the confounding at one stage is strong enough to flip the signal (e.g. all the COVID research done on vero cells because they express one receptor of interest but not the alternative receptors that human lung cells express). When that happens, we have some intermediate cost because we move to the next stage of the pipeline and inevitably fail. But by trying this in progressively more expensive and higher fidelity systems, we hope to reduce the overall cost to discover the correct rule. There’s probably a lot there to be explored from a theory perspective, but it’s messy and complicated and there aren’t a ton of easy application datasets for stats papers to benchmark on.

    • Great example. Reminds me of some discussion of what the costs of errors are in a GWAS: a Type I error wastes a year of a grad student’s life (and associated funding).

  7. Many of these considerations arose in meta-analysis, primarily because of the way people conducted and reported on results, often selectively.

    But the obvious solution is just to make the raw data available along with adequate descriptions of what was planned and happened in the study. Perhaps with privacy restrictions if necessary. Sander Greenland and I have been arguing for that since 2000 if not the 1980s. Now if these studies are being done all within one firm, it’s just silly that this not being done. The cult of summaries?

    In particular, my DPhil thesis dealt with lack of closed form likelihoods when authors chose to report non-sufficient summaries, either because the data was far from Normally distributed or authors thought certain summaries were inappropriate. If group means and sds were reported, approximated Normality usually minimized the the first problem, but unfortunately some authors (likely based of their intro stats course) refused to report means and sds when the data looked skewed. In the thesis, I realized it would impossible to decide on a best summary based a current study. Later repeated studies might clearly show the assumptions made in the first study were mistaken.

    • “Now if these studies are being done all within one firm, it’s just silly that this not being done. The cult of summaries?”
      When each experimental condition involves, say, 2e7 units, each with multiple outcomes, that is a lot of data to store and a substantial amount of computation to (re)do for a meta-analysis. And there are privacy regulation reasons to not even retain the raw data beyond some period.

      • Always contingencies of reality to deal with, but I would not be using the usual summary statistics. For instance, for the largest number of distinct subgroups keep say 7 to 9 order statistics the so likelihood for most distribution closed form likelihood will be available and loses the least information.

        Main point is to keep the high level of granularity feasible …

        • I think many tech firms keep things like mean(x), mean(winsorize(x, p)), mean(x > 0), mean(log(x+1)), and associated SDs / bootstrap SEs.

Leave a Reply

Your email address will not be published. Required fields are marked *