Causal and predictive inference in policy research

Todd Rogers pointed me to a paper by Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer that begins:

Empirical policy research often focuses on causal inference. Since policy choices seem to depend on understanding the counterfactual—what happens with and without a policy—this tight link of causality and policy seems natural. While this link holds in many cases, we argue that there are also many policy applications where causal inference is not central, or even necessary.

Kleinberg et al. start off with the example of weather forecasting, which is indeed makes their point well: Even if we have no ability to alter the weather or even to understand what makes it do what it does, if we’re able to forecast the weather, this can still help us make better decisions. Indeed, we can feed a probabilistic forecast directly into decision analyses.

On the other hand, if you want to make accurate forecasts, you probably will want a causal model of the weather. Consider real-world weather forecasts, which use a lot of weather-related information and various causal models of atmospheric dynamics. So it’s not that causal identification is required to make weather decisions—but in practice we are using causal reasoning to get our descriptively accurate forecasts.

Beyond the point that we can make decisions based on non-causal forecasts, Kleinberg et al. also discuss some recent ideas from machine learning, which they apply to the problem of predicting the effects of hip and knee replacement surgery. I don’t really know enough to comment on this application but it seems reasonable enough. As with the weather example, you’ll want to use prior knowledge and causal reasoning to gather a good set of predictors and combine them well, if you want to make the best possible forecasts.

One thing I do object to in this paper, though, is the attribution to machine learning of ideas that have been known in statistics for a long time. For example, Kleinberg et al. write:

Standard empirical techniques are not optimized for prediction problems because they focus on unbiasedness. . . . Machine learning techniques were developed specifically to maximize prediction performance by providing an empirical way to make this bias-variance trade-off . . . A key insight of machine learning is that this price λ [a tuning parameter or hyperparameter of the model] can be chosen using the data itself. . . .

This is all fine . . . but it’s nothing new! It’s what we do in Bayesian inference every day. And it’s a fundamental characteristic of hierarchical Bayesian modeling that the hyperparameters (which govern how much partial pooling is done, or the relative weights assigned to different sorts of information, or the tradeoff between bias and variance, or whatever you want to call it) are inferred from the data.

It’s great for economists and other applied researchers to become aware of new techniques in data analysis. Also good for them to realize that certain ideas such as the use of predictive models for decision making, have been around in statistics for a long time.

For example, here’s a decision analysis on home radon measurement and remediation that my colleagues and I published nearly twenty years ago. The ideas weren’t new then either, but I think we did a good job at integrating decision making with hierarchical modeling (that is, with tuning parameters chosen using the data). I link to my own work here not to claim priority but just to convey that these ideas are not new.

Again, nothing wrong with some economists writing a review article drawing on well known ideas from statistics and machine learning. I’m just trying to place this work in some context.

34 thoughts on “Causal and predictive inference in policy research

  1. “…it’s a fundamental characteristic of hierarchical Bayesian modeling that the hyperparameters (which govern how much partial pooling is done, or the relative weights assigned to different sorts of information, or the tradeoff between bias and variance, or whatever you want to call it) are inferred from the data.”

    Just to make sure I understand this, did you mean that the hyperpriors on the hyperparameters govern how much partial pooling is done, etc?

    • It also can be the value of a hyperparameter. Suppose for example that you have a hyperparameter W that governs the width of a distribution that describes uncertainty on hyperparameters Q_i that govern predictions of different groups…

      after seeing the data, the posterior distribution for W could be quite tight around say 10… which means that the Q_i are distributed according to some common distribution with width about 10.

      It isn’t necessarily the *prior* on W that governs the spread of the Q_i, it’s actually the posterior distribution of W, which is related both to the prior and the data.

      • This is exactly the situation I was thinking of, thanks. I was thinking that if I have a prior on W, say

        W ~ Normal(0,100) I[0,] ## truncated

        vs another possible prior

        W ~ Unif(0,200)

        vs another

        W ~ Gamma(0.001,0.001)

        I could end up with very different posteriors for Q_i (maybe if I don’t have a huge amount of data and the prior is influential in determining the posterior—sparse data but a lot of prior knowledge is one case where Bayesian methods outperform anything else and therefore the most interesting case).

        • In hierarchical model the sample size associated with the number of _centers_ is usually small and hence any between center modelling is poorly informed by data and would benifit from less uninformed priors.

          We discussed the bias-variance trade-off very explicitly in this paper “formal bias-variance trade-off methods such as hierarchical (random-coefficient) meta-regression” in Greenland and O’Rourke http://biostatistics.oxfordjournals.org/content/2/4/463.full.pdf

  2. Machine Learning is dominated by Bayesian methods. See “Machine Learning, A Probabilistic Perspective”, (Kevin P. Murphy, MIT Press 2012), for example. So I imagine a fair number of researchers are first exposed to Bayes in the context of Machine Learning — conflating the two — and don’t see the context that long-time Bayesian practitioners (warriors?) can provide.

    • Bayesian Warriors hunh? Can we get a secret handshake and an engraved pocket knife, and maybe adopt some kind of special haircut like the samurai topknot? maybe a 5 inch ponytail with a special golden paracord wrapping…

  3. Why are you considering statistics or Bayesian methods as disjoint from machine learning?

    ML seems to apply a lot of techniques including definitely those from stat and Bayesian methods.

    • Rahul:

      It seems funny to say, “A key insight of machine learning is that this price λ [a tuning parameter or hyperparameter of the model] can be chosen using the data itself” when this insight was well known in statistics for decades before this was being done in the field called “machine learning.” Statisticians were writing about full Bayes estimation of hyperparameters from data back in 1965, and I think the animal-breeding people were doing point estimation of hyperparameters back in the 1950s. OK, sure, just about every technique has a history that can be traced back forever. It just seems silly to call this “a key insight of machine learning.” If you don’t want to call it “a key insight of statistics,” why not “a key insight of data analysis” or just “a key insight”? I’m guessing that the authors just don’t realize that applied researchers and theoretical statisticians have been talking about this stuff for over half a century.

  4. The field of machine learning certainly did not invent the tuning parameter/penalized optimization/hyper-parameter or whatever you want to call it depending your sub field.

    But what I see as the major shift is to focus on prediction, rather inference about effects. Us statisticians somewhat reflexively dislike the Garden of Forking Paths issue when trying to estimate covariate effects (I don’t think I need to argue that’s not a terrible reflex in that context on this blog). The ML crowd correctly recognizes that in prediction, the issues of GoF are not as worrisome and in fact should be embraced/taken to extremes, as long as you have some method of ensuring that your out of sample prediction is not hurting as a result. This completely changes how they approach problems, and justifies running models in a more automated manner.

    Where they often go way off the deep end is in using vanilla ML methods for causal inference; I can’t tell you how many ML people I’ve talked to who said “I was tasked with finding which features are the most influential in *changing* behavior, so I built a vanilla ML model and picked off the effects with the greatest predictive power”. After about 1 week in a causal inference class, I’m sure they would recognize how bad an idea this is.

    As such, I’m very curious to see up coming work in weaving ML-methods with causal inference. It seems quite non-trivial to me, and I would be cautious about sales of snake-oil, but then again I’m sure there are people who are much smarter than me with much better ideas than I have working on the problem right now.

    • I agree, and recognize the same issues. Kleinberg et al. are merely pointing out that in principle we don’t need causal models to answer non-causal questions, although I’m not sure why in 2016 there are still people that needed to be reminded of that (in any case, as Andrew points out, causal knowledge may help with modeling. I’d add this is particularly handy when we want to extrapolate to populations other than the one where we collected the data). But in recent years I’ve seen a modest but resilient stream of embarrassing machine learning papers that try to do “prediction” instead of “causal inference” to answer causal questions.

    • True but the paper is published in the AER Papers & Proceedings issue (non-refereed issue of the best–or at least best connected–conference papers presented at the annual AEA meetings), and there is at least one economist in the group (Mullainathan), so Prof Gelman is basically right although it would be more accurate to talk about “researchers in computer science & economics.”

  5. There’s a fundamental desire to describe any technique as new, or new in a field.

    You might like this example: About 2009 one of our division presidents asked our legal department to trademark the term “propensity scores” and start the process to patent the method we were using to generate them.

    Luckily, they checked with us first.

  6. > On the other hand, if you want to make accurate forecasts, you probably will want a causal model of the weather. Consider real-world weather forecasts, which use a lot of weather-related information and various causal models of atmospheric dynamics. So it’s not that causal identification is required to make weather decisions—but in practice we are using causal reasoning to get our descriptively accurate forecasts.

    I think this is where some machine learners would disagree. No model can hope to predict well if it doesn’t end up approximating some (reduced form) of the true, causal process that operates in reality. But they’d disagree that you can only arrive at such an approximation through causal reasoning. Remarkably often, taking a large data set and unleashing an algorithm that automatically finds patterns that extrapolate within that data set seems to work well. The selling point of machine learning seems to be that you can make accurate predictions without a thorough causal understanding of what’s going on.

    Of course, scientists usually want to understand what’s actually going under the hood for the phenomena they study, and as you say, because reality is causal, the best model will ultimately be causal too.

    • Anon:

      Fair enough: it can sometimes work to fit a big black-box model and then interpret it causally, perhaps at that point adding some structural information to help the model work better. But real-world weather forecasts do use a lot of weather-related information and various causal models of atmospheric dynamics, no?

    • My problem is with scientists that claim to have a “causal” model without bothering much about validating the model & its predictive content on out-of-box data.

      Just because your model has some sort of structure derived from fundamentals means little till you show me that the model is doing a good job predicting.

      I’d rather take a good-predictive-power black-box model than some intricate causal model with crappy predictive power.

      • I’m all for predictive checks, but if the question is causal it means you need to intervene on the system to get the test data, which might not be available. Even in this case, we can and still should check how well it behaves with observational predictions. But the best model ever for weather forecasting will not be enough to make claims about global warming.

    • “No model can hope to predict well if it doesn’t end up approximating some (reduced form) of the true, causal process that operates in reality.”

      This statement is simply not true. One could easily be assessing correlates rather than variables that are actually causally related to each other and erroneously conclude causality. What then happens is we make decisions based on this, label entities based on it, and enshrine this belief in our shared understanding as if it is causal.

      I simply do not understand this belief that simply assuming causality will produce a causal relationship from associative methods.

      I agree that purely associative methods can be useful. But, I also belief that they can be dangerous if they are not interpreted literally as correlative rather than figuratively as causal.

      • Correction:

        I agree that purely associative methods can be useful. But, I also believe that they can be dangerous if they are not interpreted literally as correlative rather than figuratively as causal.

        • This is a really good point. A friend of mine–a very good machine learner–was contracted to do some price sensitivity work for a FMCG firm. He built an excellent predictive model of sales given prices and product/location/time characteristics. A big black-box thing; cross-validated very nicely. He then perturbed prices and generated new predictions to get change in predicted sales from a change in price. Surprisingly to him (but probably not to a structuralist) the relationship was positive–an increase in prices drove up predicted sales.

          Somehow I think that the message “slow down, think about the generative process deeply” hasn’t caught on across big parts of the predictive modeling community.

      • Curious,

        My statement wasn’t meant to express anything contrary to what you said. I agree with you.

        All I really meant was that estimating P(Y|X) with data on Y, X can end up predicting well only if P(Y|X) continues to be a good characterisation of how Y and X are related in the future, or in different settings. In eg time series that’s often just not the case. You might write down P(Y|X and claim your data are draws from this distribution, but the relationship is actually just shifting around and changing over time and what you’re estimating is an average of apples, oranges and pears. In other settings, it’s invariant through time and then you really can predict well, without necessarily having an understanding of why Y and X are related this way.

        I’m not at all saying that from a discovered relation between Y and X that seems to be enduring, we can conclude that one causes the other or that we can predict what happens if we interfere in the system and change the variables. But we can conclude that either one causes the other or some sort of hidden causal structure, like a set of common, with possibly complex internal structure, makes it so, and scientists would/should want to figure it out, whereas those merely concerned with prediction could stop.

  7. “On the other hand, if you want to make accurate forecasts, you probably will want a causal model of the weather. Consider real-world weather forecasts, which use a lot of weather-related information and various causal models of atmospheric dynamics. So it’s not that causal identification is required to make weather decisions—but in practice we are using causal reasoning to get our descriptively accurate forecasts.”

    Is causation even a concept in the physical sciences? I’ve only ever come across it being used in relation to (hypothetical) interventions, which is fairly rare (e.g. anthrophogenic causes of climate change). I think what you call causal models, they would call process-based (or mechanistic) models.

  8. Weather forecasting seems like a strange example to use here. Weather and climate forecasting is usually achieved by using theory-based models that directly model the causal processes producing changes in the weather (or climate). They’re rarely based on statistical or correlative methods.

Leave a Reply to Keith O'Rourke Cancel reply

Your email address will not be published. Required fields are marked *