Skip to content
 

Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan

David Rohde writes:

We have been working on an idea that attempts to combine ideas from Bayesian approaches to causality developed by you and your collaborators with Pearl’s do calculus. The core idea is simple, but we think powerful and allows some problems previously that only had known solutions with the do calculus to be solved in the Bayesian framework (in particular the front door rule).

In order to make the idea accessible we have produced a blog post (featuring animations), an online talk and technical reports. All the material can be found here.

Currently we focus on examples and intuition, we are still working on proofs. Although we don’t emphasise it, our idea is quite compatible with probabilistic programming languages like Stan where the probability of different outcomes for different counterfactual actions can be computed in the generated quantities block.

I took a quick look. What they’re saying about causal inference being an example of Bayesian inference for latent variables makes sense; I think this is basically the perspective of Rubin (1974). I think this is a helpful way of thinking, so I’m glad to see it being expressed in a different language. I’d recommend adding Rubin (1974) to your list of references. This is also the way we discuss causal inference in our BDA book (in all editions, starting with the first edition in 1995), where we take some of Rubin’s notation and explicitly integrate them into a Bayesian framework. But the causal analyses we do in BDA are pretty simple; it seems like a great idea to express these general ideas in more computing-friendly framework.

Regarding causal inference in Stan: I think that various groups been implementing latent-variable and instrumental-variables models, following the ideas of Angrist, Imbens, and Rubin, but generalizing to allow prior information and varying treatment effects. It’s been awhile since I’ve looked at the Bayesian instrumental and latent-variables literature, but it’s my recollection that I thought things could be improved using stronger priors: a lot of the pathological results that arise with weak instruments can be traced to bad things in the limits with weak priors. These are the sorts of examples where a full Bayesian inference can be worse than some sort of maximum likelihood or marginal maximum likelihood because of problems of integrating over the distribution of a ratio whose denominator is of uncertain sign.

I guess what I’m saying is that I think there are some important open problems of statistical modeling here. Improvements in conceptualization and computation (such as may be demonstrated by the above-linked work) could be valuable in motivating researchers to push forward on the modeling as well.

36 Comments

  1. Roy says:

    Along the same lines is a recent JBES paper that shows the equivalence of the back-door criteria and Friedman’s partial predictive plots, see:

    Qingyuan Zhao & Trevor Hastie (2019): Causal Interpretations of Black-Box
    Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2019.1624293
    Link to Article https://doi.org/10.1080/07350015.2019.1624293

    The more these connections are made, the greater understanding that will arise, and perhaps increased applicability.. I have read a good deal of Pearl’s work, as I have said before on this blog I find his work interesting, but frustrating at least for the work I do. I have multivariate dynamic data correlated in time and space with feedback between the variables (real physical system are like that), or at least data at the time-scale we can obtain have feedback. So it is never been clear to me how you do a DAG in this situation, references to toy models don’t solve that, and while I have run across several articles that are getting close to these types of situations, none of the ones I have been referred to to date really deal with this problem.

    But most of the models used to describe these systems can be put into a latent variable format (that is a state-space), so perhaps the increased connections with latent models can also provide ways of using the methods when there is dynamic feedback.

  2. I saw this paper at the causality workshop in NeurIPS. But, as I told to the author then, this has been tried before:

    Winn, John. “Causality with gates.” Artificial Intelligence and Statistics. 2012.
    Link to article: http://proceedings.mlr.press/v22/winn12/winn12.pdf

    I personally find Lattimore and Rhode’s approach harder to grasp than that of Winn, but that’s just me.

    • David Rohde says:

      Thanks for reminding me of this, it is indeed a relevant paper. Winn isn’t writing a full joint of observations, but you are right this paper does deserve more attention from us.

  3. David Rohde says:

    Thanks so much for posting!  I was expecting a long delay.

    With this work we can be in the difficult position of trying to sell the virtues of the Pearl approach to the Bayesian community and the Bayesian approach to the causal graphical model community (CGM).

    FWIW our argument that you can produce the same results as the do-calculus using plain probability theory is controversial in the CGM community and not everybody accepts it.  We think this is the first time that the solution to the front door rule has been stated using pure probability theory.  Our general argument that you can in general use probability instead of the do calculus is unproven but several examples suggest it is possible.  If anybody from that community has any objections, especially if they are specific, it could be very illuminating for us.

    Something that I think the Bayesian community can learn from Pearl is, I believe, the do-calculus can be viewed as a set of rules for removing latent variables from a model – in the same way you can represent a mixture model both with and without a discrete latent variable.  Viewed in this way the do-calculus is not only about causality but is a set of useful results that can be applied to probabilistic modelling in general.  Doing this systematically is still an open question, the strangeness of the do-calculus to probabilistic eyes makes achieving this in a common language difficult.

    Removing latent variables from a model can have huge practical impacts.  In some of Pearl’s examples like “the front door rule” or M-bias the joint between the unobserved latent variables can have multiple isolated modes and can be very difficult to approximate with a method like MCMC.  The fact that in these simple cases it isn’t necessary to do these nasty integrals as shown by the do-calculus is really remarkable, and under appreciated in the Bayesian community.

    On the other hand it isn’t always possible to remove latent variables.  The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy.  Of course prior distributions will have impacts in this situation even in the large data limit, but having a formal method to combine data with assumptions seems like progress.  The CGM community have seemed a little hostile to the idea that priors over latent variables (unobserved confounders) can help solve these sorts of problems.  Of course the assumptions matter and in a causal setting often their impacts do not reduce in the large data limit, we don’t dismiss this.

    • David:

      Curious if you have looked at SWIGs – http://people.tuebingen.mpg.de/p/causality-perspect/slides/Thomas_swig.pdf that try to bridge the two solitudes?

      • Andrew says:

        Keith:

        I took a quick look at these slides and just want to comment on a line there that says that Neyman (1923) was “popularized by Rubin (1974).” I don’t think that’s an accurate description. Rubin’s 1974 paper was original and important work on its own, in particular in the application of this idea to observational studies. For more background on this, see Rubin’s discussion of the translation of Neyman’s paper from 1990, in particular the last full paragraph on page 476. To label Rubin’s work in this area as a popularization is insulting and unnecessary. Popularization is great, and we deserve credit for it when it happens, but we should also get credit for our original contributions.

      • David Rohde says:

        Thanks Keith, that also looks relevant (similar but not the same).

        I don’t know it, although Finn might, yet another paper to read.

    • Ricardo Silva says:

      We can’t use plain probability theory in causality, and I don’t think this is controversial at all. There is no way of expressing the notion of intervention using the axioms of probability. An intervention is not a random variable In a causal graphical model. Interventions can appear as extra nodes, but these nodes themselves are not random variables just because they are added to a graph. In the potential outcome framework, we have indices indicating the (non-random) intervention level, which are constructions that again cannot be reduced to probability. The merit of causal graphical models/potential outcomes is to integrate the notion of intervention within probability as close as possible, but not closer (e.g., we can overload the notion of independence with respect to intervention nodes in a graphical model, but this is not the same as probabilistic independence among random variables. For instance, we don’t test “X is independent of Y” in the same way if X is a intervention indicator compared to when X is a random variable).

      Incidentally, I’m partially “responsible” for the John Winn paper above getting accepted to AISTATS. It was accepted on the grounds of it providing a nice trick of expressing graphical independence in a more granular way. I had second thoughts about it because of the (flawed) argument that causality can be reduced to probability. In the end, the other (expert) reviewers were generous enough to overlook it in the light of the positive contributions. I even met John in the airport on the way to the conference and explained the above. Not sure if he was totally convinced, but at least he seemed to recognise there was more to it than drawing diagrams, which are just drawings if there is no model semantics (independence models, types of nodes, meaning of measurement etc.) behind them.

      • >We can’t use plain probability theory in causality, and I don’t think this is controversial at all.

        replace probability theory with “Boolean logic”, would it be controversial? it should be, because it’s obviously false that Boolean logic plays no role in causality.

        So, what does your statement even mean? perhaps you can elaborate

        • Ben says:

          > There is no way of expressing the notion of intervention using the axioms of probability.

          Yes, the second sentence too.

        • Ricardo Silva says:

          I see how the word “use” was definitely ambiguous above. Surely we can use probability in causality, just like my post explicitly acknowledged. It would be crazy to say otherwise.

          What I intended to say is that we can’t reduce causality to probability. We can define what a random variable from the axioms of probability. We can’t do the same with intervention. There is not lack of historical attempts at that: can we define what a mean is purely from probability? Sure. Can we define what a causal effect is purely from non-causal concepts? Many people have tried, zero succeed. If you have a solution to that, please publish it.

          • We can’t reduce computer programming to probability either, we can’t reduce calculus to probability, we can’t reduce physics to probability… there’s more to life than probability.

            What we can say, is that if your model y = f(a,b,c…) is a causal model of y then probabilistic estimates of f(a2,b2,c2…) – f(a1,b1,c1) are estimates of the causal effect of changing a1,b1,c1 to a2,b2,c2

            where these models come from is the hard part. Saying that you can’t get f from pure probability theory ought to be something we don’t have to say, no one should be arguing for that, just like you can’t get Newton’s laws from pure probability theory.

            • Ricardo Silva says:

              Excellent, we agree. You had to resort to intuition aka making your own axioms. But it’s a question worth asking, as many philosophers of science have raised for a variety of reasons. After all, it is worthwhile knowing what causality is *not*, and I found the point above fairly useless in this regard.

              • Ricardo Silva says:

                Anyway, probability was just an example. The main point was to trying to define causality in noncausal terms. And my comment about identifiability was really what I wanted to emphasise…

      • “We can’t use plain probability theory in causality”

        I find this statement puzzling, and especially Pearl’s endorsement of this statement, when he himself has a section in his book Causality that discusses how to do exactly that: reduce a causal graph to a graphical (probability) model by introducing variables that indicate intervention. The end result is that causal graphs and the do-calculus are conveniences, not necessities.

        • Anonymous says:

          Most claims made by the statistarazzi are expressions of what they hope is true, not what they’ve demonstrated true.

        • David Rohde says:

          In all the examples we have worked through you can obtain the same answer using an appropriate modelling setup and conditional probability and the do-calculus but we haven’t managed to show that this is generally the case.

          Pearl has requested that solutions to toy problems are made in a Bayesian framework, we think we have made a first step in this direction.

          I think it would be valuable to have a proof on the equivalence of probability theory and the do-calclulus. It seems very likely that this is the case, but it is difficult to establish what are the equivalent concepts in the two frameworks.

          • To reiterate, Pearl himself shows how to reduce causal graphs and do actions to plain old probability theory in Section 3.2.2, “Interventions as Variables”, of _Causality_, Second Edition. This is done in terms of what amounts to a non-parametric structural equation model. He writes about “[interpreting] the causal reading of a DAG in terms of function, rather than probabilistic, relationships,” but of course, a functional relationship is just a degenerate conditional probability distribution.

            • Andrew says:

              Kevin:

              I discussed some of this in my review essay from a few years ago. The problem with the nonparametric structural equation approach is that it relies on identifying patterns of conditional independence, but in the problems I work on in social and environmental sciences, there are no true zeroes. So I’m skeptical of the throw-lots-of-data-into-the-computer-and-learn-causal-structure attitude.

        • Ricardo Silva says:

          I totally understand where you are coming from, since the whole point of something like the do-calculus (or Robin’s G-computation, or other identifiability results starting with Rubin’s ignorability + consistency conditions) is to express estimands with interventions + random variables to something with random variables only. This is a type of reduction, but not one that starts from scratch. It starts from a causal graph (or related assumptions – I agree you don’t “need” a graph, in the same way we don’t need syntactic sugar to express independences between interventions and random variables. But it does help, doesn’t it? The machinery was right there, from the graphical model literature.). So this is not the same as saying “we reduced causality to probability”. In fact, Pearl has an entire section in his book about how all notions of “probabilistic causality” (causality defined from non-causal terms, particularly probability) led to failure. This is a point repeated many times right from the beginning.

          • See above comment about Pearl himself reducing causality to probability in Section 3.2.2 of his book Causality. The problem with previous attempts to reduce causality to probability was a modeling issue: you need to explicitly model interventions as variables in your model. If xi is the intervention variable for variable x, then conditioning on xi = “set x to v” has the same effect as do(x = v) has on the corresponding causal model.

            For a brief, informal treatment of this idea, see this slide deck from a presentation I gave in 2012, Fully Bayesian Causality. You probably want to just jump to slide 20.

    • Ricardo Silva says:

      Also, I’m all for Bayesian inference, but I need to add: identifiability matters. An informed prior may provide a valid answer, assuming the prior is indeed informative and not a prior of convenience. I believe this is rare, but it may be the case in some applications. Otherwise, we are just spitting arbitrary numbers from a set of empirically undistinguishable possibilities. For instance, this example I prepared a while ago: Figure 3 of Silva and Evans, “Causal inference through a witness program”, Journal of Machine Learning Research, 2016.

    • More Anonymous says:

      Hi David — Thanks for sharing your fascinating work. I’ve looked through your arxiv papers and am trying to understand them better. Could you expand a bit on the difference between your approach to causal inference and the current approach that constructs twin networks and then applies ordinary Bayesian inference to them?

      The originating reference to that is

      Balke and Pe (1994) “Proabilistic Evaluation of Counterfactual Queries” (https://www.aaai.org/Papers/AAAI/1994/AAAI94-035.pdf)

      A short and clear summary of the originating reference is Section 2.2 of

      Graham et al. “Copy, paste, infer: A robust analysis of twin networks for counterfactual inference” (https://cpb-us-w2.wpmucdn.com/sites.coecis.cornell.edu/dist/a/238/files/2019/12/Id_65_final.pdf)

      Thanks! I’m delighted to see studies of the priors for latent variables in unidentified causal models.

      • David Rohde says:

        Thanks for the question. There are superficial similarities but also lots of differences. Pearl typically considers a stochastic system and asks what-if type questions of how that system would change if you modify it. In this case a list of probabilities are provided Alice, Bob etc going to the party. The Twin graph produced is a mechanism for computing these modifications to the system. My understanding is it is an alternative to the do-calculus (feel free to clarify if you see it differently).

        I believe Pearl’s argument that probability theory needs to be extended is on the basis that there is a need for rules to transform one stochastic system into another to answer causal questions. This seems fine, but views the world in a frequentist sense as an external stochastic system.

        In contrast we model a joint of the collected data and the outcome we care about for each possible treatment i.e. P(y*,data|t*) where y* is a future outcome, t* is a future treatment and data contains a list of historical treatments, outcomes and other covariates. Important there is only one joint distribution for the whole system. We then obtain the causal effect by standard conditioning P(y*|t*,data). As Andrew says this is similar to how the Rubin Causal Model works, we differ in that we compute the predictive for y* for each t* as a separate conditioning operation [we have a distinct P(y*,data|t*) for every treatment t*], instead the RCM uses joint distributions on counterfactuals. Our framework maps closer to Pearl and allows us to represent non-standard scenarios such as the front door rule and M-bias. Like Dawid we don’t use historical counterfactuals.

        Also like Rubin we consider only a single intervention, Pearl likes to have the flexibility to intervene on any node. Flexibility is good but often you practically can only intervene in a single place.

        A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters in a modelling framework that is much more recognisable to a working statistician.

        Andrew has quite often commented about a modelling preference for continuous vs discrete. I am in this setting sympathetic to the need for more flexible models, discrete latent variables seem rather limited here (although to be fair this is before Stan made such inference easy).

        On the other hand “link presence or absence” can in some cases have very strong impacts in terms of partial exchangeability. Being able to assume P(y,t|theta,beta)=p(y|t,beta)P(t|theta) makes a world of difference in a causal setting but is indeed a strong assumption about the absence of a link – allowing even a weak interaction makes things considerably harder, priors will now have impact even in the presence of large datasets. Ricado’s reservations about identifiability and prior impact make some sense here, but if these are the appropriate assumptions for the problem what else can you do?

        If you have further comments or questions I would be interested. I can see a need for a more extensive discussion of twin networks.

        • David Rohde says:

          sorry Ricardo not Ricado.

        • Carlos Ungil says:

          Hi David,

          I wouldn’t say that the similarities are superficial. If what Balke and Pearl present is “an alternative to the do calculus” so is your proposal, I think, as it creates the same augmented network. Their “response functions”, random variables that take as many values as there are deterministic functions between the parents of a node and the node, are directly related to your P(V|parents(V)).

          For example, you write that “Parameterizing the conditional distribution P(t|z) requires two, ϕ0 and ϕ1 to represent P(t|Z=0) and P(t|Z=1) respectively”. My understanding is that their proposal for a functional specification (equations 2-5, where a and b stand for z and t) contains your model. If we call p1, p2, p3 and p4 to the probability of the four mappings (the sum is one, so there are effectively three parameters) the parameters in your model can be recovered: ϕ0 = p3+p4 and ϕ1 = p2+p4

          Their “party” example assumes that we are supplied with the model but in principle it would estimated from the data as you do. Note that there is no data in the example. When they introduce the notation they say that “As part of the complete specification of a counterfactual query, there are real-world observations that make up the background context.” When discussing the response-function variables they say that “The prior probabilities on these response functions P(r_b) in conjunction with f_b(a, r_b) fully parameterizes the model.”

          From their conclusion: “World knowledge is represented in the language of modified causal networks, whose root nodes are unobserved, and correspond to possible functional mechanisms operating among families of observables. The prior probabilities of these root nodes are updated by the factual information transmitted with the query, and remain fixed thereafter. […] At this time the algorithm has not been implemented but, given a subjective prior distribution over the response variables, there are no new computational tasks introduced by this formalism and the inference process follows the standard techniques for computing beliefs in Bayesian networks. If prior distributions over the relevant response-function variables cannot be assessed, we have developed methods of using the standard conditional-probability specification of Bayesian networks to compute upper and lower bounds of counterfactual probabilities.”

          • Carlos Ungil says:

            Forget what I said. Looking again at Balke and Pearl now I understand the prior probability on the response functions as a set of parameters, instead of hyperparameters with their own prior distribution as I imagened (influenced surely by your model).

        • Ricardo Silva says:

          Thanks for the further comments, David. I agree that if a prior *is* an appropriate assumption, then we should go for it (in the paper I mentioned, I describe a study by Greenland on using priors about smoking, which was a latent confounder for a separate dataset concerning the effect of occupation on lung cancer development. He had a prior linking smoking and the occupation of the worker. The prior came from postulating exactly what the hidden variable was meant to be and use separated sources of information about this selection bias. This is totally fair game, although even there this may not suffice if we have other “unknown unknowns”, latent common causes we have no idea exist or what they might be).

          Concerning what else we can do: well, I think the answer is conceptually simple. Just admit what you don’t know. If the data just can’t tell the difference between two different causal effects compatible with it (and I’m not talking about statistical variability only), *report everything*. If it’s too uninformative, well, tough. Maybe it motivates collecting performing different measurements. Maybe (with some caveat emptor) you can try to elicit additional believable assumptions to provide alternative and more precise estimates with the data you already have (while still reporting what weaker assumptions can’t tell).

          And it’s still possible to be a full carrying card Bayesian there. Just construct your likelihood to reflect what the data call tell about the parameters. In the paper I mentioned, a Bayesian approach is used. The likelihood is not a latent variable model: what is the point, if we don’t know what the latent variable is to in order to draw informative priors from a magical hat? The likelihood is the marginal among the observables from whatever the latent variable model might be, as long as it agrees with information I can actually assess from data.

          I’m less concerned about the point I made above about the irreducibility of causality to non-causal terms: even when people refuse to believe this, it looks like many do modelling as if they agree with it anyway (you wouldn’t flip those edges in the causal graph of your examples even if no observational data could distinguish among them, would you?). But the identifiability issue is serious. The folk knowledge of “identifiability doesn’t matter for Bayesians”, I’m afraid to say, is pseudo-science in this context. It’s one thing to say “look at my massive Bayesian neural net making awesome predictions even if its likelihood function is supercrazy”. In this context, identifiability really doesn’t matter. But if we have an extrapolation problem, like predicting effects of interventions in the data where interventions didn’t take place, now that’s a different game.

          On an unrelated note, I would look into the problems of performing more than one intervention in a single system. People like James Robins have been doing that since the 80s (with real applications, not toy ones), and only recently people are coming to terms that much of what he was doing (and related work by Pearl and others) is directly relevant to off-policy reinforcement learning.

        • More Anonymous says:

          Hi David — Thanks very much for your helpful response! I considered it and looked through your papers and website more.

          In short, you have a project and some claims about your project. As I see it, your project is Bayesian causal inference with a twin-network-like approach and special attention to latent variables. Your project seems great! I feel very positively about it. You also have major claims about the project, for example that it shows causal inference is possible in pure probability theory. I disagree with the claims.

          Let’s start with the claimed demonstrations of causal inference in pure probability theory. Most causal inference researchers would say your demonstrations already use an ingredient that is external to pure probability theory — namely, the semantic association of causation with the arrows in your probabilistic graphical models (PGMs), and the particular mutilation of the PGMs to examine effects of actions. From this perspective, your demonstrations are already extraprobablistic in nature. Therefore, they are incapable of showing that causal inference is possible in pure probability theory.

          To support my position, I recommend you to “Probabilistic Graphical Models” by Koller and Friedman. The authors spend the first 20 chapters of their book developing PGMs in pure probability theory, without causality. Then in chapter 21 causality is added through the short definition that a causal model is a Bayesian network which, in addition to answering probabilistic queries, can answer do() queries through mutilation.

          Seeing this definition of a causal model, you may think it adds only a modicrum beyond pure probability theory, and you may therefore think Pearl is making a mountain out of a molehill in his distinction between causal inference and pure probability theory. But either way, that’s a matter of opinion separate from the topic at hand.

          I do think it would be good to add more on twin networks in your articles. To my eye, the “CausalBayesConstruct” algorithm is essentially the same as the procedure for constructing twin networks, which appears in many papers. You may have reinvented twin networks (that’s impressive!), but they should be acknowleged. You state that the resemblance between your approach and twin networks is superficial, but maybe you haven’t had enough time to look through the literature on them. With all the blog comments to get through, that’s understandable.

          You also state

          A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters…

          Twin networks apply to both discrete and continuous variables. There may be confusion becuase the twin network approach is often paired with response varaibles / canonical partitions / principal strata, which take discrete values. Actually, response variables might simplify your computation problems. Node merging might also help.

          You write

          The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy. … The CGM community have seemed a little hostile to the idea that priors over latent variables (unobserved confounders) can help solve these sorts of problems.

          I’m not sure why you are encountering hostility. For Bayesian CGM work on priors over latent variables in unidentified models, see Pearl’s “Causality” section 8.5 — which covers Chickering and Pearl (1997) — and the Koller and Friedman book. I also thought full prior distributions were used in the Balke and Pearl article that proposed twin networks, but I was wrong. Thanks to Carlos for pointing out my mistake.

          Finally, if you’ve reinvented twin networks, then you may be in an excellent place to greatly advance their study and use. Reinventing something can confer a depth of understanding books and classes just don’t give. If I were you, I’d capitalize on it!!

  4. Carlos Ungil says:

    In the “current system/system after intervention” diagram for case 1 shouldn’t phi be labeled “Parameter representing P(T|Z)” rather than P(T)?

    Unless I’m misunderstanding everything, for case 2 the opposite mistake is present (“Parameter for P(T|Z)” should be “Parameter for P(Z)”) and the label for gamma is also wrong (should be “Parameter representing P(Z|T)”, not P(Z) like in case 1).

Leave a Reply