Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan

Posted on January 27, 2020 8:40 PM by Andrew

David Rohde writes:

We have been working on an idea that attempts to combine ideas from Bayesian approaches to causality developed by you and your collaborators with Pearl’s do calculus. The core idea is simple, but we think powerful and allows some problems previously that only had known solutions with the do calculus to be solved in the Bayesian framework (in particular the front door rule).

In order to make the idea accessible we have produced a blog post (featuring animations), an online talk and technical reports. All the material can be found here.

Currently we focus on examples and intuition, we are still working on proofs. Although we don’t emphasise it, our idea is quite compatible with probabilistic programming languages like Stan where the probability of different outcomes for different counterfactual actions can be computed in the generated quantities block.

I took a quick look. What they’re saying about causal inference being an example of Bayesian inference for latent variables makes sense; I think this is basically the perspective of Rubin (1974). I think this is a helpful way of thinking, so I’m glad to see it being expressed in a different language. I’d recommend adding Rubin (1974) to your list of references. This is also the way we discuss causal inference in our BDA book (in all editions, starting with the first edition in 1995), where we take some of Rubin’s notation and explicitly integrate them into a Bayesian framework. But the causal analyses we do in BDA are pretty simple; it seems like a great idea to express these general ideas in more computing-friendly framework.

Regarding causal inference in Stan: I think that various groups been implementing latent-variable and instrumental-variables models, following the ideas of Angrist, Imbens, and Rubin, but generalizing to allow prior information and varying treatment effects. It’s been awhile since I’ve looked at the Bayesian instrumental and latent-variables literature, but it’s my recollection that I thought things could be improved using stronger priors: a lot of the pathological results that arise with weak instruments can be traced to bad things in the limits with weak priors. These are the sorts of examples where a full Bayesian inference can be worse than some sort of maximum likelihood or marginal maximum likelihood because of problems of integrating over the distribution of a ratio whose denominator is of uncertain sign.

I guess what I’m saying is that I think there are some important open problems of statistical modeling here. Improvements in conceptualization and computation (such as may be demonstrated by the above-linked work) could be valuable in motivating researchers to push forward on the modeling as well.

74 thoughts on “Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan”

Roy on January 28, 2020 12:05 AM at 12:05 am said:

Along the same lines is a recent JBES paper that shows the equivalence of the back-door criteria and Friedman’s partial predictive plots, see:

Qingyuan Zhao & Trevor Hastie (2019): Causal Interpretations of Black-Box
Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2019.1624293
Link to Article https://doi.org/10.1080/07350015.2019.1624293

The more these connections are made, the greater understanding that will arise, and perhaps increased applicability.. I have read a good deal of Pearl’s work, as I have said before on this blog I find his work interesting, but frustrating at least for the work I do. I have multivariate dynamic data correlated in time and space with feedback between the variables (real physical system are like that), or at least data at the time-scale we can obtain have feedback. So it is never been clear to me how you do a DAG in this situation, references to toy models don’t solve that, and while I have run across several articles that are getting close to these types of situations, none of the ones I have been referred to to date really deal with this problem.

But most of the models used to describe these systems can be put into a latent variable format (that is a state-space), so perhaps the increased connections with latent models can also provide ways of using the methods when there is dynamic feedback.

Reply ↓
Sergio Garrido on January 28, 2020 4:47 AM at 4:47 am said:

I saw this paper at the causality workshop in NeurIPS. But, as I told to the author then, this has been tried before:

Winn, John. “Causality with gates.” Artificial Intelligence and Statistics. 2012.
Link to article: https://proceedings.mlr.press/v22/winn12/winn12.pdf

I personally find Lattimore and Rhode’s approach harder to grasp than that of Winn, but that’s just me.

Reply ↓
- David Rohde on January 28, 2020 5:35 AM at 5:35 am said:
  
  Thanks for reminding me of this, it is indeed a relevant paper. Winn isn’t writing a full joint of observations, but you are right this paper does deserve more attention from us.
  
  Reply ↓
David Rohde on January 28, 2020 5:02 AM at 5:02 am said:

Thanks so much for posting! I was expecting a long delay.

With this work we can be in the difficult position of trying to sell the virtues of the Pearl approach to the Bayesian community and the Bayesian approach to the causal graphical model community (CGM).

FWIW our argument that you can produce the same results as the do-calculus using plain probability theory is controversial in the CGM community and not everybody accepts it. We think this is the first time that the solution to the front door rule has been stated using pure probability theory. Our general argument that you can in general use probability instead of the do calculus is unproven but several examples suggest it is possible. If anybody from that community has any objections, especially if they are specific, it could be very illuminating for us.

Something that I think the Bayesian community can learn from Pearl is, I believe, the do-calculus can be viewed as a set of rules for removing latent variables from a model – in the same way you can represent a mixture model both with and without a discrete latent variable. Viewed in this way the do-calculus is not only about causality but is a set of useful results that can be applied to probabilistic modelling in general. Doing this systematically is still an open question, the strangeness of the do-calculus to probabilistic eyes makes achieving this in a common language difficult.

Removing latent variables from a model can have huge practical impacts. In some of Pearl’s examples like “the front door rule” or M-bias the joint between the unobserved latent variables can have multiple isolated modes and can be very difficult to approximate with a method like MCMC. The fact that in these simple cases it isn’t necessary to do these nasty integrals as shown by the do-calculus is really remarkable, and under appreciated in the Bayesian community.

On the other hand it isn’t always possible to remove latent variables. The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy. Of course prior distributions will have impacts in this situation even in the large data limit, but having a formal method to combine data with assumptions seems like progress. The CGM community have seemed a little hostile to the idea that priors over latent variables (unobserved confounders) can help solve these sorts of problems. Of course the assumptions matter and in a causal setting often their impacts do not reduce in the large data limit, we don’t dismiss this.

Reply ↓
- Keith O'Rourke on January 28, 2020 9:16 AM at 9:16 am said:
  
  David:
  
  Curious if you have looked at SWIGs – https://people.tuebingen.mpg.de/p/causality-perspect/slides/Thomas_swig.pdf that try to bridge the two solitudes?
  
  Reply ↓
  - Andrew on January 28, 2020 9:59 AM at 9:59 am said:
    
    Keith:
    
    I took a quick look at these slides and just want to comment on a line there that says that Neyman (1923) was “popularized by Rubin (1974).” I don’t think that’s an accurate description. Rubin’s 1974 paper was original and important work on its own, in particular in the application of this idea to observational studies. For more background on this, see Rubin’s discussion of the translation of Neyman’s paper from 1990, in particular the last full paragraph on page 476. To label Rubin’s work in this area as a popularization is insulting and unnecessary. Popularization is great, and we deserve credit for it when it happens, but we should also get credit for our original contributions.
    
    Reply ↓
    - Keith O'Rourke on January 29, 2020 7:40 AM at 7:40 am said:
      
      Yea, actually remember a conversation with Rubin in Toronto in 1997 about that. Somethings appear much easier after they are done.
  - David Rohde on January 28, 2020 12:31 PM at 12:31 pm said:
    
    Thanks Keith, that also looks relevant (similar but not the same).
    
    I don’t know it, although Finn might, yet another paper to read.
    
    Reply ↓
- Ricardo Silva on January 28, 2020 10:14 AM at 10:14 am said:
  
  We can’t use plain probability theory in causality, and I don’t think this is controversial at all. There is no way of expressing the notion of intervention using the axioms of probability. An intervention is not a random variable In a causal graphical model. Interventions can appear as extra nodes, but these nodes themselves are not random variables just because they are added to a graph. In the potential outcome framework, we have indices indicating the (non-random) intervention level, which are constructions that again cannot be reduced to probability. The merit of causal graphical models/potential outcomes is to integrate the notion of intervention within probability as close as possible, but not closer (e.g., we can overload the notion of independence with respect to intervention nodes in a graphical model, but this is not the same as probabilistic independence among random variables. For instance, we don’t test “X is independent of Y” in the same way if X is a intervention indicator compared to when X is a random variable).
  
  Incidentally, I’m partially “responsible” for the John Winn paper above getting accepted to AISTATS. It was accepted on the grounds of it providing a nice trick of expressing graphical independence in a more granular way. I had second thoughts about it because of the (flawed) argument that causality can be reduced to probability. In the end, the other (expert) reviewers were generous enough to overlook it in the light of the positive contributions. I even met John in the airport on the way to the conference and explained the above. Not sure if he was totally convinced, but at least he seemed to recognise there was more to it than drawing diagrams, which are just drawings if there is no model semantics (independence models, types of nodes, meaning of measurement etc.) behind them.
  
  Reply ↓
  - Daniel Lakeland on January 28, 2020 4:12 PM at 4:12 pm said:
    
    >We can’t use plain probability theory in causality, and I don’t think this is controversial at all.
    
    replace probability theory with “Boolean logic”, would it be controversial? it should be, because it’s obviously false that Boolean logic plays no role in causality.
    
    So, what does your statement even mean? perhaps you can elaborate
    
    Reply ↓
    - Ben on January 28, 2020 6:58 PM at 6:58 pm said:
      
      > There is no way of expressing the notion of intervention using the axioms of probability.
      
      Yes, the second sentence too.
    - Ricardo Silva on January 29, 2020 4:26 AM at 4:26 am said:
      
      I see how the word “use” was definitely ambiguous above. Surely we can use probability in causality, just like my post explicitly acknowledged. It would be crazy to say otherwise.
      
      What I intended to say is that we can’t reduce causality to probability. We can define what a random variable from the axioms of probability. We can’t do the same with intervention. There is not lack of historical attempts at that: can we define what a mean is purely from probability? Sure. Can we define what a causal effect is purely from non-causal concepts? Many people have tried, zero succeed. If you have a solution to that, please publish it.
    - Daniel Lakeland on January 29, 2020 12:08 PM at 12:08 pm said:
      
      We can’t reduce computer programming to probability either, we can’t reduce calculus to probability, we can’t reduce physics to probability… there’s more to life than probability.
      
      What we can say, is that if your model y = f(a,b,c…) is a causal model of y then probabilistic estimates of f(a2,b2,c2…) – f(a1,b1,c1) are estimates of the causal effect of changing a1,b1,c1 to a2,b2,c2
      
      where these models come from is the hard part. Saying that you can’t get f from pure probability theory ought to be something we don’t have to say, no one should be arguing for that, just like you can’t get Newton’s laws from pure probability theory.
    - Andrew on January 29, 2020 1:53 PM at 1:53 pm said:
      
      +1
    - Ricardo Silva on January 29, 2020 2:32 PM at 2:32 pm said:
      
      Excellent, we agree. You had to resort to intuition aka making your own axioms. But it’s a question worth asking, as many philosophers of science have raised for a variety of reasons. After all, it is worthwhile knowing what causality is *not*, and I found the point above fairly useless in this regard.
    - Ricardo Silva on January 29, 2020 2:34 PM at 2:34 pm said:
      
      Anyway, probability was just an example. The main point was to trying to define causality in noncausal terms. And my comment about identifiability was really what I wanted to emphasise…
  - Kevin S Van Horn on January 29, 2020 2:48 PM at 2:48 pm said:
    
    “We can’t use plain probability theory in causality”
    
    I find this statement puzzling, and especially Pearl’s endorsement of this statement, when he himself has a section in his book Causality that discusses how to do exactly that: reduce a causal graph to a graphical (probability) model by introducing variables that indicate intervention. The end result is that causal graphs and the do-calculus are conveniences, not necessities.
    
    Reply ↓
    - Anonymous on January 29, 2020 3:59 PM at 3:59 pm said:
      
      Most claims made by the statistarazzi are expressions of what they hope is true, not what they’ve demonstrated true.
    - David Rohde on January 30, 2020 5:33 AM at 5:33 am said:
      
      In all the examples we have worked through you can obtain the same answer using an appropriate modelling setup and conditional probability and the do-calculus but we haven’t managed to show that this is generally the case.
      
      Pearl has requested that solutions to toy problems are made in a Bayesian framework, we think we have made a first step in this direction.
      
      I think it would be valuable to have a proof on the equivalence of probability theory and the do-calclulus. It seems very likely that this is the case, but it is difficult to establish what are the equivalent concepts in the two frameworks.
    - Kevin S Van Horn on February 1, 2020 3:28 PM at 3:28 pm said:
      
      To reiterate, Pearl himself shows how to reduce causal graphs and do actions to plain old probability theory in Section 3.2.2, “Interventions as Variables”, of _Causality_, Second Edition. This is done in terms of what amounts to a non-parametric structural equation model. He writes about “[interpreting] the causal reading of a DAG in terms of function, rather than probabilistic, relationships,” but of course, a functional relationship is just a degenerate conditional probability distribution.
    - Andrew on February 1, 2020 7:42 PM at 7:42 pm said:
      
      Kevin:
      
      I discussed some of this in my review essay from a few years ago. The problem with the nonparametric structural equation approach is that it relies on identifying patterns of conditional independence, but in the problems I work on in social and environmental sciences, there are no true zeroes. So I’m skeptical of the throw-lots-of-data-into-the-computer-and-learn-causal-structure attitude.
    - Kevin S Van Horn on February 3, 2020 12:41 PM at 12:41 pm said:
      
      My comments here just address the question of whether there is something to causality that can’t be modeled within a purely probabilistic framework. The claim made by Pearl is that probability distributions aren’t enough; I’m pointing out that Pearl himself shows how to reduce everything he does to a purely probabilistic model.
      
      I’m not talking about trying to learn causal structure, which I agree is very difficult. I mention “non-parametric structural equation models” simply because that’s the general framework for Pearl’s causal graphs — he’s not assuming any particular parametric form. I don’t think that commits one to actually doing everything nonparametrically. BTW, my impression is that Pearl’s emphasis is not on trying to infer causal structure, but on making causal assumptions explicit and exploring their consequences in a rigorous fashion.
    - Ricardo Silva on January 31, 2020 2:05 PM at 2:05 pm said:
      
      I totally understand where you are coming from, since the whole point of something like the do-calculus (or Robin’s G-computation, or other identifiability results starting with Rubin’s ignorability + consistency conditions) is to express estimands with interventions + random variables to something with random variables only. This is a type of reduction, but not one that starts from scratch. It starts from a causal graph (or related assumptions – I agree you don’t “need” a graph, in the same way we don’t need syntactic sugar to express independences between interventions and random variables. But it does help, doesn’t it? The machinery was right there, from the graphical model literature.). So this is not the same as saying “we reduced causality to probability”. In fact, Pearl has an entire section in his book about how all notions of “probabilistic causality” (causality defined from non-causal terms, particularly probability) led to failure. This is a point repeated many times right from the beginning.
    - Kevin S Van Horn on February 1, 2020 4:04 PM at 4:04 pm said:
      
      See above comment about Pearl himself reducing causality to probability in Section 3.2.2 of his book Causality. The problem with previous attempts to reduce causality to probability was a modeling issue: you need to explicitly model interventions as variables in your model. If xi is the intervention variable for variable x, then conditioning on xi = “set x to v” has the same effect as do(x = v) has on the corresponding causal model.
      
      For a brief, informal treatment of this idea, see this slide deck from a presentation I gave in 2012, Fully Bayesian Causality. You probably want to just jump to slide 20.
- Ricardo Silva on January 28, 2020 10:37 AM at 10:37 am said:
  
  Also, I’m all for Bayesian inference, but I need to add: identifiability matters. An informed prior may provide a valid answer, assuming the prior is indeed informative and not a prior of convenience. I believe this is rare, but it may be the case in some applications. Otherwise, we are just spitting arbitrary numbers from a set of empirically undistinguishable possibilities. For instance, this example I prepared a while ago: Figure 3 of Silva and Evans, “Causal inference through a witness program”, Journal of Machine Learning Research, 2016.
  
  Reply ↓
- More Anonymous on January 28, 2020 4:30 PM at 4:30 pm said:
  
  Hi David — Thanks for sharing your fascinating work. I’ve looked through your arxiv papers and am trying to understand them better. Could you expand a bit on the difference between your approach to causal inference and the current approach that constructs twin networks and then applies ordinary Bayesian inference to them?
  
  The originating reference to that is
  
  Balke and Pe (1994) “Proabilistic Evaluation of Counterfactual Queries” (https://www.aaai.org/Papers/AAAI/1994/AAAI94-035.pdf)
  
  A short and clear summary of the originating reference is Section 2.2 of
  
  Graham et al. “Copy, paste, infer: A robust analysis of twin networks for counterfactual inference” (https://cpb-us-w2.wpmucdn.com/sites.coecis.cornell.edu/dist/a/238/files/2019/12/Id_65_final.pdf)
  
  Thanks! I’m delighted to see studies of the priors for latent variables in unidentified causal models.
  
  Reply ↓
  - David Rohde on January 30, 2020 5:26 AM at 5:26 am said:
    
    Thanks for the question. There are superficial similarities but also lots of differences. Pearl typically considers a stochastic system and asks what-if type questions of how that system would change if you modify it. In this case a list of probabilities are provided Alice, Bob etc going to the party. The Twin graph produced is a mechanism for computing these modifications to the system. My understanding is it is an alternative to the do-calculus (feel free to clarify if you see it differently).
    
    I believe Pearl’s argument that probability theory needs to be extended is on the basis that there is a need for rules to transform one stochastic system into another to answer causal questions. This seems fine, but views the world in a frequentist sense as an external stochastic system.
    
    In contrast we model a joint of the collected data and the outcome we care about for each possible treatment i.e. P(y*,data|t*) where y* is a future outcome, t* is a future treatment and data contains a list of historical treatments, outcomes and other covariates. Important there is only one joint distribution for the whole system. We then obtain the causal effect by standard conditioning P(y*|t*,data). As Andrew says this is similar to how the Rubin Causal Model works, we differ in that we compute the predictive for y* for each t* as a separate conditioning operation [we have a distinct P(y*,data|t*) for every treatment t*], instead the RCM uses joint distributions on counterfactuals. Our framework maps closer to Pearl and allows us to represent non-standard scenarios such as the front door rule and M-bias. Like Dawid we don’t use historical counterfactuals.
    
    Also like Rubin we consider only a single intervention, Pearl likes to have the flexibility to intervene on any node. Flexibility is good but often you practically can only intervene in a single place.
    
    A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters in a modelling framework that is much more recognisable to a working statistician.
    
    Andrew has quite often commented about a modelling preference for continuous vs discrete. I am in this setting sympathetic to the need for more flexible models, discrete latent variables seem rather limited here (although to be fair this is before Stan made such inference easy).
    
    On the other hand “link presence or absence” can in some cases have very strong impacts in terms of partial exchangeability. Being able to assume P(y,t|theta,beta)=p(y|t,beta)P(t|theta) makes a world of difference in a causal setting but is indeed a strong assumption about the absence of a link – allowing even a weak interaction makes things considerably harder, priors will now have impact even in the presence of large datasets. Ricado’s reservations about identifiability and prior impact make some sense here, but if these are the appropriate assumptions for the problem what else can you do?
    
    If you have further comments or questions I would be interested. I can see a need for a more extensive discussion of twin networks.
    
    Reply ↓
    - David Rohde on January 30, 2020 5:35 AM at 5:35 am said:
      
      sorry Ricardo not Ricado.
    - Carlos Ungil on January 30, 2020 3:04 PM at 3:04 pm said:
      
      Hi David,
      
      I wouldn’t say that the similarities are superficial. If what Balke and Pearl present is “an alternative to the do calculus” so is your proposal, I think, as it creates the same augmented network. Their “response functions”, random variables that take as many values as there are deterministic functions between the parents of a node and the node, are directly related to your P(V|parents(V)).
      
      For example, you write that “Parameterizing the conditional distribution P(t|z) requires two, ϕ0 and ϕ1 to represent P(t|Z=0) and P(t|Z=1) respectively”. My understanding is that their proposal for a functional specification (equations 2-5, where a and b stand for z and t) contains your model. If we call p1, p2, p3 and p4 to the probability of the four mappings (the sum is one, so there are effectively three parameters) the parameters in your model can be recovered: ϕ0 = p3+p4 and ϕ1 = p2+p4
      
      Their “party” example assumes that we are supplied with the model but in principle it would estimated from the data as you do. Note that there is no data in the example. When they introduce the notation they say that “As part of the complete specification of a counterfactual query, there are real-world observations that make up the background context.” When discussing the response-function variables they say that “The prior probabilities on these response functions P(r_b) in conjunction with f_b(a, r_b) fully parameterizes the model.”
      
      From their conclusion: “World knowledge is represented in the language of modified causal networks, whose root nodes are unobserved, and correspond to possible functional mechanisms operating among families of observables. The prior probabilities of these root nodes are updated by the factual information transmitted with the query, and remain fixed thereafter. […] At this time the algorithm has not been implemented but, given a subjective prior distribution over the response variables, there are no new computational tasks introduced by this formalism and the inference process follows the standard techniques for computing beliefs in Bayesian networks. If prior distributions over the relevant response-function variables cannot be assessed, we have developed methods of using the standard conditional-probability specification of Bayesian networks to compute upper and lower bounds of counterfactual probabilities.”
    - Carlos Ungil on January 30, 2020 3:46 PM at 3:46 pm said:
      
      Forget what I said. Looking again at Balke and Pearl now I understand the prior probability on the response functions as a set of parameters, instead of hyperparameters with their own prior distribution as I imagened (influenced surely by your model).
    - Ricardo Silva on January 31, 2020 2:37 PM at 2:37 pm said:
      
      Thanks for the further comments, David. I agree that if a prior *is* an appropriate assumption, then we should go for it (in the paper I mentioned, I describe a study by Greenland on using priors about smoking, which was a latent confounder for a separate dataset concerning the effect of occupation on lung cancer development. He had a prior linking smoking and the occupation of the worker. The prior came from postulating exactly what the hidden variable was meant to be and use separated sources of information about this selection bias. This is totally fair game, although even there this may not suffice if we have other “unknown unknowns”, latent common causes we have no idea exist or what they might be).
      
      Concerning what else we can do: well, I think the answer is conceptually simple. Just admit what you don’t know. If the data just can’t tell the difference between two different causal effects compatible with it (and I’m not talking about statistical variability only), *report everything*. If it’s too uninformative, well, tough. Maybe it motivates collecting performing different measurements. Maybe (with some caveat emptor) you can try to elicit additional believable assumptions to provide alternative and more precise estimates with the data you already have (while still reporting what weaker assumptions can’t tell).
      
      And it’s still possible to be a full carrying card Bayesian there. Just construct your likelihood to reflect what the data call tell about the parameters. In the paper I mentioned, a Bayesian approach is used. The likelihood is not a latent variable model: what is the point, if we don’t know what the latent variable is to in order to draw informative priors from a magical hat? The likelihood is the marginal among the observables from whatever the latent variable model might be, as long as it agrees with information I can actually assess from data.
      
      I’m less concerned about the point I made above about the irreducibility of causality to non-causal terms: even when people refuse to believe this, it looks like many do modelling as if they agree with it anyway (you wouldn’t flip those edges in the causal graph of your examples even if no observational data could distinguish among them, would you?). But the identifiability issue is serious. The folk knowledge of “identifiability doesn’t matter for Bayesians”, I’m afraid to say, is pseudo-science in this context. It’s one thing to say “look at my massive Bayesian neural net making awesome predictions even if its likelihood function is supercrazy”. In this context, identifiability really doesn’t matter. But if we have an extrapolation problem, like predicting effects of interventions in the data where interventions didn’t take place, now that’s a different game.
      
      On an unrelated note, I would look into the problems of performing more than one intervention in a single system. People like James Robins have been doing that since the 80s (with real applications, not toy ones), and only recently people are coming to terms that much of what he was doing (and related work by Pearl and others) is directly relevant to off-policy reinforcement learning.
    - More Anonymous on February 2, 2020 2:22 PM at 2:22 pm said:
      
      Hi David — Thanks very much for your helpful response! I considered it and looked through your papers and website more.
      
      In short, you have a project and some claims about your project. As I see it, your project is Bayesian causal inference with a twin-network-like approach and special attention to latent variables. Your project seems great! I feel very positively about it. You also have major claims about the project, for example that it shows causal inference is possible in pure probability theory. I disagree with the claims.
      
      Let’s start with the claimed demonstrations of causal inference in pure probability theory. Most causal inference researchers would say your demonstrations already use an ingredient that is external to pure probability theory — namely, the semantic association of causation with the arrows in your probabilistic graphical models (PGMs), and the particular mutilation of the PGMs to examine effects of actions. From this perspective, your demonstrations are already extraprobablistic in nature. Therefore, they are incapable of showing that causal inference is possible in pure probability theory.
      
      To support my position, I recommend you to “Probabilistic Graphical Models” by Koller and Friedman. The authors spend the first 20 chapters of their book developing PGMs in pure probability theory, without causality. Then in chapter 21 causality is added through the short definition that a causal model is a Bayesian network which, in addition to answering probabilistic queries, can answer do() queries through mutilation.
      
      Seeing this definition of a causal model, you may think it adds only a modicrum beyond pure probability theory, and you may therefore think Pearl is making a mountain out of a molehill in his distinction between causal inference and pure probability theory. But either way, that’s a matter of opinion separate from the topic at hand.
      
      I do think it would be good to add more on twin networks in your articles. To my eye, the “CausalBayesConstruct” algorithm is essentially the same as the procedure for constructing twin networks, which appears in many papers. You may have reinvented twin networks (that’s impressive!), but they should be acknowleged. You state that the resemblance between your approach and twin networks is superficial, but maybe you haven’t had enough time to look through the literature on them. With all the blog comments to get through, that’s understandable.
      
      You also state
      
      A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters…
      
      Twin networks apply to both discrete and continuous variables. There may be confusion becuase the twin network approach is often paired with response varaibles / canonical partitions / principal strata, which take discrete values. Actually, response variables might simplify your computation problems. Node merging might also help.
      
      You write
      
      The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy. … The CGM community have seemed a little hostile to the idea that priors over latent variables (unobserved confounders) can help solve these sorts of problems.
      
      I’m not sure why you are encountering hostility. For Bayesian CGM work on priors over latent variables in unidentified models, see Pearl’s “Causality” section 8.5 — which covers Chickering and Pearl (1997) — and the Koller and Friedman book. I also thought full prior distributions were used in the Balke and Pearl article that proposed twin networks, but I was wrong. Thanks to Carlos for pointing out my mistake.
      
      Finally, if you’ve reinvented twin networks, then you may be in an excellent place to greatly advance their study and use. Reinventing something can confer a depth of understanding books and classes just don’t give. If I were you, I’d capitalize on it!!
    - Anonymous on February 3, 2020 5:45 AM at 5:45 am said:
      
      Thanks for your kind words!
      
      I think we are perhaps claiming less than that.
      
      To paraphrase Daniel Lakeland, I don’t think you can even do statistics with pure probability theory. You need some additional framework like Bayesian decision theory. So is Bayesian decision theory as applied to statistical problems enough to solve causal problems? Even there you could argue either way. You need at least one additional ingredient that isn’t (typically) in Bayesian decision theoretic literature i.e. you need a random variable let’s call it y* which will change on a treatment let’s call it t*. (Is this separate from standard Bayesian decision theory.. it is hard to say and seems like a semantic point – but it should be emphasised more in the BDT literature).
      
      You also need to be more careful about your assumptions in a causal setting. A really useful one if it applies is that you can factorise the “model”: p(y1..n,x1..n,y*|t*) = \int \prod_n P(y_n|x_n,beta)p(x_n|theta) P(beta)P(theta) P(y*|t*,beta). This is useful because under such an assumption you can regress y on t to get the causal effect. It can be viewed as an assumption of partially exchangeability where the decision t* has a partially exchangeability relationship with the observed t_1..t_n. You can swap y_{n1} with y_{n2} if t_{n1}=t_{n2} and this partial exchangeability relationship must also extend to y* t.
      
      Is this using some causal concept beyond probability theory? I am not sure this is an important question. It is pretty much standard Bayesian decision theory with special attention to exchangeability relationships that apply in a causal setting. Depending on how you define Bayesian decision theory we are perhaps using some mild extensions. For what it is worth we don’t use arrow direction for causality, but it turns out when you apply the de Finetti representation (the plate in a PGM) some arrow directions become fixed for practical purposes (draws are no longer identical if you try to reverse arrows).
      
      Pearl has I think more than anybody studied the types of partial exchangeability that permit identifiability, and there is a lot we can learn from the do-calculus.
      
      I appreciate all the pointers. I am a bit under pressure at the moment, but I will read them all.
      
      I agree with you that a thorough discussion of twin networks is missing from the article. I also agreed I oversimplified them in my previous comment.
      
      I still see big differences in intent of twin networks and our proposed method. We have one joint distribution that models experimental data and future outcomes dependent on decisions we make now. Twin networks like the do-calculus modify a (already estimated) probability distribution to answer causal questions (historical counterfactual questions in the article). The concept of mutilation again is based on modifying a probability – not one full joint. The notion that reality is a stochastic process and if you apply a treatment you modify the graph and you change the stochastic process is at its heart frequentist. A big difference is we consider a single realisation of the system and place one joint over it. I can see if frequentist probability is used there is a clear need for the do-calculus or something similar.
      
      Thanks again for your interest, encouragement and references.
      
      It would be good to get to the bottom of if this disagreement is more than semantic or not.
    - Andrew on February 3, 2020 3:26 PM at 3:26 pm said:
      
      Anon:
      
      Yes, to apply probability theory to X, you need more than probability theory. And this is true for every X.
    - Carlos Ungil on February 3, 2020 4:35 PM at 4:35 pm said:
      
      > I still see big differences in intent of twin networks and our proposed method.
      
      Let’s focus for a moment on the similarities. You have the same networks that they do. If you’re allowed to mutilate the graph without the need for “do-calculus or something similar” then so are they.
      
      The left side of the networks represents the real world, the right side the counterfactual world. In their paper the link is provided by exogenous disturbances ϵ which cannot be influenced by the forcing of any endogenous variable in the model. This allows evidence from the real-world network to propagate to the counterfactual network.
      
      As you say, estimation splits naturally into two parts. In both cases, the calculation in the “causal projection” phase comes down to things like P(Y*=1|T*=0) = P(Y*=1|T*=1,Z*=0)P(Z*=0) + P(Y*=1|T*=0,Z*=1)P(Z*=1), where the different conditional probabilities are a direct consequence of the particular model that results from the “abduction” phase (specified by the posterior distributions of the parameters or the probability distribution of response functions).
      
      There is a difference in the kind of model that you implement. Yours is better fleshed and more Bayesian. It may also have its own problems, I don’t know. The point is that both approaches qualify to the same extent regarding the “using plain probability” claim.
    - Anonymous on February 4, 2020 3:25 AM at 3:25 am said:
      
      Thanks Carlos,
      
      I find your response quite reasonable and am almost inclined to agree with it. After all if all we are arguing about is if the idea is original we are not arguing about much.
      
      However, if twin networks are able to solve these problems (e.g. front door rule, M-bias), it has to the best of my knowledge not been demonstrated and in recent discussions forgotten by Pearl.
      
      In the original statistics and medicine debate between Pearl and Rubin a key point was conditioning. Rubin wanted to do the Bayesian thing and condition on everything. Pearl warned about conditioning on the wrong variable and discussed M-bias where conditioning appeared to be a problem.
      
      In our work we can condition on everything as Rubin and Bayes suggests and get the same answer as Pearl. Maybe Pearl knew this could be solved in a Bayesian way by conditioning on everything( in a twin network) – but he does not mention it. Instead he asks how Bayes solves the “bell problem”, which I took to be a sincere (and reasonable) question.
      
      .
    - More Anonymous on February 3, 2020 6:27 PM at 6:27 pm said:
      
      David, Thanks again for your response.
      
      For your discussion about ingedients beyond probability theory, please see Carlos’s comment about the twin networks approach and your work:
      
      The point is that both approaches qualify to the same extent regarding the “using plain probability” claim.
      
      Also, you write
      
      For what it is worth we don’t use arrow direction for causality…
      
      Try switching arrow directions in your twin networks and re-running your analyses. To my understanding, sometimes you will get the right answer and other times not. The way to guarentee that you get the right answer is to use arrow direction for causality. Therefore, there is real substance to the semantic association of arrows with causality. It’s consequential.
      
      PS, note I’m using “semantic” as in “semantic vs. syntactic” not as in “semantic = superficial”.
    - David Rohde on February 4, 2020 3:38 AM at 3:38 am said:
      
      Because we are using only probability theory you can indeed reverse arrow directions and get the same result. To be clear you need to use the correct model but you can re-factorize it at will using Bayes rule.
      
      However as we apply the de Finetti representation (when we use a plate) this fixes some arrow directions in a practical sense. The model has a compact form involving a product of conditionally identical densities. You may loose this form if you try to re-factorize it.
      
      I think we mostly agree, the arrows presence and direction is closely tied to the partial exchangeability relationships and these are indeed closely related to causality. However it is possible to consider partial exchangeability without causality.
    - More Anonymous on February 5, 2020 11:11 PM at 11:11 pm said:
      
      David, you write
      
      Because we are using only probability theory you can indeed reverse arrow directions and get the same result.
      
      So, given observational data on A, B, and C, your model gives the same effect of A on C for graph A -> B -> C and graph A C?
      
      If so, how do you deal with the problem that the effect should be (generally) nonzero in the first graph and zero in the second?
      
      If not, how are you choosing the directions of the arrows without invoking causality?
      
      We seem to be getting much farther away from agreement as this discussion proceeds.
    - More Anonymous on February 5, 2020 11:13 PM at 11:13 pm said:
      
      My above comment is malformed… trying again:
      
      David, you write, “Because we are using only probability theory you can indeed reverse arrow directions and get the same result.”
      
      So, given observational data on A, B, and C, your model gives the same effect of A on C for graph A — > B — > C and graph A C?
      
      If so, how do you deal with the problem that the effect should be (generally) nonzero in the first graph and zero in the second?
      
      If not, how are you choosing the directions of the arrows without invoking causality?
      
      We seem to be getting much farther away from agreement as this discussion proceeds.
    - More Anonymous on February 5, 2020 11:16 PM at 11:16 pm said:
      
      The graphs are still malformed… the first graph is A right-pointed-arrow B right-pointed-arrow C and the second graph should be A left-pointed-arrow B right-pointed-arrow C.
    - David Rohde on February 3, 2020 12:09 PM at 12:09 pm said:
      
      Thanks for the comments! I wrote a long response – hopefully it is caught in a spam filter and will appear..
    - David Rohde on February 6, 2020 3:44 PM at 3:44 pm said:
      
      I would use a different joint distribution in each of the two cases you describe:
      
      P(A*,A1..N,B1..N,C1..N|C*)
      
      but regardless of the case I would condition to compute:
      
      P(A*|C*,A1..N,B1..N,C1..N)
      
      Because it is just conditioning you can re-factorize at will e.g.
      P(A*,A1..N,B1..N,C1..N|C*) = P(A*,A1..N|C*,B1..N,C1..N)P(B1..N,C1..N|C*)
      or any other way you like.
      
      I would only do the re-factorizing after applying the BayesCausalConstruct. It would be reasonable to say that the arrow directions in the CGM imply a different joint distribution in the PGM.
    - More Anonymous on February 7, 2020 11:16 PM at 11:16 pm said:
      
      David, Thanks for staying with this discussion! It’s appreciated.
      
      With observational data, the distribution of (A, B, C) in general does not allow one to distinguish between the two graphs that I mentioned (A right-arrow B right-arrow C and A left-arrow B right-arrow C). This is because the graphs imply the same conditional independencies (A independent of C given B, and no other independencies).
      
      Therefore, when you are estimating the effect of A on C from observational data on A, B, C, there is no purely probabilistic way to decide whether you apply CausalbayesConstruct to A right-arrow B right-arrow C or to A left-arrow B right-arrow C. Extra-probabilistic information is needed to decide this, and is supplied by our assumptions about the causal relationships between the variables, as expressed in the DAG.
      
      In this way, extra-probabilistic information contributes to CausalBayesConstruct.
    - David Rohde on February 8, 2020 9:45 AM at 9:45 am said:
      
      More Anonymous (responding above to your question below) Thanks for your continuing interest!
      
      You correctly point out that if you have many draws from P(A,B,C) you could predict missing entries in your observational study or predict A if you knew C of a missing entry A etc but you can’t know how to extend to a new situation where you set C, without further assumptions. You point to an example where you might regress A on C and another where you might just look at the marginal distribution of A. I agree.
      
      If we are to isolate our disagreement, I think probability theory is perfectly capable of distinguishing between the two scenarios you mention when you consider the joint between the observational data and the new case where you intervene. The two scenarios have the same marginal distribution over the observational data. The two scenarios differ only in an assumption of partial exchangeability between the observed data and the data relating to the intervention, i.e. what is the effect of exchanging A in the observed data with the response A* assuming that C=C*.
      
      The fact that you can’t look at the joint of the observational P(A,B,C) and work out the causal effect of setting C does not mean you need to exit probability theory. You can simply model the full joint of the observations and the new system with the intervention.
      
      I also agree that this assumption is not something that can be tested by looking at the historical data (if you only have access to data without interventions). AFAICT our only disagreement is if the assumption can be expressed with only probability theory.
    - Ricardo Silva on February 8, 2020 11:29 AM at 11:29 am said:
      
      Thanks for the continued discussion, David. I don’t want to be too repetitive, so I’ll make this my final message in this thread. What More Anonymous (I think) and I point out is: you still haven’t defined what intervention is. A very mainstream view is that it is worthwhile to provide a mathematical definition of it just like it is with probability (this guy here agrees: https://statmodeling.stat.columbia.edu/2018/12/26/what-is-probability/). Lots of informal talk above boils down to “so-what-I-don’t-need-to-define-formally-everything-in-what-I-do-etc etc.”. If causal inference has any meaning as part of Statistics, nobody should buy that: to paraphrase, “none of [the above] is the foundation of [causality]; rather, [causality] is a mathematical concept which applies to various problems.” I surely don’t agree with Pearl’s vision that statisticians don’t know how to talk about causality (well, at least not all of them), but to go in this direction of avoiding a mathematical treatment is basically to admit that.
      
      Before going out before saying that: I genuinely appreciate the views in your papers of expressing a model under intervention by explicitly likelihoods, and at the very least is a way of bringing more people to understand causal graphical models. It is not entirely novel, as it relates to truncated factorisation you can see in some causal inference tutorials and books, but having this more explicitly linked to the representation of parameters in the Bayesian case is helpful and will help several readers to understand the concepts better.
    - David Rohde on February 9, 2020 3:27 PM at 3:27 pm said:
      
      Thanks for sharing your thoughts Ricardo. It has been a good discussion, and the tone of it has been really constructive.
      
      I must say I am struggling to see the point you are making here. Our approach sits on top of axioms of rational decision making, so it has that mathematical basis; something the do-cacluclus lacks. It is also as an applied approach is distinct from mathematics e.g. prior elicitation is required. Perhaps you regard this as “informal”, I see it as a standard part of probabilistic modelling, it is a prominent part of Bayesian decision theory foundations.
      
      An application of the do-calculus is in contrast piecemeal. Observational data of A,B,C is collected. This is treated as an external stochastic process and an estimator applied. It could be a Bayesian estimator but such a step is fundamentally frequentist. You then read off the causal graphical model in order to modify the estimated joint of P(A,B,C) to give the causal quantities.
      
      In terms of being a formal theory I don’t think the do calculus has the ambition of Bayesian decision theory. It can’t handle statistical estimation. After the estimation is performed it advocates violating the conditionality principle, by dropping variables. By its nature the do calculus is frequentist as it operates with external stochastic processes.
      
      By applying the principles of Bayesian decision theory in a causal model we can condition on all available information, we can combine statistical estimation and determining causal effects in one step. We automatically have an axiom system. I just don’t understand how this can be “informal” compared to the do-calculus.
    - Carlos Ungil on February 9, 2020 8:38 PM at 8:38 pm said:
      
      David, I have one question regarding the reversal of arrows.
      
      Say that for the Front Door Rule example in “Replacing the do-calculus with Bayes rule” you reverse the arrow from W to Y (and from W* to Y*). Y depends only on U; an intervention on T* should have no effect on Y*.
      
      If the PGM is essentially the same, shouldn’t your derivation remain valid?
      
      I have the impression that you’re introducing the causal structure when you write down your parameterization because you force that the marginal probabilities will be equal in both sides of the network. That gives the right answer in the original problem and the wrong answer when those arrows are reversed and the causality flow changes.
    - David Rohde on February 10, 2020 3:04 AM at 3:04 am said:
      
      Hi Carlos,
      
      The arrow directions in the CGM does indeed affect the joint on the PGM. Only once you have the PGM can you re-facotorize or reverse arrows. That “causal structure” can be seen through a probabilistic lens of partial exchangeability.
      
      So I think your impression is correct.
    - Carlos Ungil on February 10, 2020 1:47 PM at 1:47 pm said:
      
      > Only once you have the PGM can you re-factorize or reverse arrows.
      
      Given that the model consists of the PGM produced by the CausalBayesConstruct algorithm _plus_ the additional constraints P(V|parents(V)) = P(V* |parents(V*)), what would be the point of re-factorizing or reversing arrows if the model still has to satisify the constraints defined by the parent-child relationships in the extended CGM?
    - Ricardo Silva on February 10, 2020 3:07 PM at 3:07 pm said:
      
      (Maybe I should allow myself one more post, since I like the discussion :-) )
      
      Hi, David
      
      I think one problem here is the conflation of causal modelling and do-calculus. The former is a language to express assumptions of the type “A causes B” from a primitive notion of intervention. Pearl’s Structural Causal Models and the Rubin Causal Model are examples of this. The latter, do-calculus, is a computational technique to derive identifiability results, built on top of SCM. SCMs are defined and applicable without any requirement for the do-calculus. SCM/RCM exist because “A causes B” has no counterpart in probability, and causal inference is about mathematising such claims. You may say this is not important, but I think this is not a defensible claim. Those who work with stochastic differential equations and Bayesian nonparametrics surely are happy that a mathematicisation of probability exists, even if there will always be people who think (wrongly) that is just a mathematician’s game to keep academics busy (heck, what does it even mean to condition on the outcome of a continuous random variable, which is an event of probability zero?). Similar claims apply to people doing causal inference: it is good to have a formal language to express assumptions that have no “obvious” consequences and where it may be hard to even put them together in a coherent way.
      
      Concerning the do-calculus, it doesn’t make sense to say that the do-calculus lacks “axioms of rational decision”. I don’t even know what this means, since I don’t know which axioms you are referring to (Cox’s? SCM/RCM encompasses probability, adding the concept of intervention, so it comes for free). The do-calculus is all about reducing a causal query to a probabilistic query, if at all possible, from nonparametric assumptions of conditional independence between random variables and interventions (it does *not* reduce causal assumptions to non-causal assumptions. Instead, *given* causal assumptions, it reduces expressions with random variables and interventions to expressions without interventions, if they exist). We don’t need it to do causal inference. Really, nobody ever claimed that. But the alternative is to use further assumptions, including priors about latent common causes. Hence we have (informative) latent variable models, instrumental variables, regression discontinuity, synthetic controls, difference-in-differences, etc., all of which people do resort to since conditional independence statements only may provide too little information. But If we don’t want to pay the price for these extra assumptions, the do-calculus is the most general tool to solve the nonparametric problem, and nothing more general exists. This is not a matter of opinion.
      
      At the same time, nobody has ever said that writing a likelihood function from an intervened graphical model and calculating the conditional probability is fundamentally wrong (if the assumptions come from a formal language like SCM or RCM). Such ideas were there from the beginning. But this is just not a panacea given that it’s not clear which assumptions about hidden variables were unnecessary or which made the solution highly sensitive to them. And nobody wants to estimate loads of unnecessary nuisance parameters. In your example with three variables X,Y,Z where all you need is the bivariate distribution of two of them, it seems a strange advice to estimate the full joint and then marginalise the useless variables…
    - More Anonymous on February 10, 2020 3:36 PM at 3:36 pm said:
      
      David,
      
      In your CausalBayesConstruct algorithm, you write that the algorithm’s input and output are
      
      Input: Causal graph G and intervention do(T = t)
      Output: Probablistic graphical model representing this intervention.
      
      So, the input already includes a special ingredient you don’t get from pure probability theory — the causal graph. In my example of the two DAGs with different arrow directions, this is point I was aiming to make.
      
      To quote Nancy Cartwright, “No causality in, no causality out”
      
      Overall, the way causality enters your algorithm is similar to how it enters standard twin networks, which is expected because your approach is very similar to standard twin networks.
      
      Also, I have a suggestion that might foster better reactions to your work from causal inference researchers: In your comments, sometimes it looks like you apply the term “do calculus” to other concepts, like DAG-based causal inference as a whole or the structural interpretation of counterfactuals. At least personally, I have a difficult time following your arguments when this happens. So I suggest trying to avoid that kind of conflation.
      
      …Actually, Ricardo has just now posted a comment that makes this point too, so it’s probably good advice.
      
      I think I’ll stop participating in this conversation here, but it has been a good one. Thank you and good luck.
- Anon on February 25, 2020 6:26 AM at 6:26 am said:
  
  > The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy.
  
  I’m struggling to figure out how this can be true, given the completeness of do-calculus? I assume “be applied to” effectively means this: “It is possible to obtain informative posteriors in cases where the query is non-identifiable via the do calculus.” from the blog post “Causal Inference with Bayes Rule”?
  
  But with the assumptions needed for obtaining informative posteriors in non-identifiable cases (non-identifiable via the do-calculus –> non-identifiable, as per the completeness result), can’t information about the causal effect be obtained with the do-calculus as well?
  
  Does there exist a causal query and a set of assumptions that result in Bayesian machinery providing us more information concerning the (causal) answer than do-calculus would?
  
  How about in cases of partial identification and estimating bounds for the causal effect? (E.g. the IDA algorithm, which estimates a set of possible total causal effects instead of the unique total causal effect.)
  
  Reply ↓
  - David Rohde on February 25, 2020 11:25 AM at 11:25 am said:
    
    Hi Anon,
    
    “Informative posterior” in this case is a posterior that is tighter than the prior but might not concentrate on a point (even in the large data limit).
    
    In difficult cases the posterior will tighten in some aspects but reflect the full uncertainty of the prior in other aspects.
    
    Reply ↓
Carlos Ungil on January 28, 2020 2:03 PM at 2:03 pm said:

In the “current system/system after intervention” diagram for case 1 shouldn’t phi be labeled “Parameter representing P(T|Z)” rather than P(T)?

Unless I’m misunderstanding everything, for case 2 the opposite mistake is present (“Parameter for P(T|Z)” should be “Parameter for P(Z)”) and the label for gamma is also wrong (should be “Parameter representing P(Z|T)”, not P(Z) like in case 1).

Reply ↓
- Carlos Ungil on January 28, 2020 2:04 PM at 2:04 pm said:
  
  I meant of course that “Parameter for P(T|Z)” should be “Parameter for P(T)”.
  
  Reply ↓
  - David Rohde on January 29, 2020 11:47 AM at 11:47 am said:
    
    Thanks for reading past the abstract Carlos. Can you rephrase your question, I don’t understand – and I can’t work out which bit of the text you are referring to.
    
    Reply ↓
    - Carlos Ungil on January 29, 2020 7:00 PM at 7:00 pm said:
      
      I must confess I didn’t read past the abstract… I didn’t read the paper(s) at all, my comment is about the images displayed at the link: https://gradientinstitute.org/blog/6/algorithm1.mp4 and https://gradientinstitute.org/blog/6/algorithm1_case2.mp4
    - David Rohde on January 30, 2020 4:22 AM at 4:22 am said:
      
      Thanks Carlos, yes that does need to be corrected. Good spot.
Finn on February 10, 2020 6:14 PM at 6:14 pm said:

Sorry to come late to such a great discussion – I got swamped by the ICML deadline.

Firstly thanks to everyone for such constructive suggestions and comments. I think most key points have already been addressed – but just to add my thoughts on a few things.

1) We are not at all trying to suggest that causal inference (inferring the impact of an intervention from observational data) is possible without assumptions. We are indeed using the truncated product formula to determine how the structure of the data generating process is changed by intervention. The only difference between what we are doing and Pearl’s CGMs is the explicit representation of both the system before and after intervention within a single Bayesian network, with distributional assumptions represented by hyperparameters. We obviously need to make this clearer.

2) Our work may not be particularly novel. All we have really done is construct an alternative, explicit representation of the assumptions encoded in a CGM. The fact that you can do this is pretty obvious. However, even if this is an idea that has been around since the start, we could not find it clearly written down anywhere, and we found it to be useful in terms of bridging the gap in terminology & approach to thinking about modelling between causal graphical modellers & bayesians. Using this representation allows you to resolve disputed questions like; CG modeller:”you can’t just condition on all observed variables, that can lead to biased estimates (eg M-graph)”, Bayesian: “but you should always condition on all the information we have”. Although our models do resemble twin networks, they are not representing the same thing. From a philosophical point of view, this representation puts CGMs within the broader approach of; define you model (assumptions), define a prior, collect some data, compute your posterior.

3) From an applied standpoint our work is indeed no panecea. Marginalising out the nuisance parameters is likely to be intractable in many cases. If the problem is identifiable, then we can construct (using the same conditional independence relations underlying the do-calculus) a re-parameterisation that doesn’t require marginalisation over latent variables in the observational data – but then the whole thing is equivalent to just using CGMs & the do-calculus but with different notation. Our approach is more interesting for problems that are not identifiable – particularly those that are ‘almost identifiable’ for example non-linear instrumental variables. In these cases, the posterior will contain some sensitivity to the prior but you may still get useful bounds on causal effects.

Reply ↓
- More Anonymous on February 11, 2020 2:39 AM at 2:39 am said:
  
  Finn, Those claims are much more defensible, but also they’re not quite what your arxiv articles say or what your coauthor has said. Do you plan to update your articles and website accordingly?
  
  May I also suggest that your work could draw less of the controversy David mentions if you discuss relevant precursors, like Imbens and Rubin (Ann Statis, 1997), Chickering and Pearl (Comp Sci Stat, 1997), and Balke and Pearl (AAAI, 1994).
  
  One thing I’ve wondered about is how causal identifiability and partial identifiability relate to functional forms of the equations at the nodes of causal graphs. In cases from many fields (chemistry, physics, online advertising… anything really), it makes sense to assume that the equation governing one or more nodes in a causal graph has a specific form, such as linear or polynomial, or even is a noisy version of a specific equation like Michaelis-Menten or Newton’s universal law of gravitation.
  
  Once you assume a functional form, this assumption can affect what is identifiable and partly identifiable. For example, there’s a nice paper by Kuroki and Pearl (Biometrika, 2014) with discussion of how some causal effects become identifiable even in the presence of measurement bias as long as certain nodes are governed by linear equations. However, it doesn’t seem to me that there will be a useful, general theory of how the identifiability of causal graphical models relates to functional forms because the variety of functional forms that need to be considered is too great.
  
  Instead of a general theory, I think that a promising alternative could be to take the specific causal inference problem that is of interest, set up its causal graph and functional form requirement in a Bayesian framework, and then examine the sensitivity of the effect estimate to the priors over the latent variables. Your twin-network-like approach could be a natural home for this!
  
  That would be a boon to causal inference, especially the ability to include specific equations that are scientifically known, like those from physics, microeconomics, etc.
  
  If I were working on your project, I would prioritize this and skip past questions of whether causal inference can be done in pure probability theory. Or… maybe I’m misunderstanding something and my suggestion is bad. Also very possible!
  
  Reply ↓
Ricardo SIlva on February 11, 2020 12:04 PM at 12:04 pm said:

Thanks, Finn. This helps, and I’m looking forward to see its developments! Concerning good references, I second the papers of Balke and Pearl mentioned above, which are explicit likelihood-based approaches for bounding causal effects and remain very relevant (Balke’s thesis is itself an interesting source). You may also be interested in Chckering and Pearl’s early work on putting priors on latent variables, https://maxchickering.com/publications/aaai96.pdf (I stress that I only believe this makes sense if the priors are meant to be encoding informative knowledge, as results are very sensitive to them. I’d rather have posteriors on the bounds like in my work with Robin Evans, or Ramsahai and Lauritzen’s work on confidence intervals, https://academic.oup.com/biomet/article-abstract/98/4/987/234310?redirectedFrom=fulltext. We don’t get bounds from using a latent variable model directly, let alone posteriors on bounds).

Reply ↓
David Rohde on February 12, 2020 3:04 PM at 3:04 pm said:

Thanks everyone for the thoughtful and considered discussion. I sometimes think I am being asked to defend a position I haven’t taken. The position that probability theory is insufficient for causal inference has been in many places by Pearl, although as Kevin mentioned possibly inconsistently with other statements he has himself made. It is possible that nobody here holds Pearls’ position.

Just to highlight some examples, Pearl requested a simple problem to be solved with probability theory (apparently believing it outside probability theory):
https://statmodeling.stat.columbia.edu/2009/07/05/disputes_about/#comment-49482

Incidentally, Andrew Gelman apparently viewed it as outside the Rubin Causal Model:
https://statmodeling.stat.columbia.edu/2009/07/07/more_on_pearls/

Or in Bayesianism versus Dogmatism Pearl states:
“While the Bayesian paradigm teaches us indeed that one should not ignore the prior knowledge in our possession and the variables that we can observe, it does not license us to blindly condition all probabilities on those observations. Instead, it instructs us to think carefully if conditioning would advance us towards the quantity we wish estimated, or away from that quantity”

He chides Rubin for saying: “To avoid conditioning on some observed covariates,… is neither Bayesian nor scientifically sound but rather it is distinctly frequentist and ’nonscientific ad hockery.”

Rubin is perfectly correct in his a Bayesian must always condition on everything (although to call this non-scientific could be considered overreach). Pearl is however correct that conditioning using the model Rubin mentions can result in incorrect causal effect estimates (Pearl cites the M-bias example, Rubin uses a regression model).

So it isn’t precisely true to say “nobody has ever said that writing a likelihood function from an intervened graphical model and calculating the conditional probability is fundamentally wrong” – although I don’t know widely held Pearls’ view actually is.

Our work, I think, helps clarify these issues. We condition on all data as Rubin advises yet reach the same conclusions as Pearl. While you can get the correct answer using Pearls procedures you can using probability theory too. Our worked examples show how to solve problems like the bell problem and the front door rule in the Bayesian paradigm. If you don’t share these views with Pearl, then we don’t appear to have anything to argue about.

Ricardo :
The axiom system that I mentioned is usually called the Ramsey-de Finetti-Savage theory of statistics, although if interested I would look at text books like those of Lad, Bernado and Smith or Kadane (his “Principles of uncertainty” is online). Pragmatic Bayesians prefer to concentrate on applications and best practice so there is very little of it in Gelman’s BDA. The Ramsey-de Finetti-Savage theory is characterized by its use of probability as a decision theoretic primitive i.e. subjective probability using the de Finetti representation to apply that probability specification to large spaces. Of course causal problems involve decision making under uncertainty and are therefore covered by the theory.

I was puzzled by your comment about mathematical basis as it seems to me (a) our approach can be justified by an axiom system (and the do calculus cannot be, and in fact violates conditionality) and (b) that causality is an applied area like statistics not a part of mathematics like probability. If you really believe causality is a part of mathematics that surely has led to us talking past each other.

Your point about it being easier to estimate a marginal than joint distribution is reasonable – in fact we make the same point ourselves as an advantage of the do calculus- but it will violate conditionality. There are arguments about if conditionality is sensible in practice e.g. see the Robbins-Ritrov-Wasserman-Simms debate as well as Berger and Wolpert’s book.

More Anonymous :

The CausalBayesConstruct shows that you can convert the assumptions in a CGM to a PGM. You could however go directly to the PGM. The factorization behind regression P(y,t|beta,theta)=P(y|t,beta)P(t|theta) is the same one used for back door adjustment. The factorization if it applies makes semi-supervised learning impossible and can be studied and applied without a CGM. In short these assumptions are routinely made outside causal settings. I therefore don’t consider mapping the assumptions from the CGM into the full joint in the PGM to be exiting probability theory (although I have little enthusiasm to argue this small point – and it is perhaps our only real disagreement).

I didn’t intend to say you can infer causal quantities without causal assumptions – that would be a contradiction. Only that no mathematics outside probability theory is required.

I fully agree that using the method in applied settings is an excellent future avenue for research.

Everybody seems to be dropping out. I sincerely thank you all for your attention to our work and your patience in presenting differing viewpoints (and Prof Gelman for posting!).

Reply ↓
- Carlos Ungil on February 12, 2020 3:47 PM at 3:47 pm said:
  
  David (and Finn), thank you for the discussion. Despite “Casual Inference” being mentioned in the header of the blog we don’t get enough of it :-)
  
  > The CausalBayesConstruct shows that you can convert the assumptions in a CGM to a PGM.
  
  The CausalBayesConstruct algorithm doesn’t mention that the original and “twin” nodes satisfy P(V|parents(V)) = P(V*|parents(V*)).
  
  Should we understand that setting those constraints is part of the algorithm and that they are part of the resulting PGM?
  
  Reply ↓
  - David Rohde on February 14, 2020 7:20 AM at 7:20 am said:
    
    Can you rephrase the question Carlos? I don’t follow although maybe Finn can help.
    
    Reply ↓
    - David Rohde on February 14, 2020 8:32 AM at 8:32 am said:
      
      Carlos: Maybe you mean a conditionally independent and _identical_ assumption is made if theta->x_n, yes this is assumed.
    - Carlos Ungil on February 14, 2020 3:43 PM at 3:43 pm said:
      
      Thank you David, it was not completely clear to me that the output of the CausalBayesConstruct algorithm was the PGM defined by the DAG and the set of conditional probability distributions and that the step “connect the two graphs linking theta_V to the corresponding variable V* in the post interventional graph, for each V excluding T” implied that the parametrization is shared.
- Daniel Lakeland on February 12, 2020 5:24 PM at 5:24 pm said:
  
  I honestly felt like I wanted to follow this, but didn’t have the mental energy and time given a bunch of other constraints over the last few weeks. However I too would like to thank you for the effort you put in here.
  
  I found it very frustrating to talk with Pearl regarding these issues (there was a long exchange between us on this blog about 3 or 4 years back), because I came to the conclusion just as you have that his understanding of what is probability theory and statistics is entirely frequentist… and my understanding was Bayesian… and so we talked past each other… He even acknowledged knowing about the development of the Cox/Jaynes theory of probability as extended logic, but seemed to gloss over any actual understanding of it.
  
  For example he would insist on a toy problem, and that I calculate some number like the probability for X to happen, and would give me some data… And I would insist that we need a model of the causality in order to calculate this probability, because the probability for X to happen is not a quantity you can calculate from data alone because it represents how much credibility your model of the causality of X occurring gives to each possible outcome…. and so without a model there is no probability. He would essentially insist that I should plug in the frequency in the data and seemed to refuse to acknowledge the idea that probability depends on the specifics of the model.
  
  Suffice it to say that I see p(X | model, data) is *different for each possible structure that the free variable “model” could take on*. If you leave that variable free… then you have an unevaluated purely symbolic quantity… and he insisted I should plug in a number.
  
  In any case, thanks again for putting the time and effort in here.
  
  Reply ↓
  - Ricardo Silva on February 13, 2020 4:31 AM at 4:31 am said:
    
    Yes, that exchange between Rubin and Pearl was unfortunate, as it was clear they were talking past each other. Larry Wasserman even pointed this out. I still repeat that it is true that nobody has fundamental issues against turning the crank on probabilistic conditioning to calculate a causal effect (module issues of identifiability and nuisances). Pearl has done precisely that in his work with Balke and Chickering in the 90s (Bayesian posterior analysis of a unidentified causal effect is even a theme in a chapter of his book, and joint diagrams of exchangeable DAGs for different units sharing a single parameter node appear in Chickering’s papers). Conditioning on the data has always been totally fine here, because the data are different units from the one where a spurious collider might be conditioned on. That is, conditioned on the model, data points are independent, so this has no implications on choice of adjustment. Pearl didn’t get that Rubin was talking about conditioning on past data, and Rubin didn’t get that Pearl was talking about a choice of estimand (so, the model was being “conditioned on” already. Nothing is stopping us from having this model being the result of a Bayesian estimate). It’s again the conflation of estimand and estimator that plagues discussion on causal inference. Your papers help to clarify the distinction (thanks!), but I also hope it is clearer now where Pearl was coming from.
    
    Reply ↓
    - David Rohde on February 14, 2020 8:55 AM at 8:55 am said:
      
      Ricardo,
      
      It is certainly useful to get your perspective, and I will follow through on your references.
  - David Rohde on February 14, 2020 8:57 AM at 8:57 am said:
    
    Thanks Daniel, yes I think Pearl views probability as an external stochastic process and an intervention as modifying this stochastic process – and on this basis argues that probability theory is insufficient for causal inference. If you place a Bayesian joint on the whole system you reach a different conclusion.
    
    Reply ↓
David Rohde on February 14, 2020 7:23 AM at 7:23 am said:

> Given that the model consists of the PGM produced by the CausalBayesConstruct algorithm _plus_ the additional constraints P(V|parents(V)) = P(V* |parents(V*)), what > would be the point of re-factorizing or reversing arrows if the model still has to satisify the constraints defined by the parent-child relationships in the extended CGM?

Hi Carlos,

There is no point, it is just to demonstrate that standard probability is now being used and the arrow direction has no causal meaning.

(sorry for the delay in responding)

Reply ↓
Corey on February 25, 2020 10:22 AM at 10:22 am said:

Finally catching up on this. I’m really excited to see this work; I’ve anticipated it for literally a decade.

https://statmodeling.stat.columbia.edu/2009/07/05/disputes_about/#comment-49475

Reply ↓
judea pearl on February 25, 2020 1:18 PM at 1:18 pm said:

All,
I am gratified to find the discussions on this blog indicating a much greater familiarity with
my work, compared with the discussions I found on my last visit here.
I believe the most effective contribution I can make to the current discussion is in the form of
three links.
1. First, to an unpublished report that complement my exchange with Rubin
“Myth, Confusion, and Science in Causal Analysis” — https://ucla.in/2EihVyD

2. Second, to a paper titled “Bayesianism and Causality, or, Why I am Only a
Half-Bayesian” — https://ucla.in/2nZN7IH
in which I explain the disconnect between Bayesianism and causality

3. Finally, to Section 11.1.1 in Causality (2009), titled
Is the Causal-Statistical Dichotomy Necessary?
https://ucla.in/2NnfGPQ#page=331

There is comfort, I admit, for researchers to dress causal inference in traditional probabilistic
vocabulary; familiar words evoke familiar tools and a sense of safe passage. From logical viewpoint,
however, causality and statistics do not mix, unless one extends the meaning of “statistics”
to cover the entire sphere of scientific thought. (including of course speculations about
Cinderella’s hair color, which can be decorated with Bayes priors.)
But if the comfort of traditional vocabulary increases researchers ability to solve causal problems
(like front door, external validity, mediation and missing data) so be it — I am all for it.
Judea Pearl

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan

74 thoughts on “Causal inference in AI: Expressing potential outcomes in a graphical-modeling framework that can be fit using Stan”

Leave a Reply Cancel reply