Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

This material should be familiar to many of you but could be helpful to newcomers. Pearl writes:

ALL causal conclusions in nonexperimental settings must be based on untested, judgmental assumptions that investigators are prepared to defend on scientific grounds. . . .

To understand what the world should be like for a given procedure to work is of no lesser scientific value than seeking evidence for how the world works . . .

Assumptions are self-destructive in their honesty. The more explicit the assumption, the more criticism it invites . . . causal diagrams invite the harshest criticism because they make assumptions more explicit and more transparent than other representation schemes.

As regular readers know (for example, search this blog for “Pearl”), I have not got much out of the causal-diagrams approach myself, but in general I think that when there are multiple, mathematically equivalent methods of getting the same answer, we tend to go with the framework we are used to. Thus, my unfamiliarity or discomfort with Pearl’s causal diagrams does not represent an anti-endorsement but rather just an open statement about my own experiences. (I do have disagreements with some explicators of Pearl, for example I think Steven Sloman made fundamental misconceptions in his book that I reviewed a few years ago, but that’s another story. I can’t fault a method because it can lead to errors if used without full understanding, any more than I would slam Bayesian inference in general just because it can give bad results when people inappropriately assume flat priors.)

In any case, though, I resonate with Pearl’s general point that making strong assumptions can be good: strong assumptions give many handles for model checking (see chapter 6 of BDA) and ultimately for model improvement.

P.S. At the conclusion of his post, Pearl comments on the difficulties of working with the categories from Rubin’s 1976 paper on inference and missing data, with the three categories being Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).

I think I can provide some useful background here. Many years ago I asked Rubin about this point, that these particular definitions seem like an odd way to divide the world into three parts. After all, in practice, we just about never see “missing completely at random,” and “missing at random” is a strange concept in another sense in that it allows missingness to depend on observed data which, under a different realization, might not have been observed, so that the classification of the missingness mechanism is is elf a random variable. Also, the names themselves are confusing.

Rubin’s reply, when I asked him this, was that he used this awkward partition with these awkward names to be consistent with the existing statistical literature. What was happening was that researchers were already using “missing at random” and similar terms but in a sloppy way, without any mathematical definition or clear statistical justification. So his 1976 paper was, to a large extent, an effort at rationalizing existing terms and practices, taking what people were already doing and uncovering the implicit models underlying these methods. (Just as Rubin did in a different example in a 1991 comment (JASA vol. 413, pp. 22-24) on Efron and Feldman’s paper, Compliance as an Explanatory Variable in Clinical Trials.)

So, although Pearl seems to think of work based on Rubin’s 1976 paper as somewhat unscientific, I think he should consider the history behind this: Rubin’s definitions clarify the assumptions underlying the methods that people were already happy to use.

I think this sort of activity—taking a proposed or existing method and considering what underlying model it corresponds to—to be an excellent thing to do, and very much in the Bayesian tradition. Here are two early examples of my own such efforts, from 1990 on a paper by Silverman et al., and from 1992 on a paper by Donoho et al. In both cases I don’t think the authors of the original papers really saw the point of my Bayesian reinterpretations, but I found it very helpful to take a method and consider what it meant as a model.

29 thoughts on “Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

    • Ruben:

      Yes, I agree. When Uri sent me that post, here is what I wrote to him:

      Thanks for quoting me! I do find it frustrating that some Bayesians have this attitude of not wanting to study the data model. Regarding your larger point, I pretty much agree with you. But I do think that in certain settings, a hierarchical Bayesian analysis allows one to avoid multiple comparisons problems. See this paper. The key is that, in the Bayesian analysis we recommend, all the comparisons are studied. So no selection is involved.
      See you
      A

      P.S. This recent paper addresses some of the connections between Bayes and p-hacking.

  1. I found the last bit about the history of MCAR, MAR, and MNAR interesting. No doubt Rubin made a major contribution that generated a lot of useful research. Yet I believe the time has come to move beyond these approaches.

    Specifically, I agree with Judea that a modern approach ought to analyze missingness on the basis of the underlying causal structure. For example, using this approach I found that:

    Attrition is the Achilles’ Heel of the randomized experiment: It is fairly common, and it can completely unravel the benefits of randomization. Using the structural language of causal diagrams I demonstrate that attrition is problematic for identification of the average treatment effect (ATE) if — and only if — it is a common effect of the treatment and the outcome (or a cause of the outcome other than the treatment). I also demonstrate that whether the ATE is identified and estimable for all units in the experiment, or only for those units with observed outcomes, depends on two d-separation conditions. One of these is testable ex-post under standard experimental assumptions. The other is testable ex-ante so long as adequate measurement protocols are adopted. Missing at Random (MAR) assumptions are neither necessary nor sufficient for identification of the ATE.

    For more details see http://ssrn.com/abstract=2302735

      • Andrew:

        I’ll have to read it more closely but, being v. short on time, I would say this:

        1. We don’t have to take the latent ignorability assumption in section 6.3 on faith. I provide _proof_ for tests that can tell us whether the ATE is identified, and, if so, whether it is estimable for all units, or only for those units with observed outcomes.

        2. The _intuition_ in the second paragraph of 6.4 is just that, an intuition. For example, in my manuscript I show that the ATE may be identified and estimable, at least for units with observed outcomes, even if all units have completely different observed covariate missing data patterns, contrary to the intuition. But to _prove_ that conclusion one needs to work with casual explanations for missing data.

        3. I would have liked to see a causal diagram of the assumptions being made, thus providing an immediate sense of what is going on. Those with some basic training in causal diagrams my find such a diagram much more friendly than Table 3, and attendant elaborations. (In general I find graphs are more friendly than long tables of counterfactuals).

  2. Just so I understand: in my field a ‘strong assumption’ is an assumption for which there is little supporting evidence, while a ‘weak assumption’ is an assumption for which there is good supporting evidence. For example, an analyst might prefer permutation tests arguing that they make ‘weaker assumptions’ about the functional form of the estimator. Is this the sense in which you are using ‘strong assumption’? Or do you mean that the assumptions should be as explicit as possible (which is what I take from the Pearl quote — he feels his method gets unfairly pinged just because it brings the assumptions to the surface).

    • Nick:

      By “strong assumptions” I’m typically talking about full probability models, that is, a model for the underlying process and also for the data-generation mechanism, in contrast to many standard statistical assumptions that try to get away with not modeling various aspects of the problem.

    • “For example, an analyst might prefer permutation tests arguing that they make ‘weaker assumptions’ about the functional form of the estimator.”

      You may not be making a simplifying approximations with the estimator, but permutation tests always make pretty big assumptions about the correlation structure of the null model, usually with little or no supporting evidence.

  3. Just a statement that I enjoy causal diagrams and causal-diagram analysis because they, to me at least, highlight the inherent issue of density versus point, carried through to multiple levels as the MAR discussion and comments highlight. (Read for that, density function and point function or mass function or whatever you find appropriate.) To wax philosophical, we need to reduce to decisions even as those moments flit through our fingers – “It eluded us then, but that’s no matter — to-morrow we will run faster, stretch out our arms farther. . . . And one fine morning — So we beat on, boats against the current, borne back ceaselessly into the past.”

    I suspect that some of these issues will be better addressed by description/categorization of the underlying geometries shapes and thus how they may/can fit through multiple levels given both a density and a point approach. It seems as though we’re at last making progress in that area.

  4. I think this is one area where methods are racing ahead on steroids. Most applied papers I come across never seem to use anything close to this in sophistication. Case in point: causal diagrams.

    Am I mistaken?

  5. Dear Andrew,
    Your account of how Don Rubin came up with the three
    famous categories (MCAR, MAR, NMAR), is fascinating
    and instructive: “Rubin’s definitions clarify the
    assumptions underlying the methods that people were
    already happy to use.” This is how I would have guessed
    the partition had evolved. Except I would use the word “formalize”
    rather than “clarify”. Because, as shown in my posts
    http://www.mii.ucla.edu/causality/.
    http://ftp.cs.ucla.edu/pub/stat_ser/r417.pdf
    very few people, and even fewer users, understand these
    definitions. Even authors of missing data
    textbooks, invariably copy the definitions from Rubin, bemoan
    their incomprehensibility, then replace them with
    different definitions that one can explain to readers.

    So, isn’t it time to replace this
    classification with one that a typical user
    can comprehend? After all, it is the user, not
    the analyst, who has to determine whether a given
    problem falls under MAR or not.

    I was happy to see that the movement to humanize
    missing data classification has already begun, as in the
    paper by Seaman etal (Stat Sci 2013) that was introduced
    here by O’Rourke (Thanks!) and the one by Potthoff etal(cited
    there.) Causal graphs is another step in this
    humanization process and I hope you get a chance to
    examine it, because you will not be the same person, I
    promise.

    How do I know? I glanced at the paper by Barnard etal
    on “broken randomized experiments” that you
    suggested to Fernando, and I would like other
    readers of this blog to examine it as well, and ask themselves
    if they understand what assumptions the authors made,
    whether the assumptions seem reasonable given the context,
    and which of the assumptions lend itself to statistical
    test. Now compare this paper to the one posted by
    Fernando http://ssrn.com/abstract=2302735
    on a related topic, and try to answer the same questions.
    I do not think anyone would wish to go back to the
    old style of doing science.

    This means that, even if one has not gotten much out of
    the causal diagrams approach in the past, I would find it
    hard to imagine that any missing-data analyst can afford
    to shun them in the future. If for no other reason
    then just to explain to students
    in class what MAR is, when a given problem permits
    consistent estimates and when the model is testable.

    This is what I tried to convey in my post, which is
    aimed to give rebuttal points to causal analysts who
    are still encountering skeptical reviewers asking (some
    naively): And where does the diagram come from?

    • Judea:

      Thanks for the kind words. As you know there is no doubt in my mind that the time for causal diagrams has come. But this ought to be good for everyone involved.

      For example, I really liked the last sentence in Section 6.4 of Barnard et al. which points towards a compromise: (i) Use causal diagrams to quickly prototype structural models, check identifiability, derive tests etc., and (ii) then use hierarchical models or other parametric approaches to “assist” with the estimation (e.g. overcome sparsity in strata and so on). Each technique and language has a comparative advantage.

      Unfortunately, instead of mastering two languages specifically adapted to different purposes what some researchers have done is adapt statistics to causality by introducing potential outcomes into conditional probability statements. I don’t think this is a good solution. Moreover, I find odd a model of causality where experimental units are just sets of two or many more boxes hiding potential outcomes — and all an RCT does is reveal the contents of these boxes at random –, versus a functional model of causality where RCTs bring about controlled _changes_ in the world we observe. Causal diagrams make it easy to work with functional relations. There is no need for additional abstractions.

      • Fernando:

        I think you’re slightly confused about the role of potential-outcome notation. The Bayesian potential-outcome model is just a regression model: y (the potential outcome) depends on the treatment variable T and various pre-treatment variables X. You could write y = g(T,X) + noise (or, even more generally, y = g(T,X,noise) or y ~ g(T,X). The point is that, in the absence of any experimentation, y is this stochastic function of g and T. Of course, we don’t know the function—that’s the point of the statistical inference—so we say it depends on some unknown parameter vector theta; thus y = g(T,X,theta,noise) or y ~ g(T,X,theta). You write of “sets of two or many more boxes” but that’s just a particular representation of the dependence of y on T. In probability theory, it is standard to speak of a distribution as being represented by the drawing of balls out of urns, or in this case maybe the random selection of a box. (I suppose the analogy of a box arises because you can’t see what’s inside the box until you’ve chosen it.)

        What I think is confusing you about the potential outcome notation is its separation of the acts of treatment and measurement. You would like “a functional model of causality where RCTs bring about controlled _changes_ in the world we observe.” That’s fine, but the changes are coming from the treatment, not from the randomization. In the above notation, the controlled changes correspond to the dependence on T in the model y ~ g(T,X,theta). The randomization comes from the data model, that is the probability of observing T=0 or T=1. The “science” (as Rubin puts it) determines the model g(T,X,theta); the randomization (or, more generally, the design of data collection), determines a probability distribution for T (or, more generally, a distribution for T given X, y, and some possibly unknown parameters phi). It can sometimes seem confusing to have two statistical models, one for the underlying process and one for the data collection, but this comes in handy when considering designs more complicated than simple random sampling or completely randomized experimentation.

        I’m not saying the potential outcomes notation is the only way that causal inference can be modeled; I’m just laying out the motivations. It’s not quite as odd as you seem to think, it’s just a separation of the uncertainty into two different processes, which can make a lot of sense in some examples.

        • Andrew,
          I agree that potential outcomes should not be confusing, because
          they are nothing but abbreviations of structural equations (not regression ,as you stated, but structural).
          Readers may wish to watch a video where I explain this point in great length and through vivid examples.
          http://idse.columbia.edu/seminarvideo_judeapearl
          It was given in Columbia, your home university, so it should not be too hard to follow.
          I hope you agree with all my heretical statements and that potential outcomes can be demystified.

        • Andrew:

          I can see how the potential outcome set up might be useful for Bayesian estimation. But IMHO that is the source of the trouble. An approach to causality derived from the needs of Bayesian modeling and estimation, when causality can be given a completely non-parametric treatment on the basis of structural equations and ceteris paribus conditions, adds more pieces than are needed.

          This is why I like to distinguish structural causal models and causal identification from estimation. My sense is Rubin is working in the other direction, form estimation to causality, mixing intervention with measurement.

          PS I don’t think I am confused (though I may be wrong!). I think what is going on is that I may look confused from the perspective of a Bayesian. From my perspective I’d grant you that I find some Bayesian approaches to causality if not confused, at least confusing. But maybe that was your point.

        • Fernando:

          I don’t see the y ~ g(T,X,theta) approach as being “derived from the needs of Bayesian modeling and estimation.” It just seems natural to me to model uncertain outcomes via a random function. This is not particularly Bayesian, it’s basically mainstream statistics as of the past 125 years or so. You might be right that other approaches could work better, but it’s really just the standard approach (hence the references to Neyman (1923), which itself builds up on the work of Pearson etc on modeling unknowns and predictive quantities using distributions).

        • Andrew:

          y ~ g(T,X,theta) lacks directionality. I could re-arrange that to x ~ g(T,Y,theta). The latter would not really make sense as a structural equation if x were a cause of y but not vice versa. That is why Judea in his comment made the distinction between regression and structural equations.

          In the same way that arabic numbers numbers carry two bits of information, position and magnitude, so causal diagrams carry two meanings: functional relations and causal direction (hence the arrows). These make an equation structural.

          I am no expert but it seems to be (some) statisticians have insisted for over 100 years on using a language not well suited to causality, so we end up with ambiguous things like MAR, etc.. My limited understanding is Pearson might have wanted to ban the world “causality” from the face o the earth….

        • PS the point about directionality is not nitpicking. If you read my manuscript on attrition, or Judea’s work on missing data, you’ll see that it is the underlying structural model, or causal diagram, that determines whether attrition or missing data are problematic, how to diagnose it, what to do about it etc. IMHO MAR is ambiguous bc it tackles a difficult problem with limited vocabulary.

        • Fernando:

          Indeed, the general model is p(y,T,X) (or, one might say, p(y,T,X|theta,phi). The framing as a regression problem p(y|T,X,theta) in general throws away some information but can be convenient. We discuss this in the first section of the regression chapter in BDA.

          In the potential-outcomes framework, the problems of causal inference and missing-data are separated. You use statistical methods to impute the missing data, then once these have been imputed, you compute causal inferences as desired (for example, g(T=1,X,theta) – g(T=0,X,theta).

        • Andrew:

          I don’t think separating the missing data and causal inference problem is the best approach. In particular, whether and how you can impute the missing data _is_ a causal question, for it is the underlying causal mechanism that renders it problematic or not.

          In my work I use the fact that (i) causality implies correlation (though not the opposite); and (ii) all we need to do is classify the unknown underlying causal models generating the missing data into problematic or not.

          From this I work out what _all_ problematic models have in common, and what are their observable manifestation in terms of observed correlations that distinguish them from non-problematic ones. We can then use these insights to derive diagnostic tests to infer whether the unknown underlying structural model is problematic or not on the basis of observed correlations. Note we do not need to know the specifics of the model, only whether it has some problematic properties.

          This might not be ideal, nor the best solution perhaps, but the present alternative is to assume MAR, impute, perform some diagnostic tests that may be sensible but are not proven, and pray to God. Instead we now have a complete explanation of problematic attrition, diagnostic tests, and substantiated procedures that are clearer and more flexible than MAR.

        • Andrew:
          For some of us it is more convenient to model sources of systematic errors (I’m not talking about random errors) in causal inference using structural equation models. I believe that was Fernando’s point. I will give a simple example:
          We know from the laws of probabilities that P(A,B)=P(A/B)*P(B)=P(B/A)*P(A).

          But if we have a causal knowledge that B can not cause A, P(A/B) becomes equivalent to P(A) and the two equations P(A/B)*P(B) and P(B/A)*P(A) would no longer imply the same thing. You can’t figure out this based on data alone.

          After incorporating the causal assumptions we can then switch to statistical models. Whether it is a challenge to determine which causal assumptions are accurate (e.g. “where the graph is coming from”) that’s a whole different question.

        • Fernando and Andrew,
          Rubin’s mantra: “causal inference is a missing data problem” had lots of merits at the time when it was
          pronounced (1980’s ??). The rationale was: Here is a new beast, called “cause”, about which we know almost nothing,
          if we reduced it to a missing data problem, about which we know so much, we would be way ahead.
          So the trick was to introduce unobserved counterfactual variables, like Y_x, Y_1, Y(0),… assume that they
          behave like any other variables, reduce our policy question P(y|do(x) =? to a purely statistical problem of finding
          E[Y_x] =?, which is simply a problem of estimating the mean of a variable Y_x whose values are missing,
          whenever X is not equal to x, and we are done.

          A brilliant move, which clicked immediately with the conceptual framework of most statisticians at the time.
          No new operators (do(x), no need to think about causal assumptions, just ordinary statistical estimation combined with
          ordinary statistical techniques for missing data. How can you beat it?

          Well, in the JSM-13 conference they had a whole session dedicated to the idea that “causal inference is a missing data problem” ”
          And, lo and behold, the first talk in that session (by Meng) was titled “which problem is NOT a missing data problem?” which
          reminded me of that boy in H.C. Andersen’s story who dared ask: “The Emperor has no cloths”. In other words,
          do we really know much about missing-data to rejoice the reduction from causal to statistical conceptualization of the problem?
          It turns out we know very little about missing data, and the little we know is predicated on the MAR assumption, which amounts to knowing the very distribution which we are trying to estimate from the missing data.

          In the meantime, advances in graphs and counterfactual logic came into being which made the original policy question:
          P(y|do(x)) = ? perfectly transparent, and rendered the meaning of those counterfactual variables Y_x, Y(1) discernable from
          structural equations, and our handling of missing-data problems so much more powerful when based
          on causal diagrams, that the question should return to public discussion: Should we continue to rejoice the reduction
          from causal to statistical conceptualization of the problem?”‘

          Here is a provocative proposal to Andrew and Fernando. How about you and I proposing a session for JSM-14 on the topic:
          “Missing data from a causal inference perspective”. Would you go for it?
          I know that every causal-inference researcher would welcome it, would any of the missing-data people?
          Do I hear a Yes?

        • Judea:

          That would be a lot of fun! Here is one proposal, a play on your blog post title: “Who cares where the DAG came from?”

          The idea is as follows: Even if we don’t know the true model – not even a guess – causal diagrams can be useful. Specifically, we can use them to prototype models fast and – here is the key – identify “equivalance classes” of models (e.g. models were X causes Y, no matter how e.g. direct, indirect, one or a million mediators). This opens up the possibility of analyzing an infinite number of models if we can reduce them to a finite set of equivalence classes.

          How is this useful? To borrow Plato’s analogy in some applications we can reduce everything outside the cave – the space of true models – to a finite equivalence class partition. Next we figure out what shadows these partitions will cast inside the cave if that were the state of the world. Finally we go back into the cave – the realm of scientists – to read the shadows and infer the state outside. This way we can diagnose problems without knowing or positing a true model.

          This is what a very powerful causal language, like causal diagrams, allows us to do. The stuff on attrition is an application.

  6. Fernando,
    I read the last sentence of Barnard etal
    Section 6.4, but I could not find there the redeeming
    elements that you mentioned. The compromise that
    you wrote about, i.e., structural models, identifiability,
    derived tests, ending with parametric estimation
    is not really a “compromise” but a necessity, yet
    you would not find it in the vocabulary of the old school

    In my recent post on the subject, I conjectured
    that people reluctance to using
    graphs stems from their reluctance to expose their
    assumptions and the habit of keeping assumptions under
    the rug. I now have another theory:
    People think that graphs are merely a neat way of
    communicating causal assumptions, and no more.
    What they do not realize is that graphs are
    hidden calculators, or inference engines, that compute for us
    ALL the logical implications of the assumptions,
    including, in particular, the sum total of their
    testable implications.
    I got this theory in the past few days after
    reading a paper by the economists Heckman and Pinto, where they
    attempt to derive conditional independence relations
    using the Graphoid axioms, not realizing that
    we can get conditional independencies for free
    using d-separation in the graph.

    The reason missing data problems make graphical
    models so crucial is that all theories of
    missing data are built around the notion of
    conditional independence, and one can easily
    get lost without an inference engine (graph)
    in navigating many such independencies.
    But, before accepting one theory or another,
    we need to hear from more people like Andrew,
    who have tried and did not gotten much out of
    causal diagrams.

    • Judea:

      I agree with you the the causal digram is the inference engine, and that attrition, or missing data more generally, are all about conditional independence statements.

      What I mean by compromise, not specific to missing data but more generally, is that just because we can do a lot of inference non-parametrically, that does not mean we should always do so. As you correctly point out, in all cases the causal diagram is necessary. But it may not always be sufficient in all applications, and non-parametric estimation is not always ideal. Put differently, all I am trying to say is that you can buy into causal diagrams without having to go non-parametric in everything you do. I am sure you agree, but in my experience at seminars I think people get defensive bc perhaps they think I am telling them to abandon all they currently do.

      Trivially the diagram is not sufficient when a causal effect is not non-parametrically identified but where we have additional parametric information that may make it identifiable. Another situation arises in the study of heterogeneity and interactions. These are functional properties of causal processes not captured very well by causal diagrams, where heterogeneity is implicitly pervasive (diagrams are useful to distinguish moderation, modification, etc…). If effects vary by strata of a covariate we can compute these non-parametrically stratum by stratum, or do partial pooling with a hierarchical model. The latter have been shown time and again to have very good properties. But to be clear, this is in addition to a diagram.

      • Fernando:
        On your last paragraph you might be interested to read Richardson/Robins work on SWIGS (which are simply DAGs no matter what they call them).

  7. Pingback: Causal language in ecology papers – Ecology is not a dirty word

Comments are closed.