Skip to content

Judea Pearl and I briefly discuss extrapolation, causal inference, and hierarchical modeling

OK, I guess it looks like the Buzzfeed-style headlines are officially over.

Anyway, Judea Pearl writes:

I missed the discussion you had here about Econometrics: Instrument locally, extrapolate globally, which also touched on my work with Elias Bareinboim. So, please allow me to start a new discussion about extrapolation and external validity.

First, two recent results may be of interest to your readers:

1. In the paper we show that the three problems: extrapolation, data fusion and selection bias (you call it “non-representative samples”) can be solved using one unified method.

2. In a subsequent paper I lay before readers two fundamental limitations of ignorability-based languages which hamper their ability to handle generalization problems.

Second, the two papers you posted here, by Dehejia et al., should be evaluated in light of the limitations reported in (2). I believe Julian alluded to these limitations when he wrote: “They [the Dahejia etal papers] should be special cases of the Pearl/Bareinboim theory.” There is nothing wrong with solving special cases, as long as we do not imply that the solution is the best that can be done. It turns out that we now know the limit of what can be done and, to reach this limit one needs to go beyond the boundaries of ignorability languages.

Lastly, you answered Julian’s question by pointing to an old discussion we have had on your blog, and that you suggested that “Bareinboim integrate hierarchical modeling in their framework”. This suggestion may well be something worth pursuing, but it does not indicate acceptance of the fact that we now have a general solution to the problem of generalizing experimental findings across populations. and that this general solution is not necessarily hierarchical.

To further clarify my point, let us address the role of hierarchical models in this discussion. Granted the hierarchical models are more general than non-hierarchical models, one would conclude that problems like those solved in Bareinboim’s papers can also be solved by hierarchical methods. Do you believe this to be the case? I, for one, am not aware of any way of solving those problems without the use of do-calculus. So, I am curious: can it be done?

I am not asking for your endorsement of methods which you have not tried, but I somehow feel the discussion of extrapolation will not be complete without my mentioning that methods that solve those problems do exist (and they use do-calculus).

I replied:

Thanks for the note. What I’m saying is that, whatever causal inference framework is being used, I think when extrapolating it is appropriate to use hierarchical models to partially pool. I don’t think of hierarchical models as a competitor or alternative to your causal inference methods; I see hierarchical modeling as an approach that can be used under any framework, whether it be yours or Rubin’s or some other causal framework used in epidemiology, or whatever.

I think hierarchical modeling can help with external validity to the extent that it allows researchers to go beyond simple yes-or-no include-or-don’t-include decisions on extrapolation. Some colleagues and I have a paper on this which I will post once we get permission from the company involved.

So I’m happy to see Pearl’s new papers, and hope that people working within his framework will try out hierarchical modeling (that is, meta-analysis) in their extrapolations.


  1. judea pearl says:

    Dear Andrew,
    Thanks for posting our conversation.
    I found few glitches in the links provided (my fault). Here are corrections:
    1. In the paper we show that the three problems: extrapolation, data fusion and selection bias (you call it “non-representative samples”) can be solved using one unified method.

    2. In a subsequent paper I lay before readers two fundamental limitations of ignorability-based languages which hamper their ability to handle generalization problems.

    Plus, our Statistical Science paper is now published, and the correct link is


  2. Rahul says:

    How does hierarchical modeling can help with external validity? That part wasn’t clear to me.

    • Andrew says:


      The idea is as follows. A study is done in situation A, and we want to generalize to situation B. The hierarchical modeling allows one to generalize, but with an error term that acknowledges unmodeled differences between the two situations.

      • Rahul says:


        But this generalization, as well as the concept of an error term for un-modeled differences; is that unique to hierarchical models?

        i.e. Cannot a non-hierarchical model do these things as well?

        • Andrew says:


          Any model can be expressed in different ways. To me, if you’re doing partial pooling, you’re using a hierarchical model. But you could give it different words and call it an implicit regression model or something else.

    • judea pearl says:


      I think Andrew has a more comprehensive view of how hierarchical modeling meshes with external validity.
      My views are shaped by concrete examples, and my favorite example is transporting experimental
      findings from Los Angeles to New York, so as to account for the disparity in the age distribution between the
      two populations. Here there is no natural hierarchy
      that I can think of and, still, we can show that it can be done by re-weighting.
      In other examples, and with the same data, we can show that reweighing is bad.
      I do not know how it can be done with a data-driven methods, and no causal model,
      but I am here to learn, not to teach.

      I can speculate that, if the Los Angeles data were obtained by pooling samples from
      different districts, and if some districts were under-represented in the sample, one would
      wish to leverage the natural hierarchy of districts to pool data in a smart way, so as to minimize sampling variability.

      I am inclined to believe that external validity deals with inherent disparities in the
      characters of several populations while hierarchical modeling deals with
      disparities in sample sizes. The former is an asymptotic challenge (i.e., problematic even when
      we have infinite sampe sizes) and the latter is a small-sample challenge (i.e., trivialized when
      sample size goes to infinity).

      If this is right, then external validity relates to hierarchical modeling
      like identification relates to estimation.

      But Andrew may have a different perspective on this question. I wish only that he explains how we move from
      LA to NYC with hierarchical modeling, and how we can discover that
      reweighting is unbiased in one story and biased in another (same data!),
      without modeling the story.


      • Andrew says:


        My views also are shaped by concrete examples. I have no concrete examples of experimental studies that were done in Los Angeles that are being applied to New York. I have no doubt that such studies exist but I have not worked on them myself. For examples of studies in which partial pooling can be applied to generalize to new cases, I refer you to many many examples in my books and research articles.

        You write, “hierarchical modeling deals with disparities in sample sizes.” Indeed, hierarchical modeling can deal with disparities in sample sizes, but it can also be useful in setting where sample sizes are equal. See, for example, the 8 schools model in chapter 5 of BDA.

        My point is that, if you are interested in extrapolation to new scenarios, I recommend hierarchical modeling as a way to compromise between no pooling and complete pooling. I don’t think this is, or should be, a controversial position. It’s a very mild statement! So I feel like you’re trying to start a debate where none is needed. You and others can feel free to work within your causal inference framework. When, within that framework, you want to generalize from one city to another (or, to take an example with which I’m more familiar, to generalize from one pharmacological experiment to another), I recommend using hierarchical models to partially pool.

        Hierarchical modeling does not resolve identification issues; it’s a way to perform more efficient inference conditional on whatever identification you have.

        • judea pearl says:

          We are not in a debate, because I am trying to learn how hierarchical modeling works, and everything
          I say represents an honest attempt to understand.

          Let us take the example that you cited, where we wish to “generalize from one pharmacological experiment to another”.

          Structurally, I do not see a difference between this task and the one where we go from LA to NYC. So, here
          are a few clarifying questions:

          1. Why do we want to “generalize”? Why not just discard the second experiment and stick to whatever
          the first experiment tells?

          2. Is it because the sample size is small and we wish to capitalize on the combined samples?

          3. Is it that the two populations appear to differ in some relevant characteristics.

          4. Do we need to know on what characteristics they differ? or just suspect that such characteristic may exist?

          5. If the former, do we need to know how the differing characteristic relates to treatment and outcome?
          If so, how do we specify this relationship.

          6. What do we know about the target population (i.e., the end users), and how it differs from the two
          experimental populations?

          7. I assume that the answers to some of these questions are: “not necessarily”, or “We may or may not”
          but, just to help me understand how hierarchical modeling works, let us choose ONE scenario and point out where
          hierarchical modeling enters, what the elements of the hierarchy are, and what it gives us that non-hierarchical
          generalization (using only three variables, treatment, outcome and a covariate) does not?

          8. Am I right to assume that if the target population is well represented in experiment-1 and vastly different from experiment-2
          (say men and mice) that we should discard the second and use no pooling?


          • Andrew says:


            You ask for one scenario. As I wrote in my above post, I have a paper on this which I will post once we get permission from the company involved. But while we’re waiting for this, you could take a look at the 8 schools example from chapter 5 of BDA.

            Finally, the quick answer to your questions is that, to the extent that the two scenarios are more similar to each other, there will be more pooling. It is not all or none. Of course in practice some information may be weak enough that it’s not worth the trouble to model it, but from a conceptual point of view you’d want to include all of it. This issue was discussed, for example, by Lindley and Smith in their classic 1972 paper.

            One of the advantages of hierarchical Bayesian modeling, or graphical modeling, or whatever you’d like to call it, is that it allows the combination of information from different sources (including men and mice—or, for that matter, including men and women) using partial pooling. I discuss this in my 1996 paper with Bois and Jiang.

          • Keith O'Rourke says:

            Perhaps some different words.

            Meta-analysis (historically) looked for a common parameter between different experiments that could be jointly learned about.

            For instance, in randomized clinical studies with two groups (placebo and control) one might abduct (hope given some reasonable expectations) that the proportions of success in treatment and control groups are different _but_ say the relative risk is the same or common. If so, you can audit whether the relative risks (given the data) were the same or not (transportability?).

            Very soon afterwards (1800’s) some abducted (hoped given some reasonable expectations) that given different individual study parameters, no function of the parameters (e.g. relative risk) was actually common but it was worthwhile to hope (abduct) that they could be purposefully represented (rather that literally) as being _drawn_ from a common distribution that had a common parameter (now we have a hierarchical model – which Andrew refers to insightfully as partial pooling.)

            These are harder to grasp (purposefully) but they are -still_ about a generating model (where doing versus observing can be very important) only differing on how the parameters are specified – common versus common in distribution.

  3. jrc says:

    Am I wrong to think that Judea’s do-calculus and Andrew’s hierarchical models** are not just completely compatible but basically trying to do completely different things? The way I see it:

    Judea’s work (here at least) is essentially an epistemology that derives from a metaphysics where the world is model-able by do-calculus represented via DAGs. The contribution is to develop a symbolic logic for statistical inference. This metaphysics and the associated epistemology are (at least in theory) totally unrelated to any particular empirical situation, estimation technique, or identifying variation, regardless of whether or not there are particular techniques that are usually applied to estimate the models that this thinking generates.

    Andrew’s work is on statistical methods for incorporating information from disparate sources into a single estimation framework. There are no claims about how the world is or how knowledge of that world is (or should be) generated – those claims are external to the statistical model. This estimation framework is absolutely silent on the epistemology of causal inference in the sense of grounding how we generate “truth” (epistemology) given some way the world is (metaphysics) – these claims come from outside the model in the same way that estimating any particular parameter is outside of Judea’s symbolic logic.

    Now of course this leads you both to questions about external validity and the generalizability of results. But in this case the two contributions (at least in terms of DAGs and Hierarchical Models) are even more clearly distinct: Judea contributes epistemological rigor to our thinking about variation in effects; Andrew contributes to our ability to estimate that variation using information from multiple sources.

    So what am I missing that makes this conversation seem like a disagreement? It is quite possible I’m just completely off here, since I was never taught either of these things and most of my knowledge comes from clicking links on this blog and trying to figure out what the two of you are doing.

    • Andrew says:


      Yes, exactly. I think the problems of identification and of hierarchical modeling are essentially orthogonal, that’s why I think that people using Judea’s framework should consider hierarchical modeling or some sort of partial pooling when generalizing to new situations.

      Conversely, I would completely understand if Judea were to say to me that if I want to use hierarchical models, fine, but when doing causal inference i should express my hierarchical models in his graphical modeling framework.

      I don’t understand why Judea seems to feel the ideas are in competition at all!

      • judea pearl says:

        I agree that the problems of external validity and HM are orthogonal.
        What I am trying to understand, so far unsuccessfully, is why you think that people dealing with external validity
        “should consider hierarchical modeling or some sort of partial pooling when generalizing to new situations.”
        More importantly, HOW should we do it, now that we are all charged up, ready to learn and do it.

        I am posing the simplest of all external validity problems that I could imagine:
        We run ONE pharmaceutical experiment on students in Columbia
        University, and we are wondering whether the result can be generalized
        to the entire population of Manhattan, given the obvious age disparity.

        Imagine that we did our identification exercise already and we found that we can overcome the age disparity
        problem by re-weighting. Now we appeal to your advice, that people dealing with external validity
        “should consider hierarchical modeling or some sort of partial pooling when generalizing to new situations.”
        We ask: Please teach us how?

        What partial pooling should we engage in?

        What hierarchical modeling should we consider?

        Haven’t we done “partial pooling” already when we used re-weighting?

        Isn’t every transport formula “some sort of partial pooling,” by virtue of borrowing
        information from two or more populations?

        If so, what opportunities would we be missing if we never heard of hierarchical modeling, and were
        just doing ordinary re-weighting?

        Whatever the answer, please try to stay with this simple example.
        I am trying to learn, not to debate. No competition in sight, just a quest for clarity.


        • Andrew says:


          Sorry, but this just isn’t so interesting to me. I have real examples and you keep making up fake examples involving Los Angeles and New York, or Columbia University and whatever. You ask: Please teach us how? I refer you to my books and research articles. I’m not saying it’s easy—if it were easy, I wouldn’t have had to write all these books and articles on the topic. If you want to start somewhere, I’d recommend chapter 5 of BDA and chapters 11-12 of ARM. In your particular hypothetical example, the point of hierarchical modeling is to account for differences in the populations, beyond what is accounted for in your predictor variables. We discuss this idea in detail in ARM; it makes sense to adjust for available predictors and also to include error terms to allow for unexplained variation. But, really, there’s not much I can say in a blog comment compared to what I’ve said in 2 books and dozens of articles!

          • judea pearl says:

            You seem impatient with my quest for understanding.
            But I find it hard to believe that a celebrated technique such as hierarchical modeling becomes harder and harder to explain
            the simpler the problem get. I am sure that all I am missing is one magic idea. So, I will try again.

            When you say “this isn’t so interesting to me” do you mean the problem is too simple to benefit from hierarchical
            modeling? Can I conclude then that, in such simple problems hierarchical modeling need not be considered?
            Yes _____, No______

            To make sure I read you correctly, I am re-examining your example-related answer:
            You wrote: “In your particular hypothetical example, the point of hierarchical modeling is to account for differences
            in the populations, beyond what is accounted for in your predictor variables.”
            I take it that, in our example, “age” stands for “predictor variables”.and that the investigator suspects that other differences
            exists between Columbia’s students and Manhattan population.
            Can we conclude then that, in case we verify that no other differences exist in the two distributions,
            hierarchical modeling need not be considered?
            Yes _____, No______

            More importantly, in case we do find other differences, say in income, can we take it that hierarchical modeling can account for
            these other differences and deliver an estimate that is superior to what we would get by just re-weighting on AGE.?
            Yes _____, No______

            If Yes, do we need to assume something about those other
            differences, or it is enough to just estimate them.?
            Yes _____, No______

            I must apologize to Corey and Daniel for not addressing the examples they brought up. I am laboring
            to simplify the examples as much as possible, to minimize the number of variables and eliminate uncertainties and ambiguities.
            Every time we shift to a new example we face the danger of leaving out few things which interfere
            with the road towards understanding. So, please bears with me
            on this simple example: two populations, one disparity (AGE), large sample, nothing ambiguous, nothing left to guesswork.
            Can we get at least one item checked on this list?
            Do we really need 2 books and dozens of articles to answer Yes or No ?

            Lets hope,

            • ojm says:

              Dumb question – in your example, what are you looking to predict and what are you looking to predict with?

              • judea pearl says:

                No question is dumb.
                In my example the goal is to predict the average effect size (of treatment on outcome) in the population of Manhattan,
                given the age-specific effect size among Columbia students, and given the (unequal) age distributions in the
                two populations.
                Duck soup

              • ojm says:

                OK sure. Maybe my answer will be dumb instead.

                Initial thoughts (y is outcome, a is age, T is treatment including city):

                p(y|T) = ∫p(y|a,T)p(a|T)da

                If age is ‘causal’ then further assume p(y,a,T) = p(y|a).

                So p(y|T) = ∫p(y|a)p(a|T)da

                Model is invariant at highest level, age distribution varies by city (again, included in T. Could split further but I’m lazy). Estimate, test, predict, expand etc?

              • ojm says:

                Meant p(y,a|T) = p(y|a) obviously.

              • ojm says:

                Ugh. p(y|a,T) = p(y|a). Writing on a phone.

            • I take it you want to do something like

              Effect Of Treatment = f(age,some_modeling_parameters) + zero_mean_noise

              where you have data on a small number of ages based on Columbia students, and you now want to extrapolate to other ages and other regions of the country, and you’re asking how will hierarchical modeling help?

              And the answer I have is… the problem is too simple to meaningfully have any hierarchy. Corey’s point that you need to do two experiments on the two populations seems logical to me. The biggest point of hierarchical modeling here is that the information from the two experiments could tell you about consistent differences between the populations which could be represented in the model by different values for the set of parameters for the two groups, and the hierarchy would come in when specifying how the set of parameters could vary from group to group.

              Our uncertainty about a third unobserved population could then be represented by sampling from the distribution of the parameters and predicting for that third population. The information from the second experiment is needed if you want to have any data-driven information about the parameter distribution, otherwise you’re just sampling from the posterior for the one group you have, which doesn’t represent in any way the range of possibilities.

              • In all of this, I suspect that the contribution from Judea is in formalizing the way in which f can be generated from assumptions about some mechanisms of causality, and the contribution from HM is to allow you to take into account variability from group to group in the modeling_parameters and better estimate their values for observed groups, and also better estimate the range of possibilities for unobserved groups.

                The contribution from HM to extrapolation or “external validity” is in estimating the ranges over which the modeling_parameters might vary when moving from a variety of observed groups to another unobserved group.

                HM isn’t going to give you a “single value” for the parameters for the unobserved group, but it is going to give you a credible data-and-prior-driven range over which they might vary, so that credible ranges of outcomes for the unobserved groups can be estimated.

              • Rahul says:

                Would hierarchical models be a good approach to tackle a purely predictive accuracy contest? Say something like the Netflix prize etc.

                I mean, is that sort of prediction contest, with a hold out set, a good test of external validity or not really? Especially the ones where the hold out set is not a random sample but a set of observations isolated by site, time period etc.

              • Hierarchical models are a technique, like fly-fishing, whether it’s a good technique depends at least a little on how good you are at using it and how much you know about the problem.

                Two people can go fly-fishing in the same pond and one will come back with their bag-limit and another come back with a puny single trout…. is the failure of the second one because fly fishing is an inadequate technique, or because they didn’t do the right things for that particular set of fish? The success of the first person suggests the technique can be adequate.

                In a Bayesian model you model not only the process that produces the individual data points, but also, if necessary, the sampling method that produces the data-set. If someone tells you they’ve got a “random hold out set” then you can model that random hold-out, if they tell you that they have a non-random hold-out which is specifically associated with some particular type of sub-group, then you SHOULD model that sampling process when predicting the hold-out values.

              • Rahul says:


                All of what you write makes sense & is perfectly reasonably. I didn’t intend my question to be of the sort “Is generic method X suitable for all problems?”

                My question was particularly motivated by an anecdotal observation (perhaps mistaken) that in the forums / mailing lists etc. specifically devoted to predictive modelling contests I didn’t recall much of a mention of Hierarchical Models. As opposed to (say) logistic regression, naive bayes, random forests, bagging / boosting, ensemble models etc.

                Is there a reason for this? Or are they known by another name? Subsumed under another rubric?

                My observation itself could be wrong, of course & reflect my ignorance of the prediction contest communities.

              • I think those prediction contests tend to attract computer-sciencey people. The “machine learning” groups are kind of hands-off, that is they don’t tend to be too interested in models that require you to know anything much about the problem. Similarly, I think the statistical bayesian modeling community is primarily a group of people who have a real desire to incorporate their assumptions and knowledge about how things work (ie. physicists, economists, ecologists, biologists etc).

                So, at least a big part of this difference I think is one of culture and attraction. For example, if I remember correctly the Netflix contest was already “rigged” against Bayesians in that the metric being used (sum of squared error in ratings) was not what a Bayesian would necessarily want to model. A Bayesian would want to put a joint probability distribution over the possible ratings of each movie

                p(stars[i][j] = n | predictors)

                for all movies i and people j and possible ratings n in {1,2,3,4,5}

              • Rahul says:


                Interesting. Shouldn’t incorporating domain knowledge lead to better predictions, though?

                How come the community of computer-science-ey types not getting penalized in predictive accuracy by sticking to their hands-off, domain-ignorant models? Can’t someone “add” domain knowledge to the best models and make them somewhat better?

                Alternatively, why are we finding that the gains in domain knowledge don’t let us predict the world any better (& in fact far worse if you assume the contest-fans are right in using their hands-off approaches as the ones that really work well)?

              • But, how much domain knowledge is there out there about “what makes people prefer ‘Dude where’s my car’ to ‘Dumb and Dumber'”?

                Also, even if a domain expert might have such information, I imagine the predictors they’d need are things like some kind of results of a Meyers-Briggs test and some information about the cultural background and early childhood experiences of the subjects, what sort of jobs they do, and how much education they have from what kinds of schools in what kinds of majors etc… information that isn’t available.

                So, what would rapidly become interesting to the subject matter expert would be to start with the movie ratings and try to infer where the person grew up, what their parent’s income was, and what industry they work in, or whatever.

                The purely mechanical brute force machine learning approach can probably identify fairly obscure proxies for some of the information you’d need as a domain expert. But, by hand the domain experts are not necessarily going to find those same features.

              • Rahul says:

                Maybe true about the Netflix prize. But there’s prediction contests on a wide range of topics from hospital admissions to bike rentals to financial markets.

                The fact that, like you said, most top entries tend to be hands-off & data agnostic (& none using HM), does that extend your “But, how much domain knowledge is there out there” argument across these diverse applications?

        • Corey says:

          Try this: we run two pharmaceutical experiments, one on students in Columbia and one on recruited subjects in Spokane, Washington. Now we want to generalize to, say, the population of Manhattan. We do our re-weighting exercise between our two experimental populations and find that there remains some unaccounted-for disparity. To apply a hierarchical modeling approach we treat all such unaccounted-for disparities between various populations (and in particular, the disparity between the Manhattan population and the experimental populations) as exchangeable random variables.

          (Why two pharmaceutical experiments? Two experiments is the bare minimum number we need to have data-driven dirt on the dispersion of the disparity distribution.)

        • Let’s model the response of the symptoms as f(age, dose/weight, initial-health-conditions, internal-biological-state)

          we can generally measure age, dose, weight with little error. initial-health-conditions is a vector of things which we might be able to measure somewhat less precisely, and internal-biological-state is some set of things about which we have no measurements whatsoever, but we could in principle measure them (things like say the mass of each kidney, the fraction of the damaged portion of the kidney, the number of copies of some set of genes in the genome, the concentration of some proteins in some tissue… whatever). Perhaps even we don’t know what these variables are, but there are “effective parameters” which more or less are determined by some combinations of these functions that tell us how well our drug would work…

          In modeling this situation, we will need to define parameters to tell us things about the initial-health-conditions, and the internal-biological-state for each person.

          We don’t know the initial-health-conditions, but, because all of the patients exhibit some kind of symptoms to some degree or another, we can guess that the range of values for the initial health conditions which are plausible is some range that is more or less “narrow”.

          Because we know even less about the “internal-biological-state” we can guess that the plausible ranges of values are more or less “wide”.

          However, we are not sure how “narrow” our “narrow” should be…. so we create our model of initial-health-conditions specifying that it is

          N(center, width)

          But we don’t know what the width should be, so we specify

          width ~ exponential(1/Width_order_of_magnitude)

          Finally, we specify a likelihood

          p(data | age, dose/weight, initial-health-conditions,internal-biological-state) ~ N(f(age,….), measurement_and_modeling_error_scale)

          how is this “heirarchical”? Because it specifies our information about internal-biological-state in terms of a width which is itself uncertain and modeled in terms of a common order of magnitude.

          This offers us partial pooling between the data points because to the extent that the individual values of the “initial-health-conditions” variable are less spread out, it will increase the posterior probability of the model (because the width parameter can be smaller which is more probable under the exponential model). However, the values for the individual patients will not collapse to the same single value, because that would in most cases, decrease the likelihood of the data.

          I think this is the main concept behind hierarchical modeling, to specify what you can about the individual values of the parameters in terms of a common distribution that the ensemble of them have and then to iterate on that concept.

        • Ricardo Silva says:


          X = treatment (e.g. vaccine is taken)
          Y = outcome (e.g person stays healthy within a year)
          D = location (e.g. value of “0” is observed in the experimental study, where “0” = Columbia students etc.)
          U = unobserved confounders of X and Y
          W = covariates (e.g, age) that will imply a causal DAG as the following structure (G)

          {X, W, U} -> Y
          {D, U} -> X
          W -> D (direction here is not relevant, or even whether there are unmeasured confounders etc. What matters is the assumption D _||_ Y | {W, do(X)}).

          We want estimand (E) defined as P(Y = y | do(X = x), D = d) for all sorts of values other than d = 0. (G) entails the following about (E):

          (E) = sum_w P(Y = y | do(X = x), D = d, W = w)P(W = w | D = d, do(X = x))
          = sum_w P(Y = y | do(X = x), D = 0, W = w)P(W = w | D = d)

          So far, no estimation: this is a way of getting estimand (E) as entailed by model assumptions (G) assuming 0 < P(D = 0 | W = w) < 1 for all w. This is essentially the setup in the second paper linked at (

          Now, estimation. At D = 0, we assume we have randomized data. Then the first factor can be estimated from the experimental data. The second factor will change with d. If W is age and D is location, it will be sensible to estimate that with a hierarchical model (say, W is Gaussian with mean mu_d, variance v_d, put priors on {mu_d, v_d}, maybe make hyperparameters depend on covariates of location which are not relevant to the causal graph, such as longitude/latitude of locations.) Andrew, myself, and many others would say hierarchical models are an excellent way of estimating this distribution.

          Others of course can use their method of choice for the estimation step, but the point is we can use causal graphs to derive the estimand from model assumptions, then hierarchical models for the estimator.

          • judea pearl says:

            You made my day!!! I finally got it!!

            I will summarize what I understood from your post, and please check if I got it right.

            1. When we tackle a cross-population problem, we first derive the pooling estimand (your (E))
            which tells us what probability relationship to estimate from what source, then, to estimate the needed relationships
            and we appeal to hierarchical modeling to get those estimates from the appropriate data,
            Finally we combine them in accordance with the pooling estimand.
            Duck soup.

            Thanks for illuminating me!!
            Where have you been hiding when I was begging for a simple explanation on a simple example?

            This confirms my earlier suspicion that “external validity relates to hierarchical modeling like
            identification relates to estimation.”
            So, why was I unsure about this interpretation?
            I guess because Andrew kept on insisting that HM is doing pooling.
            “I recommend using HM as a way to compromise between no
            pooling and complete pooling.” or “I recommend using HM to partially pool.” and more.

            Note that in our Columbia example, HM is not doing any cross population pooling.
            It is always applied to one factor in the estimand, namely, it estimates one relationship at a time from data
            originating from one population at a time.

            So, what makes Andrew say things like: “to the extend that
            the two scenarios are similar to each other there will be more pooling” ?
            Even if we have k heterogeneous experiments and one target population, each factor in the pooling estimand
            would come from one and only one population.

            I guess the answer is that sometimes it is useful to estimate each factor from several populations.
            For example, the age distribution in Manhattan is best estimated by a smart pooling of data from various
            districts of Manhattan. But this lies outside the main problem of pooling together information
            from Columbia and Manhattan.

            Still, who am I to out-guess Andrew? He may have a totally different interpretation of what pooling is.

            Thanks for the enlightenment,

            • ojm says:

              Tho’ nice and explicit, as far as I can tell this is essentially the same explanation as I gave above.

              Clearly, during estimation, if you did the estimation first for one city then this can inform your model’s structural parameters in p(y|a) for the other city since this part is assumed to be [approximately] invariant). Of course p(a|T) will differ since the age distributions vary by city.

              I am unclear on what is unclear, though it may be my blind spot.

            • Ricardo Silva says:

              Glad to help, you got it correct.

              My only comment is that the word “pooling” is a bit of an overloaded term. The way I use it (which I think this agrees with the way Andrew and others use) is to mean the “sharing of statistical strength”, a smoothing device basically. So we “pool” locations d and d’ by saying that mu_d and mu_d’ are dependent a priori: in that way, data from d’ is also used (in an “discounted” way that follows from Bayes’ rule) to give us information about mu_d. If we have many locations, then we can also learn about the prior dependence, since if age distribution were empirically close to each other in geographically close locations, then we could adjust the prior to strengthen dependencies accordingly. Otherwise, if they vary abruptly this would swing the prior dependence towards a set of nearly independent {mu_d}. This gives an inferential way to vary from complete (mu_d = mu_d’) to no-pooling (mu_d independent a priori of mu_d’).

              The Columbia/Manhattan example would be a bit more complicated, as one group is a subset of the other. But prior information could be used in different ways, e.g., mu_manhattan > mu_columbia.

              • Keith O'Rourke says:

                Another way to put it would be to say mu_d and mu_d’ are represented as draws from a common distribution say N(mu,sigma) or N(mu, (location.d – location.0)^2 + sigma) or whatever and as long as they are being drawn from a common distribution with a common parameter – there will be a combination (“sharing of statistical strength”) simply by likelihood multiplication.

                (Gory details of this way of putting it are in my thesis here )

                Note, here mu_d and mu_d’ are not observed but predicted by a distribution (some argue estimation is the wrong term).

                But I would agree the literature seems like a confusing mess.

            • I don’t know what Ricardo means, but evidently you don’t “get it” because there IS “pooling” in a hierarchical model. Perhaps a better term is sharing of information across groups, because perhaps you have some kind of special meaning for the word “pooling”. Ricardo explains some of that below (above? I’m not sure where my comment is going to fall)

              Here is the simplest scenario I can imagine for “external validity” via heirarchical modeling it’s really Corey’s point:

              There are three groups G1, G2, and G3. We assume we have random draws from G1 and G2 and measure a quantity about the individual samples. We assume we have a known variation within all G1, G2, G3 just for pedagogical reasons…

              We want to find out about the mean of a third group G3 where we have no observations.

              We assume we know something about the degree to which scientific facts constrain the range of mean values for groups… For example if we are measuring height of people, we know that none of the people will be taller than the Sequioa “General Sherman” which is thought to be the tallest living organism. But probably we have more information than that.

              We claim that each of the groups has a mean and a known standard deviation so that the individual samples

              G1[i] ~ normal(G1_mean,known_sd)


              G2[i] ~ normal(G2_mean,known_sd)

              But, we also claim that our scientific knowledge gives us some prior information about the G1_mean, G2_mean, G3_mean… namely that they all tend to fall near some common mean and have some common range (this is the culmination of our “scientific knowledge” like about the General Sherman example above, I use “normal” distributions here just for pedagogy).

              G1_mean ~ normal(overall_mean,overall_sd_of_means)
              G2_mean ~ normal(overall_mean,overall_sd_of_means)
              G3_mean ~ normal(overall_mean,overall_sd_of_means)

              We give our information about overall_mean and overall_sd_of_means as distributions (the “bottom” of the hierarchy).

              overall_mean ~ some_distribution_goes_here
              overall_sd_of_means ~ some_distribution2_goes_here

              Now, we run bayes theorem in the form of a Stan program or some other computation… and we get posterior distributions for everything…

              In particular, we get a posterior distribution for “overall_mean” and “overall_sd_of_means”. Clearly these distributions are informed by the data from both G1 and G2 (this is “partial pooling” partial because we do not assume a single “overall_mean” which we estimate by assuming G1 and G2 are samples from a common group).

              Taking a sample of overall_mean and overall_sd_of_means from the posterior, we then generate various normal deviates for “G3_mean” and this summarizes our knowledge about group G3 in the absence of any data on G3.

              I can not think of anything very much simpler than this for which a hierarchical model makes much sense.

              Yes, I easily concede that it makes sense to distinguish between observational and experimental designs and that the “do calculus” might be a good way to do so (I don’t know), though I typically don’t formalize my models via do calculus.

              • Ricardo Silva says:

                Yes, I wasn’t clear. Judea’s message was correct modulo his interpretation of the word pooling, which I took to be the conditioning/marginalization operations that are typically used to express causal estimands using observational data (not sure what is a good name for that. I don’t like the term “adjustment”, feels a lot like “fitting”, which is a statistical term, not a term that I would use for the procedure of deriving logical operations to get an expression from an estimand from given model assumptions).

                By the more common use of the word “pooling” (at least in the context of Bayesian inference), then most definitely hierarchical modeling is doing pooling, and it can definitely play a role in the estimation of P(W | D) in my example, if one wishes to do so.

              • CK says:

                What if there are systematic differences in G_means? Is HM pooling appropriate when such differences exist? Shouldn’t we do the correction (if possible) before we proceed with pooling?

              • CK,

                if there are systematic differences in G_means we should model them if we feel that the improvement in the overall model will justify the effort, in the model for them, we can use HM. Eventually at some level in the model we will only know that there are differences, not what the systematic description is, at that level we can just specify a prior distribution based on whatever information we have.

              • CK says:

                Thank you Daniel. From my understanding, transportability guides how to correct for systematic differences across studies but leaves the choice of a statistical model (be it HM or not) to the data analyst. So from my views the two approaches play different (but complementary) roles in the analysis.

              • judea pearl says:

                CK , Daniel, et al,
                I would join you in your conclusion:
                ———- transportability guides how to correct for systematic differences across studies
                ———–but leaves the choice of a statistical model (be it HM or not) to the data analyst.

                if and only if, by “systematic difference” we mean any difference that would disappear whenever the number of samples
                increases indefinitely in each of the studies.
                Do I have the right interpretation of “systemetic”?

              • CK says:

                You have the right interpretation of what I meant by “systematic”.

              • judea pearl says:

                Great, so we have finally reached a resolution of this long discussion (73 comments) which started with
                innocent comment on Dehejia et al, and ended with full understanding of HM does — it simply estimates the
                distributions that are taken as given in transportability.
                I just posted a note on do-calculus to ojm, which somehow did not get posted. I think it is relevant to the entire discussion:
                I do not understand why you feel so threatened by the do-calculus.
                I think if you try to solve any of the data-fusion examples in
                you will appreciate its power. Try Fig. 5 (b) or (c).

                Same if you try to use logic, set theory, category theory, statistics, or string
                theory to resolve any of the Simpson’s paradox examples:

                To say that we can do everything with standard mathematics
                is to severely underestimate the capability of our generation
                to innovate new tools.

                Where does this defeatism come from?


              • ojm says:

                I don’t feel threatened by it – I feel disappointed by it. I was quite excited at first because it was very close to what I was looking for when I came across it.

                Unfortunately I came to feel that it was not quite what I was looking for, after all.

                I will give it another go at some point, I’m sure. Perhaps I am wrong about it.

                And of course I do not wish to “severely underestimate the capability of our generation to innovate new tools”. I just think that there are more exciting tools out there.

              • Rahul says:

                +1 to ojm’s sentiments.

                My experience with trying out DAGs is almost identical. First time I came across DAGs I was super excited: here’s a novel approach that’s actually trying to do something revolutionary about deducing causality.

                But the more I tried the more I was lost & in spite of multiple attempts I’ve never been able to use it on practical problems.

                So far from being threatened by do-calculus I’m disappointed by it, or rather disappointed by my own lack of ability to “get it”. But I do hope these tools will catch on & others can use them to figure out causal relations on real important problems.

              • judea pearl says:

                Ojm and Rahul,
                So, you are disappointed with DAGs.
                Ojm does not say why, but Rahul does:
                “I was super excited here’s a novel approach that’s actually trying to do something about deducing

                It seems that your expectations were set too high, because
                “deducing causality” from data alone, with no assumptions, is
                a mathematical impossibility (that, btw, can be proven by DAGs.)

                Aha!!! Someone invariably says, “if you allow assumptions,
                what have you accomplished?, everything can be deduced
                from strong assumptions!!”

                Aha!!! said Euler’s contemporaries “What good are
                differential equations? You cant get the trajectories
                without assumptions about the laws of motion. We are

                DAGs are indeed like differential equations. You plug
                in assumptions about which you have strong conviction,
                (say that the force is inversely proportional to the square of the distance)
                because they cohere with your experience and the way
                you perceive the world, and out come answers to questions
                about which you have not got the slightest idea,
                (like that the trajectory of the star is an ellipse or,
                more pertinent,) that certain sets of assumptions
                do not permit transportability (no matter how
                smart your statisticians), while others do.

                Are you still disappointed?
                Just look where astronomy is today, and how slow causal
                inference has advanced after two centuries of statistical
                meddling with the problem, having no mathematical tools to solve
                its differential equations

                I would be extremely excited, if I were you, seeing what
                has been accomplished in just 20 years of
                causal calculus – identification, confounding, mediation, model testing and more.
                This assumes, of course, that you are interested in causal questions, not in just
                another survey.


              • Rahul says:


                I never said anything about “with no assumptions”. Not sure why you thought I meant that.

              • ojm says:

                Quick description of my dissatisfaction:

                Do calculus appears to me to suffer from attempting to be too general and too specific at the same time.

                I can agree with general ideas like
                – define mechanisms through structural invariance
                – asymmetry/causal ordering can be introduced through boundary/initial conditions and the nature of macroscopic measurement and/or other symmetry breaking mechanisms
                – graphical notations and graph theoretic proofs can be very illuminating

                I just feel like applied mathematicians and physicists have already been using these ideas and developing related formalism for years and I find them more useful and better adapted to problems I’m interested in.

                Again, if you could give a nice demonstration on the ideal gas or two slit experiment or some other undergrad physics example then I might be more open to convincing.

              • judea pearl says:

                You write:
                “I never said anything about “with no assumptions”. Not sure why you thought I meant that.”
                I did not realize you allowed for causal assumptions, sorry. I guess I could not reconcile your
                “disappointment with DAGs ” with “allowing for assumptions”. I never met a person who recognizes
                the need for causal assumptions, willing to express them in some language, trying DAGs, and ending up
                How else would one express the assumption that symptoms do not cause diseases?
                I hope you would not advise me to use conditional ignorability before you try it.

                Anyhow, sorry for misreading your message and, if you need to express assumptions, recall:
                I never met a person who was willing to express them in some language, trying DAGs, and ending up

              • judea pearl says:

                The ideal gas and two-slit experiment are textbook phenomena,
                I would rather spend my time on uncharterred territories, like external validity.
                Join me if curious.

              • ojm says:


                RE: ‘The ideal gas and two-slit experiment are textbook phenomena, I would rather spend my time on uncharterred territories, like external validity. Join me if curious.’

                What happened to
                ‘I am laboring to simplify the examples as much as possible, to minimize the number of variables and eliminate uncertainties and ambiguities’


                ‘Only the mighty Gods believe they can handle “actual, messy, applied dataset” without understanding simple, pedagogic examples. I am just a humble mortal, enjoying the miracle of understanding.’


                I’m just trying to see this methodology applied to two canonical – yet non-trivial – examples of mechanistic understanding.

                These sorts of examples are a key test for me. I know (roughly) how I would handle them and similar examples with hierarchical modelling and other frameworks.

                I have my doubts, however, as to whether do-calculus is a) adequate and/or b) helpful for these sorts of models.

              • while our estimates of the individual values would go to the correct value in the limit of infinite samples, that need not mean we are informed *about how the systematic part works*.

                A bayesian model in which you actually specify some model for the systematic nature of the differences can then be used to estimate important unknown quantities in this model.

                In general I think the issue I have with “do” calculus is that it appears to attempt to formalize the “modeling process” and thereby gain some extra “tricks” that become formal symbolic manipulations (such as determining if something is identifiable or whatnot.

                I just rarely have that issue. Like OJM I think seeing a simple but mechanistic model such as “the ideal gas” or the “two slit” or even show me why I might be interested in do calculus if I work on say figuring out why certain kinds of building assemblies fail in earthquakes while others don’t… it might help me. Every one of OJM’s complaints makes some sense to me.

              • Ricardo Silva says:


                (I hope this reply falls in the right thread)

                The ideal gas example is a bit complicated, because it is a continuous time equilibrium system. It is not that straightforward to explain it in a blog reply. See some discussion here: and Thomas Richardson’s thesis . Dealing with continuous time requires some effort. There is on-going research, I can point out to and

                Concerning the two-slit problem, I know little of physics but do even physicists really understand it? I thought it was “shut up and calculate” all turtles down. There is an experimental setup to it, but it is hardly illuminating to describe the system of putting together the laboratory setup, firing up the measurements, etc. and any application of do-calculus to that would be trivial.

                If you like to think of causal models in terms of equations, there is a direct connection between them and causal graphs under the name of “structural equations”. The term is meant to emphasize the difference with respect to algebraic equations. Because for an invertible f, cause = f(effect) and effect = f^{-1}(cause) are equivalent algebraic equations, but not equivalent structural equations.

                For a real-world application (real, as in a live system that is actually used) of structural equation modeling with do-calculus, see Another example of an application of (partially) mechanistic models with causal graphs is

              • Ricardo Silva says:

                Incidentally, here are some interesting papers on Bell’s inequality and causal models:


                and the relation between that type of reasoning and instrumental variables is given quite some air time in Pearl’s book.

              • ojm says:

                Thanks Ricardo. This confirms to me that do calculus is interesting but still at the stage where it struggles with textbook physics problems. I’ll keep an eye on future developments.

                PS I think the ‘issue’ of invertible cause-effect functions is a bit of a red herring – it’s not that hard to get irreversible macroscopic dynamics from reversible microscopic dynamics, for example. I should leave the comments for now, finally!

                Thanks again.

              • Ricardo Silva says:


                Again, I’m happy to help but I think the message I wanted to convey was the opposite of your conclusion! Reversible mechanical systems are really, really awful examples of interesting applications that can be done with causal models, and we shouldn’t let this stop us from looking into applications of such models in other domains, and realizing that most people don’t care about reversibility by assuming cause-effect is not a symmetric relationship. This is akin to dismissing statistics because it is not of much help in mechanics. I’m glad nobody took seriously that (apocryphal?) quote from Rutherford, that “if you need statistics then you are doing the wrong experiment”, and Fisher with his “silly” lady tasting tea contributed much more to science than anyone waiting to solve practical applications only if good ol’ 19th century methodology could be applied first…

                In any case, I agree this conversation has been very interesting and (to me at least) productive, so I was happy to be involved but it is time to move on.

              • judea pearl says:

                Now that we bring this discussion to an end, I want to thank you for
                your insightful comments, through which I learned what “hierarchical modeling”
                and “pooling” are about. Although I have originally hoped to talk about extrapolation, not hierarchical
                modeling, I nevertheless benefitted from the discussion and, if I can find the time, I will one day
                summarize what I learned on my blog ,, in a different
                language of course, so that my students can get it too.

                I do not presume the hierarchical modeling folks have learned much from me, about extrapolation,
                or about modern causal analysis, but this takes time and some willingness to learn a new language.
                Such willingness, I have learned, does not come out of linguistic curiousity; it comes out of
                necessity to solve problems that your old language cannot handle, and most researchers
                tend to only ask questions that the old language permits them to ask.

                On a more hopeful note, I believe our new book “Causal Inference in Statistics – A Primer”
                that I, Glymour and Jewell have co-authored, may do a lot to bridge language barriers and get
                statisticians to ask causal questions, like policy analysis, mediation and extrapolations.
                It will be out by March 2016 , see
                and I hope it proves that anyone can teach causality in Stat 101.

              • Keith O'Rourke says:

                Judea (mostly):

                The remaining challenge with this way of putting it “Once we get the pooling estimand correctly, we call in the estimation experts, be they Bayesians ornon Bayesians, HM or ML, deep learning or shallow learning” or “A hierarchical model would give a complete joint probability distribution for the data and parameters, and transportability analysis leave this completely unspecified, trusting that the hierarchical modeling experts would do a good job at that”

                is that the choice of specifying random parameters drawn from a distribution with a common parameter (exchangeability assumptions or ojm’s higher level invariance) is the implicit assumption that representing things that vary apparently haphazardly as random will do more good than harm. It can be put as a variance bias trade-off but a potentially serious one.

              • judea pearl says:

                I fail to see the challenge you are referring to, partly because your sentence is very long.
                I for one see nothing wrong with this summary:
                “A hierarchical model would give a complete joint probability distribution for the data and parameters,
                and transportability analysis leave this completely unspecified, trusting that the hierarchical modeling experts would do a good job at that”

                But, since you are conversant in both languages, perhaps you can convey to the hierarchical modeling experts, in their language, what they are missing if they continue to believe that their current methods can accomplish extrapolation of experimental findings across populations. I have a strange feeling
                they still believe this to be feasible.

          • Ricardo’s example shows the way the estimation proceeds. The same steps apply even if there are additional complications such as missing data.

            I would summarize the procedure for the estimation of causal effects as follows:

            1. Specify the causal model.

            2. Check the identifiability of the causal effect in the causal model. If the effect can be identified, use the rules of causal calculus to express the causal effect in terms of observed probabilities.

            3. Specify the process of data collection: study design, selection and missing data. Causal models with design can be used for this. Selection-backdoor criterion and other tools of external validity should be considered here.

            4. Form a statistical model according to the causal model with design. Hierarchical models should be considered here.

            5. Estimate the parameters needed to calculate the causal effect as derived in Step 2.

            An example where the procedure is applied to a case-control design is given in a paper recently published in Scandinavian Journal of Statistics.

        • Judea, I think one of the biggest issues here is that you may be talking right past a typical Bayesian and vice versa in terms of how probability is used.

          Remember, in Bayesian statistics, there is no ONE value of the parameters that we get out of a single experiment in Columbia. This is typically in contrast to a frequentist/classical analysis which comes up with one value such as the maximum likelihood value.

          Given that there is no one value, there is no ONE reweighting either. For every value of the parameter, there is one reweighted prediction so for every experiment, there is a distribution of parameters and from that distribution of parameters, a predictive distribution of outcomes. And for every transport from one population of ages to another, there is a reweighted *distribution* of parameters and of predicted outcomes.

          • Keith O'Rourke says:


            Probability is used in the same way for the usually important part – choosing to represent group variation as random aleatory variation rather than fixed unknown constants to be estimated versus differently for hyper parameters where Bayesians represent these with probability models (reflecting epistemological uncertainty) and non-Bayesians dogmatically _refuse to no matter what_.

            Here, the main issue here I believe is – choosing to represent group variation as random aleatory variation rather than fixed unknown constants to be estimated.

            This might be of interest –

          • judea pearl says:

            Daniel, Corey, Rahul, DB, Keith, jrc etal, etal,

            I truly appreciate your patience and your
            honest effort to explain to me the role of
            hierarchical modeling in cross-population problems.

            You might wonder what made me resonate so readily with Ricardo’s
            explanation as opposed to other explanations that
            you have so graciously tried to educate me with.
            The answer is that Ricardo started with an
            ESTIMAND (E), namely the quantity that we wish to estimate.
            He then listed the data that is available to us,
            one coming from the Columbia experiment, the other
            from survey data in Manhattan.
            Finally, he shows that we can express E in terms
            of what we have (making one assumption). This
            for me was an immediate affirmation that we are talking
            about the same problem, and that we can continue
            to discuss hierarchical modeling.

            Without trying to impose this style
            of communication on others, I find it
            to be very effective and extremely time saving.

            In all honesty, I did not anticipate this
            conversation to generate 51 responses.
            My intention was to discuss approaches to external
            validity but, somehow, we got side tracked into the latter.

            Going back to external validity, Ricardo’s post
            illustrates the main points I was trying to make:
            1. There is no need to complicate external validity
            problems with procedural details or estimatioin details.
            2. We now have simple mathematical notation for expressing
            (a) what we want, (b) what we have, (c) what we assume and,
            from there, it is all a matter of algebra.
            3. Once we get the pooling estimand correctly, we
            call in the estimation experts, be they Bayesians or
            non Bayesians, HM or ML, deep learning or shallow learning.
            Targeted or un-targeted estimation. The tent is wide and open.

            Thanks again, and apologizing for the slow learning.
            I hope you are satisfied with the solution to the external validity problem.


            • ojm says:

              What do you dislike about the notation that I used above? I find it suggestive for some things that I don’t find your notation very suggestive for and vice versa.

              Note I equally made no assumptions on how estimation is to be carried out, I just factorised some probability distributions to indicate conditional independence/invariance.

              • judea pearl says:

                It is not that I dislike, I just could not be sure we have the same problem in mind.
                1. Problems about causal effects cannot be articulated in the language of conditional probabilities.
                We need a do-operator to distinguish experimental from observational study
                We need a do-operator to express the assumption of age-specific invariance
                2. I could not see where the HM comes in and who is doing the pooling.

                Are you familiar with the do-X notation?
                Try it, it would make life so much easier , for the reasons above, and for the reasons
                demonstrated in the Stat Sci. paper.
                Best , Judea

              • Rahul says:


                I think it was a trick question. Unless the explanation included do-X notation you cannot get full credit on this exam. :)

              • judea pearl says:

                This was not an exam. This was an example through which I hoped to learn how HM deals with data pooling.
                Plus, there is a way to avoid the do-operator, by calling the experimental study P_1 and the observational P_2
                and then state verbally the assumption of invariance, then point to which data is obtained from P_1 and which from P_2.

                As I used to tell Andrew and Imbens: you do not need to learn multiplication, you can always add a number to itself n times.

                But, seriously, this was my attempt at learning HM, and I did not see from your answer how HM is used. I am still not
                sure that Andrew would buy my interpretation.

              • ojm says:

                Fair enough.

                Yes I am (somewhat) familiar with the do operator. I have also learned (I hope!) valuable things from reading your (prize winning!) work.

                I just disagree that the do operator is a notation that is particularly helpful to me. I also (respectfully) disagree with a number of your claims. I may still change my mind one day.

                One thing that might help is if you presented an analysis of either or both of
                – the ideal gas
                – the two slit experiment
                using your notation and whatever other context you think helpful.

                This may be simplistic but – if you include the delta function as a valid conditional probability (used within integrals of course) and allow proper quantifiers then you appear to deny that functional equations can express causal assumptions?

            • Rahul says:


              Simple, pedagogic examples are one thing. But do you have any examples where you’ve actually applied your formalism for external validity in the context of an actual, messy, applied dataset?

              i.e. Any pointers to a good applied paper using the framework for external validity you are advocating? Not a methods paper please.

              • judea pearl says:

                Only the mighty Gods believe they can handle “actual, messy, applied dataset” without understanding
                simple, pedagogic examples. I am just a humble mortal, enjoying the miracle of understanding.

              • Ricardo Silva says:


                The Dehejia et al. paper Andrew originally linked to (and which I re-linked above) follows this formalism very closely, even if they use a different notation. I agree that is by showing their value with real applications that these methods can take-off and make the most impact. But to my understanding, the core method in Dehejia et al. is nothing but a special case of Bareinboim and Pearl (but showing how it works in a time-series context is interesting in itself, in any case).

                On a side rant not really related to your perfectly valid request, I would like to add that being pedagogic is no small thing. I never stop being surprised by how often causal inference method papers discussing “messy data applications” have at their core very simple ideas that just get obfuscated (and I have to say, it irritates me to no end when in discussion papers I see comments from authors that is beneath them to explain something in “toy” examples. I wasted much of my time as a student decoding pages of “deep” insights into much simpler equivalent models).

                I really liked Dehejia et al., those are good papers that don’t obfuscate their core ideas, and many thanks to Andrew for bringing them to my attention via his blog. But their most useful bits (to me, since I’m not an applied researcher in the area discussed in that paper) were the methodological ones, and those can be explained very effectively with made-up examples without losing anything. With that I feel more confident on how to apply those ideas to my own problems, or to look at the more general cases figured out by Bareinboim and Pearl.

              • Rahul says:


                Is the Dehejia et al. paper using do-calculus? In my initial read I couldn’t find those parts.

                Also, I’ve nothing against pedagogic examples. But if a methodology needs successful applied adoption at some point applied practitioners have got to start to use on actual problems. Perhaps it is only my ignorance, but I’ve felt that for do-calculus / DAG based methods this transition has been rather slow and bumpy.

                This slowness could very well be only the inertia / ignorance / limits of practitioners but whatever it is I hope it will change. I’d love to start regularly bumping into papers that use DAGs / do-calculus on important practical problems.

                Personally, I’ll readily admit that I’ve never really grasped DAGs sufficiently (and I have tried more than once!) to be able to use them on a new causal problem. But that, I’m sure, results from my own cognitive limits.

              • Ricardo Silva says:

                Hi, Rahul

                The Dehejia paper uses potential outcomes, which can be “translated” into the do-calculus. The “translation” is the one I’ve posted in my initial comment. Judea, Rubin and others may be more strongly opinionated on which type of notation to use, and I myself feel far more comfortable with DAGs, but by the end of the day you should look into whatever makes you feel comfortable as long you are coherent in your notation, distinguishing observational and interventional regimes.

                Good textbooks that you might find interesting, which mixes DAGs and counterfactuals, are Morgan and Winship, “Counterfactuals and Causal Inference”, and Hernán MA, Robins JM (2016). “Causal Inference”. The latter has a draft available on-line. Both have nice examples with real data, with the Morgan and Winship focusing on social sciences.


              • judea pearl says:

                You wrote
                ” you should look into whatever makes you feel comfortable as long you are coherent in your notation, distinguishing observational and interventional regimes.”
                I agree that there is some choice of style and comfort here, but it is not all a matter of style and comfort.

                The reason I got into this discussion was to assure people interested in external validity that getting into this line of
                research does not require agonizing over the problems shown in Dehejia’s papers, which are of two kinds:
                1. Not understanding the assumptions (conditional ignorability) thus giving up on defending them.
                2. Giving up on handling post-treatment covariates.
                These are hard science limitations, not merely a matter of convenience.
                Well, thanks for giving me a stage to make my point.

  4. judea pearl says:

    Andrew, Keith, etal,
    There must be something I do wrong in our communication, because I ask simple questions, and I get
    answers that make me feel like I will never understand what hierarchical modeling (HM) is all about.
    What I am laboring to understand is ONE scenario in the context of Andrew’s problem of “generalizing from one pharmacological experiment to another”.

    If you would allow me to go back to this simple context of two pharmacological experiments, I will try to clarify my first
    three questions:

    1. Why do we want to “generalize”? Why not just discard the second experiment and stick to whatever
    the first experiment tells?

    2. Is it because the sample size is small and we wish to capitalize on the combined samples?

    3. Is it that the two populations appear to differ in some relevant characteristics.

    These basic questions may sound naive, but for me they are essential for understanding the purpose , hence the philosophy of HM.
    Can any HM expert come down to my level of ignorance and assist in answering these three motivational questions?

    Specifically, would we still be doing HM w/o (2)? , namely, if we had enough samples in both experiments.? (very low sampling variability)
    Similarly, would we still be doing HM w/o (3)? , namely, the two populations are the same and the two experimental conditions are the same.

    • 1) Why run another experiment at all if you are going to throw it away and stick with the first one? Obviously, the second experiment adds to the information you have available unless it’s done very very poorly.

      2) Not only are sample sizes small, but also, conditions inevitably differ, we may wish to actually learn about the variability between otherwise “identical” experiments for example.

      3) Yes, or even that they do differ in an unknown characteristic even though they appear to be the same in all aspects which were recorded.

      But, 1,2,3 are true whether we model things hierarchically or not aren’t they? I mean, if we test a drug on 100 patients in CA with certain age and sex mix and with certain conditions… why ever do another experiment?

      Suppose that in our 100 patients, every single one of them had EXACTLY the same favorable outcomes, so that the posterior distribution of the “fraction of successful cure” parameter was very close to 1. Unless you have some information which tells you that there are conditions under which such an outcome could vary, you WOULDN’T do another experiment. Practically though, this never happens. There is both variation in outcome, and well known variations in how well things in general work under alternative conditions.

    • Andrew says:


      I also wrote a paper called, Multilevel (Hierarchical) Modeling: What It Can and Cannot Do, so maybe that would help. Of you can wait for our new paper to be available to be shared; it won’t be long, I hope!

    • D.O. says:

      Let me try.
      As I understand Prof. Gelman, he prefers not to work with very well constrained averages, but increases the number of covariates until his models break. If, for example, you have a population with age as the only parameter and you happen to have a very solid data for every age group such that it does not make sense to consider differences between different ages as, in substantial part, due to random error, than you may just go ahead and reweigh the age groups to predict what happens in another population. From Prof. Gelman’s point of view (if I understand it correctly) you will be just leaving some information on the table. It is always possible to think about some other covariates, dilute your groups until variances in each group become comparable with between-the-groups differences, partially pool averages for these groups, and then match the target population using all these extra covariates.

      • Andrew says:


        To be fair, I don’t always use a hierarchical model. In this recent paper, for example, we just did simple age adjustment (and I remain baffled as to why the authors of the paper we discussed did not do so themselves, but I do have some speculations related to possible misunderstanding on their part of different orders of bias correction).

        So, sometimes I think a simple non-hierarchical adjustment can work fine. My point was that if a researcher is concerned about problems with generalizing from one group to another, or one scenario or another, then I recommend hierarchical modeling as an alternative to complete pooling or no pooling. If a researcher is already happy with one of these simpler alternatives, fine; in many settings complete or no pooling will be a reasonable approximation given data at hand. But if a researcher is feeling uneasy about this choice, hierarchical modeling seems like the natural way to go.

        • Keith O'Rourke says:

          > But if a researcher is feeling uneasy about this choice, hierarchical modeling seems like the natural way to go.
          This is perhaps what some folks miss – the absence of an easy data model choice that only requires _fixed_ parameters (which is likely to be the case “if a researcher is [thoughtfully] concerned about problems with generalizing from one group to another”.)

          Moving to _random parameters_ (e.g. exchangeable groups) to _evade_ the uneasy choice when it is assessed as required – is conceptually very different than moving between models with fixed parameters. The randomness is no longer being taken as literal but rather just a conceptual device to get better estimates and intervals. I used to use the term _inexplicable_ heterogeneity that is usual when investigating study replication and then the question becomes when is it advantageous to treat inexplicable variation as being random (or assume studies are exchangeable.)

          Corey and Daniel where pointing to this aspect as well.

          If you have a simple problem where fixed parameters are adequate – then there is no reason to consider more complicated models with random parameters.

  5. Z says:

    I think Judea is concerned with the question, “Given that I know all relevant confounders and effect modifiers X, when and how can I consistently estimate the average effect of treatment T on outcome Y in population B given data on X,T,Y from population A and X from population B?”

    I think Andrew is mostly concerned with the question, “Given that average effects tend to vary between populations for mysterious reasons, given data on covariates X’, treatment T, and outcome Y from populations A through G, what is my best guess about the effect of T on Y in population H given the distribution of X’ in H?”

    In my interpretation of Andrew’s formulation, we don’t necessarily know exactly what the relevant effect modifiers and confounders are (so we use X’ instead of X) but we still need to try to predict the treatment effect in a new population. In this scenario, under various distributional assumptions, partial pooling is a good idea to minimize risk. If we do know the relevant X, then hierarchical modeling should of course be used to ‘model the story’ as Judea put it. Andrew would probably say that in the applications he’s worked on he’s rarely known ‘the story’.

  6. ojm says:

    Both Judea’s causal models and Andrew’s hierarchical models rely on conditional independence assumptions to encode model structure. My impression is Judea cares more about formal methods for making these conditional independence (structural) assumptions and exploring their consequences and interpretations.

    I don’t really see any contradictions – Pearl’s causal assumptions imply a hierarchical (conditional) structure and vice versa, no?

    Re: different cities. The model structure is invariant at some ‘top level’ but underlying parameters differ, no? The transformation between underlying parameters provides another layer in the hierarchy and may help constrain one parameter set given another.

    • judea pearl says:

      Gee, if causal assumptions imply a hierarchical structure, then I understand HM !!! Finally!! Thanks.
      But then Andrew will not encourage people doing external validity to consider HM, they are already considering
      and executing HM. What remains to be considered?
      I am still missing something.

      • ojm says:

        A given causal model implies a hierarchical *structure*.

        Many things are structurally isomorphic. What remains is the details.

        PS Why might Andrew recommend HM? Ask Stan for one example.

      • Jared says:

        A hierarchical model would give a complete joint probability distribution for the data and parameters, and at least some of the papers from the “transportability crowd” (for lack of a better label) leave this completely unspecified. Maybe they all do, but I can’t claim to have read them all. Of course the mathematics of transportability don’t require this explicit, completely specified probability model. And in many of these papers there is a complete lack of actual data so it would be pointless to introduce a full probability model, hierarchical or not.

        As far as I can tell your results are about *what* you want/need to compute (and when it is safe to compute it). They don’t seem to have anything to say about how to estimate the relevant probability distributions and expectations. I think Andrew’s point is that hierarchical modeling would be a useful tool when it comes time to actually start computing estimates of these from data. This is especially true when you’re trying to think carefully about all sources of variation. Actually doing this probability modeling sensibly — estimating all the relevant probability distributions — is highly nontrivial! There remains a lot to be considered. But it’s sort of moot in the absence of real problems and data.

        • judea pearl says:

          I think you are absolutely right, this is how I understood things today.
          A hierarchical model would give a complete joint probability distribution for the data and parameters, and transportability analysis
          leave this completely unspecified, trusting that the hierarchical modeling experts would do a good job at that. Instead, what transportability
          analysis tell us is how to put all those joint distributions together, something HM does not tell us (if my understanding is correct).
          I think we are lucky to be able to separate the task so neatly, both subtasks are highly nontrivial.

          • ojm says:


            Constructing a hierarchical model requires specifying how you wish to factorise your general, overall distribution into a more specialised combination of distributions. This factorisation is not implied by the overall distribution. What gives the factorisation is mechanistic/causal assumptions, including boundary conditions.

            (BTW people realised – and formalised the fact – that mathematical models aren’t in general well-posed without boundary conditions long before do(X=x) was invented.)

            Hierarchical modelling is thus useful for
            – specifying the model structure and boundary conditions (mechanistic/causal assumptions) by assuming specific factorisations that don’t hold purely as a matter of probability theory
            – allowing useful estimation, model checking (etc) tools to be used naturally, given a specific factorisation

            Possible explanations for you not appearing to understand this are

            – I misunderstand you/I am wrong or missing something
            – your notation is obscuring the issues to the extent that you cannot recognise an isomorphic (to good approximation, anyway) mathematical modelling approach
            – you have some motivation for deliberately misunderstanding
            – something else

            I have no horse in this race so I just want end my comments on this post by saying
            a) you have made valuable contributions to thinking about causality


            b) so have many physicists, mathematicians and statisticians who have been expressing causal assumptions using standard mathematics for many years (logic with quantifiers, set theory, category theory etc etc).

            To say that people not using do-calculus (or whatever) who try to do causal reasoning are like people doing mathematics without using a multiplication symbol is, to me, to severely underestimate the expressive power of existing mathematics.

            • judea pearl says:

              I do not understand why you feel so threatened by the do-calculus.
              I think if you try to solve any of the data-fusion examples in
              you will appreciate its power. Try Fig. 5 (b) or (c).

              Same if you try to use logic, set theory, category theory, statistics, or string
              theory to resolve any of the Simpson’s paradox examples:

              To say that we can do everything with standard mathematics
              is to severely underestimate the capability of our generation
              to innovate new tools.

              Where does this defeatism come from?


            • Keith O'Rourke says:


              Part of it could just be limitations of blogging.

              > factorise your general, overall distribution into a more specialised combination of distributions
              I think this is what I meant by pooling happening through parameter specifications and likelihood multiplication.

              The joint data generating model (likelihood) of all the studies can be factorized by study and within each study the parameters specified as common over all studies (complete pooling), arbitrarily different by study (no pooling) and (conceptually) as random draws from a distribution that has some common parameters for all studies (partial pooling often justified by exchangeability). This way to get partial pooling might seem more natural as Bayesian, but it can be (and has been) defined as a type of likelihood.

              Now Ricardo might mean in his comment that getting estimand (E) as entailed by model assumptions (G) does not pin down whether various parameters should be specified as common or random draws from a distribution with common parameters.

  7. enchilada says:




Leave a Reply