## Keli Liu and Xiao-Li Meng on Simpson’s paradox

XL sent me this paper, “A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference.”

I told Keli and Xiao-Li that I wasn’t sure I fully understood the paper—as usual, XL is subtle and sophisticated, also I only get about half of his jokes—but I sent along these thoughts:

1. I do not think counterfactuals or potential outcomes are necessary for Simpson’s paradox. I say this because one can set up Simpson’s paradox with variables that cannot be manipulated, or for which manipulations are not directly of interest.

2. Simpson’s paradox is part of a more general issue that regression coefs change if you add more predictors, the flipping of sign is not really necessary.

Here’s an example that I use in my teaching that illustrates both points:

I can run a regression predicting income from sex and height. I find that the coef of sex is $10,000 (i.e., comparing a man and woman of the same height, on average the man will make$10,000 more) and the coefficient of height is $500 (i.e., comparing two men or two women of different heights, on average the taller person will make$500 more per inch of height).

How can I interpret these coefs? I feel that the coef of height is easy to interpret (it’s easy to imagine comparing two people of the same sex with different heights), indeed it would seem somehow “wrong” to regress on height _without_ controlling for sex, as much of the raw difference between short and tall people can be “explained” by being differences between men and women. But the coef of sex in the above model seems very difficult to interpret: why compare a man and a woman who are both 66 inches tall, for example? That would be a comparison of a short man with a tall woman. All this reasoning seems vaguely causal but I don’t think it makes sense to think about it using potential outcomes.

Keli replied:

There are variables which cannot be manipulated by a human, but it is still possible to imagine hypothetical manipulations. For example, we discuss how the color of a certain species of a plant is not really manipulable (it comes with the species). But one can imagine some sort of “proto”-plant which is that plant before it had color attached. We can now talk about manipulation of color for the proto-plant. The key however is that our unit of analysis has changed: when we think about manipulation of color, our fundamental unit is no longer plant but rather proto-plant. So we can always imagine hypothetical manipulations, but we need to keep careful accounting of what our unit of analysis is.

Now to the second part of the question: why should we be interested in these hypothetical manipulations in the first place if we cannot perform the actual manipulation? As you say, Simpson’s Paradox is really encompassed by the more general problem of which predictors to include in our regression. Imagine the ideal scenario where we can gather data from as many individuals as we want and for each individual we have an incredibly rich set of predictors. Now suppose we want to study the effect of height on income. What predictors should we include in our regression? Our point is that to answer this question, you need to think about manipulation of height, even if it is only a hypothetical manipulation.

In particular, when we think about the hypothetical manipulation of height, our unit of analysis is no longer an individual from the dataset (since these individuals come with height already). Rather the unit of analysis is some “proto”-individual. These “proto”-individuals have attributes but height is not one of these attributes—hence why we can think about manipulation of height. We should include in the regression all predictors which are attributes of this “proto”-individual, but leave out predictors which are not attributes of this “proto”-individual.

When you say “indeed it would seem somehow ‘wrong’ to regress on height _without_ controlling for sex”, what you are implicitly doing in your mind is conceiving of this “proto”-individual and realizing that sex is in fact an attribute of this “proto”-individual. Similarly, when you say, “why compare a man and a woman who are both 66 inches tall, for example?” what you have done is to imagine a hypothetical manipulation of gender. In your thought experiment, you have made the judgment call that the type of proto-individual which allows a hypothetical manipulation of gender does not possess the attribute of height (if we are thinking about God making humans on an assembly line, height is added only after gender) [No, I’m not thinking about God making humans on an assembly line — AG]. Hence we should not include height in the regression if our goal is to learn about the effect of gender.

To me, the whole proto-individual idea just adds more complexity and leaves me even more confused! And I also think I can interpret those regressions without having to think about manipulation of height or of sex—to me, these are between-person comparisons, not requiring within-person manipulations. But I’ll put this all out there for the rest of you to chew on.

1. Sherman Dorn says:

Quick note: the URL for the paper is http://www.stat.columbia.edu/~gelman/stuff_for_blog/LiuMengTASv2.pdf (you forgot the ~gelman).

2. Shira says:

Is this related to the point on p.167 (intro to Chapter 9) of ARM about the difference between predictive inference and causal inference?

Interpreting regression coefficients as comparisons between people, not the manipulation (causal) interpretation, also means avoiding language like “with variable x held constant” and instead saying “among those people with the same level of variable x”, right?

3. Fernando says:

Surprised they dont cite these:

Simpson’s Paradox: An Anatomy. Judea Pearl.

4. derek says:

“Why compare a tall person and a short person who are both women, for example? That would be a comparison of a short person who was “girly” and a tall person who was “mannish”.”

If you think that sentence looks strange, that’s how your assertion that a man and a woman of the same height are both “tall” and “short”, depending on whether they’re a man or a woman, looks to me. Short is what women are–or else if you’re calling a 66 inch woman tall, then women aren’t short after all. Can’t have it both ways.

If you don’t control for things because it makes you feel oogy that you’re comparing things that are different, then don’t control for anything. Ending up with things in the same part of the distribution, that started out in different parts of the distribution, is what controlling for things is *supposed* to do, isn’t it?

• Andrew says:

Derek:

I don’t understand when you write: “Short is what women are—or else if you’re calling a 66 inch woman tall, then women aren’t short after all.” Women are, on the average, shorter than men, but they’re not all short. And 66 inches is tall for a woman (in the U.S.).

5. Keli says:

Hi Andrew,

Thanks for your comments. I also usually do not think about God making humans on an assembly line, but I think I had just had a conversation with someone about automobile manufacturing before I wrote that email :)

I think our our disagreement is probably semantic. I really like your response: “to me, these are between-person comparisons, not requiring within-person manipulations”. I think it captures the essence of the problem: what are we trying to compare? In my terminology, we need to determine the unit of analysis (or comparison). Your choice of unit, as implied by “between-person”, is a person. I feel that this is somewhat ambiguous. What exactly is a person? Here’s my suggestion for how to rigorously define “person” for the purpose of learning about the effect of height (obviously this definition changes depending on our goal). A person is that collection of attributes invariant to height. Incidentally, a professor just pointed out to me this week that this definition was used by Arthur Goldberger to characterize “structural parameters” in econometrics (too bad we weren’t able to get this reference into the paper, as it may have helped a lot of economists understand what we were trying to say). Once we have this definition, we know precisely what to condition on. I call this a person a “proto-individual” (since it may not have some attributes not invariant to height which we might normally associate with humans), you call it a “person”, but I think we’re talking about the same thing (let me know if I’m misinterpreting you).

One of the motivations for the article was, can we get people to think about Simpson’s paradox and causality rigorously without much of the jargon? I think in the article, we try really hard to convince people that it all boils down to asking “How do I avoid comparing apples and oranges”–which is something everyone can grasp.

• I don’t know, you social scientists seem a little mixed up when it comes to defining what sounds to me like “experimentally controllable” variables. The fact is, we can’t take a person and knock off a couple of inches leaving “all else equal” (because such a person becomes a hospital patient).

However, I would like to point out that any analysis we do should ultimately be independent of the units of measurement, and because of this “tallness” should be expressed as a ratio of actual height of a person to some reference height. There is no reason to believe the reference height MUST be invariant to sex. So I’m guessing H/(72 inches) for males and H/(63 inches) for females might produce coefficients which are more or less comparable independent of sex, and I think this is what’s wrong with the discussion so far.

• Andrew says:

Dan:

Nope, if you do the analysis on the log scale you get the same basic pattern. The trouble is that, if you want to think of it in that way, the baseline for height isn’t really 0, it’s more like 4 feet or something like that. (Also, the average heights of American men and women is more like 70 and 64.5 inches, not 72 and 63.) In any case, for whatever reason, the log scale or relative scale doesn’t do much for you when analyzing heights of adults.

• I’m not sure if we agree or disagree actually. In any case, in every mathematical model of physical systems we are free to choose certain units of measurement, and yet our results should be (but often because people don’t construct models in this manner, are not!) independent of the units of measurement. This is just a basic symmetry property of the universe. If your result is dependent on you measuring in some special units, then the result is actually a property of the special units, not of the universe.

Anyway I think if I put it more clearly you might agree or disagree.

Suppose I am trying to predict $from H and S I can write the most general form of this$ = f(H) + g(S) + q(H,S). Each term must be in dollars. And sex can be considered a dimensionless variable as well (it’s really just an indicator, and can take on the values 0 or 1 for example). However, notice that unless H is dimensionless, when we change units of measurement, the function f must also change in order to keep the relationship constant. To eliminate this property we should express H relative to a reference H0

$= f(H/H0) + g(S) + q(H/H0,S) Now, we could in fact split this into two equations$_m = fm(H/H0m)
$_f = ff(H/H0f) where now we use S to set H0 in the two cases, and we allow f to potentially be two different functions. Now the question is, is the general situation such that fm and ff are actually the same function provided that H0m and H0f are chosen correctly? In this case, males and females are not different, they’re just scaled versions of each other. We could also allow for different baselines… so we might look for f’ such that$_m = f'(H/H0m – rm)
$_f = f'(H/H0f – rf) In which case they are shifted and rescaled versions of each other… This is an interesting symmetry property if it is true. Or we might find that we can’t do it.. we need to have different functional forms for fm and ff regardless of these rescaling and shifting operations… in which case something different is happening, there is something about being male or female which is different from the opposite sex in a way that isn’t simply a change of size or location… But if we don’t express things in this manner, we get tied up in nonessential aspects of the problem… namely the units in which something is measured. • To expound further… often f might be expressible as a linear term plus some other stuff… when that’s the case, we are free to FORCE one of the linear terms to have coefficient 1 by choice of H0… so there aren’t 2 coefficients, one for males and one for females, there is only one since the other can always be forced to 1 by choice of reference size. • Another way of thinking about this analysis is that we actually have two parameters, that we are using to build our model: H/H0m and H0f/H0m (or if you like you can use f as the baseline, but since H0m is bigger than H0f it’s convenient to have the second parameter normalized by the larger value and then it’s a parameter less than 1.) • D.O. says: Suppose you are designing caves for an amusement park. If cave entrance is too large people wouldn’t be interested and if it’s too small they wouldn’t like to crawl to get in. How normalizing height by sex will help you? • Keli says: Dan, Completely agree with you that “we can’t take a person and knock off a couple of inches leaving ‘all else equal'” in practice. But before we even worry about how to conduct the physical experiment, we need to know if we can formulate “the effect of height on income” in a well defined way. If one can at least think of a conceptual manipulation in which you “knock off a couple of inches leaving ‘all else equal'” then at least our question is well defined (of course you still don’t know how to go about answering it in practice yet). Once you have such a conceptual manipulation in mind, in particular once you know the experimental units that this conceptual manipulation would act on, you can ask, “Is there any way I can get my current data to look like data that would have been generated under this conceptual experiment?”. This question is the intuition behind methods like propensity score matching. Even though talking about such conceptual manipulations may seem like philosophical quibbling, it has practical value in clarifying for us what our ideal data would be and what our ideal analysis would be. Once we know the ideal, we at least know what we are trying to approximate and can have awareness about how we are failing in our approximations. 6. D.O. says: I am not sure it helps to solve the general Simpson paradox problem, but cannot we use data to resolve the question. We can estimate how much the target variable (income) changes with one std change in each of the predictors and look at the sample correlation between the predictors and then construct a criterion of how much sense there is to talk about variation in sex keeping height constant. Something like, if such and such percentage of variation in income due to height can be explained through it’s covariation with sex than it is not meaningful to think about height influence on income without sex as a covariate. I suspect that this is what the paper really is doing in much more precise and sophisticated way, but then why indeed we need the counterfactuals? 7. judea pearl says: Andrew, At the risk of sounding (again) like a flash-light salesman, let me point your readers to this paper, “Understanding Simpson’s Paradox” http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf which will appear this month (February) in The American Statistician. The main paragraph reads: “Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, (i.e., consult the aggregated table or the disaggregated table) explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson’s paradox and, naturally, will proceed to convince readers that the paradox deserves the title “resolved.” I think this paper will convince you that: 1. No data in the world can answer Simpson’s question 2. Any attempt to resolve the paradox without invoking causality is doomed to failure. 3. Those who equate causality with potential outcomes, would have truly hard time thinking through this paradox, (though Larry Wasserman managed to do it in a simple problem) • Keli says: Thanks for the link, looking forward to reading it. Before getting to your three points, I’d like to point out something unique about our discussion of Simpson’s paradox, in particular the reason we use the word “Multi-resolution” in the title. You say, “We must identify the correct answer, (i.e., consult the aggregated table or the disaggregated table)”. One of our arguments in the paper is that in many cases where Simpson’s Paradox arises, we have more choices than just the aggregated table or disaggregated table. The set of possible decisions is not dichotomous. The multi-resolution framework is a way to see all the choices available to us (which correspond to different “resolutions” of inference) and also to choose between them in a mathematically rigorous way. As we discuss the choice is quite similar to the bias-variance tradeoff. For the three points that you make. 1. “No data in the world can answer Simpson’s question.” Agree. To answer the question one way or the other, we need to make assumptions that most times cannot be checked by the data. 2. “Any attempt to resolve the paradox without invoking causality is doomed to failure.” Agree in spirit. Xiao-Li and I feel that the role of “causality” in Simpson’s Paradox is to inform us about what constitutes a relevant comparison–if we want to compare apples to apples rather than apples to oranges, what is our “apple”? This is pointed out in the wonderful reply by Michael below. 3. “Those who equate causality with potential outcomes, would have truly hard time thinking through this paradox, (though Larry Wasserman managed to do it in a simple problem)” I’d have to disagree on this one. I feel that potential outcomes are notationally rich enough to explain the paradox. This obviously doesn’t mean that there aren’t other means of expressing what’s going on in the paradox. • CK says: Keli, I’m wondering if you had a chance to try to represent Figure 1 (c) of Judea’s paper using a potential outcome language. • Keli says: Ok, just took a quick look at 1(c) (this is in the paper that Judea just linked right?). So first, let’s try to represent Judea’s diagram using the following potential outcome notation. From the diagram, we have Y=Y(x,L2) i.e. Y is a function of the treatment assigned, x, and L2. In the paper, we emphasize that to figure out what to compare to what, you need to formulate the domain of the outcome space as a product space \Omega x X where \Omega is a population containing the units of analysis and X is the space of possible treatments. The product space structure ensures that characteristics of the units in our population, i.e. functions of \omega in \Omega, are invariant to treatment. It is immediately obvious form the equation that the product space in this case is L2 x X. So, ideally, we would like to compare individuals with the same L2 value (this defines the appropriate units for comparison). But, in Judea’s problem, L2 is unobserved, so what should we condition on? We notice that Z is observed. Does this mean that we should condition on Z? We should definitely use the information in Z, after all, Z is a function of L1 and L2 so knowing Z tells us something about L2. BUT to use this information does not mean we directly condition on Z (i.e. compare people with the same Z value)when making treatment comparisons. If one were a Bayesian, one would build a full Bayesian model, that allows one to impute L2 given the information in Z,X,and Y. If you’re Frequentist, you can still build a good predictive model for L2. The key is that even though directly conditioning on Z is wrong in this case, as Judea points out, this does not mean we should ignore Z. To be efficient, we need to use Z to help impute L2. Of course, the above paragraph assumes that even if L2 is unobserved, we still know it exists (the paper looks at examples where L2 is known to exist but is missing). When you draw out a diagram like 1(c), you are modelling variables, like L2, which are unobserved. All we’re saying is that if you already have that model, then you should use Z efficiently to help impute L2, rather than ignoring Z altogether. Again, using Z to impute L2 does not mean a direct conditioning on Z. • CK says: Keli: Yeah that’s the paper I was referring to. We actually can safely ignore L1, L2, and Z in figure 1 (c) if our interest is to assess whether x causes y. As Judea put it, “In model (c) the correct answer is provided by the aggregated data” Why the need to impute L2? • Keli says: More precision of course. Let’s say you have a randomized experiment as well as the pre-treatment variable gender and gender affects the outcome. Since the experiment is completely randomized, you can get a valid treatment estimate of the treatment effect without conditioning on gender, but you get more precision if you do condition on gender. In this case, you don’t have to condition on L2. But there is some info about L2 in Z, so if you use that info to impute L2, you get more precision for your treatment estimate. • Keli says: Just to clarify, I think Judea says that the “correct” answer is the aggregated data because he is thinking of choosing between the aggregate data and disaggregate (by Z) data. But as I wrote in my response to Judea, these are not the only two choices–the “multi-resolution” framework in the paper helps us to see this. If L2 is observed, we would ideally like to disaggregate the data by L2. L2 is not observed, so instead we use Z to help impute L2 (obviously to rigorously account for uncertainty in L2, you could either use a fully Bayesian model, or multiple imputation of some sort). • CK says: Yes, the aggregate analysis is not the only solution (if you know L1 and L2) but it is more economical(i.e. there is no need to involve more covariates unnecessarily). Regression adjustments for sets {L1}, {L2}, {L1, L2}, {Z,L2}, {Z,L1} or {L1, L2, and Z} will still give you valid results but at the expense of reducing precision of your regression parameters. It takes only a few seconds to figure out this with graphical models, but can be quite involving with the potential outcome language. • Keli says: CK, regression adjustment for L2 will increase (NOT reduce) the precision of your regression parameters. That’s the whole reason why we want to adjust for L2. An aggregate analysis may be simple but it is not fully efficient in many situations. • CK says: Keli: That’s news to me, can you give a numeric example? • Keli says: First, obviously the following is a toy example with no attempt at realism. Suppose that: L2 is gender (0 or 1), X is treatment (either 0 or 1), and Y is some continuous outcome. For each individual, L1 is a uniform random variable which affects treatment assignment as follows: X= 1{U<1/2} where 1{} is the indicator function. So treatment is randomly assigned. Z=L2+L1. Finally Y=0.5+4*X+2*L2+norm(0,1) where norm(0,1) is a standard normal random variable (you can use any reasonable coefficient values, as long as the coefficient on L2 is non-zero). Use lm in R and you'll see that the standard errors for coefficient on X is smaller if we include L2 in the regression. You would agree with this right? If you do not include L2 in the regression, the regression estimate for coefficient on X is still consistent, but you'll just get bigger standard errors (less precision). So I think we can all agree that if we have L2, we would definitely use L2 in the regression. Now consider case where we do not observe L2 but we observe Z. The intuition remains that we should use Z to impute L2. Here I made the numbers really simple so that if Z1, impute L2=1. Since we figure out L2 from Z, we can then condition on L2. Obviously in practice the inference for L2 is much harder (first off, we may not know that Z=L2+L1, and there will be other complications)–in most cases L2 is not perfectly identified in practice, but even partial information about L2 can sharpen inferences for the treatment effect. • Keli says: For some reason, when I posted, it messed up the third sentence in the last paragraph which should read: Here I made the numbers really simple so that if Z is less than 1, impute L2 equals to 0 and if Z is greater than 1 impute L2=1. • Keli says: The example above conforms to figure 1(c), but in case anyone is unhappy with the fact that we cannot simultaneously have say Z=1.6 and X=1 (this actually just goes to show why direct conditioning on Z is a bad idea even though one can still use Z to impute L2), consider the following modified version. Everything stays the same except Z equals L2+L1 with probability 0.999 and Z equals L2-L1+1 with probability 0.001. We can keep the same imputation rule, impute L2 equals 1 if Z greater than 1 and impute L2 equals 0 if Z less than 1. • CK says: Where does norm(0,1) come from. Your example will be clearer if you can clarify this. • Keli says: I just added norm(0,1) because people usually add an error term for regression, but you can remove it if you want, the example still stands. So just let Y=0.5+4*X+2*L2 and everything I said still follows. • judea pearl says: Keli, My comments will follow yours, marked with KL-KL and JP-JP. KL writes: One of our arguments in the paper is that in many cases where Simpson’s Paradox arises, we have more choices than just the aggregated table or disaggregated table. The set of possible decisions is not dichotomous. The multi-resolution framework is a way to see all the choices available to us (which correspond to different “resolutions” of inference) and also to choose between them in a mathematically rigorous way. As we discuss the choice is quite similar to the bias-variance tradeoff. JP answers: I assume by “multi-resolution” you mean multiple ways of estimating the treatment effect, and choosing the treatment of maximum efficacy. If so, is it different from using the backdoor criterion and finding multiple sufficient sets of covariates and, if none exists, using the extended graphical criterion for treatment-effect identification? I assume that by “different resolution” you mean different subsets of covariates that would give us the correct effect, if controlled for. If so, I doubt you can “see all the choices available to us” using the potential outcome formalism. And my doubt is based on the way you answered CK’s question about Fig. 1 (c) (in my paper). The simple answer is “do not adjust for Z” since E(Y_x)=E(Y|X=x), while your answer invokes imputing the unobservables L1 and L2, whose dimensionality is unknown, and other complications, like encoding the structure of the scenario in the Potential Outcome language, deciding if it is testable, etc etc, which is hardly what one would expect from a formalism where one “sees all the choices available to us”. Regarding “choose between them in a mathematically rigorous way”, I hope you agree that the potential outcome language is not more rigorous than the graphical language; the two are proven to be logically equivalent. So, your preference of the former is a matter of culture, not of rigor, or, as you say in your paper, graphical methods carry an “overhead” which is “familiarity with causal diagrams”. True, but isn’t the overhead justified? Arithmetics carries the overhead of memorizing the multiplication table, isn’t the overhead justified when we compare it to the alternative of adding a number to itself n times? A flashlight costs$0.95 (battery included),
which is an “overhead” too, but doesn’t it justify the investment when we consider the alternative
of dancing around the lamppost all night looking for a wallet lost in the dark?
(Yes, I am talking about the flashlight of causal diagrams).

KL writes:
1. “No data in the world can answer Simpson’s question.”
Agree. To answer the question one way or the other, we need to make assumptions that most times cannot be checked by the data.

KL writes:
2. “Any attempt to resolve the paradox without invoking causality is doomed to failure.”
Agree in spirit. Xiao-Li and I feel that the role of “causality” in Simpson’s Paradox is to inform us about what constitutes a relevant comparison–if we want to compare apples to apples rather than apples to oranges, what is our “apple”?
I do not see it that way. For me, the role of causality is not to inform us about familiar statistical concepts
such as “relevant comparisons” but to answer causal questions, such
as, “which treatment is more effective”. The whole acrobatics of “relevant comparisons” and “apples
and oranges”, “exchangeability”, and more, comes from trying to fit simple causal questions into the language of traditional statistics,
which is quite clumsy in dealing with causal questions.
Recall, today we have multiplication tables, we do not need to add a number to
itself n times. Join the 21st century

KL writes:
3. “Those who equate causality with potential outcomes, would have truly hard time thinking through this paradox,
(though Larry Wasserman managed to do it in a simple problem)”
I’d have to disagree on this one. I feel that potential outcomes are notationally rich enough to explain the paradox.

It is surely “rich enough”, but observe the acrobatics needed to tame this richness.
Observe the number of new concepts you had to define in your paper with Prof. Meng.
Or, observe the acrobatics needed to formulate Andrew’s example of Sex—>Height—->Income.
In the causal world you simply think about an employer who decides salaries based
on two attributes, Sex and Heights ( plus the facts the Height is affected by Sex, not the
other way around), and we are done — the story is mathematically encoded in its entirety.
We need not concern ourselves with hypothetical manipulation of a person’s height, nor with
a “proto-individual” that had all the characteristics of the real individual except height,
nor of conditional ignorability that no mortal can assess, etc. etc.
Now, once the story is formalized, all questions about causation can be answered
mechanically, by a simple graphical criterion that takes 3.5 minutes to master.
(This is your $0.95 flashlight, battery included). But, science has its inertia, and the scientific world is eager to see how long it would take for the statistical community to catch up with the 21st century. It is said that what Galileo saw through the telescope in 1608 was so disturbing for some officials of the Church that they refused to look through it; they reasoned that the Devil was capable of making anything appear in the telescope, so it was best not to look through it. Same goes with our$0.95 flashlight (battery included), please try to look through it.

In 3.5 minutes you will be able to tell :
1. why people consider Simpson’s phenomenon surprising or unbelievable.
2, which scenarios would support Simpson’s reversal and which would not. And,
3. which scenario places the correct answer with the combined table and which
with the Z-specific table.
(or, if you prefer, which scenario lends itself to consistent estimation
of the treatment effect)

This is not bad for a \$0.95 flashlight (battery included),
because it is precisely what’s needed to proclaim Simpson’s paradox “resolved”.

• Keli says:

Thanks for the thoughtful response Judea. I’ll follow your format for responding.

JP writes: “I assume by “multi-resolution” you mean multiple ways of estimating the treatment effect, and choosing the treatment of maximum efficacy. If so, is it different from using the backdoor criterion and finding multiple sufficient
sets of covariates and, if none exists, using the extended graphical criterion for treatment-effect identification?”

KL writes: We assume a situation that tries to replicate what individualized medicine might look like in the future. A patient goes to the doctor. The doctor has way more measurements on characteristics of this particular patient than we might have for patients in a clinical trial. Ideally, you would find the treatment effect conditional on all the characteristics of this patient. But as we mentioned, the trouble is that many of these characteristics were not recorded/missing for patients in the clinical trial (on which we base our decision). For example L2 in your figure is unobserved. The question becomes, can we obtain partial identification for some of these characteristics, can we impute them for patients in the clinical trial in any meaningful way? And is it worth our effort to do so.

As you said, the backdoor criterion helps us to answer what characteristics we would ideally like to condition on. If you want just an average treatment effect for the whole population, most times you need to condition on less characteristics that if you wanted an average for a smaller (more relevant for the patient) subpopulation. Suppose that the characteristics you need to condition on for the subpopulation treatment effect are only partially identified (in the clinical trial patients) while those you need to condition on for the population treatment effect are fully observed. Do you try to make assumptions and estimate the subpopulation treatment effect or do you settle for the population treatment effect. The latter can be estimated without bias but is less relevant for our patient. We will incur bias in estimating the former since any assumptions we make are probably wrong–but the interesting question is, how wrong do they need to be before we should give up? The multi-resolution framework is a useful way for understanding this tradeoff.

So you see, I think we are answering slightly different questions. We use the Simpson’s Paradox example because it is well known and is a canonical situation where the characteristics of interest may be only partially observed on clinical trial patients.

JP writes: “And my doubt is based on the way you answered CK’s question about
invokes imputing the unobservables L1 and L2, whose dimensionality is unknown,
and other complications…”

JP writes: “I hope you agree that the potential outcome language is not more rigorous than the graphical language…So, your preference of the former is a matter of culture…The whole acrobatics of ‘relevant comparisons’ and ‘apples and oranges’, ‘exchangeability’, and more, comes from trying to fit simple causal questions into the language of traditional statistics.”

KL answers: I made my statement about rigor in the context of the robustness-efficiency tradeoff I just mentioned and my point was that most times we do not formalize that tradeoff. Sorry, if it seemed like I was implying that potential outcomes are more rigorous than graphical language. I think graphical language is great. We are not trying to replace graphical language, simply to say that here is another way of trying to understand what is going on.

There are multiple representations (notations) we can use to express our meaning. Each of them has advantages in certain areas–they lend to different ways of thinking. I’m sure that as mathematicians, we would agree that a proof that one is superior in general is impossible? Given this, my view has always been to become as comfortable as possible with as many representations as possible, to try to gain new insight into old problems by seeing things using a slightly different representation. It may seem like Xiao-Li and I are doing acrobatics when we try to connect to traditional concepts like “relevant comparisons” but this representation actually helped us to realize a beautiful connection between Simpson’s Paradox and Fiducial inference. My point is that by having many of these representations in your mind, you become more capable of finding unity between seemingly disparate ideas, whose underlying representations are the same.

It was never my intention to debate the merits of potential outcomes vs graphical language. As you said, most times it is a matter of culture. But there is something to be said for multi-culturalism, so I’ll end with one of my favorite quotes from Marvin Minsky:

“In the 1960s and 1970s, students frequently asked, Which kind of representation is best,’ and I usually replied that we’d need more research before answering that. But now I would give a different reply: To solve really hard problems, we’ll have to use several different representations.'”

• judea pearl says:

Keli,
Thanks for clarifying matters. I see indeed that you are attacking a set of problems
that are different from those usually asked in the context of Simpson’s paradox,
(e.g., to adjust or not to adjust), and that your problems seem to fall more closely into
the issues of generalizability and transportability (e.g., inferring causal effect in a population
that DIFFERS from the one under study) . (You might be interested in our humble perspective of the
problem: http://ftp.cs.ucla.edu/pub/stat_ser/r404-reprint.pdf).

But I would like to comment on your last point of MULTI-CULTURALISM, and Minsky’s beautiful quote:
`To solve really hard problems, we’ll have to use several different representations.’”
I see a lot of multi-culturalism in the graphical modelling literature (they call it “symbiosis”), and none from the Potential Outcome
literature; is that my perception playing tricks on me, or a real phenomenon that is hurting investigators
in the latter camp? To witness, I do not find any graph, nor a structural equation in your paper, nor in any paper
way by refraining from using graphs? or that they have learned other tools to compensate for lacking the graphical perspective.
For example, how would people in the potential outcome camp represent the scenario of Fig. 1(c) ? How would they
decide if it has any testable implications, or if it can exhibit Simpson’s reversal? How would they come to the conclusion that L2 can be
adjusted for (which you have noted) or that Z need not be adjusted for? It seems to me that to answer
this sort of questions without the advent of graphs would be horrendously complicated (albeit doable), hence,
that these questions are simply AVOIDED, or left to guesswork. Am I wrong?

• Keli says:

Thanks for that reference, Judea. It is definitely highly relevant for our work. As issues of external validity gain greater prominence, we will need to go beyond modelling the observed data. We’ll have to try and capture the overall structure of a population. How to do so most effectively–being able to make a meaningful statement with minimal assumptions on the population structure–is of course a topic of research, but as your article shows, graphs are definitely a good candidate.

Onto your point that there isn’t enough “symbiosis” going on in the Potential Outcome literature. Obviously, I can only speak from personal experience and this mainly comprises my undergrad experience. The first graphical model paper I read was a discussion by Steffen Lauritzen. It was actually Don who told me to read it. From what I remember, Don really liked the discussion because Lauritzen was painstaking in laying out all the assumptions behind the graph that he drew. I think what Don wanted me to learn was that it is really easy to draw a graph (everyone can draw circles and connect them) but many people do not stop and fully appreciate the information they’ve encoded into their graph. This is human nature in some sense: we abuse power (and graphs are powerful) unless something stops us. Let me state the point another way. A graph encodes a set of conditional independence assumptions. The visual nature of the encoding is incredibly efficient, especially compared to enumerating the list of conditional independence assumptions. However, the inefficiency of enumeration can itself be a bonus, because it makes us stop and think about whether we really want to make a certain assumption. In some sense, this is the same as the statistical tradeoff between efficiency and robustness.

From my own learning experience (others will probably be different), I’ve found that it really helped to start my education with just potential outcomes. As mentioned, this meant that I would have to write out every conditional independence assumption I made. As a beginning stat student, the explicitness of this act helped me to really see where I was getting all my information. As I’ve gotten more comfortable with the meaning of conditional independence, I find that sometimes when I’m discussing a problem with a friend, I’m too lazy to write out all the statements, so I just draw the DAG. But I fully understand that DAG now because at some time in the past, I had to write everything out. I’m sure someone could have had the reverse learning experience, starting from DAGs, and it would’ve been equally effective.

Finally, I just want to say that we in fact had a causal graph seminar in the department last year and a friend who took that class is actually doing some really cool work with them. As for why more people aren’t publishing stuff involving graphs, I can only talk about myself. My personal goal is that I should have a good enough intuitive grasp of what I’m doing to explain it to my mother. My mother does not understand graphs or conditional independence, and she is not willing to learn about either. She does understand the idea of fair comparisons though, so that’s why we adopted the apples to oranges theme in the paper. For me, it’s mainly a point about communication.

• jimmy says:

hi keli,

could you provide a reference for the steffen lauritzen paper you mention here?

• Keli says:

The article is here: http://www.stats.ox.ac.uk/~steffen/papers/isi03sll.pdf

Don wanted me to pay particular attention to the comparison of the two graphs in the section on Principal Surrogates. In the left graph, there is simply a variable U while the right graph uses the mapping variables sigma and eta. As Lauritzen points out, the mapping variables completely characterize the action of U. I think (I may be wrong about this) that Don wanted me to realize that the graph on the left is in some sense an abbreviated notation which hides the more primitive structures–the mapping variables. To truly understand the graph, you need to understand it on the level of the mapping variables, which many people do not. As Lauritzen also points out, if you try to define surrogacy using the left graph, you end up with strong surrogacy, but if you look “deeper” i.e. at the mapping variables rather than U, you can weaken this definition. You may prefer one definition or the other, that will depend on particular application–but you should understand the graph well enough to know how to change the definition if you wanted to. So now when I look at graphs, I’m always careful to ask myself whether I’m really understanding the structure and implied assumptions at its most primitive level, and I have Don to thank for this valuable lesson.

• K? O'Rourke says:

Keli:
To put it in Peircean terms of first, second and third aspects of inquiry – intuitively put as might, must and shoulds:

a specific DAG _might_ be the resolution to obtain the best causal assessment, this _must_ accord with both the implications of that DAG (all the assumptions involved, math) and mesh well with brute force reality (be adequately assessed as reflecting reality by assessing its fit and robustness to other DAGs, empirical) and _should_ we be convinced to continue thinking it is the least wrong resolution we can get.

So I think you are pointing to a possible lack of attention to all the assumptions involved and I to assessing fit and robustness. (I think there likely are good examples of these not being neglected when using DAGs, I am just interesting in looking at them – perhaps some by Ian Shrier?)

But there is also that important economy of research question – the question of which approach to use for what for whom. The one your co-author Meng put as, if it involved a treatment that you or a loved one needed (and you had to hire someone else to do the analysis) which approach would you ask them to follow. To have them explain what is involved to others, I would have them draw the DAG. To have them convince themselves to continue thinking it is the least wrong resolution they can get?

• judea Pearl says:

Keli,
This seems to be one of the last postings in the fading
discussion on Simpson’s paradox but, since you
have given us a frank and informative description of the status
of “multi-culturalism” and ” symbiosis” in the potential
outcome camp, I owe you an equally frank and informative answer
from my perspective.
It will be a bit lengthy, because I think many readers of this
blog are interested in understanding this puzzling phenomenon.

You wrote:
“onto your point that there isn’t enough “symbiosis”
going on in the Potential Outcome literature”
“there is a total absence”, namely, about the community
ritual of total abstinence from useful tools of modern
causal analysis.

From your bringing up Minsky’s quote on the virtues of
multi-culturalism I infer that your own research style is different,
and that you are genuinely interested in comparing and
understanding the merits of the various approaches
to causal inference. I will therefore go slowly over
your reasons for mistrusting graphs. I find these
reasons to be extremely valuable because, through them,
I am learning what the sources of miscommunication have been,

You wrote:
“it is really easy to draw a graph (everyone can draw circles
and connect them) but many people do not stop and fully
appreciate the information they have encoded into their graph.

JP replies:
True, there have been many misinterpretations of graphs, as
there have been misinterpretations of calculus, quantum
mechanics, statistics, as well as most mathematical tools
that science has develops.
But the question is do you, Keli, understand and fully
appreciate the information that a graph conveys, and what
graph can do for you?
You post shows that you do, at least partially.

Keli writes:
“This is human nature in some sense: we abuse power (and graphs
are powerful) unless something stops us.
Let me state the point another way. A graph encodes a set of
conditional independence assumptions. The visual nature of
the encoding is incredibly efficient, especially compared to
enumerating the list of conditional independence assumptions.
However, the inefficiency of enumeration can itself be a bonus,
because it makes us stop and think about whether we really
want to make a certain assumption.”

First, a tiny correction. A graph encodes more than
a set of conditional independence assumptions. It encodes
in fact ALL causal, counterfactual and probabilistic assumptions that
you find in the potential outcome literature, including
for example, ignorability, conditional ignorability, and
their variants. Moreover, this sort of assumptions
are not encoded explicitly by the researcher, but are
derivable, upon demand, from the structure of the graph.
The graph itself is constructed so as to mirror what
researchers believe about how processes operate in the world
(e.g., that height does not cause gender and that income
does not cause height);
no conditional independence nor conditional ignorability need
enter the mind of the graph builder during the construction
phase.

Second, your preference for making the process of
writing assumptions as tortuous as possible may have
some merits; after all, the more time we spend on writing
down an assumption, the more we think about it, and the
more we think about it, the better it would reflect what
we truly believe about the world. However, this torture-seeking
argument assumes that the increase in writing time is spent
on thinking about the truth of the assumption, as opposed to
laboring on (1) understanding what we write, (2) translating
the assumption from its mathematical syntax to the way knowledge
is stored in our mind, (3) deciding if the assumption
we contemplate writing does not follow from those we
we wrote already, (4) deciding if the assumption we contemplate
if the set of assumptions we wrote down constitutes the sum
total of what we need to assume (i.e., that we have not forgotten
an important assumption), (6) deciding whether the assumptions we
made have any testable implications, and more…

In my personal journey through causal reasoning, I have collected
ample evidence showing that, in the case of potential outcome,
the increased writing time is wasted on
(1)-(6), rather than on thinking whether a given assumption is
plausible. The veracity of an assumption is proportional
to the transparency of the language in which it is represented,
not on the time required to write it down, and the
non-transparency of potential outcome assumptions is NOT
a matter of taste or habit, but an established fact.

There are many ways to communicate this evidence.
First, we can verify to ourselves that no mortal can
make conditional ignorability assumptions with less than
50% errors (unaided by graphs). Thus, given that such
assumptions are required in almost every potential outcome
task, it is clear that the extra hardship of
writing assumptions in potential outcome language
does not contribute to their veracity.

Second, we can run an experiment on a simple example
where we know what the correct set of assumptions ought to be.
We can then compare two sets of assumptions, one articulated
by potential outcome researchers and one articulated by
graph-trained researches, both IN THE LANGUAGE OF POTENTIAL
OUTCOME, and judge which set is more error prone (relative to the
correct set).

I was thinking of suggesting Andrew’s example of
height-sex-income for this purpose, but then I recalled
that I have done the comparison already, in one of my lectures.
http://idse.columbia.edu/seminarvideo_judeapearl
about 1:15 hours into the lecture, starting with a slide
titled: “Formulating Assumptions: Three Languages”.
The example is about smoking and cancer, but can easily
translated to Andrew’s sex-height-income.
How many mortals do you know who can examine the
set of potential outcome assumptions written on that slide and
determined whether it is complete, i.e., that I have not dropped
one on purpose.? (I know about 1,000, but they are all
graph-trained).

I would encourage you to go through this exercise because , from
the way you write, I can tell that you have the curiosity and
open-mindedness to look beyond the potential outcome curtain.
And, of course, I encourage all readers of this blog to do the same,
compare the two methods on an example for which we know the
answer — the rest will follow.

Let us move back now to Simpson’s paradox,
Assume that the data exhibits Simpson reversal in
the Smoking-Tar-Cancer story (or in Andrew’s story of
Sex-Height-income) and we ask people “what action is
correct, the one based on the combined table, or the
one based on the Z-specific tables. Which group of
researchers do you think would be more likely to give

This is not to say that methods built on “apples and
oranges” (also called “ceteris paribus”, i,e. “all else
being equal”, or “exchangeability”, or “fair comparison)
are wrong. No. The reason graph-trained
researchers would do better is simply that all “apples
and oranges” considerations are encoded in the graph
and, regardless how complex the scenario,
all these thousands of “apple and orange” considerations
are summarized by the back-door criterion.

Miracle, isn’t it?
(I knew you would appreciate it).
And I would like to convey this miracle to your
mother, too, about whom you wrote:
“My personal goal is that I should have a good enough intuitive
grasp of what I’m doing to explain it to my mother. My mother
does not understand graphs or conditional independence, and she
is not willing to learn about either. She does understand the
idea of fair comparisons though, so that’s
why we adopted the apples to oranges theme in the paper.
For me, it’s mainly a point about communication.”

First thing you have to assure your mother is that,
if she is opting to use graphs, she will never have
to deal with conditional independence judgments. She will have
only to express her understanding of the Sex-Height-Income
story in terms of “what causes what” relations.
Second, her entire intuition about “fair comparisons”
i.e., apples and oranges) will be honored in every
step of the process. However, given that judgments
about fair comparisons is sometimes not straightforward
especially when dealing with more than three variables,
(see Fig 1(c)) it is only reasonable that we
delegate these judgments to a reliable mechanical procedure,
not unlike that used for solving algebraic equations.
(Recall, even the 3-variable Sex-Height-Income
example generated disagreements on this blog)

I am sure your mother would rejoice the idea that
all her thoughts about “fair comparisons”, no matter
how many variables in the story, can be faithfully represented and
properly combined by a pocket calculator called a graph.

It is a miracle, and very very different from the
picture normally painted in communities that religiously
abstain from graphs because they are “too seductive”
(Rubin, 1995).

• Andrew says:

Judea, Keli:

Let me just pick up on the topic of external validity and going beyond the observed data. Whatever approach to causal inference you are using, I think that hierarchical models are a useful tool for inferring about a population that differs from the one under the study. In the usual normal models (which seem reasonable to me as a starting point), the key parameter is the between-population variance, along with the variances of any interactions. (See our books for lots of examples.)

I think there is or should be some methodological orthogonality here, in the sense that hierarchical modeling can be relevant, somewhat independent of the approach used to model causality.

In this case, we’re talking about the “1-schools problem” or the “2-schools problem” rather than 8 schools, and so the prior distribution (or, more generally, the model) for these key variance parameters will be very important in the analysis. But that’s ok, that’s just the way it is: to the extent we want to generalize out of sample (or, more generally, out of the population being sampled from), we need to make some assumptions.

• Judea pearl says:

Andrew, Keli,
My computer is down, I will be back soon.
On the meaning of mapping variables, see causality book, page 264,
On external validity, what are the “variance parameter?. And what is the
story in the “2 school problem” ?.i.e., what is given and what needs to be
Estimated?

• Andrew says:

Judea:

You’ll have to read our books! But the quick story is that the number of “schools” in the “N schools problem” is the number of groups or scenarios in a multilevel model, and the variance parameter is the variance of the distribution of the parameters of interest as they vary among the groups. If you have data from one scenario or location and want to learn about another, then you have the “1 schools problem.” Of course for practical purposes there will have to be some similarity between your scenarios, or else you won’t be able to do much generalization. This similarity can sometimes be captured by a regression model using group-level predictors. Again, both of my books (really, three of them, if you count Red State Blue State) have lots of examples and discussion of this sort of reasoning. But we don’t do any one-school or two-school examples; we probably should. Jennifer and I are dividing our book into two, and I plan to add a discussion of the 1-schools and 2-schools problem to the new multilevel modeling book.

Anyway, in my comment just above, I was simply making the point that questions of partial pooling (or transportability or external validity or extrapolation or whatever) arise in causal and non-causal contexts, and I think that statistical solutions to these problems can be mostly separated from questions of how to model causality.

• Keli says:

Andrew, your point gets at a question that I’ve really wanted to ask and with both you and Judea here, perhaps I’ll finally here the answer!

So I don’t believe in parameters (at least not what they usually refer to). There are real world estimands, which means that if I gave you the whole of a population, the estimand is a function of the complete data. Parameters are useful modelling tools that help us to get at these real world estimands. But in terms of problem set up, I’d like to define my real world estimand first, and then ask, how do I set up my hierarchical model to best get at this estimand through the parameters in the model. (Don’t worry I’ll get to the question soon…)

As in the paper a real world estimand might be the health outcome for a particular patient under a certain treatment. After determining my real world estimand, I start making some modelling assumptions. At first, I want to make only “structural assumptions” (I know this is a vague term, but Larry Wasserman had a nice paper describing these that I’ll try to link to when I get the chance). These structural assumptions have the flavor of a set of conditional independence assumptions as might be captured by a DAG. There’s no functional form or distribution assumptions yet. With only structural assumptions, you have a qualitative model of the world. This is still a non-parametric model.

Of course, most times, we do not have enough data to estimate the non-parametric model we’ve just setup. As Xiao-Li would say, I’m interested in Cubans over 60, but what do I do if I only have data on Cubans under 60 and people from Haiti over 60? This is of course a question of external validity–extrapolating outside of our data range. As you suggested, Andrew, a hierarchical model is a powerful tool for addressing this. Of course, any hierarchical model we put down will be misspecified. Without further constraining the question, it would probably be too hard to try and assess the effect of the misspecification in a meaningful way.

So, let me constrain the question further. Suppose that our structural assumptions are correct, i.e. the DAG we’ve drawn correctly captures the world qualitatively. Given this supposition, what is the “variance parameter” in the misspecified hierarchical model in terms of the characteristics of our non-parametric structural model?

For example, we understand what model misspecification does to the MLE. Can we pin down in a meaningful way what model misspecification does to the “variance parameter”? I realize that my question isn’t well formulated yet but hopefully you understand what I’m trying to ask, so feel free to modify it.

• Keli says:

Let me add one little clarification to this question. Andrew, I think you would say that in the real world, in most situations, there is an arrow going from every node to every other node in a DAG. Feel free to make your set of structural assumptions as rich or as sparse as you want. My only request is that you not include any “parameters” as actual variables at the stage of structural modelling (though you are of course free to model parameters as variables in your hierarchical model after you’ve specified a structural model). By this I mean that variables in my hierarchical model should only be things I can directly measure e.g. height. They are “real” in some sense.

• CK says:

Andrew:
Thanks for the correction. One more thing is still unclear. Let me quote a section in your book on Objections to exchangeable models (chapter 5, page 124, BDA 2nd edition).

“….the rate tumor experiments were performed at different times, on different rats, and presumably in different laboratories. Such information does not however invalidate exchangeability. That experiments differ implies that thetas differ, but it might be perfectly acceptable to consider them as if drawn from a common distribution..”

Can you clarify what you meant by a common distribution. I incorrectly assumed that you meant a common population.

• Andrew says:

Ck:

The “common population” represents the population of parameter values that the schools are drawn from in a hypothetical sampling model. It is just like in classical regression, where the regression line corresponds to the expected value of y given x under a hypothetical sampling distribution that would either represent resampling of the errors from the error model or resampling the observations from a hypothetical superpopulation.

I was reacting to your remark, “the transportability approach extends to situations where populations are dissimilar hence individual data can’t be simply averaged. On the other hand the hierarchical approach assumes that all the data are sampled from the same population (isn’t this why we are pooling?)” My point was that hierarchical modeling (and, for that matter, regression) indeed “extends to situations where populations are dissimilar hence individual data can’t be simply averaged.” Partial pooling is not simple averaging.

• CK says:

From what I know, the transportability approach extends to situations where populations are dissimilar hence individual data can’t be simply averaged. On the other hand the hierarchical approach assumes that all the data are sampled from the same population (isn’t this why we are pooling?), and the population is assumed to be either homogeneous or heterogeneous. The latter includes the random effect.

I stand to be corrected but I get a sense that the assumptions behind the two approaches are entirely different.

Andrew: Is there an equivalence to transportability in your hierarchical approach? (i.e. no same population assumption) I have read your EBA book but was unable to find an answer to this question. Which chapter should I read? I’ve your 2nd edition.

Judea: The transportability approach is great but I think it is really hard to apply it in real life. The information needed is usually unavailable. Any suggestion?

• Andrew says:

Ck:

You write, “the hierarchical approach assumes that all the data are sampled from the same population (isn’t this why we are pooling?).” This is not correct. The hierarchical model allows for differences between the population. That is why it is only partial pooling, not complete pooling. We discuss in chapters 5 and 8 of BDA3 (or chapters 5 and 7 of BDA2). There was much confusion of this in the old-time statistical literature, with, for example, anti-Bayesians such as Kempthorne thinking that “exchangeability” was a strong assumption of similarity between groups, not realizing that the groups could be very different, in which case the group-level variance is large.

• judea pearl says:

CK
I dont know about hierarchical models (still laboring to understand) but I can say something about
transportability . The information needed is very simple: Experimental results from one
population and observational data from another (potentially dissimilar) population. Nothing else is needed.
The theory merely tells you what assumptions are needed to do what you seek to do, namely to estimate
causal effects in the non-experimental population.

The research problem is crisply stated, no ambiguities, no ifs no buts. I wish I could find such crisp description in
the Hierarchical modeling literature.

Think about it not as a method of data analysis but as a warning sign, prior to analysis, saying: Beware, beware, if you cannot vouch for
a certain set of assumptions, no approach in the world (be it hierarchical or spiral or super-spiral modeling)
can give you the estimates you need. Nothing.
Conversely, Once you are willing to vouch for a certain set of assumptions, you need not think any more,
the algorithm will tell you what you need to measure (and estimate) in population-1, what you need to measure (and estimate)
in population-2, and how to combine all those estimates properly to get what you want.

Whether these capabilities are available in hierarchical models I do not know (still waiting to be educated),
But the fact that I have not heard Hierarchical modelers say: “Sorry, this task is undoable” leave me to believe
that the two approaches are not identical, nor can they be orthogonal, so what are they? Perhaps Andrew knows.

• Andrew says:

Judea:

I don’t what spiral or super-spiral modeling is, so I can’t help you there. But your comment illustrates the (conceptual) orthogonality of hierarchical modeling and causal frameworks. You write that your theory “merely tells you what assumptions are needed to do what you seek to do, namely to estimate causal effects in the non-experimental population.” In contrast, hierarchical models are based on clear mathematical assumptions too (the exact distributions are in chapter 5 of our book, it’s completely rigorous and standard stuff; the basic model was worked out by Tiao and others in 1965) but we don’t say anything about causal effects. It’s a model that could work in causal or non-causal situations. But for causal inference, additional assumptions are required, which is fine.

That’s why I say the two approaches are orthogonal: graphical modeling is a way to express the independence assumptions required for causal inference; hierarchical modeling is a way to partially pool information from different sources but without reference to any causal interpretation. To do a good job of generalizing causal inferences, one needs a mathematical framework for causal inference and a mathematical framework for partial pooling. The two approaches are not competing; they can go together.

It’s a similar story in sample surveys, where multilevel regression and poststratification (MRP) uses design-based information in a modeling context. To perform hard problems of generalization from sample to population, it’s good to use design information and modeling information together.

• judea pearl says:

Keli,
I will comment on your message above, related to parameters in structural equations. My comments are in *****
Keli writes
There are real world estimands, which means that if I gave you the whole of a population, the estimand is a function of the complete data.
*****Your description of “estimand” does not match your example below — causal effect — we cannot estimate causal effects
even if we have infinite sample from the whole population; we need a model. So, what you call ‘real world estimand” is really
a property of the world (not of the complete data) and can be computed from the structural model of the world.

Keli writes:
I’d like to define my real world estimand first, and then ask, how do I set up my hierarchical model to best get at this estimand through the parameters in the model. (Don’t worry I’ll get to the question soon…)
****** You have hit it on the nail, this is exactly what I am missing from the hierarchical model literature: the “real world estimand” or,
in my language, the QUERY, i.e., the purpose of the study, the thing you want estimated,or, the thing that would make you happy if
you could estimate it. Many researchers can tell us chapters and verses of HOW they estimate things but very few tell us
WHAT they want estimated (And I am talking as a reviewer who labors hours upon hours on such submitted papers).
The language of counterfactuals is rich enough for people to describe their queries;
Keli writes:
As in the paper a real world estimand might be the health outcome for a particular patient under a certain treatment.
After determining my real world estimand, I start making some modelling assumptions.
At first, I want to make only “structural assumptions” (I know this is a vague term, but Larry Wasserman had a nice paper describing these
*******These are not vague at all, because they can be expressed in the language of science, read counterfactuals
Kelli writes:
that I’ll try to link to when I get the chance). These structural assumptions have the flavor of a set of conditional independence assumptions as might be captured by a DAG. There’s no functional form or distribution assumptions yet. With only structural assumptions, you have a qualitative model of the world. This is still a non-parametric model.
*******Correct, except that there are other assumptions too: the missing links, also called “exclusion assumptions”.
Keli writes:
Of course, most times, we do not have enough data to estimate the non-parametric model we’ve just setup.
****** First, Data will not help us a bit, Second, we do not wat to estimate the model, we need to estiate the
QUERY of interest, your “real world estimand” or “The health outcome for a particular patient under a certain treatment”
Lets not forget the QUERY, like most researchers do.
Keli writes:
As Xiao-Li would say, I’m interested in Cubans over 60, but what do I do if I only have data on Cubans under 60 and people from Haiti over 60? This is of course a question of external validity–extrapolating outside of our data range. As you suggested, Andrew, a hierarchical model is a powerful tool for addressing this.
****** Is this what hierarchical modeling does? Wow, I finally understand it: Extrapolation outside our data range. This is something
transportability does not handle, because we are trying to protect ourselves against the possibility that all Cubans under 60 are
alergic to the treatment and to impose smoothness assumptions amounts to meddle with parameters. The only extrapolation
we allow is across regimes (e.g., from experimental to nonexperimental) but not from one range of a variable to another.
The structural assumptions tells us nothing about smoothness, therefore they cannot

So, let me constrain the question further. Suppose that our structural assumptions are correct, i.e. the DAG we’ve drawn correctly captures the world qualitatively. Given this supposition, what is the “variance parameter” in the misspecified hierarchical model in terms of the characteristics of our non-parametric structural model?
******** I believe “variance parameter” cannot be defined in non-parametric structural model. Because non-parametric models
do not impose “smoothness” properties, and you need “smoothness”.

• judea pearl says:

Andrew,
I think we are converging on some understanding of the two extrapolation schemes.
But let me re-confirm:
1.I said: “Transporability theory merely tells you what assumptions are needed to do what you seek to do, namely to estimate causal effects in the non-experimental population.”
2. You wrote: ” In contrast, hierarchical models……. ” and you lost me by not continuing the contrast, but referring me to
“In contrast, hierarchical modeling tells you what assumptions are needed to estimate statistical
parameters from data in which the values of some variables (or combinations thereof) are missing.

• Andrew says:

Judea:

You write, “Is this what hierarchical modeling does? Wow, I finally understand it: Extrapolation outside our data range.” Sorry, but that’s just wrong. Jennifer and I wrote a whole book about hierarchical modeling, if you don’t want to read that, there’s a book by Kreft and De Leeuw, a book by Bryk and Raudenbush, there’s a short article I published a few years ago, Multilevel modeling: what it can and can’t do (“multilevel modeling” is another term for “hierarchical modeling”), there’s really a lot of stuff out there.

Until you get around to reading about the topic, I suggest you think of hierarchical modeling as being a lot like linear regression. It can be used for interpolation or extrapolation, and in its standard form it requires a set of mathematical assumptions that specify a probability model for the data.

You also write that hierarchical models need “smoothness.” This is not correct. We can do hierarchical models with discrete predictors, indeed we do it all the time. See, for example, this paper with Yair recently published in the American Journal of Political Science.

You also write, “hierarchical modeling tells you what assumptions are needed to estimate statistical parameters from data in which the values of some variables (or combinations thereof) are missing.” Nope, that’s wrong too! I again recommend you read up on the topic if you are interested.

• judea pearl says:

Andrew,
I am not trying to re-define hierarchical modeling (HM) nor to belittle its capabilities or universality.
I am honestly trying to understand it, albeit in my way, that is, in terms of what problems it solves, rather than
how it does it. This is my limitation, I admit, but I am a slow reader, and even a slower understander,
So, I am asking for a single sentence, if such exists, to complete the contrast that you have started:

1.I said: “Transporability theory merely tells you what assumptions are needed to do what you seek to do, namely to estimate causal effects in the non-experimental population.”
2. And you wrote: ” In contrast, hierarchical models……. —- please continue, I will take whatever you think I
should know…—–“. I have no investment nor ego in HM, except curiousity.
Can we complete this contrast??

In the meantime, I will follow your advice and will “think of hierarchical modeling as being a lot like linear regression. It can be used for interpolation or extrapolation, and in its standard form it requires a set of mathematical assumptions that specify a probability model for the data. ” Got it. I know something about regression, and I will try to make use of the little I know.

I know for example that regression has nothing to do with causation, truly nothing, so, may I assume the same holds for
HM? Some people ask me: “Why dont you use HM for causal analysis ?” ; can I answer them: because HM have nothing to do with
causation?.

Again, I have no ax to grind, just thirst for learning.

• Keli says:

Judea, thanks for clarifying my question and pointing out my ambiguities.

JP: “So, what you call ‘real world estimand’ is really a property of the world (not of the complete data) and can be computed from the structural model of the world.”

What I mean by “complete” data is the set of all the potential outcomes for some population. I see how my use of the word complete is ambiguous, thanks for pointing that out. So my question really is, is the “variance parameter” some meaningful function of the set of all potential outcomes?

Now, onto the fascinating discussion about smoothness:

JP: “We are trying to protect ourselves against the possibility that all Cubans under 60 are alergic to the treatment and to impose smoothness assumptions amounts to meddle with parameters…The structural assumptions tells us nothing about smoothness, therefore they cannot.”

Andrew says that hierarchical modelling isn’t about smoothness. I think what he means is that it isn’t about mathematical smoothness which requires for example a metric space. There is a statistical generalization for the mathematical notion of smoothness and this is what’s captured by hierarchical models. Let me argue for this, then I’ll return to your point.

Suppose we have a function f(x) over x \in R^p. We observe a set of points f(x_0)…f(x_n) from this function and we want to ask, “How smooth is f?” A statistical way to rephrase this question might be, “How well am I able to predict f(x*) for an unobserved x* after I see f(x_0)…f(x_n)?”. This of course depends on how far x* is from x_0,…,x_n and the mathematical smoothness of f. What a spatial statistician would do, is to assume that f comes from a stochastic process (e.g. a Gaussian process) and to estimate the “auto-correlation” parameters in that process. The “auto-correlation” tells us how much we can expect f to change when we move x around. In the language of hierarchical models, it tells us to what extent we can “borrow information” about f(x*) from f(x_0)…f(x_n).

In the above example, we still assumed a metric space, R^p. What happens when we remove this assumption, e.g. we have instead a set of discrete indicators with no notion of distance. Note that it is not discreteness itself that is important but rather the notion of distance. We can have a discrete time series (so time is restricted to a grid) but the notion of auto-correlation still exists, because we can define distance between time points. Once you remove the notion of distance, “auto-correlation” becomes correlation. In particular the correlation between “random effects” in a hierarchical model is a statistical measure of smoothness.

When we use hierarchical models, I think we assert that we can get a decent estimate of statistical smoothness. For example if the real world process is not smooth, we should estimate a correlation parameter of 0 (variance parameter of infinity). I use the word “decent” because for practical applications, the notion of quantitative consistency etc. are not as useful here as the notion of qualitative consistency (Xiao-Li coined this distinction). Even if we get the smoothness slightly wrong, it may still be correct enough to help us for a particular application. So now going back to your point Judea, when can we hope to estimate the smoothness in a qualitatively consistent way? Is there non-parametric identification? (probably not unless we make more assumptions, but what are those assumptions?)

Returning to the earlier point, I would still like to link the “variance parameter” (our estimate of statistical smoothness defined through a hierarchical model, which may be misspecified) to a function of the complete data (the set of all potential outcomes for a world). This would really help me to understand. Is this doable?

• judea pearl says:

Keli
Keli,
So my question really is, is the “variance parameter” some meaningful function of the set of all potential outcomes?

I happened to know what “the set of all potential outcomes”
is, it is simply the “structural causal model”, because
from such a model one can obtain the set of all potential
outcomes. But I wish I knew what a “variance parameter” is,
and I wish it was “some meaningful function of the set of
all potential outcomes”. Why? because if it was, then
one would be able to define it as a counterfactual
expression, and even a novice like me would be able
to understand what it is.
Andrew defines it verbally as “”the variance of the distribution
of the parameters of interest as they vary among the groups”
Suppose my parameter of interest is the counterfactual
Y_x. Taking Andrew’s description literally, I get
Var(Y_x)_i i= 1,2…
where i is the group index. Now, where is the
“variance parameter”? And why is it of such importance?

On smoothness.
In nonparametric analysis we normally assume that
all functions are arbitrarily rough and we ask:
Is the parameter of interest identifiable?
“parameter of interest” the “QUERY”. Why? Because
our interest may not be in a parameter but in
an answer to a question that might require lengthy
computations. e.g., “Is there a winning move for White?”
or “Would Joe be alive?”. The word “parameter”
connotes parametric models, and we are dealing with
nonparametric models; all functions y=f(x,z,eps) are
arbitrary)

Assuming that I can use the word QUERY,
in nonparametric analysis we first ask: Is the QUERY Q identifiable?
If it is, then we get an estimand ES(Q) which
is a functional of the observed distribution P.
[For example, Q may stand for P(Y_1 = 0) and, under no confounding, Q is
identifiable and ES(Q) turns out to be ES(Q) = P(Y=0 |X = 1), which is a functional
of P(X,Y). ]
Now, to estimate ES(Q) from finite samples,
we resort to parametric smoothing, as is done
in propensity score analysis. If, hiweverm Q is nonidentifiable
(nonparametrically), we ask whether smoothing assumptions
on the functions y=f(x,z,eps) could render Q identifiable.
Typical snoothing assumptions are monotonicity,
linearity or pseudo-linearity.

Speaking more concretely, If we do not have any data on Cubans older than
60, we distinguish between the case where the missing
data is an accident of the sampling process, in which
case the parametric estimation phase would extrapolate from
Cubans below 60 to Cubans above 60. If, however,
we have reasons to believe that the missing data is
systematic, e.g., Cubans over 60 may be arbitrarily
different from other Cubans, but were prevented from
entering the hospital, and our Q is about Cubans
above 60, Q is proclaimed non-admissible.
To be admissible, a query Q must be computable
from the model when the number of samples goes to
infinity.

• Andrew says:

Judea:

You write, “I am not trying to re-define hierarchical modeling (HM) nor to belittle its capabilities or universality. I am honestly trying to understand it, albeit in my way, that is, in terms of what problems it solves, rather than how it does it. This is my limitation, I admit, but I am a slow reader, and even a slower understander.”

I recommend you read this paper which I precisely wrote to explain to people what multilevel (hierarchical) modeling cannot do. The paper is only 3 pages long!

• judea Pearl says:

Andrew,
I will tell you frankly
where I got stuck on page 1 of your 3-page paper on
multi-level modeling.

“…our goal in analyzing these data was to estimate the
distribution of radon level in each of the 3,000 US
counties, so that home owners could make decisions about
measuring or re-mediating the radon in their houses based on
the best available knowledge of local conditions. For the
purpose of this analysis, the data were structured
hierarchically, houses within counties.”

I was hoping to hear what is known to a home owner before
decision. I presume he/she knows what county he/she resides
in. So the target of analysis is E(R|C), where R is the
radon level and C is the county name. Alternatively,
the target could be the entire distribution P(R|C), but
this is not where I got stuck.

What remained unclear is whether the home owner knows whether
his house was chosen for measurement (or how far his house is from one
that was measured), in which case the
target of analysis would be E(R|C,H), not E(R|C)

But the main obstacle in my reading was the sentence:
“… the data were structured hierarchicaly, houses
within counties.” I am not used to mixing information
about how the problem was solved prior to finishing
the problem description. At this point of the introduction
readers with my cognitive deficiency expect to find
what data is available to the analyst, and to the home owner,
not how it was organized for analysis. In other words, do
we have data on radon levels in randomly chosen houses
within each county? Is the target of analysis
E(R|C,H, d(H,H’)) where d(H.H’) is the distance between
the house owned (H’) and the closest house (H) where
measurement was taken?

The next paragraph begins to tell us what information is
available:

In performing the analysis, we had an important predictor
–whether the measurement was taken in a basement.
We also had an important county-level predictor a measurement of
soil uranium that was available at the county level. “
But, alas, the problem description stops abruptly and we back
again to the specific method of analysis:

“We fit a model of the form:

y_ij ~ N(alpha_j + beta * x_ij , sigma_y^2)
alpha_j ~ N(gamma_0 + gamma_1 u_j, sigma_a^2),

One paragraph later you continue saying:

Equivalently, the model can be written as a single-level regression with
correlated errors

y=E2=88=BCN(gamma_0 1 + gamma_1 G u + beta x + sigma_y^2 I + sigma_alpha^2 =
G G^T),
where G is the n J matrix of county indicators.

Here I am really stuck. First, I am used to call such
equations (with correlated errors) “structural”.
Second, it is not clear whether the single-level regression
option hides valuable information that is revealed in
the hierarchical option, or is it truly equivalent
to the hierarchical option.

This sort of questions could have been avoided if we were
to complete the problem description before going
to the method of analysis.
What is given. what we are trying to estimate, and
what is missing from ordinary regression ,
which is the crown working horse for doing “best predictions”.

• Rahul says:

Andrew:

An aside, but is the first sub-graph of Fig. 1 from your “Hierarchical modeling” paper (data for Lac Qui Parle county) actually showing a regression line fit to two points?

• Rahul says:

Again re. the Hierarchical Modelling Paper:

I found this approach intuitively unsatisfying: Aren’t we throwing away all the location data? Counties are artificial boundaries; is it more relevant to my radon risk the one other measurement from my county that is 40 miles away or from an adjacent county but only a mile away?

Wouldn’t some sort of geospatial model be more useful that takes distances from neighboring readings into account?

Also, say, you have only readings but no GPS coordinates (say it’s sanitized data). Even then, can’t one take a county-adjacency matrix into account? I’d intuitively think that the parameters of the county closest to my house may have a big effect?

Just curious whether you’ve tried such models? Is the dataset posted anywhere to play with?

• Andrew says:

Judea:

You asked for something short, I gave you something short. Now you complain that I don’t give enough details. That’s what happens in a 3-page paper! If you want more on the radon example, you can read this much longer paper that has much more context.

Regarding your question, “What remained unclear is whether the home owner knows whether is house was chosen for measurement (or how far his house is from one that was measured),” the answer is that measurement in this particular dataset was done by random sampling. The decision problem is primarily for the millions of houses that were not in the sample. And, we also did a separate analysis to add precision for people who had neighbors’ measurements, but that’s not discussed in these papers. This last analysis helps a little but is irrelevant to the questions you originally asked regarding hierarchical modeling.

Beyond this, let me emphasize that I do not present or view multilevel modeling as a general framework to solve all statistical problems, or all causal inference problems. Hence the title of the paper, Multilevel (Hierarchical) Modeling: What It Can and Cannot Do. Nor of course do I believe that our radon analysis was perfect, or do I think it would be possible for me to explore all the subtleties in a 3 page paper or even a 30 page paper. Multilevel modeling is a statistical tool that allows us to do predictive inference in a way that makes sense to me (and many others); for the resulting estimates to be interpreted as causal inferences, more assumptions are needed, that’s where identification comes in, and that’s the area you’re more interested in.

• Andrew says:

Rahul:

1. Yes, the line plotted for Lac Qui Parle is fit to the 2 data points in that county. As I recall, the “data alone” line comes from a varying-intercept, constant slope model. So the intercept is the best fit for that 2 points in that county, but the slope is estimated from all the counties together. We could’ve done it in other ways, the point was just that there’s a lot of noise with only 2 points.

2. Yes, of course we fit spatial models of counties and also of measurements within counties. It turns out that, once you include county-level uranium levels and county-level geologic predictors, there’s not much information in the counties’ geographic location. Alternatively, you could say that by including such predictors, we were including geographic information. At the house level, we did not have locations at anything more precise than the county. But we did do a separate analysis of another dataset with exact house locations in order to estimate the spatial autocorrelation function and allow us to give recommendations for people who had measurements from neighbors’ houses.

• Rahul says:

@Andrew:

Thanks. Can you post the data-set online?

Naive question: So is the multi-level approach always predictively superior to using a simple three term least squares regression on (1) basement-or-not, (2) county & (3) soil-uranium-level?

• Andrew says:

Rahul:

1. The data are posted at the website for my book with Jennifer, just click on Data and software, and go from there.

2. Speaking generally, there are cases where multilevel regression can make worse predictions—it’s based on a mathematical model, so if the model is wrong you can get bad answers. For example, if the underlying phenomenon is bimodal, you’ll want to be partially pooling toward local modes, not toward the center of the distribution. Pooling toward local modes can be done using a mixture model, as we did in this 1990 paper. Or using appropriate predictors, as we did in this follow-up effort from 1994.

• Rahul says:

@judea:

I don’t think it is merely scientific inertia. I’m no trained expert, but I’ve tried reading through the Causal Diagrams framework a couple of times. And it always seems tantalizing & beautiful: Here’s finally an answer to all these conundrums! But enlightenment never comes. I mean it is all conceptually neat & awesome but I’ve never really managed to apply it in a fruitful practical way.

Perhaps, that only reflects my stupidity. But perhaps there’s a deeper reason as to why I don’t see so much enthusiastic, widespread adoption by the applied world. Clearly, it isn’t so simple as “3.5 minutes to master” makes it sound. Nor did it answer all my questions for me.

• judea pearl says:

Rahul,
You are the first person I found who actually tried to use the flashlight and
did not find his wallet. I truly appreciate your input, really. Please tell me
which question you were NOT able to answer with the graphical approach.
This may jolt me to either fine-tune the method or improve the way I present it.
Please, help me with one such question.

• Rahul says:

@judea

In hindsight, I really think maybe I ought to put more effort into understanding this.

All I can say is I’ve put in my figurative 3.5 minutes worth and come out quite confused & frustrated. It really doesn’t seem a “simple” concept. My current state of mind is best summed up by one of your old comments:

“You are probably wondering whether this derivation solves the smoking-cancer debate. The answer is NO. Even if we could get the data on tar deposits, the model above is quite simplistic, as it is based on certain assumptions which both parties might not agree to. For instance, that there is no direct link between smoking and lung cancer, immediated by tar deposits. The model would need to be refined then, and we might end up with a graph containing 20 variables or more.”

I see a very elegant framework, but the empirical data needed to resolve most of its connections etc. is so sparse that I don’t see the practical progress. What’s a good applied paper that you think has fruitfully used your ideas? I’d love to read that. Maybe that’ll help.

• judea pearl says:

Rahul,
Good applied papers using DAG’s appear quite frequently in the Epidemiology and Health Science literature, and to somewhat
lesser degree in Social Science and Psychology.
Here is one such paper, brought to my attention yesterday by Emil Coman:
http://dx.doi.org/10.1016/j.jclinepi.2012.01.002

Your concerns about sparse data are justified, but, what is the alternative?
acting like the world is totally chaotic? Ignoring what we know about a problem? Using potential outcomes?
There is a theorem that says that, asymptotically, no one can do better.
Is that a solution? Theoretically, yes; practically, no.
But knowing what needs to be done if data were abundant offers some guidance,
because alternative methodologies do not even get this right.

• K? O'Rourke says:

Judea – with all due respect – http://dx.doi.org/10.1016/j.jclinepi.2012.01.002 – really?!

There is only fictional examples and it’s in an application area where there are limited sets of covariates and usually only ecological (group summaries).

And the primary challenge is missing direct comparisons or how to combine direct and indirect comparisons, and an apparent blindness to (in effect) an aggressive single imputation of the _missing_ group’s data.

This seems more Keli’s focus “can we impute them for patients [a missing treatment group] in the clinical trial in any meaningful way?” and “about rigor in the context of the robustness-efficiency tradeoff I just mentioned and my point was that most times we do not formalize that tradeoff” especially the most times we do not formalize that trade off.

• CK says:

http://www.ncbi.nlm.nih.gov/pubmed/16931543

• K? O'Rourke says:

CK With just a cursary read

“control for potential confounders (maternal age, gravidity, education, marital
status, race/ethnicity, and prenatal care). Since adjustment
for these potential confounders had virtually no effect, we
present only the unadjusted rate ratios below.”

So what I believe Rahul was asking for is skipped over in the paper and indicated as not challenging this time.

Not complaining about a judgemental choice of just those – maternal age, gravidity, education, marital
status, race/ethnicity, and prenatal care – but rather the omission of credible arguments for that and some empirical _testing_ of that, either within data set or from other data set. (Might just be the editors fault!)

• CK says:

Raul:
The birth paradox is coming from incorrect adjustment of a mediator, when the interest is on estimating the total effect.

You are right, if the causal knowledge is lacking, DAGs will not help you. But what if you have this knowledge and your analytic approach conflicts with this knowledge (e.g. not adjusting for a potential confounder)? Don’t you agree that you would need to revise your analytic strategies?

You asked “If you told this paradox to a 1950 epidemiologist with no clue about DAGs what do you think he would do?”. He/she wouldn’t understand you the same way Karl Pearson wouldn’t understand if you ask him to run a simulation study in R. Both will need to be trained to use proper tools for their specific problems.

You also asked “is it correct to focus on identification & treat testing as trivial?”. I believe they are equally important, and I think it is a mistake to ignore the former.

• Rahul says:

@CK

This discussion is helpful to crystallize my thoughts. I guess, what I’d really love to see is a paper that distinguishes between multiple fairly non-trivial DAGs (all plausible a priori) on the basis of actual data. Say, something have 6-8 nodes at least.

I think you underestimate epidemiologists if you think that the Birth Paradox would leave them entirely stumped merely because DAGs weren’t around. I think if presented with a non-intuitive observation like “smoking improves mortality in LBW babies” starting to look for a parallel, deadlier cause for LBW & mortality is a natural response that trained researchers would have without even knowing about DAGs. You think not?

• CK says:

Raul:
if I understood your point, you are suggesting that for each pair of nodes connected by an arrow you should
gather all the data to assess a causal link. This means that an additional DAG will have to be appended to the original DAG in each node pair to identify other sources of biases. What would happen is that your graph will end up expanding indefinitely. You know that we can’t do this, it is practically impossible. However, there is a practical solution:

A feasible way of doing this is to incorporate in your graph what is already known. For example, most people would agree that smoking causes lung cancer, so instead of proving or refuting this hypothesis in your analysis (which is not your primary interest), you can simply learn from other studies (or prior experience) and incorporate that information in your graph. If somehow in the future we refute the smoking-lung cancer hypothesis, the graph will have to be revised, and that’s how science grows.

I never said that the Birth Paradox would leave the 1950’s epidemiologists entirely stumped (in addressing other important problems) my response was solely focusing on the paradox -if the tool existed in the 50’s we wouldn’t be puzzled by the the paradox for all these years.

The aim of sending you the link was to show you that a good number of epidemiologists (including myself) use DAGs in their work, and we think that they are quite useful.

• CK says:

Keith:
With the aid of DAG (used as a tool to encode the causal knowledge), the authors were able to identify proper covariates for adjustement. The “birth paradox” (which is a variant of simpson’s paradox) stemmed from improper adjustement. You can try the potential outcome approach to address the same problem but you will see how challenging this would be.

What I have described is the identification phase. In the testing phase you can analyse your dataset and perhaps exclude some of the variables that you think have a negligible impact on your estimates of interest (but you have to be mindful that you could be introducing new sources of bias). This is what the authors did.

The question on testing is trivial(e.g. to provide a proof from other datasets (did you mean studies?)that maternal age is a potential confounder), and it is possibly why the authors didn’t go into such details.

Perhaps you and Raul can share with us alternative ways of addressing this paradox.

• Rahul says:

Interesting paper. What I got from it is that LBW comes from either smoking or some other high mortality cause (or causes), say U. Fine, but until you identify what this U actually is (say malnutrition) & further prove that the mortality from U alone is higher than smoking you haven’t really resolved that paradox. Have you?

Do I read this correctly? Or is it correct to focus on identification & treat testing as trivial?

Further, wouldn’t an epidemiologist search for such an U (some LBW cause that’s worse that smoking for mortality) in a pre-DAG world also? Or not? If you told this paradox to a 1950 epidemiologist with no clue about DAGs what do you think he would do?

Maybe I’m missing something crucial?

8. Michael says:

There is nothing paradoxical without the “vaguely causal” interpretation of estimates, and the paradoxes are resolved by ceasing to be vague. That is the value of the manipulation and potential outcome metaphors.

You feel (vaguely) that sex causes height and both jointly cause income. That’s why it is wrong to control on height to estimate the causal effect of sex, but right to control on sex for the causal effect of height. Because if you were able to slip into the womb and change someone’s sex before birth, the height would change too, and the impact on income should not be purged of its height-mediated component. But height might be changed without altering the determination procedure; and the causal impact of height is naturally thought of as the impact of that isolated change.

If your interest is purely non-causal, there really seems no question here. All the regressions you might consider give you perfectly good partial correlation information. Which one is best depends entirely on what questions you ask. We seem able discuss the issue without specifying the question only because we have in mind that some questions are usually interesting. And those are causal questions. If I want to know how much it will cost me to get the civil rights lawyers off my back by hiring at least one woman for a job where everyone needs to be six foot two, I want the results from a regression with both variables — plus location, age, favorite color…. for prediction, more is better except for the degrees of freedom question, yes? But for causal interpretation, no.

• Andrew says:

Michael:

As I wrote above, one can set up Simpson’s paradox with variables that cannot be manipulated, or for which manipulations are not directly of interest. Perhaps to you there is nothing paradoxical about these examples, but they do seem to confuse many people.

• Zach says:

I think there’s a difference between something that deserves to be called a paradox and something that ‘confuses many people’. Let’s say there are two groups of people. Group 1: People for whom Simpson’s paradox is simply the confusing mathematical fact that regression coefficients can change depending on the other covariates included in the regression. Group 2: People for whom Simpson’s Paradox is the difficulty of causally interpreting regression coefficients from a particular regression given their sensitivity to the other covariates included. For people in Group 1, Simpson’s paradox can certainly be cleared up through a careful explanation of how regressions describe comparisons between people (or units). For people in Group 2, you need to appeal to causal language (Pearl’s or Keli’s).

• Andrew says:

Fair enough. I’m in group 1.

• judea pearl says:

Andrew,
I dont think you really meant it when you classified yourself in Group-1 (i.e., People for whom Simpson’s paradox is simply the confusing mathematical fact that regression coefficients can change depending on the other covariates included in the regression).
As I wrote to Zak above, how would you explain why an innocent reversal of association came to be regarded as “paradoxical”, and
why it has captured the fascination of statisticians, mathematicians and philosophers for over a century.
Some of these are solid statisticians, like Karl Pearson, and Dennis Lindley; they cannot be dismissed lightly
as “confused by not understanding facts about regression”.
These scholars knew regression very well and still came to the conclusion that counterfactuals or potential outcomes (or graphs,
or exchangeability, or “structural parameters” or “proto-individuals”) are necessary for understanding Simpson’s paradox.
The age of regression-first is over. And there is a 95-cent flashlight available to the curious, battery included.

• Andrew says:

Judea:

1. I was not dismissing Pearson and Lindley. When I say that the mathematics of regression coefficients is confusing, perhaps it would be better to use the word “counterintuitive.” When something is counterintuitive, it can be a good investment in time to develop scaffolding that will help make it become intuitive. For a simpler example, consider the famous probability problem of the three two-sided cards. (One is black on both sides, one is white on both sides, one is black on one side and white on the other. You pick a card and a side at random and it is black. What is the probability that the other side is black? The answer is of course 2/3, but this is counterintuitive to a lot of very smart people. The mathematics of counting up all the probabilities (some like to use a table here, I prefer the tree) makes it very clear. Anyway, to me, the changing of the regression coefficients can be counterintuitive (hence motivating my work on the Red State Blue State problem), and I continue to seek out methods such as the BK plot that to help clarify this pure mathematical issue, no causality involved at all.

2. That said, I think you’re right that causality is a key part of what makes Simpson’s paradox interesting. So I was being too glib when I implied otherwise. Instead let me return to my original statement, in the context of my example, that “I can interpret those regressions without having to think about manipulation of height or of sex—to me, these are between-person comparisons, not requiring within-person manipulations.” Yes, I’m thinking causally, but to me the relevant treatments are not manipulations of height or sex. I think of this more as a reverse causal reasoning setting, where the existence of an interesting pattern suggests that there are important effects out there that can framed as potential outcomes, or do-operators, or whatever you want to call them.

• K? O'Rourke says:

> “the existence of an interesting pattern suggests that there are important effects out there that can” _not be_ framed as _manipulations_ even if that is just a purposeful representation?

• Andrew says:

Keith:

What I was trying to say was that Judea and Keli and XL (and Don and Jennifer) are right that in my example I ultimately do care about potential outcomes, or manipulations, or do-operators. But in this case the variables that I imagine being manipulated are not sex or height. I think the place where many people go wrong is in the implicit assumption that the variables given in the problem description are the only variables we should care about. One point of my paper with Guido is that a reverse causal question raises the interesting possibility that there are relevant causal variables that have not been included in the problem as stated.

• Keli says:

Thanks for making the distinction, Zach.

In the paper, we make clear that the reason why we need the causal language is because we want to make the optimal treatment decision for a patient. In the context of treatment choice, we need causal interpretations. Otherwise, as you point out, Simpson’s Paradox is a simple algebra fact.

• CK says:

Zach:
Suppose you are interested to examine whether x affects y. Let’s assume you have a third variable z. Can data alone tell you that z is common cause of x and y (and so should be adjusted) or a common effect (and so shouldn’t be adjusted)? How can a careful examination of regressions address this phenomenon without a prior causal knowledge?

• judea pearl says:

Zak,
I dont think may people would be “confused” by the mathematical fact that
regression coefficients can change depending on other covariates, why would they?
Conditioning on covariates simply means that we are examining a subpopulation, so,
why would it be confusing or surprising that associations change in going from one subpopulation
to another?
I think your Group 1 is close to empty. Even Andrew, who classifies himself in Group 1 (see his comment below) would have
difficulaties explaining why an innocent reversal of association came to be regarded as “paradoxical”, and
why it has captured the fascination of statisticians, mathematicians and philosophers for over a century.
Some of these are solid statisticians, like Karl Pearson, and Dennis LIndley; they cannot be dismissed lightly
as “confused by not understanding facts about regression”.
Furthermore, you say that “For people in Group 1, Simpson’s paradox can certainly be cleared up through a careful
explanation of how regressions describe comparisons between people (or units).” No way. As CK points out (below), we
can examine regression coefficients till dawn and will not be able to tell if treatment should be administered
(as recommended by the combined table ) or not administered (as recommended by the sex-specific table).
So, given that Group 1 is empty , it means that Group 2 (most people) need to appeal to causal language.
Why the reluctance? Lindley had a reason to be reluctant — the language did not exist in his time, but we
live in the 21st century, and you can buy a 95-cent flashlight (battery included) to walk you out of the darkness of regression.

• Keli says:

Michael,

Your second paragraph is precisely what Xiao-Li and I try to capture using the term “proto-individual”. If we are interested in effect of height, the relevant proto-individual does have the attribute sex so we should condition on it. If we are interested in the effect of sex, the relevant proto-individual does not have the attribute height yet because as you mentioned, intuitively, sex helps to determine height, so we should not condition on it.

9. Abhimanyu Arora says:

Wrt the nice example provided by Andrew in the original post, I just had one question for better understanding of the matter: Suppose I include an interaction term between height and gender, can I say then that the difficulty of interpretation (as alluded to) is resolved?
Thanks

• CK says:

The answer to your question is no. Adding an interaction term will tell you (statistically) whether there are subgroup differences but you’ll need to go few steps further to show that the effect within subgroups is different from that in the aggregate population.

• Abhimanyu Arora says:

Thanks CK, for clarifying. Coming back to the example.
Until the part on interpretation of gender it is so okay. But I do not see what’s the problem behind comparing a short man and a tall woman. Here’s my point: In fact it makes even greater sense (given the assumption that men are taller on average or the distribution of the latter’s height is shifted to the right). You cannot compare a ‘short man’ and a ‘short’ woman for exactly the same reason, that the difference in income would then be explained in some part by the difference in heights. Same goes for tall men and tall women and even more so for short women and tall men (which is further out of the question, so to speak). So basically the effect of height is eliminated only in the case when short men and tall women are compared (and this helps in interpreting the coeffcient on gender). Doesn’t it ring a bell on similar underlying concepts behind the popular matching models?

The idea behind the Simpson’s paradox is that the population effect might be different from the sub-group wise effect(s). This is in some sense related to his point 2 and the fact that Andrew acknowledges that height should not be controlled for alone (what we say omitted variable bias). But I could not relate this to the controversy on the coefficient on gender (unless the sole purpose of that is to see that thinking about potential outcomes is not necessary)

• Abhimanyu Arora says:

Small addendum: To the part I refer to matching models—those for causal inference. Thanks

10. David P says:

Germane to some of the earlier discussion, although not to Simpson’s paradox itself, this is from the Washington Post a few days ago. If you’re looking for a manipulation of sex, here it is:

‘In 2011, the NCAA, after consultation with scientific experts and bodies like the National Center for Lesbian Rights, determined that male-to-female transgender athletes should sit out a year while undergoing testosterone-suppression treatment before competing on women’s teams. That guideline fits well with the experiences of transgender athletes such as Joanna Harper, a 57-year-old medical physicist and 2012 U.S. national cross-country champion for the 55-to-59 age group.

‘Harper was born male but started hormone therapy in August 2004 to suppress her body’s testosterone and physically transition to female. Like any good scientist, she recorded data, and she found herself getting slower by the end of the first month. “I felt the same when I ran,” she says. “I just couldn’t go as fast.”

‘Harper’s time in the Helvetia Half Marathon in Portland, Ore., was about 50 seconds per mile slower in 2005 than it was in 2003, just before the transition. But age- and sex-graded performance standards indicate that Harper is precisely as competitive now as a female as she was as a male. And data she has for a half-dozen other athletes with similar histories follow the same pattern.’

So we have marathon time = f(age,sex) + g(other factors). The indication is that Harper switched from f(age,male) to f(age,female) but g(other factors) was unaffected, at least as an ordering.

11. K? O'Rourke says:

Judae (and anyone else who can still wind their way through this fascinating post)

Folks seem very comfortable in regression with assuming every individual has the exact same slope parameter (and with interactions everyone with the same value on the interacting variable) and so all the individual slope estimates are fully pooled (I used to bore people with turning regression into meta-analyses of many one sample studies.) But the parameters not being exactly the same but instead being drawn from a distribution where that distribution has exactly the same parameters causes lots of angst.

(I did try to beat this point to death in my thesis “When there are common in distribution parameters, unobserved random parameters differ by study but are related by being drawn from the same “common” distribution. “http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf )

In terms of transportability, a simple example would be a number of studies with (assumed) common treatment effect (all exactly the same) and different control rates. The treatment effect is fully pooled and one would _transport_ to a new population perhaps using a control rate estimate just for that population. (Judea has better, more formal formulations of this.) But what if you know that the treatment effect also varies haphazardly from study to study? Do you give up and say every study is an island on its own or do you purposely deal with it as random and assume the treatment effects are drawn from a common distribution. Note that most folks are much more comfortable getting over every observation in a regression problem being an island on its own, by assuming slopes and variances are common.

• CK says:

Keith:
That last paragraph of yours is similar to the question I had in mind. I think Judea answered this very well -use the transportability formula as a warning tool. With no sufficient information available, we can proceed with all kind of analytic techniques we know but we should be mindful of the implications. In other words we shouldn’t overstate the capability of our methods.

12. K? O'Rourke says:

CK

Yes, but my comment was not about providing a warning about the abyss – just clarifying what that abyss seems to be (using regression concepts).

Extrapolation or transportation on the basis of exactly common (subcomponents of) parameters is I belive just overly hopeful though very convenient.

13. […] on Simpson’s paradox broke out again last month on Andrew Gelman’s blog (95 comments), http://statmodeling.stat.columbia.edu/2014/02/09/keli-liu-xiao-li-meng-simpsons-paradox/ triggered by four papers on the subject published in The American Statistician (February, […]