In a link to our back-and-forth on causal inference and the use of hierarchical models to bridge between different inferential settings, Elias Bareinboim (a computer scientist who is working with Judea Pearl) writes:

In the past week, I have been engaged in a discussion with Andrew Gelman and his blog readers regarding causal inference, selection bias, confounding, and generalizability. I was trying to understand how his method which he calls “hierarchical modeling” would handle these issues and what guarantees it provides. . . . If anyone understands how “hierarchical modeling” can solve a simple toy problem (e.g., M-bias, control of confounding, mediation, generalizability), please share with us.

In his post, Bareinboim raises a direct question about hierarchical modeling and also indirectly brings up larger questions about what is convincing evidence when evaluating a statistical method. As I wrote earlier, Bareinboim believes that “The only way investigators can decide whether ‘hierarchical modeling is the way to go’ is for someone to demonstrate the method on a toy example,” whereas I am more convinced by live applications. Other people are convinced by theorems, while there is yet another set of researchers who are most convinced by performance on benchmark problems.

I will address the larger question of evidence in a later post. For now, let me answer Bareinboim’s immediate question about hierarchical modeling and inference.

First off, let me emphasize that “hierarchical modeling” (equivalently, “multilevel modeling”) is a standard term in statistics. It’s not something I invented or even named!

Second, Bareinboim writes:

Unfortunately, I [Bareinboim] could not reach an understanding of Gelman’s method (probably because no examples were provided).

I did not supply examples in that blog post but many many examples of hierarchical models appear in three of my four books and in many of my research articles. In lots of these descriptions, I spend some time discussing issues of generalizing from sample to population. Here’s the basic theoretical idea (not new to me, it’s just coming from a bunch of papers from about 1970 to 1980 by Lindley, Novick, Smith, Dempster, Rubin, and others):

Suppose you are applying a procedure to J cases and want to predict case J+1 (in the problem under discussion, the cases are buildings and J=52). Let the parameters be theta_1,…,theta_{J+1}, with data y_1,…,y_{J+1}, and case-level predictors X_1,…,X_{J+1}. The question is how to generalize from (theta_1,…,theta_J) to theta_{J+1}. This can be framed in a hierarchical model in which the J cases in your training set are a sample from population 1 and your new case is drawn from population 2. Now you need to model how much the thetas can vary from one population to another, but this should be possible. They’re all buildings, after all. And, as with hierarchical models in general, the more information you have in the observed X’s, the less variation you would hope to have in the thetas.

Another term for this is “meta-analysis.” The point is that you fit a model in one setting and want to apply it in another, knowing that these settings differ. My recommended approach is to build a hierarchical model in which one component of variance represents this difference. People don’t always think of hierarchical modeling here because in this version of the problem it might seem that J (the number of groups) is only 2. But in many settings (such as the buildings example above), I think existing data has enough multiplicity that a research can learn about this variance component. Even if not, even if J really is only 2, I like the idea of doing hierarchical modeling using a reasonable guess of the key variance parameter.

OK, back to examples. As I said, lots and lots, including our model for evaluating electoral systems and redistricting plans, our model for population toxicokinetics, missing data in multiple surveys, home radon, and income and voting.

Bareinboim asks “what guarantees” are provided by my methods. My answer: my method provides no guarantees. But that’s ok, there are no guarantees. When estimating effects of redistricting, or low-dose metabolism of perchloroethlyene, or missing survey responses, or radon levels, or voting patterns, there are no guarantees, we just have to do our best. Different statistical methods focus on different aspects of a problem. One difficulty I’ve had with the causal-graph approach is that it focuses on conditional independence, and, in the problems I work on, there is little to no conditional independence. The multilevel modeling approach focuses on quantifying sources of variation, which is just what I’m looking for in the sorts of generalizations I want to make.

P.S. I now realize that there is some disagreement about what constitutes a “guarantee.” In one of his comments, Barenboim writes, “the assurance we have that the result must hold as long as the assumptions in the model are correct should be regarded as a guarantee.” In that sense, yes, we have guarantees! It is fundamental to Bayesian inference that the result must hold if the assumptions in the model are correct. We have lots of that in Bayesian Data Analysis (particularly in the first four chapters but implicitly elsewhere as well), and this is also covered in the classic books by Lindley, Jaynes, and others. This sort of guarantee is indeed pleasant, and there is a long history of Bayesians studying it in theory and in toy problems. Arguably, many of the examples in Bayesian Data Analysis (for example, the 8 schools example in chapter 5) can be seen as toy problems. As I wrote earlier, I don’t think theoretical proofs or toy problems are useless, I just find applied examples to be more convincing. Theory and toys can be helpful in giving us a clearer understanding of our methods.

Naive question: How do you use live applications as evidence when evaluating a statistical method? It seems that in order to evaluate the performance of a method on a live application, you would have to know the very state of affairs that you want to use the method to learn about.

Greg:

I like to see a method give reasonable answers in a setting where existing methods don’t work so well. For examples, see the links above (as well as many of the examples in my books).

My worry is that you can only assess the reasonableness of a method’s answers insofar as you know by some other means what kinds of results a good method would yield on the problem in question; but the method is only informative insofar as it gives information beyond what you already know about what kinds of results a good method would yield on that problem. Thus, you can only assess the reasonableness of a method’s answers insofar as those answers are uninformative. For that reason, you need things like theorems and simulations as well as checks for reasonableness to warrant a method. Does that make sense? Am I missing something?

I think it makes sense Greg,

I am glad that you too are craving for some sort of guarantee in the process of performing (causal) inference.

How does multilevel modeling (which I’ve always thought was just a way of structuring/clustering the errors) in any way whatsoever do anything about selection bias or omitted variables? I’m not following that at all.

Stuart:

See this link or, for more details, chapter 5 of Bayesian Data Analysis. In short, multilevel modeling is not just a way of structuring/clustering the errors, it’s a way of sharing information across different experiments or different scenarios.

But does that address selection bias or omitted variables?

Stuart:

The usual approach is to include enough group-level predictors so that selection bias and omitted variables are not such a concern, or else to model the selection explicitly. Sometimes you just can’t do much. If you have a psychology study on a bunch of college sophomores, you need to make strong assumptions to generalize to the population as a whole. Other times you have data in diverse enough groups that you can make a generalization with more robustness.

Maybe we’re talking about different things, or maybe I’m not understanding the terms here.

A concrete example: studies of charter school performance that aren’t randomized experiments. One problem with all such studies is that students are selecting into charter schools, and hence might have higher levels of motivation, family commitment, etc., than other students who otherwise look identical on every data point that you can realistically measure. So education scholars worry about selection effects quite a bit there.

Hierarchical modeling, as far as I’ve seen it used in education, simply comes in if you want to take into account the way students’ test performance is nested within [classrooms and/or schools]. But that has absolutely nothing to do with the selection effect problem described above. Nor are there any “group-level predictors” that can remedy the selection effect problem. Bringing in hierarchical modeling here doesn’t seem to me to have anything whatsoever to do with the selection effect issue. But again, maybe I’m missing something.

Stuart:

I think you are retracing some of the arguments about hierarchical modeling that were made in statistics during the 1970s and 1980s. At first there was much concern about the assumption of exchangeability (see the comments by Kempthorne and response by Lindley and Smith reproduced on pages 13-14 here) but then there was a gradual folding-in of hierarchical modeling with general ideas about regression adjustment, including individual-level predictors, group-level predictors, and their interactions.

To consider your example of school comparisons: researchers include individual and group-level predictors and then can adjust for selection beyond this. The point of the multilevel model is that there can be group-level predictors (in this case, groups are schools) such as the location of the school, its funding, and aggregate characteristics of the families that send their children to the school. There can be teacher effects and school effects. You might want to look at the work of researchers such as Jennifer Hill, J. R. Lockwood, and Jonah Rockoff, all of whom have thought seriously about your concerns.

To speak more generally: this entire discussion began when Phil posted a question about his research in modeling indoor air flow in buildings. Phil wanted to generalizing findings from one set of buildings to a new set. In classical (non-multilevel) statistics, this is commonly handled in one of three ways:

1. Pool all the data together, applying one regression model to new and old buildings; or

2. Allow no connection between old and new data, just giving up on the idea of using data from population 1 to learn about population 2; or

3. Come up with some hack such as a fractional weight to apply population 1 in some downgraded way to learn about population 2.

Multilevel modeling is a general framework between bridging between 1,2,3 (or, one might say, a way to do 3 in a systematic way). The idea is to acknowledge differences between the groups and to model them. If you’re working on a difficult problem and can’t directly estimate the unexplained group-level information, then I’d recommend setting these hyperparameters based on what prior information you have, recognizing the possible dependence of your conclusions on these assumptions.

Multilevel modeling is not magic; it is a method for generalizing from one population to another. But this is the case with statistical methods in general.

We are often trained to think of statistical methods as if they were computer programs or subroutines: send in the data and out comes the p-value or whatever. But a better analogy to statistical methods might be to

programming languages. A statistical method such as multilevel modeling is a framework that allows the user (such as myself) to explicitly model variation between groups, rather than being stuck with either complete pooling or no pooling. This modeling of variation is, to my mind, central to the problem of generalization or transportability or prediction.We are talking about different things.

Hierarchical modeling is chiefly concerned with obtaining better estimates assuming the model is specified, while DAGs are concerned with determining a specification assuming an underlying dag. The aspect of generalizability that hierarchical modeling addresses is that it can model variation across experiments. This is a coarser-grained approach to thinking about generalizability, in that experiment-specific sources of bias which are not explicitly modeled will at least be modeled as the estimation of the variance across experiments.

Hierarchical modeling doesn’t say anything about what variables should be included in the specification to estimate a causal effect. The DAG literature, on the other hand, isn’t much concerned about how to estimate the model parameters. However, a causal inference problem isn’t “solved” until one is able to estimate the parameters in the model. Most practitioners will probably cringe at the caveat that “the proposed method relies on a sample size approaching inﬁnity, which is diﬃcult to obtain in practice”.

Maybe some of the debate here stems from whether specification or estimation is the overriding difficulty in one’s domain. I think the methods could easily complement one another, although that doesn’t seem to happen often.

Maybe this is all because I’m familiar with frequentist models (I studied program evaluation with your old friend Pat Wolf, by the way) and you’re coming from a Bayesian perspective. When I hear “hierarchical modeling,” I think of education studies that involve a model like this: First level: Yit [student test scores] = A0 [intercept] + B1*Yit-1 [lagged test scores] + B2*Xi [student demographics] + e; second level: A0 [from the first level] = a0 + B1*Zj [school level characteristics] + u. So if you substitute the second level into the first level, you now just basically have structured the error term taking into account the fact that students are clustered in schools.

Anyway, the specific problem with selection effects mentioned above (with observational studies of charter schools) isn’t something that I can imagine being able to address with any kind of modeling, not really, and I don’t recall Jonah Rockoff (whom I know) claiming to have done so. Neither do I quite follow your point about Phil’s problem — generalizing from the buildings in your dataset to any other buildings is one thing, but if there are specific selection effects at work whereby you by definition can only measure buildings whose owners choose to behave in particular ways, then that would be the problem that I’m worried about, and I again can’t imagine how anyone would be able to “model” such differences when the whole problem is that you have no information about the very thing that would need to be modeled.

Stuart:

Using data from situation A to learn about situation B can involve various difficulties. One challenge is selection, for which I believe the ideal solution is to model the selection process, treating selection variables as missing data as necessary. Another challenge is that the model for situation A can differ from the model for situation B. In that case I think multilevel modeling is a good framework for doing the appropriate amount of generalization and partial pooling.

Coming at this from an education evaluation perspective, I’m just not sure what it means to be told to “model” something on which there is no data (i.e., students’ levels of motivation). That’s why education scholars are always hoping to find some sort of exogenous force — ideally randomization or possibly a strong instrument — that affects selection into schools.

Stuart, you can typically model the residual variance for a random effect to at least quantify the uncertainty that cannot be explained with recorded data.

I’m still not sure what that means.

Put it in the concrete context I mentioned: the effectiveness of charter schools (outside of randomized experiments). Students who select into charter schools may be very different in unobservable ways, even if you control for everything you can think of, including the previous year’s test scores — they could be more motivated to seek academic success, they could be on the verge of having academic trouble, we just don’t know.

So if their scores improve in a charter school, what does that tell us about the policy question of whether to increase the number of charter schools — i.e., how would other students who so far haven’t made the marginal choice to switch do if put in a charter school instead?

The problem is: 1) we (program evaluators) have no way of measuring internal motivation; 2) we don’t know how much motivation is fixed in the population or how much the charter schools could be raising motivation; 3) we don’t know how much motivation (even if we could precisely measure it) actually drives test scores as opposed to the school itself.

So how exactly am I supposed to analyze charter students’ scores and then come up with a parameter that estimates how charter schools might affect anyone else? THAT’s what I’d love to figure out. But when I see Andrew (above) seem to suggest that hierarchical modeling might be an answer to selection bias (which is the problem I’m describing here), I get the impression we’re talking about completely different subjects. I don’t see how any model or cute little graph (sorry, I like Pearl and don’t mean to be dismissive) has any answer whatsoever to the fact that in the charter school problem above, we don’t have any way of parameterization whatsoever, not that I can see.

@Greg: We are doing applied statistics in the real world. If we’re modeling things people care about, like votes in the next election, the disease status of patients, the financial solvency of loan applicants, the diffusion of a gas, the chance of a tornado in the midwest, etc., the world itself gives us test cases.

Thanks for the response, Bob. I think you might be getting at something that addresses my worry. Could you explain how test cases can help warrant methods even when we’re using them to learn things we don’t already know?

Greg:

Consider, for example, our toxicology paper. Estimates based on complete pooling or no pooling did not give sensible answers, but multilevel modeling allowed us to estimate a population distribution. Or, for another example, our models for home radon exposure. There we actually went to the trouble of cross-validation to demonstrate the effectiveness of our methods, but really we hardly needed to: it was clear that multilevel modeling was giving sensible answers.

@Greg if I understand you correctly, I think some of the theoretical justifications you’re looking for regarding partial pooling being a preferred approach to parameter estimation can be found if you look into the relationship between hierarchical models and James Stein estimation. The problem with test cases and simulations is that the perceived superiority of an approach will be highly dependent on the characteristics of the data and the sample size.

However, that’s a more general issue than the current discussion regarding estimation of causal inference, confounding, and generalizability.

Part of the confusion arises from Elias’s claim to have “solved” the external validity problem, and to provide “guarantees” about claims of external validity.

Much hinges on how you define the problem. A common definition is the need to make predictions out of sample. In Manski’s view this is fundamentally and identification problem. We are effectively going beyond the data, and the only way we can do so is by making assumptions (e.g. invariance, monotonicity, linearity, etc…). Assumptions “solve” the problem.

Given those assumptions, what I think Elias proposes is a method that “guarantees” inferences follow logically from the assumptions just made. I.e. he seeks to prevent those more ham-fisted among us from making predictions not consistent with our own assumptions. But if all your assumptions are embodied in an empirical model, hierarchical or otherwise, and you use the model to make predictions out of sample, then presumably you are being consistent with your own assumptions.

Two notes:

(1) Most of the assumptions need for predicting out of sample are parametric in nature (see above). These sort of assumptions are hard to code in DAGs, which are by nature non-parametric, hence the need for additional notation. (E.g. graph X -> Y <- Z, has non-parametric representation Y=f(X,Z), which can specialize to any of Y=X+Z or Y=X+Z+X*Z or Y=X^Z, etc. )

(2) I hear Andrew's criticism of conditional independence. If the goal is predicting out of sample one does not need a causal model. There is a difference between prediction and understanding. But perhaps the emphasis on causality comes from the belief that causal models are more robust. Presumably they are the right model, the sort of model you want an AI computer to have.

To be more precise, the problem is not so much predicting out of sample as out of the population from which the (random) sample was taken. That is, off the support of the estimated density.

P.S. Following my note (1) above, one could make the argument that, if anything, Pearl and Baeriboim’s article highlights the notational deficiency of DAGs when it comes to dealing with external validity issues. Hence the need for more notational bells and whistles.

>The idea is to acknowledge differences between the groups and to model them

Agree, but all done _in order_ to get something _common_ that is of interest (e.g. target parameter) or at least common in distribution (i.e. exchangeable/randomly drawn parameter from a hyper-target parameter).

I believe Elias is trying to formalize and elaborate this process of acknowledging differences to get a common target parameter from multiple studies or units of analysis – if you assume these are the (only) differences and this your target then this is how you know if and how to get common parameters from that group of studies (perhaps just a subset of them).

I believe Andrew is more focused on the variation that comes from failing to actually get common parameters by treating the units as exchangeable – modeling parameters as not common but common in distribution or random. This might be put, as I earlier did, that non-transportability [after best attempts to get transportability] is [all that is] replicating (and one can make some sense of the variation of treatment effects that were hoped to be common).

But maybe a more thorough discursive analysis of the comments on the numerous blog posts should be done to confirm this, along with actual reading of the various papers ;-)

Maybe fortunately for me I read the _wrong paper_ http://ftp.cs.ucla.edu/pub/stat_ser/r400.pdf where this was the concluding paragraph:

“Of course, our entire analysis is based on the assumption that the analyst is in possession of sufficient background knowledge to determine, at least qualitatively, where two populations may differ from one another. In practice, such knowledge may only be partially available and, as is the case in every mathematical exercise, the benefit of the analysis lies primarily in understanding what knowledge is needed for the task to succeed and how sensitive conclusions are to knowledge that we do not possess.”

Keith:

Yup. But, just to clarify, the purpose of the above post was that Elias said he knew of no examples of hierarchical modeling for various inferential problems (this all started with a discussion of how to generalize from data A to problem B). So I gave a bunch of links, as well as references to my book. I think it’s great that different statistical methods are out there for people to use. I don’t think it’s so great when people are unaware of the power of hierarchical modeling, after I’ve written 3 books and about 100 articles on the topic! Not liking what I do is one thing; not knowing it exists, that really bothers me, and I’d like to do my best to clarify the situation.

Well at a minimum this thread will be a great resource for students!

A general suggestion and then a few _last_ two cents from me – just for the blog.

My old director (who now produces Broadway plays) told us when we get a review of our work (after taking a day or so to be upset) to pretend, no matter how hard that is, that the reviewer is trying to be helpful. Perhaps when people read unfamiliar literature, it would be a good idea to pretend that literature better understands the literature you are more familiar with (or least the substance it addresses). Again pretend so you might better understand how it is truly deficient in some sense.

Elias’s point “To articulate causal assumptions, one needs to resort to one of three notational systems”

That is just the defunct Sapir–Whorf hypothesis – that without a certain language or notation your thinking is limited – language and notation can be helpful but semiosis (representation) is extremely fluid and adaptive. I do not think it is a good strategy to look for one’s notation in unfamiliar literature…

Andrew’s point about “study this variation” is perhaps more subtle than it might appear.

And to “allow this variation to determine the amount of partial pooling.” formally one _must_ assume a particular random distribution for the variation and define a target of inference given that or at least means and variances are of interest.

rom some my past email on this (not my statement but one I agree with)

“… there is no real alternative to supposing that the interaction (instability in treatment effect) is generated by a stochastic mechanism and that an average over an ensemble of such repetitions is of interest. That is, it is the interaction [non-transportability] that is being replicated not the centres themselves.”

And an old paper (with an online preview) Meta-Analysis: Conceptual Issues of Addressing Apparent Failure of Individual Study Replication or “Inexplicable” Heterogeneity http://www.springerlink.com/content/g952tnj818482780/

Dear Keith,

I take issue with your position on language and thought, but I really liked your paper on MA which is the only one I read that questions the wisdom of treating variations among studies as random noise.

I do not want however to steer the conversation into this avenue while we are so close to resolving the Gelman-Manski dilemma — should identification precede estimation.

eb

For your interest, Andrew often reports this quote from Rubin as a useful starting point in any analysis:

“What would you do if you had all the data?”

Doesn’t this quote relate to the identification problem?

Dear yop,

Glad Andrew reports this quote from Rubin.

It is surely related to the identification problem and strengthens Manski’s doctrine. When the target quantity is not identified then all the data in the world will not help you estimate the quantity that you want to estimate.

Excellent quote.

eb

One thing is prediction out of sample, quite another *doing* something out of sample (e.g. distributing a vaccine to a new population)

A wet roof predicts its raining, but wetting my roof does not cause it to rain.

One nice thing about DAGs is they help lay out the causal structure need to make predictions about the effects of interventions.

A multilevel model has no way, in the notation, to distinguish explicitly between variables that are causal, proxies, or simply good predictors. That is the point Elias tries to make when he uses the example of a proxy variable.

When I read a DAG I know exactly the causal structure of the problem. When I read a multilevel model equation I cannot, without further information, tell whether an included variable is a good predictor, a proxy, or a cause.

The combination of (selection) DAGs with hierarchical modeling might be more fruitful.

Fernando:

Some of the examples I linked to above have direct causal or decision-making interpretations. In our analysis of districting plans, we estimated the (causal) effect of redistricting. In our analysis of radon exposures, we estimated the potential effects of different plans for measurement and remediation. In our toxicology example, we predicted concentrations of the toxin given externally specified initial conditions. It may well be that all those problems could be solved just as well (or even better) using other approaches, but in the meantime we did solve these real causal problems using multilevel models.

I read your book with Porf. Hill in class, parts of BDA, and some papers on multiple imputation. I fully agree partial pooling is great!

My point is that, for causal problems, the equations (without further context) can be ambiguous. This may sound like splitting hairs but, for better or worse, much of the debate about DAGs, etc. is all about notation.

Just as we can debate the best computer language for doing statistics, so much of the debate Pearl and others are having is about the best language to store our causal knowledge with. This is more than semantics, though. Arabic numerals are better for accounting than Roman ones.

I agree with Pearl DAGs are a great way to store, communicate, and work with our causal knowledge. DAGs are particularly useful for identification purposes. Beyond that, how one does their estimation is up to them (so long as it is consistent with our assumptions).

P.S. To give a specific example, in this paper http://sekhon.berkeley.edu/papers/SekhonTitiunik.pdf the authors find problems with causal identification in other researchers’ work.

My claim is that, had those other researchers written down their causal knowledge in a DAG, as opposed to a mixture of parametric equations and text, they would have avoided the identification problems, and made better inferences.

Dear Fernando,

it seems that we are aligned in this discussion, I agree that notation is more important than people give credit for.

eb

Dear all,

I am back from Canada and would like to thank Andrew for taking the time to post examples, to help us understand how hierarchical models transport experimental findings among diverse populations.

I have encountered two obstacles in my attempt to understand the method proposed.

1.

The problem of deciding whether causal effects are transportable across populations requires causal assumptions about the target population, in which no experiments can be performed. We all know that causal assumptions must be articulated by the analyst, since they cannot be deduced from data or statistical assumptions. Yet, Andrew’s examples are not accompanied with causal assumptions, which leads me to conclude, from first principles, that the problem he deals with is different than the one I proposed.

To articulate causal assumptions, one needs to resort to one of three notational systems: Potential outcome, structural equations, or graphs. Since I cannot find any of these notations used, I am led to the conclusion that either the examples are not about causal transportability or that the assumptions are implicit, and have not been encoded mathematically.

2.

Andrew sent me to look up “bias” or “unbiased” in the index to Bayesian Data Analysis to see why Bayesians have problems with the concept of bias, since it is defined relative to the “true” parameter of interest, a concept that does not always make sense to Bayesians. I would be the last to force upon Bayesians that which they find hard to swallow, especially such notion as “true causal effect”. But what I would like any methodology to provide its practitioners is some notion of GUARANTEE. By “guarantee” I do not mean assurance of finding the correct” answer, but assurance that the method is better, in some minimal sense than, say, consulting an astrology table.

Andrew’s description of the way conclusions are established in hierarchical models does not give me that assurance. Quite the opposite, the method of “we use what we call the secret weapon, and plot several estimates on a single graph so the partial pooling can be done by eye” makes me wonder if guarantees gives me the question is not exactly what we call science.

Don’t we have even weak guarantees, say of the form: If the prior has property P1 and the two populations have properties P2 and P3 then, when the number of sample goes to infinity, the posterior is guarantees to have property P4.

I am disappointed. If I were a hierarchical Bayesian I would not rest until I establish such weak guarantees.

Remark.

Fernando raised the question whether mathematics can ever “solve” problems or it merely “assumes them away”, (given that all formal methods are based on assumptions). I believe the reason we prefer to call mathematical results “solutions”, as opposed to “assumptions” is that the assumptions are usually much much easier to defend or criticize than the conclusions.

For instance, it is much easier for an engineer to judge whether a given triangle is right angle, compared with assessing whether the square built on one of the sides has the same area as the sum of the squares on the other two sides. This is where the strength of formal methods lies and this is why we dare call our transportability results “solutions” to, rather than “assuming away” the problem of generalizability.

Fernando noted correctly that the external validity problem lies outside the realm of “predicting out of sample as out of the population from which the (random) sample was taken …” , as suggested by Gelman’s examples.

Overall, the two barriers remaining between me and an understanding are:

1. “causal assumptions” — where are they encoded?

2. “guarantees” — where are they expressed?

Best regards,

Elias

Elias:

1. If you think my methods are no better than “consulting an astrology table,” I invite you to use an astrology table to estimate the radon level in your home, or to impute missing data in a sample survey, or to estimate patterns of income and voting in different states, or . . .

2. I don’t know why you have such a problem with people doing partial pooling by eye. Scientists make conclusions by eyeballing data all the time. In fact, one of the standard criticisms that working scientists make of formal statistics is that they don’t trust an effect until they can see it in the data. In hierarchical settings, it’s not at all unusual to perform statistical analyses within groups and then to be able to see the between-groups pattern by eye.

3. Your idea of a guarantee involves sample sizes going to infinity. I only work with finite samples.

I think it’s fine for you and your colleagues to work on your own methods and to express skepticism about my methods and others’—but you might want to reflect a bit on the fact that we are

notfools, that we may have different goals than you have. You are interested in infinite sample sizes, I’m not. You’re interested in toy problems, I’m interested in them a little (as in my paper about the boxer, the wrestler, and the coin flip) but not as much as you are. I recognize that much can be learned from analyses of infinite samples and from toy problems. I encourage you to recognize that much else can be learned from finite samples and from real problems of the sort listed above.Dear Andrew,

My reference to astrology tables was not meant to imply that the method has no scientific basis, it meant to express my eagerness to understand what that basis is, and to understand it in mild scientific terms. Here is what I wrote:

But what I would like any methodology to provide its practitioners is some notion of GUARANTEE. By “guarantee” I do not mean assurance of finding

the “correct” answer, but assurance that the method is better, in some minimal sense than, say, consulting an astrology table.

I have no doubt that your method does better than astrology tables and, therefore, that its superiority can be captured in a form of performance guarantee. Unfortunately, I have not seen it expressed that way thus far.

I am not wedded to asymptotic guarantees, at infinite samples. My offer to phrase the guarantee in terms of asymptotic behavior was intended to make it easier to prove but, by all means, if the guarantee can apply to studies with finite size data, so much the better.

I imagine myself standing in front of a class, trying to show them the benefits of doing hierarchical models, and one of the students asks: If I follow this method, what confidence do I have that I got the answer to the question that we posed? or close to it?. What do I tell this student?

eb

Elias:

If you are teaching hierarchical models, and one of the students asks: “If I follow this method, what confidence do I have that I got the answer to the question that we posed? or close to it?”, I recommend that you honestly reply that you don’t know, that you would like to have that confidence but you don’t have it. If the student is talking about an applied example (such as radon modeling) you can do various demonstrations (for example, we use cross-validation in our radon paper from 1996) and you can also honestly say that no other statistical method offers any guarantee either. If the student is talking about a toy example, you can say that you know of no guarantees but that Bayesian practitioners claim to have much success with applied examples.

When I teach Bayesian data analysis, I demonstrate its utility through examples such as those given in the post above, and also through others’ applied research, but I prefer to use my own examples since I can give the stories behind them. When first writing the BDA book, I mostly used examples of others.

I emphasize that Bayes is not the only way to go, that I have found the Bayesian modeling and model-checking framework useful in a wide range of problems, but that others have had success using other methods. I offer no guarantees to my students. I’m just not interested in guarantees.

If

youfind yourself teaching Bayesian data analysis, I recommend you explain the students that this is a commonly used approach to statistics and for that reason it is a good idea for them to learn and understand it. You can tell the students that you are personally skeptical of Bayesian methods but that they are a good thing to know about, if for no other reason than to understand much of the statistical work being done nowadays.Consider a slightly different example. I teach sample surveys. In that class, I describe methods such as classical survey weighting which I do not love, but which I recognize are popular, important, and serve useful purposes. If I had a tool that could completely replace classical survey weighting in practice, I’d apply it and I wouldn’t look back. But I don’t. I like my methods but I realize they have difficulties too.

Similarly, if you, Elias, teach Bayesian data analysis, you can talk about ways in which you prefer Pearl’s notation and you can use that notation as much as possible. But you might want to consider examples such as evaluating electoral systems and redistricting plans, population toxicokinetics, missing data in multiple surveys, home radon, and income and voting, where hierarchical Bayes has done well. After all, if the students are taking a class on hierarchical models, they might want to see them in action.

Dear Andrew,

I begin to understand the philosophy behind the Hierarchical Models methodology applied to causal inference. While I will find it hard to tell my students that Bayesian practitioners “are not interested in guarantees” and at the same time “claim to have much success with applied examples.” (the two are contradictory, because there is no “success” unless one knows the true answer), what I will be able to tell them is that with the help of new mathematical tools and a sharp distinction between causal and statistical concepts, Bayesians too will one day be able to understand why their methods work most of the time, and why they sometimes don’t.

Another thing that I will be able to tell them is that other methods happened to be blessed with notions of theoretical guarantees and, by virtue of being theoretical, those guarantees scale up from toy problems to practical problems of any size.

If you do not find anything heretical in my state of understanding, I believe we have reached a stable equilibrium that will last till the next breakthrough.

Hoping,

eb

Elias,

Your methods give no guarantees either, it’s just that you’re doing everyone the disservice of pretending they do. You pretended to answer Fernando’s point about solving the problem by assuming it away, but you just talked around the issue. You assume away any confusion with an a priori DAG when one can defensibly assume one knows the DAG, or at least a DAG that doesn’t assume away the vast majority of plausible DAGs, in approximately 0% of real problems that require current investigation can. Okay, 0% is a slight exaggeration, but the point remains.

And “no guarantees” and “have had much success” are just unbelievably obviously not contradictory. Maybe as a computer scientist you just don’t actually understand statistics, probability, randomness, distributions, variation, cross-validation, prediction error, or any of the rest of it, at all!?

If you’re only willing to think inside your little bubble and pretend it’s the whole world, then how about this. If you’re actually trying to investigate whether there is a causal effect from X -> Y, that means you don’t understand the relationship between X and Y. That means there is a key relationship in the setting of the problem you’re investigating that you have to admit you don’t understand a priori. Yet you claim 100% confidence that you know exactly what the relationships are between every single other variable? In what situation, other than new drug studies, does that seem at all reasonable? Your guarantee is just that if I blindly accept that you have the ear of God, with the single exception that s/he mumbled when talking about this single edge in the graph, then you can guarantee that the results of this analysis gives you the indisputable truth about that edge. Toy problem? Great. Logic puzzle? Great. Vast majority of real problems? What a joke. In real problems it just means that usually you’re deceiving everyone, pretending you can offer guarantees while hiding the vast and never guaranteeable assumptions you made a priori.

“Bayesians too will one day be able to understand why their methods work most of the time, and why they sometimes don’t”

What a condescending, revealing, and unbelievably uninformed joke of a statement. We do already. You just don’t understand them at all. We admit that we don’t know all the true relationships or true sources of variation, but we can observe in the data and from experience with similar problems that, for instance, there are hierarchical groupings of subjects where there seems to be sources of variation common across the group, so we choose to try to model the total variation by modeling those different sources of variation. Sometimes we have chosen a model and/or have measured variables at the different group levels that allow us to model the various sources of variation approximately well, so we can make good predictions for new observations from similar circumstances. Sometimes we haven’t so we can’t. If we knew why we hadn’t built a good model, we would have built a better model. In that situation where we failed, if you tried to analyze that data and offered a guarantee that your result was sound, you’d just be deceiving yourself and others. Andrew’s just being more honest and actually understands randomness, variation, and unknowability.

Elias:

You write, “there is no ‘success’ unless one knows the true answer.” I disagree completely. Darwin had success even though he did not know the true answer (that waited until the synthesis of the 1930s). Newton had success even though he did not know the true answer. Holland (the guy who built the tunnel to New Jersey) had success even though he did not know the true answer. And I’d like to think that in my much smaller examples in public health and political science, that I have had success without knowing the true answer.

You write, “guarantees scale up from toy problems to practical problems of any size.” I disagree completely. The guarantee is conditional on the model being correct, or on some assumptions. The typical challenge for practical problems is that we know in advance that our assumptions aren’t correct and the textbook conditions are not met, but we need to proceed anyway.

I think it’s great that you have confidence in your methods. I recommend that you consider the possibility that you don’t understand other people’s methods as well as you think you do!

Matt,

I think you are under-estimating the ability of graphical models to represent ignorance about DAG structures, and to provide guarantees even when one faces a multitude of plausible DAGs. First, graphical representations of classes of DAGs are well established in the literature, for example, the class of all Markov-equivalent DAGs. Second, suppose we know nothing about the world, except that one causal link is missing (e.g., skin color does not affect intellectual capacity). That one piece of knowledge can be represented formally in the model and harnessed to help in inference tasks that depend on the relationship between skin color and intelligence.

If the inference task is successful, the method issues a pair of outputs: . For example, “Dear investigator; Based on the data provided, you desired effect size is estimated at alpha, with confidence interval beta, guaranteed to hold true as long as the assumption that “skin color does not affect intellectual capacity can be defended.”

Thus, we are not “assuming the problem away” as you described; we are actually “solving the problem” subject to a transparent sets of assumptions, and then we are delivering the strongest possible guarantee under the circumstances.

If you think one can do better, please let us know how. Or, if you do not think a better method exists, at least tell us where, in this mild, cautious and honest methodology have we been “deceiving everyone” or “hiding the vast and never guaranteeable assumptions we made apriori”.

Let’s be concrete. Where?

eb

Dear Andrew,

Newton and Darwin had THEORIES, the very notion of which you seem to reject when it comes to “real life problems”. Holland had a “true answer” or “measured success” — people managed to cross from New York to New Jersey, something they could not do before.

You write that “in public health and political science,.. I [Gelman] had success without knowing the true answer”. I believe you, and I would like to share that conviction with you, but beg only to understand the basis on which you feel confident that your methods were successful. Given that in prediction problems success is measured by the accuracy of the predictions. In causal inference, success is measured by proximity to results of randomized experiments. Which of the two (perhaps a third?) serves as your measure of success?

You strongly disagree with my claim that “guarantees scale up from toy problems to practical problems of any size” because (in your words) “The guarantee is conditional on the model being correct,”. My point is that this condition (model correctness), rather than invalidating the guarantee, constitutes the very essence of the guarantee. In other words, the assurance we have that the result must hold as long as the assumptions in the model are correct should be regarded as a guarantee, not as violation of a guarantee.

You write, and I agree, that “The typical challenge for practical problems is that we know in advance that our assumptions are not correct and the textbook conditions are not met, but we need to proceed anyway.” I do not stop at that. I ask: What happens when we proceed? What guides our thoughts, and our choice of alternative courses of exploration? Modern theories of “heuristic reasoning” teach us that explorative guidance comes from consulting simplified models of the problem. In other words, by dismissing solutions to simplified problems we deprive ourselves of guidance when problems get complicated.

Lastly, I am puzzled by you last recommendation that I consider the possibility that I don’t understand other people methods as well as I think I do. My whole discussion at this forum was persistent plea to understand other people methods better than I do. If you trace back my postings you will find a continuous stream of questions: How do you handle this, and what about that, what guarantees you get here, and what assumptions you make there. While the answers I received thus far were less than satisfactory, it would be hard to argue that I have not tried.

eb

Matt,

On reading your post again, I believe I understand the reason for our miscommunication, at least part of it, so let me try to unravel.

Your depiction of graphical modeling goes as follows (I quote): If you’re actually trying to investigate whether there is a causal effect from X -> Y, that means you don’t understand the relationship between X and Y. That means there is a key relationship in the setting of the problem you’re investigating that you have to admit you don’t understand a priori. Yet you claim 100% confidence that you know exactly what the relationships are between every single other variable?

The logic of graphical modeling actually goes as follows:

1.

I am indeed trying to investigate whether there is a causal effect from X -> Y, that means I am suspecting that such relationship may exist, though I don’t know its type, size or even its sign.

2.

That means indeed that there is a key relationship in the setting of the problem that I admit to be unable to assess a priori — I confess.

3.

Now, where on earth did you get the idea that anyone claims 100% confidence to “know exactly what the relationships are between every single other variable?” God forbid!!! All we do at this point is ask ourselves: Is there any variable, say Z, for which I can be fairly confident that it does NOT affect X directly? If there is, I do not draw an arrow from Z to X, but if I suspect otherwise, I do draw an arrow from Z to X, and continue. By the time I finish going over all pairs of variables that bear on the problem, some are measurable, some not, I have a tentative structure where arrows mean SUSPICION that a relation exists and lack of arrows mean good reason to believe that direct causal relation is absent. From this fuzzy and qualitative piece of knowledge I now ask a quantitative question: Can I estimate the size of the average effect of X on Y from the available data, assuming that my assumptions about the missing arrows are correct (the full arrows’ graph carry no assumptions).

I think this careful modeling approach is far from the grotesque picture you described: “.. you claim 100% confidence that you know exactly what the relationships are between every single other variable?” “Your guarantee is just that if I blindly accept that you have the ear of God, with the single exception that s/he mumbled when talking about this single edge in the graph, then you can guarantee that the results of this analysis gives you the indisputable truth about that edge. “

Not only is the method careful and logical, it is also the best method I know for harnessing qualitative knowledge, combine it with data and answer quantitative causal questions.

This constructive way of encoding partial knowledge in a theory before proceeding to estimation, is not unique to graphical modeling. It is done routinely in structural equation models, in potential outcome approaches to time-varying treatments, and in every application where one needs to carefully select potential confounders for control. (See for example, Pearl and Robins’ (1995) method of identifying effects of sequential treatments [Causality, pages 118]).

But, as always, I am willing to learn new methods. Can you take me step by step through your favorite model building approach?

Listening.

eb

[…] This is the general link for the discussion: http://statmodeling.stat.columbia.edu/2012/07/examples-of-the-use-of-hierarchical-modeling-to-generalize-to-new-se… […]

I get the impression Elias is no fan of hierarchical models.

Dear Elias,

You emphasize the importance of guarantees, but only mention guarantees that hold when the model is “correct”. Surely _any_ probabilistic model yields correct inference when model assumptions hold, provided you have not made mathematical errors or misstated results (e.g. by ignoring uncertainty in point estimates)?

A more useful guarantee would quantify robustness against model misspecification – i.e. how does the correctness of your inference degrade as the degree of truth of your stated assumption decreases?

Dear Konrad,

Robustness, or protection against misspecification, also requires that we encode hypothetical assumptions, draw their logical consequences, and compare them against our initial conclusion. It is the same sort of inference exercise but now conducted with a variety of (hypothetical) assumptions, as opposed to one (most believable) set of assumptions. Pearl (2004) has a paper “on the robustness of causal claims” in which the minimal set(s) of assumptions necessary for a given claim are deduced from the graph and used to quantify the degree of robustness of that claim.

Messing up a bit with the thread (perhaps we can discuss more offline), but I am kinda excited with one of my current projects in which the theory gives hints to the researcher as to which kind of easy thing (e.g. smaller experiments) can be done in order to obtain a more definitive solution for the bigger problem; alternatively, how to reuse knowledge from a somewhat related population. Still, all of these tools to strengthen guarantees require that we have an engine that receives assumptions, combines them with data, and draws their logical consequences. That is why I was surprised to hear that some methodologies can get by without the engine.

eb

I don’t think it is the case that hierarchical models “get by without the engine”. They are based on strong assumptions and give results that would be reliable if the assumptions were true (i.e. they are the logical consequences of assumptions and data). I think that when Andrew says there are no guarantees he means there are no guarantees when the assumptions are _false_ (as we know a priori is the case; we just don’t know in _which way_ the assumptions are false). He is not interested in guarantees for which the preconditions are not met.

Dear Konrad,

you wrote: “They [hierarchical models] are based on strong assumptions and give results that would be reliable if the assumptions were true “

This is not sufficient to turn (conditional) assumptions into guarantees. To qualify as guarantee, an assumption must meet two conditions:

1. It has to be explicit (as opposed to hidden behind estimation routines);

2. It has to be meaningful, so that investigators can judge its plausibility.

If anyone thinks that the assumptions invoked in hierarchical models satisfy these two conditions, please bring one to our attention, preferably one that supports causal conclusions.

I am eager to see one,

eb

In my understanding, the assumption in any probabilistic model (including a hierarchical one) is that the data constitute a sample drawn from a specified distribution.

This assumption meets both of your criteria: 1) it is made explicit by providing an equation for the distribution; 2) it is easily interpretable and its plausibility is easy to judge: first, the assumption is obviously false because we know a priori that the observations were not really obtained by sampling from a distribution (i.e. the question is not _whether_ it is false, but _how_ false it is); second, given a set of rival assumptions we can judge their relative plausibilities (i.e. decide which is less wrong) using standard model comparison and/or hypothesis testing techniques; third, we can investigate the plausibility of an assumption in isolation using techniques for evaluating model fit (such as posterior predictive checking); fourth, we know which variables in the problem domain are being taken into account and in which ways, so if (say) an independence assumption is being made we know about it and can try to evaluate its plausibility in terms of the problem domain (strictly speaking, all independence assumptions are false, but luckily many/most dependencies are negligible).

An example of a hierarchical model from the application area I work in (not causal I’m afraid – I’m not aware of any applications of causal models in this area) would be the gamma model of rate variation in genetic sequence evolution. In this model we have a number of sites in a DNA sequence that are assumed to evolve independently under a continuous time Markov process, but with rate parameters that are correlated. To model the correlation between rate parameters, we assume them to be independent draws from a gamma distribution. Traditionally, the model is described as a random effects model, which is just the frequentist name for a hierarchical model (i.e. no priors are specified for the parameters of the gamma distribution and they are estimated via maximum likelihood), although of course the Bayesian approach has also been used.

The point of the example is that this model was constructed as an improvement over earlier non-hierarchical models that ignored the dependence between sites and instead modeled them as iid. The improvement was suggested by the empirically known fact that different sites evolve at different rates, which means that the constant rate (or iid) assumption was always known to be false. What was not always known was whether replacing the iid assumption with the hierarchical gamma assumption would lead to an improved model fit. But in 1993 it was demonstrated via statistical model comparison that, for a large number of real data sets, the hierarchical model is a closer approximation to reality.

This is the usual pattern of progress in statistical modelling in all of the application areas I have worked in: we start with a simple model that is wrong (in some known ways and also in some unknown ways), and make progress by moving through a succession of less and less wrong models (discarding failed attempts along the way). The one thing that is constant is that, at any stage of the process, the currently best model is known to be wrong. This is why many people in this thread are only interested in guarantees that apply when model assumptions are acknowledged to be wrong.

Dear Konrad,

I learned a lot from your example — thanks.

You described clearly how predictions can be improved by choosing a more elaborate statistical model, one that accounts for dependencies that were neglected before. Indeed, in this case, the improved prediction in itself gives you a guarantee that is stronger than the plausibility of the dependence assumptions.

What surprised me, however, was Andrew’s assertion that he prefer to use hierarchical models to handle the problem of transportability — transporting experimental findings among diverse populations. I was surprised because hierarchical models are statistical (relying on statistical assumptions such as independencies) and transportability is a causal problem (i.e., requiring causal assumptions that do not show up in the data and cannot be expressed in the language of probabilities).

I assumed therefore that these models were some causal variants of hierarchical models, and I was curious to see in what language causal assumptions were encoded.

I am still curious,

Thanks for the example.

eb

Dear Elias,

I think it’s just a matter of explicit causal assumptions not coming into play. Causal assumptions are inherently stronger than correlation assumptions, and this can be seen as an argument against invoking them when they are not strictly required. When representing a hierarchical model as a directed probabilistic graphical model you have a structure which will contain a parent variable with some child variables (in my example the child variables are the site-specific parameters and the parent variable is the parameter vector for the gamma distribution), and of course we typically think of the parent variable as being a common cause affecting the child variables (in my example, the common cause is some combination of mutation and selection effects, but in this simple model these are not represented explicitly). But in many cases, all of the questions of interest can be answered without actually making this causal assumption – this makes the results more broadly applicable: they apply even if the causal structure is different from the one we had in mind when constructing the model.

So I think Andrew’s approach is to use (non-causal) directed PGMs which are constructed based on causal considerations but stop short of actually making causal assumptions (because such assumptions are not needed to answer the questions under investigation). Of course I could be wrong here, not having studied Andrew’s specific examples.

Konrad:

I think that’s about right. Another way of putting it is that as researchers we have only a certain amount of effort to spend, and different methods allow us to spend that effort in different places. In hierarchical modeling we typically use fairly simple causal assumptions but are serious about the problem of generalizing from one group to another. From what I’ve seen of Pearl’s methods, he and his collaborators are highly interested in causal assumptions but don’t spend much time modeling the subtleties of generalizing. That’s one reason why I think that hierarchical models are particularly well suited to questions of transportability: there’s a lot of space between complete pooling and no pooling. Of course I’d be happy to see models that combine the Pearl and hierarchical modeling approaches. So far, the examples I’ve seen have focused on estimating whether or not a link between two variables exists, rather than on the sort of partial pooling that I find helpful in thinking about generalizing. Maybe some readers of this thread will find a way forward. . . .

Dear Andrew,

You wrote: “… he [Pearl] and his collaborators are highly interested in causal assumptions but don’t spend much time modeling the subtleties of generalizing.”, I see it differently, if not the the opposite, because the main research question that we ask is when/how experimental findings can be generalized across populations.

The causal assumptions that we make are not the focus but the conditions that we found to be necessary to enable the generalization. And here is why I say “necessary”.

In the first part of our paper on transportability, we present three different causal stories (i.e., set of assumptions), all compatible with the same observable data, yet we show that each entails a different transport formula (i.e., a different way of pooling data from the two populations).

In other words, in the whole exercise of answering a question of generalization of causal quantities, the causal assumptions are the tools, not the goal — I do not understand why you describe this effort as one that “does not spend much time modeling the subtleties of generalizing”, when every sentence in our paper deals with those subtleties.

I am providing a link again here (http://ftp.cs.ucla.edu/pub/stat_ser/r400.pdf), so that you and others can see how those subtleties are modeled explicitly, and painfully.

eb

Elias:

Your paper is fine for what it is, but I don’t see it helping me evaluate electoral systems and redistricting plans, learn about population toxicokinetics, impute missing data in multiple surveys, estimate home radon exposures, estimate the relation between income and voting, or solve the many other problems I’ve worked on over the years, almost all of which involve generalizing to new population. But that’s fine—it’s not necessary that your method (or mine) solve all problems. The world is complicated and there is room for many methods. Also see footnote 1 on page 68 of this article. The world is a big place.

Dear Konrad,

You are right. The hierarchical models are famous for generalizing statistical properties across groups, and what we are trying to do is to understand how we can use their power in causal, not statistical, inference. Please, see my response to Andrew on this issue.

Dear Elias,

I have now had a look at your paper. Your approach is more general than hierarchical models in many respects, but it does not incorporate the key idea from which hierarchical models draw their power – that of partial pooling. This is a good thing, because it means the two approaches are orthogonal and combining them may lead to further progress. (You also restrict yourself to the case of two populations, which is unneccesary.)

It is not hard to describe hierarchical models in your framework: the DAG for each population is identical and all non-ancestral nodes are transportable. Ancestral nodes (by which I mean nodes that have no parents) are typically non-transportable – so complete pooling of these nodes (assuming that they are identically distributed) is not an option. However, this does not rule out partial pooling – instead of assuming that the distributions of the ancestral nodes are independent (no pooling), we assume that they (the distributions) are draws from a shared distribution. This shared distribution-of-distributions can be specified a priori or, when there are sufficiently many populations, estimated from data.

To draw the DAG for this, first (if your DAG is not already in this form) consider the parameter vector describing the distribution of each ancestral node as a variable X and add it to the DAG as a parent of the node whose distribution it governs. Next, make N copies (one per population) of the DAG. Finally, for each parameter vector X that was added in step 1, add a parent node that has X and its copies as children. Instead of modeling the copies of X as draws from different distributions, they are now independent draws from the same distribution, which is conditioned on the new parent node. You now have a single DAG modeling both populations simultaneously.

To incorporate this in your framework, one would need a different notion of transportability (perhaps called “partial transportability”). Instead of asking whether two (or N) distributions are identical, the question would be whether they (the distributions or, equivalently, their parameter vectors) can be modelled as iid draws from a family of distributions.

Dear Konrad,

You made my day. I am delighted that someone familiar with Hierarchical modeling (HM) took the time to look at our example and compare the two approaches. I am doubly delighted that you actually think that transportability problems can benefit from HM and that combining the two may lead to progress.

I met a few obstacles in trying to follow your synthesis. For example, you speak of “ancestral nodes”, “parameter vector”, and “partial pooling”, which I cannot locate in the transportability problem. Perhaps you can walk me step by step through a simple example, so that I (and many others) can better understand how HM approaches the problem of generalizing across populations.

Here is the example:

We conduct a huge randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z. (For the sake of simplicity, assume that the trial had 1 million participants, and that we also measured the age distribution, P(Z=z) in LA.)

The mayor of New York City (NYC), Mr. Bloomberg, comes to us and asks whether it is possible to generalize these results to the population of NYC, with one caveat: for budgetary reasons, no experiments can be funded in NYC. Still, Mr. Bloomberg’s staff gives us access to a huge census data in NYC, P*(X, Y, Z).

The data immediately alerts us to the fact that age distribution in LA is significantly different from that of NYC. Moreover, we see that, when given the choice (as in NYC), old people are more likely to take the treatment and less likely to benefit from it (as seen in LA).

Our dilemma is whether we can pool information from the two cities to gain a more accurate assessment of the causal effect of the treatment in NYC, without running experiments there.

(We are concerned, of course, that there may be other factors, beside age, that affect both choice of treatment and treatment effectiveness.)

—-x—-

I am anxious to learn what the first step should be in the HM analysis. What would you write down? What do we need to assume? How would you proceed to answer Mayor Bloomberg’s question? Can we tell him anything at this point, before seeing the actual data?

I truly appreciate your openness and willingness to broaden our understanding of HM.

eb

1) First a caveat – my description was a bit hasty and not entirely correct – e.g. it is not necessarily true that all ancestral nodes are transportable.

2) Ancestral nodes: I stated that by this I meant nodes that have no parents (is there a standard term for this in graph theory? I guess “parentless” would be better), but of course you need to know what DAG I am talking about. For this, see chapters 8 and 9 of “Pattern recognition and machine learning” by Christopher Bishop – Ch 8 is an account of (non-causal) PGMs (D-separation etc) which you will already be familiar with, but there may be differences in notation and the way the DAGs are set up. I particularly recommend chapter 9, which presents mixture models in the PGM framework. Hierarchical models are equivalent to mixture models, though strangely I have not seen this equivalence pointed out anywhere (it just occurred to me one day that the equations are identical) – I guess everyone just considered it too obvious to state :-) This means that the presentation of mixture models is really also a presentation of HM (but with different motivation and interpretation).

3) With my first step (writing the parameter vectors as ancestral nodes) I had in mind setting up the DAG as in Figure 9.6 of Bishop, where model parameters are treated as (unobservable) variables in the model. Once the model is in this form, I think my claim that all non-ancestral variables are treated as transportable is correct. I think Bishop makes it clear what I mean by parameter vectors.

4) Partial pooling: this is the key concept of HM, and refers to the idea that (a) different populations are described by models that are identical except in that the parameter values are different; (b) instead of treating these parameters as completely independent, they are treated as draws from a shared distribution.

5) Re your example – this is not a great example for HM, because you only have data for two cities (i.e. only two populations). So if you assume a shared distribution for the city-specific parameters, you will not be able to estimate that distribution sensibly (you could only use HM if you were willing to specify it a priori, perhaps after looking at other data sets that provide information about age distributions in cities). If you expanded the example to include many cities, HM would become a more attractive approach. For instance, you might try a model where all city-specific distributions other than age distribution are identical, while age distribution is modeled hierarchically (i.e. city-specific ages are treated as a draw from a distribution for which you specify a parameterization and then estimate the parameters using data from all cities). Whether this is a good model is an empirical question to be evaluated by comparing its fit to data with that of other candidate models, or by techniques such as posterior predictive checking that do not involve other models. At present there is no consensus on exactly how to do this model evaluation given that the potential set of candidate models is endless – this is one area where I think your approach might be able to help.

6) In answer to your final question – we shouldn’t even be _thinking_ of telling him anything before seeing the data. What we end up telling him should be (a) our set of assumptions A seems reasonably well supported by the data; (b) to the extent that A is a reasonable approximation of reality, we conclude B.

Dear Konrad,

it’s kind of you to put effort into explaining to us the workings of hierarchical modeling (HM). The first part of your message clarifies nomenclature (we are thoroughly familiar with PGMs), and the second deals with the specific example of Mr. Bloomberg. Let me start with the second part because it has a potential of clarifying the first.

You wrote that the example brought is not ideal for unleashing the full power of HM. I am interested in the reasons for this incompatibility, is it because:

1. HM excels in estimating parameters of distributions, and in our case, the distribution is practically known given the one million samples available (both in LA and NYC), which makes HM superfluous.

2. HM excels in estimating common parameters of distributions, and our example demands causal, not statistical, generalizability, which makes HM inapplicable.

3. None of the above, HM can still contribute to the given exercise (even in this simple, two-city setting).

These answers will help us understand under which conditions one can harness the power of HM to assist problems of generalization across populations.

Remark.

Your last comment was that “there is nothing that we can tell Mayer Bloomberg until we see the data”, I believe we can tell him something very important: “yes, Mr. Bloomberg, your problem can be solved, we can get an unbiased estimate of the causal effect in NYC”. I believe this piece of information is important, albeit qualitative, if you compare with the alternative, e.g., “Sorry, Mr. Bloomberg, we forgot to measure age in LA, hence, you cannot answer your question without bias”.

For newcomers, I repeat Mr. Bloomberg dilemma:

We conduct a huge randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z. (For the sake of simplicity, assume that the trial had 1 million participants, and that we also measured the age distribution, P(Z=z) in LA.)

The mayor of New York City (NYC), Mr. Bloomberg, comes to us and asks whether it is possible to generalize these results to the population of NYC, with one caveat: for budgetary reasons, no experiments can be funded in NYC. Still, Mr. Bloomberg’s staff gives us access to a huge census data in NYC, P*(X, Y, Z).

The data immediately alerts us to the fact that age distribution in LA is significantly different from that of NYC. Moreover, we see that, when given the choice (as in NYC), old people are more likely to take the treatment and less likely to benefit from it (as seen in LA).

Our dilemma is whether we can pool information from the two cities to gain a more accurate assessment of the causal effect of the treatment in NYC, without running experiments there. (We are concerned, of course, that there may be other factors, beside age, that affect both choice of treatment and treatment effectiveness.)

—-x—-

I appreciate your help,

eb

Elias:

You write: “I believe we can tell him something very important: ‘yes, Mr. Bloomberg, your problem can be solved, we can get an unbiased estimate of the causal effect in NYC’.”

There’s no way! I was (peripherally) involved in a huge data collection and analysis effort involving the NYC public school system. Real data involve measurement problems, people coming in and out of the sample, gaming the system, etc etc etc. You can forget about unbiased anything.

More generally, I don’t think it’s useful to ask “whether we can pool information from the two cities.” You can do some pooling but cities are different. That is why we do partial pooling. I discussed this issue in general terms in the post here that started this set of threads. The whole idea is to get away from all-or-nothing thinking. Instead of asking whether the two cities are the same or if they can be pooled or if they can be thought of as coming from a common distribution, we accept that cities vary, study this variation, and allow this variation to determine the amount of partial pooling.

For another case, see the meta-analysis example near the end of chapter 5 of Bayesian Data Analysis.

Dear Andrew,

Thank you for joining the discussion. As a matter of fact, we have done exactly what you suggested.

“Instead of asking whether the two cities are the same or if they can be pooled or if they can be thought of as coming from a common distribution, we accept that cities vary, study this variation, “

We have done so. We have accepted the start that cities vary and, moreover, we have studied this variation thoroughly, and we have even pinned down its source: “Age difference” and quantified it to last digit, by estimating the age distributions in the two cities.

Now we wish to follow your next step:

“allow this variation to determine the amount of partial pooling”,

and asked:

“what should I write down in the first step in the analysis? What do I need to assume? How should I proceed…?”

Please, do not send us to a different example in a book or an article when we are so close to solving this one.

True, real-life data is noisy.

It would be unthinkable to conclude that HM excels when measurements are noisy but helpless when measurements are clean.

I want to continue with our next step:

“allow this variation to determine the amount of partial pooling.”,

how to proceed?

eb

Elias: No.

1. No, there is no unbiased estimation going on in a realistic study of New York or Los Angeles.

2. No, you have not pinned down the source of variation between the two cities, nor is it possible to quantify it to the last digit, nor is there a last digit. The cities differ even after whatever you have controlled for, hence the value of partial pooling. This is discussed in the literature on meta-analysis: it is good to have individual-level and group-level predictors, but then the model should still account for unexplained variation at the individual and group levels.

Dear Andrew,

I gather from your answer that questions of “bias”, “identification”, and “unbiased estimates” are of no interest to HM practitioners, nor are they considered legitimate questions in theoretical HM circles.

I believe the following paragraphs from Manski’s book “Identification Problems in the Social Sciences” (Harvard Univ. Press, 1995) would explain why HM researchers should not dismiss these issues lightly. Specifically revealing is the last sentence: “Negative identification findings imply that statistical inference is fruitless: “it makes no sense to try to use a sample of finite size to infer something that could not be leaned even if a sample of infinite size were available.”

We happened to have some negative identification findings in our theory, if HM methods fail on any of the negative examples, we would know why, but would HM practitioners? or HM theorists?

Here is Manski’s passage:

For over a century, methodological research in the social sciences has made productive use of statistical theory (footnote ignored). One supposes that the empirical problem is to infer some feature of a population described by a probability distribution and that the available data are observations extracted from the population by some sampling process. One combines the data with assumptions about the population and the sampling process to draw statistical conclusions about the population feature of interest.

Working within this framework, it is useful to separate inferential problem into statistical and identification components. Studies of identification seek to characterize the conclusions that could be drawn if one could use the sampling process to obtain an unlimited number of observations. Studies of statistical inference seek to characterize the generally weaker conclusions that can be drawn from a finite number of observations.

Statistical and identification problems limit in distinct way the conclusions that may be drawn in empirical research. Statistical problems may be severe in small samples but diminish in importance as the sampling process generates more observations. Identification problems cannot be solved by gathering more of the same kind of data. These inferential difficulties can be alleviated only by invoking assumptions or by initiating new sampling processes that yield different kind of data.

To illustrate, (skipping example…)

Extrapolation is a particularly common and familiar identification problem. (…)

Empirical research must, of course, contend with statistical issues as well as with identification problems. Nevertheless, the two types of inferential difficulties are sufficiently distinct for it to be fruitful to study them separately. The study of identification logically comes first. Negative identification findings imply that statistical inference is fruitless: it makes no sense to try to use a sample of finite size to infer something that could not be leaned even if a sample of infinite size were available.

Elias:

I think bias can matter. I just think it’s ridiculous that you think you have an unbiased estimate in your example. To repeat what I wrote before: There’s no way! I was (peripherally) involved in a huge data collection and analysis effort involving the NYC public school system. Real data involve measurement problems, people coming in and out of the sample, gaming the system, etc etc etc. You can forget about unbiased anything.

Dear Andrew,

I don’t see how you can reconcile your previous statement “Bias matters” with your recent one: “I just think it’s ridiculous that you think you have an unbiased estimate in your example.”; “you can forget about unbiased anything”.

Bias analysis invariably demands that we make idealized assumptions, e.g., infinite sample, i.i.d., no measurement error, people not coming in and out of the sample, people not gaming the system, etc etc etc. Therefore, saying: “You can forget about unbiased anything” amounts to saying “Bias matters, but do not analyze it”.

I don’t know where statistics would be today had every student who asked: “Does an unbiased estimator exist”? been warned by her/his professor with:”I just think it’s ridiculous that you think you have an unbiased estimate in your example.”; “You can forget about unbiased anything.”

Thanks for giving me the opportunity to interact with you and your bloggers, but for the time being I am more inclined to stick with Mansky’s words:

“The study of identification logically comes first….: it makes no sense to try to use a sample of finite size to infer something that could not be leaned even if a sample of infinite size were available.”

(I believe that we have reached a good point of semi-understanding, after more than 65 messages, I will be more happy to discuss this or additional issues by email or in the coffee shop. I will be at the UAI Conference next week in Catalina Island. )

eb

Elias:

You write, “Therefore, saying: ‘You can forget about unbiased anything’ amounts to saying ‘Bias matters, but do not analyze it’.”

Uh, no. I am saying that bias matters, it is a good idea to analyze it, and it is

nota good idea to tell the mayor or anyone else that you can get an unbiased estimate, because you can’t.Go talk to people at the U.S. Census. They don’t go around saying that they have unbiased estimates. They recognize that they have bias and they try to reduce it.

In our radon project, we dealt with 80,000 biased measurements. We did our best to correct the bias. But I am not so foolish as to claim that our bias is zero.

Dear Andrew,

I am glad you agree that: “bias matters, it is a good idea to analyze “.

I am not sure however if you agree with Manski on the logical priority of bias analysis: He says “The study of identification logically comes first…. it makes no sense to try to use a sample of finite size to infer something that could not be learned even if a sample of infinite size were available.”.

It seems to me that, if you agree with this priority then, in every causal inference task, HM researchers should be waiting for the results of identification analysis before starting the estimation phase. And, in such a case, they would be curious to find out what bias-analysis says about generalizability before applying any multi-level estimation.

On the other hand, if you don’t agree with Manski’s priority, the question arises: WHEN, in your opinion, bias should be analyzed,? Should it be after the estimation phase? Perhaps before estimation, but after glancing at the data? And, if it is analyzed, how? With simplifying assumptions or without? And which assumptions would you permit? which would you forbid?

Finally, I will add an argument in favor of “identification first — estimation second” which also covers the questions raised by Konrad. Identification analysis not only answers the question “is an unbiased estimate possible”, but also provides us with an “estimand”, which should serve as the target in the estimation procedure. Without the proper estimand, statistical estimation can be chasing after the wrong parameter without ever answering the research question at hand (in our example we need the causal effect of X on Y in NYC).

We heard many complaints here about assumptions making the problem “easier”, “essentially solved”, “invalid”, “toy-like”, etc etc etc. Now, suppose our “easy” analysis yields a negative answer, i.e., non-transportable. Would including more realistic assumptions (e.g., “measurement problems, people coming in and out of the sample, gaming the system, etc etc etc.”) make estimation feasible?

A negative result from any of our examples means that no statistical method, no matter how sophisticated or revered, can estimate what needs to be estimated. A negative result actually guarantees us that whatever you are trying to estimate is the wrong thing to estimate — should a researcher engage in estimation before checking the possibility that such negative verdict would be issued by the analysis of bias?

Thus, the priority “identification first — estimation second” is not a matter of convenience or personal preferences; it is a matter of technical necessity. I wonder whether you see this priority reflected in the practice of HM or, if it’s not happening now, whether you think it should be encouraged in the future.

I am really curious about your take on these issues.

Still, if you feel these questions would require too much of your time, I would be quite satisfied with a quick, yes-no answer on whether you agree with Manski’s priority.

Thank you for all your patience,

eb

Dear Elias,

Sorry, I forgot that you are measuring Z directly, so you have no need to estimate the way it varies over cities. So given your assumptions, option 1 is correct: the problem is already solved, and if you really believe your assumptions adding anything to your solution would be superfluous. But in practice I’d go for option 3 because I don’t believe your assumptions.

The key question which I don’t see you addressing is whether the transportability assumption (by which I mean the decision not to add a selection variable to every node of the model) is good. I can think of two approaches for this:

1) You could start with the most general model, which means that every node should have a selection variable. In an example such as yours where we do not know the mechanisms underlying the causal effects, this is the only model that is defensible a priori. You could then use data to decide whether simplifying assumptions (e.g. removing selection variables – i.e. pooling) are empirically justified – in other words you “discover” variables that can be pooled. If complete pooling is not justified, partial pooling might still be – so HM could be useful here. Unfortunately this approach usually has practical problems, e.g. you will have a large number of inestimable parameters.

2) A more common approach in practice is to start with an oversimplified model, knowing that the model is _not_ defensible a priori and expecting the conclusions to be biased. You then proceed by exploring more complex models, hoping to achieve better realism – this is the approach I described before, and which is championed by Andrew. In this framework, you “discover” variables that can _not_ be pooled, in the sense that unpooling tham leads to significant improvement in model fit. However, for many variables we can do better by doing partial rather than no pooling: they are different but correlated (probably due to a shared causal effect acting on these variables). This is where HM is very useful.

In your example, you make a strong transportability assumption that does not appear defensible a priori. Both X and Y are strongly affected by causes not represented in the model, and we have no reason to believe that these causes act identically in different cities. So it seems that the 2nd approach applies, and one should investigate whether relaxing transportability assumptions will lead to improved model fit.

HM offers one way of doing this. For instance, suppose you want to investigate whether P(y|do(x),z) should be made city-dependent. I don’t know what modeling approach you have in mind for this quantity, but in practice you might use a parametric form – let’s assume this to keep the discussion simple. So we can pick a parameter W (or a set of parameters) which we think might vary over cities. To put this in a causal framework, we can think of W as a previously unmodeled and not directly observable cause – if we base our choice of W on something concrete that makes sense in the particular example, we are more likely to have success – e.g. maybe W represents the city-wide population frequency of a genotype that affects treatment efficacy and has different frequency in different race groups, which in turn are unequally represented in different cities. Now we can compare complete pooling (W is assumed the same in all cities and estimated once from the pooled data – this is a strong assumption which may be badly wrong) with no pooling (W is estimated separately for every city – this is based on an independence assumption which may result in discarding relevant information). But we can also try the HM approach of partial pooling: pick a shared parametric distribution for W and estimate _its_ parameters from the pooled data: this may be a better model than either complete or no pooling. After all, if we know W in 10 cities we may feel that we have a ball park idea of what to expect in the 11th – the first 10 estimates are not uninformative.

For HM to be applicable as above, you need: 1) W should not be directly observable; 2) either enough cities to make the distribution of W estimable, or an independent way of estimating the distribution of W.

Re your remark – I see two problems:

1) You are assuming you have an unbiased estimate for LA, which is probably false.

2) You are making an untested transportability assumption: whether this is a good assumption is an empirical question that can only be evaluated using NYC data (or alternatively, data from many cities that need not include NYC). A priori, we should assume the assumption is false, but the data may show that it is reasonable in this case.

In summary:

1) Transportability assumptions should be investigated empirically.

2) When relaxing transportability assumptions, HM offers a way of keeping “weak transportability”.

Dear Konrad,

You wrote: “…given your assumptions, option 1 is correct: the problem is already solved, “, to refresh our memories, option 1 read:

1. Hierarchical Modeling (HM) excels in estimating parameters of distributions, and in our case, the distribution is practically known given the one million

samples available (both in LA and NYC), which makes HM superfluous.

Do I understand you to say that HM cannot contribute to the problem and, therefore, that our solution cannot benefit from HM?

Or, that the problem is already solved and there is no need for our solution?

eb

Elias wrote:

“It would be unthinkable to conclude that HM excels when measurements are noisy but helpless when measurements are clean.”

“that the problem is already solved and there is no need for our solution”.

Your comments gets at the heart of the matter here. Elias, my sense of your approach is that you are attempting to take thought experiments addressing cause and effects relationships out of the ether of hypotheticals and into the real world ( 1 million samples from randomized experiment, e.g.). Of course, possession of perfect information (certain parameter distributions )removes the need of statistical analysis along the lines of HM. Perhaps, this is even possible in areas that examine systems entirely of artificial origin, synthetic biology e.g., but research into human societies and ecology, will continue to need to use approaches like HM to address issues of transportability, and cause and effect more generally. In my estimation, a synthetic causal analysis framework that incorporates insights from both HM and the your approach will bear the greatest causal fruit.

Dear JSB,

thanks for your thoughts, but see my last response to Andrew. Identification comes first, estimation second, and all these warnings about finite data, messy measurements, etc, must be postponed in the first stage.

So why is everybody advising me to look at HM at this stage, cant it wait for its logical turn? I agree with your estimation that “synthetic causal analysis framework that incorporates insights from both HM and the your [mine] approach will bear the greatest causal fruit.”

But, given that people immersed in HM have zero interest in the logical first step, I don’t know what to expect – it should be of some interest at least of the theoreticians of HM.

I suspect it is because, on the whole, people here do _not_ agree with Manski. Here are some possible reasons:

1) In real problems, it is important for modelers to have a good knowledge and intuition of the application at hand, otherwise they are likely to construct poor models. An unidentifiable model is an extreme example of a _very_ poor model – one hopes that most modelers have a sufficient idea of what they’re doing that they are not at much risk of constructing unidentifiable models.

2) Demonstrating identifiability in a context of infinite data, clean measurements, etc tells us nothing of relevance to the (finite, messy) problem at hand. It doesn’t rule out any of the _likely_ problems, such as model misspecification or insufficiently informative data.

3) While a theoretical demonstration of unidentifiability can, in principle, save a lot of wasted effort, it is not strictly necessary – if the model doesn’t work (due to unidentifiability, insufficient data, insufficiently nformative data, or whatever reason) the problem becomes clear soon enough. There is an advantage to learning this the “hard” way – in the process, the modeler gets a much better understanding of the application area and of how his/her intuition was at fault. In most cases, this will lead to better models being proposed in future.

4) In Bayesian methodology, unidentifiability need not always be a show-stopper. So what if your posterior distribution has a ridge, when you’re not trying to optimize?

Dear Elias,

You write as if you think HM is a method of analysis – it is not. Rather, it is an approach for constructing more sophisticated models. Or, to put it differently, for investigating (and usually relaxing) model assumptions.

As I understand your paper, you are including these hard assumptions as part of the problem statement:

1) causal structure in both LA and NYC are known, and known to be identical

2) quantities like P(y|do(x),z) etc are _known_

3) the selection variable governing Z is the _only_ one, therefore all other quantities are transportable

If we interpret these assumptions as part of the problem statement, it is already solved (using your method, of which the validity is not being questioned). So we have a solved problem, and there is nothing more to do. Including crisp assumptions in the problem statement makes it a toy problem with a crisp solution.

However, if the assumptions are not part of the problem statement but part of your proposed solution the problem turns into a real world problem and the situation is different. In this case the problem can never be considered solved because we are never sure that we have found the best set of assumptions. Certainly the above assumptions can only be considered as a starting point, and the real-world problem is to build a better set of assumptions from there. I agree with others here that HM can contribute to that process.

Re identification and bias:

Those are indeed legitimate questions, but they are not the questions we were discussing here. Unidentifiability is relevant whenever it can be demonstrated. The bias-variance tradeoff is relevant whenever one works with point estimates, but considering bias on its own without also considering variance can be very misleading. An unbiased estimator with high variance may be much worse than a biased estimator with low variance: consider the text book example of the biased ML estimator of the variance of the normal distribution vs the unbiased version – I don’t know of any practical situations where the unbiased estimator is preferable.

[…] this discussion from last month, computer science student and Judea Pearl collaborator Elias Barenboim expressed an […]