More on Pearl’s and Rubin’s frameworks for causal inference

To follow up on yesterday’s discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.

Contents:
– A practical issue: poststratification
– 3 kinds of graphs
– Minimal Pearl and Minimal Rubin
– Getting the most out of Minimal Pearl and Minimal Rubin
– Conceptual differences between Pearl’s and Rubin’s models
– Controlling for intermediate outcomes
– Statistical models are based on assumptions
– In defense of taste
– Argument from authority?
– How could these issues be resolved?
– Holes everywhere
– What I can contribute Continue reading

Judea Pearl and I briefly discuss extrapolation, causal inference, and hierarchical modeling

OK, I guess it looks like the Buzzfeed-style headlines are officially over. Anyway, Judea Pearl writes: I missed the discussion you had here about Econometrics: Instrument locally, extrapolate globally, which also touched on my work with Elias Bareinboim. So, please … Continue reading

Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

This material should be familiar to many of you but could be helpful to newcomers. Pearl writes: ALL causal conclusions in nonexperimental settings must be based on untested, judgmental assumptions that investigators are prepared to defend on scientific grounds. . … Continue reading

Causality and Statistical Learning

[The following is a review essay invited by the American Journal of Sociology. Details and acknowledgments appear at the end.]

In social science we are sometimes in the position of studying descriptive questions (for example: In what places do working-class whites vote for Republicans? In what eras has social mobility been higher in the United States than in Europe? In what social settings are different sorts of people more likely to act strategically?). Answering descriptive questions is not easy and involves issues of data collection, data analysis, and measurement (how should one define concepts such as “working class whites,” “social mobility,” and “strategic”), but is uncontroversial from a statistical standpoint.

All becomes more difficult when we shift our focus from What to What-if and Why.

Thinking about causal inference

Consider two broad classes of inferential questions:

1. Forward causal inference. What might happen if we do X? What are the effects of smoking on health, the effects of schooling on knowledge, the effect of campaigns on election outcomes, and so forth?

2. Reverse causal inference. What causes Y? Why do more attractive people earn more money, why do many poor people vote for Republicans and rich people vote for Democrats, why did the economy collapse?

In forward reasoning, the potential treatments under study are chosen ahead of time, whereas, in reverse reasoning, the research goal is to find and assess the importance of the causes. The distinction between forward and reverse reasoning (also called “the effects of causes” and the “causes of effects”) was made by Mill (1843). Forward causation is a pretty clearly-defined problem, and there is a consensus that it can be modeled using the counterfactual or potential-outcome notation associated with Neyman (1923) and Rubin (1974) and expressed using graphical models by Pearl (2009): the causal effect of a treatment T on an outcome Y for an individual person (say), is a comparison between the value of Y that would’ve been observed had the person followed the treatment, versus the value that would’ve been observed under the control; in many contexts, the treatment effect for person i is defined as the difference, Yi(T=1) – Yi(T=0). Many common techniques, such as differences in differences, linear regression, and instrumental variables, can be viewed as estimated average causal effects under this definition.

In the social sciences, where it is generally not possible to try more than one treatment on the same unit (and, even when this is possible, there is the possibility of contamination from past exposure and changes in the unit or the treatment over time), questions of forward causation are most directly studied using randomization or so-called natural experiments (see Angrist and Pischke, 2008, for discussion and many examples). In some settings, crossover designs can be used to estimate individual causal effects, if one accepts certain assumptions about treatment effects being bounded in time. Heckman (2006), pointing to the difficulty of generalizing from experimental to real-world settings, argues that randomization is not any sort of “gold standard” of causal inference, but this is a minority position: I believe that most social scientists and policy analysts would be thrilled to have randomized experiments for their forward-causal questions, even while recognizing that subject-matter models are needed to make useful inferences from any experimental or observational study.

Reverse causal inference is another story. As has long been realized, the effects of action X flow naturally forward in time, while the causes of outcome Y cannot be so clearly traced backward. Did the North Vietnamese win the American War because of the Tet Offensive, or because of American public opinion, or because of the skills of General Giap, or because of the political skills of Ho Chi Minh, or because of the conflicted motivations of Henry Kissinger, or because of Vietnam’s rough terrain, or . . .? To ask such a question is to reveal the impossibility of answering it. On the other hand, questions such as “Why do whites do better than blacks in school?”, while difficult, do not seem inherently unanswerable or meaningless.

We can have an idea of going backward in the causal chain, accounting for more and more factors until the difference under study disappears–that is, is “explained” by the causal predictors. Such an activity can be tricky–hence the motivation for statistical procedures for studying causal paths–and ultimately is often formulated in terms of forward causal questions: causal effects that add up to explaining the Why question that was ultimately asked. Reverse causal questions are often more interesting and motivate much, perhaps most, social science research; forward causal research is more limited and less generalizable but is more doable. So we all end up going back and forth on this.

We see three difficult problems in causal inference: Continue reading

How to think about how to think about causality

In suggesting “a socially responsible method of announcing associations,” AT points out that, as much as we try to be rigorous about causal inference, assumptions slip in through our language:

The trouble is, causal claims have an order to them (like “aliens cause cancer”), and so do most if not all human sentences (“I like ice cream”). It’s all too tempting to read a non-directional association claim as if it were so — my (least) favourite was a radio blowhard who said that in teens, cellphone use was linked with sexual activity, and without skipping a beat angrily proclaimed that giving kids a cell phone was tantamount to exposing them to STDs. . . . So here’s a modest proposal: when possible, beat back the causal assumption by presenting an associational idea in the order least likely to be given a causal interpretation by a layperson or radio host.

Here’s AT’s example:

A random Google News headline reads: “Prolonged Use of Pacifier Linked to Speech Problems” and strongly implies a cause and effect relationship, despite the (weak) disclaimer from the quoted authors. Reverse that and you’ve got “Speech Problems linked to Prolonged Use of Pacifier” which is less insinuating.

It’s an interesting idea, and it reminds me of something that really bugs me. Continue reading

Bayesian statistics then and now

The following is a discussion of articles by Brad Efron and Rob Kass, to appear in the journal Statistical Science. I don’t really have permission to upload their articles, but I think (hope?) this discussion will be of general interest and will motivate some of you to read the others’ articles when they come out. (And thanks to Jimmy and others for pointing out typos in my original version!)

It is always a pleasure to hear Brad Efron’s thoughts on the next century of statistics, especially considering the huge influence he’s had on the field’s present state and future directions, both in model-based and nonparametric inference.

Three meta-principles of statistics

Before going on, I’d like to state three meta-principles of statistics which I think are relevant to the current discussion.

First, the information principle, which is that the key to a good statistical method is not its underlying philosophy or mathematical reasoning, but rather what information the method allows us to use. Good methods make use of more information. This can come in different ways: in my own experience (following the lead of Efron and Morris, 1973, among others), hierarchical Bayes allows us to combine different data sources and weight them appropriately using partial pooling. Other statisticians find parametric Bayes too restrictive: in practice, parametric modeling typically comes down to conventional models such as the normal and gamma distributions, and the resulting inference does not take advantage of distributional information beyond the first two moments of the data. Such problems motivate more elaborate models, which raise new concerns about overfitting, and so on.

As in many areas of mathematics, theory and practice leapfrog each other: as Efron notes, empirical Bayes methods have made great practical advances but “have yet to form into a coherent theory.” In the past few decades, however, with the work of Lindley and Smith (1972) and many others, empirical Bayes has been folded into hierarchical Bayes, which is part of a coherent theory that includes inference, model checking, and data collection (at least in my own view, as represented in chapters 6 and 7 of Gelman et al, 2003). Other times, theoretical and even computational advances lead to practical breakthroughs, as Efron illustrates in his discussion of the progress made in genetic analysis following the Benjamini and Hochberg paper on false discovery rates.

My second meta-principle of statistics is the methodological attribution problem, which is that the many useful contributions of a good statistical consultant, or collaborator, will often be attributed to the statistician’s methods or philosophy rather than to the artful efforts of the statistician himself or herself. Don Rubin has told me that scientists are fundamentally Bayesian (even if they don’t realize it), in that they interpret uncertainty intervals Bayesianly. Brad Efron has talked vividly about how his scientific collaborators find permutation tests and p-values to be the most convincing form of evidence. Judea Pearl assures me that graphical models describe how people really think about causality. And so on. I’m sure that all these accomplished researchers, and many more, are describing their experiences accurately. Rubin wielding a posterior distribution is a powerful thing, as is Efron with a permutation test or Pearl with a graphical model, and I believe that (a) all three can be helping people solve real scientific problems, and (b) it is natural for their collaborators to attribute some of these researchers’ creativity to their methods.

The result is that each of us tends to come away from a collaboration or consulting experience with the warm feeling that our methods really work, and that they represent how scientists really think. In stating this, I’m not trying to espouse some sort of empty pluralism–the claim that, for example, we’d be doing just as well if we were all using fuzzy sets, or correspondence analysis, or some other obscure statistical method. There’s certainly a reason that methodological advances are made, and this reason is typically that existing methods have their failings. Nonetheless, I think we all have to be careful about attributing too much from our collaborators’ and clients’ satisfaction with our methods.

My third meta-principle is that different applications demand different philosophies. This principle comes up for me in Efron’s discussion of hypothesis testing and the so-called false discovery rate, which I label as “so-called” for the following reason. In Efron’s formulation (which follows the classical multiple comparisons literature), a “false discovery” is a zero effect that is identified as nonzero, whereas, in my own work, I never study zero effects. The effects I study are sometimes small but it would be silly, for example, to suppose that the difference in voting patterns of men and women (after controlling for some other variables) could be exactly zero. My problems with the “false discovery” formulation are partly a matter of taste, I’m sure, but I believe they also arise from the difference between problems in genetics (in which some genes really have essentially zero effects on some traits, so that the classical hypothesis-testing model is plausible) and in social science and environmental health (where essentially everything is connected to everything else, and effect sizes follow a continuous distribution rather than a mix of large effects and near-exact zeroes).

To me, the false discovery rate is the latest flavor-of-the-month attempt to make the Bayesian omelette without breaking the eggs. As such, it can work fine if the implicit prior is ok, it can be a great method, but I really don’t like it as an underlying principle, as it’s all formally based on a hypothesis testing framework that, to me, is more trouble than it’s worth. In thinking about multiple comparisons in my own research, I prefer to discuss errors of Type S and Type M rather than Type 1 and Type 2 (Gelman and Tuerlinckx, 2000, Gelman and Weakliem, 2009, Gelman, Hill, and Yajima, 2009). My point here, though, is simply that any given statistical concept will make more sense in some settings than others.

For another example of how different areas of application merit different sorts of statistical thinking, consider Rob Kass’s remark: “I tell my students in neurobiology that in claiming statistical significance I get nervous unless the p-value is much smaller than .01.” In political science, we’re typically not aiming for that level of uncertainty. (Just to get a sense of the scale of things, there have been barely 100 national elections in all of U.S. history, and political scientists studying the modern era typically start in 1946.)

Progress in parametric Bayesian inference

I also think that Efron is doing parametric Bayesian inference a disservice by focusing on a fun little baseball example that he and Morris worked on 35 years ago. If he would look at what’s being done now, he’d see all the good statistical practice that, in his section 10, he naively (I think) attributes to “frequentism.” Figure 1 illustrates with a grid of maps of public opinion by state, estimated from national survey data. Fitting this model took a lot of effort which was made possible by working within a hierarchical regression framework–“a good set of work rules,” to use Efron’s expression. Similar models have been used recently to study opinion trends in other areas such as gay rights in which policy is made at the state level, and so we want to understand opinions by state as well (Lax and Phillips, 2009).

I also completely disagree with Efron’s claim that frequentism (whatever that is) is “fundamentally conservative.” One thing that “frequentism” absolutely encourages is for people to use horrible, noisy estimates out of a fear of “bias.” More generally, as discussed by Gelman and Jakulin (2007), Bayesian inference is conservative in that it goes with what is already known, unless the new data force a change. In contrast, unbiased estimates and other unregularized classical procedures are noisy and get jerked around by whatever data happen to come by–not really a conservative thing at all. To make this argument more formal, consider the multiple comparisons problem. Classical unbiased comparisons are noisy and must be adjusted to avoid overinterpretation; in constrast, hierarchical Bayes estimates of comparisons are conservative (when two parameters are pulled toward a common mean, their difference is pulled toward zero) and less likely to appear to be statistically significant (Gelman and Tuerlinckx, 2000).

Another way to understand this is to consider the “machine learning” problem of estimating the probability of an event on which we have very little direct data. The most conservative stance is to assign a probability of ½; the next-conservative approach might be to use some highly smoothed estimate based on averaging a large amount of data; and the unbiased estimate based on the local data is hardly conservative at all! Figure 1 illustrates our conservative estimate of public opinion on school vouchers. We prefer this to a noisy, implausible map of unbiased estimators.

Of course, frequentism is a big tent and can be interpreted to include all sorts of estimates, up to and including whatever Bayesian thing I happen to be doing this week–to make any estimate “frequentist,” one just needs to do whatever combination of theory and simulation is necessary to get a sense of my method’s performance under repeated sampling. So maybe Efron and I are in agreement in practice, that any method is worth considering if it works, but it might take some work to see if something really does indeed work.

Comments on Kass’s comments

Before writing this discussion, I also had the opportunity to read Rob Kass’s comments on Efron’s article.

I pretty much agree with Kass’s points, except for his claim that most of Bayes is essentially maximum likelihood estimation. Multilevel modeling is only approximately maximum likelihood if you follow Efron and Morris’s empirical Bayesian formulation in which you average over intermediate parameters and maximize over hyperparameters, as I gather Kass has in mind. But then this makes “maximum likelihood” a matter of judgment: what exactly is a hyperparameter? Things get tricky with mixture models and the like. I guess what I’m saying is that maximum likelihood, like many classical methods, works pretty well in practice only because practitioners interpret the methods flexibly and don’t do the really stupid versions (such as joint maximization of parameters and hyperparameters) that are allowed by the theory.

Regarding the difficulties of combining evidence across species (in Kass’s discussion of the DuMouchel and Harris paper), one point here is that this works best when the parameters have a real-world meaning. This is a point that became clear to me in my work in toxicology (Gelman, Bois, and Jiang, 1996): when you have a model whose parameters have numerical interpretations (“mean,” “scale,” “curvature,” and so forth), it can be hard to get useful priors for them, but when the parameters have substantive interpretations (“blood flow,” “equilibrium concentration,” etc.), then this opens the door for real prior information. And, in a hierarchical context, “real prior information” doesn’t have to mean a specific, pre-assigned prior; rather, it can refer to a model in which the parameters have a group-level distribution. The more real-worldy the parameters are, the more likely this group-level distribution can be modeled accurately. And the smaller the group-level error, the more partial pooling you’ll get and the more effective your Bayesian inference is. To me, this is the real connection between scientific modeling and the mechanics of Bayesian smoothing, and Kass alludes to some of this in the final paragraph of his comment.

Hal Stern once said that the big divide in statistics is not between Bayesians and non-Bayesians but rather between modelers and non-modelers. And, indeed, in many of my Bayesian applications, the big benefit has come from the likelihood. But sometimes that is because we are careful in deciding what part of the model is “the likelihood.” Nowadays, this is starting to have real practical consequences even in Bayesian inference, with methods such as DIC, Bayes factors, and posterior predictive checks, all of whose definitions depend crucially on how the model is partitioned into likelihood, prior, and hyperprior distributions.

On one hand, I’m impressed by modern machine-learning methods that process huge datasets and I agree with Kass’s concluding remarks that emphasize how important it can be that the statistical methods be connected with minimal assumptions; on the other hand, I appreciate Kass’s concluding point that statistical methods are most powerful when they are connected to the particular substantive question being studied. I agree that statistical theory is far from settled, and I agree with Kass that developments in Bayesian modeling are a promising way to move forward. Continue reading