A counterexample to the potential-outcomes model for causal inference

Something came up where I realized I was wrong. It wasn’t a mathematical error; it was a statistical model that was misleading when I tried to align it with reality. And this made me realize that there was something I was misunderstanding about potential outcomes and casual inference. And then I thought: If I’m confused, maybe some of you will be confused on this too, so maybe it would be helpful for me to work this through out loud.

Here’s the story.

It started with a discussion about the effectiveness of an experimental coronavirus treatment. The doctor wanted to design the experiment to detect an effect of 0.25—that is, he hypothesized that survival rate would be 25 percentage points higher in the treatment group than among the controls. For example, maybe the survival rate would be 20% in the control group and 45% in the treated group. Or 40% and 65%, or whatever.

I was trying to figure out what this meant, and I framed it in terms of potential outcomes. Consider four types of patients:
1. The people who would survive under the treatment and would survive under the control,
2. The people who would survive under the treatment but would die under the control,
3. The people who would die under the treatment but would survive under the control,
4. The people who would die either way.
Then the assumption is that p2 – p3 = 0.25.

For simplicity, suppose that p3=0, so that the treatment can save your life but it can’t kill you. Then the effect of the treatment in the study is simply the proportion of type 2 people in the experiment. That’s it. Nothing more and nothing less. To “win” in designing this sort of experiment, you want to lasso as many type 2 people into your study and minimize the number of type 1 and type 4 people. (And you really want to minimize the number of type 3 people, but here we’re assuming they don’t exist.)

I liked this way of thinking about the problem, partly because it connected analysis back to design and partly because it made it super-clear that there is no Platonic “treatment effect” here. The treatment effect depends entirely on who’s in the study, full stop. The treatment is what it is, but its effect relies on the experiment including enough type 2 people. This also makes it clear that there’s nothing special about the hypothesized effect of 25 percentage points, as the exact same treatment would have an effect of 10 percentage points, say, in a study that’s diluted with type 1 and type 4 people (those who don’t need the treatment, or those for whom the treatment wouldn’t help anyway).

I shared the above story in a talk at Memorial Sloan Kettering Cancer Center, and afterward Andrew Vickers sent me a note:

You gave a COVID story arguing that the effect size in a function of the population in a trial and how there are four types of patients (always die, always survive, die unless they take drug, die only if they take drug). This idea has recently been termed “heterogeneity of treatment effect” and I was part of a PCORI panel that wrote two papers in the Annals of Internal Medicine on this subject (here and here) and here).

In brief, you characterized the issue as one of interaction (drug will work for patient A due to nature of their disease or genetics, won’t work for patient B because their disease or genetics are different). The PCORI panel focused instead on the idea of baseline risk: if a drug halves your risk of death, that means a greater absolute risk difference for someone with a high baseline risk (e.g. high viral load, pre-existing pulmonary disease, limited metabolic reserve) than for a patient at low baseline risk (e.g. a young, healthy person).

We followed with a long email exchange. Rather than just jump to the tl;dr end, I’m gonna give all of it—because only through this detailed back-and-forth will you see the depth of my confusion.

Here was my initial response to Vickers:

My example was even simpler because if the outcome is binary you can just think in terms of potential outcomes. It’s impossible to have a constant treatment effect if the outcome is discrete!

Vickers disagreed:

I don’t follow that at all. Let’s assume we are treating patients with high blood pressure, trying to prevent a cardiovascular event. Now it turns out that the risk of an event given your blood pressure is as follows:

Systolic Blood Pressure  Risk
                    140    2%
                    150    4%
                    160    8%
                    170   10%

Assume that a drug reduces blood pressure by 10mm Hg. Then no matter what your blood pressure is to start with, taking the drug will half your risk. The relative risk in the study – 50% – is independent (give or take) of the trial population; the absolute risk difference depends absolutely on the distribution of blood pressure in the trial: it could vary from 1% if most patients had blood pressure of 140 to 5% if most patients had 170.

I then responded:

What I’m saying is that a person has 2 potential outcomes, y=1 (live) or 0 (die). Then the effect of the treatment on the individual is either 0 (if you’d live under either treatment or control or if you’d die under either treatment or control), +1 (if you’d live under the treatment or die under control), or -1 (if the reverse). Call these people types A, B, and C. We can never observe a person’s type, but we can consider it as a latent variable. By definition, the treatment effect varies. If the treatment has a positive average effect, that implies there are more people of type B than people of type C in the population. The effectiveness of a treatment in a study will depend on how many people of each type are in the study.

Vickers:

You are ignoring the stochastic process here. Imagine that that everyone has two coins, and it they throw one head, they are executed. My “intervention” is to take away one coin. That lowers your probability of death from 75% to 50%. But there aren’t “latent types” of people who will respond or not respond to the coin removal intervention. In medicine, many processes are stochastic (like heart attack) but you can raise or lower probabilities.

Me:

Hmm, I’ll have to think about this. It seems to me that we can always define the potential outcomes, but I agree with you that in some examples, such as the coin example, the potential outcome framework isn’t so useful. If the coin example there are still these 3 types of people, but whether you are one of these people is itself random and completely unpredictable.

Vickers:

I don’t like the idea of types defined by something that happens in the future. We wouldn’t say “there are six types of people in the world. Those that if you gave them a die to throw would throw a 1, those that would throw a 2, those who’d throw a 3 etc. etc.”

There is a fairly exact analogy between the coin example and adjuvant chemotherapy for cancer. getting chemotherapy reduces your burden of cancer cells. These cells may randomly mutate and then start growing and spreading. So getting chemotherapy isn’t much different from having a coin removed from your pile where you have to throw zero heads to escape execution.

Me:

I guess it’s gotta depend on context. For example, suppose that the effectiveness of the treatment depended on a real but unobserved aspect of the patient. Then I think it would make sense to talk about the 3 kinds of people. The chemotherapy example in an interesting one. I’m thinking the coronavirus example is different, if for no other reason than that most coronavirus patients aren’t dying. If you work with a population in which only 2% are dying, the the absolute maximum possible effectiveness of the treatment is 2%. So, at the very least, if that researcher was saying he was anticipating a 25% effectiveness, he was implicitly assuming that at least 25% of the people in his study would die in the absence of the treatment. That’s a big assumption already.

I guess the right way to think about it would be to allow some of the variation to be due to real characteristics of the patients and for some of it to be random.

Vickers:

No question that your investigator’s estimates were way out of line.

As regards “I guess the right way to think about it would be to allow some of the variation to be due to real characteristics of the patients and for some of it to be random”, I guess I like to think in terms of mechanisms. In the case of adjuvant chemotherapy, or cardiovascular prevention, an event (cancer recurrence, a heart attack) occurs at the end of a long chain of random processes (blood pressure only damages a vessel in the heart because of there is a slight weakness in that vessel, a cancer cell not removed during surgery mutates). We can think of treatments as having a relatively constant risk reduction, so the absolute risk reduction observed in any study depends on the distribution of baseline risk in the study cohort. In other cases such as an antimicrobial or a targeted agent for cancer, you’ll have some patients that will respond (e.g. the microbe is sensitive to the particular drug, the patient’s cancer expresses the protein that is the target) and some that won’t. The absolute risk reduction depends on the distribution of the types of patient.

This discussion was very helpful in clarifying my thoughts. I was taught causal inference under the potential-outcome framework and I hadn’t fully thought through these issues.

And watching me get convinced in real time . . . it’s like the hot hand all over again!

P.S. In comments, several people make the point that the two frameworks discussed above are mathematically equivalent, and there is no observable difference between them. That’s right, and I think that’s one reason why Rubin prefers his model of deterministic potential outcomes as, in some sense, the purest way to write things. Also relevant is this paper from 2012 by Tyler Vanderweele and James Robins on stochastic potential outcomes that was pointed out by commenter Z.

46 thoughts on “A counterexample to the potential-outcomes model for causal inference

  1. This debate sounds similar to Laplace’s demon who can predict the future given all particles current locations and movements.

    Is there ‘true’ randomness? or do events appear random to us because we just can’t observe the necessary information?

    But from our perspective they’re similar.

  2. I don’t see how this is a counterexample. I’m used to thinking about these issues in the game theory context. Think about a game like poker. One way to model the game has “chance nodes” where nature makes a random decision scattered throughout the game tree: every time a card is dealt there is randomness. The other way uses a single chance node at the root of the tree: We shuffle the deck and after that what cards come out is deterministic.

    It seems to me that the entire discussion is about which of these two is the “right” model to use. And I don’t think that is one that has a single answer in general; just use whichever one is more convenient for the analysis you want to do.

    • Anon:

      I agree with your last paragraph. The point of my post is that I was trained to think that the model with fixed potential outcomes was always the right way to do it. For example, that’s how Jennifer and I wrote up the general framework for causal inference in our books. But after the conversation detailed in the above post, I changed my mind, and now I think that in many settings it can make sense to model the potential outcome as a random variable. Specifically: in our books we follow Rubin and define the individual causal effect as y^T – y^C, the difference between the two potential outcomes. But now I think there are settings where it makes sense to define the causal effect as E(y^T) – E(y^C), the difference between the two expected potential outcomes. This doesn’t change the definition of the population average treatment effect, but for reasons discussed in the above post and in many other places on this blog over the years, I’m not only interested in the average treatment effect.

      So, again, I’m agreeing with you: This is all about what notation is most convenient for the problem at hand. And the choice of notation can make a big deal in what analyses we might do.

    • Yeah, I agree with you that it doesn’t really seem like there’s a paradox here.

      Another equivalent way to think about it is in terms of a generative model. Everyone gets assigned $C^1_i, C^2_i \sim \Bern(1/2)$ (or whatever the appropriate probability is), where $C^j_k$ is binary, i.e., a coin flip. Then you can imagine setting $Y(0) = C^1_i * C^2_i$ and $Y(1) = C^1_i$.

      Now, for any individual, you can put them in Group 1, Group 2, Group 3, or Group 4 based on $Y(0)$ and $Y(1)$. Of course, since the POs are random variables, the assignment is also a random variable. (It seems to me that this is actually the standard way of talking about POs?)

      So, now, the only question is, when do you draw from your random variables? You can imagine World A where you get your draws *before* the experiment, but that the results of the draws are hidden from you until you observe it in the experiment. (This seems to be closer to what Andrew Gelman is saying is the default way of thinking about these things.) You can also imagine World B where you get your draws *after* the experiment. (This seems to be close to how Andrew Vickers is saying we should think about, e.g., cancer research.)

      The point that is confusing to me, at least, is that it seems like a difference without a distinction—from the perspective of the experimenter, there’s no way to distinguish between World A and World B! And if that’s the case, is it really meaningful to worry about whether we live in World A or World B? (You could even go further and say that the difference *isn’t* meaningful *because* it’s not verifiable.)

      I see Andrew’s point above that the difference in notation might make a difference to the analyses you would do. I’m not sure what a realistic example of that is, though. Which is not to say there aren’t examples—I just can’t think of any! But the fact that there’s no way to tell the difference between World A and World B anyway is making me a bit skeptical.

      • 匿:

        The difference between the two perspectives comes when you want to start modeling varying treatment effects, a topic we’ve been talking about for a long time and that continues to be a concern. With outcomes defined as being binary, treatment effects necessarily vary: all treatment effects are -1, 0, or 1. With Vickers’s stochastic framework, you can start with a constant treatment effect on the logistic scale or a near-constant treatment effect on the linear scale, and then you can model variation from that. This makes sense to me.

  3. I found a paper and talk by Stephen Senn on these topics very helpful (talk, paper), especially Table I and Table II of the paper.

    I took from them that in some contexts, “a unit’s potential outcomes” correspond to many different sources of variation which are not always stable properties of the “unit” in question.

    It’s easiest to illustrate with a continuous outcome like blood pressure, where there’s not just person-to-person variability, but also variability due to time of day, temperature, whether the person has just exercised, etc., and all of these can interact with treatment. A repeated period cross-over trial can identify more of these components than a parallel group trial.

    It seems to me that a similar logic applies with binary outcomes. You can set up a model where each unit has fixed “live” or “die” potential outcomes, but these will be defined relative to a combination of stable individual-level attributes and many other variables. If the between-individual variation is most important this might be fine, but if the within-individual/between-occasion variation dominates, individual-level averages of potential outcomes (i.e. probabilities) might be more scientifically relevant than the potential outcomes themselves.

  4. I think a lot of this can be resolved by the fact that we can never identify particular individuals as type-1, type-2, type-3, or type-4, since both potential outcomes can never be observed for the same person. The potential outcome typology (btw very similar to the always-taker/never-taker/complier/defier in local average treatment effect analysis and its generalizations) is useful for characterizing systematic patterns by those types, such as the average characteristics or related of type-2 individuals vs type-4 individuals. It’s true that whether someone belongs in a given type bucket is not knowable “ex ante”, but we can still make ex ante statements about what defines them on average.

    In the coin example, everyone has two coins so there is only one “type” of person defined by the potential outcomes with/without one coin removed. This means that while it is unknowable beforehand whether someone will die in the experiment or not, we can still say that everyone is “ex ante” helped by it.

    The broader critique seems to be that potential outcomes “collapse down” a lot of interesting structural mechanisms that we might care about, for example because we can address them with policy or generalize to other contexts. That’s totally fair and related to the long and often contentious debate over “reduced form” vs “structural” methods in economics (and in other fields). But I don’t think the upshot from that debate is that there’s anything wrong with collapsing things down with POs; just that we might want to bring institutional knowledge or structural assumptions in to push the analysis further. 

    • To clarify, by “everyone is helped ex ante by the experiment” I mean any summary statistic computed for this group is representative of the population

  5. This isn’t how I think of it at all. Where’s the impact of random assignment? Or is this a non-random assignment design? Suppose your four types of people were based on eye color instead of height (brown, blue, green, gray). Then, after random assignment, and in the long run, they would be equally represented in each group, rendering the trait ignorable. In realistic (finite) samples, it’s never perfectly equal, but we can reweight or reassign by eye color to do better. Returning to you typology, we can’t do a randomization check on that because we can’t measure that trait, but that’s true of almost all potential covariates. That’s the point of random assignment: we can ignore what we can’t measure, but only in the long run. So our conclusion is about long-run outcomes, which is why it’s causal “inference” and not causal “observation.”

    Though now I realize your post refers only to causal inference, not causal inference in random experiments, so maybe my thought process goes outside of the problem as you define it?

    • Michael:

      The example we were discussing was a randomized experiment. I’d been taking the position that the potential outcomes were death or survival (0 or 1), so there were three possible treatment effects: -1, 0, and 1. Vickers argued that it made sense to think of the outcome as Pr(survival), so that the individual treatment effect can be any real number between -1 and 1. Using the continuous formulation doesn’t solve any problems—you still have to draw the line of where you introduce the randomness—but that just becomes part of the model.

      • I’m not surprised you were confused–this is very confusing! But my point has nothing to do with discrete or continuous, only with random assignment. “Where you introduce randomness” is assignment to treatment. That’s it. That’s the only place you introduce randomness. If treatment is randomized, then everything that’s subject to randomization, including prior random events or traits, is ignorable. *Must* be ignored if you’re to infer causality. RCTs facilitate that inference through the act of intentional randomization–once you’ve randomized, you’ve destroyed all information about prior states. Which I think is the larger point: if you follow the information through the model, only the information about treatment effect survives randomization.

        Yes, you could theoretically observe causes at the atomic level and watch people go from untreated to treated, and identify the stochastic event in that process, but that’s not an experiment: you perform no manipulation and you don’t need to make any inferences. Explaining the causal mechanism and any stochastic components is obviously something we all want to do, but it occurs in parallel with determining causation, not in sequence with it.

        • Sorry, should’ve said “only the information about treatment survives for making inferences to a long-run population.” It’s obvious the information survives for describing the sample.

    • Z:

      Thanks. I had not been aware of this particular article or this terminology. I think that this idea of stochastic potential outcomes is what Vickers was recommending, and until having that conversation with him I’d just thought that potential outcomes had to be deterministic. As a teacher and writer of textbooks, I understand the appeal of deterministic potential outcomes, and the deterministic potential outcome framework has been helpful to me in many applications, starting here. As noted above, any stochastic potential outcome model can be formulated mathematically as a deterministic potential outcomes model, but there are problems for which the stochastic potential outcomes model just seems to be a lot more natural.

      • Andrew: As an instructor like you are, I found deterministic models provide simple, intuitive results that often generalize straightforwardly to all models; but it seems Vickers was getting at how some mechanistic models are much better captured by stochastic potential outcomes. Thus early on I began using stochastic potential outcomes for general methodologic points (e.g., Greenland S, 1987. “Interpretation and choice of effect measures in epidemiologic analysis”, American Journal of Epidemiology, 125, 761-768) and specific applications as in oncology (e.g., Beyea and Greenland 1999, “The importance of specifying the underlying biologic model in estimating the probability of causation”, Health Physics, 76, 269-274).

        In light of my experiences (and the current episode you document) I have to conclude that adequate instruction in causal models must progress from the deterministic to the stochastic case. This is needed even when it is possible to construct the stochastic model from an underlying latent deterministic model. And (as in quantum mechanics) it is not always possible to get everything easily out of deterministic models or generalize all results from them; for example, it became clear early on that while some central results from the usual deterministic potential-outcomes model generalized to the stochastic case (e.g., results on noncollapsibility of effect measures), others did not (e.g., some effect bounds in the causal modeling literature don’t extend to stochastic outcomes). And when dealing with the issues of causal attribution and causation probabilities, Robins and I ended up having to present 2 separate papers for technical details, one for the deterministic and one for the stochastic case (Robins, Greenland 1989. “Estimability and estimation of excess and etiologic fractions”, Statistics in Medicine, 8, 845-859; and “The probability of causation under a stochastic model for individual risks”, Biometrics, 46, 1125-1138, erratum: 1991, 48, 824).

        • Sander, have you seen (or written) a discussion of under what circumstances each type of causal model (stochastic vs deterministic) is appropriate? As examples where stochastic counterfactuals can be required, you mention probability of causation and certain problems in quantum mechanics. Below in the comments Andrew gives the example of free throws as a case where stochastic counterfactuals would be desirable if not necessary (and I agree with him). Is there some systematic characterization of when they’d be necessary and when they’d be desirable but not necessary (which I know is not a particularly well defined question)?

        • On the necessary part, outside of quantum mechanics (where the issue remains debated to this day but local determinism appears to be knocked out) I’m not aware of cases in which logical necessity of the stochastic model can be claimed: Given a stochastic model one can always reverse engineer a deterministic system that delivers the same empirical consequences. But the complexity of those deterministic models can make them computationally impractical and practically pointless; dice-throw examples show how that can occur even with an incredibly simple system. I myself don’t know of any hard and fast categorical rules that distinguish when that happens, but there are examples in which the natural simplifications that make those deterministic models computationally practical lead to results that do not follow from their stochastic counterparts (at least without further assumptions or “principles”).

          As an example consider the causal justification for conditioning on all margins of a 2×2 table from a randomized experiment. Under a deterministic model it’s obvious that the outcome margin is fixed under Fisher’s sharp causal null hypothesis (no effect on anyone); the only causal types are then 1 and 4, and the outcome margin is merely the counts of the two types, and unaffected by treatment assignment. I found that showing this derivation to students removes their bafflement at why their texts and software offer Fisher’s exact test (with its fixed margins) as if a “gold standard”. Not that I think it should be: Switch to a stochastic causal model or a pure descriptive survey-sampling context (no causal model connecting the two variables) and the potential-outcomes rationale for fixing the margin evaporates. Justifications for outcome conditioning then devolve to conditionality principles, which have been the source of endless controversy (for more details and earlier citations see Greenland “On the logical justification of conditional tests for two-by-two contingency tables” and the ensuing exchange in The American Statistician 1991;45:248–251 and the letters 1992;46:163).

          On the desirable part, such examples never stopped me or others from employing the deterministic model as a simple way to illuminate general problems in applying conventional statistics to causal problems (e.g., as in Greenland “Randomization, statistics, and causal inference”, Epidemiology 1990;1:421–429). But they do show some care is needed in going from the highly idealized deterministic case to the more general stochastic case (the deterministic case being the extreme limit of the stochastic case in which all unit-specific outcome probabilities are 0 or 1).

  6. Actually, while I think the coin analogy is very clever, I don’t think it’s a causal model at all. Consider: Everyone has two untossed coins prior to assignment. You can say that a trait of each of the coins is “will come up heads next time tossed” without changing anything meaningful about the scenario. Then you can say each coin also has the trait “will be removed by experimenter as part of treatment.” But now you’ve reduced all causal effects to mere traits, and eliminated the possibility of manipulation.

    Ascribing a future, unique study outcome to a coin (or participant) as a trait is a violation of the very idea of causality. That would mean all outcomes are pre-determined for each coin, and for each participant, and so a participant’s execution isn’t “caused” but fated. Likewise, if “will get better with chemo” is a trait of the treated/untreated participant, then so is “will get chemo” and nothing caused the person to get better–they were always going to get better by their traits alone.

    In the coin analogy, there is no cause and there is no effect, only unconnected events that happen to be separated by time. There is no causality to infer, therefore it’s no longer a model of causal inference. Hence, we cannot suppose experimental outcomes are traits of individuals in a causal model.

  7. This is a distinction I think about a lot when evaluating machine learning models. In practice it’s very common to use metrics that condition on the outcome. For situations like computer vision and natural language processing where deep neural networks perform very well, the data generating processes are generally deterministic, and these metrics arguably make sense.

    On the other hand, when we build predictive models of nondeterministic processes, conditioning on the outcome can lead to nonsensical conclusions. If there are subgroups with different base rate risks, then even a model with perfect knowledge will have differing false-positive and false-negative rates on the subgroups. This isn’t an error with the model, it’s an inevitable consequence of the way the data generating process works. Using the wrong metrics here can lead to society rejecting the models, even if they could play a role in solving the underlying inequities that lead to the differing base rate risks.

  8. Frank Harrell has blogged a bunch about this, really helpfully (to me at least)
    For instance, https://www.fharrell.com/post/varyor/, in which he writes “In the multitude of forest plots present in journal articles depicting RCT results, the constancy of ORs over patient types is impressive.”

    A lot of treatment effect heterogeneity goes away when you move from the linear scale to the logit scale.

    The problem is that it is hard to think about effects on the logit scale in terms of fixed potential outcomes, as you wrote

    • The meaning and applicability of Harrell’s observation has been sharply debated – see the 215-entry page at his blog here
      https://discourse.datamethods.org/t/should-one-derive-risk-difference-from-the-odds-ratio/4403/
      which also meanders into related causal-modeling controversies, including questionable claims about measures other than odds ratios.
      Among many problems with claims that odds ratios tend to be constant, most obvious is the 2nd-order version of the familiar null P-value fallacy: People do a test of homogeneity, get p>0.05 and then interpret that as evidence the odds ratio is homogeneous, when in fact the power of the test is very small – studies are almost never powered to detect realistic heterogeneity levels and so would end up with p>0.05 most of the time even if homogeneity was always false.

  9. This does not seems a counterexample at all. It seems like a confusion because you are discussing different outcomes. You do not even need to use stochastic POs, in my humble opinion.

    Let’s suppose for one moment you have X (taking the medication), Y (blood pressure – high or low), Z (death). Here I chose to use Y as a binary rather than a continuous variable.

    If you only consider X and Z, a canonical transformation (Balke and Pearl, 1997) of the problem will generate 4 different outcomes: always-takers, never-takers, compliers, and defiers.

    Things start to get a little bit different if you assume a mediator, and this is a good opportunity to show the importance of considering DAGs.

    Let’s suppose we have X -> Y -> Z. Then for Y we will have the canonical transformation: a) Y(X=0)=0, b) Y(X=0) = 1; c) Y(X=1)=0; d) Y(X=1)=1. It will be the same thing for Z, a) Z(Y=0)=0, b) Z(Y=0) = 1; c) Z(Y=1)=0; d) Z(Y=1)=1.

    However, things change a little bit if you consider a DAG such that X -> Y -> Z, and X -> Z. In this case, instead of only four groups for the canonical transformation of Z, we will have 16 (2^(2^2)); a) Always 1; b) Always X; c) Always Y; d) X and Y; e) X or Y; f) X xor Y; and so on.

    But anyway, one does not need to reason with stochastic POs in this case.

    • In the canonical transformation for the groups,
      I should have written “a)Y(X=0)=0 and Y(X=1)=0 [never-takers], b) Y(X=1) = 1 and Y(X=0) = 0 [compliers]; c) Y(X=1)=0 and Y(X=0)=1 [defiers]; d) Y(X=0)=1 and Y(X=1)=1 [always-takers]”. But anyway, the reasoning goes the same way.

  10. Examples given by mr. Vicker can be easily translated into a well known language of (deterministic) PO model, but the treatment offect could be/is very heterogeneous. In other words, it is a discussion about moderation/interaction of patients characteristics and various causal states (treatment levels) they have been assignment to.

    • Roman:

      Yes, I think the point is that we want the heterogeneity in the model to correspond to real heterogeneity, not to pure randomness. For a simple example, imagine you are shooting baskets with independent outcomes, and consider a treatment that raises your Pr(success) from 0.3 to 0.4. I wouldn’t want to model this on the level of the potential outcome for a single shot (in which case there’s a 0.3*0.4 + 0.7*0.6 chance of a treatment effect of 0, a 0.3*0.6 chance of a treatment effect of -1, and a 0.7*0.4 chance of a treatment effect of +1. Another way to understand this example is to consider the number of shots taken by the player to be a design decision in the experiment. In that case, under the probabilistic parameterization the treatment effect on this person is 0.1, full stop, but under the deterministic parameterization the possible treatment effects depend on the number of shots, approaching the constant 0.1 only in the limit as the number of shots per person approaches infinity.

      • I’m still having trouble understanding the difference between the probabilistic and the deterministic parameterizations but maybe this is because I’m too attached to my understanding of frequentist interpretation of probability in which the two last statements are equivalent.

        In any case, it seems as though our best estimate of individual level treatment effects when we only have one observation per individual would be making inferences from population or sub-population level averages.

        • William:

          Yes, it’s better to have multiple measurements on each individual so that it is possible to estimate individual-level treatment effects.

  11. Two statistics professors start shouting. “You’ve missed the stochastic process!” Then even more professors show up, one of whom has been practicing since around when I was born. This thread is golden. Stochastic outcomes just captured my imagination.

  12. If you replace binary outcomes with survival times, or some other continuous response, then heterogeneous treatment effects seem more compelling; this is the natural environment of the Lehmann-Doksum quantile treatment effect.

  13. I’m trying to understand the difference between the stochastic and deterministic versions of potential outcomes. I think it might help if I knew how to simulate the stochastic version in R.

    First I imagine two distributions.

    Y_1 ~ X + T + e_1
    Y_0 ~ X + e_0

    X is a fixed individual component, T is the treatment effect, and e_1, e_0 are random components.

    For the deterministic version, I envision drawing one interation of n samples and forming a two dimensional array where the n rows represent individuals and the columns represent the two potential outcomes.

    For the stochastic version, I envision drawing m iterations and forming a three dimensional array with dimenisions mxnx2 where m is theoretically infinite but could be set at some reasonable level for convergence.

    The difference between the two is simply that in the deterministic version each individual’s treatment effect is Y_1 – Y_0 whereas in the stochastic version it is E(Y1m) – E(Y0m).

    Is this correct? Or would you need to add a random component to T that is separate from e as well possibly allowing the random component of T to be correlated with e? In other words, there would be a random component to the treatment effect and a random component to the individual baseline (e.g. blood presure), which may be correlated with treatment effect.

    • WS: I think getting parametric obscures the distinction and messes up a lot of causal reasoning by blurring the distinction between general nonparametric and specific parametric results (classical Gaussian path analysis providing examples). The general potential-outcomes model can be expressed nonparametrically as saying treatment T picks (or indexes) the Y-outcome distribution F(y;t). “Deterministic” is just the special case in which all those distributions are degenerate (place probability 1 on one value y of Y and 0 on all the rest), which to understate is a grossly unrealistic model for medical treatments that make it into trials (which are conducted precisely because there is so much uncertainty about what outcomes will be, no matter what the treatment), and which can produce results that do not extend to the general stochastic (nondegenerate) case. Please see my responses to Z above for other details and some citations.

  14. It’s been a long time since I looked at these issues, but doesn’t it go back to Neymann, who introduced fixed potential outcomes as a mathematically equivalent shortcut for the treatment effect calculations?

    The fixed potential outcomes in “Rubin Causal Model” have always been a stumbling block for me.

  15. It sounds to me the argument here is not even about causal inference but about probability: is there a deterministic latent variable (type 1-4 here) behind the random variable (binary death)? Typically we can explain some outcome randomness by collecting more input predictors (patient genes, habits, etc), until we are really entering a quantum territory. In the same way that a model being true or false depends on what input information you have collected, here whether a treatment effect is “individual” or “average” depends on what input we would like to condition on.

    • Yuling:

      Sure, it’s just a question of what’s a convenient probability model for this example. The causal inference comes in because I was getting confused by the usual approach in which the causal effect is defined based on the latent potential outcomes.

  16. I think another interesting discussion of the stochastic vs deterministic potential outcomes debate is in the replies/comments attached to Dawid’s “Causal Inference without Counterfactuals” in JASA.

  17. I think the random assignment guarantees just the “internal validity”, and it randomizes only the assignment. It is random sampling that randomizes which individual is selected. The random sampling guarantees just the “external validity”.

    If your data will be sampled randomly from your target population, the observation for the i-th person becomes random even if the values of each individual are fixed. You cannot distinguish this randomness generated by random sampling from the randomness generated by each individual’s stochastic behavior. I mean if your data are a random sample from the deterministic fixed value like Y_me = 1, Y_Jack = 0, Y_Yoko = 1, …., you cannot distinguish its randomness from i.i.d. stochastic process of each individual like Y_me ~ Bern(0.8), Y_Jack ~ Bern(0.8), Y_Yoko ~Bern(0.8),…

    If you can assume both random allocation and random sampling (or random individual behavior) for your data generating process, I think it must be natural that you assume both randomness.

  18. Dear all,

    I know I’m late for the discussion, but I want to clarify that Vicker’s coin example can indeed be formulated in terms of deterministic potential outcomes, and that there is thus not necessarily a conflict between Gelman’s and Vicker’s examples.

    The outcome of a die depends on numerous microscopic and macroscopic factors such the velocity and angle of the hand that throws the dies, the direction and force of the wind (if any) in the room, the structure of the surface at which the die lands etc. The deterministic potential outcome model assumes that there is a (possibly extremely large!) set U of such factors such that, given these, the outcome of the die is deterministic. In the deterministic potential outcome model a “person/subject” is synonymous with this set U. That is, a “person/subject” is not just the macroscopic features that we normally ascribe to an individual, but by definition ALL factors that influence the outcome. These factors may of course change over time. For instance, in this model I’m not necessarily the same person now as I was when starting writing this text, since I’m now somewhat more tired than when I started, and may thus throw the die with a different velocity now than I would have done 5 min ago.

    With this deterministic model we can, just like Gelman did for his example, divide subject into deterministic response types (or principal strata). However, since there is only one level of intervention (throwing two dice) there are only two response types; those who will throw one head, and therefore will be executed, and those who will not.

    See also

    J. Pearl, “The logic of counterfactuals in causal inference (Discussion of `Causal inference without counterfactuals’ by A.P. Dawid),” In Journal of American Statistical Association, Vol. 95, No. 450, 428–435, June 2000.

    for more details and (probably better) explanations.

    best/arvid

    • Arvid:

      See above when I wrote, “It wasn’t a mathematical error; it was a statistical model that was misleading when I tried to align it with reality.” I agree that the example can be fit into the deterministic potential outcome framework: my original formulation of causal effect as being -1, 0, or +1 is mathematically valid. But I agree with Vickers that this deterministic model, while mathematically OK (as you say in your comment), is not so helpful for the sort of example that he was discussing. By giving each person a latent characteristic based on a die roll that has yet to happen, you’re creating a model where the latent variable does not generalize from one situation to the next, and that doesn’t make a lot of sense. We want our parameters and latent characteristics to generalize.

      So, again, you’re right (as I was originally) to say that the coin example can indeed be formulated in terms of deterministic potential outcomes. I just agree with Vickers that such a formulation is not so useful for statistical modeling.

      • Dear Andrew,

        You say:

        “By giving each person a latent characteristic based on a die roll that has yet to happen, you’re creating a model where the latent variable does not generalize from one situation to the next, and that doesn’t make a lot of sense.”

        It seems like I could replace “die roll” in the quote with any other outcome, say corona survival. Does this mean that you no longer thinks that the (deterministic) potential outcome model makes sense for any outcome? If, on the contrary, you still think that the (deterministic) potential outcome model makes sense for, say, corona survival, what is then the difference between corona survival and the toss of a die/coin?

        best/arvid

        • Arvid:

          It depends on the example. That was the point of my discussion with Vickers. The deterministic latent variable model is always mathematically correct, but it will be more convenient to use in some settings than in others. I think the deterministic latent variable model makes more sense in settings where the latent variable represents some stable underlying characteristic of the person in the experiment, and less sense in settings where it is a pure random variable. In the coronavirus example, I think it’s somewhere in between: the outcome is some combination of a latent characteristic and a random outcome. One way to disentangle these would be to consider additional pre-treatment measurements on the people in the experiment: these additional measurements could be informative about latent characteristics but they can’t be informative about pure randomness. Further discussion of this issue is here. I’d thought about this problem often but only recently saw the connection to potential outcomes.

          Again, there’s never anything mathematically incorrect about using deterministic latent variables in any of these cases; I just think that it can be easier t to do statistics using a stochastic model, for reasons explained to be by Vickers.

  19. Hi, it’s me, Andrew Vickers, the guy that started all this in the first place… A lot of this is getting very deep philosophically, and into some arguments about causation in epidemiology that are out of my area of expertise, and which I wasn’t really trying to comment on at all.

    My major concern is how to think about heterogeneity of treatment effect. In Andrew Gelman’s original formulation, the heterogeneity of treatment effect is in the four types of patient: 1. always survive, 2. survive only on treatment, 3. survival only on control, 4. always die. Treatment has no effect on patients who are type 1 or 4, helps those who are type 2 and harms those who are type 3, hence heterogeneity of treatment effect. Because we only want to give treatment to those who benefit, our job would be to work out how to distinguish the four types of patient and we might imagine some sort of statistical analysis in which report on interaction terms in a model such as Pr(odds death) = b1.treatment + b2.marker + b3.marker x treatment + c. In this sort of statistical model, we are testing whether the relative risk (or, more precisely, the odds ratio) of the treatment varies between different types of patient.

    But there is another way of thinking about heterogeneity of treatment effect, which is about whether the absolute risk reduction of a treatment varies between patients, even where the relative risk is relatively constant. Let’s think about the model above Pr(odds death) = b1.treatment + b2.marker + b3.marker x treatment + c where the marker is binary, and the coefficients are -0.7 (i.e. odds ratio of 0.5), 0.7, 0 and -2 for b1, b2, b3 and c respectively. In patients who are positive for the marker, risk is 21.4% vs. 11.9% for patients in the treatment and control group respectively. For patients who are negative for the marker, risk is 11.9% and 6.3%. Odds ratio is the same irrespective of marker status, but absolute risk difference is greater in the marker positive group. Now let’s imagine that, for the sake of argument, the treatment is associated with a 5% absolute risk of a competing event. In that case, we would give the treatment to patients who are positive for the marker but not to those who are marker negative, even though the odds ratio is the same for both. For statistical analysis of this type of heterogeneity of treatment effect, we are interested markers of absolute risk, rather than looking for interactions between treatment and a marker.

    We could, of course, use similar logic about the harm of an exposure rather than the benefit of a treatment.

    So here is the big main point: we know that the second type of heterogeneity of treatment effect happens all the time and has a huge effect size. For instance, (see https://www.mdcalc.com/framingham-risk-score-hard-coronary-heart-disease) risk of a heart attack amongst those with high blood pressure can vary enormously between patients. A 50 year old woman non-smoker with a systolic of 150, cholesterol of 160 and HDL of 70 has a 0.4% risk of a heart attack within 10 years. A 70-year old male smoker with cholesterol of 200 and HDL of 30, but who has the same blood pressure has a 30% risk of heart attack. For a drug with the same relative risk reduction, there is a nearly 10-fold difference in absolute benefit of treatment.

    As a result, I would recommend research that focuses on differences in baseline risk of the event, rather than our current obsession with differences in relative risk of a treatment.

Leave a Reply

Your email address will not be published. Required fields are marked *