Skip to content

Treatment interactions can be hard to estimate from data.

Brendan Nyhan writes:

Per #3 here, just want to make sure you saw the Coppock Leeper Mullinix paper indicating treatment effect heterogeneity is rare.

My reply:

I guess it depends on what is being studied. In the world of evolutionary psychology etc., interactions are typically claimed to be larger than main effects (for example, that claim about fat arms and redistribution). It is possible that in the real world, interactions are not so large.

To step back a moment, I don’t think it’s quite right to say that treatment effect heterogeneity is “rare.” All treatment effects vary. So the question is not, Is there treatment effect heterogeneity?, but rather, How large is treatment effect heterogeneity? In practice, heterogeneity can be hard to estimate, so all we can say is that, whatever variation there is in the treatment effects, we can’t estimated it well from the data alone.

In real life, when people design treatments, they need to figure out all sorts of details. Presumably the details matter. These details are treatment interactions, and they’re typically designed entirely qualitatively, which makes sense given the difficulty of estimating their effects from data.


  1. Brent Hutto says:

    When a “treatment” intervention is difficult and expensive to implement, there is always a limitation on how many participants (or other unit of randomization) can be enrolled without the required budget being totally unrealistic. Even assuming homogeneous treatment effects and adequate randomization there may be large uncertainty in assessing treatment effect magnitudes.

    So if we think that, in reality, several potential sources of treatment heterogeneity much be modeled as treatment interactions then I’d suspect in many cases that’s tantamount to saying we should not study a certain treatment. The required number of units could easily be several times greater than under the homogeneous treatment effect (within each assignment arm) assumption.

    I’m not saying I disagree that these issues are important. Just saying the choice in certain fields may come down to attempting to evaluate an intervention with possibly unmet assumumptions versus not even trying to evaluate it. Someone made the comment in another blog post thread about wondering if there was even any future to statistics or epidemiology, thinking about treatment heterogeneity modeling causes me to worry about that as well.

  2. Z says:

    I’d add that meaningful differences in average effects within subgroups defined by commonly measured baseline covariates (e.g. age, sex, race, income, …) may actually be rare. This would not mean that effect heterogeneity (defined as variance in treatment response across individuals) is rare. It would just mean that we often don’t measure the variables that actually matter to the treatment effect.

    • Daniel says:

      Amen, Z.

      Likewise, the claim at the end of the paper, “we should be more suspect of extant claims of effect moderation” is a bit too strong and pertains better to post-hoc subgroup moderators lacking theoretical justification (they do refer to these). They just happened to differ across groups A and B in this sample and we wouldn’t really expect to see it again because there’s no reason to suspect that group factor to change how the treatment is received…replicate, replicate, replicate. On the other hand, variables that actually matter to the treatment effect should still be expected to introduce systematic variance in treatment effects, and we should be looking for those moderators, even if sometimes they may be challenging to detect for power reasons.

      I’ll offer an example with really small N just to save on typing. Imagine you’re studying how images of naked men (vs. images of clothed men) affect arousal. You have treatment effects (naked=more arousal) of -1, 0, 2, 3, 5, 5 (main effect of 2.33). Obviously there is some heterogeneity in the effects, and it looks like maybe even some systematic heterogeneity. Luckily, you also measured income (low, high) and gender (male, female) of your participants.

      So you see if income moderates (-1, 2, 5, for an average effect of 2 in low, and 0, 3, 5 for an average effect of 2.67 in high), and it doesn’t. No surprise, why should income even matter here? It’s not “no” heterogeneity, but low, hard-to-notice heterogeneity across income, and we’d actually expect that anyway. Perhaps in subsequent studies, we’ll generate precise estimates that income has virtually no effect on the treatment effect.

      But, we actually would expect different treatment effects for men and women (the majority of whom are heterosexual), and so we can see if that moderates (-1, 0, 2 for an average effect of .33 for men and 3, 5, 5 for an average effect of 4.33 for women)- and it sure does!

      There are a lot of factors that we would expect our treatment effects to not systematically vary across, so if we’re just testing all possible interactions, moderation *should* be relatively rare. On the other hand, if we are particular in the moderators we go hunting for based on what could plausibly exert a systematic influence on treatment effects, moderation may be quite abundant (see psych).

  3. Garnett says:

    “All treatment effects vary.”

    Can someone explain why we should take this as canonically true?
    Not that I disagree, but I’m looking for ways to justify this to others.

    • Anoneuoid says:

      Because everything is correlated with everything else. That principle is kind of a restatement of this ancient philosophy:,_so_below

      • Garnett says:

        Thanks for your comment.
        Unfortunately, I’m going to need a little more help understanding this idea….

        • Anoneuoid says:

          It means exactly what it says. Literally, there are no non-zero correlations (perhaps with the exception of non-overlapping lightcones).

          These are real correlations, not sampling error or something like that. They may be negligibly small, of no interest, and require impractically large sample size or precision of measurement to detect… but they still exist.

          Yet another thing noted by Paul Meehl:

          These armchair considerations are borne out by the finding that in psychological
          and sociological investigations involving very large numbers of subjects, it is regu-
          larly found that almost all correlations or differences between means are statisti-
          cally significant. See, for example, the papers by Bakan [1] and Nunnally [8].
          Data currently being analyzed by Dr. David Lykken and myself, derived from a
          huge sample of over 55,000 Minnesota high school seniors, reveal statistically signifi-
          cant relationships in 91% of pairwise associations among a congeries of 45 miscel-
          laneous variables such as sex, birth order, religious preference, number of siblings,
          vocational choice, club membership, college choice, mother’s education, dancing,
          interest in woodworking, liking for school, and the like. The 9% of non-significant
          associations are heavily concentrated among a small minority of variables having
          dubious reliability, or involving arbitrary groupings of non-homogeneous or non-
          monotonic sub-categories. The majority of variables exhibited significant relation-
          ships with all but three of the others, often at a very high confidence level
          (p < 10^-6).

          Note that this principle is in direct contradiction to the one that motivates NHST. If you are doing NHST hoping to find “differences” (a type of correlation), you are assuming correlations are rare and exceptional.

        • Anoneuoid says:

          I guess I need to connect the two.

          A second principle is: “Every situation is different from every other situation”

          This seems like common sense, but the point is those differences (however slight they may be) will correlate with the treatment effect. Therefore, treatment effects always vary.

        • I remember reading that it’s been calculated that a 1kg mass one light year away wiggling back and forth provides enough perturbation to the molecular trajectories in the gas in a balloon through gravity that if you ignore it while simulating the trajectories of the gas in the balloon, you will diverge from the correct trajectories within a relatively short time, less than a second I think.

          In other words, literally everything affects everything in some way.

          • Garnett says:

            Thanks for both of your input.

            Is this along the lines of the Butterfly flapping its wings affects {everything}. Wasn’t there a discussion about this on here a while back? Unfortunately, it still sounds like something you either believe or you don’t….

            (not trying AT ALL to be contrary, just I get these questions a lot and haven’t found a generally satisfying answer).

            • Yes it is. Another way to say this is that there does not exist a true zero effect anything. The best you can do is ask which things have magnitude smaller than some particular threshold below which you don’t care. Then, to get your bound you could do a hypothesis test that the magnitude is greater than the threshold of interest, and when the p value is small, and I’d tend to suggest something more like 0.0001 then you can say “with near certainty, the effect is closer to zero than our threshold of interest”

              there are no “real” zeros.

            • Anoneuoid says:

              it still sounds like something you either believe or you don’t

              There is extrapolation involved, but same for anything. Collect enough data and you always see a significant correlation. That is what we observe about the universe.

              That is why we see people play the game of setting stricter and stricter significance thresholds as their data gets “bigger”.

              Once you understand this, the entire modern academic exercise revolving around chasing significant differences really looks ridiculous. Anything actually of interest gets produced incidentally to the main goals of the researchers.

          • Here’s where I first read that

            Apparently Emile Borel showed the motion of a *gram* (not kg) mass a few light years away perturbs the trajectories so that you can’t track them more than a few seconds…

            E. Borel, Le Hasard, 1914 see Raghuveer’s post for further details

          • Ethan Bolker says:

            Not to argue with the point this argument supports, nor with Poincare who knew initial conditions varying by just a bit mattered – but this argument assumes gravity acts instantaneously. LIGO showed gravity waves move at light speed, so you’d have to wait a year to see that perturbation.

            • No, you’d just have to calculate the motion of the gram mass that you’d be able to see today if you had an ultra-telescope.

            • Anoneuoid says:

              Could also be:

              LIGO showed gravity waves move at light speed

              Have they been able to make a real prediction yet? I.e. detect something and tell telescopes where they will see a supernova or similar later?

              • Since the assumption is you have gravity and light traveling at the same speed, you’d probably have to settle for detect something and then tell a sky survey where to look for an even that just happened.

                Since they have only one detector, it’s hard for me to imagine how they’d get directionality, so probably it’s “look for something big that happened somewhere in the sky at exactly this time”

                But I’m just guessing.

              • Anoneuoid says:

                Since the assumption is you have gravity and light traveling at the same speed, you’d probably have to settle for detect something and then tell a sky survey where to look for an even that just happened.

                There was the one where from the news it seemed like they alerted gamma ray observatories to search a given region in the sky after observing a neutron star collision.

                However, on closer inspection that “signal” was first vetoed by their (fast) “online” procedures as an artifact and LIGO was alerted by the gamma ray observatories before the (slower) “offline” procedures were complete. So that one doesn’t count as a true prediction in my book, there is still room for shenanigans.

                This one:

                On 2017 August 17 12:41:06 UTC the Fermi Gamma-ray Burst Monitor (GBM; Meegan et al. 2009) onboard flight software triggered on, classified, and localized a GRB. A Gamma-ray Coordinates Network (GCN) Notice (Fermi-GBM 2017) was issued at 12:41:20 UTC announcing the detection of the GRB, which was later designated GRB 170817A (von Kienlin et al. 2017). Approximately 6 minutes later, a gravitational-wave candidate (later designated GW170817) was registered in low latency (Cannon et al. 2012; Messick et al. 2017) based on a single-detector analysis of the Laser Interferometer Gravitational-wave Observatory (LIGO) Hanford data. The signal was consistent with a BNS coalescence with merger time, tc, 12:41:04 UTC, less than $2\,{\rm{s}}$ before GRB 170817A. A GCN Notice was issued at 13:08:16 UTC. Single-detector gravitational-wave triggers had never been disseminated before in low latency. Given the temporal coincidence with the Fermi-GBM GRB, however, a GCN Circular was issued at 13:21:42 UTC (LIGO Scientific Collaboration & Virgo Collaboration et al. 2017a) reporting that a highly significant candidate event consistent with a BNS coalescence was associated with the time of the GRB959 .


                Also, LIGO has two detectors and there is a third in Europe:

                So they can do some triangulation based on the relative timing of the signals.

        • Martha (Smith) says:


          Perhaps this will help: Correlations that appear surprising are extremely common. Sometimes there is a plausible explanation (e.g., consumption of ice cream and death by drowning are plausibly correlated, since both increase in summer months.) But sometimes there is no obvious connection — see for lots of examples.

  4. Pophealth says:

    I agree that it might be too hasty to say heterogeneity of treatment effect (HTE) is rare. The argument generally goes that you don’t see a lot variation across treatments but there are a few things to consider

    1. Generally studies are under-powered to detect HTE (whether you ascribe to the 4x bigger or 16x bigger sample size needed)
    2. With the statistical selection filter you tend to get noisy estimates that tend to be much less reproducible
    3. You have the file drawer problem where “non-significant” interactions are not reported

    The three conditions above give a sense that you cannot trust interaction estimates and “real” interactions are few and far between.

    Given these problems I am starting to come around to Gelman’s idea of building your model with multiple interaction terms and present all of the results. Sure you will in all likelihood get noisy HTE but as long as you summarize and report these correctly, I don’t see a big problem (embrace uncertainty). If you don’t have enough data, say so and let the meta-analyst determine if there is HTE with adequate precision in a systematic review.

    The only drawback that I foresee is in the absence of HTE (or very small effects of HTE) your model with lots of interactions is less efficient than your marginal model. I don’t think you would want to make your design so inefficient that you can no longer have good precision for your average treatment effect (which is probably your primary objective). A trade off is required.

    Am I thinking about this correctly?

Leave a Reply