Beth Tipton, Chris Bryan, and David Yeager write:

The increasing influence of behavioral science in policy has been a hallmark of the past decade, but so has a crisis of confidence in the replicability of behavioral science findings. In this essay, we describe a nascent paradigm shift in behavioral intervention research—a heterogeneity revolution—that we believe these two historical trends have already set in motion. The emerging paradigm recognizes that the unscientific samples that currently dominate behavioral intervention research cannot produce reliable estimates of an intervention’s real-world impact. Similarly, unqualified references to an intervention’s “true effect” are rarely warranted. Rather, the variation in effect estimates across studies that defines the current replication crisis is to be expected, even in the absence of false positives, as long as heterogeneous effects are studied without a systematic approach to sampling.

I agree! I’ve been ranting about this for a long time—hey, here’s a post from 2005, not long after we started this blog, and here’s another from 2009 . . . I guess there’s a division of labor on this one: I rant and Tipton et al. do something about it.

From one standpoint, the idea of varying treatment effects is obvious. But, when you look at what people do, this sort of variation is typically ignored. When I had my PhD training in the 1980s, we were taught all about causal inference. We learned randomization inference, we learned Bayesian inference, but it was always a model with constant treatment effect. Statistics textbooks—including my own!—always start with the model of constant treatment effect, only including interactions as an option.

And the problem’s not just with statisticians. Behavioral scientists have also been stunningly unreflective regarding the relevance of varying treatment effects to their experimental study. For example, here’s an email I received a few years ago from a prominent psychology researcher: not someone I know personally, but a prominent, very well connected professor at a leading East Coast private university that’s not Cornell. In response to a criticism I gave regarding a paper that relied entirely on data from a self-selected sample of 100 women from the Internet, and 24 undergraduates, the prominent professor wrote:

Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans.

The paper in question did not attempt to generalize to “all humans,” just to women of childbearing age. The title and abstract to the paper simply refer to “women” with no qualifications, and there is no doubt in my mind that the authors and anyone else who found this study to be worth noting) is interested in some generalization to a larger population.

The point is that this leading psychology researcher who wrote me that email was so deep into the constant-treatment-effect mindset that he didn’t just think that particular study was OK, he also thought it was “absurd” to be concerned about the non-representativeness of a sample in a psychology experiment.

So that was a long digression. The point is that the message sent by Tipton, Bryan, and Yeager, while commonsensical and clear, is not so apparent. For whatever reason, it’s taken people awhile to come to this point?

Why? For one thing, interactions are hard to estimate. Remember 16. So, for a long time we’ve had this attitude that, since interactions are hard—sometimes essentially impossible—to identify from data, we might as well just pretend they don’t exist. It’s a kind of Pascal’s wager or bet-on-sparsity principle.

More recently, though, I’ve been thinking we need to swallow our pride and routinely model these interactions, structuring our models so that the interactions we estimate make sense. Some of this structuring can be done using informative priors, some of it can be done using careful choices of functional forms and transformations (as in my effects-of-survey-incentives paper with Lauren). But, even if we can’t accurately estimate these interactions or even reliably identify their signs, it can be a mistake to just exclude them, which is equivalent to assuming they’re zero.

Also, let’s move from the overplayed topic of analysis to the still-fertile topic of design. If certain interactions or aspects of varying treatment effects are important, let’s design studies to specifically estimate these!

To put it another way: **We’re already considering treatment interactions, all the time.**

Why do I say that? Consider the following two pieces of advice we always give to researchers seeking to test out a new intervention:

1. Make the intervention as effective as possible. In statistics terms, multiplying the effect size by X is equivalent to multiplying the sample size by X^2. So it makes sense to do what you can to increase that effect size.

2. Apply the intervention to people who will be most receptive of the treatment, and in settings where the treatment will be most effective.

OK, fine. So how do you do 1 and 2? You can only do these if you have some sense of how the treatment effect can vary based on manipulable conditions (that’s item 1) and based on observed settings (that’s item 2). It’s a Serenity Prayer kind of thing.

So, yeah, understanding interactions is crucial, not just for interpreting experimental results, but for designing effective experiments that can yield conclusive findings.

**Big changes coming**

In our recent discussion of growth mindset interventions, Diana Senechal wrote:

We not only have a mixture of mindsets but actually benefit from the mixture—that we need a sense of limitation as well as of possibility. It is fine to know that one is better at certain things than at others. This allows for focus. Yes, it’s important to know that one can improve in areas of weakness. And one’s talents also contain weaknesses, so it’s helpful, overall, to know how to improve and to believe that it can happen. But it does not have to be an all-encompassing ideology, nor does it have to replace all belief in fixity or limitation. One day, someone will write a “revelatory” book about how the great geniuses actually knew they were bad at certain things–and how this knowledge allowed them to focus. That will then turn into some “big idea” and go to extremes of its own.

I agree. Just speaking qualitatively, as a student, teacher, sibling, and parent, I’d say the following:

– When I first heard about growth mindset as an idea, 20 or 30 years ago, it was a bit of a revelation to me: one of these ideas that is obvious and that we knew all along (yes, you can progress more if you don’t think of your abilities as fixed) but where hearing the idea stated in this way could change how we think.

– It seems clear that growth mindset can help some kids, but not all or even most, as these have to be kids who (a) haven’t already internalized growth mindset, and (b) are open and receptive to the idea. This is an issue in learning and persuasion and change more generally: For anything, the only people who will change are those who have not already changed and are willing to change. Hence a key to any intervention is to target the right people.

– If growth mindset becomes a dominant ideology, then it could be that fixed-mindset interventions could be helpful to some students. Indeed, maybe this is already the case.

The interesting thing is how much these above principles would seem to apply to so many psychological and social interventions. But when we talk about causal inference, we typically focus on the average treatment effect, and we often simply regression models in which the treatment effect is constant.

This suggests, in a God-is-in-every-leaf-of-every-tree way, that we’ve been thinking about everything all wrong for all these decades, focusing on causal identification and estimating “the treatment effect” rather than on these issues of receptivity to treatment.

Or to put it in very simple terms, Vitamin B supplements will probably help someone who has a serious deficiency of Vitamin B, but for everyone else the effect will be small, zero, or possibly slightly negative.

Indeed… it’s not really the varying treatment effects. it’s the mechanistic modeling. knowing that something varies but not why leads you to average treatment effects… but knowing that something varies and having ideas of why leads you to mechanistic modeling of the why, and now we’re doing science.

Agree with the post and the comments. One observation:

I work in medicine, often trying to model treatment outcomes for patients. As you might imagine (or, maybe not, if you aren’t familiar with how medicine “really” works) the heterogeneity in response is GIGANTIC for most treatments.

I often say that the easiest way for us to improve outcomes is to stop treating patients who won’t get better. While this seems obvious (and almost trivial/circular in reasoning), it is shockingly difficult to implement. Even if we can identify the “poor” candidates (which, we often can) it is nearly impossible to get the MDs *AND* the patients to go along with this approach.

Some of this may be societal – lots of folks just want a pill/surgery/device to make the problem go away, and are willing to try anything, even if it has an extremely low likelihood of working. Some of this is economic – Hospitals/Clinics/MDs get paid to treat people, not counsel them on the fact that there is no treatment that will work.

In any case, I am 100% convinced that there are incredibly useful lessons from the Social Sciences modeling – where RCTs and scientifically proven mechanisms are rare – that are directly applicable to the clinical medicine world. The reality is that only a tiny fraction of conditions have solid evidence (or even crappy NHST type evidence ;)) supporting specific treatments for individual patients.

I think your point about mechanistic modeling is important. A statistical model can detect “effects”, but ultimately it is just a fancy way of redescribing the data. Too often (as in a lot of branches of social science and medical research), the statistical model is treated as the end goal, when really it is just the starting point.

+1

I’m reminded of Tamiflu. If you take it early enough after you get the flu, the disease does not develop. If you are even a few hours late, the drug is useless and you get sick. But because drugs are scored by average treatment effect, and we don’t know how long folks were sick before they took it, the efficacy of Tamiflu is described as “shortens the duration of illness by 12 hours” or something like that. The odds that your illness will be reduced by 12 hours is close to nil.

Nice example to illustrate the point.

I think you forgot the link to Tipton et al’s paper at the top.

Link added; thanks.

A strong yes to this post. FWIW, this is what I wrote six years ago: http://econospeak.blogspot.com/2014/05/regression-analysis-and-tyranny-of.html

Peter:

I agree with what you wrote. I think one thing that you could add is that it’s difficult to estimate variability in treatment effects (recall the magic number 16), and in statistics we’re often trained to think that if something can’t be measured or estimated precisely, that it can be safely ignored.

Another possible hesitancy to model interactions: sometimes the interactions would be between a varying intercept group and a continuous (a/k/a “fixed”) effect (amount of Vitamin B, etc.).

If you want to model that sort of interaction, one option could be to discretize the continuous variable into a few, meaningful bins that could serve as potential slopes to vary against the modeled group (e.g., deficient, average, too much instead of a raw measurement range). But many have been drilled to believe that “binning” continuous variables is per se bad and results in loss of information, not gain. In truth, especially with the rise of penalized smoothing splines, discretization can often produce more accurate results, particularly if it is capturing non-linearities or allowing an interaction that would otherwise be missed. But I suspect it goes against the training of many analysts to even try it.

It would be helpful for a paper (or blog post!) to confront this mentality directly and show when it turns out to be good advice and when it actually stands in the way of a better model.

if you’re going with a smoothing spline, it seems like you might as well smoothing spline the unbinned data. I think binning is mostly useful as a way for people who don’t know how to specify nonlinear models well to nevertheless fit nonlinear models.

Well, some software does this automatically and some does not. My point was that people don’t know the “right” way (if there is a consistently best one) to interact a continuous variable with varying intercepts and are probably reluctant to discretize toward that end.

Can you give me an example problem, something more concrete? like sticking with your vitamin B example. Perhaps I could better understand what you mean.

Fair enough. Let’s take the Vitamin B example. We have repeated measurements of 100 individuals for whether they have some illness, which we model as an IID varying intercept. We also have estimates of Vitamin B levels, a continuous measurement which is of primary interest for the study. We suspect that Vitamin B interacts differently with different individuals, which ordinarily might be handled through a varying slope, but here, it would require discretizing the Vitamin B variable in some way to create the slopes.

I’m still not getting it… is “whether they have some illness” the outcome or a covariate?

I could see the outcome being say hospitalization, as a function of illness and vit B…

H ~ f(I,B)

or the illness itself as a function of vit B

I ~ g(B)

Either way, I don’t see why B would need to be discretized

I think the point he’s making is that the interaction can be nonlinear. One way to estimate that would be to bin into order to estimate that nonlinearity.

So if there’s an interaction I’ll assume for example that both I and B affect H in a nonlinear way… and so it’s perfectly possible to simply fit a nonlinear 2D function. The discretization seems like kind of a poor-man’s technique when your software won’t allow you to fit anything other than linear functions. By discretizing say B you can then fit linear functions to “low, medium, and high” which allows you to basically fit a kind of piecewise function to the behavior… but you’re better off just thinking about f(I,B) as a nonlinear function and fitting it directly.

Yea in theory I don’t disagree that you should think about estimating f(I,B) directly. I myself built a package to estimate HETE effects using nets: https://github.com/Ibotta/mr_uplift.

But binning can be effective in practice for a few reasons.

1) smaller parameter space. fitting spline might be better in theory but if you have limited data it might be more efficient to bin data first

2) interpretability / sql predictions. might be easier to explain to people what the model is doing with if logic then 2.3x^5+-12x^4+…

In general I prefer plotting the function to reporting coefficients. When I work with nonlinear functions I often use radial basis expansions, because they’re so flexible. Rather than emphasizing the formula I’m using, which is really just a generic thing, I’d tend to draw spaghetti plots of posterior draws of the function. In a 2D context, I might do small multiple plots of the surface, or draw curves at discrete levels of say the B value.

So, you’re estimating f(I,B) as a continuous function, but for display purposes you might plot f(I1,B), f(I2,B) each as functions of B alone.

To me it makes sense to start with the idea that you want to estimate a 2D function f(I,B) and then maybe make the conscious choice of doing some simplification. It shouldn’t be the case that we start with the simplified ideas and then treat them as essentially “the way it’s done”.

I think we need a second folk theorem of statistics:

If you aren’t struggling to fit your model, it isn’t complex enough to meaningfully describe your real-world problem.

+1