5 different reasons why it’s important to include pre-treatment variables when designing and analyzing a randomized experiment (or doing any causal study)

In presenting causal inference and randomized experiments, we start with the basic framework in which there are pre-treatment predictors x, treatment z, and outcome y, with potential outcomes y(z). Here it is in Regression and Other Stories:

Our presentation is different than many other textbooks which start with z and y, only later including x.

So then the question arises: Why is it such a good idea to include x? Why is the pre-treatment predictor (or predictors) so important, both in practice and for our understanding of causal inference.

Here are five reasons for including pre-treatment predictors:

1. Adjust for bias in non-randomized design
2. Adjust for random imbalances in randomized design (and for nonrandom imbalances because of imperfect randomization, dropout, etc.)
3. Reduce the standard error of the estimated effect
4. Check for imbalance and lack of overlap between treatment and control groups
5. Generalizing to population with a different distribution of x.

We explain further in chapters 19 and 20 of Regression and Other Stories, but I’m not claiming any originality here. This is all common knowledge among statisticians who work on these sorts of problems. But sometimes people hear about randomization or some other identification strategy, and they don’t realize that:
– Even if you have identification, adjusting for pre-treatment variables can give you statistical efficiency (item 3 above) and generalization (item 5);
– If your identification is imperfect, adjusting for pre-treatment variables can let you check that (item 4) and adjust for problems (item 2);
– In real life usually your identification isn’t everything you think it is, so it’s important to adjust anyway (item 1).

There are lots of ways to do this adjustment: linear regression, logistic regression, nonparametric models, etc. In her classic 2011 paper, Jennifer uses a nonparametric model to simultaneously adjust for differences between treatment and control group and to generalize to the population, and in recent years much more has been done in this area, for example this 2018 paper by Athey and Wager. Again, though, in many settings you can get pretty far from simple linear and logistic regression, as we did in 1990 when estimating incumbency advantage (although we did later return to the problem and do better using a probabilistic selection model).

Similarly, if you have an identification method such as regression discontinuity that already includes one pre-treatment predictor, you should include others. For regression discontinuity in particular, the variable that drives the discontinuity is not always a good predictor of the outcome, and you can do better by also including pre-test scores or whatever.

Again, the general theme is x, z, y. The treatment z affects the outcome y, and you want to model this behavior conditional on pre-treatment characteristics x.

Every once in awhile we come across a study in which there are no pre-treatment variables. Typically this reflects a failure of data collection, where the researchers were overconfident from the purported causal identification in their design. That’s one reason why it’s important to think about x in the design stage, before collecting and analyzing the data.

Sometimes there really isn’t any useful pre-treatment information available. Bummer! Even there, though, I think it’s useful to think about pre-treatment variables and what you would do with them—in the same way that, even if you can’t do random assignment, it’s typically a helpful thought experiment to consider a hypothetical, even if infeasible, randomized design (“force some people to smoke and force others to abstain,” etc.), as this can provide more insight into the process you are trying to model, the effect you’re trying to estimate, and the population of scenarios to which this effect might apply.

18 thoughts on “5 different reasons why it’s important to include pre-treatment variables when designing and analyzing a randomized experiment (or doing any causal study)

  1. To get off on an innocuous wrong foot [regarding what should follow the word, “different”]

    “Our presentation is different than many other textbooks which start with z and y, only later including x” see the delightful

    https://www.merriam-webster.com/grammar/different-from-or-different-than

    “A considerable amount of ink and pixels have been shed over the past several hundred years, in a valiant attempt to force the English-speaking people to choose the correct word to use immediately after different. From is the word most of the usage guides want you to use, especially in the US, so if that’s all you wanted to know you can leave this article now, untroubled by information on semi-literate 18th-century grammarians and the mysterious mating habits of the comparative adjective. But to the rest of you … let’s go.”

    • Have you ever the British linguist Prof. David Crystal’s response to zero-tolerance grammarians? It’s called “The Fight for English: How language pundits ate, shot, and left”; strongly recommend it to you given your recent posts here.

  2. This gets a bit tricky with non-collapsibility (eg, logistic regression), where you don’t reduce the standard error. You do still increase the signal:noise ratio, but it’s by dilating the target parameter. There’s a plausible argument that you should instead be getting a better estimate of the same parameter using the covariates for standardisation. This reduces to regression adjustment in the linear case

    • Thomas:

      Yes, good point. In the above post, I’m using terms like “bias,” “estimate,” and “standard error” in a casual (or, one might say, sloppy) way. In real life I would prefer Bayesian inference, not point estimation; I think the classical concept of “bias” is not so relevant; and I’m interested both in uncertainty and variation, not just a standard error.

    • But perhaps that’s mainly a reason not to focus on an odds ratio! So even if using logistic regression, you would just be using that to compute some collapsible quantity.

  3. In my experience, most researchers in the social sciences (including and perhaps especially in economics) do not really know nor care about anything but bias. They will happily choose estimators that are hopelessly inefficient in their given use cases because “bias=bad”. The idea that it might sometimes be preferable to choose a slightly biased but efficient approach over a less biased but incredibly inefficient one is just completely alien to them. Hence all the cookie-cutter unit-fixed-effects panel regressions (‘What do you mean, my effective sample size is now <10?' 'Ok, so the theory I'm trying to test is really about cross-sectional differences between people/organizations/countries, but 'unobserved heterogeneity' so xtreg y x, fe goes brrrrt) and quasi-experimental studies that are based on almost no residual variation for estimating the treatment effects.

    • Alex:

      Yeah, I discussed this point in my 2013 post, Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses, with related issues discussed in my 2011 post, The bias-variance tradeoff.

      Also there was this talk I gave in 2014, “Unbiasedness”: You keep using that word. I do not think it means what you think it means. Great title, huh?

      I guess I should write this up formally sometime. The quick answer is:
      1. When researchers restrict themselves to noisy, so-called unbiased estimates, they compensate by aggregating data, so that they are, at best, getting unbiased estimates of averages that are of no direct interest.
      2. Even setting aside the above point, so-called unbiased estimators aren’t actually unbiased because in real life there is selection on statistical significance.

      • Thanks for the reply, Andrew.

        Yup, your posts here have done a lot to shaping my thinking as an applied researcher on this issue – which I’m very grateful for.

        It would be great to be able to point people to a formal, easy to follow write-up!

        One ‘didactic’ trick I often use to sensitize people to this issue, might have stolen them from this blog:
        Showing a really stylized graphical representation of the implied sampling distributions of a) an estimator with some bias but great efficiency (so, narrow distribution with a peak at slightly different value than the true value of the parameter of interest), and b) an estimator with no bias but super low efficiency (so, super-wide distribution but with the peak exactly on the true value of the parameter of interest)*. Then I ask, ok, you have just one observed sample out of this distribution, which estimate is likely going to be closer to the true value?

        *Typically, I’ll use some variation of two hypothetical “wrong” ways of estimating a mean from a normal distribution, e.g. one where the sum of the observations is divided by n-some arbitrary number, e.g. 5) (introducing upwards bias**) and one where we divide by n but instead only use a random selection of the original sample to calculate the sample mean, e.g. only 10 percent of the observations. The trick is to make the example as simple as possible, and I have had some success with this one even with undergraduate social science students.

        **Also works very well to explain consistency! Most undergraduate students are able to figure that one out by just asking them ‘ok, so this is biased – but what would happen to the distribution if I increase the number of observations per sample?

  4. I don’t quite understand the third reason you give.
    At least from a frequentist perspective, it is NOT true that adjustment for pre-treatment variables necessarily gives you more efficient estimates, in the sense of a lower mean squared error compared to unadjusted estimates. The misspecification bias from a bad model may outweigh the reduction in variance. See the two seminal papers on this issue:

    Freedman, David A. “On regression adjustments to experimental data.” Advances in Applied Mathematics 40.2 (2008): 180-193.
    Lin, Winston. “Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.” The Annals of Applied Statistics 7.1 (2013).

    Maybe Bayesians think about this differently – but what exactly do you then mean by “statistical efficiency”?

    • Jonas:

      I agree that any specific adjustment for pre-treatment variables does not necessarily give you more efficient estimates. There’s nothing you can do that necessarily increases efficiency. For example, suppose you increase the sample size of a study? You might think that’s gotta increase efficiency? But, no, you can increase your sample size and be drawing from a higher-variance part of data space, so that increasing N makes efficiency goes down.

      Speaking more generally, claims of increased efficiency—or, for that matter, claims of decreased bias—are model-based, and there will always be cases where the model is so far off that certain decisions don’t make sense anymore.

      To get back to regression adjustment: Everybody knows that it’s a bad idea to adjust for pre-treatment variables that have no connection to the outcome. At the extreme, you would not do a regression adjustment on a purely random x. More generally, if the predictive power of x is low, an adjustment can make things worse. More interestingly, to the extent that the relation of E(y) to x is nonlinear, and you have enough data to estimate this, it will make sense to fit a nonlinear regression—otherwise, you’ll be estimating some average adjustment that won’t necessarily answer your questions of interest. For that matter, not adjusting for x when it is predictive of y will also yield an estimate of some average quantity of interest that won’t necessarily answer your questions of interest. That’s why we use flexible models and are concerned with how to fit such models in small samples. As I wrote above, “there are lots of ways to do this adjustment: linear regression, logistic regression, nonparametric models, etc. ”

      Regarding my above post, what I was giving was five reasons for including pre-treatment predictors. I wasn’t saying that all these reasons are operative all the time, just that they’re good reasons. All five of these, like just about any statistical recommendations, don’t apply 100% of the time.

      But we’re statisticians! We’re comfortable with uncertainty, and we don’t demand that procedures work 100% of the time. Rather, we try to understand the limitations of the methods that we use, and apply our methods judiciously.

      The point about the above post is that adjusting for pre-treatment variables is important for so many reasons, and students don’t always realize that.

    • But Winston Lin’s paper indeed provides good reason to think you’ll weakly decrease asymptotic variance. If you’re set on using linear regression, just interact the centered covariate with treatment. Or maybe you mean something further?

      • Indeed, if done right (i.e. in the way Lin suggests), there is arguably no downside to adjusting for pre-treatment variables. You still need to use Huber-White standard errors or bootstrapping though.
        It’s just that I don’t see people doing this; usually it’s just a basic linear regression without any interaction terms, under the assumption that this will improve the estimates – which is not always true.

        • I see the inclusion of interaction terms as becoming more popular. Also, it really only matters if, with a binary treatment, the treatment probability isn’t close to 1/2.

  5. In practice, I find that this comes naturally because in almost every instance, the practitioner knows of pre-treatment variables that are either associated with treatment selection or with the outcome and thus, either the random assignment is stratified or the sample sizing may be done to allow for analyses at levels of certain pre-treatment variables. The latter is related to your point #5 but going the other direction of segment-level analyses.

    Nice post!

    • Kaiser:

      Yes, in some ways this is very obvious because in real life the “before” measurement is right there. The problem comes in the textbook presentation of causal inference, where it seems to be standard practice to start simple and not consider pre-treatment variables x, but then practitioners go and think that not including x is a good thing, or that it’s safer or more rigorous to not consider x, or some other such foolishness.

      I have no problem with the commenter in the above thread who points out that including x can make things worse—anything can make things worse! I just think it’s best to start with x in your model, and then if x is really noisy or expensive to gather, or if for some reason you’re constrained to use a really bad model for x, then you can consider finding a different x, or not adjusting altogether, which would be unfortunate, but better to be in that bad place as a choice rather than by default.

      • To add to your second point, even if the adjustment ended up being ineffective or counter productive, the analyst can then make a judgment call as to the form of the model… but having that pre-treatment data around is better than scrambling to find it later (which sometimes just isn’t feasible).

    • This was what I thought, although a social scientist then pointed out on Twitter that such researchers may collect info on hundreds of variables via a questionnaire or whatever, just because it’s more efficient and they might use them for something else later. In this situation I assume something like causal graphs have to enter into the decision making?

Leave a Reply

Your email address will not be published. Required fields are marked *