Adjusting for differences between treatment and control groups: “statistical significance” and “multiple testing” have nothing to do with it

Jonathan Falk points us to this post by Scott Alexander entitled “Two Unexpected Multiple Hypothesis Testing Problems.” The important questions, though, have nothing to do with multiple hypothesis testing or with hypothesis testing at all. As is often the case, certain free-floating scientific ideas get in the way of thinking about the real problem.

Alexander tells the story about a clinical trial of a coronavirus treatment following up on a discussion by Lior Pachter. Here’s Alexander:

The people who got the Vitamin D seemed to do much better than those who didn’t. But there was some controversy over the randomization . . . they checked for fifteen important ways that the groups could be different, and found they were only significantly different on one – blood pressure.

The table doesn’t show continuous blood pressure measurements, but it says that, among the 50 people in the treatment group, 11 had previous high blood pressure, whereas of the 26 people in the control group, 15 had high blood pressure. That’s 22% with previous high blood pressure in the control group, compared to 58% in the treatment group.

Just as an aside, I can’t quite figure out what’s going on in the table, where the two proportions are reported as 24.19% and 57.69%. I understand the 57.69% comes from: it’s 15/26 to four decimal places. But I don’t know how they got 24.19% from 11/50. I played around with some other proportions (12/49, etc.) but couldn’t quite come up with 24.19%. I guess it doesn’t really matter, though.

Anyway, Alexander continues:

Two scientists who support this study, say that [the imbalance in blood pressure comparing the two groups] shouldn’t bother us too much. They point out that because of multiple testing (we checked fifteen hypotheses), we need a higher significance threshold before we care about significance in any of them, and once we apply this correction, the blood pressure result stops being significant.

Alexander then writes, and I agree with him:

Come on! We found that there was actually a big difference between these groups! You can play around with statistics and show that ignoring this difference meets certain formal criteria for statistical good practice. But the difference is still there and it’s real. For all we know it could be driving the Vitamin D results.

Or to put it another way – perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn’t malfeasance involved. But that’s only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators. If you care about whether Vitamin D treats COVID-19, it matters a lot that the competently randomized, non-screwed up study still coincidentally happened to end up with a big difference between the two groups. It could have caused the difference in outcome.

Well put.

The thing that Alexander doesn’t seem to fully realize is that there is an accepted method in statistics to handle this. What you do is fit a regression model on the outcome of interest, adjusting for important pre-treatment predictors. Such an analysis is often described as “controlling” for the predictors, but I prefer to reserve “control” for when the variables are actually being controlled and to use “adjust” for adjustments in the analysis.

Alexander does allude to such an analysis, writing:

Although the pre-existing group difference in blood pressure was dramatic, their results were several orders of magnitude more dramatic. The paper Pachter is criticizing does a regression to determine whether the results are still significant even controlling for blood pressure, and finds that they are. I can’t see any problem with their math, but it should be remembered that this is a pretty desperate attempt to wring significance out of a small study, and it shouldn’t move our needle by very much either way.

I disagree! Adjusting for pre-treatment differences is not a “desperate” strategy. It’s standard statistics (for example in chapter 19 of Regression and Other Stories, but it’s an old, old method; we didn’t come up with it, I’m just referring to our book as a textbook presentation of this standard method), nothing desperate at all. Also, no need to “wring significance” out of anything. The point is to summarize the evidence in the study. The adjusted analysis should indeed “move our needle” to the extent that it resolves concerns about imbalance. In this case the data are simple enough that you could just show a table of outcomes for each category treatment or control and high or low blood pressure. I guess I’d prefer to use blood pressure as a continuous predictor but that’s probably not such a big deal here.

Multiple comparisons and statistical significance never came up. The other thing is that you shouldn’t just adjust for blood pressure. Indeed it would be better to combine the pre-treatment indicators in some reasonable way and adjust for all of them. There’s a big literature on all of this and not always clear agreement in the statistical literature on what to do, so I’m not saying it’s easy. As Alexander notes, in a randomized trial the important of any such adjustment will be more important when sample size is small. That’s just the way it goes. I read Pachter’s linked post and there he says that the experiment was originally designed to be a pilot study, but then the results were so stunning that the researchers decided to share the results right away, which seems fair enough.

Pachter summarizes as follows:

As for Vitamin D administration to hospitalized COVID-19 patients reducing ICU admission, the best one can say about the Córdoba study is that nothing can be learned from it.

And here’s his argument:

Unfortunately, the poor study design, small sample size, availability of only summary statistics for the comorbidities, and imbalanced comorbidities among treated and untreated patients render the data useless. While it may be true that calcifediol administration to hospital patients reduces subsequent ICU admission, it may also not be true.

I see his point regarding small sample size and data availability, but the concern about imbalanced comorbidities among treated and untreated patients . . . that can be adjusted for. I can see him saying the evidence isn’t as clear as is claimed, and there are always possible holes in any study, but is it really true that nothing can be learned from the study? The headline result, “only 1/50 of the treated patients was admitted to the ICU, whereas 13/26 of the untreated patients were admitted,” seems pretty strong.

I didn’t quite follow Pachter’s argument regarding poor study design. He says that admission to the intensive care unit could be based in part on adjustment for pre-treatment conditions. It could be that more careful adjustment would change the result, so it does seem like, as always, it would be better if the data could be made public in some form. Or maybe there’s an issue of information leakage so that the ICU assignment was made with some knowledge of who got the treatment? In any case, lots more will be learned from larger studies to come.

“Statistical significance” and “multiple testing” have nothing to do with it

But here’s the point. All this discussion of p-values and multiple comparisons adjustments is irrelevant. As Alexander says, to the extent that imbalance between treatment and control groups is a problem, it’s a problem whether or not this imbalance is “statistically significant,” however that is defined. The relevant criticisms of the study would be if the adjustment is done poorly, or if the outcome measure in the study is irrelevant to ultimate outcomes, or if there was information leakage, or if Vitamin D creates other risks so the net benefits could be negative (all of these are points I took from the above-linked posts). The discussions of statistical significance and testing and p-values and all the rest have nothing to do with any of this. So it’s frustrating to me that so much of Pachter’s and Alexander’s discussions focus on these tangential issues. Reading their posts, I keep seeing them drift toward the interesting questions and then springing back to these angels-on-a-pin probability calculations. Really the point of this post is to say: Hey, focus on the real stuff. The point of statistics is to allow non-statisticians to focus on science and decision making, not to draw them into a vortex of statistics thinking!

P.S. I just noticed that Pachter’s post that Alexander is reacting to is from Nov 2020. Pachter links to this clinical trial with 2700 patients scheduled to be completed at the end of June 2021, so I guess then we’ll know more.

30 thoughts on “Adjusting for differences between treatment and control groups: “statistical significance” and “multiple testing” have nothing to do with it

  1. Andrew, you might want to fix, “As Alexander notes, in a randomized trial the important of any such adjustment will be more important when sample size is small.”

  2. “As Alexander says, to the extent that imbalance between treatment and control groups is a problem, it’s a problem whether or not this imbalance is “statistically significant,” however that is defined.”

    Yes, it seems that many researchers (perhaps due to the heavy focus on NHST in statistics education) get really hung up on whether baseline differences observed in-sample are “really real” or not, as measured by some p-value. The thought seeming to be that only “true” differences should matter. In fact, it’s crucial that confounding NOT be somehow dependent on whatever is going on out in the real world. If it were, then we would be screwed when it comes to accounting for confounding. The fact that it’s only the in-sample differences that matter is exactly what makes confounding fixable using statistical adjustment.

  3. I have seen many studies which detect (or fail to detect) differences due to some characteristic after using linear regression to account for the fact that groups of people with and without this characteristic are also different in many other ways. I am always left wondering if linear regression is accurately modelling the effect of the characteristics it is supposed to be accounting for, and whether, if this is not the case, this could be influencing the results. Can anybody point me at a good reference for this question? (matching studies look apparently more robust, but I have not seen them used nearly as much).

      • +1.

        A point that I’ve read him make is that a straightforward ANCOVA adjustment can be thought of as actually adjusting for a *single* covariate, namely the bit of the linear predictor minus the treatment effect. (Compare with the logic of propensity scores: for a two-group comparison, there’s only one dimension’s worth of confounding that matters.)

        ANCOVA also has some (nonparametric, asymptotic) robustness to it. With enough data, both your estimate and standard error should be about right:

        B. Wang, E. L. Ogburn, and M. Rosenblum, “Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions,” Biometrics, vol. 75, no. 4, pp. 1391–1400, 2019, doi: https://doi.org/10.1111/biom.13062.

    • Ag:

      Linear regression won’t be perfect. I think that linear regression, if set up in a resaonable way and with coefficients regularized in the estimation, should give a better estimate than not adjusting at all.

  4. In economics, where everybody cares a lot about unobservables, we would take the large imbalance as just a warning sign that randomization might not have worked — and that treated and controls might as well be imbalanced w.r.t. unobserved confounders, and one of those could be relevant, no? We can control for what we see, but not for what we don’t see.

    I agree about p-values being irrelevant here, but somehow we want to inform our decision what to control for and what not. One could use standardized differences to describe balance between treated and controls. But they would be outside 10% in other covariates as well.

    Finally, what are those p-values anyway? With such small samples and two binary variables, would one not do Fisher’s exact test?

    • Hendrik:

      It would be best to adjust for everything. Not adjusting for a variable is equivalent to adjusting for it using a linear regression whose coefficient is fixed at zero. That’s extreme regularization. In practice we won’t adjust for all variables, but I recommend adjusting for variables that might be important. As noted in the above post, another strategy is to combine a bunch of predictors into some kind of total score and then adjust for that. And Jennifer Hill and others have written about the use of nonparametric adjustment schemes.

      Regarding the p-values, no, I don’t think Fisher’s so-called exact test makes sense; see discussion here for example.

    • In economics, where everybody cares a lot about unobservables, we would take the large imbalance as just a warning sign that randomization might not have worked — and that treated and controls might as well be imbalanced w.r.t. unobserved confounders, and one of those could be relevant, no? We can control for what we see, but not for what we don’t see.

      This doesn’t seem right to me.

      If an unobserved confounder is strongly associated with the observed ones, adjusting for the observed ones will also adjust for the unobserved ones. So in that respect we’re in a *better* situation.

      On the other hand, if an unobserved confounder is not at all associated with the observed ones, then imbalance in the observed is uninformative about imbalance in the unobserved.

  5. There is a huge measurement problem with all vit. D studies. I read a rather lengthy explanation by some molecular biologist (or similar) arguing what we measure as a vit. D level is just a proxy for some other compounds and does not translate to any meaningful dose of what is really circulating in your blood. Just because we gave one group ‘vit. D’ doesn’t mean it created desired changes, compared to the control group. Apparently, it’s way more complicated to measure, which by itself is enough to render studies of this nature useless.

    Pure observation shows how Africa didn’t get hit too hard. They are close to the equator, but they are young on average too. OTOH, Brazil is full of mass graves and it’s not really the lack of sunshine vitamin they suffer from.

    Regarding minimizing baseline differences, it is almost impossible when it comes to COVID, as we are clueless to what the key variables really are. The cliche of obese, old with pre-existing conditions is getting old. Plenty of countries are seeing deaths among very fit/healthy/young and they die quite fast too.

    I believe a more informative approach is some post-hoc analysis where all those who recovered faster in both arms would be hyper-analyzed to find a common thread. Probably some complicated genetic predisposition. We’ll never find out and that’s the beauty of it all.

    • Seems to me anther problem is an assumption that supplements can be used to reach the same composition of vitamin D in the body as wjafmanifests from exposure to sunshine and from diet.

      • A well-known clinical example is this:

        Certain cancer patients are given enormous doses of Vitamin D, once a week.
        That is because it is observed that these patients’ Vitamin D levels are anomalously low.
        This is the case because the cancers spur a florid endocrine derangement.
        But the arrow cannot be turned around: to argue that enormous doses of Vitamin D are protective against cancer.

    • The intervention in study in question was not ‘vitamin D’, but the active form calcifediol, so your concern is probably only tangentially relevant here even if it might be important in other contexts.

  6. This post drove me crazy (Alexander’s I mean) so I’m happy to see some discussion on it. I’m not sure I agree that interpretation of p-values has no bearing on this question though. Isn’t one of the cardinal rules of NHST that you cannot take a high p-value as evidence of no effect? It seems like both the original authors and alexander are basically saying “if the p values are greater than 0.05, then the two groups are the same” which is just an error of interpretation even if there were no other problems.

    I agree with all the other things in this (Andrew’s) post too, especially about thinking through the actual problem. I very much doubt, for instance, that someone would look at two groups with 6% vs 20% rates of diabetes and consider them the same… but that’s what the p-value says! (or, to my point above, actually doesn’t say)

  7. This showed up in the wtong place:

    The burden of proof is reversed from as it should be.

    Why is it ok to ignore 100 years of science saying vitamin deficiencies are bad because covid?

    If someone tells me in 2021 they need to run a bunch of statistical tests before we know whether low oil levels are ok for my car, then I will think they are a crackpot unless they can explain why exactly this situation is different. It is the exact same thing.

    Also:

    P.S. I just noticed that Pachter’s post that Alexander is reacting to is from Nov 2020. Pachter links to this clinical trial with 2700 patients scheduled to be completed at the end of June 2021, so I guess then we’ll know more.

    If you look at the details of this trial, it amounts to giving an arbitrary amount of vitamin d3 to people who recently tested positive for covid. There is no plan to guide the dosing according to vitamin d levels, so they will end up giving it to many people who don’t need it and not giving enough to people who need more.

    Would a mechanic add the same amount of oil to every car? Or a fireman use the same amount of water on every fire? Or a painter the same amount of paint on every car?

    Studies like this don’t make much sense.

  8. Thanks Andrew for the comments and I agree on the usefulness of regression adjustments in contexts like these. I’d like to raise a couple questions about the table Alexander shows right up front, which lists the 15 risk factors along with columns for the observed n for the two groups, the IC 95% interval, and p-value. I personally find the p-value column here useful as a scalar statistic, and have no problems with summary statements such as “Among the fifteen factors, only blood pressure exhibits statistically significant differences”. For those who dislike p-values, how would you have preferred this table be constructed? For those who feel the phrase “statistically significant” should be abandoned, how would you replace the preceding quote?

    • Russ:

      1. I get the idea of a p-value being a scalar summary, but (a) if you want a scalar summary, I’d much prefer the z-score to the p-value, as the z-score maps directly into the amount of bias that would be adjusted for, whereas the p-value is a strongly nonlinear transformation with no clear connection to the inferential goal, and (b) I think it’s much better to see things on the scale of the data, (c) rather than the notorious “Table 1”-style display, I’d much rather see a balance plot as for example shown here.

      2. To answer your final question, you could replace “Among the fifteen factors, only blood pressure exhibits statistically significant differences” with “Figure 1 shows the average differences between treatment and control groups in the data. Several of the differences are large, most notably that the people in the treatment group are on average ** years older than the people in the control group; the treated people have a 35 percentage point lower rate of high blood pressure. The aggregate prognostic scores of the two groups are not so different on average. After adjusting for the prognostic score, the resulting estimate of treatment effect changes from ** to **, a small change, suggesting that imbalance between the groups was not driving the results. Concerns remain, however, about ** and **.” Statistical significance never came up!

      • P.S. Let me emphasize that it’s not like I think I have all the answers here. There are lots of open questions regarding display, analysis, and decision making even in clean randomized trials. I do think that discussions like this can move the ball forward.

        • With you on moving the ball forward, and recommendations and discussions as you have suggested appear to be the best way. Really appreciate blog forums like this as a chance to do this and hope the Bayesian sharks are not yet circling.

          The choice between a figure or table to me is often not trivial, and I also tend to prefer well-formed graphs as you indicated. Putting “balance plot” into an equivalence class with “forest plot”, I’ve happily noticed increase usage of them in traditionally table-heavy pharmaceutical regulatory contexts.

          Checking randomization as here strikes me as a case where p-values are reasonable. We can meaningfully ask, “Under random treatment assignment, what is the probability of observing z-scores as large or larger than what we have observed?”. Ensuing multiple testing questions then become relevant. Many small multiplicity-adjusted p-values would be a red flag that something is amiss with the design.

          Side point: If demographic and lab variables such as the fifteen here are available a priori, then it seems desirable to use more advanced design-of-experiment methods that balance the covariates among the two treatment arms.

          In general, maybe we should push more for z-scores (aka signal-to-noise-ratios) as the common-ground statistical scalars everyone should use for basic analysis and communications. They have easy integer-based rules of thumb and strong traditions in engineering (six sigma certification programs) and science fields like physics (five sigma rules). I almost always work with -log10(p-values) in large scale analyses like genomics, and they are typically linear with the associated z-scores.

          I’ve always taken usage of “statistically significant” as a friendly, shorthand, verbal indication to the reader that a standard rule-of-thumb (not a bright line) has been applied to flag a test statistic as larger than we would expect if there were no true effect under the modeling assumptions made. There’s an underlying concept here that is worth expressing with one or two words. My second question was targeting a desire to concisely say that blood pressure stands out as the only covariate among the fifteen that exhibits differences larger than expected under randomized treatment assignment and standard distributional assumptions.

  9. One often hears the notion that Table 1 is there to verify if the randomization “worked” (not so much on this blog, but in real life), as if randomization was an equalizing procedure. It is not, there are always differences between the groups if one looks at enough variables. Since we only worry about the variables that may be linked to outcome, why not adjust for those a priori, regardless of Table 1.
    With small samples, I’d worry that the data-driven multiple adjustment model may be occasionnally incorrect (e.g., age might be associated with a lower risk of death). Wouldn’t it be safer to use a propensity score, and only adjust the treatment effect for that one covariate?

    • Any analysis approach poses risks of errors and I believe the overall objectives here are to try and minimize them. Propensity scoring is certainly a good way to go, but it involves building the propensity score predictive model and this brings responsibility to make sure this additional model is sound. By design, multiple testing procedures should be incorrect at the specified error rate. Before all of this, creating an experimental design that balances covariates (assuming they are available) is likely preferable to simple random assignment with fingers crossed that you don’t use a random number seed that badly imbalances them.

  10. I’m still stuck on why we can’t trust the randomization. Normally, randomization of treatment makes causal inference because it balances characteristics (whether observed or not) between the treatment groups. When I see small p-values in table 1 of an RCT, I disregard it as chance.

    Is this result of randomization asymptotic and so it doesn’t apply (or applies less) to small randomized studies? If so, when can we trust that randomization worked (balanced the characteristics between treatment groups)?

    Frank Harrell and others advocated for adjusting the treatment effect using other, known predictors of the outcome but this is to reduce the model’s variance not to reduce the bias of the estimated treatment effect.

    • I’m surprised we don’t do something like pairwise matching and randomization (slice all participants into twin pairs as identical as possible based on known confounders, then assign treatment and control randomly between twins).
      Can someone give reasons against this? (besides the additional effort and needing a 50/50 split between groups)?

    • Yes the convergence of all traits towards balance is asymptotic. It depends what you’re concerned about but for example if you’re concerned about average age then the average converges like 1/sqrt(N) and to get average age to match within say 10% of the standard deviation you’d need 100 people in each group or so

Leave a Reply to Leon Cancel reply

Your email address will not be published. Required fields are marked *