Yes the convergence of all traits towards balance is asymptotic. It depends what you’re concerned about but for example if you’re concerned about average age then the average converges like 1/sqrt(N) and to get average age to match within say 10% of the standard deviation you’d need 100 people in each group or so

]]>I’m surprised we don’t do something like pairwise matching and randomization (slice all participants into twin pairs as identical as possible based on known confounders, then assign treatment and control randomly between twins).

Can someone give reasons against this? (besides the additional effort and needing a 50/50 split between groups)?

I guess the attitude towards randomization should be similar to foreign policy: trust but verify.

]]>Any analysis approach poses risks of errors and I believe the overall objectives here are to try and minimize them. Propensity scoring is certainly a good way to go, but it involves building the propensity score predictive model and this brings responsibility to make sure this additional model is sound. By design, multiple testing procedures should be incorrect at the specified error rate. Before all of this, creating an experimental design that balances covariates (assuming they are available) is likely preferable to simple random assignment with fingers crossed that you don’t use a random number seed that badly imbalances them.

]]>With you on moving the ball forward, and recommendations and discussions as you have suggested appear to be the best way. Really appreciate blog forums like this as a chance to do this and hope the Bayesian sharks are not yet circling.

The choice between a figure or table to me is often not trivial, and I also tend to prefer well-formed graphs as you indicated. Putting “balance plot” into an equivalence class with “forest plot”, I’ve happily noticed increase usage of them in traditionally table-heavy pharmaceutical regulatory contexts.

Checking randomization as here strikes me as a case where p-values are reasonable. We can meaningfully ask, “Under random treatment assignment, what is the probability of observing z-scores as large or larger than what we have observed?”. Ensuing multiple testing questions then become relevant. Many small multiplicity-adjusted p-values would be a red flag that something is amiss with the design.

Side point: If demographic and lab variables such as the fifteen here are available a priori, then it seems desirable to use more advanced design-of-experiment methods that balance the covariates among the two treatment arms.

In general, maybe we should push more for z-scores (aka signal-to-noise-ratios) as the common-ground statistical scalars everyone should use for basic analysis and communications. They have easy integer-based rules of thumb and strong traditions in engineering (six sigma certification programs) and science fields like physics (five sigma rules). I almost always work with -log10(p-values) in large scale analyses like genomics, and they are typically linear with the associated z-scores.

I’ve always taken usage of “statistically significant” as a friendly, shorthand, verbal indication to the reader that a standard rule-of-thumb (not a bright line) has been applied to flag a test statistic as larger than we would expect if there were no true effect under the modeling assumptions made. There’s an underlying concept here that is worth expressing with one or two words. My second question was targeting a desire to concisely say that blood pressure stands out as the only covariate among the fifteen that exhibits differences larger than expected under randomized treatment assignment and standard distributional assumptions.

]]>Is this result of randomization asymptotic and so it doesn’t apply (or applies less) to small randomized studies? If so, when can we trust that randomization worked (balanced the characteristics between treatment groups)?

Frank Harrell and others advocated for adjusting the treatment effect using other, known predictors of the outcome but this is to reduce the model’s variance not to reduce the bias of the estimated treatment effect.

]]>With small samples, I’d worry that the data-driven multiple adjustment model may be occasionnally incorrect (e.g., age might be associated with a lower risk of death). Wouldn’t it be safer to use a propensity score, and only adjust the treatment effect for that one covariate? ]]>

P.S. Let me emphasize that it’s not like I think I have all the answers here. There are lots of open questions regarding display, analysis, and decision making even in clean randomized trials. I do think that discussions like this can move the ball forward.

]]>Russ:

1. I get the idea of a p-value being a scalar summary, but (a) if you want a scalar summary, I’d much prefer the z-score to the p-value, as the z-score maps directly into the amount of bias that would be adjusted for, whereas the p-value is a strongly nonlinear transformation with no clear connection to the inferential goal, and (b) I think it’s much better to see things on the scale of the data, (c) rather than the notorious “Table 1”-style display, I’d much rather see a balance plot as for example shown here.

2. To answer your final question, you could replace “Among the fifteen factors, only blood pressure exhibits statistically significant differences” with “Figure 1 shows the average differences between treatment and control groups in the data. Several of the differences are large, most notably that the people in the treatment group are on average ** years older than the people in the control group; the treated people have a 35 percentage point lower rate of high blood pressure. The aggregate prognostic scores of the two groups are not so different on average. After adjusting for the prognostic score, the resulting estimate of treatment effect changes from ** to **, a small change, suggesting that imbalance between the groups was not driving the results. Concerns remain, however, about ** and **.” Statistical significance never came up!

]]>The burden of proof is reversed from as it should be.

Why is it ok to ignore 100 years of science saying vitamin deficiencies are bad because covid?

If someone tells me in 2021 they need to run a bunch of statistical tests before we know whether low oil levels are ok for my car, then I will think they are a crackpot unless they can explain why exactly this situation is different. It is the exact same thing.

Also:

P.S. I just noticed that Pachter’s post that Alexander is reacting to is from Nov 2020. Pachter links to this clinical trial with 2700 patients scheduled to be completed at the end of June 2021, so I guess then we’ll know more.

If you look at the details of this trial, it amounts to giving an arbitrary amount of vitamin d3 to people who recently tested positive for covid. There is no plan to guide the dosing according to vitamin d levels, so they will end up giving it to many people who don’t need it and not giving enough to people who need more.

Would a mechanic add the same amount of oil to every car? Or a fireman use the same amount of water on every fire? Or a painter the same amount of paint on every car?

Studies like this don’t make much sense.

]]>The intervention in study in question was not ‘vitamin D’, but the active form calcifediol, so your concern is probably only tangentially relevant here even if it might be important in other contexts.

]]>A well-known clinical example is this:

Certain cancer patients are given enormous doses of Vitamin D, once a week.

That is because it is observed that these patients’ Vitamin D levels are anomalously low.

This is the case because the cancers spur a florid endocrine derangement.

But the arrow cannot be turned around: to argue that enormous doses of Vitamin D are protective against cancer.

+1.

A point that I’ve read him make is that a straightforward ANCOVA adjustment can be thought of as actually adjusting for a *single* covariate, namely the bit of the linear predictor minus the treatment effect. (Compare with the logic of propensity scores: for a two-group comparison, there’s only one dimension’s worth of confounding that matters.)

ANCOVA also has some (nonparametric, asymptotic) robustness to it. With enough data, both your estimate and standard error should be about right:

B. Wang, E. L. Ogburn, and M. Rosenblum, “Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions,” Biometrics, vol. 75, no. 4, pp. 1391–1400, 2019, doi: https://doi.org/10.1111/biom.13062.

]]>In economics, where everybody cares a lot about unobservables, we would take the large imbalance as just a warning sign that randomization might not have worked — and that treated and controls might as well be imbalanced w.r.t. unobserved confounders, and one of those could be relevant, no? We can control for what we see, but not for what we don’t see.

This doesn’t seem right to me.

If an unobserved confounder is strongly associated with the observed ones, adjusting for the observed ones will also adjust for the unobserved ones. So in that respect we’re in a *better* situation.

On the other hand, if an unobserved confounder is not at all associated with the observed ones, then imbalance in the observed is uninformative about imbalance in the unobserved.

]]>Seems to me anther problem is an assumption that supplements can be used to reach the same composition of vitamin D in the body as wjafmanifests from exposure to sunshine and from diet.

]]>I agree with all the other things in this (Andrew’s) post too, especially about thinking through the actual problem. I very much doubt, for instance, that someone would look at two groups with 6% vs 20% rates of diabetes and consider them the same… but that’s what the p-value says! (or, to my point above, actually doesn’t say)

]]>Pure observation shows how Africa didn’t get hit too hard. They are close to the equator, but they are young on average too. OTOH, Brazil is full of mass graves and it’s not really the lack of sunshine vitamin they suffer from.

Regarding minimizing baseline differences, it is almost impossible when it comes to COVID, as we are clueless to what the key variables really are. The cliche of obese, old with pre-existing conditions is getting old. Plenty of countries are seeing deaths among very fit/healthy/young and they die quite fast too.

I believe a more informative approach is some post-hoc analysis where all those who recovered faster in both arms would be hyper-analyzed to find a common thread. Probably some complicated genetic predisposition. We’ll never find out and that’s the beauty of it all.

]]>Good reference; thanks.

]]>When confounds are unmeasured:

]]>There’s even a name for this way of thinking: “The Table 1 Fallacy”

]]>Good catch!

]]>Hendrik:

It would be best to adjust for everything. Not adjusting for a variable is equivalent to adjusting for it using a linear regression whose coefficient is fixed at zero. That’s extreme regularization. In practice we won’t adjust for all variables, but I recommend adjusting for variables that might be important. As noted in the above post, another strategy is to combine a bunch of predictors into some kind of total score and then adjust for that. And Jennifer Hill and others have written about the use of nonparametric adjustment schemes.

Regarding the p-values, no, I don’t think Fisher’s so-called exact test makes sense; see discussion here for example.

]]>Ag:

Linear regression won’t be perfect. I think that linear regression, if set up in a resaonable way and with coefficients regularized in the estimation, should give a better estimate than not adjusting at all.

]]>I would google for stuff by Stephen Senn on understanding randomization and especially suggested imbalances.

]]>I agree about p-values being irrelevant here, but somehow we want to inform our decision what to control for and what not. One could use standardized differences to describe balance between treated and controls. But they would be outside 10% in other covariates as well.

Finally, what are those p-values anyway? With such small samples and two binary variables, would one not do Fisher’s exact test?

]]>Yes, it seems that many researchers (perhaps due to the heavy focus on NHST in statistics education) get really hung up on whether baseline differences observed in-sample are “really real” or not, as measured by some p-value. The thought seeming to be that only “true” differences should matter. In fact, it’s crucial that confounding NOT be somehow dependent on whatever is going on out in the real world. If it were, then we would be screwed when it comes to accounting for confounding. The fact that it’s only the in-sample differences that matter is exactly what makes confounding fixable using statistical adjustment.

]]>