Linda Seebach points to this post by Scott Alexander and writes:

A recent paper on increased risk of death from all causes (huge sample size) found none; it controlled for some 300 cofounders. Much previous research, also with large (though much smaller) sample sizes found very large increased risk, but used under 20 confounders.

This somehow reminds me of what happens with polygenic scores. Each additional confounder reduces the effect, and by the time you get to 300+ it’s all gone.

Anyway, it looked like something you might find worth commenting on.

From Alexander’s post:

For years, we’ve been warning patients that their sleeping pills could kill them. How? In every way possible. People taking sleeping pills not only have higher all-cause mortality. They have higher mortality from every individual cause studied. . . . Even if you take sleeping pills only a few nights per year, your chance of dying double or triple. . . .

When these studies first came out, doctors were understandably skeptical. . . . The natural explanation was that the studies were confounded. People who have lots of problems in their lives are more stressed. Stress makes it harder to sleep at night. People who can’t sleep at night get sleeping pills. Therefore, sleeping pill users have more problems, for every kind of problem you can think of. When problems get bad enough, they kill you. This is why sleeping pill users are more likely to die of everything.

He continues:

This is a reasonable and reassuring explanation. But people tried to do studies to test it, and the studies kept finding that sleeping pills increased mortality even when adjusted for confounders. . . .

And he goes through a few large studies from 2012 onward that estimate that sleeping pills are associated with a large increased risks of illness and death, after adjusting for age, sex, and various physical and mental health risk factors (which I’ll assume were measured before the study began, hence the “pre-treatment” in the title above).

And then Alexander comes to a relatively new study, from 2017:

They do the same kind of analysis as the other studies, using a New Jersey Medicare database to follow 4,182,305 benzodiazepine users and 35,626,849 non-users for nine years. But unlike the other studies, they find minimal to zero difference in mortality risk between users and non-users. Why the difference? . . .

They adjusted for three hundred confounders.

This [says Alexander] is a totally unreasonable number of confounders to adjust for. I’ve never seen any other study do anything even close. Most other papers in this area have adjusted for ten or twenty confounders. Kripke’s study adjusted for age, sex, ethnicity, marital status, BMI, alcohol use, smoking, and twelve diseases. Adjusting for nineteen things is impressive. It’s the sort of thing you do when you really want to cover your bases. Adjusting for 300 different confounders is totally above and beyond what anyone would normally consider.

It’s funny that Alexander says this. I’ve never adjusted for hundreds of confounders myself, nor have I done the equivalent in sample surveys and adjusted for hundreds of poststratification variables.

But I think Jennifer has adjusted for hundreds of confounders in some real problems, so maybe she’d have some comments on this.

You certainly *should* be able to adjust for 300 pre-treatment variables in a causal analysis. After all, if you only adjust for the first 10 variables on your list, you’re also adjusting for the other 290, just in a super-regularized way setting all those adjustments to zero. More generally, we can used regularized regression / machine learning to adjust for lots and lots of predictors and their interactions. I think Jennifer is going to put this in Advanced Regression and Multilevel Models (the successor to Regression and Other Stories).

The results of any particular analysis will be sensitive to model specification so I’ll make no general claim on what to believe in one of these studies. Maybe someone who works in this area can take a look.

**P.S.** I really like that Alexander said “adjust” rather than “control.” I think he understands a lot more about statistics than I do about psychiatry.

“The results of any particular analysis will be sensitive to model specification…”

To me, that’s the crux of the problem. Every variable you adjust for requires a specification. If the variable is categorical, there isn’t an issue here. But many variables are interval or ratio level information, and there the functional relationship specified in the modeling may be quite important. A mis-specification of a confounder can easily create spurious relationships or mask important ones.

In medical research in general, (I have not read the specific paper, so I don’t know if this applies here or not) it is very common to simply model all non-discrete confounders as being linearly related to the outcome. While this might be a reasonable approximation when the range of variation of the confounder is narrow, and may even be true large-scale for an occasional relationship, it seems very likely that the risk of misspecification using this practice is high.

So, if you’re adjusting for 300 confounders, probably an appreciable number of those will be non-categorical variables, which implies that the probability that some of them will be mis-specified to an important extent gets high. While I wouldn’t stipulate any particular maximum number of non-categorical variables, I agree with Scott Alexander’s instinct that the more things you adjust for, the higher the risk of a serious mis-specification, and the more skeptical you should be of the results.

Even adjusting for categorical variables isn’t as straightforward as you imply. Categorical variables can come in combinations, if you have 20 of them with 2 categories each, there are 2^20 ~ 1 Million combinations. In a database of 4 million pill users…

if out of the 300 variables you have say 100 that are categorical with 2 categories there are 2^100 ~ 1.2×10^30 possibilities.

If you do a principle component analysis you can build up components which are linear combinations of variables that are either 1 or 0… something like a*x1 + b*x2 + c*x3 ….

If you start with 100 binary variables, you could project this onto a space of say 3 to 5 principal components… but now the principal components are weighted sums of binary variables… you’d expect something like a normal distribution, so even tens of binary variables produce something that you could consider to be a continuous variable for modeling purposes.

So all the problems you mention apply as well to categorical variables unless you have just a couple of them

When I read this post I thought it was a regression model with 300 independent variables. But it’s actually a propensity score model and then comparisons where there is overlap.

I had interpreted it to mean there were 300 independent variables as well. So, if those variables are somehow collapsed into a propensity score (and there is reference to some innovative algorithm that was used which I did not try to understand), then probably the multicollinearity I was concerned about might not be relevant. Still, my understanding it that the vast majority of prescriptions for sleeping pills are given to people that have other medical issues, then it would seem that if all these conditions are included in the model, it might not be possible to distinguish between the effects of the conditions and the effects of the sleeping pills.

They use a couple of different types of PSM (1:1 and “high dimensional”). Either way, King and Nielsen would probably disagree:

https://gking.harvard.edu/files/gking/files/psnot.pdf

I haven’t read the paper carefully, but I did look at the list of variables they adjusted for – these include many different medical conditions and many different treatments. Couldn’t there nonsignificant results merely reflect multicollinearity problems? So many of these variables are related to each other, that I would expect the standard errors to grow quite large when all are included. Can someone comment on why this is unlikely?

One possible method to deal with this issue would be to throw all those variables into a principal component analysis, and then just pick out the top N principle components. But this will leave you with something where you have relatively little intuition about what the components mean.

Belloni-Chernozhukov-Hansen “post-double-selection” and related machine-learning approaches (theory-based methods for reducing the dimensionality of controls using ML etc.) is the way to go. (But I’m biased, having coded and written about it.) Hundreds of confounders not a problem … if in principle they are legitimate confounders. GIGO always lurking around the corner!

So, there are two different aspecs of “multicollinearity”. One aspect is that the pre-treatment predictors are correlated with each other. I believe that’s what you are primarily refering to. Yes, this will cause issues in terms of our estimates of the coefficients for those variables, but that actually isn’t that much of a problem in their setup because they are doing propensity matching and the main thing that matters in propensity matching is the standard error of the propensity estimates, which is much less sensitive to multicollinearity.

The second aspect is that the treatment (sleeping pills) are probably highly correlated to the pre-treatment variables. This will actually cause a lot of problems as you start getting minimal overlap.

I see two general problems that could be affecting these results, both of which tend to be ignored in clinical studies. Both stem from treating the control model as independent of the regressor of interest and failing to think seriously about the causal model being tested. (Yes, correlation is not causality, but relationships that are believed to be causal are routinely tested in correlation studies.)

One is the problem raised by Ethan, where introducing collinear control variables results in decreased precision in the estimates. I’ve seen some cases where all the coefficients are driven toward zero. I’ve seen other cases where the two variables wind up having opposite signs and large, statistically significant, and in published papers, great efforts made to explain the unexpected sign, when if models had been run with the only one of the correlated variables, and exploratory analysis to examine correlations on the right hand side conducted, the unexpected sign problem would have gone away and been understood.

The second problem is when some of the control variables are actually mediators of the effect of the regressor of interest. In reduced form without the controls, the effect is strong; as mediators are introduced, the estimate is the marginal effect of the unmediated pathways. Throw in enough mediators and the effect could go to zero. I recall a paper about factors influencing health outcomes in South Africa that estimated a close to zero association of race. In the control model were variables for education, income, housing location, and a slew of other things that were all influenced by race in South Africa. Most of the mediated paths were in the control model.

There is a tendency to do research by rote. Pick an outcome, pick a regressor of interest, throw everything you can into the control model without thinking about how it is related to either the outcome or the regressor of interest, and report the statistical significance, sign and maybe the magnitude of observed association of the outcome and regressor of interest without reflection on the underlying model. It’s a terrible way to do research, and often gets the answer wrong. Andrew often speaks of constructing and testing models with the statistical analysis. It is critical that a researcher have a conceptual model of the relationships on the right hand side and the paths by which they influence one another and the outcome, to both structure the analysis and interpret results.

Jack said,

“There is a tendency to do research by rote. Pick an outcome, pick a regressor of interest, throw everything you can into the control model without thinking about how it is related to either the outcome or the regressor of interest, and report the statistical significance, sign and maybe the magnitude of observed association of the outcome and regressor of interest without reflection on the underlying model. It’s a terrible way to do research, and often gets the answer wrong. Andrew often speaks of constructing and testing models with the statistical analysis. It is critical that a researcher have a conceptual model of the relationships on the right hand side and the paths by which they influence one another and the outcome, to both structure the analysis and interpret results.”

Yup. — “Gimme a procedure to use; don’t ask me to think.”

‘Yup. — “Gimme a procedure to use; don’t ask me to think.” ‘

I think – ha ha – it’s more a matter of what people want to think about, rather than whether they want to think at all. People want to think about how to apply the results once they get them, not how to get them in the first place.

After all, the modern scientist is here to save the world. Doing the actual science is just the rote work in service of that goal and, in the end, while there are certain perils that are more thrilling to contemplate (death by soup overconsumption, no doubt driven by Big Food’s relentless advertising, no doubt dreamed up by men with fat arms who don’t believe in wealth redistribution), any peril could conceivably lead to amazing self promotion opportunities.

“More generally, we can used regularized regression / machine learning to adjust for lots and lots of predictors and their interactions.”

Like using the Finnish horseshoe prior?

Back in grad school, if you wanted to shut down someone’s argument, all you had to do was shrug and say “It’s an empirical question.”

Here, it really is an empirical question–why debate the theoretical basis for the analysis when you can just test it directly? If you can get a hold of the pertinent covariance matrix, or create reasonable approximations thereof, just run simulations for various parameter assumptions. Granted, it may not be that easy if the details of their PSM methods are unclear from the paper, but you can just use standard approaches to PSM as a proxy. Anyone looking for a master’s thesis?

Alternatively, if you have the actual data and want to be really empirical, run the analysis with a known killer (like smoking) as the DV and see what effect size estimate you get after adjusting for 300 predictors (taking out smoking, obviously, but now adjusting for sleeping pills). Your sample will shrink (most people who smoke begin smoking earlier in life than they begin taking sleeping pills, so you have to drop those smoking already at pretest) as will your pool of predictors (some IVs will not be measurable that early), but you get the general idea. There are probably several known killers among the predictors in this study and you could do this with all of them, which should also give you a hint as to how big an effect has to be to survive being adjusted for hundreds of pretest predictors.

They could be adjusting for a collider (as well as mediator). So, I wouldn’t buy an estimate with so many variables wihtour an explicit causal model (DAG) that take into account the possibility of collider bias.

Agree- my first thought was are all 300 variables true confounders? Did they use a DAG?