Agree- my first thought was are all 300 variables true confounders? Did they use a DAG?

]]>Here, it really is an empirical question–why debate the theoretical basis for the analysis when you can just test it directly? If you can get a hold of the pertinent covariance matrix, or create reasonable approximations thereof, just run simulations for various parameter assumptions. Granted, it may not be that easy if the details of their PSM methods are unclear from the paper, but you can just use standard approaches to PSM as a proxy. Anyone looking for a master’s thesis?

Alternatively, if you have the actual data and want to be really empirical, run the analysis with a known killer (like smoking) as the DV and see what effect size estimate you get after adjusting for 300 predictors (taking out smoking, obviously, but now adjusting for sleeping pills). Your sample will shrink (most people who smoke begin smoking earlier in life than they begin taking sleeping pills, so you have to drop those smoking already at pretest) as will your pool of predictors (some IVs will not be measurable that early), but you get the general idea. There are probably several known killers among the predictors in this study and you could do this with all of them, which should also give you a hint as to how big an effect has to be to survive being adjusted for hundreds of pretest predictors.

]]>Like using the Finnish horseshoe prior? ]]>

‘Yup. — “Gimme a procedure to use; don’t ask me to think.” ‘

I think – ha ha – it’s more a matter of what people want to think about, rather than whether they want to think at all. People want to think about how to apply the results once they get them, not how to get them in the first place.

After all, the modern scientist is here to save the world. Doing the actual science is just the rote work in service of that goal and, in the end, while there are certain perils that are more thrilling to contemplate (death by soup overconsumption, no doubt driven by Big Food’s relentless advertising, no doubt dreamed up by men with fat arms who don’t believe in wealth redistribution), any peril could conceivably lead to amazing self promotion opportunities.

]]>“There is a tendency to do research by rote. Pick an outcome, pick a regressor of interest, throw everything you can into the control model without thinking about how it is related to either the outcome or the regressor of interest, and report the statistical significance, sign and maybe the magnitude of observed association of the outcome and regressor of interest without reflection on the underlying model. It’s a terrible way to do research, and often gets the answer wrong. Andrew often speaks of constructing and testing models with the statistical analysis. It is critical that a researcher have a conceptual model of the relationships on the right hand side and the paths by which they influence one another and the outcome, to both structure the analysis and interpret results.”

Yup. — “Gimme a procedure to use; don’t ask me to think.”

]]>One is the problem raised by Ethan, where introducing collinear control variables results in decreased precision in the estimates. I’ve seen some cases where all the coefficients are driven toward zero. I’ve seen other cases where the two variables wind up having opposite signs and large, statistically significant, and in published papers, great efforts made to explain the unexpected sign, when if models had been run with the only one of the correlated variables, and exploratory analysis to examine correlations on the right hand side conducted, the unexpected sign problem would have gone away and been understood.

The second problem is when some of the control variables are actually mediators of the effect of the regressor of interest. In reduced form without the controls, the effect is strong; as mediators are introduced, the estimate is the marginal effect of the unmediated pathways. Throw in enough mediators and the effect could go to zero. I recall a paper about factors influencing health outcomes in South Africa that estimated a close to zero association of race. In the control model were variables for education, income, housing location, and a slew of other things that were all influenced by race in South Africa. Most of the mediated paths were in the control model.

There is a tendency to do research by rote. Pick an outcome, pick a regressor of interest, throw everything you can into the control model without thinking about how it is related to either the outcome or the regressor of interest, and report the statistical significance, sign and maybe the magnitude of observed association of the outcome and regressor of interest without reflection on the underlying model. It’s a terrible way to do research, and often gets the answer wrong. Andrew often speaks of constructing and testing models with the statistical analysis. It is critical that a researcher have a conceptual model of the relationships on the right hand side and the paths by which they influence one another and the outcome, to both structure the analysis and interpret results.

]]>They use a couple of different types of PSM (1:1 and “high dimensional”). Either way, King and Nielsen would probably disagree:

]]>So, there are two different aspecs of “multicollinearity”. One aspect is that the pre-treatment predictors are correlated with each other. I believe that’s what you are primarily refering to. Yes, this will cause issues in terms of our estimates of the coefficients for those variables, but that actually isn’t that much of a problem in their setup because they are doing propensity matching and the main thing that matters in propensity matching is the standard error of the propensity estimates, which is much less sensitive to multicollinearity.

The second aspect is that the treatment (sleeping pills) are probably highly correlated to the pre-treatment variables. This will actually cause a lot of problems as you start getting minimal overlap.

]]>Belloni-Chernozhukov-Hansen “post-double-selection” and related machine-learning approaches (theory-based methods for reducing the dimensionality of controls using ML etc.) is the way to go. (But I’m biased, having coded and written about it.) Hundreds of confounders not a problem … if in principle they are legitimate confounders. GIGO always lurking around the corner!

]]>I had interpreted it to mean there were 300 independent variables as well. So, if those variables are somehow collapsed into a propensity score (and there is reference to some innovative algorithm that was used which I did not try to understand), then probably the multicollinearity I was concerned about might not be relevant. Still, my understanding it that the vast majority of prescriptions for sleeping pills are given to people that have other medical issues, then it would seem that if all these conditions are included in the model, it might not be possible to distinguish between the effects of the conditions and the effects of the sleeping pills.

]]>One possible method to deal with this issue would be to throw all those variables into a principal component analysis, and then just pick out the top N principle components. But this will leave you with something where you have relatively little intuition about what the components mean.

]]>Even adjusting for categorical variables isn’t as straightforward as you imply. Categorical variables can come in combinations, if you have 20 of them with 2 categories each, there are 2^20 ~ 1 Million combinations. In a database of 4 million pill users…

if out of the 300 variables you have say 100 that are categorical with 2 categories there are 2^100 ~ 1.2×10^30 possibilities.

If you do a principle component analysis you can build up components which are linear combinations of variables that are either 1 or 0… something like a*x1 + b*x2 + c*x3 ….

If you start with 100 binary variables, you could project this onto a space of say 3 to 5 principal components… but now the principal components are weighted sums of binary variables… you’d expect something like a normal distribution, so even tens of binary variables produce something that you could consider to be a continuous variable for modeling purposes.

So all the problems you mention apply as well to categorical variables unless you have just a couple of them

]]>To me, that’s the crux of the problem. Every variable you adjust for requires a specification. If the variable is categorical, there isn’t an issue here. But many variables are interval or ratio level information, and there the functional relationship specified in the modeling may be quite important. A mis-specification of a confounder can easily create spurious relationships or mask important ones.

In medical research in general, (I have not read the specific paper, so I don’t know if this applies here or not) it is very common to simply model all non-discrete confounders as being linearly related to the outcome. While this might be a reasonable approximation when the range of variation of the confounder is narrow, and may even be true large-scale for an occasional relationship, it seems very likely that the risk of misspecification using this practice is high.

So, if you’re adjusting for 300 confounders, probably an appreciable number of those will be non-categorical variables, which implies that the probability that some of them will be mis-specified to an important extent gets high. While I wouldn’t stipulate any particular maximum number of non-categorical variables, I agree with Scott Alexander’s instinct that the more things you adjust for, the higher the risk of a serious mis-specification, and the more skeptical you should be of the results.

]]>