“Too much data”?

Chris Hane writes:

I am scientist needing to model a treatment effect on a population of ~500 people. The dependent variable in the model is the difference in a person’s pre-treatment 12 month total medical cost versus post-treatment cost. So there is large variation in costs, but not so much by using the difference between the pre and post treatment costs. The issue I’d like some advice on is that the treatment has already occurred so there is no possibility of creating a fully randomized control now. I do have a very large population of people to use as possible controls via propensity scoring or exact matching.

If I had a few thousand people to possibly match, then I would use standard techniques. However, I have a potential population of over a hundred thousand people. An exact match of the possible controls to age, gender and region of the country still leaves a population of 10,000 controls. Even if I use propensity scores to weight the 10,000 observations (understanding the problems that poses) I am concerned there are too many controls to see the effect of the treatment.

Would you suggest using narrower matching criteria to get the “best” matches, would weighting the observations be enough, or should I also consider creating many models by sampling from both treatment and control and averaging their results? If you could point me to some papers that tackle similar issues that would be great.

My reply: Others know more about this than me, but my quick reaction is . . . what’s wrong with having 10,000 controls? I don’t see why this would be a problem at all. In a regression analysis, having more controls shouldn’t create any problems. But, sure, match on lots of variables. Don’t just control for age, sex, and region; control for as many relevant pre-treatment variables as you can get.

5 thoughts on ““Too much data”?

  1. Almost for sure to be better – although the cost/benefit of additional controls does decline and in some cases where measurements on the controls are expensive people have chosen to not exceed 3 controls for each case.

    This, I believe is the best paper to start reading about this (and read it more than once)


    and also follow Rubin’s advice to always both match and adjust (within the matching)


  2. I agree. I have actually done database studies with very few exposed cases and thousands of controls. It never caused any trouble with the actual estimation.

    If you think the treated population is very different than the untreated population and, because it is a database, there may be confounding by indication then I have seen a lot of success with tight propensity score matching (or stratification). I like Kurth AJE 2006 ("Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect" as a nice starting point for this type of analysis. James Robins was the statistician on that paper and he gives a very good explanation for the rather extreme results seen in that paper.

  3. Agree that this provides opportunity for a richer set of covariates on which to match and that having many controls poses no problems with respect to inference. The one thing I would add is to see if within these data, there are subjects who were exposed to some kind of shock that determined their treatment status in a manner outside their control. Coupling that kind of shock with matching makes the identifying assumption—no unmeasured confounders—more credible because you can at least believe that self-selection bias is small.

  4. i believe that judea pearl would caution against "control for as many relevant pre-treatment variables as you can get." in particular, i believe he would argue that conditioning on certain variables, depending on the assumed structure of the model, can induce spurious correlations. i think morgan and winship explain it best in their book entitled "counterfactuals and causal inference", section 3.1.2 and figure 3.4. if the model is both 'a' and 'b' cause 'c', but not one another and not vice versa, then conditioning on c can induce the appearance of correlations between a and b, even when they do not exist. for example, assume 'a' corresponds to SAT scores, and 'b' corresponds to motivation, and P[a,b]=P[a]P[b]. now, let 'c' correspond to college admittance. those that are accepted to college tend to have a+b > t, where t is some threshold. so, P[a,b|c] != P[a|c]P[b|c], as 'a' and 'b' will be negatively correlated upon conditioning on 'c'. i think it is hard to convey with just words, but pg. 67 of the above mentioned book has a clarifying figure, methinks. so, if i understand the question properly, and the issue of "collider variables" properly, then the advice would be: do not condition on any variable that you believe to be a collider, any other variable can be safely conditioned on. if that is correct, then this comment is the first time that any of the causal inference stuff has ever been useful to me in any way!

  5. Joshua – that’s true if the assumed structure of the models is true (which it never quite is, as all models are false)

    This has been discussed on this blog a couple times before, and to me it comes down to a judgement about the credibility of the model – is it not so wrong that following its (the model's) implications does more good than harm.

    Having said that, I do believe it often does more good than harm to be explicit about models – especially if you can resist injudiciously following their implications

    p.s. I currently am guessing that some are very negative about "formal causal models" given a concern that they will be taken far too seriously in _empirical_ research

Comments are closed.