Chris Hane writes:
I am scientist needing to model a treatment effect on a population of ~500 people. The dependent variable in the model is the difference in a person’s pre-treatment 12 month total medical cost versus post-treatment cost. So there is large variation in costs, but not so much by using the difference between the pre and post treatment costs. The issue I’d like some advice on is that the treatment has already occurred so there is no possibility of creating a fully randomized control now. I do have a very large population of people to use as possible controls via propensity scoring or exact matching.
If I had a few thousand people to possibly match, then I would use standard techniques. However, I have a potential population of over a hundred thousand people. An exact match of the possible controls to age, gender and region of the country still leaves a population of 10,000 controls. Even if I use propensity scores to weight the 10,000 observations (understanding the problems that poses) I am concerned there are too many controls to see the effect of the treatment.
Would you suggest using narrower matching criteria to get the “best” matches, would weighting the observations be enough, or should I also consider creating many models by sampling from both treatment and control and averaging their results? If you could point me to some papers that tackle similar issues that would be great.
My reply: Others know more about this than me, but my quick reaction is . . . what’s wrong with having 10,000 controls? I don’t see why this would be a problem at all. In a regression analysis, having more controls shouldn’t create any problems. But, sure, match on lots of variables. Don’t just control for age, sex, and region; control for as many relevant pre-treatment variables as you can get.