Jonas Cederlöf writes:

I’m a PhD student in economics at Stockholm University and a frequent reader of your blog. I have for a long time followed your quest in trying to bring attention to p-hacking and multiple comparison problems in research. I’m now myself faced with the aforementioned problem and want to at the very least try to avoid picking (or being subject to the critique of having picked) control group which merely gives me fancy results. The setting is the following,

I run a difference-in-difference (DD) model between occupations where people working in occupation X is treated at year T. There are 354 other types of occupations where for at least 20-30 of them I could make up a “credible” story about why they would be a natural control group. One could of course run the DD-estimation on the treated group vs. the entire labor market, but claiming causality between the reform and the outcome hinges on not only the parallel trend assumption but also on that group specific shocks are absent. Hence one might wan’t to find a control group that would be subjected to the same type of shocks as the treated occupation X so one might be better of picking specific occupation from the rest of the 354 categories. Some of these might have parallel trends some others, but wouldn’t it be p-hacking choosing groups like this, based on parallel trends? The reader has no guarantee that I as a researcher haven’t picked control groups that gives me the results that will get me published?

So in summary: When one has 1 treated group and 354 potential control groups, how does one go about choosing among these?

My response: rather than picking one analysis (either ahead of time or after seeing the data), I suggest you do all 354 analyses and put them together using a hierarchical model as discussed in this paper. Really, this is not doing 354 analyses, it’s doing one analysis that includes all these comparisons.

But you don’t want to just lump together controls that likely satisfy assumptions and controls that don’t under the same model. I would only include the 20-30 for which Jonas said he could make up a credible story.

Thanks for sharing!

@Jonas: Do you have enough pre-treatment observations (time periods)? If so, I think you might be interested by the synthetic control method. It would construct the conterfactual as a weighted average of other types of occupations (where the weights are chosen so the counterfactual most closely resembles the treated occupation before the intervention). You could then argue the absence of specific shocks for the subset of occupations with weights different from zero. See David McKenzie on the Development Impact blog for a great illustration http://blogs.worldbank.org/impactevaluations/evaluating-regulatory-reforms-using-the-synthetic-control-method

Perhaps stop trying to apply the “treated vs control” paradigm to situations where it makes no sense? You should probably be looking for commonalities rather than differences in your data anyway.

*I know, you already put checking for a difference from control in your dissertation aims, that is your committee’s fault. Blame them, dont let them drag you down deeper into the hole they’ve dug for themselves.

Perhaps you can use synthetic control methods: https://arxiv.org/pdf/1610.07748.pdf

+1

Another option is to consider modeling the shocks directly. In other words:

Outcome(x,t) = Trend(t,Treatment(x)) + Shocks(x,t)

And then hierarchically let the Shocks(x,t) be related between related occupations indexed by x, and inform the Shocks(x,t) functions by external data on shocks.