Piranhas in the rain: Why instrumental variables are not as clean as you might have thought

Woke up in my clothes again this morning
I don’t know exactly where I am
And I should heed my doctor’s warning
He does the best with me he can
He claims I suffer from delusion
But I’m so confident I’m sane
It can’t be a statistical illusion
So how can you explain
Piranhas in the rain
And if you see us on the corner
We’re just dancing in the rain
I tell my friends there when I see them
Outside my window pane
Piranhas in the rain.
— Sting (almost)

Gaurav Sood points us to this article by Jonathan Mellon, “Rain, Rain, Go away: 137 potential exclusion-restriction violations for studies using weather as an instrumental variable,” which begins:

Instrumental variable (IV) analysis assumes that the instrument only affects the dependent variable via its relationship with the independent variable. Other possible causal routes from the IV to the dependent variable are exclusion-restriction violations and make the instrument invalid. Weather has been widely used as an instrumental variable in social science to predict many different variables. The use of weather to instrument different independent variables represents strong prima facie evidence of exclusion violations for all studies using weather as an IV. A review of 185 social science studies (including 111 IV studies) reveals 137 variables which have been linked to weather, all of which represent potential exclusion violations. I conclude with practical steps for systematically reviewing existing literature to identify possible exclusion violations when using IV designs.

That sounds about right.

This story reminds me of when we were looking at the notorious ovulation-and-voting study and we realized that the evolutionary psychology and social priming literatures are just loaded with potential confounders:

But the papers on ovulation and voting, shark attacks and voting, college football and voting, etc., don’t just say that voters, or some voters, are superficial and fickle. No, these papers claim that seemingly trivial or irrelevant factors have large and consistent effects, and that I don’t believe. I do believe that individual voters can be influenced these silly things, but I don’t buy the claim that these effects are predictable in that way. The problem is interactions. For example, the effect on my vote of the local college football team losing could depend crucially on whether there’s been a shark attack lately, or on what’s up with my hormones on election day. Or the effect could be positive in an election with a female candidate and negative in an election with a male candidate. Or the effect could interact with parent’s socioeconomic status, or whether your child is a boy or a girl, or the latest campaign ad, etc.

This is also related to the piranha problem. If you take these applied literatures seriously, you’re led to the conclusion that there are dozens of large effects floating around, all bumping against each other.

Or, to put it another way, the only way you can believe in any of this sort of studies is if you don’t believe in any of the others.

It’s like religion. I can believe in my god, but only if I think that none of your gods exist.

The nudgelords won’t be happy about this latest paper, as it raises the concern that any nudge they happen to be studying right now is uncomfortably interacting with dozens of other nudges unleashed upon the world by other policy entrepreneurs.

Maybe they could just label this new article as Stasi or terrorism and move on to their next NPR appearance?

P.S. Gaurav adds:

I [Gaurav] need to think more about your points around the “piranha problem” but here are some initial thoughts:

a. How would we describe Thaler’s auto-enrollment to retirement savings? Big or small, and why? (And the flip side of it—the pain of choosing a plan and enrolling.)

b. I remember reading a similar point (your first point—you have two) as a critique of theists and miracle cures. The person wrote something as follows: many people who believe in God believe that there are cheap “hacks” for success like God has left some cheat codes in the game of life. For instance, if you wear a ring or pray or fast or what have you, you will get a bunch of material rewards.

c. There is too much stable structure to expect consistent large effects of ephemeral, ad hoc things.

d. Chemistry has well-established literature on “catalysts,” but I think everywhere else we apply that logic, it is a misapplication.

Re. weather as an instrument, I had a few small points (and I doubt that they are new):

1. The first stage of IV is ripe for specification search. There is likely plenty of p-hacking there. More generally, where there is no clear answer on how weather should be measured (and there isn’t), people probably pick up the formulation that gives the largest F-stat. To give you an example of issues with measuring “weather”—what does “rain” mean? There is a measurement issue about how to measure the amount of rain. Is it the duration of rain, amount of rain, etc. over every small geographic unit? There are tons of researcher degrees of freedom and some knobby measurement issues because we cannot measure a bunch of it super precisely.

2. Jon correctly divides studies that use weather into studies that exploit small/local (in time and space) fluctuations in ‘weather’ and those that use much larger periods (e.g., climate change/droughts). The former seem in some ways less suspect than the latter.

3. This is again from Jon. He points to some aspects of how the validity of the first stage is hard to establish because of other specification issues. For instance, studies that estimate the effect of rain on voting on election day may control long-term weather but not “medium term.” “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

4. “I think the studies showing weather -> mood (classic survey research finding) are complicated also. They don’t fully account for selective response—with bad weather, more people are likely to be home, etc. And funnily, effect of mood (on subsequent variables) relates to your piranha point. In some cases, it seems to me that mood is pretty paramount. For instance, I think there is a ton of variation in how optimistic, etc. people feel—when key things in their lives are unchanged—based on mood alone.

5 thoughts on “Piranhas in the rain: Why instrumental variables are not as clean as you might have thought

  1. The idea of all these large “true” causal effects existing might be still be reasonable if it is recognized that these many of these studies at best are able to identify such effects for only a very specific subset of observations, i.e. the local average treatment effect (e.g. the effect for only those observations that are actually influenced by the instrument). Sure, everything will have an effect on some observations in any given population. But such effects will average out once the entire relevant population is considered. It boils down to a problem of generalizability. In that sense, speaking of LATEs just another way of saying selection bias (but of course, that won’t get you published)

  2. For great evidence on nudges in the real-world policy domain, see here: https://eml.berkeley.edu/~sdellavi/wp/NudgeToScale2020-05-09.pdf

    From the abstract:
    In this paper, we assemble a unique data set including all trials run by two of the largest Nudge Units in the United States, including 126 RCTs covering over 23 million individuals. We compare these trials to a separate sample of nudge trials published in academic journals from two recent meta-analyses. In papers published in academic journals, the average impact of a nudge is very large – an 8.7 percentage point take-up increase over the control. In the Nudge Unit trials, the average impact is still sizable and highly statistically significant, but smaller at 1.4 percentage points. We show that a large share of the gap is accounted for by publication bias, exacerbated by low statistical power, in the sample of published papers; in contrast, the Nudge Unit studies are well-powered, a hallmark of “at scale” interventions.

  3. “It’s like religion. I can believe in my god, but only if I think that none of your gods exist.”

    But religion doesn’t have to be like that. Some believe that all the gods of the big established religions are actually the same, and one can also hold that god exists through belief and that if different people believe in different gods, they exist for them in the same way.

    • Christian:

      Yes, and I was actually thinking about that when writing that sentence, but I didn’t want to get into it there. A religion can allow multiple gods but then it has to include interactions; it can’t be a simple monocausal model of one god doing all the work. Similarly, in social science, effects interact. The point of that criticism of instrumental variables estimates is not that the underlying effects are zero; it’s that even when the effects are there, the IV estimate is not as simple as people have been taught.

  4. I recently developed a new test to determine whether regressors are exogenous, and I applied it to instruments in 2SLS regressions.
    Marvell, Thomas B., Testing for Reverse Causation and Omitted Variable Bias in Regressions (October 17, 2020).
    SSRN.com

Leave a Reply

Your email address will not be published. Required fields are marked *