Does regression discontinuity (or, more generally, causal identification + statistical significance) make you gullible?

Yes basically.

This one’s pretty much a perfect example of overfitting, finding a discontinuity out of noise, in that if you just draw a smooth line through each graph, it actually looks better than the discontinuous version. We see this a lot: There’s no discontinuity in the data, but it’s possible to make a discontinuity appear by fitting enough things to the two sides of the border.

Oh, but they report robustness tests! Neumann (or Simonsohn) would be amused. I don’t know why they didn’t just go all-in and fit a global 80th-degree polynomial or something like that.

To get serious for a moment: The problem is not, I think, with these researchers who are doing their best, using the methods that were taught to them. The problem is not even with their teachers and the writers of their textbooks, who I’m sure were doing their best given the understanding they had when they wrote the books. (My own understanding of overfitting, forking paths, causal identification, etc., has changed a lot over the past ten years, so it’s fair enough to expect that of others too.)

I do think there’s a problem with the economics establishment for accepting such silly analyses unquestioningly, just as before there was a problem with the psychology establishment a decade ago accepting anything with a randomized experiment and statistical significance as a scientific discovery.

This is not a message that many people want to hear. Or, to say it more carefully, this is a message that many people accept, to the extent that the people who don’t accept it don’t want to hear it.

To put it another way: If you want to make a big claim and convince me that you have evidence for it, I need that trail of breadcrumbs connecting data, model, and theory. Had the above-linked paper not had the discontinuity graph, I’d have no reason to trust its claims at all. But then if you show the trail of breadcrumbs, and it shows no pattern . . . that tells you something too!

At this point you might say that the authors of the above-linked paper are damned if they do, damned if they don’t: If they don’t give a graph, I won’t trust their claims, but then they do give a graph, and I use that graph as evidence not to trust their claims.

But that’s fine! They’d be damned if they didn’t, but they’re not damned if they do. They had what seemed to be a promising result, they then made a graph which reveals how bad their model is (no reason in any world to expect that a polynomial function of y given x makes any sense here, also it’s super-clear that the discontinuities are an artifact of overfitting) . . . that’s a win! They can simply report that the data are too noisy to learn anything here, and they can move on. For a scientist or applied researcher, that’s a much better result than making a strong claim that doesn’t make sense, is not supported by the data, and is unlikely to replicate.

39 thoughts on “Does regression discontinuity (or, more generally, causal identification + statistical significance) make you gullible?

  1. So the regression discontinuity theme comes here a lot. Because there is thus a need to examine these discontinuities also for policy reasons, I would love to hear about some sensible approaches.

    There is also a lot of technical (incl. in statistics and economics) literature on the subject. I do not know but if one would search for “breakpoint analysis”, even a monograph could pop up.

    So: when I look at the second plot, I do clearly see a shift in the mean. What gives?

    • Jukka:

      Regarding the second plot: It is what it is. The data are consistent with a discontinuity and they’re also consistent with no discontinuity. To start with, I recommend looking at it without the distracting curve, which has a big meaningless noisy drop on the left side of the discontinuity. This noisy drop, which makes the discontinuity look so big, is no accident: It’s the product of selection bias. Regressions are typically reported only when statistically significant, and so when it comes to regression discontinuity, we are more likely to see a result when it happens to be juiced up by some random noise that happens to go in the right direction. We’ve seen this over and over again, most notoriously in the example discussed in this article.

      Regarding the larger question of how to analyze such data: I think the sensible approach is to treat the problem like an observational study. The “running variable” with its discontinuity is only one of many possible pre-treatment variables to adjust for. A big problem with regression discontinuity in textbooks and in practice is that people obsess on how to model the outcome conditional on the running variable, without recognizing the importance of adjusting for other pre-treatment differences between treatment and control groups. It’s an observational study! There are some settings where it’s enough to adjust for just one pre-treatment variable, for example a pre-test in an education experiment. But in the settings where things go wrong, the problem is often that there are other differences not captured by this one variable. And a search for a better functional form won’t solve that problem. To step back and think more generally: yeah, observational studies are hard; we discussed that here, for example. I think that “identification strategies” such as regression discontinuity can cause problems by triggering the credulity of economists who should really know better.

      • The original reason for discontinuity analysis was to correct for selection effects, right? For whatever reason, there is a selection discontinuity on supposedly otherwise irrelevant sides of the discontinuity. That’s all well and good, but trying to measure the effect at a sharp boundary is really problematic unless there is a ton of data very close to the division line on each side. Where there isn’t, these polynomial interpolations in an attempt to sharpen the selection division are indeed silly.

        But in this case, the data, unadjusted for sample selection, tells a simple story (at least to my eye.) In the first graph, health care use begins rising when you’re nearly old enough to vote and keeps rising for those who have been eligible for some time. There is an easy story to tell there which doesn’t depend on discontinuity at the election itself. In the second graph, the slope clearly drops to near-zero for those who can vote. Their discontinuity, which is eligibility to vote, is intended (I guess) to suggest that the act of voting (instead of worrying about the election) is the source of stress. OK… that’s unconvincing. But the non-RD raw results tell a story in which there isn’t really any sample selection to worry about.

      • Andrew:

        You’ve criticized regression discontinuities frequently and IMO justifiably. But are your readers subject to selection bias? ie, are there cases where regression discontinuity analysis has clearly been successful?

        • What about Lee, David S. (2008): Randomized experiments from non-random selection in U.S. House elections,
          Journal of Econometrics cited in Angrist and Pischke, estimating incumbency effects?

        • Jim:

          We have an example of regression discontinuity in our book!

          The short answer is that regression discontinuity analyses can work well when the running variable is a strong predictor of the outcome, as for example pre-test and post-test.

    • When I look at the second plot, I see a logarithm. Which makes me curious, in the caption, it says “Log of the …”, is there any good explanation for why the log was taken? The log ranges from about 4.4 to about 4.7, and it sure was not in order to straighten the data.

  2. > For a scientist or applied researcher, that’s a much better result than making a strong claim that doesn’t make sense, is not supported by the data, and is unlikely to replicate.

    Depends on what you mean by “better”. If you want an academic job, it is much “better” to publish wrong results than to make true discoveries — like that “the data are too noisy to learn anything here” — that will not be accepted for publication.

  3. Fair points. The overfitting with the polynomials is indeed the classical case, as nicely demonstrated in the paper linked.

    But I think the problem is not only about data & functional forms. In industry there are all kinds of monitoring systems. Often these just monitor a single time series. Things like CUSUM and Markov switching models are common. But with these applications, I guess a “false positive” (i.e., a discontinuity instead of the continuity) does not matter that much because there is a human in the loop after an alert.

  4. Seems worth noting that this paper isn’t published. I’m not sure what conclusion we should draw from the existence of bad unpublished papers. What is this evidence of, the proposition that there are economists who continue to do bad work? Is there any field where that wouldn’t be true?

    • Ian:

      My problem is not so much with the bad work being done, as with Tyler Cowen linking to it entirely unskeptically. Cowen is an influential economist. Not influential in the academic sense, but influential in social media. Cowen also often has interesting things to say. So when he makes this mistake without even seeming to realize he made the mistake, that’s a problem.

      • You think Tyler Cowen is conditioning his blind promotion because it was a regression discontinuity approach? There are so many people with a mouthpiece who point to the fact some paper has 1 citation or that some Nobel economist is talking about something he spent 3 minutes talking about completely unironically and with blind faith trust the institution of science. These are also the loudest people. They’re the “cool” kids translating everything us nerds do and they don’t give a thought to any of our own propensity for confusion, mistakes, errors, nuance. Not an economist problem, it’s something else.

        • Reed, Adam:

          I don’t know what Cowen’s reasoning was. I’m guessing that the paper looked legit to him so he believed it. He might not have gotten into the detail on the regression discontinuity; the point is that academic economics, like academic psychology, has certain tricks that are used to elicit credibility. In psychology it’s experimentation: academic psychologists and their customers believe all sorts of ridiculous things that are attached to an experiment. In economics it’s identification strategies. Cowen’s a busy guy and I can’t expect him to carefully read every paper he posts on; the problem is with the field of economics (and other fields too), not specifically with one person inside that field.

      • I agree with that. I find it rather unfortunate that Cowen is so influential among certain people, given how often he does stuff like this. He’s very sloppy. As an actual active research economist, though, I can say that he has no influence on my area and I rarely if ever hear anybody discussing anything he has to say as though he’s some kind of authority.

        • Ian:

          OK, let me rephrase and say that the issue is not so much that Cowen is influential (although he does often have interesting things to say and I enjoy reading his blog) but rather that I think he’s representative of a large cohort of academic economists who take this sort of bad analysis on faith.

    • It’s not a “published” paper, but having the NBER stamp on it provides more legitimacy than just posting to SSRN. This is an issue when people in the profession cite NBER working papers in published research. Case in point, the paper under discussion has already been cited in at least one published paper: https://link.springer.com/article/10.1007/s11127-020-00829-y. And this is not just an issue with this paper. I see NBER working papers cited frequently in published work.

      As a econ grad student who graduated in 2016, I can attest to bad research being done and taught to students. Regarding the topic of what to research, a top labor economist and prominent minimum wage scholar once told my labor economics class that we should find a data set that had a good instrument. The advice was not to tackle important questions or questions that interest the researcher but to find a data set that one could use various causal identification strategies on. Yes, you can use these techniques in appropriate settings, but these techniques seem to have become hammers searching for something to hit.

  5. The thing is, it does look like there is a discontinuity in the data. It’s just not at the chosen cutoff. I think the x axis is months (those are some terribly labeled graphs) before and after legal voting age (which is 20 in Taiwan). Something seems to change 12-18 months before that.

  6. I try to employ techniques for causal identification in many economic contexts and I think you have a selection bias of misuse of techniques represented on the blog which present a problem worse than it truly is. This, sadly, is an anecdotal statement that I don’t have collected data to back up. My experiences with colleagues, my advisors, and in reading the literature I work in is that many people start by thinking about a data generating process and not blindly applying a tool box of procedures like black box models. However, the desire to be first and have a strong story definitely generates a lot of buzzy headlines that bug me (and my advisor through my whiny emails to him).
    It might be the circle I’m in, but I don’t think the problem is that bad in some economic fields.

    • Xysname:

      Yeah, that’s a bad one too! Good to be reminded that it’s not just economists who do this sort of thing. As with the graphs in the above post, it’s a good thing that they made the plot, and it’s a bad thing that, having made the plot, they didn’t see its obvious problems. I guess I don’t see this CDC example as quite so bad, though, because they’re focusing on the trend within each band rather than on the discontinuity, which is particularly sensitive to artifacts in the curve fitting.

      • Aside from the figure, doesn’t the report also have identification problem? I mean, there’s no reason to believe the difference is caused by just the masks.

        • Xysname,

          Sure, but if you look at the report they’re pretty clear that their analysis is descriptive. The title says “Trends,” not “Effects,” and, yes, in the summary they have some causal talk but they’re careful: “Countywide mask mandates appear to have contributed to the mitigation of COVID-19 transmission in mandated counties.” As usual, though, you can’t really talk about the effect of masks, given that other behavioral changes can be happening at the same time.

  7. I have a hard time understanding what the issue is with especially the second graph. Assuming no other shenanigans, seems reasonably unlikely to happen by chance. Could you explain the reason to be skeptical?

    • Jonatan:

      These are data from the real world. Nothing is happening by chance, and I have no particularly interest in rejecting the hypothesis that the data were produced by a particular random number generator. I have a big problem with the jump in the fitted line at the discontinuity point, as this appears to be entirely driven by the overfit tail of the fitted curve just to the left of the discontinuity.

      • I agree that there is not jump. But the slope is different on the two sides of the vertical line. Should I not look at that? (I am not very familiar with regression discontinuity designs, and haven’t read this paper.)

        • Jonatan:

          I agree that the curve seems to flatten out. There is an increase and then the increase stops. But I don’t think this fits in with the claims of the article. They’re claiming a jump (see the discontinuity in the fitted curve). A trend stopping is not a jump.

  8. Andrew:

    I have been reading your posts on pitfalls associated with RDD. They are valid points — but I am confused on whether you think an appropriate study would simply acknowledge non-significance (going with the trend that statistical significance is a prerequisite for publication), or their model fit is just plain bad.

    If the fit is just bad, what published studies have you read that don’t fall in this trap?

    • Jean:

      I think these models are just plain bad. As I wrote above, I think there are some good discontinuity studies; the key is for the running variable to be a strong predictor of the outcome, for example pre-test and post-test in an education study.

  9. The “damned if you do / don’t” thing reminds me of a student in a class I TAd. On a test with fill in the blank problems, he filled in the blanks with random words. He was upset that I gave him zeros for all of these problems. He had put a word in the blank! All that effort and no partial credit!

Leave a Reply

Your email address will not be published. Required fields are marked *