How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole

This one’s buggin me.

We’re in a situation now with forking paths in applied-statistics-being-done-by-economists where we were, about ten years ago, in applied-statistics-being-done-by-psychologists. (I was going to use the terms “econometrics” and “psychometrics” here, but that’s not quite right, because I think these mistakes are mostly being made, by applied researchers in economics and psychology, but not so much by actual econometricians and psychometricians.)

It goes like this. There’s a natural experiment, where some people get the treatment or exposure and some people don’t. At this point, you can do an observational study: start by comparing the average outcomes in the treated and control group, then do statistical adjustment for pre-treatment differences between groups. This is all fine. Resulting inferences will be model-dependent, but there’s no way around it. You report your results, recognize your uncertainty, and go forward.

That’s what should happen. Instead, what often happens is that researchers push that big button on their computer labeled REGRESSION DISCONTINUITY ANALYSIS, which does two bad things: First, it points them toward an analysis that focuses obsessively on adjusting for just one pre-treatment variable, often a relatively unimportant variable, while insufficiently adjusting for other differences between treatment and control groups. Second, it leads to an overconfidence borne from the slogan, “causal identification,” which leads researchers, reviewers, and outsiders to think that the analysis has some special truth value.

What we typically have is a noisy, untrustworthy estimate of a causal effect, presented with little to no sense of the statistical challenges of observational research. And, for the usual “garden of forking paths” reason, the result will typically be “statistically significant,” and, for the usual “statistical significance filter” reason, the resulting estimate will be large and newsworthy.

Then the result appears in the news media, often reported entirely uncritically or with minimal caveats (“while it’s too hasty to draw sweeping conclusions on the basis of one study,” etc.).

And then someone points me with alarm to the news report, and I read the study, and sometimes it’s just fine but often it has the major problems listed above. And then I post something on the study, and sometime between then and six months in the future there is a discussion, where most of the commenters agree with me (selection bias!) and some commenters ask some questions such as, But doesn’t the paper have a robustness study? (Yes, but this doesn’t address the real issues because all the studies in the robustness analysis are flawed in a similar way as the original study) and, But regression discontinuity analysis is OK, right? (Sometimes, but ultimately you have to think of such problems as observational studies, and all the RD in the world won’t solve your problem if there are systematic differences between treatment and control groups that are not explained by the forcing variable) and, But didn’t they do a placebo control analysis that found no effect? (Yes, but this doesn’t address the concern that the statistically-significant main finding arose from forking paths, and there are forking paths in the choice of placebo study too, also the difference between statistically significant and non-significant is not itself . . . ok, I guess you know where I’m heading here), and so on.

These questions are ok. I mean, it’s a little exhausting seeing them every time, but it’s good practice for me to give the answers.

No, the problem I see is outside this blog, where journalists and, unfortunately, many economists, have the inclination to accept these analyses as correct by default.

It’s whack-a-mole. What’s happening is that researchers are using a fundamentally flawed statistical approach, and if you look carefully you’ll find the problems, but the specific problem can look different in each case.

With the air-pollution-in-China example, the warning signs were the fifth-degree polynomial (obviously ridiculous from a numerical analysis perspective—Neumann is spinning in his grave!—but it took us a few years to explain this to the economics profession) and the city with the 91-year life expectancy (which apparently would’ve been 96 years had it been in the control group). With the air-filters-in-schools example, the warning sign was that there was apparently no difference between treatment and control groups in the raw data; the only way that any result could be obtained was through some questionable analysis. With the unions-and-stock-prices example, uh, yeah, just about everything there was bad, but it got some publicity nonetheless because it told a political story that people wanted to hear. Other examples show other problems. But one problem with whack-a-mole is that the mole keeps popping up in different places. For example, if example #1 teaches you to avoid high-degree polynomials, you might thing that example #2 is OK because it uses a straight-line adjustment. But it’s not.

So what’s happening is that, first, we get lost in the details and, second, you get default-credulous economists and economics journalists needing to be convinced, each time, of the problems in each particular robustness study, placebo check, etc.

One thing that all those examples have in common is that if you just look at the RD plot straight, removing all econometric ideology, it’s pretty clear that overfitting is going on:

In every case, the discontinuity jumps out only because it’s been set against an artifactual trend going the other direction. In short: an observed difference close to zero that is magnified by something big by means of a spurious adjustment. It can go the other way too—an overfitted adjustment used to knock out a real difference—but I guess we’d be less likely to see that, as researchers are motivated to find large and statistically significant effects. Again, all things are possible, but it is striking that if you just look at the raw data you don’t see anything: this particular statistical analysis is required to make the gap appear.

And, the true sign of ideological blinders: the authors put these graphs in their own articles without seeing the problems.

Good design, bad estimate

Let me be clear here. There’s good and bad.

The good is “regression discontinuity,” in the sense of a natural experiment that allows comparison of exposed and control groups, where there is a sharp rule for who gets exposed and who gets the control: That’s great. It gives you causal identification in the sense of not having to worry about selection bias: you know the treatment assignment rule.

The bad is “regression discontinuity,” in the sense of a statistical analysis that focuses on modeling of the forcing variable with no serious struggle with the underlying observational study problem.

So, yes, it’s reasonable that economists, policy analysts, and journalists like to analyze and write about natural experiments: this really can be a good way of learning about the world. But this learning is not automatic. It requires adjustment for systematic differences between exposed and control groups—which cannot in general be done by monkeying with the forcing variable. Monkeying with the forcing variable can, however, facilitate the task of coming up with a statistically significant coefficient on the discontinuity, so there’s that.

But there’s hope

But there’s hope. Why do I say this? Because where we are now in applied economics—well-meaning researchers performing fatally flawed studies, well-meaning economists and journalists amplifying these claims and promoting quick-fix solutions, skeptics needing to do the unpaid work of point-by-point rebuttals and being characterized as “vehement statistics nerds”—this is exactly where psychology was, five or ten years ago.

Remember that ESP study? When it came out, various psychologists popped out to tell us that it was conducted just fine, that it was solid science. It took us years to realize how bad that study was. (And, no, this is not a moral statement, I’m not saying the researcher who did the study was a bad person. I don’t really know anything about him beyond what I’ve read in the press. I’m saying that he is a person who was doing bad science, following the bad-science norms in his field.) Similarly with beauty-and-sex ratio, power pose, that dude who claimed to he could predict divorces with 90% accuracy, etc.: each study had its own problems, which had to be patiently explained, over and over again to scientists as well as to influential figures in the news media. (Indeed, I don’t think the Freakonomics team ever retracted their endorsement of the beauty-and-sex-ratio claim, which was statistically and scientifically ridiculous but fit in well with a popular gender-essentialist view of the world.)

But things are improving. Sure, the himmicanes claim will always be with us—that combination of media exposure, PNAS endorsement, and researcher chutzpah can go a long way—but, if you step away from some narrow but influential precincts such as the Harvard and Princeton psychology departments, NPR, and Ted World HQ, you’ll see something approaching skepticism. More and more researchers and journalists are realizing that randomized experiment plus statistical significance does not necessarily equal scientific discovery, that, in fact, “randomized experiment” can motivate researchers to turn off their brains, “statistical significance” occurs all by itself with forking paths, and the paradigm of routine “scientific discovery” can mislead.

And it’s an encouraging sign that if you criticize a study that happens to have been performed by a psychologist, that psychologists and journalists on the web do not immediately pop up with, But what about the robustness study?, or Don’t you know that they have causal identification?, etc. Sure, there are some diehards who will call you a Stasi terrorist because you’re threatening the status quo of backscratching comfort, but it’s my impression that the mainstream of academic psychology recognizes that randomized experiment plus statistical significance does not necessarily equal scientific discovery. They’re no longer taking a published claim as default truth.

My message to economists

Savvy psychologists have realized that just because a paper has a bunch of experiments, each with a statistically significant result, it doesn’t mean we should trust any of the claims in the paper. It took psychologists (and statisticians such as myself) a long time to grasp this. But now we have.

So, to you economists: Make that transition that savvy psychologists have already made. In your case, my advice is, no longer accept a claim by default just because it contains an identification strategy, statistical significance, and robustness checks. Don’t think that a claim should stand, just cos nobody’s pointed out any obvious flaws. And when non-economists do come along and point out some flaws, don’t immediately jump to the defense.

Psychologists have made the conceptual leap: so can you.

My message to journalists

I’ll repeat this from before:

When you see a report of an interesting study, he contact the authors and push them with hard questions: not just “Can you elaborate on the importance of this result?” but also “How might this result be criticized?”, “What’s the shakiest thing you’re claiming?”, “Who are the people who won’t be convinced by this paper?”, etc. Ask these questions in a polite way, not in any attempt to shoot the study down—your job, after all, is to promote this sort of work—but rather in the spirit of fuller understanding of the study.

Science journalists have made the conceptual leap: so can you.

P.S. You (an economist, or a journalist, or a general reader) might read all the above and say, Sure, I get your point, robustness studies aren’t what they’re claimed to be, forking paths are a thing, you can’t believe a lot of these claims, etc., BUT . . . air pollution is important! evolutionary psychology is important! power pose could help people! And, if it doesn’t help, at least it won’t hurt much. Same with air filters: who could be against air filters?? To which I reply: Sure, that’s fine. I got no problem with air filters or power pose or whatever (I guess I do have a problem with those beauty-and-sex-ratio claims as they reinforce sexist attitudes, but that’s another story, to be taken up with Freakonomics, not with Vox): If you want to write a news study promoting air filters in schools, or evolutionary psychology, or whatever, go for it: just don’t overstate the evidence you have. In the case of the regression discontinuity analyses, I see the overstatement of evidence as coming from a culture of credulity within academia and journalism, a combination of methodological credulity with in academic social science (the idea that identification strategy + statistical significance = discovery until it’s been proved otherwise) and credulity in science reporting (the scientist-as-hero narrative).

P.P.S. I’m not trying to pick on econ here, or on Vox. Economists are like psychologists, and Vox reporters are like science reporters in general: they all care about the truth, they all want to use the best science, and they all want to help people. I sincerely think that if psychologists and science reporters can realize what’s been going on and do better, so can economists and Vox reporters. I know it’s taken me awhile to (see here and here) to move away from default credulity. It’s not easy, and I respect that.

P.P.P.S. Yes, I know that more important things are going on in the world right now. I just have to make my contributions where I can.

It’s not about the blame

Just to elaborate in one more direction: I’m not saying that all or even most economists and policy journalists make the mistake of considering this sort of study correct by default. I’m saying that enough make this mistake that they keep the bad-science feedback loop going.

To put it another way:

I don’t mind that researchers study the effects of natural experiments. Indeed, I think it’s a good thing that they do this (see section 4 here for more on this point).

And I don’t mind that researchers perform poor statistical analyses. I mean, sure, I wish they didn’t do it, but statistics is hard, and the price we pay for good analyses is bad analyses. To put it another way, I’ve published poor statistical analyses myself. Every time we do an analysis, we should try out best, and that means that sometimes we’re gonna do a bad job. That’s just the way it goes.

What’s supposed to happen next is that if you do a bad analysis, and it’s on a topic that people care about, that someone will notice the problems with the analysis, and we can go from there.

That’s the self-correcting nature of science.

But some things get in the way. In particular, if enough people consider published and publicized results as correct by default, than that slows the self-correcting process.

I wrote the above post in an attempt to push the process in a better direction. It’s not that 100% or 50% or even 25% of economists and policy journalists act as if identification strategy + statistical significance = discovery. It’s that X% act this way, and I’d like to reduce X. I have some reason for optimism because it’s my impression that X went down a lot in psychology in the past ten years, so I’m hoping it could now decline in economics. Not all the way down to zero (I fear that PNAS, NPR, and Ted will always be with us), but low enough to end the sustainability of the hype cycle as it currently exists.

The funny thing is, economists are often very skeptical! They just sometimes turn off that skepticism when an identification strategy is in the room.

P.P.P.P.S. To clarify after some discussion in comments: I’m not saying that all or most or even a quarter of RD publications or preprints in economics are bad. I’ve not done any kind of survey. What I think is that economists and policy journalists should avoid the trap of reflexive credulity.

54 thoughts on “How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole

  1. “The bad is “regression discontinuity,” in the sense of a statistical analysis that focuses on modeling of the forcing variable with no serious struggle with the underlying observational study problem.”

    This is not quite right. If the regression discontinuity assumptions are correct *and* the trend modeling on either side of the discontinuity is reasonable, then you really don’t need to “struggle” with confounding. This is a legitimate advantage of the regression discontinuity design. The problems in all these examples arise because there are very few observations right at the discontinuity on either side and the attempt to leverage further-away observations to increase confidence about what’s happening close to the discontinuity depends on strong statistical modeling assumptions and sensitive models. We *could* have a graph for one of these studies where the points did not look like a blob with a vertical line through it at the point of the discontinuity and overfit regression lines overlaid on either side. If we a had a graph with a really tight trend on one side of the discontinuity and another really tight trend on the other and the two trends hit the discontinuity line at different points, this could represent truly strong evidence of an effect without any “struggle” with confounding. The principle is closely related to instrumental variables, which also allow you to ignore confounding when the assumptions hold and which I know you also don’t believe in Andrew.

    My broader complaint is that criticism of wide misuse of a method should not be conflated with criticism of the method itself.

    • Z:

      When N=31, as it was in this schools study, you definitely need to struggle with confounding. And, in this case, the struggle created the problem! A simple analysis comparing the exposed and control groups shows essentially no difference, but after lots of knobs are turned, something shows up.

      But I do agree with your point that a regression discontinuity analysis is not necessarily bad. I think (hope) we’re agreement on the larger point here, which is that economists and journalists should not, by default, assume such an analysis was done well. We should be with RD in econ where we are now with experiments in psychology: Yes, experiments (and natural experiments) are great; yes, we can get solid, replicable findings from statistical analyses of experiments (and natural experiments); no, just cos someone writes a paper with an experiment (natural experiment) in it, that doesn’t mean that by default we should assume it’s correct.

    • > If we a had a graph with a really tight trend on one side of the discontinuity and another really tight trend on the other and the two trends hit the discontinuity line at different points, this could represent truly strong evidence of an effect without any “struggle” with confounding.

      This is true, as you say, “If the regression discontinuity assumptions are correct”. But there are lots of plausible situations where there are confounders which also correspond with the discontinuity point. Suppose we’re trying to assess the impact of an American vs Mexican federal emissions standards controlling for smooth geographic trends. One might do an RDD on degrees north of the Rio Grande, and you might see something very stark. But you still have to think about confounders a little bit or you might miss the fact that there’s a ton of differences that are discontinuous at 0 and so the effect of emissions standards specifically are not well identified.

      • Somebody:

        Exactly. There can be lots of pre-treatment differences between exposed and control groups, and the forcing variable is just one of many, maybe not a particular important variable. Focus on the forcing variable creates two problems: (a) encouraging researchers not to think seriously about other potential confounders, and (b) allowing degrees of freedom that facilitate finding patterns from noise.

        The point here is not that RD analyses necessarily have these problems, or that RD analyses are necessarily flawed, but rather that they can and often do have these problems, hence it’s a mistake for people to think that just cos an paper does RD, that we should start from a position of believing it. It’s just like with experiments in psychology, where people have started to move away from this default trust.

        • As an economist working at a mid-tier school, it appears to me that the claim of an empirical strategy is examined very closely if you are not at a top school or in the NBER. However, if you are at a top school then you can often get a pass. There is no real anonymous submission in economics, all of the working papers are circulated and you know who the authors are. If I submitted this paper to a good journal it would certainly be rejected. However, if someone from Harvard did it would have a change at being accepted. Of course. the Harvard people are much better on average, but they also get a pass.

          I think you are diagnosing a major problem wrong. In economics, its very difficult to get published unless you have a natural experiment. This continues to be true even though so many natural experiments have turned out to be bunk in time (see quarter of birth or weather IV studies). If top-flight economics allowed the publication of very good observational studies without having to have a natural experiment then we would benefit by having a good alternate source of information and by not forcing people to include a bogus natural experiment or IV to get published.

          The other problem is the non-acceptance of zero results. There is now a literature arguing for cognitive effects of air quality on IQ. This paper should be an important null finding- and they shouldn’t have to do anything but publish the means comparison to get it in a good environmental economics journal. But that would be rejected out of hand because there are no fancy methods and because its a null result.

        • Anon:

          What you say all makes sense. Just to clarify: I think that natural experiments are great, and we should take advantage of them when we can. But I agree that we should not limit ourselves to only studying natural experiments. Indeed, many of the questions we want to answer can best be studied using observational studies, and if we only look at data from natural experiments, we’ll only be learning about weird special cases. Just for example, if all our understanding of the effects of money on behavior came from lottery winners, we’d be learning about an unusual group of people (gamblers, with a big oversample of compulsive gamblers) under unusual conditions.

          So, yeah, don’t let my praise of natural experiments be taken as a statement that they have some special status.

          With regard to your last point: yes, this new research is a null finding of the effects of air quality on IQ, but I’d hardly call it an important null finding: It’s a study of something like 30 schools at one point of time in one city. It’s a null result from a low-power, hard-to-generalize study. So, sure, it should be published (as a null finding) somewhere, but I would not call it an important finding.

          The first step, I think, is for the econ establishment and policy journalists to not puff this sort of thing up, as at the very least it just encourages more of the same. I don’t fault the author of the article: I’m sure he’s doing his best, and I guess he didn’t have anyone around to explain what was wrong with his analysis. After all the hype, it becomes harder to step back from the claim. The whole situation is unfortunate, and I just hope things can get a bit more sane in the future; hence my above post.

        • The regression discontinuity design does not assume that there are not other differences between exposed and control groups apart from the forcing variable. If it did, it would actually be nearly useless.

        • > The regression discontinuity design does not assume that there are not other differences between exposed and control groups apart from the forcing variable.

          I know. It assumes that the differences between exposed and control groups vary according to some continuous function of the forcing variable except for the causal effect of interest at the discontinuity, or equivalently that there are no differences between exposed and control groups in some small neighborhood of the discontinuity in the forcing variable. The point of this example is that the discontinuity at the political boundary of the Rio Grande isn’t really just the causal variable of interest (the emissions standards between the two countries) but all differences between U.S. and Mexican governance.

        • And GDP, and industries in which people are employed, and education levels, and water resources, and transportation networks, and patterns of consumption and preferences, and tolerance for tradeoffs between pollution and access to goods or services, and plenty of other things.

          Where RD makes more sense is when someone *imposes* a difference on an otherwise fairly homogeneous group. For example when a policy suddenly applies to one county but not the neighboring county which is closely matched in most variables.

      • Right, that country border example is one where the RD assumptions fail because stuff other than the treatment changes suddenly at the discontinuity. You of course need to think about confounders in this way when assessing whether the RD assumptions hold. But once you determine they hold you wouldn’t need to “struggle” with them any further under the scenario I described. Take the school gas leak example. I think it’s reasonable to assume that there’s nothing special about that circle 5 miles around the gas leak, so the RD assumptions should be pretty solid. But if your sample of schools really close to the circle is small (check) and/or the association with distance from the gas leak is weak (check) then you’re in trouble.

  2. I’m curious to hear what you think of this discussion on ‘econ twitter’ which argues that the just looking at the RD graph is an overly strict and not accurate means of assessing the RD. The kickoff tweet includes a gif illustration unfortunately that I can’t embed here: “Hey RDDers, it turns out that it is very difficult to see an effect visually that is significant at the 5 percent level. Perhaps we should stop using RD plots to make statistical inference….. I made a gif (based on simulated data) to illustrate this point.” https://twitter.com/KiraboJackson/status/1074062192037847040

    • Anon:

      I think the tweet is right that, with enough data, it’s possible for real patterns to show up that might not appear in a poorly-drawn scatterplot. But I do think that plots of data and fitted models are important, both to help us understand the models we’ve fit, and to reveal problems in the fitted models. For example, all three of the plots reproduced in the above post reveal serious problems with the fitted models, as in all cases the apparent discontinuity is an artifact that arose from overfitting. Something being “significant at the 5 percent level” is irrelevant. So I disagree with that aspect of the tweet, the seeming implication that we should believe that a data pattern represents a real “effect” just because there’s some statistical test with p less than 0.05.

      • Andrew said:

        “I think the tweet is right that, with enough data, it’s possible for real patterns to show up that might not appear in a poorly-drawn scatterplot. ”

        Perhaps a clarification would be useful? (caps mine):
        “…with enough data, it’s possible for real patterns to OCCUR up that might not BE VISIBLE in a poorly-drawn scatterplot. ”

        I agree with Andrew +10: Regardless of the supposed visibility of the discontinuity, the data should always be plotted and the plot should be thoroughly labelled. That includes the physical or real-world meaning of the discontinuity, the equations of the slope on either side of the discontinuity, and the difference in the slope if it’s barely visible.

  3. In my mind that tweet is misguided. The issue isn’t whether SS at the 5% level has been obtained, it is whether achieving SS at the 5% level is meaningful at all given the monumental number of researcher degrees of freedom (not least of which is the infinite number of locations one could usually select for the discontinuity)

    • Well, usually the discontinuity is created at a certain place by some kind of policy. so it’s not a free variable in the analysis. At least there’s that.

      But, yes, the statistical significance is probably meaningless. Furthermore, if you want to test significance you should test it vs a proper generating process. For example here I generate 20 graphs using a t5 distributed random number generator, and fit lines to each side of the midpoint of the interval. There is *nothing* going on, but the coefficients are *almost always* very different.

      http://models.street-artists.org/2020/01/09/nothing-to-see-here-move-along-regression-discontinuity-edition/

      If you just look at the intercept here backing it out from extending the lines… page 8 seems to have an intercept difference of 5 or so… page 9 is -6 maybe?

      So the 5% significant difference in intercept would be around a difference of 5 or 6 in absolute value.

      Slope wise, I’m guessing a similar thing, maybe a difference in slope of 10 or something.

      My guess is that the p values people usually calculate are based on a very different generating process than the one they need to check.

      • It’s not entirely free but it’s not exactly fixed either. Even with a geographically defined event there are choices where to place the discontinuity and what to look at. Although, for a good RD design you might argue that the correspondence of the discontinuity to nature makes sense on some well-defined theoretical ground and thus is more or less fixed. Maybe, but not in some of the designs Andrew reviewed as late.

        Thank you for posting those graphs. They are quite fun to look at! I would not have guessed that they were constructed from pure noise.

        To be fair to others though, Runge’s phenomenon was only taught (at least to me) in numerical analysis. Although I had long forgotten about it so very much thanks for bringing it up. Is numerical analysis a standard component of a typical stats / econ degree? My initial (largely uniformed) impression is I think no?

  4. I’m an economist who thinks you’re being too kind to economists. You assume that they’re actually trying to uncover truth when what they’re actually trying to do is demonstrate mastery of the latest shiny tool embedded in an intellectual construct. Applied work has moved from one methodological fad to another over the last 40 years (which covers my time as an economist.) The goal of academic economists is, by and large, not to advance economic understanding, but to convince other economists that you’re a smart economist in possession of the most up-to-date tools. For more junior economists, this makes some sense, of course. The applicability of RD methodologies has to be defended after a fashion, of course, but hand waving and some robustness analyses will generally do.

    RD and other natural experiment methodologies are on the latest in this long line of methods which have come and gone: game theory as panacea (which you discussed earlier this week,) dynamic stochastic general equilibrium models (don’t ask) , Berry-Levinsohn-Pakes models of market structure, and many more besides — they start as genuine innovations, but morph to ways by which a PhD student can get a quantitative novelty to complete a dissertation and are then mistaken for the accumulation of knowledge.

    Like you, I am not describing all economists, but it is z% and my z% is higher than your x%.

    • For my fellow economists who think I’m being too harsh or cynical, ask yourselves about a genuine economic insight demonstrated with a solid OLS analysis or a supposition pulled from the literature combined with an RD regression. Which is more likely to pass muster as a dissertation? To get published? It isn’t close.

      • My own take is that you are not too harsh or cynical (I assure you, in those respects, I am extreme). However, I do think the blame is somewhat misplaced. I think academia has become quite disfunctional – economists are only the canary (and serve that function well since most -not all – economists have personal characteristics that make them prone to the worst offenses). I am awed by much published economics work these days (the technical sophistication exceeds even the fairly extreme training I had in my program long ago), while at the same time horrified by how difficult it is to use any of that work. Reading a serious paper in economics now would take several weeks of time – and that is even without having the actual data to look at. Yes, the world is complicated, but if it is that complicated then what is the purpose of any research if you want to impact the world? Even a relatively simple question, such as the effects of minimum wages on unemployment, has no simple answer despite an incredible amount of complex research having been thrown at it.

        I fault the over-specialization that academia has bred. We have constructed edifices that follow their own rules (get the right pedigree, publish the right kind of research in the right journals, and then you might get a tenure track job – one of the 10% of openings left – in a respectable school – tenure is a different story even then). For all of that sophistication in the research, whole areas of knowledge are dismissed or ignored without any need to understand or be exposed to them. Economists routinely dismiss most of sociology as irrelevant and inferior. The same with psychology – unless it is the currently popular behavioral thread. And ethics – don’t even think about it.

        But I suspect economists are not really that different than other disciplines. We have all holed up in our separate silos and only answer to our own power structures. Of course there are exceptions – but they can’t keep up with the tides sweeping towards even more specialization and technical sophistication for its own sake (I told you, my skepticism is extreme).

        I had quite a technical economics education, but my work has evolved towards more basic visual quantitative evidence. As Andrew has pointed out many times – understanding measurement, producing a few straightforward graphs of the data, conducting some basic preliminary analysis (absent the derived bells and whistles that get you published in top journals) – these are increasingly the tools I use. Perhaps it is because of my experiences as an expert witness, where I need to make my work comprehensible to non-economists. We can lament the understanding of basic concepts in the population, but I think it is healthy to need to explain your work to people that don’t have that shared educational background. I think writing aimed at such understanding would have a hard time getting published. It has to look and sound complicated or it isn’t taken seriously. That isn’t unique to economics, though they may be the worst offenders.

        Where did we go wrong? Why isn’t the priority to say something meaningful and useful? Yes, the world is complex and I am not advocating simplifying it, at the expense of seriously investigating a complex problem. But we do seem to have lost our way. The focus is not on the results, but on the academic game. How else can we ignore the fact that at least 80% of academic position openings have become non tenure track positions? How else can we ignore the fact that most institutions only pay lip service to teaching (yes, they look at course evaluation numbers – not comments – and yes, they check all the boxes that accreditors are looking for, but they don’t take seriously looking at actual course materials, observing teaching more than the obligatory once per year, etc.)?

        • sigh…

          Where are the subversive intellectual conversations though? Does academia just have a lock on methodology of thought in the modern world? Shouldn’t there be places people could do good work for its own sake, or is it just too expensive and time consuming, and we’re all barely scraping by as it is?

        • I believe the fact is that academic (over)specialization has advanced the state of knowledge more rapidly than it would have otherwise – but this only works for the elite. The masses are left behind (as with the majority of students being taught by overworked, underqualified, and underpaid instructors. I see this as intimately tied to the growing inequities in all aspects of life. There is little hope that the average college student, let alone the average citizen, can make any sense out of the vast majority of academic work done in any discipline. We have all but given up on them. And if we focus on the elites, well they are doing better than ever. To a great extent we all buy into this – after all, I want medical advances to proceed as rapidly as possible. But part of that package means that the masses will be relegated to following the algorithm created by one of the elites. So, it is a double-edged (at least) sword. But it is a subject worthy of attention – and one not likely to be solved by statisticians or economists alone (or, perhaps, not with their help at all).

        • I don’t necessarily see it. I mean, it helps those elites sure, they get jobs promoting power pose, stents, selling ineffective Tamiflu by the billion dollar boatload, promoting policy on the basis of Excel errors and other forms of weak sauce…

          Do we really have a lot of progress in the biomedical field? How about in Economic policy. Arguably economic policy is an area where literally billions of people suffer needlessly each year. People die of dysentery or opioid overdose because of crappy policy. Is it OK that Economics is not about insight but fancy calculated numbers? (harkening Richard Hamming’s opposing view that computation is about insight, not numbers)

        • Yes, we do have a lot of progress in the biomedical field. I am about to get 2 kidney stones removed using technology that would likely not have been developed without the overspecialization I am referring to. At the very same time, we have millions dying from readily preventable diseases that would cost less to address than the development of the technology that will be used on me. I think you are proving my point. These two things are tied together. What we would really like is both the technological advances and the public health policies – both are possible to some extent, although some tradeoffs will likely be required. Solving that problem will involve medicine, economics, and other disciplines.

        • The last thing academia has ever been is subversive or innovative.

          If you want to academia to be innovative, people have to be able to get money independent of the approval of their peers.

    • Jonathan:

      You could be right, but my take on these problems is that these researchers think they’re doing the Lord’s work, but they’re immersed by three ideologies that are harming their ability to think clearly as individuals and to act productively as a group. The three ideologies:

      1. The ideology of causal identification: From the recognition that causal identification is a problem, to the development of methods that resolve the problem in certain narrow settings, to the thoughtless and blind use of these methods (sometimes nearly literally blind, as when a researcher doesn’t realize the problem with a method even after displaying a plot that pretty much invalidates the conclusions of the analysis, as in the above examples).

      2. The ideology of statistical significance: Same thing but with random error. From the recognition that noise is a problem, to the development of methods that resolve the problem in certain narrow settings, to the thoughtless and blind use of these methods.

      3. The ideology of not listening to outsiders. Lots of stuff out there about economists thinking they’re special.

      As the saying goes, Notalleconomists. Just enough to keep the merry-go-round turning.

      But, again, psychologists and science journalists made the exact same errors but in the past ten years it seems that they, as a group, have escaped from this trap. So I’m hoping economists and policy journalists can do so too. The only tough thing might be item #3 above, as the don’t-listen-to-outsider attitude seems stronger among economists than psychologists—or at least that’s my view as an outsider to both fields.

  5. This post is so negative it leaves the reader feeling RD should just be taken out behind the barn and killed with an axe.

    Caveats in the post suggest there are some good RDs, but how can we tell when an RD is good? There are so many ways for them to be wrong, we can’t just go down a list to show we haven’t committed any of the myriad possible errors.

    Should the analysis be flipped? Is there a small set of guidelines we can use to identify a good RD? (Like ticking off the five assumptions of OLS.)

    • Terry:

      I do feel we’d all be better off if RD had never been born. That said, RD exists and we have to deal with it; indeed, we have a section on RD in Regression and Other Stories.

      Here’s my take on RD:

      1. We can learn from natural experiments. In a natural experiment, the exposure is assigned externally so we don’t have to worry about selection bias.

      2. Many natural experiments have the form of a discontinuity. If so, there’s a potentially serious concern with lack of overlap of the exposed and control groups, inference will be sensitive to our model for the outcome given the forcing variable, so we should take that aspect of the model seriously.

      3. The forcing variable is not the only game in town. The exposed and control groups can differ in other pre-treatment variables as well. What you have is an observational study, and you should do your best to adjust for differences between exposed and control groups.

      4. As with all datasets (including controlled experiments, natural experiments, and pure observational studies), extrapolation is a challenge. If you just have data from one city in one year, or just from people at a certain threshold, or whatever, then assumptions are needed to draw inferences for the general population. That’s fine–it’s nothing special to RD studies or natural experiments–but we should not let casual identification and internal validity distract us from ever-present questions of external validity.

      5. As with all analyses, we have to be concerned about forking paths and plausible effect sizes. Don’t take statistical significance seriously in an uncontrolled design with unlimited forks. Again, this is nothing special to RD: the same issues arise in controlled experiments and observational studies. And, again, forking paths are a particular issue when data are highly variable and sample size is low.

      5. Again, as with casual inferences in general, I recommend performing the simple comparison and then seeing how that changes as you add adjustments. If the simple comparison shows no difference, and differences only appear when you throw in adjustments . . . that doesn’t imply that your analysis is wrong, necessarily, but it does suggest that you should explain to the readers what these adjustments are doing and why you think they’re a good idea. Saying that you followed the rules is not enough.

      When does RD work particularly well?

      a. When the forcing variable has a strong logical and empirical connection to the outcome, for example pre-test and post-test scores in an education study, or vote for the Republican party predicting conservative votes in congress. In these sorts of examples, if you adjust for the forcing variable, you don’t need to be so concerned about imbalance or lack of overlap on other pre-treatment variables.

      b. When the threshold is itself random and has no external meaning. This didn’t happen, for example, with that air pollution in China study, where there could be systematic differences between cities north and south of the river.

      c. When you have a large sample size with lots of data near the discontinuity, and the pattern shows up clearly in the data.

      There are also formal assumptions for RD, but I haven’t found them so useful in understanding or explaining the method. I think it should be possible to translate my above recommendations into the formal language. Textbook examples of RD tend to be very clean, but then users sometimes seem to think it’s a general approach, a button they can push to get kosher causal inferences. I’ve heard that various people who’ve done obviously ridiculous RD analyses are genuinely puzzled when they hear that people like Guido Imbens and me don’t believe their results. Even our papers on the topic don’t convince them. They’ve been brainwashed into thinking that causal identification + statistical significance = discovery.

      I guess this little comment should be its own post. Even though (especially though?) I fear it will annoy many economists.

      • While you’re at it, how about a list of when RD’s identification fraternal twin, Difference-in-Difference analysis, works well? For every bad RD analysis, I’d guess there are about 3 bad D-in-D analyses. After that, maybe discuss their cousin, propensity score matching. (Note that their big brother, weakly identified IV regression, has already been relegated to the attic.)

        • Jonathan:

          Difference-in-difference is a special case of regression adjustment where the regression coefficient on the “before” variable is fixed at 1. In general I think it’s better to estimate that coefficient, to avoid well known problems of noise and regression to the mean.

      • I guess this little comment should be its own post.

        Sounds like a statistics specialist who writes books about statistics should put an expanded version of this comment in one of their books.

  6. Hi Andrew, greetings from Peru.

    If you have problems with Harvard and Princeton psychology research, you need to come to Peru and cry aloud with me.

    The main problem with statistics is that even the experts doesn’t have a clue regarding the use la statistics to promote good science.

    But, how you can promote good use of statistics for research? Even the most basic concepts are used to produce BS.

    What you could do? I think popular statistics campaigns are an option. Because if you use the vía negativa, like when the government made campaigns against the Zika (don’t put uncover water on the street, etc.) it’s more effective I think.

    If you have a reference guide to statistics 101 that could be really helpful.

  7. > Monkeying with the forcing variable can, however, facilitate the task of coming up with a statistically significant coefficient on the discontinuity, so there’s that.

    If you folks think things are bad in “social sciences” economics, you have no idea how bad they are in the “business school” offshoots, like finance and accounting. In the past year, a top accounting journal has retracted an RD paper, while a top finance journal has all-but-retracted another RD paper from the same authors.

    https://retractionwatch.com/2019/01/15/after-more-than-a-year-of-back-and-forth-an-accounting-journal-retracts-a-paper-on-tax-avoidance/

    https://retractionwatch.com/2019/11/13/the-methodology-does-not-generate-the-results-journal-corrects-accounting-study-with-flawed-methods/

    The editors and referees apparently believed these papers were super rigorous because of RD. But the RD was completely wrong: instead of treatment being determined by the forcing variable, the “forcing” variable was determined by treatment.

    Now that’s just a conceptual error, which everybody makes. What was beyond the pale was that the authors had seemingly been notified of this previously and responded by changing the textual description of their methodology to be “correct” after the papers’ acceptance, while leaving the results in the tables unchanged.

    At the accounting journal, the authors admitted that they did not change their methodology, but when asked to provide their data files, they claimed that their laptop had crashed, and the files were no longer available. Star Wars p<0.05: The Gremlins Strike Back!

  8. Andrew,

    Your post is an inaccurate caricature of regression discontinuity designs and their application. I encourage you to add some nuance.

    On the theory: The claim that the forcing variable is “often a relatively unimportant variable” is misleading. In a sharp RD, the treatment is a deterministic function of the forcing variable, so the forcing variable is the only possible source of omitted variable bias in treatment/control comparisons. In a fuzzy RD there are other determinants of treatment, but a similar argument applies: the forcing variable is the only possible source of omitted variables bias in an instrumental variables analysis using the threshold as an instrument for treatment. This is the traditional justification for focusing on adjusting for the forcing variable.

    On applications: The econometrics literature understands the hazards of overfitting in the RD context and has made great progress on this issue. See in particular the work of Cataneo and coauthors, which provides principled data-driven methods for model selection. Their approach has become the standard in applied research using RD designs. It is of course possible to do RD badly (as with any research design), but it is unfair to use poorly executed examples to smear the entire literature.

    • Simon:

      1. I’m not “smearing the whole literature,” not at all. Any method can be done well or done poorly. Some people have build bridges that collapse; that doesn’t mean that bridge-building is a bad idea. As I wrote above, the problem is not so much that this or that particular analysis is done poorly, but rather that many economists and policy journalists will by default assume that an RD analysis was done well, or will jump to its defense. Raising barriers to criticism makes it easier for bad work to stay afloat.

      2. Regarding the forcing variable: I think there’s been some miscommunication here. When I said that the forcing variable is often relatively unimportant, I’m talking about unimportant as a predictor of the outcome variable. I agree with you 100% that the forcing variable is extremely important as a predictor of the treatment variable. If you have a RD design, you should definitely account for the forcing variable in your analysis, otherwise you can potentially have huge errors: I agree with you. The problems are that (a) there can be imbalance on other variables not explained by the forcing variable (this seems to be the case in the air-pollution-in-China example), and (b) inappropriate adjustment for the forcing variable can introduce artifactual discontinuities (this seems to have occurred in the three examples shown in the above post, and it’s also the topic of my two published papers on RD).

      3. I agree that much of the econometrics literature understands the hazards of overfitting in the RD context and has made great progress on this issue. Unfortunately, this news has not trickled down to the policy journalists at Vox—and I suspect that one reason the policy journalists are not aware of these subtleties is because so many professional economists act as if a published RD is correct by default. Again, it’s my impression that the psychologists have learned better: they’ve learned that bad analyses of experiments are done by psychologists all the time, so they don’t start from that position of belief.

      • >As I wrote above, the problem is not so much that this or that particular analysis is done poorly, but rather that many economists and policy journalists will by default assume that an RD analysis was done well, or will jump to its defense. Raising barriers to criticism makes it easier for bad work to stay afloat.

        Andrew, this is ridiculous. As you’ve highlighted often on this blog, journalists will often by default assume that *any* analysis was done well no matter what the design is, or even if there is a design. RD is not especially unique in this, and I’m not sure why you assert it is. (Except perhaps that the stuff making RD work is somewhat easier for a layman to grasp?)

        I agree with Simon. This post comes off as ignorant and extrapolating from a small sample of poor studies.

        • ???:

          Lots of journalists are savvy enough to be skeptical of a simple observational claim but can be fooled by the implicit authority conveyed by a statistical method. A reporter can know to be skeptical of a simple comparison of treated and untreated schools (after all, the two groups could be systematically different in some way) but then set aside that skepticism if some statistical method such as RD comes up. The RD analysis gets the implicit endorsement of the economics establishment. That’s why I think it’s important for economists to not reflexively come to the defense of these sorts of studies.

          I can’t tell from your comment if you think that the schools study was bad. I think it was a bad study. (This is not a moral comment on the integrity of the study’s author. I assume he was doing his best and following the rules as he saw them. As I noted above, I’ve published some bad analyses myself; it happens.)

          If you don’t think that the schools analysis was bad, then I think you’re confused on the statistics, as it seems to me to be a pretty clear case of overfitting and forking paths, and indeed this is pretty clear from the RD graph in the paper itself.

          I think, though, that you do agree with me that that the schools study and the other studies discussed above are bad (I say this because you refer to “a small sample of poor studies.” If so, then I guess you’re disagreeing with me regarding emphasis. I have not done any sort of survey of the economics literature and did not make any attempt to imply that all or even most RD studies in economics are of such low quality. I thought I’d been clear on that, but maybe I wasn’t, so I can add a PPPPS to the above study to clarify.

          Also, I agree with you 100% that RD is not unique in being used to draw poor inferences. My problem is that many economists and journalists will see an RD study and by default accept it. See this comment and this one by an economists on this thread for further perspective on this issue.

        • >I have not done any sort of survey of the economics literature and did not make any attempt to imply that all or even most RD studies in economics are of such low quality.

          Then I’m not sure what you mean when you’ve repeatedly put forth the idea in this thread (and this very comment!!) that economists accept RD results uncritically. I completely disagree with the posts you link: The paper Financial Economist references has a history that’s not just “reviewers believed a bad RD,” and Dale Lehman’s comment is complete nonsense. It’s disappointing to see you endorse such simplistic accounts so readily.

          If anything, as Simon points out, there has been a concerted effort to think about how to “honestly” tune the researcher degrees of freedom in RD. (Including your own paper with Guido Imbens!) RD is established enough that reviewers know what to ask for: Balance tests, density tests, and a clear explanation for why your RD doesn’t fail exclusion restrictions.

          I’m like you: I like good analysis, and dislike bad analysis. The difference is that people take you a lot more seriously than a junior scholar like me, and when you engage in hyperbole and armchair psychology the way you do in this thread, people actually believe you. And they will probably read your “RD should never have been born” moreso than your “PPPPS: And some, I assume, are good studies.”

        • ???:

          From the above post:

          It’s not that 100% or 50% or even 25% of economists and policy journalists act as if identification strategy + statistical significance = discovery. It’s that X% act this way, and I’d like to reduce X. I have some reason for optimism because it’s my impression that X went down a lot in psychology in the past ten years, so I’m hoping it could now decline in economics. Not all the way down to zero (I fear that PNAS, NPR, and Ted will always be with us), but low enough to end the sustainability of the hype cycle as it currently exists.

          I’d like to reduce X. And, yes, it’s not just RD. If I felt that statisticians were automatically accepting an analysis just cos it had Bayes in the name, I’d feel the same way!

        • And I actually do wish that RD should never have been born. I say this because, in the absence of something called “RD,” I think people would just think of these problems as observational studies and follow standard principles of adjusting for relevant pre-treatment predictors. I’ve seen enough bad RD’s to get a sense that researchers can obsess on the forcing variable and not think about the larger observational study context, in the same way that people focus on the p-value and don’t think about forking paths.

        • OK, I don’t feel like this is getting anywhere, so let me sum my point up succinctly: That paragraph is at odds with the rest of the post, as well as the rest of your subsequent comments on the post. The tone of your post and comments is incredibly condescending and reads as you saying that economists are too stupid to know that you should do studies well, and not badly, and believe a single RD graph. Your post does not “push the process in a better direction,” since your advice is just “believe good studies and don’t believe bad studies.”

        • ???:

          Before going on, I want to thank you for commenting. It’s valuable to get different perspectives. At the very least, different perspectives motivate me to express myself more clearly. Also, of course I make mistakes, so I want this comment section to be a welcoming environment where people can explain where they think I’m wrong.

          Now, to get to your comment:

          I certainly don’t believe economists are too stupid etc. I do think that X% economists and policy journalists act as if identification strategy + statistical significance = discovery.

          Of course 100% of economists “know that you should do studies well, and not badly.” It’s just that they often seem to think that if a study has an identification strategy and statistical significance, that then they default to the assumption that it was done well.

          Think comparatively here! Psychologists aren’t stupid either, but until recently a big chunk of the academic psychology profession seemed to think that if a paper had a bunch of randomized experiments with statistical significance, that it was high quality research. Consider the episode with Bem’s ESP paper. Many academic psychologists had the frame of mind that randomization + statistical significance earned a study the default assumption that it was done well. These psychologists were wrong, and I think enough of them realized it that psychology had changed. This doesn’t mean that they were “too stupid”; they were just trapped in an erroneous pattern of thinking. It happens.

          To say that people make mistakes, sometimes following consistent patterns, is not “condescending.” Rather, I think what would be really condescending would be if I refrained from pointing out these mistakes. I have enough respect for psychologists, and for economists, and for statisticians, and for journalists, to point these things out when I see them.

        • > If I felt that statisticians were automatically accepting an analysis just cos it had Bayes in the name, I’d feel the same way!
          Well X% do (I’ve met some), but then no has a good sense of what X is :-(

          (I believe part of the problem with Bayes is there is no widely accepted assessment of the _strength_ of a posterior, so the posterior can naively be take as a posterior is a posterior is a posterior. Similar to how prior were be taken as just a prior is a prior is a prior.)

  9. I often wonder about how varying study quality interacts with publication bias. All of the bad studies discussed on this blog produced “positive” findings. Where are all the poorly done studies with negative results? Are researchers sitting on them? Are they being rejected from journals? Are we simply not seeing them as they don’t attract attention?

    • Ethan said, “Where are all the poorly done studies with negative results? Are researchers sitting on them? Are they being rejected from journals? Are we simply not seeing them as they don’t attract attention?”

      Maybe they’re in the same place as many well done studies with negative results — in the “file drawer” (which maybe gets purged after a few years, so they end up in oblivion?)

    • Ethan:

      Relatedly, I often find it frustrating to discuss these issues with the journalists who write these news articles. The journalist is always thinking about the next story. Just about the only time we ever get good discussions about the problems with reporting is when a journalist is specifically writing about the replication crisis. And this introduces its own bias.

    • Ethan: > varying study quality interacts with publication bias.

      It’s complicated:

      “Results from better quality studies should in some sense be more valid or more accurate than results from other studies, and as a consequence should tend to be distributed differently from results of other studies. To date, however, quality scores have been poor predictors of study results. We discuss possible reasons and remedies for this problem. It appears that ‘quality’ (whatever leads to more valid results) is of fairly high dimension and possibly non‐additive and nonlinear, and that quality dimensions are highly application‐specific and hard to measure from published information.” – https://doi.org/10.1093/biostatistics/2.4.463

  10. Great post!

    Maybe this is stated in your text in a different way (sorry if I missed it): isn’t the problem with the air filters analysis that a linear term was used in the ‘change in scores vs distance’ plot? I see no motivation why scores should depend on the distance from a gas leak when they claim the gas leak is irrelevant (baseline levels before air filter installation were low). This is a binary treatment state, with or without air filters. I’m just reading about RDD and more classic examples like expenditures post-intervention plotted against an income scale of course make sense.

    So, to check myself, would you agree the analysis would be completely fine if just a 0th order (constant) term were used to fit their change in test score data for both the with air filter and without air filter groups? Of course we can see by eye the results would not be statistically significant.

    • Especially as this linear term allows plenty of schools in the ‘treated’ zone to have negative changes (worse test scores) in distance < 4.5 miles yet the article claims the treatment to have a surprisingly positive effect on test scores! Oof it hurts a little to think about how much auxiliary work went into support this paper and analysis…

Leave a Reply to William S Cancel reply

Your email address will not be published. Required fields are marked *