Just another day at the sausage factory . . . It’s just funny how regression discontinuity analyses routinely produce these ridiculous graphs and the authors and journals don’t even seem to notice.

Ethan Steinberg sends in these delightful regression discontinuity graphs:

He reports that the graphs come from a study from the University of Pennsylvania trying to estimate the effect of a care management intervention for high risk patients.

I’m skeptical for usual reasons. They also make the classic error of comparing the statistical significance of different comparisons.

Nothing special here, just another day at the sausage factory. People are taught this is what to do, it spits out publishable results, everybody’s happy.

P.S. Just to explain some more: Sometimes people frame this as a problem of trying to figure out the correct specification for the running variable: should it be linear, or polynomial, or locally linear, whatever? But I don’t think this is the right way to think of things. I’d say that the original sin of the “regression discontinuity” framing is the idea that there’s some sort of purity of the natural experiment so that the analysis should be performed only conditioning on the running variable. Actually, these are observational studies, and there can be all sorts of differences between exposed and unexposed cases. It’s poor statistical practice to take the existence of a discontinuity and use this to not adjust for other pre-treatment predictors. Once you frame the problem as an observational study, it should be clear that the running variable is just one of many potential adjustment factors. Yes, it can be an important factor because of the lack of overlap between exposed and control groups in that variable. But it’s not the only factor, and it’s a weird circumstance of the way that certain statistical methods have been developed and taught that researchers so often seem to act as if it is.

P.P.S. A student asked for more detail regarding my concerns with certain regression discontinuity analyses. If you want more of my thinking on the topic, you can google *statmodeling “regression discontinuity”* to see some posts, for example here and here, and you can look at my articles with Imbens and Zelizer. We also discuss the topic in a less critical and more constructive, how-to-do-it perspective in section 21.3 of Regression and Other Stories.

P.P.P.S. Commenter Sam looked at the above-linked article more carefully and reports that their main analysis adjusts for other pre-treatment predictors and also includes alternative specifications. I still think it’s nuts for them to use this quadratic model, and even more nuts to include different curves on the two sides of the boundary—this just seems like noise mining to me—but they did also do analyses that adjusted for other pre-treatment variables. I don’t find those analyses convincing either, but that’s another story. Maybe the general point here is that it takes a lot for this sort of statistical analysis to be convincing, especially when no pattern is apparent in the raw data.

43 thoughts on “Just another day at the sausage factory . . . It’s just funny how regression discontinuity analyses routinely produce these ridiculous graphs and the authors and journals don’t even seem to notice.

  1. It’s funny that this is listed with the tag Economics– but quite appropriate, I have to admit. People like me, an economist, are interested in posts like these, and need the warning from them.

  2. I know you are not into model comparisons/Bayes factors so much (if I understand correctly) — but I’ve always thought these silly curves should be out of sample validated someway with a simpler model without the discontinuity. I have a hard time seeing any curves being justified in these scatterplots, let alone two different curves on either side of the discontinuity.

    I’ve been in talks before where people present higher order polynomial curves as robustness checks for RDD, and that is going in the opposite direction of what I would consider robust.

  3. The right way to fit these is with LOESS and a smallish bandwidth. If something is there it will show up.
    http://models.street-artists.org/2020/07/05/regression-discontinuity-fails-again/

    Fitting two separate curves is totally wack, it basically assumes a difference which then is guaranteed to show up.

    This blog post has code to generate datasets where nothing is happening and then fit two curves to it. With pdf of output
    http://models.street-artists.org/2020/01/09/nothing-to-see-here-move-along-regression-discontinuity-edition/

    Doing this you find a difference just about EVERY time

    • Daniel,

      There’s a literature on fitting these with local linear fits (of which loess is one approach, not necessarily the best), but as stated in the P.S., I think the big problem is in not trying to adjust for other pre-treatment variables. Yes, it’s absolutely necessary to adjust for the running variable, but there can be lots of other differences between exposed and unexposed groups.

      And, yea, fitting two separate curves is totally wack . . . except that (a) many respected authorities do that, and (b) a method that’s more likely to spit out a big difference is a method that’s more likely to get you publication . . . maybe even a Ted talk if you play your cards right! In all seriousness, I expect that very few of the people who do bad regression discontinuity are Pizzagate-style scammers—and, for that matter, I expect that the actual Pizzagate guy was sincere and really did think he was churning out important discoveries—; but sincerity doesn’t make the work any better. Honesty and transparency etc.

      What’s stunning to me is not so much that people do bad analyses or that such analyses get published and hyped, as that nobody in the pipeline sees graphs like those reposed above and says, “Hey! What’s going on here???” In that sense, statistical methods serve as a sort of ideology that blinds people to what’s in front of their faces. Which makes me sad and frustrated.

      • Agreed, you should definitely do more adjustments and maybe local estimation with something other than LOESS, but first… just plot the damn thing with the loess curve, and if that doesn’t show anything think hard about whether you’re just fooling yourself to go farther.

        And yes, statistics as “discovery factory” is rampant and very troublesome.

        It’s quite disturbing to read Reddit on say r/AskStatistics or r/statistics and see question after question like “Which test can I use to prove my research question is significant?”

        example: https://www.reddit.com/r/statistics/comments/qxw7gc/q_which_test_establishes_causation_between_one/

        I think of this as insight into the current crop of graduate student’s view of what statistics is… it’s not pretty.

        • Faculty and grad students at a local university discussing possible stats education – what methods should we learn to maximize the probability of getting out papers published?

          Of course the answer is – the approach that will likely fool the most journal reviewers ;-)

        • Antony:

          Yes, it happens with all kinds of graphics, but the regression discontinuity graph is such a clear case because here there is a direct connection between the graph and the model. Many graphs are data graphs and the connection to any fitted model is not so clear.

  4. I have a treasured dataset of gaussian noise added to a concave-upward quadratic. If you take only the central third of the points and fit a quadratic to them, you get a concave-downward quadratic. If you fit to all the points you get a concave-upwards curve that is close to the true underlying quadratic.

    I didn’t even have to rerun the dataset very many times to get this characteristic.

    In a way, this kind of discontinuity “analysis” could be considered an example of the “law of small numbers” Andrew wrote about a few posts ago. Each side of the data is being erroneously considered to be a good representative of its own data subset.

  5. It seems wrong to say that the forcing variable is “just another” pretreatment variable we can control for. It assigns treatment randomly for compliers near the threshold, that is the whole idea. If there is a small sample of course you get ridiculous noisy results like those above, but the underlying premise relies on the forcing variable. Adjusting for covariates is nice, but with a large sample and a discontinuity that “makes sense” (i.e. is a correct application of the method) the covariates by definition should balance across the threshold.

    Maybe the way economists need to hear it is that their results are very asymptotics-driven, and in a semi-parametric case those results are less reliable unless you have a lot of data (more than in a standard parametric case).

    “I’d say that the original sin of the “regression discontinuity” framing is the idea that there’s some sort of purity of the natural experiment so that the analysis should be performed only conditioning on the running variable.”

    Yes, when the model’s assumptions are met this will be true. Applying RDD to seventeen cities near a river? Not so much.

    • Yes, that was my previous understanding as well. I thought as long as there are no discontinuities in the potential outcomes at the threshold, then the difference at threshold should be unbiased although controlling for other covariates may increase precision.

      I realize that there is rarely ever enough data near threshold, which I thought was the reason for using local linear regression to estimate the left and right limit as you approach the threshold. Imbens, G., and K. Kalyanaraman, (2008) developed a MSE-optimal method for selecting bandwidth that allows some bias in order to gain precision.

      There are also some “robustness” checks that I’ve seen recommended including testing the bandwidths and testing pseudo-thresholds at different points other than the actual threshold.

      I’ve read so many articles on regression discontinuity but I’m still really confused about how (and if ) to use one in practice. I’ve read the relevant chapters in Gelman and Hill 2007, but I’ll have to check out the new book as well.

  6. Andrew, of your two obsessions (RDD and Hoover), this is your healthiest. Easily my favorite posts.

    Trained in Economics, here. The incentive structure to generate this nonsense in the research context is very powerful. If I produced this at my job where the results interact with the real world (and not the fake world of “get me a TEDTalk” or podcast appearance), I would get laughed out of the room. This sort of obviously bad work could even be enough to affect an annual review.

    Other academics really need to step it up. I feel the same way about the argument that “there are only a few bad apples” RE: American police. Prove it! It strikes me that tolerating hacks makes you a hack.

    At least something good comes out of these silly “studies” — these blog posts.

  7. Even if this were truly a randomized experiment, shouldn’t you still control for pre-treatment differences?

    I guess I’m asking if you advocate a general rule of adjusting for differences in the comparison populations and it’s just a matter of how likely and how large those differences may be based on the study design?

  8. A common factor in implausible RDs is that there’s a dip in the regression function to the left of the discontinuity that looks quite artificial and is likely because of nonparametric regressions going hog-wild.
    Seems like bounds on the 2nd derivative
    [ https://arxiv.org/abs/1606.04086 , https://arxiv.org/abs/1705.01677 ] or something even more stringent like monotonicity [ https://arxiv.org/abs/2011.14216 ] would be useful here [@AG – curious about your thoughts on these].
    Not sure why these methods aren’t getting picked up by researchers. I know the answer is because they’re not plug-and-play in STATA, but the problem is sufficiently serious that sb should figure out how to solve the conic programming problem in IW for example.

  9. Is there a method to determine the “significance” of a discontinuity?

    I’m envisioning something like fitting a single curve over all data, and scoring its fit, and then fitting two curves, one on each side of the hypothetical continuity, and scoring their fit, and then comparing these scores to figure out if the discontinuity is more than an artifact of random chance.

    The hope is that, with a method like that, a lot of these discontinuity findings could be invalidated with little effort.

  10. I am sorry for taking your time, but we had a discussion on twitter about the origin of “another day at the sausage factory” and found several plausible sources. Could you clarify from where this phrase comes from?

    With the best wishes.

    • German:

      I have no idea where the phrase comes from! I didn’t even know it was a phrase, at least I don’t think so. It derives from the saying that you don’t want to see sausage or legislation being made . . . ummm, let’s google *sausage legislation* . . . here’s Quote Investigator which is always my favorite source for this sort of thing. They cite Fred Shapiro who dug up the earliest known version: “The Daily Cleveland Herald, March 29, 1869, quoted lawyer-poet John Godfrey Saxe that ‘Laws, like sausages, cease to inspire respect in proportion as we know how they are made,’ and this may be the true origin of the saying.”

      As to the exact phrase, “Just another day at the sausage factory”: maybe I read it somewhere and it lodged in my unconscious? A quick google turns it up in various places, for example this news article by Steve Lopez in the Los Angeles Times. So my guess is that it’s just a natural formulation that has been independently coined many times, derived from the well known saying about sausage and legislation.

      • The 1869 citation is interesting, but later than I would have guessed. Real or feigned disgust at the contents of sausage has always been with us, as it is with the content of my personal favorite, scrapple. I always thought the “sausage factory” referred to Upton Sinclair’s (1906) The Jungle.

  11. I’ll just leave my usual comment to this genre of post. Sometimes regression discontinuity assumptions are justified, sometimes they’re not. Sometimes instrumental variable assumptions, sometimes they’re not. Arguing that they never are because they’re observational studies (and because they’re often abused) seems to betray a lack of understanding.

    • Anon:

      I’ll just leave my usual response to this genre of comment and remind you that I never said or argued that regression discontinuity assumptions and instrumental variables assumptions are never justified. Indeed, we discuss both these methods in chapter 21 of Regression and Other Stories.

      • “I’d say that the original sin of the “regression discontinuity” framing is the idea that there’s some sort of purity of the natural experiment so that the analysis should be performed only conditioning on the running variable. Actually, these are observational studies, and there can be all sorts of differences between exposed and unexposed cases.”

        Actually, if the assumptions of the design are met, then in the neighborhood of the discontinuity pretreatment variables do not vary substantially. Your statement logically implies that the assumptions don’t matter because it’s an observational study. Many other statements from past posts (about both RDD and IV) have the same implication. The plots you show have nothing to do with violations of RDD assumptions, though. Even if the assumptions were always meaningfully violated, these plots would only illustrate an orthogonal issue, which is overinterpretation of noise in the trend of the outcome as the forcing variable approaches the discontinuity. I do grant that the plots are effective propaganda against indentification assumptions writ large because they are so evidently silly even to laymen.

      • >Indeed, we discuss both these methods in chapter 21 of Regression and Other Stories.

        I suspect Jennifer Hill cringes at these posts but of course can’t be certain

    • I take it back, I don’t think there’s a lack of understanding, just a deep and seemingly willful lack of caring about rigor on these topics. I wonder if this is somehow based on an impression that econometrics types position themselves as like gatekeepers of rigor but then so much junk research gets produced with their ‘rigorous’ methods so screw rigor when talking about their methods? (You could say “it’s just a blog, I don’t need to be rigorous”, but rigor is orthogonal to formality.) And now I’ve typed ‘rigor’ so many times it has lost all meaning.

      • Anon:

        None of the statistical models we use are exact. They’re all approximations. I use normal distributions and logistic regressions all the time even though they’re not true. Similarly, sure, people use regression discontinuity analysis and all sorts of other methods even though their assumptions are never really true—and, even if they were, other assumptions of these models would be violated. I work with survey data even though nonresponse rates are over 90%. In data analysis, rigor can play an important role in guiding the development and interpretation of methods, and in providing a lower bound on uncertainty, as we can see in the simple case of the sqrt(p(1-p)/n) standard error for a proportion, which is valid in the case of simple random sample and is typically a lower bound in the realistic case of nonsampling errors.

        To get back to regression discontinuity: In these sorts of problems, as usual with causal inference, it is helpful to adjust for differences between treatment and control groups (or, it is sometimes helpful to say, exposed and unexposed groups). The groups can typically differ in many ways, not just the running variable. I agree with you that under certain conditions you can perform an analysis on a narrow band of cases that are very close to the discontinuity cutoff, but this is rarely what we see, for the simple reason that there won’t be tons of data very close to the cutoff. So what we’re really saying is that there are some assumptions that are kinda close to being valid and we can fit a model full of assumptions that are kinda reasonable, then we can do some theory to get robust standard errors or whatever . . . you can call this rigorous if you want, but to me it’s no more rigorous than setting up a predictive model using all pre-treatment information and going from there.

        I don’t know if this helps, but . . . even in a true randomized experiment, I’ll still want to adjust for pre-treatment variables, as this can give me more accurate inferences, especially for subpopulations of interest. This is the same reason we would perform regression adjustment, MRP, etc., with surveys even if we were not concerned with differential nonresponse.

        To put it another way: rigor is fine, and there’s nothing unrigorous about adjusting for other pre-treatment variables in a discontinuity study. There’s no rule of rigor that says disallows this adjustment, and there’s no rule of rigor that states that an adjusted analysis would be worse than adjusting only for the running variable. Meanwhile, in the real world, people are doing really bad analyses, and I think part of the reason for this is that they have it in their heads that when there’s a discontinuity, it’s the right thing to only adjust for the running variable.

        • Under the assumptions of an RDD (or IV), you’re not just adjusting for the running variable (or IV). You’re also adjusting for confounding by all the other variables (observed or unobserved) that differ between treated and untreated groups. My understanding is that theory is lacking for RDD analyses that adjust for observed confounders that may jump at the cutoff while retaining the desirable property of adjusting for unobserved confounders that do not jump substantially at the cutoff. But I would think that if observed confounders did jump at the cutoff, you’d also worry that unobserved confounders jump at the cutoff and not do an RDD analysis. Typically, the reason a good researcher would do an RDD is that they think there’s unobserved confounding but the running variable cutoff is totally arbitrary (usually part of some human designed decision process totally disconnected from the mechanisms that determine the outcome) so no confounders (observed or unobserved) would jump at it. Again, the reason to do an RDD is not to adjust for the running variable, it’s to adjust for all variables, even ones you don’t see. Your preferred method of throwing observed covariates and treatment into a regression model can’t do this.

        • Anon:

          I’m suggesting including the running variable in the model as well as other predictors. Just as when I have a completely randomized experiment, I’ll still adjust for pre-treatment variables. In either case, including other predictors does not cause the estimate to lose the good properties that arise from the design.

        • Actually if you do a regression where the running variable is just one of the predictors included as opposed to an actual RDD analysis you do lose the good properties from the design. The key idea is that only right near the discontinuity might it be reasonable to believe (again, assuming your discontinuity isn’t something absurd like a river) that unobserved confounders are similar between treated and untreated. If you just do your regression you’re comparing outcomes far from the discontinuity and those comparisons are likely confounded (even though you threw in what *observed* confounders you had into the regression). Hope this clears things up.

        • Anon:

          No, I’m saying that whatever analysis you were going to do, you also include other key pre-treatment variables. So if you were planning to do an analysis right near the discontinuity, I think you should also include these other variables. Or if you were planning to use more data and fit a locally linear monotonic curve or whatever, then do that and also include other pre-treatment predictors. The act of adding other predictors does not destroy the properties that arise from the design. Again, though, this discussion is kinda theoretical, given that in practice you typically won’t have tons of data “right near the discontinuity” (however that is defined). It’s modeling assumptions all the way down, as discussed in my comment above.

        • I’ve already said that theory for adjusting for observed covariates in a RDD is lacking to my knowledge. I agree that it would be a nice little improvement similar to adjusting for baseline covariates in an RCT, as you say. If the assumptions of the design are reasonable, either the adjusted or unadjusted version provides good evidence, but the adjusted would be nice to have. Glad you seem to agree that the lack of this extra bit of adjustment is not an “original sin” (as you put it earlier) invalidating the method, just as, using your analogy, failing to adjust for observed pre-baseline covariates is not an original sin invalidating RCTs.

        • Anon:

          In my post, I wrote, “the original sin of the ‘regression discontinuity’ framing is the idea that there’s some sort of purity of the natural experiment so that the analysis should be performed only conditioning on the running variable.”

          I guess I could simplify to: “the original sin of the ‘regression discontinuity’ framing is the idea that there’s some sort of purity of the natural experiment so that the user can forget the basic principles of observational studies.” If, for the reasons you discuss, you’re in a setting where you have good reason to think there should be balance between treated and control groups, then that can be fine: you are following the basic principles of observational studies.

        • >If, for the reasons you discuss, you’re in a setting where you have good reason to think there should be balance between treated and control groups, then that can be fine: you are following the basic principles of observational studies.

          I’ll take it!

  12. It seems to me that the first rule of data analysis should be “if you can’t see it when you squint at the graph, it’s not there.” Does that sound like a reasonable rule of thumb?

    My reasoning is that, even if you have so much data that the discontinuity or correlation or whatever can’t possibly be “random chance”, if you can’t see it just by looking at the data then there’s so much unaccounted & uncontrolled variation that could _also_ explain the stats that you can’t say anything meaningful anyway.

  13. what would a good RD plot look like? i don’t think there is anything prima facie wrong with these plots. without reading the paper, it’s not clear whether they adjusted for other covariates or whether the result is robust across many specifications

    • I just checked and the paper

      1 does control for all baseline observables in its main estimates:

      “Xit is a vector of control variables that includes all demographics displayed in Table 1 (age, gender, enrolled month, plan choice (HMO vs. PPO), and whether the plan has pharmacy coverage) as well as the risk score for each member i at wave t.”

      2 checks that results are robust to various other functional forms on the RDD:

      “ We address functional form concerns by estimating local linear regressions and third degree (cubic) local-polynomial regressions. We also estimate regressions that drop the triangu- lar kernel weight, which in essence decreases the contribution of beneficiaries close to the threshold relative to those farther away. We also conduct several robustness checks that increase the sam- ple size, by including individuals farther away from the cutoff, and consider three alternative bandwidth choices – 150, 200, and 250 ranks from the threshold on both sides. Lastly, we estimate regres- sions utilizing the bandwidth selection procedure by Imbens and Kalyanaraman, (2012), and Calonico et al. (2014). In all cases, the results do not change the conclusions from our main results. We provide estimates for all robustness checks in Appendix Tables 1–14 in Supplementary material.”

      so it sounds like, contrary to the tone of this post and the comments (“lol @ economists for producing a funny looking graph”), the actual regression analysis was done along the grounds you suggest

      i wonder what your suggestion is. should people stop reporting raw RD plots, and instead include Frisch-Waugh style plots of residuals against the running variable? my sense is that people like seeing the “raw” plot to see that the discontinuity is there without any controls. is that wrong? is there some statistical principle where a “goofy” raw plot without controls tells us something is wrong, even if the analysis used controls and is robust to functional form?

      • Sam:

        You quote from the paper: “In all cases, the results do not change the conclusions from our main results. We provide estimates for all robustness checks in Appendix Tables 1–14 in Supplementary material.”

        I looked at Appendix Tables 1 and 2 (results from local linear regression discontinuity) and the coefficients were all over the map. Some were positive, some were negative, approximately 1 in 20 were statistically significant at the 5% level . . . basically no evidence of anything. That does seem consistent with the graphs reproduced above, but not with the claims in the abstract of the paper. So I disagree with your statement that their analysis “is robust to functional form.”

        • Ha! I should have checked the appendix instead of taking their word for it. I guess there’s a lesson in there. Thanks for checking :)

    • Sam:

      You write, “what would a good RD plot look like? i don’t think there is anything prima facie wrong with these plots.”

      Those plots are great! They reveal how silly the discontinuity analysis is. The problem is not the plots, it’s with the authors, reviewers, and editors of the paper who saw these plots and didn’t notice the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *