We had an interesting discussion the other day regarding a regression discontinuity disaster.

In my post I shone a light on this fitted model:

Most of the commenters seemed to understand the concern with these graphs, that the upward slopes in the curves directly contribute to the estimated negative value at the discontinuity leading to a model that doesn’t seem to make sense, but I did get an interesting push-back that is worth discussing further. Commenter Sam wrote:

You criticize the authors for using polynomials. Here is something you yourself wrote with Guido Imbens on the topic of using polynomials in RD designs:

“We argue that estimators for causal effects based on such methods can be misleading, and we recommend researchers do not use them, and instead use estimators based on local linear or quadratic polynomials or other smooth functions.”

From p.15 of the paper:

“We implement the RDD using two approaches: the global polynomial regression and the local linear regression”

They show that their results are similar in either specification.

The commenter made the seemingly reasonable point that, since the authors actually did use the model that Guido and I recommended, and it gave the same results as what they found under the controversial model, what was my problem?

**What if?**

To put it another way, what if the authors had done the exact same analyses but reported them differently, as follows:

– Instead of presenting the piecewise quadratic model as the main result and the local linear model as a side study, they could’ve reversed the order and presented the local linear model as their main result.

– Instead of graphing the fitted discontinuity curve, which looks so bad (see graphs above), they could’ve just presented their fitted model in tabular form. After all, if the method is solid, who needs the graph?

Here’s my reply.

First, I do think the local linear model is a better choice in this example than the global piecewise quadratic. There are cases where a global model makes a lot of sense (for example in pre/post-test situations such as predicting election outcomes given previous election outcomes), but not in this case, when there’s no clear connection at all between percentage vote for a union and some complicated measures of stock prices. So, yeah, I’d say ditch the global piecewise quadratic model, don’t even include it in a robustness check unless the damn referees make you do it and you don’t feel like struggling with the journal review process.

Second, had the researchers simply fit the local linear model without the graph, *I wouldn’t have trusted their results*.

Not showing the graph doesn’t make the problem go away, it just hides the problem. It would be like turning off the oil light on your car so that there’s one less thing for you to be concerned about.

This is a point that the commenter didn’t seem to realize: The graph is not just a pleasant illustration of the fitted model, not just some sort of convention in displaying regression discontinuities. The graph is central to the modeling process.

One challenge with regression discontinuity modeling (indeed, applied statistical modeling more generally) as it is commonly practiced is that it is unregularized (with coefficients estimated using some variant of least squares) and uncontrolled (lots of researcher degrees of freedom in fitting the model). In a setting where there’s no compelling theoretical or empirical reason to trust the model, it’s *absolutely essential* to plot the fitted model against the data and see if it makes sense.

I have no idea what the data and fitted local linear model would look like, and that’s part of the problem here. (The research article in question has other problems, notably regarding data coding and exclusion, choice of outcome to study, and a lack of clarity regarding the theoretical model and its connection to the statistical model, but here we’re focusing on the particular issue of the regression being fit. These concerns do go together, though: if the data were cleaner and the theoretical structure were stronger, this can inspire more trust in a fitted statistical model.)

**Taking the blame**

Examples in statistics and econometrics textbooks (my own included) are too clean. The data come in, already tidy, and then the model is fit, and it works as expected, and some strong and clear conclusion comes out. You learn research methods in this way, and you can expect this to happen in real life, with some estimate or hypothesis test lining up with some substantive question, and all the statistical modeling just being a way to make that connection. And you can acquire the attitude that the methods just simply work. In the above example, you can have the impression that if you do a local linear regression and a bunch of robustness tests, that you’ll get the right answer.

Does following the statistical rules assure you (probabilistically) that you will get the right answer? Yes—in some very simple settings such as clean random sampling and clean randomized experiments, where effects are large and the things being measured are exactly what you want to know. More generally, no. More generally, there are lots of steps connecting data, measurement, substantive theory, and statistical model, and no statistical procedure blindly applied—even with robustness checks!—will be enuf on its own. It’s necessary to directly engage with data, measurement, and substantive theory. Graphing the data and fitted model is one part of this engagement, often a necessary part.

Andrew: “It’s necessary to directly engage with data, measurement, and substantive theory. Graphing the data and fitted model is one part of this engagement, often a necessary part.”

To underline this, as William Cleveland wrote in “The Elements of Graphing Data”:

“Data Display is critical to data analysis. Graphs allow us to explore data to see overall patterns and to see detailed behavior; no other approach can compete in revealing the structure of data so thoroughly. Graphs allow us to view complex mathematical models fitted to data, and they allow us to assess the validity of such models.”

Tom:

Yes. Cleveland is one of my heroes.

Tom and Andrew,

I understand the validity of this point and I have clearly followed similar principles in presenting data from my own research from early on.

BUT:

My hunch is that graphing data is fine as long as you have only 2 variables (resulting in a 2-D display) and perhaps even 3 variables (with 2 variables having main or interaction effects on a third, dependent variable; illustrated by, for instance, regression hyperplanes in 3-D). But even with 3-D models confined to the 2-D surface of your computer screen or a fancy graph in a printed article, it cam become difficult to gauge the appropriateness of a model just through visual inspection. And in my opinion the validity of this approach is even more compromised once you have more than 2 predictors that might interact (or not) in complex ways. At that stage you either have to break things down by splitting one variable and thereby obscuring to a large extent that variable’s properties or by some other way of reducing the complexity. And that comes at a cost, usually.

Perhaps I am just ignorant of tried & tested approaches to depicting more complex models in graphs. But my sense is that we need more, better, and more clever ways to depict data from complex models and thereby to check the adequacy of our models in light of our data. I am not ready to accept that we should be limited in our ability to understand and correctly model complexity in behavioral and other kinds of data simply because anything that goes beyond 2 or 3 variables involved cannot easily be visualized.

Does anyone have suggestions?

Oliver:

Graphing is more of a challenge when there are multiple continuous predictors.

One thing that we sometimes do is to discretize some predictors. For example, in a regression of y on x1, x2, x3, x4 if you discretize x2, x3, and x4 to have two, three, and four levels, respectively, then you can display the full model using a 3-by-4 grid of plots indexed by x3 and x4, with each plot showing y vs. x1 using different colors for the different values of x2.

In other settings, we can make an omnibus continuous predictor by using the linear predictor from a fitted model. For example, in a regression of y on x1, x2, x3, x4, …, where x1 is a discrete predictor of interest (for example, a treatment indicator), we can fit the model y = Xb, then create the omnibus predictor z = b2x2 + b3x3 + b4x4 + …, and then plot y vs. z with different colors for the different values of x1.

For your basic regression discontinuity problem, it’s more clear what to graph, as there’s a single forcing variable to use on the x-axis. So there’s no excuse

notto plot the data and fitted model, and that’s good news, because such a plot can reveal problems, as in the example above.Unfortunately,some RDD fans seem to draw the opposite conclusion: they’d rather trust a mechanical p-value than their lyin’ eyes (https://twitter.com/kirabojackson/status/1074110061025419268).

Daniel:

That’s too bad. Again, I think part of the blame goes to statistics and econometrics textbooks, where we tend to give clean examples where the model fits the theory (for example, predicting post-test scores from pre-test scores in a model where pre-test is used as a discontinuity threshold), so you’d expect a strong and persistent relation between the forcing variable and the outcome.

Another way to put it is that we train people to have too much faith in the statistical properties of these canned procedures to provide insight about the real world.

Sometimes it seems that we’re actively encouraging people to set aside their common sense. The advice to look at the p-value and ignore the graph is a pretty stunning example!

Perhaps a maxim somewhat like the following needs to be adopted in writing statistics textbooks and teaching statistics:

Each technique needs to be illustrated by (at least) two examples: One where it works well, and (at least) one where it doesn’t.

I think that tweet is making a subtly different point. You’re saying that the plot can reveal obvious problems where the assumptions of the test are not met, especially when you’re not sure to look for anything in particular. e.g. incorrect model specification like the polynomial above.

He’s saying that a noisy but well behaved plot without obvious trends doesn’t necessarily imply that the effect isn’t there. The example where there definitely is an effect since the data is simulated, but the plot is “uncompelling” makes sense to me.

Though maybe language like “don’t use RD plots to make statistical inference” and describing plots as merely “a powerful way to display an effect” incorrectly minimizes the value of plots.

Never mind, misread the post above

That doesn’t mean your point is without some validity, though. Leaving the t-stat argument aside, the simulation does show that sometimes a plot will be unconvincing even if there’s actually an effect.

Ultimately, then, the call becomes an issue on what kind of errors you are willing to make. And, unlike a simulation, one must take into account the fact that one can deviate from the assumptions of the RD procedure used, forking paths, whether the estimated effect is all that relevant (perhaps there is in fact a discontinuity at the threshold, but the unconvincing plot could suggest that it has little practical relevance – subjects a little bit away from it may not get such different outcomes after all, despite not being comparable to each other), etc.

I think that the analysis should start by asking “How likely is it that there is a discontinuity in the data set? And follow that with “If there is a discontinuity, where is it most likely to be?” Assuming a priori that the answers are “Yes” and “Here” is putting the cart way before the horse.

In this particular case, I think the authors are on safe ground there. They’re using union vote share, so one would really expect there to be a discontinuity at 50%, where just below the companies don’t have unions and just above they do. I think this is a case where the null hypothesis is definitely false, but the dataset probably doesn’t provide strong evidence against it or allow one to come up with a good estimate of effect size.

How would one decide (or even estimate) how likely it is that there is a discontinuity in the data set?

Also, my understanding is that the purpose of using a RDD is usually not to investigate whether or not there ia some discontinuity, but rather to investigate whether or not there is a discontinuity at a point where there has been some particular event that might cause a discontinuity.

oops: “whether or not there ia some discontinuity” –> “whether or not there is some discontinuity”

Good question. I’m no statistician, but there should be a way of doing so. If one can see it when the data are plotted, then one should be able to devise a test to determine its location. And as for it being associated with a particular event, there’s often a delay between its implementation and it showing up in the data, at least for programs I’m familiar with (e.g., new crime prevention programs), so determining when the discontinuity occurs after the intervention is a relevant factor.

Right, the idea is that there is an actual discontinuity (or at least rapid change) in some intermediate outcome and we want to see if that intermediate outcome causes a difference in a final outcome.

in this case I assume if the vote is more than 50% the firm unionized and then they want to see if unionizing caused changes in volatility, by comparing basically firms that had votes like 49% to those with say 51%

The discontinuity doesn’t bother me as much as the quadratic assumption. (As D. Lakeland points out, there is reason to believe there is a discontinuity at 50%.)

But how on earth do you justify the shape of the curves on each side of 50%? The concavity is flipped from one side to the other. And the slope flips signs in the middle of each side. Why are the extrema in the middle of the two intervals? I would never have guessed that, I would have guessed the extrema to be at the far ends or at 50%. Why on earth does NCSKEW and DUVOL have a local max at .75? And why is there almost no trend up or down in the two intervals? We are confident that there is important information in the second derivatives when there is nothing of interest in the first derivatives? (Not a rigorous argument, I’ll grant you, but my spidey-sense is tingling on this last point.)

You might (might!) be able to tell a story where extreme values of the vote are associated with extreme values of each variable, so you need a nonlinear model, but the nonlinearity shown is just plain weird. The nonlinearity is a jumble of concavities and extrema … what you would expect to get if your model was the result of arbitrary cherry picking.

Terry:

The piecewise-quadratic curve is indeed horrible, both in general terms (estimating what’s going on at the discontinuity using this global model) and in the actual example as shown in the graph. Using such curves is just poor practice.

The story is slightly more interesting when we to a locally linear model, which is better in that it eliminates or reduces the global dependence of the fit. But still there’s the problem of what happens with the actual data: if the local linear fit also looks like a mess, then we still can and should be concerned that the fitted model is a data artifact and does nothing for the causal inference of interest except add noise (or bias, depending on how you look at it).

Also forking paths, which makes it hard for me to take the reported statistical significance seriously.

And the larger issue of disconnect between theory, model, and measurement.

And the meta issue of why people wanted so much to believe and then defend this result. I don’t see where all this trust is coming from.

To Andrew’s last point: I’m not sure that people do want so much to believe and defend this *result*. I think they may want to believe and defend the method, because it’s standard, they use it themselves, and thinking more seriously about the problems would make it harder to publish papers.

Sadly, you may be correct that your speculation may indeed apply to many people.

I’ve only skimmed these threads — I can’t get over how depressingly awful the data + “model” are — but this reminds me that the question of where in a noisy time series a “step” occurs comes up in fields that may not be familiar to readers. For example, biophysical experiments in which one watches trajectories of molecules or cells, which may take discrete steps. There’s some neat analysis work inspired by this, for example: https://www.ncbi.nlm.nih.gov/pubmed/26200870 .