Skip to content
 

Another Regression Discontinuity Disaster and what can we learn from it

As the above image from Diana Senechal illustrates, a lot can happen near a discontinuity boundary.

Here’s a more disturbing picture, which comes from a recent research article, “The Bright Side of Unionization: The Case of Stock Price Crash Risk,” by Jeong-Bon Kim, Eliza Xia Zhang, and Kai Zhong:

which I learned about from the following email:

On Jun 18, 2019, at 11:29 AM, ** wrote:

Hi Professor Gelman,

This paper is making the rounds on social media:

Look at the RDD in Figure 3 [the above two graphs]. It strikes me as pretty weak and reminds me a lot of your earlier posts on the China air pollution paper. Might be worth blogging about?

If you do, please don’t cite this email or my email address in your blog post, as I would prefer to remain anonymous.

Thank you,
**

This anonymity thing comes up pretty often—it seems that there’s a lot of fear regarding the consequences of criticizing published research.

Anyway, yeah this is bad news. The discontinuity at the boundary looks big and negative, in large part because the fitted curves have a large positive slope in that region, which in turn seems to be driven by action on the boundary of the graph which is essentially irrelevant to the causal question being asked.

It’s indeed reminiscent of this notorious example from a few years ago:

Screen Shot 2013-08-03 at 4.23.29 PM

And, as before, it’s stunning not just that the researchers made this mistake—after all, statistics is hard, and we all make mistakes—but that they could put a graph like the ones above directly into their paper and not realize the problem.

This is not a case of the chef burning the steak and burying it in a thick sauce. It’s more like the chef taking the burnt slab of meat and serving it with pride—not noticing its inedibility because . . . the recipe was faithfully applied!

What happened?

Bertrand Russell has this great quote, “This is one of those views which are so absurd that only very learned men could possibly adopt them.” On the other hand, there’s this from George Orwell: “To see what is in front of one’s nose needs a constant struggle.”

The point is that the above graphs are obviously ridiculous—but all these researchers and journal editors didn’t see the problem. They’d been trained to think that if they followed certain statistical methods blindly, all would be fine. It’s that all-too-common attitude that causal identification plus statistical significance equals discovery and truth. Not realizing that both causal identification and statistical significance rely on lots of assumptions.

The estimates above are bad. They can either be labeled as noisy (because the discontinuity of interest is perturbed by this super-noisy curvy function) or as biased (because in the particular case of the data the curves are augmenting the discontinuity by a lot). At a technical level, these estimates give overconfident confidence intervals (see this paper with Zelizer and this one with Imbens), but you hardly need all that theory and simulation to see the problem—just look at the above graphs without any ideological lenses.

Ideology—statistical ideology—is important here, I think. Researchers have this idea that regression discontinuity gives rigorous causal inference, and that statistical significance gives effective certainty, and that the rest is commentary. These attitudes are ridiculous, but we have to recognize that they’re there.

The authors do present some caveats but these are a bit weak for my taste:

Finally, we acknowledge the limitations of the RDD and alert readers to be cautious when generalizing our inferences in different contexts. The RDD exploits the local variation in unionization generated by union elections and compares crash risk between the two distinct samples of firms with the close-win and close-loss elections. Thus, it can have strong local validity, but weak external validity. In other words, the negative impact of unionization on crash risk may be only applicable to firms with vote shares falling in the close vicinity of the threshold. It should be noted, however, that in the presence of heterogeneous treatment effect, the RDD estimate can be interpreted as a weighted average treatment effect across all individuals, where the weights are proportional to the ex ante likelihood that the realized assignment variable will be near the threshold (Lee and Lemieux 2010). We therefore reiterate the point that “it remains the case that the treatment effect estimated using a RD design is averaged over a larger population than one would have anticipated from a purely ‘cutoff’ interpretation” (Lee and Lemieux 2010, 298).

I agree that generalization is a problem, but I’m not at all convinced that what they’ve found applies even to their data. Again, a big part of their negative discontinuity estimate is coming from that steep up-sloping curve which seems like nothing more than an artifact. To say it another way: including that quadratic curve fit adds a boost to the discontinuity which then pulls it over the threshold to statistical significance. It’s a demonstration of how bias and coverage problems work together (again, see my paper with Guido for more on this).

This is not to say that the substantive conclusions of the article are wrong. I have no idea. All I’m saying is that the evidence is not as strong as is claimed. And also I’m open to the possibility that the substantive truth is the opposite of what is claimed in the article. Also don’t forget that, even had the discontinuity analysis not had this problem—even if there was a clear pattern in the data that didn’t need to be pulled out by adding that upward-sloping curve—we’d still only be learning about these two particular measures that are labeled as stock price crash risk.

How to better analyze these data?

To start with, I’d like to see a scatterplot. According to the descriptive statistics there are 687 data points, so the above graph must be showing binned averages or something like that. Show me the data!

Next, accept that this is an observational study, comparing companies that did or did not have unions. These two groups of companies differ in many ways, one of which is the voter share in the union election. But there are other differences too. Throwing them all in a regression will not necessarily do a good job of adjusting for all these variables.

The other thing I don’t really follow are their measures of stock price crash risk. These seem like pretty convoluted definitions; there must be lots of ways to measure this, at many time scales. This is a problem with the black-box approach to causal inference, but I’m not sure how this aspect of the problem could be handled better. The trouble is that stock prices are notoriously noisy, so it’s not like you could have a direct model of unionization affecting the prices—even beyond the obvious point that unionization, or the lack thereof, will have different effects in different companies. But if you go black-box and look at some measure of stock prices as an outcome, then the results could be sensitive to how and when you look at them. These particular measurement issues are not our first concern here—as the above graphs demonstrate, the estimation procedure being used here is a disaster—but if you want to study the problem more seriously, I’m not at all clear that looking at stock prices in this way will be helpful.

Larger lessons

Again, I’d draw a more general lesson from this episode, and others like it, that when doing science we should be aware of our ideologies. We’ve seen so many high-profile research articles in the past few years that have had such clear and serious flaws. On one hand it’s a social failure: not enough eyes on each article, nobody noticing or pointing out the obvious problems.

But, again, I also blame the reliance on canned research methods. And I blame pseudo-rigor, the idea that some researchers have that their proposed approach is automatically correct. And, yes, I’ve seen that attitude among Bayesians too. Rigor and proof and guarantee are fine, and they all come with assumptions. If you want the rigor, you need to take on the assumptions. Can’t have one without the other.

Finally, in case there’s a question that I’m being too harsh on an unpublished paper: If the topic is important enough to talk about, it’s important enough to criticize. I’m happy to get criticisms of my papers, published and unpublished. Better to have mistakes noticed sooner rather than later. And, sure, I understand that the authors may well have followed the rules as they understood them, and it’s too bad that resulted in bad work. Kind of like if I was driving along a pleasant country road at the speed limit of 30 mph and then I turned a corner and slammed into a brick wall. It’s really not my fault, it’s whoever put up the damn 30 mph sign. But my car will still be totaled. In the above post, I’m blaming the people who put up the speed limit sign (including me, in that in our textbooks our colleagues and I aren’t always so clear on how our methods can go wrong).

P.S. The person who sent the email to me adds some comments on the paper:

I wonder if those weird response variables DUVOL and NCSKEW are themselves “researcher degrees of freedom”. Imagine all the other things they could have studied – stock price growth after the union vote, revenue, price/earnings ratio… these could just as plausibly be related to unionization as the particular crash risk formulas, Equations (1-3), used by the authors.

A few more suspicious aspects:

1. The functional form is purely empirical. They tried polynomials of degrees 1-4 and selected quadratic because it had the best AIC (Footnote 9).

2. Tons and tons of barely significant results, 0.01 < p < 0.05 it looks like based on the tables. You can't just blindly go with an "approved" methodology - you have to at least (1) sanity check your RDD plots, (2) check whether the fitted lines in the RDD make sense theoretically, right? There's no economic reason for those curves to look the way they do.

In a sane world, perhaps this article would have received very little attention, or maybe its problems would’ve been corrected in the review process, or maybe it would’ve appeared in an obscure journal and then not been taken seriously. But it came to a strong conclusion on a politically charged topic.

Science communication is changing. On one hand, we have post-publication review, so there are places to point out when claims are pushed based on questionable evidence. On the other hand, the claims get out there faster.

P.P.S. I’m also reminded of something I wrote last month:

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated.

P.P.P.S. More here.

41 Comments

  1. Ram says:

    I agree that this looks to be spurious, but I’m curious what’s going wrong here. It looks like they also did a LOESS using some optimal plug-in bandwidth and got (substantively) similar results. They also tried increasing and decreasing the bandwidth, and tried multiple kernels, and the results remained essentially the same. They even did various placebo tests where they redid the analysis using other cutoffs besides 50%, and the estimates there were consistently smaller and not statistically significant. I realize that any statistical method has error rates, but usually such errors reflect differences between sample and population, not differences between analysis and sample, which is what you’re getting at here. I understand that the results of all these “robustness” checks are highly correlated given that they’re all slight variations on the same underlying analysis of the same data, but it’s still not obvious what the misapplication is here. The only thing that looks especially suspicious is that they binned the data before doing the regressions, but the scatterplot reflects the binned data and the problem is visible right there. I guess what I’m asking is how is it that this method gives results that are obviously wrong given the data used to fit it, when this method isn’t obviously making any assumptions that the data explicitly reject?

    • Andrew says:

      Ram:

      To answer the question in your final sentence: One way to say this is that they’re making a mistake to use an unregularized regression, as that big-ass overfit quadratic curve is driving the result; another way to say it is that the points far from 0. The lowess fit could well be using a huge chunk of the data and have similar issues as the quadratic (but maybe not so bad). The problem with the robustness tests is that they’re trying to find null effects, which shouldn’t be hard given all the noise in the data. It’s hard to say more without actually seeing the data.

      From a statistical standpoint, this is a lot like those cargo-cult psychology papers that we’ve discussed over the years. The paper seems impressive because it presents lots of empirical results that seem to be supportive of its main conclusion, but when you go through the empirical results one at a time, problems crop up. Researcher degrees of freedom and the garden of forking paths are a thing, and reporting lots of analyses doesn’t resolve this issue. In the paper under discussion, the problems are particularly clear because the authors presented those graphs. Usually we’ll just see a regression table and it’s a lot harder to figure out what’s going on.

      Another way of saying this is that it’s tempting to believe that you can’t get all of this just by chance, but you can. That’s the point of my papers with Zelizer and Imbens linked above. Throwing in that polynomial adds noise, which gives the researchers a way to apparently win by getting an effect size large enough to reach that significance threshold.

      • Ethan Bolker says:

        I think lots of what’s discussed on this blog and a cause of common lay errors in probability comes down to

        It’s tempting to believe that you can’t get all of this just by chance, but you can.

        I wonder if there’s a way to get that word out beyond the preaching to the choir here.

        • Andrew says:

          Ethan:

          Lots of people have heard of “p-hacking” and “forking paths” so that’s a start. But I think one thing that people don’t always realize is how all these different problems go together. For example, the choice of whether to include that quadratic curve is a forking path, but also it introduces noise to the estimate which in turn introduces bias through the statistical significance filter. And the arbitrariness of the outcome measures is another forking path but it is also relevant to questions of generalizability, in that the substantive message of the paper is, from the abstract, a “beneficial impact of unionization and the role that organized labor plays in influencing extreme downside risk in the equity market.” That’s a lot of conclusion to draw from these particular measures.

          When someone sees a critique of this sort of paper, it can be tempting to think: Hey, there are so many different results, robustness checks, etc., so what if there are one or two potential weaknesses in the analysis? There’s such an apparent pile of evidence, that any specific criticisms seem kinda picky. And people don’t realize that the criticisms reinforce each other.

    • Anoneuoid says:

      I’m curious what’s going wrong here. It looks like they also did a LOESS using some optimal plug-in bandwidth and got (substantively) similar results. They also tried increasing and decreasing the bandwidth, and tried multiple kernels, and the results remained essentially the same. They even did various placebo tests where they redid the analysis using other cutoffs besides 50%, and the estimates there were consistently smaller and not statistically significant.
      […]
      it’s still not obvious what the misapplication is here

      This is called overfitting… when I read your description I give pretty much 0% chance anything useful would result from that process. At the very least (and I really mean that as a bare minimum requirement) they should have had a hold-out set.

      • Andrew says:

        Anon:

        A hold-out set is fine, but if they’d had a clear discontinuity just using data near the boundary, that could be convincing. The model as a whole doesn’t look very careful to me, though, just kind of a blind regression. It’s good they made the graphs as this reveals the overfitting problem, and now maybe they or others will be able to do better in the future.

        • Anoneuoid says:

          Pretty sure I can always get something that looks like a discontinuity if I get to choose the boundary, exclusion criteria, variables, etc.

          Eg, in this paper they used only 687 out of 5,342 elections. So 87% of the data is dropped. If I have free reign to come up with excuses to drop 88% of the data I will discover all sorts of stuff for you. I’ve found there is always a legitimate excuse to drop any given data point.

          Then here is eg the dependent variable:

          To construct our first measure of crash risk, NCSKEW, we take the negative of the third moment of firm-specific weekly returns for each fiscal year and divide it by the standard deviation of firm-specific weekly returns raised to the third power.

          Plenty of room to mess around with that too.

          You need to check the model on other data, ideally that data not even collected when the model is devised, but a hold-out is far better than nothing.

      • Ram says:

        I would need to read the paper more carefully to figure out precisely what they did, but usually the idea is to set the bandwidth so as to minimize the cross-validated MSE or something similar. A plug-in bandwidth is just a closed-form approximation of this. So if that’s what they did, they did choose the bandwidth to optimize performance on hold out sets. Of course CV is noisy, so they may have chosen too large of a scale for the kernel, but they tried bandwidths bigger and smaller than this and didn’t see any appreciable change in the results. So I don’t think “overfitting” is a useful way to think about what’s going wrong here. If they just cherry picked an arbitrary bandwidth to get their result, I’d agree that this was part of the problem.

        Another possibility is that they’re using large sample normal theory inferences, and perhaps that is not doing such a good job here due to skewness in the data slowing convergence to normality. That would mean that more accurate confidence bands around the LOESS curves would be essentially overlapping, making the apparent discontinuity in the point estimates just come out looking like noise.

        • Anoneuoid says:

          CVs always overfit in practice because the person runs them a bunch of times and adjusts the model as they go (“data leakage”).

          That is why standard practice is to run all your CVs to tune the model, then assess the skill of the final model on your hold out.

        • Andrew says:

          Ram:

          They’re overfitting to the data in the sense that they’re getting this big swooping curve that makes no sense but is driving the discontinuity estimate.

          • Ram says:

            Right, I agree about the quadratic fit, which is awful. I’m talking specifically about the LOESS analysis, where they seem to have used a CV or CV-like procedure to decide how much to regularize the curve fit. This can overfit too, since we’re using the test folds to estimate the smoothing parameter, but it’s less obvious that this would give something as silly as the quadratic fit, and their sensitivity analyses show that the results are qualitatively invariant to the smoothing parameter in a nontrivial neighborhood of the CV-like estimate, meaning that CV-noisiness may not be a huge problem here. Which is what makes me think the problem is less the curve fit and more the confidence bands being too narrow. The estimated discontinuity may be a reasonable description of the data if it’s paired with a suitably wide confidence interval. The trouble seems to be that it isn’t, which is where the statistical significance is coming from.

            • Andrew says:

              Ram,

              Could be. I guess I’d like to see the fitted loess curves. Ultimately I think the issue is that this is an observational study, and this curve will not in general adjust for pre-treatment differences between the groups.

              Also, as various commenters have pointed out, there are lots of seemingly arbitrary choices in the analysis, including, most notably, the outcome measures and the data-exclusion rules (going from 5342 elections down to 687).

              • Ram says:

                Agree on the second point. On the first point, there is an appeal to discontinuity designs, in that you know the one thing determining treatment assignment (which side of the cutoff you’re on), and you can use this to control for the determining factor, but only at the cutoff. I see the value in that, rather than trying to control for all the high dimensional and mostly unobserved average group differences. But as always it’s important that you use estimators and inference procedures with good performance characteristics, and I guess I want to know why the usual RDD ones (CV-regularized nonparametric regression with asymptotic normal theory inferences) seem to be sputtering here.

              • Andrew says:

                Ram:

                Yeah, I’m not sure. As the saying goes, Bad cases make bad law. In this particular example, I’d think the first step would be to look at a richer and less noisy set of outcome measures and to keep all 5432 data points, or as many as are reasonably possible to keep. Also of course to move away from the goal of attaining statistical significance.

              • Daniel Weissman says:

                Ram is making some great points in this thread!

  2. Soso says:

    I also stumbled across the paper. But I was Note wondering about the 600ish observations. Given that RDD is more a local method, they must have very few observations.

    • Andrew says:

      Soso:

      If you follow the link to the paper, you can deduce this information from Figure 1. It appears that approximately a third of the observations are between 40% and 60% of the vote—that would be over 200 data points, so enough to just do an analysis right there at the boundary. Bad news from the researchers’ perspective is that such a result might not be statistically significant. The right way to think about that would be to accept that there’s a lot of variation and no clear pattern in the data—but people don’t always want to hear that!

  3. jd says:

    I’m having trouble making out what the y-axis is on the crash risk plots, but for all of the plots above, if I remove the regression and discontinuity lines in my minds eye, I can’t see any trends whatsoever. It seems like that would be a good first check on all these examples? Plot the data with no fancy lines drawn.

    • Brent Hutto says:

      I was about to post the same thing. The dots suggest flattish trend lines with lots of noise and a few oddball points. Nothing about the raw data suggests any sort of curve and/or discontinuity to my eyes.

      Separate question. What is the purposed of “binning” this sort of data rather than fitting whatever you’re going to fit to the raw data? Assuming I understood correctly that the dots on the plots are bins.

      And for a potentially clueless question, why can’t I see any confidence intervals on those quadratics (as stated in the caption)?

      • jd says:

        Not sure about the confidence intervals for Figure 3 above. In the paper, Figure 2 contains confidence intervals as lighter gray lines around the fitted line. I would think something like that would be shown for Figure 3. Maybe I am overlooking something obvious, though

    • Koray says:

      I too was thinking that even individual halves of each of the 2 plots don’t look like good fits at all. There’s just too much noise.

  4. Sam says:

    Hi Andrew,

    This is an incredibly irresponsible post. You should read the paper and revise out of respect to the researchers.

    Just wanted to point out with regards to this part

    “This anonymity thing comes up pretty often—it seems that there’s a lot of fear regarding the consequences of criticizing published research.”

    that

    (a) This paper is not published, it is a working paper.
    (b) The authors are not famous, and none of them are from a high-ranked economics department, and the paper was politely criticized by informed and uninformed researchers on Twitter.
    (c) You are in fact the most famous and influential person in this conversation, and you are using this influence recklessly. Needlessly nasty comments by senior researchers chill discourse and are the worst part of academic culture, in my opinion. I sincerely hope the pugilistic quality of the older generation of Causal Inference/Stats research does not trickle down to the younger generation.

    • Andrew says:

      Sam:

      1. I did read the paper.

      2. I have nothing above to revise. If there’s a specific thing I got wrong, please let me know.

      3. The person who sent me that email had no problem with me reproducing the email but did not want the name shared. So I preserved anonymity. I think that’s the right thing to do.

      4. The paper was published online in the same way that other papers on SSNR, Arxiv, webpages, etc. are published. It was not published in a peer-reviewed journal, but it was posted and shared for all to read.

      5. Non-famous people’s work can be criticized too. Labor unions and corporate performance are important topics, which is why this paper got attention in the first place. I think it’s a terrible mistake to let unsupported claims slide, just because they were made by non-famous people.

      6. I don’t see why criticizing poor statistical work “chills discourse.” I’ve done poor work—it happens all the time! I like when people criticize it, as it gives me a chance to correct and improve.

      7. See the last paragraph of my post above, which addresses your concerns b and c.

      Anyway, I guess the real point is that if you think there’s anything specific I got wrong in the above post, please explain in the comments. There’s a tone discussion that we can have, but there’s also a substantive question of economics, and I think it would be a mistake for people to think that the above-discussed paper provides the evidence that it claims. Also, there’s future work to think about, and it would be good for future researchers to realize the problems with the analysis discussed above.

      P.S. I appreciate your taking the trouble to comment. I’d rather have these reactions out in the open (also useful to other blog readers who might agree with you!) rather than just simmering. Better to discuss these issues and explore our disagreement than to be in separate silos.

      • jim says:

        Andrew, what a great response.

        The thing we should all be about is finding the right answer. That’s the purpose of not just science, but any endeavor. That’s how our efforts benefits humanity.

        To that end, when we make mistakes, whatever we’re doing, in science, in a company, in an NGO or university admin, or whatever – as I certainly have done – we should be able take credit for them, correct them and move forward. And we should also be able to engage in constructive tactful criticism of others’ work to improve the overall effort.

      • Sam says:

        Andrew:

        You criticize the authors for using polynomials. Here is something you yourself wrote with Guido Imbens on the topic of using polynomials in RD designs:

        “We argue that estimators for causal effects based on such methods can be misleading, and we recommend researchers do not use them, and instead use estimators based on local linear or quadratic polynomials or other smooth functions.”

        From p.15 of the paper:

        “We implement the RDD using two approaches: the global polynomial regression and the local linear regression”

        They show that their results are similar in either specification.

        When I assumed you had not read the paper I was being charitable; I did not want to presume that you knew that they had used local linear regression as well but still spent an entire post ridiculing them for using a polynomial in one specification.

        Being mean to Heckman and picking on him for his goofy high-order polynomial RDD is one thing, given that you’re both high status and have a well-known historical animosity. However, the language in this post is in my opinion not appropriate given the rank of the researchers and actual content of the paper.

        This is not the best RD paper ever. It is not even the best of the many union RD papers since DiNardo and Lee (2004). If it is submitted to a good journal, editors will probably suggest they use Cattaneo et al’s robust RD, currently the state of the art in econometrics for RD. I just don’t see the point ridiculing it as a “disaster” and calling the authors “cargo-cult” ideologues.

        • Andrew says:

          Sam:

          1. The displayed double-quadratic regression is the only one displayed in the paper and it’s the main analysis. The local linear regression is presented as an alternative analysis and with very little detail, for example they never say what their value of h is, nor do they give that estimated slope, nor do they explain why that model makes sense, nor do they motivate their seemingly arbitrary outcome measurements come from, nor do they discuss the consequences of throwing away most of their data. One advantage of the double-quadratic regression is that they do display the model fit and we can see its problems. Seeing the problems of the model that they do display does not give me confidence that all’s ok with the models and data exclusion rules that they don’t display.

          2. The earlier paper on air pollution in China that I criticize is not by Heckman.

          3. In any case, I don’t think I’m being “mean” to authors or “picking on them” by pointing out flaws in their published work. And I don’t think people are being mean to me by pointing out problems in my own work. Pointing out flaws in published work is one of the essential steps of science.

          4. Considering the people whose work I’ve criticized or praised: in most cases I’ve not met these people and I have no personal connection with them, “historical animosity” or otherwise.

          5. You seem to be making everything so personal, and all this about “respect” and “status.” This is science, not a playground. We put our work out there and it gets criticized. That’s fine. “Status” has nothing to do with it. Criticism is a plus. As scientists, we want to learn about the world, and we can do better when people point out how we are doing this wrong.

          6. You write, “This is not the best RD paper ever. It is not even the best of the many union RD papers since DiNardo and Lee (2004).” I don’t care about the rating of this paper. I’m not giving out awards. My problem with the paper is not that it is not “the best.” I’ve written lots of papers that are not “the best.” I just want competence. The problem is that people can take this paper and draw unwarranted conclusions from it, indeed the authors positively encourage such unwarranted conclusions, for example by saying things such as, “We further find that the impact of unionization on crash risk is significant for firms located in states without right-to-work law, but insignificant for firms in states with such a law. This is because unions have a stronger influence in states without right-to-work law,” and “Within a narrow window of the cutoff point, crash risk drops significantly once union vote shares exceed the 50% threshold, suggesting that unionization has a negative impact on crash risk,” and “the coefficients on Unionization are negative and statistically significant at the 5% level. These results indicate that unionization has a negative influence on stock price
          crash risk.” Yes, I know that lots of researchers are trained to believe that causal identification + statistical significance = discovery. But that conclusion is a mistake. As discussed a zillion times now, including various places in this comment thread, it’s all too easy to get statistical significant results in such settings from pure noise. I do think that reliance on statistical significance in uncontrolled settings is cargo-cult science. The good news about this particular article under discussion is that perhaps the obvious problems with that discontinuity graph will make people more aware of the general problems with this sort of analysis.

          7. Maybe you’re right that the paper is not a “disaster.” It’s a run-of-the-mill paper, using opaque data and flawed statistical analysis, along with a misplaced trust in identification strategies and statistical significance, to make strong conclusions, not supported by data, about a real-world issue. From one standpoint this is a disaster, from another standpoint it’s just one of a million papers out there, it just happens to be a paper that someone pointed to, that addresses a topic of some topical concern (which is how it got noticed in the first place). Writing about and discussing such papers can help us better understand the problems in science, hence the above post.

          • Martha (Smith) says:

            “You seem to be making everything so personal, and all this about “respect” and “status.” This is science, not a playground. We put our work out there and it gets criticized. That’s fine. “Status” has nothing to do with it. Criticism is a plus. As scientists, we want to learn about the world, and we can do better when people point out how we are doing this wrong.”

            +1

    • Mike says:

      Sam:
      I don’t find this post to be reckless/disrespectful/nasty/pugilistic. I’m not a statistician or political scientist or economist, and I’m not a Stan user (I can barely use R). I come here for the discussion of general problems in data analysis and inference, and I’ve learned a lot from posts exactly like this one that illustrate an important and general class of mistakes underlying a data analysis that appears to the untrained eye to be sophisticated and sound but is in fact neither. For readers like me a specific example is a necessary part of learning what the error is, so focusing on a specific published article and its errors is important and useful to me (and as Andrew noted it could be important and useful to the authors). Maybe this paper was discussed in a different way on other media, but media like Twitter are useless to people like me for such a discussion because the posts are too short and because my social networking is so limited (I would never see this paper discussed). This blog aggregates a lot of related discussions of a very diverse set of papers that I would otherwise never be able to find and read about and learn from. I think it’s a positive resource, and not a chill on discourse.

    • DC says:

      Sam, as one of those in this ‘younger generation’, and as someone who has had my work criticized publicly & sometimes unfairly, I find your comment incredibly off base. Maybe if you read more of the blog (ironic, given that you’re telling Andrew to read a paper), you’d better understand then nature of the criticism, where its coming from, and why. Why should it matter if the paper is published or not? If it was posted for others to read, it is available for critique. Furthermore, as far as I know, most folks don’t complain if someone speaks positively about their unpublished manuscript, only if the comments are negative.

      • Martha (Smith) says:

        “Why should it matter if the paper is published or not? If it was posted for others to read, it is available for critique. Furthermore, as far as I know, most folks don’t complain if someone speaks positively about their unpublished manuscript, only if the comments are negative.”

        +1

  5. Jonathan says:

    I skimmed the paper and my first thought was that 1980 to basically now is almost 2 generations, and that a whole lot has changed since 1980 in the structure of the union movement, in the response of the markets to issues regarding unionization, etc. I can’t see how one can say there’s ‘an’ effect because now is different from then unless you demonstrate that then and now are the same. I didn’t see an attempt to show that then is now. So let’s say there was an effect in the 1980’s and it ran one way, and then the effect changed in 1995 and then it changed again but became nothing much. I don’t know.

    I believe you noted – in the reference to time frames – that it’s hard to tell what they mean by stock crash risk. Who really cares about short term changes other than short term traders? What would be more meaningful to me is long term price declines: if unionization means stock price retains value over 10 years versus losing value, then maybe that’s worth something to investors. But that gets at categorization of data, which I didn’t get into – and didn’t see much of beyond eliminating codes for financial institutions and regulated businesses – because sectors decline or rise. Most investors treat sector effects as more important in the short term and even mid term than the quality of business management: you can be a genius manager in a declining industry and go bankrupt or a crap manager in a rising industry and make a fortune. The concept followed seemed divorced from the realities of business and investing. I wanted to see a table that compared each firm’s stock price with its sector prices over the relevant periods, over time, etc. That kind of stuff can be collected from about a gazillion online sources of stock prices. And that’s not getting into individual adjustments: there’s been so much change of ownership, change of concentration, etc. in the 35 years of data that you really need to examine each case, like you would in a medical study that looked at mortality and adverse events. I didn’t see how they ‘binned’ the results: was this done just by votes with results from 1982 included in a bin with stuff from 2006.

    In 1980, very few companies had moved to Mexico and America manufactured a lot more stuff. I moved to Boston in 1985 and there was still a car assembly plant in Framingham. When I look at the Fed’s monthly and longer term employment numbers, you see ridiculous erosion in manufacturing employment. (And I lived before that in Detroit, which …) Financial rigging of companies had barely begun, meaning Wall Street then focused on debt and stock issuing and not the buying of companies to rip them apart, not the buying of companies and then piling on debt to pull out cash, etc. That’s all kind of important history. I’m reminded of Buffet’s comment when asked by someone at the Buffalo newspaper why they weren’t receiving raises when the stock price was so high and he said something like they weren’t responsible for the stock rises. That was true: the sector went up because people were (stupidly in hindsight) pouring money into papers. My growing up in union town USA taught me that the reasons for union votes are complicated. So for example, a company under stress in a sector – or in a declining sector – may face unionization because the workers are worried about the future. I’ve seen many of those elections and they tend to be close because one group believes the union can protect their jobs and the other group believes the union will cause them to lose jobs faster. I tend to think companies under stress for whatever reason would have more stock price risk. Some of that is regularly quantified in alpha, etc. tracking, but you can see it in sectors without looking at individual companies.

    My assumption when looking at the weird graphs is that the discontinuity was related to the binning of data and the varying ages of the data, meaning a then and now issue plus whatever else.

    • Martha (Smith) says:

      “(And I lived before that in Detroit, which …) “

      From another person who lived in Detroit, had a grandfather on one side of the family who was a union man working in an auto factory; an uncle on the other side who was an attorney for a company that had an “employee’s association” that was or was not a company union depending on who you asked; and a father who had to destroy machinery (when the company he worked for closed down a plant and moved operations to another state) so that the company wouldn’t have to pay property taxes on idle machinery — well, there are a lot of other factors besides union votes that can come into play.

  6. This isn’t one quadratic, it’s two separate quadratics. There are essentially 6 degrees of freedom for the quadratics. This comes down to poor representation a function whose behavior changes… This seems to be a consistent problem for people, they don’t understand the theory of function approximation.

    • Andrew says:

      Daniel:

      Setting aside the details—it may be that there are fewer than 6 df here, depending on how the model is specified in terms of main effects and interactions in the regression formulation—I think the bigger problem here is that there’s no underlying function here to approximate. To put it another way, this is not a function approximation problem, it’s an adjustment-for-differences-between-treatment-and-control-group problem. And it’s tough for people because math is hard. From high school math, you can get the impression that quadratic is the next logical step beyond linear. From statistics and econometrics, you can get the impression that it’s ok to just throw one more term into the regression model.

      • Martha (Smith) says:

        “From high school math, you can get the impression that quadratic is the next logical step beyond linear. From statistics and econometrics, you can get the impression that it’s ok to just throw one more term into the regression model.”

        +1

      • Ignore all the extra complications and imagine there is a real function to be estimated here and it’s polluted by noise, furthermore assume that the function behaves differently on each side of the 50% point. Should you represent this as two unrelated functions?If you do that how often do you get the right hand half to be different from the left half? They are different completely unless all coefficients on the right hand side are equal to the value on the left hand side. If nothing is going on, and you have noise, essentially with probability 1 you will get a discontinuity and different curvature and different slope etc. You are primed to “find a difference”.

        Now, suppose you model this as one 3rd order chebyshev polynomial plus one logistic type sigmoid? If there is strong evidence of a discontinuity, you will get a large magnitude change and a fast change… the parameters directly represent the kind of thing you are expecting, this is a kind of regularization, a kind of prior information, and makes much more sense.

  7. Graduated says:

    In general I think RDD is really neat as a method, but in this case, we are expected to accept the following strange pair of claims:

    1. Companies with a union vote of 49% are very similar to companies with a union vote of 51%, except that they happened to form a union, and this explains the difference in NCSKEW and DUVOL.
    2. Companies with a union vote of 40% have basically the same average NCSKEW and DUVOL as 60% vote companies, but that doesn’t matter because 40% and 60% companies are too different from each other to be comparable.

  8. W.D. says:

    Andrew,

    I ran some Chow tests on random noise with a RDD on a quadratic model. On 1,000 trials, 40 of them passed the Chow test, whereas 0 passed for the linear model. This seems to confirm your hypothesis https://ryxcommar.com/2019/06/26/chow-tests-with-quadratic-terms-on-random-noise/

Leave a Reply