Skip to content

Statistical controversy on estimating racial bias in the criminal justice system

1. Background

A bunch of people have asked me to comment on these two research articles:

Administrative Records Mask Racially Biased Policing, by Dean Knox, Will Lowe, and Jonathan Mummolo:

Researchers often lack the necessary data to credibly estimate racial discrimination in policing. In particular, police administrative records lack information on civilians police observe but do not investigate. In this article, we show that if police racially discriminate when choosing whom to investigate, analyses using administrative records to estimate racial discrimination in police behavior are statistically biased, and many quantities of interest are unidentified—even among investigated individuals—absent strong and untestable assumptions. Using principal stratification in a causal mediation framework, we derive the exact form of the statistical bias that results from traditional estimation. We develop a bias-correction procedure and nonparametric sharp bounds for race effects, replicate published findings, and show the traditional estimator can severely underestimate levels of racially biased policing or mask discrimination entirely. We conclude by outlining a general and feasible design for future studies that is robust to this inferential snare.

Deconstructing Claims of Post-Treatment Bias in Observational Studies of Discrimination, by Johann Gaebler, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill:

In studies of discrimination, researchers often seek to estimate a causal effect of race or gender on outcomes. For example, in the criminal justice context, one might ask whether arrested individuals would have been subsequently charged or convicted had they been a different race. It has long been known that such counterfactual questions face measurement challenges related to omitted-variable bias, and conceptual challenges related to the definition of causal estimands for largely immutable characteristics. Another concern, raised most recently in Knox et al. [2020], is post-treatment bias. The authors argue that many studies of discrimination condition on intermediate outcomes, like being arrested, which themselves may be the product of discrimination, corrupting statistical estimates. Here we show that the Knox et al. critique is itself flawed, suffering from a mathematical error. Within their causal framework, we prove that a primary quantity of interest in discrimination studies is nonparametrically identifiable under a standard ignorability condition that is common in causal inference with observational data. More generally, though, we argue that it is often problematic to conceptualize discrimination in terms of a causal effect of protected attributes on decisions. We present an alternative perspective that avoids the common statistical difficulties, and which closely relates to long-standing legal and economic theories of disparate impact. We illustrate these ideas both with synthetic data and by analyzing the charging decisions of a prosecutor’s office in a large city in the United States.

I heard about these papers a couple years ago but didn’t think too hard about them. Then recently a bunch of different people contacted me and asked for my opinion. Apparently there’s been some discussion on twitter. The Knox et al. paper was officially published in the journal so maybe that was what set off the latest round of controversy.

Also, racially biased policing is in the news. Not so many people are defending the police—even the people saying the police aren’t racially biased are making that claim based on evidence that police mistreat white people too—but I suppose that much of the current debate revolves around the idea that the police enforce inequality. So whatever the statistics are regarding police mistreatment of suspects, these are part of a larger issue of the accountability of police use of force. We’ve been hearing a lot about police unions, which raises other political questions, such as what is it like for police officers who disagree with the positions taken by their leadership.

Anyway, that’s all part of the background for how this current academic dispute has become such a topic of discussion.

2. Disclaimer

I have no financial interest in this issue, nor do I have any direct academic stake. Some colleagues and I did some research twenty years ago on racial disparities in police stops, but neither of the new articles at hand has any criticism of what we did, I guess in part because our analysis was more descriptive than causal.

I do have a personal interest, though, as Sharad Goel and Jennifer Hill are friends and collaborators of mine. So my take on these papers is kind of overdetermined: Jennifer is my go-to expert on causal inference (although we have never fought more than over the causal inference chapters in our book), and I have a lot of respect for Sharad too as a careful social science researcher. I’ll give you my own read on these articles in a moment, but I thought I should let you know where I’m coming from.

Just to be clear: I’m not saying that I expect to agree with Gaebler et al. because Sharad and Jennifer are my friends—I disagree with my friends all the time!—; rather, I’m saying that, coming into this one, my expectation is that, when it comes to causal inference, Jennifer knows what she’s talking about and Sharad has a clear sense of what’s going on in this particular application area.

3. My read of the two articles

Suppose you’re studying racial discrimination, and you compare outcomes for whites to outcomes for blacks. You’ll want to adjust for other variables: mere differences between the two groups does not demonstrate discrimination. In addition, you won’t be working with the entire population; you’ll be working with the subset who are in some situation. For example, Knox et al. look at the results of police-civilian encounters: these are restricted to the subset of civilians who are in this encounter in the first place.

Knox et al. make the argument that analyses using administrative data are implicitly conditioning on a post-treatment variable because they subset on whether you were stopped or not. To use the words in their title, adjusting for or conditioning on administrative records can mask racially biased policing. Knox et al. point out that this bias can be viewed as an example of conditioning on intermediate outcomes. It’s a well known principle in causal inference that you can’t give regression coefficients a direct causal interpretation if you’re conditioning on intermediate outcomes; see, for example, section 19.6, “Do not adjust for post-treatment variables,” of Regression and Other Stories, but this is standard advice—my point in citing ourselves here is not to claim any priority but rather to say that this is standard stuff.

Knox et al. are right that conditioning on post-treatment variables is a common mistake. R. A. Fisher made that mistake once! Causal inference is subtle. In general, there’s a tension between the principles of adjusting for more things and the principle of not adjusting away the effect that you’re interested in studying. And, as Knox et al. point out, there’s an additional challenge for criminal justice research because of missing data: “police administrative records lack information on civilians police observe but do not investigate.” This is an important point, and we should always be aware of the limitations of our data.

From a substantive point of view, the message that I take from Knox et al. is to be careful with what might seem to be kosher regression analyses purporting to estimate the extent of discrimination. I would not want to read Knox et al. as saying that regression methods can’t work here. You have to be aware of what you are conditioning on, but if you interpret the results carefully, you should be able to estimate a causal effect of one stage in the process. The concern is that it can be easy for the most important effects to be hidden in the data setup.

Gaebler et al. discuss similar issues. However, they disagree with Knox et al. on technical grounds. Gaebler et al. argue that there exist situations in which you may be able to estimate causal effects of discrimination or perception of race just fine, even when conditioning on variables that are affected by race. It’s just that these won’t capture all aspects of discrimination in the system; they will capture the discrimination inherent solely at that point in the process (for example, when a prosecutor makes a decision about charging).

For a simple example, suppose you have stone-cold evidence of racial bias in sentencing. That’s still conditional on who gets arrested, charged, and prosecuted. So it doesn’t capture all potential sources of racial bias, not by a long shot. Or, maybe you find no racial bias in sentencing. That doesn’t mean that total racial bias is zero; it just means that you don’t find anything at that stage in the process.

I see Gaebler et al. as being in agreement with Knox et al. on this key substantive point. The distinction is that Knox et al. are saying that in principle you can’t estimate effects of discrimination using the basic causal regression, whereas Gaebler et al. are saying that it can be, and often is, possible to estimate these effects, even though these effects are not the whole story.

Gaebler et al. give an example where the estimand corresponds to what one would measure in a randomized controlled trial where the stated race of arrested individuals was randomly masked on the police narratives that prosecutors use when making their decisions. With this setup, it is possible to estimate causal racial bias in a particular part of the system.

For another example, suppose you did a study of racial discrimination and you included, as a predictor, the neighborhood where people were living. And you found some amount of discrimination X, a difference between what happens to whites and to blacks, conditional on neighborhood. That could be a causal effect, but, even if X = 0, that would not mean that there is no racial discrimination. If there is discrimination by neighborhood, this can have disparate impact.

4. From the study of discrimination to the study of disparate outcomes

One thing I especially like about the Gaebler et al. article is that they move beyond the question of racial discrimination to the more relevant, I think, issue of disparate outcomes: “much of the literature has framed discrimination in terms of causal effects of race on behavior, but other conceptions of discrimination, such as disparate impact, are equally important for assessing and reforming practices.”

Again, suppose that all discrimination were explainable not by race but by what neighborhood you live in. People are segregated geographically, so discrimination by neighborhood really would be a form of racial discrimination, but this could be set up so that the causal effect of race itself (for example, in police or prosecutor decisions) would be zero. But from the Knox et al. perspective, you could say that this control variable (the ethnic and racial composition of your neighborhood) is an intermediate outcome. I prefer Gaebler et al.’s framing, but ultimately I think both articles are coming to the same point here.

And here’s how Gaebler et al. end it:

The conclusions of discrimination studies are generally limited to specific decisions that happen within a long chain of potentially discriminatory actions. Quantifying discrimination at any one point (e.g., charging decisions) does not reflect specific or cumulative discrimination at other stages—for example, arrests. Looking forward, we hope our work offers a path toward quantifying disparities, and provokes further interest in the subtle conceptual and methodological issues at the heart of discrimination studies.

I agree. Lots more can be said on this topic, and I recommend Gaebler et al. as a starting point.

5. So why the controversy?

Given that, from my perspective, the two papers have such similar messages, why the controversy? Why the twitter war?

I guess I can see where the authors of both papers are coming from. From the perspective of Gaebler et al., the distinction is clear. Knox et al. have made a mathematically false statement leading to a confusion about the identifiability of causal effects within the justice system. Knox et al. don’t seem to agree regarding the mathematical error, but in any case they take the position that their result is correct in practice, labeling Gaebler et al.’s results as “silly logic puzzles about knife-edge scenarios” that “aren’t useful to applied researchers.”

The concern I have about Knox et al. is that they make a claim that’s too strong. They say that they “show that when there is any racial discrimination in the decision to detain civilians . . . then estimates of the effect of civilian race on subsequent police behavior are biased absent additional data and/or strong and untestable assumptions,” that “the observed difference in means fails to recover any known causal quantity without additional, and highly implausible, assumptions,” and that this estimation strategy “[cannot] be rehabilitated by simply redefining the quantity of interest” or by adding additional covariates.

In some sense, sure, you can’t make any causal inference without strong assumptions. Even in a textbook randomized controlled experiment, all you’re doing is estimating the average causal effect for the people in the study (or a larger population for which your participants can be considered a random sample). When your data are observational, you need even more assumptions. And don’t get me started on instrumental variables, regression discontinuity, etc.: all these methods can be useful, but they don’t come without lots of modeling.

But that can’t be what Knox et al. are trying to say. The statement, “estimates of the effect of X on Y are biased absent additional data and/or strong and untestable assumptions,” is true for any X and Y. It’s a good point to make, quite possibly worth a paper in the APSR, but nothing in particular to do with criminal justice. The relevant point to make, and hence the point I will extract from Knox et al., is that, in this particular example of arrests, the problems of selection is, in the words of Morrissey, really serious. But you can use standard methods of causal inference to estimate causal effects here, as long as you’re careful in interpreting the results and don’t take the estimate of discrimination in one part of the system as representing the entirety of racial bias in the whole process.

My own resolution to this is to take Knox et al.’s message of the difficulty of causal inference under selection, as applied to the important problem of estimating disparities in the criminal justice system as a caution to be careful in your causal thinking: Define your causal effects carefully, and recognize their limitations (as in the above example where the causal effect of race in arrest or sentencing decision, conditional on neighborhood, might be of interest, but at the same time we realize this particular causal effect would only capture one of many sources of racial discrimination). I like Gaebler et al.’s framing of the problem in terms of the specificity of their definition of the treatment variable. I think the authors of both papers would agree that overinterpretation of naive regressions has been a problem, hence the value of this work.

6. tl;dr summary

From both papers I draw the same substantive conclusion, which is it that simple, or even not-so-simple regressions of outcome on race and pre-treatment predictors can give misleading results if you’re trying to understand the possibility of racial discrimination in the criminal justice system without thinking carefully about these issues.

There are also some technical disputes. It’s my impression that Gaebler et al. are correct on these issues, but, now that I have a sense of the statistical issues here, I’m not so interested in the theorem, or the explanation of error in the theorem, or the error in the explanation of the error in the theorem, or the corrected explanation of the error in the theorem. My read of this particular dispute is that Knox et al. were trying to prove something that is not quite correct. The correct statement is that, even when standard regression-based inferences allow you to estimate a causal effect of race at some stage in the criminal justice process, this causal effect is conditional on everything that came before, and so a focus on any particular causal effect will not catch other biases in the system. The incorrect statement is that you can’t estimate causal effects in standard regression-based inferences. These local causal effects don’t tell the fully story, but they’re still causal effects, and that’s the technical point that Gaebler et al. are making.

My tl;dr is different from that of political scientist Ethan Bueno de Mesquita, who wrote this:

I disagree with the “dreck” comment.

Don’t get me wrong. I have no general problem with labeling papers as dreck; I’ve many times called published papers “crappy,” and this blog is no stranger to pottymouth. It’s just that I like the Gaebler et al. paper. I disagree with the above-quoted assessment of its quality. I can’t really say more unless I hear Bueno de Mesquita’s reasons for giving in the “dreck” label. He and others should feel free to respond in the comments.


  1. ZB says:

    Andrew, as a long-time reader, I’m a taken aback that you would endorse a paper that is, in no uncertain terms, saying “if you assume away selection problems, you can estimate causal effects.” That’s both trivially true and monumentally bad research advice! You’ve castigated papers for saying less egregious things.

    • Jake T says:

      Your comparison seems incomplete. In fact, I could propose a pithy summarization of Gaebler et al. as, “It is possible to carefully measure causal effects at any given stage of a multi-stage process.” KLM, in that framing, argue that such careful analysis is impossible, and your conclusions will always be false. Since the universe is turtles all the way down, the absurd extension is that causal influence is impossible!

      In particular, KLM are _not_ arguing about omitted variables or selection effects at the second stage. That would have been a valid issue, though well understood. They are arguing that one must consider omitted variable bias at the first stage, even while studying and making claims only about the second. They certainly haven’t provided sufficient evidence to justify this extraordinary claim – as far as I can tell, they haven’t even corrected their invalid proof from the paper.

      • Sam Bailey says:

        Say minorities are arrested at random and whites are arrested only if the officer has video evidence of them committing a crime. Just comparing Pr(prosecution|minority=0, arrested) and Pr(prosecution|minority=1, arrested) seems like it would not give us numbers we’re interested in. Even if prosecutors use race in deciding whether to charge people, the sets of people are selected to be different enough to make comparisons hard. Is this what Mummolo means when he talks about collider bias?

        Of course, we wouldn’t actually compare Pr(prosecution|minority=0, arrested) and Pr(prosecution|minority=1, arrested), we’d compare Pr(prosecution|minority=0, arrested + OTHER STUFF) and Pr(prosecution|minority=1, arrested + OTHER STUFF). Is Gaebler et al.’s point that we can actually do a pretty good job capturing the “OTHER STUFF,” that there’s not much the prosecutor sees that the researcher does not?

      • Ricardo Silva says:

        I won’t comment on the application, but I have a couple of thoughts on the framing of the methods.

        I didn’t read any of the papers but I browsed the Twitter discussion (I know, measurement error and noisy channels and all that…) But the discussion of the so-called ‘mathematical error’ seems to hinge on each group of authors having a different interpretation of what ‘nonparametric identification’ is. Gaebler et al. discuss nonparametric constraints on top of those implied by the causal structure, a reasoning not entirely dissimilar to adopting monotonicity in some types of (statistical) nonparametric IV analysis. KLM seem to follow the notion of “no other constraints other than those entailed by the causal structure”, which is more like the use of the word by Pearl and others. Both have a point, but it seems to boil down to a game of semantics rather than anything substantively wrong on KLM’s ‘necessary’ conditions.

        I have to say though, by just staring at that formal definition of subset ignobility in Gaebler et al.: it is so convoluted and opaque it almost reads like parody of a hypothetical researcher’s bending-over-backwards-checklist to achieve identifiability at all costs (don’t take this comment too seriously, as I haven’t read the paper in detail). KLM’s core point seems to be a very standard “go for partial identifiability if we can’t really justify getting rid of the impact of those unmeasured confounders”.

        • Jake T says:

          I don’t think a casual dismissal of this discussion as one of semantics is appropriate. If KLM’s proof of their necessary conditions is correct, it invalidates important existing research on discrimination in prosecution decisions. That is a profound implication for practitioners and policy makers. There are charitable interpretations of true claims KLM might have been trying to make, and some are discussed in this post. At the end of the day, it is up to KLM to publish and publicize what they mean. That is not a task which should be left to the reader.

          The style or real world believability of the counter-example in your eyes is not particularly relevant. Gaebler set out to invalidate KLM’s proof and succeeded (unless I am missing something). We are now venturing into opinion, and mine is that KLM should offer a revised claim in light of the counter-example and/or retract their work until they are in a position to do so.

          And while we are entertaining my opinions, I think the discussion about the issue as linked on Twitter has been unfortunately dismissive and acrimonious. Private apologies are likely in order.

          • Carlos Ungil says:

            Where does the following summary fail? (I may be misunderstanding the controversy and I’d like to understand how.)

            KLM: if there is an unobserved difference, apart from race, that affects both the first stage outcome and the second stage outcome, .

            Counter-example: if there are no unobserved differences that affect both the first stage outcome and the second state outcome, .

            • Carlos Ungil says:

              (Angle brackets to square brackets)

              KLM: if there is an unobserved difference, apart from race, that affects both the first stage outcome and the second stage outcome, [issues].

              Counter-example: if there are no unobserved differences that affect both the first stage outcome and the second state outcome, [no issues].

            • Carlos Ungil says:

              I had not seen the “corrected” link, which is more relevant. They show an example where the estimator is unbiased despite unobserved confounding. But what they are showing, it seems, is that the estimator is biased in that model but some specific probability distributions for race and confounder can be found that make the bias zero. I’m not sure that constitutes really a counter-example.

              • Ricardo Silva says:

                Indeed, it seems to be the case that subset ignorability is a ‘measure-zero’ constraint under the model structure proposed by Knox et al., that is, if we put a continuous prior on parameters of the model with unmeasured confounding, there is probability zero that subset ignorability will hold (I’m happy to learn otherwise if this is not the case). Make of that what you will. It’s reasonable to argue that if KLM wants to make a precise mathematical point, they should be more precise on how their result relies on excluding these cases, and whether claims of necessity are that interesting outside of the cases implied by the graph. But it looks like there is also something to be said about whether subset ignorability is plausible under the original paper’s structural assumptions

          • Ricardo Silva says:

            Perhaps the following parallel is helpful. The usual instrumental variable assumptions (exogenous instrument, exclusion-restriction) are not enough to ‘nonparametrically identify’ the average treatment effect in the sense that, without knowing anything else about the model but the joint distribution of the observed variables, we cannot provide an unique answer (bounds on the effect are still possible in some cases like discrete distributions).

            But we can still add ‘nonparametric constraints’ to the mix to get identifiability. For instance, shape constraints (like monotonicity) on some response functions: that’s still nonparametric, but arguably against the spirit of the original ‘definition’ of nonparametric identification in terms of the causal graph, which is a useful distinction. For instance, assuming that the contribution of the confounders to the outcome variable is additive can also be interpreted as a nonparametric constraint (infinite number of parameters still allowed) that gets you identifiability, but it’s not ‘nonparametric’ in the graphical sense.

            The entire premise that it’s useful to separate graphical and non-graphical conditions for identifiability lies precisely on the fact that we can build a continuum of more and more convoluted non-graphical conditions to achieve identifiability. I find useful to draw a line somewhere saying ‘this is what I mean by a useful sandbox of defensible assumptions’ because we can build generic methods for those, like bounding methods or do-calculus that don’t exploit particular shape assumptions. We can’t seriously expect people to be willing to draw stronger and stronger assumptions for the sake of always getting a unique answer. When we build a method, we are implicitly saying ‘and that’s what you get without further assumptions’. I agree it would be helpful for KLM to make that more explicit considering the general interest and heterogeneous audience of their paper.

            I also agree that the tone of the Twitter discussion was unnecessarily aggressive, hence I was trying to understand what the clash was given that both papers have something relevant to say. I found useful to draw this parallel, hopefully other people will too.

    • Andrew says:


      Gaebler et al. aren’t suggesting to assume away selection problems. They’re saying that you can do causal inference on a selected subset of the population, as long as you are clear that the inference is conditional. This is what is done in medical experiments all the time, when they assign treatments to volunteers.

      • Carlos Ungil says:

        “In our recurring example, subset ignorability means that among arrested individuals, after conditioning on available covariates, race (as perceived by the prosecutor) is independent of the potential outcomes for the charging decision.”

        Assuming that the potential outcomes are independent of the actual treatment (conditional on the observed covariates) is not an issue in the medical experiment example where treatments are assigned randomly.

        When the perception of race by the prosecutor is not random the plausibility of the ignorability assumption may be questioned. But it’s clear that if it holds ommited-variable bias is not a problem.

  2. ZB says:

    >But that can’t be what Knox et al. are trying to say. The statement, “estimates of the effect of X on Y are biased absent additional data and/or strong and untestable assumptions,” is true for any X and Y. It’s a good point to make, quite possibly worth a paper in the APSR, but nothing in particular to do with criminal justice.

    Also, one quick point on this: I agree (as you imply, this is trivial!). On the other hand, this is a literature that is *very* public-facing in a way other topics are not, and perhaps we should think about our threshold for trading off potential bias vs. tractability. We might worry that poorly-identified papers may negatively affect discourse to a magnitude much greater than anything about, say, himmicanes.

  3. David says:

    Wouldn’t one problem with the Gaebler et al. approach be that you end up with a very local treatment effect for downstream decisions? Because of the disparity in stops there is potentially very little (or no) overlap in the populations that end up in court so any estimate of bias by the prosecutors is almost meaningless. For the purpose of making policy recommendations, having one study that says there is bias in stops and one study that says there’s no bias in prosecutorial decisions conditional on stops gives us very little information on the counterfactual where we somehow eliminate the bias in stops.
    We definitely shouldn’t conclude that if we could fix the bias in stops, we don’t need to do anything to fix prosecutorial decisions because of a (potentially perfectly identified) lack of effect conditional on stops in current administrative data. We may be able “a causal effect of one stage in the process” but we are not estimating THE causal effect of that stage, only an extremely local one.

    This may all be obvious to many social science researchers (although I don’t think it is except to very statistically competent ones like you (Gelman) and Gaebler/Hill/Goel/Knox/Mumullo and all their coauthors) but it certainly isn’t to policy makers who get told that there is no estimated bias at the level of prosecutors. My sense is that that is what Knox et al. mean when they say these estimates aren’t useful to applied researchers, especially those who work with policy-makers.

    • Jake T says:

      There is lots of work showing bias at the prosecution step, including by Goel, so your hypothetical concern about “nothing to fix here” implications for policy are not substantiated. There is, in fact, plenty to improve at every step. In addition, it suggests “we definitely shouldn’t conclude that if we could fix the bias in stops, we don’t need to do anything to fix prosecutorial decisions” is a straw man.

      • This isn’t necessarily true in other areas of research, and the papers aren’t really about Justice reform, they’re about the fundamental problem of causal inference with examples from justice reform.

        Another example might be say medical treatment. Maybe individual doctors don’t make decisions about surgery based on Socio economic status, but there are many many things that keep poor/minority people from even getting to the doctor in the first place, for example.

        • Jake T says:

          This is a good point. I think my response would also generalize: what are cases where all the relevant factors are perfectly identified between stages such that all the relevant variables in the 2nd stage perfectly obscure any bias? To what extent is this concern primarily hypothetical? In addition, there are other tools you can use to understand whether this is happening in you analysis to confirm the robustness of your conclusion.

          I would, separately, like to emphasize that though this is an interesting question, it is not the policymaking point that KLM nor Gaebler made.

  4. Ariel says:

    This issue is extremly important for social sciences, and has come up in the Harvard Discrimination case as well, with DiNardo vs. Card. I think that an important vantage point may be what we do observe in cases where there is widespread aggrenment of no or very little bias. For example, the over representation of males in jail (relative to population share), or of asians/jews in STEM, or of African Americans in sports. Estimate “Causal” Regressions there, and get a sense of the effect sizes and statistical significance of disparate outcomes without discrimination.

  5. Justin Pickett says:

    Andrew, you wrote:

    “Gaebler et al. argue that there exist situations in which you may be able to estimate causal effects of discrimination or perception of race just fine, even when conditioning on variables that are affected by race. It’s just that these won’t capture all aspects of discrimination in the system; they will capture the discrimination inherent solely at that point in the process (for example, when a prosecutor makes a decision about charging).For a simple example, suppose you have stone-cold evidence of racial bias in sentencing. That’s still conditional on who gets arrested, charged, and prosecuted. So it doesn’t capture all potential sources of racial bias, not by a long shot. Or, maybe you find no racial bias in sentencing. That doesn’t mean that total racial bias is zero; it just means that you don’t find anything at that stage in the process.”

    But the issue is not just adjusting away from total effect (via controls for intervening variables), right? It is also that many variables are colliders and so by sampling on them — just looking at the effect of race at sentencing — risks collider variable bias, if race and other things (e.g., neighborhood conditions) both effect arrest, prosecution, etc.). That is, conditioning on arrest opens a backdoor path to sentencing, inducing spuriousness? That has been one of the key issues in sentencing research — it is difficult to study bias even just at one stage:

    • Andrew says:


      I agree that you can’t easily interpret regression coefficients causally if the regression is conditioning on post-treatment predictors. As Gaebler et al. discuss, conditioning on who is arrested is pre-treatment with respect to the causal effect of perception of race on later decisions.

      To put it another way, Knox et al. are correctly pointing out that, if you’re trying to estimate the entire effect of race on the sentencing system, for example, you have to consider the effects of race on everything that came before. And Gaebler et al. are right that regression can be used to estimate causal effects of race in the intermediate steps. You can study bias at just one stage, but that inference is only relevant for the subset of cases that reach that stage.

      • Carlos Ungil says:

        > And Gaebler et al. are right that regression can be used to estimate causal effects of race in the intermediate steps. You can study bias at just one stage, but that inference is only relevant for the subset of cases that reach that stage.

        The existence of bias depends on the model. For example, a prosecutor that decides to charge people by flipping a coin would not be discriminating if the model doesn’t include any covariates apart from race but may be discriminating within a more complex model (which cared about evidence, for example). Gaebler et al. show that you can study bias at just one stage in some models, when the unknown bias introduced in the previous stage is indeed irrelevant because you assume it can be ignored.

  6. Jonathan says:

    I only read the 2nd paper, which I enjoyed very much, both for the subject matter and the ignorability discussion. The subject matter interests me because I’ve sort of done that job; while in school, one part of the learning experience was sitting with the charging prosecutors in a very large, very violent city. I read arrest reports and went through the lists of offenses and their elements. In practice, the model operated very much as described in the paper. There were differences of course. The gist of running a large, busy prosecutor’s office is to force pleas because you can run that many trials. That city, like many, could run years long backlogs of cases. And like most cities seem to do, they’d bring in old judges to dispose of old cases, meaning a lot of rapid dimissals because you cant easily prove rather ordinary criminal events that long after the fact. So they’d get 90+% pleas, with maybe 6% or so going to trial, winning about half of those.

    But that is an extension of the model because most of the cases where you’d expect a plea were people getting caught in the act. That might mean fewer charges because you had the guy on what you needed and you wanted to keep that clean. Other times, you might look at the arrest report and criminal record and you’d charge everything because that case might come to trial, meaning the pre-trial prosecutor maybe could use the extra leverage in negotiations. To be clear, we had charging prosecutors and then pre-trial negotiation and then trial, though the latter two could overlap. The people doing trials also tended to do the preliminary exam hearings, which were way, way more relaxed, being only probable cause hearings (unless it was a big money drug case and then it was a battle). As a note, I wonder how cellphones have changed the dynamic. Back then, you prayed the officers and witnesses would show. You’d go out into the hallways hunting for your witnesses. A constant problem in preliminary exams was getting in touch with the police supervisors, the sergeants, etc. who would testify about the reports but who wouldnt even have read the file. Mobiles would really have helped.

    BTW, the area where I thought the most racial difference could occur was probation. They were radically over-worked, trying to do social work in a criminal setting, and the lever was no longer as binary. That is, the criminal justice process does chain into a series of decisions that can be bounded and turned into yes/no as it processes toward an end of identifying those who are guilty of a specified crime. But after? You have no reasonable doubt requirement. In the system, that doubt requirement runs from seeing the person as a suspect, where reasonable doubt enters, through every step until conviction. It disappears on conviction and then probation can much more easily put you away. The grouping characteristic changes, so ‘doubt’ is replaced by ‘no doubt’. That presumptive no is a different standard. It’s not as loose a fit as deciding which person to look at, but it’s much looser than the need to prove ‘no doubt’.

    • Ben says:

      > The gist of running a large, busy prosecutor’s office is to force pleas because you can run that many trials.

      > That city, like many, could run years long backlogs of cases. And like most cities seem to do, they’d bring in old judges to dispose of old cases, meaning a lot of rapid dimissals because you cant easily prove rather ordinary criminal events that long after the fact.

      Was the advantage here that pleas are expedient? It sounds like there’s another thing here too — that old cases just become unprosecutable, and so pushing for a plea avoids this?

      > you prayed the officers and witnesses would show

      Someone told me this was a difficulty in prosecuting police, though I don’t know if it’s the only one. The argument was that police could stop pressing other charges if they were attacked, and then the DA wouldn’t be able to do prosecutions or something? Did you see a police vs. DA dynamic?

  7. Roy says:

    I know very little about any of this, but Jonathan Mummolo gives a lengthy response if anyone is interested at:

    Was linked to awhile back by Richard McElreath.

  8. Dean Eckles says:

    “I heard about these papers a couple years ago but didn’t think too hard about them. Then recently a bunch of different people contacted me and asked for my opinion. Apparently there’s been some discussion on twitter. The Knox et al. paper was officially published in the journal so maybe that was what set off the latest round of controversy.”
    It sounds like you may have received a privately circulated version of the Gaebler et al. paper (or are thinking of a different paper), since I believe the first public version is dated June 2020 (see also So I think Gaebler et al. posting that paper is very much what set off current controversy.

  9. Julian says:

    “Define your causal effects carefully, and recognize their limitations”

    I think this is the key takeaway. To me the disagreements stem from differences in the definition of the estimand of interest. Of course you can estimate *some* causal effect under certain assumptions so Gaebler et al. are not wrong, but that seems to be missing the point, really. In my opinion, the estimand of interest in Knox et al. is the far more interesting and substantively important.

    • Carlos Ungil says:

      > The arrest decision is post-treatment relative to the officer’s perception of race but, importantly, it is pre-treatment relative to the prosecutor’s perception of race.

      The race of the defendants it’s not going to change between the time of their arrest and the time they’re processed. What does pre-treatment mean?

    • Adrian says:

      I would say both are important. It is definitely informative if the bias is seemingly not present at the second stage but only in the first one. If anything that’s exactly how we can learn about how to address the substantive issue at hand.

      • Julian says:

        I didn’t say one was unimportant.

        > It is definitely informative if the bias is seemingly not present at the second stage but only in the first one. If anything that’s exactly how we can learn about how to address the substantive issue at hand.

        Agreed, but if bias is present in the first stage, how do assess whether bias is present or not in the second stage?

        As Carlos wrote on another comment, “Gaebler et al. show that you can study bias at just one stage in some models, when the unknown bias introduced in the previous stage is indeed irrelevant because you assume it can be ignored.”

  10. Carlos Ungil says:

    > This paper has generated considerable online discussion between the authors of this manuscript and Knox et al., summarized in the Twitter threads referenced below.

    This is not the future we were promised. (I just hope the next thing doesn’t make academic discussions on twitter look good.)

  11. Guido Biele says:

    Does the following explain some of the disagreement:
    One is interested in the effect of racism R on sentencing S, but one can only observe people in front of judges J. We also know that R influences if one will be in front of a judge, and that there are observed confounders C, which influence both sentencing S and if a person stands in front of a judge (J).
    This scenario can be sketched as follows:
    R -> J S
    and it is possible to obtain an unbiased estimate of racism R on sentencing S by adjusting for C.

    However, one can argue that it is unreasonable to assume that one has data about all common causes of J and S. Such a scenario can be sketched as follows:
    R -> J S
    J S
    and now it is no longer possible to obtain an unbiased estimate of racism on sentencing, because one cannot adjust for unobserved confounders.

    In a way that seems too simple an explanation for the disagreement. But a discussion of the papers should maybe start with an explicit description of their key assumptions about the data generating process. If we are lucky they even make contrast predictions of implied conditional independencies, which would allow an empirical examination of the assumptions.

  12. Carlos Ungil says:

    The stylized example in Gaebler et al. seems a bit too stylized:

    “Our goal is to estimate the cdeOb given the prosecutor’s records on arrested individuals, namely their criminal history and race. Intuitively, subset ignorability holds in this simple scenario because the prosecutor’s dataset contains all factors used in the charging decision, even though the prosecutor does not know all the factors that led to an arrest, a decision that may itself have been discriminatory.“

    One would expect the decision to charge a suspect to depend somehow on the evidence of an actual offense and not just on their race and criminal record. In this example, a prosecutor that charges people if, and only if, they have a criminal record wouldn’t be discriminating even if black people is systematically arrested (99% are innocent) while white people are arrested only when they are obviously guilty (1% are innocent).

  13. Derek says:

    For the Gaebler paper, they seem to be saying ” we don’t have good enough data to do an analysis like this”
    So, why don’t they propose solutions to fix this? OUr nation is in crisis over this issue, and leading academics seem to throw up their hands on the issue (at least in this paper)

  14. somebody says:

    I’m interested in seeing people’s opinions on this paper

    The issue with generalizing from one racist police force to all of the United States is obvious, and I’m sure you lot think they should have done a general hierarchical lagged regression with stan rather than use a fixed effects difference-in-difference estimation procedure from the stata box (so do I). That aside…

    Probably the biggest threat to the validity of the results I can see is how they get randomization. A lot hinges on how the officer assignment procedure supposedly works as described by the department to the authors actually being how it works. I find that in bureaucratic city government organizations, the handbook is often a wishlist. Granting that bit of honesty though, the study seems pretty good. On the face of it, the randomization strategy seems sound and most of their analysis decisions seem sane. On the other hand, the effect size is so large that it makes me feel like some mistake is being made, even with my priors leaning towards the existence of substantive police bias. Five times as likely? Sixty percent more likely to use force? Twice as likely to use their gun? If this is true, the officer-generating-process must be much worse than I imagined.

  15. cs student says:

    Knox et al has DAGs, Gaebler et al has potential outcomes; is this part of the Pearl vs. Rubin feud that has been going on for so long? as a total outsider the level of anger people are expressing seems weird for the actual argument. but if there are longstanding grievances in play here, that would explain it. it’s something Knox brings up in the Twitter thread, too.

    • Andrew says:


      Both these papers use potential outcomes. Knox et al. use the principal stratification framework of Rubin. The two papers are working within the same framework; they’re just defining the causal effect differently in this application.

      I don’t think the anger comes from longstanding grievances. Anger happened. The authors of the ovulation-and-clothing paper seemed to get pretty angry at me, and there was no longstanding grievance. I just wrote that their paper was wrong, and they didn’t like it. I’m angry at David Brooks because he keeps getting things wrong and not correcting it. Brooks may be angry at me for continuing to bring these things up, or maybe he just doesn’t care.

      A little bit of anger is appropriate sometimes! In this case, I wish the anger level were lower, because criminal justice is an important topic, and I think the anger is a distraction. But I understand where the anger is coming from.

  16. Michael Nelson says:

    Causality is complicated. It is plausible that some police in certain districts use coded language to activate racial bias in some prosecutors. Or an officer’s bias may lead to writing up the offense in a way that seems worse, or even to include false (but sincere) memories of events. In either case, the officer’s bias wouldn’t just impact the probability of being charged via biased decisions made prior to the prosecutor’s involvement (i.e., arrest). Police bias would also directly impact the probability of being charged–bias in arrest is partially confounded with bias in charging. That’s not to mention if the prosecutor knows, or knows of, the arresting officer, in which case the prosecutor might make inferences based on that relationship. I haven’t thought out the modeling implications–could be the effect of racism is overestimated, for all I know–but the point is that causation in a human system is leaky, so claiming that your model has isolated effects of bias at a particular point in the process–the “value added” of racism, if you will–is at least optimistic, if not misleading. Claiming to estimate the effects of bias in the justice system up to and including the charging decision is more reasonable, but that is a starkly different parameter than both papers seem to want to get at.

  17. Ethan Bueno de Mesquita says:

    A couple comments.

    First, as I said publicly on twitter, I regret referring to the paper as “dreck” and I apologize to the authors. It was rude and uncalled for. I’ll do better in the future.

    Second, my basic take on what is going on in this literature remains sympathetic to Knox et al. It seems to me that Knox et al have pointed out a serious flaw in much of the empirical literature on bias in policing. Gaebler et al then point out a (in my view, heroic) condition under which this issue turns out not to matter. On your reading, this is a hashing out of identifying assumptions by two groups of scholars that just got a little heated, perhaps on its way to scientific consensus. I hope that is right. On my original reading of the Gaebler et al paper, I had a different reaction. It seemed to me to be making a mathematically true argument that nonetheless obscured the key substantive point about identification in this setting by focusing on an empirically implausible, knife-edge condition under which the standard approach is identified and writing about that assumption in a way that might make a reader think it innocuous (e.g., referring to it is as “standard”). One place you and I may differ, which may partially explain our different takes, is that my sense is you think essentially all identifying assumptions are implausible in observational data, whereas I am somewhat more optimistic (you may be the only person I’m more optimistic than on this front). In any event, for these reasons, I worried and worry that the paper will result in the literature continuing down an unfruitful path on what is a question of genuine scientific and policy import.

    I’ll finish where I started, though. Those concerns are no excuse for being rude, as I was in my tweet. They are a reason to continue the civil scientific debate, which I appreciate your doing here.

    • Andrew says:


      In observational studies, causal estimates just about always rely on “empirically implausible, knife-edge conditions.” Regular old regression, matching, instrumental variables, difference in differences, etc etc—all these methods assume exact conditions that are actually false. Similarly, sampling inferences from just about all real-world surveys rely on “empirically implausible, knife-edge conditions.” Regression, weighting, MRP, etc., they’re all assuming some things in practice that we know are false.

      Are these assumptions “innocuous”? No. We use them all the time, but we should be aware of their problems. Anyway, I don’t consider it any kind of damning criticism of Gaebler et al. that they use “empirically implausible, knife-edge conditions.” This is what we all do all the time in our empirical work, if we’re not lucky enough to work with perfectly random samples, perfectly unbiased measurements, zero drop-out, etc.

      • Ethan Bueno de Mesquita says:


        As predicted in my earlier reply, this is precisely where we disagree. I think there is a meaningful distinction for the credibility and transparency of the social science between, say, the assumption of continuity of potential outcomes at the threshold in a careful and transparent RD paper and, say, a selection on observables assumption in a “regular old regression”. The differences have to do with how well we understand and can think about both the sources of identifying variation and the potential sources of bias, and with the transparency of the (as you say, always literally false) identifying assumptions.

      • WJ says:

        In principle you’re right, in practice this is dangerous. Within the context of criminal justice issues, this means producing research that is technically correct from a mathematical point of view, but essentially worthless in terms of policy because it’s very poorly identified wrt discrimination.

        Also, you’re wading into the waters of false equivalences. It makes no logical sense at all to lump all causal estimates in observational studies together as being knife-edge. There are better research designs and worse ones. More plausible and responsible observational designs and far less so. Lumping everything together as you are muddies these waters. And that’s dangerous, especially within the context of research on the criminal justice system. We should do the best research we can and avoid poorly conducted research even if it is mathematically correct.

    • dl says:

      Interesting discussion, but even taking Andrew’s point to heart, I’m with Ethan. Sure, there are always unobserved confounders, so ignorability assumptions are always incorrect. But there’s incorrect and there’s wildly implausible. The regression equivalent of Gaebler counter-example is regressing outcome Y on treatment Z without including any pre-treatment covariates X, since it’s technically possible that the omitted variables exactly cancel each others’ biases out, identifying the treatment effect. Well yes, but come on.

      On another note, I think many commenters here are missing the most concerning bias in this application. It is that police bias against minorities (=arresting a minority when a similarly situated white would be freed) gives prosecutors a set of cases where everything (measurable by researchers) equal, whites are more blameworthy than minorities. I don’t see how researchers can measure and adjust for strength of evidence in a case, and if I’m right about that, any local treatment effect estimate is going to be worthless.

      • Sam Bailey says:

        Gaebler et al. recognize this bias and do some sensitivity analysis while admitting they can’t actually be sure what they’re missing. They include this line which I think is correct, though I can’t speak to the results themselves:

        “Thus, as in many applied statistical problems, one must ultimately rely in large part on domain expertise and intuition to form reasonable conclusions.”

        Messy statistical analysis may still be helpful if we have “domain expertise and intuition” to frame it. Researchers could probably also take any sufficiently messy or noisy result to confirm what they already believe, with the newfound and dangerous confidence that those beliefs are now “statistically supported.” Maybe that’s what Knox et al. are worried about.

        • dl says:

          Right, I’m sure serious researchers in the field get this problem; I was referring to several comments above that seemed to miss what the most likely/serious bias at play is.

  18. JFA says:

    A lot of the comments are suggesting that this is just about the difficulty of causal analysis with omitted variables. Yet all the discussion in the papers is how that issue applies to the criminal justice system. These papers are about the specific processes in the criminal justice system that may or may not lead to racial bias and disparate impacts. Given that these papers are actually about criminal justice (and not just the statistical issues that arise in causal analysis), I am surprised that neither paper uses information from crime victimization surveys on offender characteristics to address weaknesses in the administrative data.

  19. There is no need to consider statistics when you can watch someone being murdered on video :-(

    • AllanC says:

      I think what you’re saying is that sometimes new evidence is so compelling that it dwarfs prior beliefs about a particular supposition. This is undoubtedly correct but the focus of these two papers is not about discerning whether or not the postulate that racism exists is true – the video you refer to would certainly confirm its existence in a deeply troubling way – but rather what issues exist with the quantification / estimation of its effect on certain life events.

  20. I think a (highly improbable) thought experiment is helpful to parse the assumptions in KLM and Gaebler et al.

    Let’s suppose that there are Black and white civilians who are being arrested and then sent to the prosecutor’s office for charges. Further, let’s suppose that we can randomly assign an individual’s race at the start of the process. And let’s suppose that we can change the individual’s “race” in the second stage, before the prosecutor makes their charge decision. This leads to four possible scenarios:
    1. White while arrested, still white once at the prosecutor (w, w)
    2. White while arrested, but magically turned “Black” once at the prosecutor (again, this is the highly improbable part, but it is worthwhile to grant this for the argument) (w, b)
    3. Black while arrested, still Black once at the prosecutor (b, b)
    4. Black while arrested, but white once at the prosecutor (b, w)

    So the Gaebler et al assumption is that there are no differences between the average charges for the group of people who are (w, w) and those that are (b, w) (and also that there are no differences for the (b,b) and (w, b) ). It is a kind of “markovian race” assumption—we’re making an assumption about the “flow” of cases into the prosecutor’s office.

    How could this be? Well, one sufficient assumption is that police are just randomly deciding which civilians to arrest (even if they are randomly sampling Black civilians at a higher rate, or that there are characteristics of the civilians that increase (decrease) the rate of arrests). Or, we can suppose that things cancel out just right in the next stage.

    But, if police impose even slightly different decision rules for whites or Blacks, then this equality in the next stage doesn’t hold. So, for example, you could see Blacks receiving on average lower charges than whites because cops are implementing rules in primarily Black neighborhoods that lead to Black civilians being arrested for much lower offenses, even within a particular kind of offense.

    These selection problems are particularly tricky to adjust for, which leads to the push from KLM for straightforward assumptions that enable bounding of the effects.

    It is also straightforward to see that (1) this is a problem regardless if you’re analyzing the whole system and (2) it isn’t clear that re-randomizing race at any stage would address selection at an earlier stage. This is a tough problem, which I think speaks to the strength of the assumption in the Gaebler et al paper.

    In other words, this simple example suggests it is unwise to assume that you’re in a standard selection on observables setting at any one stage in this process, because prior decisions affect the characteristics of who remains in the system.

    (I’m doubly conflicted on this. Jonathan Mummolo was my student, I co-author with Dean, and count Will Lowe as a valued mentor. But I work at Stanford with Sharad and obviously value him as a colleague. . )

    • Andrew says:


      I think the thing you’re missing here is that, for the prosecutor’s decision, Gaebler et al. are considering the “race” treatment as information about race aa perceived by the prosecutor. So, on one hand, Knos et al. are correct that this is only part of the racial discrimination story, as it does not capture all the discrimination that came before, and, on the other hand, Gaebler et al. are correct that it’s possible to estimate the effect of information about race on the subset of cases that come to the prosecutor.

      Now, at this point, you might say that “information provided to the prosecutor about race” is not the treatment that you care about, and that’s fair enough, and at this point I’ll point you to the other argument of Gaebler et al. to study racial discrimination by looking at disparate impacts rather than by looking for an overall causal effect of race.

      • Thanks for the response (particularly when it is late!) I’m happy to email if it is easier to hash things out there.

        I don’t think “race as perception” really addresses the key issues that I raised. The key issue is that there could be systematic and unmeasured differences in how civilians who are perceived to be Black and perceived to be white arrive at the prosecutor stage. Regardless of how you operationalize the treatment that is still a problem.

        Again, it is a problem because there can be systematic differences across the two perceived groups. To repeat the example, it might be that civilians who are perceived Black were arrested for much more minor infractions than civilians perceived white. These differences are really hard to address (I think) with a set of observables. In other words, simply subsetting to one stage does not allow you to estimate bias in that stage without additional strong assumptions about what came before.

        • WJ says:

          This really captures the crux of the issue well. I am glad JG has taken the time to write this out.

          I am curious about how Andrew will respond to it.

          Sadly, I fear, that this has indeed turned into a game, the likes of which KLM warned us all about. Gaebler et al’s assumptions are so implausible as to be absolutely misguided in the case of race, discrimination, and anything to do with the criminal justice system.

          There is an important qualitative difference between technically correct from a mathematical point of view and absolutely useless from a practical point of view. Unfortunately, Gaebler et al. seem to be missing the forest for the trees. We should be careful to distinguish mathiness from causal inference and logical practicality in applying math to understand the social world.

        • dl says:

          Right, this is the point I was trying to make above. If you look at it as an abstract statistical problem, then sure, assuming (“as standard”) that researchers can measure everything that is relevant to the prosecutor’s decision, the estimated treatment effect on the prosecutor’s decision is valid. But that assumption seems to be particularly implausible in this substantive context.

        • Carlos Ungil says:

          I agree that considering a counterfactual treatment where the (perceived) race could be different at some point doesn’t make a difference regarding the issues discussed in KLM.

          It’s possible to estimate the effect of (perceived) race on the subset of cases that come to the prosecutor but only to the extent that the model includes enough information to account for the differences between the two distinct groups arriving at the prosecutor’s desk: (previously and presently perceived as) blacks and (previously and presently perceived as) whites.

          If the outcome in the second stage depends on unmeasured covariates that are selected in the first stage there is a problem. What really solves the problem is not considering race a treatment assigned at the time of perception, it is the assumption that there are no unmeasured covariates to invalidate a naive comparison between “identical” subjects (that differ only on being perceived as black or white as they go through their lives, so to know what would happen now to the black person if he were white you can just look at outcome for the white person and vice versa).

          In your example you talk about the kind of offense. If the data is available, and good enough, the adjustment may be possible. One can compare those arrested for minor offenses in both groups and see if the charging decisions are different (conditional on the variables considered in the model). But the problem is that there may be still unmeasured covariates biasing the estimate. If the threshold of detention, conditional on the “kind of offense” which is recorded and included in the model, is different (say a white driving a stolen car is detained while all the black occupants of a stolen car are detained) charges will be dropped more often for blacks. Maybe the prosecutor really just looks at the facts and doesn’t even get to perceive the race, but in that case he will appear to have a bias against whites. Or maybe he’s more severe with blacks (everything being equal) but seems more severe with whites (every measured thing being equal).

        • Andrew says:


          There will definitely be systematic and unmeasured differences in how civilians who are perceived to be of different races arrive at the prosecutor stage. As Gaebler et al. say, it can be possible to design studies to estimate the effect of perceived race at the prosecutor stage, but this should not be taken to imply that this estimate represents some sort of total causal effect of race.


          I think that both groups of researchers—Knox et al. and Gaebler et al.—see the forest, not just the trees. Again, I recommend the recommendation by Gaebler et al. to study racial discrimination by looking at disparate impacts rather than by looking for an overall causal effect of race.

  21. Fafa says:

    My problem with the Knox et al paper is that the paper wrongly, yes wrongly, claims that the only way to estimate racial discrimination effects using policing data is by bounding using their demonstrated method. They are essentially claiming that no one will ever be able to devise an identification strategy that will satisfy what they call “mediator-outcome” confounding in policing data. I suspect the reason for the high temperature of the subsequent arguments is that researchers tend to be more optimistic. (Not helping is that the authors have responded with a “how dare they, don’t they know how important our work is” tone). Mummolo’s tweet thread is revealing because it shows that he understands very well that he is wrong! The obvious counterexample, “what if there was an RCT where part of the chain (e.g. prosecutor decisions) was randomized,” seems to make him very angry and causes him to double down:

    But it also should be obvious that an RCT is not the only possibility to solve this unsolvable problem. When some enterprising grad student finds a record of a day when officers in a particular city were ordered to pull over only green cars, should they not try to use that to look at racial disparities in what happens during those encounters. Is Mummolo going to tell them: “Well, you see, I’ve proven that you will never be able to identify the effect you are looking at, so don’t bother”?

    I agree with Andrew that the peer review process is no panacea, but I’ve found it is often good for correcting problems of overclaiming. My guess is that had this been submitted to an Econ journal, a referee would have ensured that they were limiting their critique appropriately, and Gabeler et al would not have the space to write what Mummolo calls a ‘gotcha’ paper. But they did not.

    As Andrew’s post alludes to, there is nothing in the paper that is conceptually new or has not been raised in causal inference a million time, but it is definitely valuable to show how generic problems manifest in specific fields and the exposition using DAGs was nice. But the authors seem very churlish about the possibility that their critique is only wide ranging, instead of universal.

  22. SC says:

    “Other conceptions of discrimination, such as disparate impact, are equally important for assessing and reforming practices.”

    What exactly does Gaebler et al. mean here? A fair criminal justice system will have a greater impact on people that commit more crime. A fair economic system will reward productive people more than less productive people.

    They seem to be shifting the focus to racial equality in outcomes (impact) instead of racial equality in treatment (causal effect of race), which seems to be the opposite of what we want?

  23. Andrew says:


    There are many dimensions of fairness. You give your own definitions of fairness, which is fine, but they will not be the same as other people’s definitions. And fairness is not the only goal.

  24. Christopher says:

    Knox, Lowe, and Mummolo wrote a followup paper that formally analyzes the Gaebler et al proposal. It should clarify any confusion.

    • Carlos Ungil says:

      Thanks. In summary, “analysts often fail to distinguish between (i) assuming a condition holds, which is easy; and (ii) satisfying
      a condition and carefully justifying it, which is hard” and “rather than developing improved research designs or deriving better estimation techniques,
      GCBSGHa advocates assuming that even with imperfect controls, biases from multiple sources will happen to perfectly offset one another.”

Leave a Reply