Skip to content
 

Update on social science debate about measurement of discrimination

Dean Knox writes:

Following up on our earlier conversation, we write to share a new, detailed examination of the article, Deconstructing Claims of Post-Treatment Bias in Observational Studies of Discrimination, by Johann Gaebler, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill (GCBSGH).

Here’s our new paper, Using Data Contaminated by Post-Treatment Selection?, with Will Lowe and Jonathan Mummolo. This is an important debate and the methods discussed are being used to study serious policy issues. We think these new derivations are valuable for those producing and consuming discrimination research.

Examining GCBSGH’s proposed approach was a very useful exercise for us. In the paper, we clear up some confusion about estimands, in particular showing that given post-treatment selection, analysts do not even get the controlled direct effect (CDE) among the people included in the experiment, unless observations are exactly the same despite responding to treatment differently. This is conceptually the same as arguing that IV recovers the ATE—I [Knox] think few reasonable analysts would argue that compliers are somehow exactly the same as the full sample. In our paper, we prove the following is logically equivalent to GCBSGH’s proposal: in ideal experimental settings where civilians of different racial groups are randomly assigned to police encounters pre-stop, acknowledging that biased police may stop minority civilians for as little as jaywalking but white civilians only for assault—yet arguing that both sets of stops are somehow identical in potential for police violence.

But there is a more important point that we hadn’t appreciated on first read: GCBSGH’s proposal is described as working even when treatment ignorability doesn’t hold. We now examine that aspect closely and find the proposed approach recovers the estimand in this more general setting only if post-treatment selection bias exactly cancels out omitted variable bias. Of course, analysts are free to assume whatever they want, but we think federal judges and civil rights orgs are unlikely to find this argument compelling.

We realize this exchange got heated, so we’ve tried hard to dial the tone back and just focus on the intellectual arguments. The back-and-forth has been relegated to a short section that runs through claimed “counterexamples” (which all mirror the parenthetical asides and appendices in our original paper) and also the fact that we couldn’t find a textual basis for their critique anywhere in our paper. Ultimately this is a pretty minor point, but we said we needed assumptions 1-4 for X under conditions Y, and they attacked us for assumptions 2/4/5 not being necessary for Z. Frustrating, but I guess we could’ve been clearer.

I’ve not tried to follow all the details here so I’ll let people go at it in the comments section. I’ll just say that I’m not sure what to make of the “knife-edge” or “measure zero” issue. Almost all statistical methods are correct only under precise assumptions which in observational data will never hold. But that doesn’t mean the methods are useless. When we estimated incumbency advantage in congressional elections (here and here), our estimates were only strictly kosher assuming random assignment or linearity and additivity, none of which are actually correct. This is not to say that Knox et al. are wrong in their arguments, just to say that biases will always be with us. Regarding the point about post-treatment selection bias exactly canceling out omitted variable bias: I think that must depend on what you’re estimating. This gets back to my comment in that earlier post that some of the disagreement between Knox et al. and Gaebler at al. is that they’re estimating different things.

I sent the above paragraph to Knox, who replied:

We must point out that your statement about estimands is simply inaccurate. Our new paper considers the exact same estimand as GCBSGH and we are very explicit about that. Our original paper also considered this estimand, among several others.

We also think the problem is pretty intuitive. You cannot, in general, pick up halfway through a selective process and run a standard analysis without bias, because the selection almost always breaks the apples-to-apples comparison between treatment and control.

The only time when you can ignore selection is when various differences happen to accidentally cancel (see our paper), and there are simply no substantive reasons to believe this is happening. That is the real problem. As we say in the paper, this is no as-if-randomness assumption. This is an assumption about observations being the same despite responding to treatment differently. The knife-edgedness isn’t the root of the issue, it just makes it worse.

I think that even if the estimands are the same mathematically, they can correspond to different applied goals. I still think the two sides of this debate are closer than they think. But in any case at this point you can read this all yourself.

33 Comments

  1. Dean Knox says:

    Thanks for posting. The link to our paper seems to be dead. Here it is:
    “Can Racial Bias in Policing Be Credibly Estimated Using Data Contaminated by Post-Treatment Selection?”
    https://dcknox.github.io/files/KnoxLoweMummolo_PostTreatmentSelectionPolicing.pdf

    To follow up on Andy’s last point, our goals are identical to GCBSGH. The real difference, I think, is in our philosophical approach to identification issues. When causal effects are unidentified under plausible assumptions, we advocate careful partial identification approaches and derive nonparametric sharp bounds. In contrast, GCBSGH advocate making stronger accidental-cancellation assumptions that, if true, mean there is no problem to begin with.

  2. Joshua says:

    As someone who’s interested whether observational studies of racial bias in policing are biased because of a reliance on stop or arrest records…I’m wondering if someone can translate that abstract into English? I can usually parse technical prose to make sense of it…but wirh that abstract I can only do so to a very minimal extent.

    • jim says:

      “… translate that abstract into English…”

      Got you covered:

      “We’re trying to sort out all the confounding effects in the data regarding racial discrimination in policing. We’re not having much success. There are so many actual and proposed and theoretical opposing confounding effects that we can’t tell WTH is going on! It’s like the Grinch said: Noise Noise Noise!

    • Jonathan Mummolo says:

      Hey Joshua, sure. I can take a stab at that.

      Using arrest records to estimate racial bias in the use of force is fraught because racial bias can affect the decision to arrest. Specifically, it’s a problem because if there is bias in arrests, the white arrest cases will differ from the nonwhite arrest cases. The example we use is minorities may be arrested for less serious offenses, though often this won’t show up in the records. These differences affect whether police use force. In a new paper, GCBSGH claims this is not a problem because we can assume those differences won’t distort our estimates. Our analysis shows the only way that can be true is if various differences happen to exactly offset one another. But since there is no reason think that accidental cancellation would occur, we argue that is implausible, and that instead, analysts should use techniques to describe the range of possible discrimination rather than reporting biased estimates. The reason this is so important is because recent work shows that ignoring selection bias can greatly understate discrimination in the use of force.

      • jim says:

        Great thanks Jonathan.

        This is an interesting discussion from a statistical methods point of view, but the initial assumption of purely *racial* discrimination is *highly* questionable, as is the assumption that the primary form of racial discrimination is W/NW.

        There are myriad confounding factors that certainly affect every aspect of police behavior toward potential suspects: clothing; attitude / body language; the location of the suspect (are they hanging out in a parking lot or having coffee at starbucks?); social behavior and associates; language and accent; and, sometimes, smell. Hardly any of this data is included in police records, yet it could be at least as influential if not more influential than race.

        So, while methodologically the discussion is interesting, in practice it’s irrelevant because the assumptions that underpin the method can’t possibly be valid.

        • Luis z says:

          I am not sure, but I guess that if we assume independence between those factors and race, then with a big enough sample the analysis should still hold.

          And if we prove that there is no independence, then a more philosophical problem arises. If a certain ethnic group is associated with a certain factor, and police discriminate against that factor, can’t it be considered racism?

          In the end, these questions should be asked with the end goal of understanding what is wrong with a society and how can it be fixed. researchers should not forget that.

          • Joshua says:

            Luis –

            +1

            That said, I have no idea why we’d assume they’re independent (not that you said otherwise).

            And I think it should be a default that we assume they aren’t independent until it’s shown that they are.

            • jim says:

              “And I think it should be a default that we assume they aren’t independent until it’s shown that they are.”

              We assume their independent because we already know they are. To suggest that how people dress or speak doesn’t influence people’s perception of others is preposterous. No doubt some confounding factors are at least loosely correlated with race or ethnicity or cultural factors. But there’s also the possibility that criminal behavior is loosely correlated with race or ethnicity or cultural factors. Or do you reject that possibility?

              • Joshua says:

                jim –

                > To suggest that how people dress or speak doesn’t influence people’s perception of others is preposterous.

                Well then it’s a good thing that I’m not saying that. Read what I wrote again. I”m saying that you can’t assume that judging people based on how they dress is independent from race.

                > No doubt some confounding factors are at least loosely correlated with race or ethnicity or cultural factors.

                Please define “loosely associated.”

                > But there’s also the possibility that criminal behavior is loosely correlated with race or ethnicity or cultural factors. Or do you reject that possibility?

                Well – that gets a bit complicated. For example, in urban environments and among males of certain age groups, there is a correlation with arrests and imprisonment and race.

                But it’s interesting how that correlation diminishes or actually disappears when you move to rural environments or diminishes when you move to different ages and with females.

                And it’s also problematic to assess any correlation between criminality and race and/or ethnicity if you aren’t controlling for biases in the perception of, and definition of criminality. Certainly you’re aware of the differences in how the courts have perceived the criminality of drug trade in crack versus powder cocaine? Do you think there is an association between corporate fraud and race? So yeah, I don’t make assumptions about the correlation of which you speak.

                And of course, there’s the whole issue as to why, if there is a correlation, why that correlation exists and whether or not it exists because of hundreds of years of behaviors that we now consider criminal. But I guess that’s a different issue.

        • Joshua says:

          jim –

          > , but the initial assumption of purely *racial* discrimination is *highly* questionable.

          Given the historical context i don’t thknk omits it’s *highly* questionable in the least. Given the historical context, where race has been the explanatory factor for hundreds of years, seems to me it should be the default assumption. Not to say that it’s proven per se, but if you’re going to have a default assumption, I’d say independence of those other factors from race should *not* be the default assumption.

          On top of that, the evidence from rates of stop and frisks relative to outcomes front the stops showing merit for the discrepancy in those rates is evidence. Rates of stops while driving where clothing or accent wouldn’t be discernable but race would be, is evidence that race is an explanatory factor.

          • jim says:

            “Rates of stops while driving where clothing or accent wouldn’t be discernable but race would be, is evidence that race is an explanatory factor.”

            I could be convinced of that. Obviously I’ve heard of crimes like “driving while black” and I’ve always presumed that was a real thing. But I’ve never reviewed a paper on it, so I don’t know the methods that have been used to establish the effect and I don’t know it’s magnitude. I accept it at face value but from a quantitative point of view it’s not an easy thing to tease out, so I’m not going to bet the farm that its true.

            “Given the historical context i don’t thknk omits it’s *highly* questionable in the least. “

            I do. We’re all aware of the historical context but if we’re just going to presume that’s continuing why are we doing the study? The object of the study is to understand the current situation. Even if race does play a role in driving stops, is it appropriate to assume it plays the same roll in every situation? No, because style of dress and behavior could either reduce or amplify other biases.

            2020 isn’t 1950. The color barrier in sports is ancient history. Americans embraced rock-n-roll, Motown and disco (even cops!). Oprah is widely admired by people of all races. Rap and hip-hop are mainstream. There are even black Republicans that have actually been elected to office. Obama was elected president. Twice.

            • Joshua says:

              jim –

              > but if we’re just going to presume that’s continuing why are we doing the study?

              To test the default assumption is a fine reason to do the study. You are suggesting one default assumption as opposed to another. I’m saying I don’t agree with your choice. We have hundreds of years of unarguable precedent to support a default assumption that race is an explanatory factor in discrepancies for how people of different races are treated by police (if not the only factor), independent of the other factors you listed. We can theorize that the other factors have an explanatory influence independent of race, but we don’t have nearly the same body of evidence and so it seems to me you should provide the evidence before considering it as a default assumption.

              > . Even if race does play a role in driving stops, is it appropriate to assume it plays the same roll in every situation?

              I would say of course not. But I don’t think that saying that race is an explanatory factor independent across the aggregated stats, independent of those other variables, makes that assumption.

              > 2020 isn’t 1950. The color barrier in sports is ancient history. Americans embraced rock-n-roll, Motown and disco (even cops!). Oprah is widely admired by people of all races. Rap and hip-hop are mainstream. There are even black Republicans that have actually been elected to office. Obama was elected president. Twice.

              On the flip side, we could list all manner of associations with race and societal outcomes. The extent to which people think that races explains societal outcomes is largely predicted by the race of the predictor. But your anecdotal reasoning is largely irrelevant to the question of what your default assumption should be. You are arguing for the default assumption that the factors are casually independent, without offering evidence beyond some anecdotal associations over a relatively short period, and in contradiction to hundreds of years of unarguable evidence otherwise.

              Sure – check the default assumptions running the other way, and be aware of your default assumptions and check their plausibility, and try not to just reach conclusions without accounting for assumptions. But again none of that seems to me to be convincing as to why we should make the default assumption that race isn’t an independent explanatory factor for discrepancies in how the police.

            • Joshua says:

              and jim –

              Please keep this in mind…

              -snip-
              African Americans were stopped at disproportionately high rates, were more likely to have force used against them during a stop, and were less likely to be found in possession of contraband
              -snip-

              Presumably, the explanatory factors for why people are stopped and frisked would be correlated with the success rate at finding contraband. Whatever factors lie in the decision chain, they should result in a success rate. How would you explain why blacks were stopped and frisked at a higher rate while blacks stopped were less likely to be in possession of contraband? Because of their clothing or accent independent of race?

              https://www.ramapo.edu/law-journal/thesis/racial-biases-within-stop-and-frisk-the-product-of-inherently-flawed-judicial-precedent/

              • somebody says:

                > Presumably, the explanatory factors for why people are stopped and frisked would be correlated with the success rate at finding contraband. Whatever factors lie in the decision chain, they should result in a success rate. How would you explain why blacks were stopped and frisked at a higher rate while blacks stopped were less likely to be in possession of contraband? Because of their clothing or accent independent of race?

                I don’t know if that necessarily constitutes evidence of racial bias. There’s been a lot of work on fairness in machine learning that applies here

                https://5harad.com/papers/fair-ml.pdf

                https://5harad.com/papers/threshold-test.pdf

                in particular, section 3.2 deals with limitations of classification parity. There’s some discussion to be had about how “risk distributions” can be a consequence of modeling choices and how bad false positive rates might mean a program isn’t worth doing at all, but it’s all a little more sophisticated than “unequal false positive rates => discrimination.”

            • Dean Eckles says:

              Some relevant evidence (from some of the people involved in this more technical debate) about being stopped depending on observability of the person inside a car https://news.stanford.edu/2020/05/05/veil-darkness-reduces-racial-bias-traffic-stops/

  3. Joshua says:

    Jonathan –

    Thx.

    > In a new paper, GCBSGH claims this is not a problem because we can assume those differences won’t distort our estimates.

    That seems odd. Why would we make that assumption?

    So now I need someone to translate their thesis, as their abstract is also rather opaque.

  4. Gabriel says:

    I find KLM’s response very convincing.

    I’m having a hard time thinking through the implications of your comment that all statistical estimates are correct only under precise assumptions. Yes, I agree. But it also seems clear to me that there are more and less plausible assumptions, although I admit that I don’t have a good general grasp of what I mean by ‘plausible’ outside the specific circumstances of a given study. But clearly we don’t want ‘all statistical estimates rest on assumptions’ to throw open the door to researchers assuming whatever is convenient.

    What I think is particularly difficult about subset ignorability is that it is very clearly identifying that there is a problem — in fact, that there are two problems — and then assuming that they cancel. This seems qualitatively different than the typical case in which a researcher attempts to adjust for known problems, e.g. if I want the effect of schooling on wages, and I know the relationship is confounded by ability and family income, I am supposed to try to do something about these known sources of bias, or else be far more conservative in what I say about the relationship between schooling and wages (e.g. a partial identification approach). Maybe I use an IV that everybody will argue about, or maybe I adjust with proxy measures of the confounders — possibly unsuccessfully, but either strategy strikes me as more plausible than assuming that these unknowns just so happen to cancel each other out. In this sense, it seems that some assumptions are clearly inferior to others, and that the point of KLM is to say ‘these assumptions are very implausible and when studying something as serious as police violence we should be more circumspect’. Wouldn’t you agree that the KLM partial identification approach rests on more credible assumptions than the subset ignorability approach?

  5. Michael Nelson says:

    It seems to me all anybody needs is a short paper that says: “We think you are wrong. We ran a simulation with this code under these assumptions. And here are the standard errors of the estimated effect across the different levels of degree of assumption violations.” The actual work would be long, but the paper would be short. It looks this is the approach at the heart of GCBSGH’s paper, but it’s the heart of an 800 lb. gorilla requiring one to cut through layer after layer before reaching it. I’m not saying the proofs and alternative perspectives on the field and delineations of conceptual frameworks and explanations of new methods and applications to secondary datasets aren’t good, important work. It seems GCBSGH could have written several useful papers from all that content without even addressing the KLM study. And then they could’ve referenced those papers in their brief write-up of their counterexample simulation.

    I sympathize with KLM for having to generate an intelligent response to this–it doesn’t contain their code (or the word “code”), but there is room for a 238-word paragraph where the authors discuss a topic that they describe in that very paragraph as “not the aim of our paper.” And then another, 255-word paragraph several pages later that ends with the authors noting that this second topic, too, is beyond the “focus” of their paper. For all I know, KLM were dead wrong. But I don’t know, and I probably won’t unless GCBSGH write a paper telling me what their paper said.

    • Carlos Ungil says:

      I think there is a bit of the following going on here:

      K&al. We cannot learn much about X from measurements Y which are biased. Y = X + b + noise implies that mean(Y) is not a good estimator of X

      G&al. With the standard assumption that the measurements are unbiased, observations of Y can be used to estimate X. Your math is wrong because mean(Y) is a good estimator of X when b equals zero.

      I’m not sure a computer simulation would help.

      • Michael Nelson says:

        You may be right. My thinking was that, since all assumptions are wrong, GCBSGH could use simulations to show how wrong the assumption would have to be to undermine results based on it, then argue the required size of the violation is implausible. That’s what I inferred GCBSGH were doing with their simulation’s sensitivity analysis results, which they claim demonstrate robustness to violation of the ignorability assumption. That’s the “heart” of the paper I wish they had cut out and written up on its own. But if you’re right, it’s all the more frustrating that they even included the simulation.

  6. Erin says:

    I don’t understand why these people keep fighting with each other. GCBSGH are saying “theoretically, if you could perfectly control for the confound X which creates the selection issue, then the effect is identified.” KLM are arguing that it is impossible to ever perfectly control for X, and the bias created from imperfectly controlling for X is likely to be especially large in this particular context (something I’m absolutely sure GCBSGH would agree with). I wish both sides would tone the rhetoric down a notch and try to better appreciate the arguments their “opponents” are making—because their current strategy is causing a lot of unhelpful (and unnecessary) confusion.

      • Jake T says:

        In fairness, only one side here is engaged in a large scale PR campaign about their work.

        • JB says:

          +1 Jake T. I agree with Erin’s call for less rhetoric and more substance. Not sure if this feedback should be directed at both parties, though. See:

          https://twitter.com/dean_c_knox/status/1276508868500164612
          https://twitter.com/jonmummolo/status/1275790509647241222
          https://twitter.com/jgaeb1/status/1276261257050353664
          https://twitter.com/5harad/status/1275931524819386368

          I may be missing some context, but it seems like GCBSGH have been courteous and fairly dispassionate in their engagement. On the other hand, KLM have used phrases such as:

          – “silly logic puzzles about knife-edge scenarios”
          – “only error is in their reading comprehension”
          – “justify the same old statistical mistakes”
          – “this is another irrelevant critique”
          – “absurdity of these tired claims”
          – “deliberate misreading and attempted misleading”
          – “this isn’t a good faith critique”
          – “use this issue to try and score cheap points”
          – “complaints aren’t useful to applied researchers”
          – “misleading, out-of-context gotcha paper”
          – “assuming away the challenges is reckless”
          – “if you don’t mind, we’d like to get back to doing useful work”
          – “we need useful ideas, not word games & gotchas”

          This type of language is immature at best. It has no place in a serious academic discussion.

          I don’t yet know where I stand on the substance of this debate (there is certainly a lot to digest). That said, I must say that it was distracting to read these threads and try to get past the snark and hostility. I am also puzzled by the bad faith that KLM assume in their colleagues, who seem credible and also in ideological agreement with their values? Hoping that the discourse improves from here on out, for everyone’s sake.

  7. somebody says:

    Methinks Dean Knox et al have spent much too much time attacking the mathematics and too little time on the plain English.

    > We must point out that your statement about estimands is simply inaccurate. Our new paper considers the exact same estimand as GCBSGH and we are very explicit about that. Our original paper also considered this estimand, among several others.

    This seems flatly wrong to me. I understand they’re giving the same formal definition for controlled direct effect among the observed, but GCBSGH is talking about the effect for prosecutor in making the decision to charge or not charge, while Knox et al is laser focused on the propensity for police escalation into violence. Knox then restate the ignorability assumption in several ways for the violence escalation example which all seems implausible, but that doesn’t really convince me that it’s implausible for the prosecution example.

    > In this setting, rearranging terms in Proposition 1 reveals that GCBSGHa’s “subset ignorability” assumption requires that officers to be equally violent in “assaults” and “jaywalking” encounters (i.e., have the same average potential outcomes).

    Remembering that they’re implicitly conditioning on all available covariates here, yeah that seems pretty implausible. But on the other hand “prosecutors be equally likely to charge in “assault” and “disorderly conduct” arrests conditional on all available covariates” seems pretty okay depending on the covariates (I replaced jaywalking because it’s not generally a criminal offense so “charging” doesn’t make any sense). Maybe they aren’t exactly equal, but they’re certainly closer than the propensities for violence across jaywalking and assault.

    You can dispute my assessment of the qualitative plausibility, these gut checks are inherently subjective. But the point that the plausibility and reasonability of assumptions depend on the qualities of a particular example, not just the mathematical formalism. If you want to sniff test an assumption by way of qualitative description, you can’t do it by constructing a completely different but mathematically equivalent analysis. As such, when Knox say the papers are about the same estimand and the same assumptions, it’s just not true in any meaningful way. I get that they want to be careful about circumscribing one’s claims when the stakes are as high as in a national discussion about racial justice. There certainly have been terrible published analyses that implicitly assume police data is an unbiased data source, then conclude that the police must not be biased. But my ideological sympathy aside, “Our new paper considers the exact same estimand as GCBSGH and we are very explicit about that” just seems like a dishonest claim.

  8. Ken K says:

    Ah the dangers of inference without theory….

  9. Fafa says:

    My comment on the previous post stands. I was hoping it wouldn’t. KLM continue to be deliberately obtuse and catty about a reasonable critique of their work. The abstract of their supposedly “detailed examination” contains the absurdly sarcastic sentence: “GCBSGHa’s proposal merits close study, as it posits a massive methodological breakthrough which, if confirmed, would undermine over 40 years of research on selection bias”. Come on.

    (The Section 3 arguments, in which they make the dramatic claims about the ‘knife-edge’ conditions of GCBSGH, are narrowly, focused on police violence, presumably because the GCBSGH assumptions are most unreasonable when the “re-randomization of race” is done by the same actor (i.e. the police). But the takeaway from GCBSGH was that KLM paper implies the entire chain post encounter is contaminated even in an “idealized experiment”, while the response here is that just the police link is always contaminated.)

    I actually like the original KLM paper and thought it was a good contribution. I just hate the unbelievably snide attitude in the twitter correspondence and now this.

Leave a Reply