Did this study really identify “the most discriminatory federal judges”?

Christian Smith, Nicholas Goldrosen, Maria-Veronica Ciocanel, Rebecca Santorella, Chad Topaz, and Shilad Sen write:

In the aggregate, racial inequality in criminal sentencing is an empirically well- established social problem. Yet, data limitations have made it impossible to determine and name the most racially discriminatory federal judges. The authors use a new, large-scale database to determine and name the observed federal judges who impose the harshest sentence length penalties on Black and Hispanic defendants. . . . While acknowledging limitations of unobserved cases and variables, the authors find evidence that several judges give Black and Hispanic defendants double the sentences they give observationally equivalent white defendants.

They fit a multilevel model! That makes me happy.

I heard about this from Jeff Lax and David Hogg, who pointed me to a post on twitter by law professor Jonah Gelbach disputing the above claims. Gelbach writes, “the data are incomplete . . . a match rate of less than 50% . . . the match rate varies substantially across districts . . . endogeneity concerns . . . The dependent variable is specified as the log of 1 plus the sentence length . . .” The criticisms are a mixed bag—for example, at one point Gelbach writes, “One concern is that judges w/few defendants will have higher-variance random slope estimates, raising the possibility that the results would be an artifact of estimating lots of effects & then picking largest values, which are more likely to happen w/judges having few cases.”—but he’s actually getting things backward here, for reasons discussed by Phil and me in our 1999 paper, All maps of parameter estimates are misleading.

At this point, I think Jeff was hoping I’d adjudicate and share my own conclusion. But I don’t want to! For two reasons.

1. It takes work. Some things don’t take much work at all. Reading Alexey Guzey’s criticisms of Why We Sleep and then reading the relevant parts of Why We Sleep—it’s pretty clear what’s going on. Reading the overblown claims of John Gottman followed by the breakdown by journalist Laurie Abraham, again, this wasn’t a hard case to judge (although it seems that Abraham has made her own mistakes). Similarly with beauty-and-sex-ratios, ages-ending-in-9, and various other bits of junk science—serious flaws were immediately apparent to me.

2. The “send it to Andy” approach to judging tough statistics questions doesn’t scale.

So instead I’m going to give some generic advice of the sort I’ve given many times before, involving workflow, or the trail of breadcrumbs. What I want to see is graphs of the data and fitted model. For each judge, make a scatterplot with a dot for each defendant they sentenced. Y-axis is the length of sentence they gave, x-axis is the predicted length based on the regression model excluding judge-specific factors. Use four colors of dots for white, black, hispanic, and other defendants, and then also plot the fitted line. You can make each of these graphs pretty small and still see the details, which allows a single display showing lots of judges. Order them in decreasing order of estimated sentencing bias.

This won’t answer all our questions, but it’s a start. With these graphs in hand, you’ll be able to more carefully go through the different concerns with the study.

Also, it’s kinda wack that in their Tables 4 and 5, which cover individual judges, they just give point estimates and no uncertainties. What’s the point of fitting a big-ass model and then not presenting uncertainties??

Separate from all of this is the leap from a statistical pattern to actual discrimination (whatever that means, exactly). I’m not getting into that here.

It seems that Smith et al. are making two claims: first a general statement that blacks and hispanics are given slightly longer sentences than whites and others, on average; second a particular claim about the judges on the top of their list. I’ll just say this: if their general claim of aggregate bias is correct, then of course there will be variation, with some judges more biased than others. As everyone here recognizes, bias (statistical or otherwise) can come from some combination of the judge, the cases he or she sees, and the judge’s institutional setting.

P.S. Update here.

7 thoughts on “Did this study really identify “the most discriminatory federal judges”?

  1. Did they remove the offending casual language in the most recent version? I didn’t see any of it when perusing the paper.

    I’ve seen a couple of shoddy papers in the past that accuse groups of discriminatory behavior based on dubious analyses. It’s one of the few things that actually irks me with the current state of social science research, because the targets of the accusations have almost no recourse against “science” saying they are discriminatory. I feel we should be hesitant to ascribe such motives to people.

    From what I saw, this paper avoids that and sticks with descriptive language, and if so, that’s really good.

    On the flip side, I still dislike the attitude of finding a result with a model, showing some p-values, and calling it a day. Researchers should be trying to find flaws and inconsistencies in order to improve the model—they should be hunting for problems, not defending them. And like Gelman says, graph the data!

  2. Even the first claim that “a general statement that blacks and hispanics are given slightly longer sentences than whites and others, on average” can result in ambiguity in statistical estimand. Are all other socioeconomic variables of defendants such as family background, income, education, occupation pre-treatment confounders that we would like to adjust for, or are they mediators which carry an indirect effect? No data can tell these questions apart.

    • I think this is more than theoretically possible. In my experience, one way people in lower status groups cope is by adopting and even exaggerating the attitudes of higher status groups. I don’t know him, so I can’t really say, but I think I see this in Clarence Thomas, for example.

  3. Table 3 feels problematic to me as it only includes the judges who give higher sentences to minorities by one/two SD above average.
    What about the judges who gave lower sentences to minorities by one/two SD below average?

    What does it mean that a judge one SD below average gives ~5% (a guesstimated number) shorter sentences to black defendants compared to white defendants?
    What does it mean that that a judge two SDs below average gives ~40% (again another guesstimated number) shorter sentences to black defendants compared to white defendants?
    What if somebody were to claim from the data that ~30% (guesstimate) of the judges are biased against whites in favor of black defendants?
    Would that be a reasonable inference to make?

    The implausible effect size of 87% longer sentencing for Hispanics compared to white defendants for judges 2SD above average is another red light for interpreting these numbers.

Leave a Reply

Your email address will not be published. Required fields are marked *