Emma Pierson writes:

My two sisters and I, with my friend Jacob Steinhardt, spent the last several days looking at the statistical methodology in a paper which has achieved a lot of press – Incarceration and Its Disseminations: COVID-19 Pandemic Lessons From Chicago’s Cook County Jail (results in supplement), published in Health Affairs. (Here’s the New York Times op-ed one of the authors published this weekend.) The central finding in the paper, quoting the abstract, is that community infections from people released from a single jail in Illinois in March are “associated with 15.7 percent of all documented novel coronavirus disease (COVID-19) cases in Illinois and 15.9 percent in Chicago as of April 19, 2020”. From the New York Times op-ed, “Roughly one in six of all cases in the city and state were linked to people who were jailed and released from this single jail”. On the basis of this claim, both the paper and the op-ed make a bunch of policy recommendations in the interest of public health – eg, reducing unnecessary arrests.

To be clear, we largely agree with these policy recommendations – and separate from this paper, there’s been a lot of good work documenting the dire COVID situation in jails and prisons. Mass incarceration was a public health emergency even before the COVID-19 pandemic, and tens of thousands of people have now contracted coronavirus in overcrowded jails and prisons. The Cook County Jail was identified as “the largest-known source of coronavirus infections in the U.S.” in April. Many incarcerated individuals have died after being denied appropriate medical care.

However, we also feel that the statistical methodology in the paper is sufficiently flawed that it does not provide strong evidence to back up its policy recommendations, or much evidence at all about the effect of jail releases on community infections.

We are concerned both about the statistical methods it uses and the effect sizes it estimates. We have shared our concerns with the first author, who helpfully shared his data and thoughts but did not persuade us that any of the concerns below are invalid. Given the high-profile nature of the paper, we thought the statistics community would benefit from open discussion. Depending on what you and your readers think we may reach out to the journal editors as well.

The analysis relies on multivariate regression. It regresses COVID cases in each zip code on the number of inmates released from the Cook County Jail to that zip code, adjusting for variables which include the number or proportion of Black residents, poverty rate, public transit utilization rate, and population density. A number of these variables are highly correlated: for example, the correlation between the number of Black residents in a zip code and the number of inmates released in March is 0.86 in the full Illinois sample (and 0.84 in Chicago zip codes). The results in the paper testify to the dangers of multivariate regression on highly correlated variables: the signs on several regression coefficients (eg, public transit utilization) flip from positive in the bivariate regressions (Supplemental Exhibit 2) to negative in the multivariate regressions (Supplemental Exhibit 3). If the regression coefficients do in fact have causal interpretations, we should infer that if more people use public transit, COVID cases will decrease, which doesn’t seem plausible.

Given the small samples (50 zip codes in Chicago, and 355 overall) and highly correlated variables, it is unsurprising that the results in the paper are not robust across alternate plausible specifications. Here are two examples of this.

First, we examined how the effect size estimate (ie, the coefficient on how many inmates were released in March) varied depending on which controls were included (assessing all subsets of the original controls used in the paper) and which subset of the dataset was used (Chicago or the full sample). We found that the primary effect estimate varied by a factor of nearly 5 across specifications, and that for some specifications, was no longer statistically significant. Similarly, in Appendix Exhibit 2 (top table, Chicago estimate) the paper shows that including all zip codes rather than just those with at least 5 cases (which adds only 3 additional zipcodes) renders the paper’s effect estimate no longer statistically significant.

Second, the results are not robust when fitting other standard regression models which account for overdispersion. As a robustness check, the paper fits a Poisson model to the case count data (Appendix Exhibit 4). Standard checks for overdispersion, like the Pearson 𝛘2 statistic from the Poisson regression, or fitting a quasipoisson model, imply the data is overdispersed. So we refit the Poisson model used in Appendix Exhibit 4 on the Chicago sample using three methods which are more robust to overdispersion: (1) we bootstrapped the confidence intervals on the regression estimate from the original Poisson model; (2) we fit a quasipoisson model, and (3) we fit a negative binomial model. All three of these methods yield confidence intervals which are much wider than the original Poisson confidence intervals, and which overlap zero. (For consistency with the original paper, we performed all these regressions using the same covariates used in Appendix Exhibit 4 for the original Poisson regression. We’re pretty sure that setup is non-standard, however, or at least it doesn’t seem to agree with the way you do it in your stop-and-frisk paper—it models cases as exponential in population, rather than using population as an offset term—so we also perform an alternate regression using a more standard covariate setup. Both methods yield similar conclusions.) Overall this analysis implies that, when overdispersion in the data is correctly modeled, the Chicago sample results are no longer statistically significant. Even in the full sample, the results are often not statistically significant depending on what specification is used.

To be clear, I would remain skeptical of the basic empirical strategy in the paper even if the sample were ten times as large, and none of the above applied. But I’m curious about your and your readers’ thoughts on this – as well as alternate strategies for estimating the number of community infections jail releases cause.

In addition to the methodological concerns summarized above, the effect sizes estimated in the paper seem somewhat unlikely. The paper estimates that each person released from the Cook County Jail results in approximately two additional reported cases in the zip code they are released to. Two facts make us question this finding. First, the CDC estimates that cases are underreported by a factor of ten, so to cause two reported infections, the average released person would have to cause twenty infections. Second, not everyone who left Cook County Jail was infected. Positive test rates at Cook County Jail are 11%; the overall fraction of inmates who were infected is likely lower, since it is probable that individuals with COVID-19 were more likely to be tested, but we use 11% as a reasonable upper bound on the fraction of released people who were infected. Combining these two facts, in order for the average person released to cause two reported cases, the averaged infected person released in March would have to cause nearly two hundred cases by April 19 (when the paper sources its case counts). This isn’t impossible — there’s a lot of uncertainty about the reproductive number, the degree of case underreporting, and the fact that not all detainees are included in the booking dataset. Still, we coded up a quick simulation based on estimates of the reproductive number in Illinois, and it seems somewhat unlikely.

My reply:

There are three things going on here: language, statistics, and policy.

*Language.* In the article, the authors say, “Although we cannot infer causality, it is possible that, as arrested individuals are exposed to high-risk spaces for infection in jails and then later released to their communities, the criminal justice system is turning them into potential disease vectors for their families, neighbors, and, ultimately, the general public.” And in the op-ed: “for each person cycled through Cook County Jail, our research shows that an additional 2.149 cases of Covid-19 appeared in their ZIP code within three to four weeks after the inmate’s discharge.” I appreciate their careful avoidance of causal language. But “2.149”? That’s ridiculous? Why not just say 2.14923892348901827390823? Even seeing aside all identification issues, you could never estimate this parameter to three decimal places. Indeed, the precision of that number is immediately destroyed by the vague “three to four weeks” that follows it. You might as well say that a dollop of whipped cream ways 2.149 grams.

But I don’t like this bit in the op-ed: “Roughly one in six of all cases in the city and state were linked to people who were jailed and released from this single jail, according to data through April 19.” All the analysis is at the aggregate level. Cases have not been “linked to people” at the jail at all!

I do not mean to single out the authors of this particular article. The real message is that it’s hard to be precise when describing what you’ve learned from data. These authors were so careful in almost everything they wrote, but even so, they slipped up at one point!

*Statistics.* The supplementary material has a scatterplot of the data from the 50 zip codes in Chicago that had 5 or more coronavirus cases during this period:

First off, I’m not clear why the “Inmates released” variable is sometimes negative. I guess that represents some sort of recoding or standardaztion, but in that case I’d prefer to see the raw variable in the graph.

As to the rest of the graphs: the key result is that the rate of coronavirus cases is correlated with the rate of inmate releases but *not* correlated with poverty rate. That seems kinda surprising.

So it seems that the authors of this paper did find something interesting. If I’d been writing the article, I would’ve framed it that way: We found this surprising pattern in the zip code data, and here are some possible explanations. Next logical step is to look at data from other cities. Also it seems that we should try to understand better the selection bias in who gets tested.

Unfortunately, the usual way to write a research article is to frame it in terms of a conclusion and then to defend that conclusion against all comers. Hence robustness checks and all the rest. That’s too bad. I’d rather frame this as an exploratory analysis than as an attempt at a definitive causal inference.

My take-home point is not that this article is bad, but rather that they saw something interesting that it would be worth tracking down, comparing to other cities and also thinking about the selection bias in who got tested.

*Policy.* Unlike the journal Psychological Science, I support criminal justice reform. I’m in agreement with the op-ed that we should reduce the number of people in jail and prison. I’d say that whatever the story is regarding these coronavirus numbers.

> My take-home point is not that this article is bad

But the article is very, very bad, right? This article seems (to me) at least as bad as himmicanes and all the stuff you (correctly!) rant against on a regular basis.

> they slipped up at one point!

Huh? Emma Pierson argues (convincingly, in my view) that they “slipped up” in many, many ways. Do you disagree with her claims? If so, please explain. She has convinced me!

I hope you are not pulling your punches because you agree with the policy recommendations . . .

D:

I’d have the same reaction to the article whether or not I agreed with its policy recommendations. And, yes, I do think this is a bad article, as bad as the himmicanes article. When I wrote, “My take-home point is not that this article is bad,” I’m just saying that the badness of the article is not my take-home point. There are a lot of bad articles. At least this bad article has an interesting set of scatterplots. Those researchers or someone else could follow up and see if anything’s going on with this.

In the above post, I bring up three issues (following Pierson et al.):

1. There is a causal inference made based on a regression that has the usual problems of noisy data, forking paths, and an underlying lack of substantive theory. As usual in these settings, the data are consistent with the authors’ claims but the data are also consistent with zero effects or with a zillion other stories.

2. The scatterplots show an interesting pattern in the data from Chicago. It would be good for this to be explored further. I’d say the regression analysis is just a distraction from this data pattern. I don’t care how many robustness checks were done in that paper; what I really want is for them to look into the data more carefully.

3. In most of the article and op-ed, the authors are pretty careful to avoid causal language. This does not resolve the problems noted in item 1 above, but it’s something. But in one place they flat-out said something false, or at the very least highly misleading. I don’t like it when Cass Sunstein does this in the New York Times, I don’t like it when David Brooks does this, I don’t like when anybody does it. In all these cases, I think the problem is as much sloppiness and journalistic convention as anything else, but I still don’t like it.

Andrew wrote:

“You might as well say that a dollop of whipped cream ways 2.149 grams.”

Change that to

“You might as well say that a dollop of whipped cream weighs 2.149 grams.”

Sure you could say that too.

But you can’t say that a dollop of whipped cream wheys 2.149 grams.

The units don’t match.

For some subset of the population, a homonym error is worse than a regression misinterpretation.

Thanks, Paul, Andrew and gec — I needed a good laugh about now.

Wait a minute: 50 zip codes in Chicago? When I read the fine print, I see they exclude any zip code with fewer than five detected COVID cases. Censoring on the dependent variable has to produce a high number.

Sure enough, when I keep reading I find the regression where they include the zeros. Now the estimate for Chicago is 1 COVID case per released inmate, plus or minus one. I find that result entirely believable (although the other objections still apply).

(By the way, the negative numbers in the plots occur because X and Y are regression adjusted for the other independent variables.)

Good thing we have powerful computers and statistical programming languages and spreadsheets and million column data sets. If we could write some code to check our assumptions someday we might figure out how to use all that stuff.

Werent people spreading it on purpose in prisons in the hope of getting let out? Thought I saw a video a few months ago of that going on, maybe in LA.

There’s some sort of error in those scatter plots. Each point represents one of the 50 Chicago zip codes, with COVID rate per capita on the Y and another variable on the X. The plot of interest (inmates released in March) has two points with COVID rates above 0.008 (one of which, upper right, probably contributes strongly to the line of fit). All the other plots only have one point with COVID rates above 0.008. Similarly, the lowest COVID rate on the “inmates released” plot (just above 0.002) doesn’t appear in any of the other plots. Something is wrong here, at least with those outliers.

Feeling rather impish at this point, I can’t resist rephrasing what you said as, “Fin says there’s something fishy in those scatter plots.” ;~)

Is this the right plaice for puns?

Is there a wrong place for puns?

Now we’re really floundering…

https://en.wikipedia.org/wiki/Plaice

Very good.

There’s no wrong place for puns, but that one gave me a haddock.

Oh my cod!

Bassphemy

> But I don’t like this bit in the op-ed: “Roughly one in six of all cases in the city and state were linked to people who were jailed and released from this single jail, according to data through April 19.” All the analysis is at the aggregate level. Cases have not been “linked to people” at the jail at all!

Maybe your issue if with the lack of evidence in this case. Or do you dislike that kind of language in general? Nine out of ten lung cancer deaths are linked to smoking but all the analysis is done at the aggregate level: one could also say that lung cancer deaths have not being linked to smoking at all.

What is with commenters on this blog who want smoking to be actually-not-so-bad?

We more or less understand the mechanisms by which smoking causes lung cancer: https://onlinelibrary.wiley.com/doi/full/10.1002/ijc.27816

And we have data about whether individuals who smoke are more likely to get cancer: https://onlinelibrary.wiley.com/doi/full/10.1002/ijc.27339

It’s not even remotely the same as the situation in the quotation where we know nothing about whether prisoners’ contacts are more likely to get COVID and that is what is driving the spike in their zip codes.

think you’re missing the point here a bit, anon e mouse.

I’m not sure that I follow your point here. All the analysis is done at the aggregate level? I presume that most of the research on this is actually done at the individual level, i.e. looking at the prevalence of lung cancer in individual who smoke, not looking at the prevalence of lung cancer in zip codes and the prevalence of smoking in those zip codes and then trying to correlate the two. This is, though, precisely the sort of analysis the article in question attempts.

One case is, of course, much clearer that other (that’s was intentional). In the cancer example you compare prevalence in a group of individuals who smoke and in a group of individuals who don’t smoke (smoking you might get cancer). In the covid example you compare prevalence in groups of individuals who are “close” to different numbers of people released from jail who may be infected (being close to infected people you might get infected). The definition of closeness may not be good enough and the study may be overlooking many issues, but that was not the question.

The question is whether one can say that lung cancer cases have been linked to smoking, when it may not be 100% certain for any individual smoker (if that’s not correct consider a different example, I’m not an oncologist and maybe lung cancer cases in smokers can actually be classified as smoking-cancers and non-smoking cancers). I guess that in the coronavirus jailbreak case it sounds wrong to present it in terms of epidemiological link because (unlike in the smoking example) it could have been a more detailed description identifying precisely who where the cases linked trough specific social contact chains to those people released from jail.

To be clear, in the lung cancer example the “cases linked to smoking” are relatively well identified: the cases due to smoking would be the majority of the cases in smokers. But the same language is often used in less clear-cut settings like “deaths linked to air pollution” or whatever.

I see the problem as being largely that “linked to” is a very fuzzy phrase that could be used to describe a lot of things, and could be interpreted in many different ways.

speaking of significant language problems, you might want to change your title, given that the issue here, as is clear in the article and op-ed, is jails and not prisons. the distinction between these two rather different institutions is very obviously highly relevant to the findings and what the authors suggest are their implications. jails cycling people in and out constantly in very high numbers. prisons are a different situation.

Eru:

Following your comment, I changed the title from “Coronavirus breaking out of prison” to “Coronavirus jailbreak.”

It seems very odd indeed that the main findings of the original article are from fitting a model to data where number of covid cases in the area (the response variable!) is >5. That’s like a regression to understand income where you take out all the poor people first out of a misguided idea that low incomes aren’t statistically material.

does that seem unreasonable in the study context where they have data from only one jail, the catchment area of which does not represent the entire state and during a relatively early time period in which covid had not yet penetrated all ZIPs in the state? the study notes that the 5 or more factor was dictated by the limitations of official data… what alternative would you have used?

The alternative I’d have used is to add to every conclusion (in both the article and the op-ed) “…in zip codes with at least five COVID cases.” It’s an awkward addition, and it almost absurdly narrows the claim the authors are trying to make, both of which are benefits above and beyond being completely accurate–the data limitations actually do make the analysis awkward and do narrow the range of possible (empirically supported) conclusions to an absurd degree. That’s something the reader should get a sense of right away. Within the data/evaluation community, we know arbitrary limitations like this are just something you have to put up with when using secondary data, and we (hopefully) mentally penalize the results accordingly, but most NYT readers will not do so.

The normal approach would be to increase the analysis unit size until the full city population is covered.

They could have grouped the data above the zip-code level such that each grouping consists of adjacent zips with some minimum population and the required five cases (or some other characteristics). I can’t imagine this would take more than half a day to do.

I can’t see a reason that this approach would be cause any new problems. There’s nothing socially or analytically significant about zips as far as I know.

Zip codes tend to be built on administrative boundaries like city or county boundaries. The socioeconomic situation of people in adjacent zip codes can be VASTLY different. Particularly in gentrifying cities. So, there’s that.

The bouncing betas issue is C&R 101 and should’ve been caught by reviewers. The standard fix is to make composites of the collinear predictors. I’d be curious if Ms. Pierson and her sisters tried that and, if they did, how it changed the results.

I also think there needed to be more justification for the choices the authors made in the analysis, or at least acknowledgement of the alternatives. They can’t rule out forking paths retroactively, but they can show what would’ve happened had they taken other paths. In an age with arxiv and post-publication review, journal space limitations aren’t an excuse anymore.

The evidence cannot support the authors’ strong conclusions, and the editor should’ve force a change there. Do these criticisms warrant contacting the journal? The collinearity issue may, particularly if the analysis with composites gives very different answers. Even then, I think the reasonable thing for the journal to do would be to publish this as a letter to the editor.

Otherwise, most of these appear to be conventional, if obviously weak, analyses. Weakness is not, in my opinion, a reasonable basis for a journal to flag a published paper, although sure, the journal should apologize for the failures of its pre-publication review (ha). Analytical weakness is the basis for the kind of post-publication take-downs happening right now. Rhetorical weakness is the basis for public embarrassment. I urge Ms. Pierson and her sisters to write up their counterarguments and alternative analyses in a new paper that builds on (and buries) the original. Heck, ask the original author if he wants to co-author.