Implicit assumptions in the Tversky/Kahneman example of the blue and green taxicabs

Juan de Oyarbide writes:

In Chapter 16 of the book “Thinking, Fast and Slow,” titled “Causes Trump Statistics,” Daniel Kahneman brings the differentiation between the use of statistical base rates and causal base rates in Bayes’ rule. Kahneman claims with a simple example that often, due to our logical human reasoning, we may not find the correct Bayesian mathematical model, and that depends on how the problem is presented to us. So he says that under some circumstances the omission of priors generates an overestimation of posterior probabilities.

I wonder if in the problem in question we actually have the same mathematical representation for either way the problem is presented, or there might be some model misidentification. I think the way information is brought could condition our understanding of the priors, and therefore the associated uncertainty (e.g., information on population probabilities with uncertainty on risk is not the same as having same probabilities on risk associated to each population and then equal population weights).

Oyarbide provides further details:

I found the problem online, I will share it below.

A cab was involved in a hit-and-run accident at night. Two cab companies, the Green and the Blue, operate in the city. You are given the following data: 85% of the cabs in the city are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accident was Blue rather than Green?

Now consider a variation of the same story, in which only the presentation of the base rate has been altered. You are given the following data: The two companies operate the same number of cabs, but Green cabs are involved in 85% of accidents. The information about the witness is as in the previous version.

Kahneman writes:

The two versions of the problem are mathematically indistinguishable, but they are psychologically quite different. People who read the first version do not know how to use the base rate and often ignore it. In contrast, people who see the second version give considerable weight to the base rate, and their average judgment is not too far from the Bayesian solution. Why?

In the first version, the base rate of Blue cabs is a statistical fact about the cabs in the city. A mind that is hungry for causal stories finds nothing to chew on: How does the number of Green and Blue cabs in the city cause this cab driver to hit and run? In the second version, in contrast, the drivers of Green cabs cause more than 5 times as many accidents as the Blue cabs do. The conclusion is immediate: the Green drivers must be a collection of reckless madmen! You have now formed a stereotype of Green recklessness, which you apply to unknown individual drivers in the company.

The stereotype is easily fitted into a causal story, because recklessness is a causally relevant fact about individual cabdrivers. In this version, there are two causal stories that need to be combined or reconciled. The first is the hit and run, which naturally evokes the idea that a reckless Green driver was responsible. The second is the witness’s testimony, which strongly suggests the cab was Blue. The inferences from the two stories about the color of the car are contradictory and approximately cancel each other. The chances for the two colors are about equal (the Bayesian estimate is 41%, reflecting the fact that the base rate of Green cabs is a little more extreme than the reliability of the witness who reported a Blue cab). The cab example illustrates two types of base rates.

Statistical base rates are facts about a population to which a case belongs, but they are not relevant to the individual case. Causal base rates change your view of how the individual case came to be. The two types of base-rate information are treated differently: Statistical base rates are generally underweighted, and sometimes neglected altogether, when specific information about the case at hand is available. Causal base rates are treated as information about the individual case and are easily combined with other case-specific information.

Oyarbide writes:

My question is, are the problems mathematically indistinguishable? Because the first case we don’t have information about risk, so some prior should be incorporated before including population facts. My second questions is, is there such thing of a statistical base rate and a causal base rate? Shouldn’t we always write a problem based on causality and incorporate population information on priors?

My reply is that neither problem is fully mathematically specified; they both rely on implicit assumptions of independence or random sampling. So you can think of the problems as different to the extent that the different scenarios might bring to mind different models of departures from this unstated assumption.

This is not an argument against self-citations. It’s an argument about how they should be counted. Also, a fun formula that expresses the estimated linear regression coefficient as a weighted average of local slopes.

From Regression and Other Stories, section 8.3: Least squares slope as a weighted average of slopes of pairs:

I was reminded of this after an email discussion with Shravan Vasishth, which started with this note from him:

I just saw this article on gaming h-indices:

Isn’t it odd that google scholar does not have a button for excluding self-citations in the computation of citation counts?

This paper also led me to search online for a portal where I could buy some citations to boost my h-index. I found one… in Ukraine:

You can just order a publication here. Cool! It’s a prize winning website too (did that buy that prize as well?)

I replied that I don’t think that self-citations are so bad, though. In all seriousness, I think all my own self-citations are perfectly legitimate.

I once received an angry email from a famous economist attacking me for referring too much to my own work. I replied that this was not out of egotism; it was just that I was referring to the work with which I was most familiar.

Shravan responded:

I have nothing against self-citations per se. I feel that the problem with self-citations is: it can be used to boost one’s metrics. This can happen even without the researcher gaming the system consciously. If my lab produces some 25 articles a year, then it is essentially guaranteed that my h-index and citation count will steadily go up. The question that one should be able to answer when using h-indices and citation counts to evaluate a scientist is: what is the relevance of this scientist for the field? I think it is fair to need to know what the metrics are with and without self-citation. If my h-index without self-citations is 25 and with is 55, that’s very informative.

IIRC Web of Science allows one to exclude self-citations when counting beans, sorry, I meant publications; but google scholar does not.

Self-citations are a real problem when doing bibliometrics because they represent a mix of how “productive” a scientist is plus how relevant they are.

Other clever ways I have seen (I don’t have any hard data; this is just an impressionistic view based on 25 years of observation) of researchers boosting self-citations without self-citing are

– in the review process: reviewers often try to force authors to cite their work.

– through citation cartels. The better connected one is in the field, the better the connectivity and the more you will get cited. This can be due to where one went to school, which part of the world one is from, how much effort one spends in schmoozing with the big guns in conferences, etc.

Good point. I don’t think there’s anything wrong with my self-citing, but I agree that when I self-cite, this should not be taken as evidence of the influence of my work.

Here’s the point. A citation-linking service such as Google Scholar has two useful purposes:

1. You click on a paper and it shows you everything in the database that has cited that paper. This sort of forward tracing is often helpful in turning up other work that is relevant to what you’re studying. For this purpose, self-citations can be very helpful: if I’m already reading a paper by some research team, I might well be interested in their follow-up work.

2. Count the citations (or do some sort of weighted count, page-rank style) and it gives you a measure of the paper’s influence. For this purpose, self-citations don’t seem relevant, as it doesn’t really count as influence if the only person who’s citing a paper is the author. So I agree with Shravan that there should be an option to not count those. Also I guess another option to turn off all citations to and from the International Supply Chain Technology Journal.

Hence the title of this post. Self-citations shouldn’t be counted when measuring influence, but they should be included in the list of forward references.

The above discussion of influence in citations reminded me of the concept of “influence” in regression. When two points have the same x-value, they provide no information about the slope. Which is kinda like self-citations: it only counts as “influence” if you move away from the baseline in some way. Indeed, one might say that the further you move, the more influence you have. Which suggests the possibility of some model-based influence measure, going beyond counting and even going beyond a recursive counting-like procedure such as page rank.

Pervasive randomization problems, here with headline experiments

Randomized experiments (i.e. A/B tests, RCTs) are great. A simple treatment vs. control experiment where all units have the same probability of assignment to treatment ensures that receiving treatment treatment is not systematically correlated with any observed or unobserved characteristics of the experimental units. There will be differences in, e.g., mean covariates between treatment and control, but these are already accounted for in standard statistical inference about the effects of the treatment.

However, things can go wrong in randomization. Often this is understandable as some version of latent noncompliance or attrition. Some units get assigned to treatment, but something downstream overrides that and the original assignment is lost (a kind of latent noncompliance). Or maybe when that mismatch is detected, something downstream drops those observations from the data. Or maybe treatment causes units (e.g., users of an app) to exit immediately (e.g., the app crashes) and that unit isn’t logged as having been exposed to the experiment.

So it is good to check that some key summaries of the assignments are not extremely implausible under the assumed randomization. For example, we may do a joint test for differences in pre-treatment covariates. Or — and this is particularly useful when we lack any or many covariates — we can just test that the number of units in each treatment is consistent with our planned (e.g., Bernoulli(1/2)) randomization; in the tech industry, this is is sometimes called a “sample ratio mismatch” (SRM) test.

These kinds of problems are quite common. One very common way they happen arises from the streaming arrival of randomization units to the point where treatment is applied. In cases where users aren’t logged in, this is unavoidable. In cases where there is a universe of user accounts, it can still be a dead end to randomize them all to treatments and use that as the analytical sample: most of these users would never have touched the part of the service where the treatment is applied. So instead it is common to trigger logging of exposure to the experiment and just analyze that sample of users (which might be less than 1% of all users); use of this kind of “triggering” or exposure logging is very common, but also can present these problems. For example, an analysis of experiments across several products at Microsoft found that around 6% of such experiments had sample ratio mismatches (at p<0.0005).

Here’s another example of randomization problems — with public data.

Upworthy Research Archive

Nathan Matias, Kevin Munger, Marianne Aubin Le Quere, and Charles Ebersole worked with Upworthy to curate and release a data set of over 15,000 experiments, with a total of over 150,000 treatments. Each of these experiments modifies the headline or image associated with an article on Upworthy, as displayed when viewing a different focal article; the outcome is then clicks on these headlines. You may recall Upworthy as a key innovator in “clickbait” and especially clickbait with a particular ideological tilt.

One of the things I really like about how they released this data is that they initially made only a subset of the experiments available as an exploratory data set. This allowed researchers to do initial analyses of that data and then preregister analyses and/or predictions for the remaining data. To me this helpfully highlighted that sometimes the best way to provide a data set as a public good isn’t to provide it all at once, but to structure how it is released.

Randomization problems

There were some problems with the randomization to treatments in the released data. In particular, Garrett Johnson pointed out to me that many times there were too many or too few viewers assigned to one of the treatments (i.e. SRMs). In 2021, I followed up on this some more. (The analysis below is based on the 4,869 experiments in the exploratory data set with at least 1,000 observations.)

If you do a chi-squared test of the proportion in each treatment, you get a p-value distribution that looks like this once you zoom in on the interesting part:

ECDF of p-values in the Upworthy data

That is, there are way too many tiny p-values compared with the uniform distribution — or, more practically, there are lots of experiments that don’t seem to have the right number of observations in each condition. Some further analyses suggested that these “bad” SRM experiments were especially common for experiments created in a particular period:

histogram of Upworthy p-values by week of experiment

But it was hard to say much about why that was.

So in 2022 I contacted Nathan Matias and Kevin Munger. They took this quite seriously, but also — because they had not conducted these experiments or built the tooling with which they were conducted — it was difficult for them to investigate the problem.

Well last week they have publicly released the results of their investigation. They hypothesize that this problem was caused by some caching, whereby subsequent visitors to a particular focal article page might be shown the same treatment headlines for other articles. This would create an odd kind of autocorrelated randomization. Perhaps point estimates could still be unbiased and consistent, but inference based on assuming independent randomization could be wrong.

I hadn’t personally encountered this kind of caching issue before in an experiment I’ve examined. Other caching issues can crop up, such as where a new treatment will have more cache misses, potentially slowing things down enough that some logging doesn’t happen. So this is perhaps a useful addition to a menagerie of randomization devils. (Some of these issues are discussed in this paper and others.)

They identify a particular period where this problem is concentrated: June 25, 2013 to January 10, 2014.

In advance of their announcement, Nathan and colleagues contacted the several teams who have published research using this amazing collection of experiments. Excluding the data from the period (making up 22% of the experiments) with this particularly acute excess of SRMs, these teams generally didn’t have their core results change all that much, so that’s nice.

Remaining problems?

I looked back at the full data set of experiments. Looking outside of the period where the problem is concentrated, there are still too many SRMs. 113 of the experiments outside this period have SRM p-values < 0.001. That’s 0.45% with a 95% confidence interval of [0.37%, 0.54%], so this is clearly an excess of such imbalanced experiments (compared with an expected 0.1% under the null) — even if much, much fewer than in the bad period (when this value is ~2/3). The problem is worse before, rather than after, the acute period, which makes sense if the team fixed a root cause:

ECDF of p-values in the Upworthy data by period

If there are only around a half of a percent of the remaining experiments with problems, likely many uses of this data are unaffected. After all, removing 22% of the experiments didn’t have big effects on conclusions of other work. However, of course we don’t necessarily know we have power to detect all violations of the null hypothesis of successful randomization — including some that could invalidate the results of that experiment. But, overall, compared with not having done these tests, I think on balance we perhaps have more reason to be confident in the remaining experiments — especially those after the acute period.


I hope this is an interesting case study that further illustrates how pervasive and troublesome randomization problems can be. And I may have another example coming soon.

[This post is by Dean Eckles. Because this post discusses practices in the Internet industry, I note that my disclosures include related financial interests and that I’ve been involved in designing and building some of those experimentation systems.]

More on the disconnect between who voters support and what they support

Palko writes:

One poll and I’m a bit suspicious of it, but still

U. of North Florida poll, FL reg voters

Trump 50
Biden 43

DeSantis 51
Biden 42

6 wk abortion ban, no exceptions
Support 22
Oppose 75

Concealed carry
Support 21
Oppose 77

Ban CRT/DEI on campus
Support 35
Oppose 61

Seems plausible.

P.S. I’m posting this to appear a few months in the future so maybe more polls on this will have come by then.

What to do with age? (including a regression predictor linearly and also in discrete steps)

Dale Lehman writes:

Recently released preprint regarding COVID/FLU vaccines and potential risks of stroke in the aged population:
I haven’t read it carefully and my first impression is that it is a worthy effort and appropriately caveated. Also, the conclusions seem reasonable given the observational nature of the study and limitations of the data. My questions involve 2 things. The subgroups the examine are 65-74 years old, 75-84, and >85. I’ve seen these types of binning common in medical studies. But why? The actual ages are certainly known, so why not treat age as a continuous variable? So, of course I look for the data, thinking that I can see if treating age as continuous reveals anything of interest. Here is the link for the data:
It says “All data produced in the present work are contained in the manuscript.” That seems like a worthless statement to me – it is virtually true of all papers that the data produced in the study is in the paper. The underlying data, apparently, is not addressed. No doubt it is protected for the usual privacy reasons (though I really don’t see the dangers in having this data on the anonymized 5 million+ Medicare recipients), but can’t they at least talk about the data used in the study in the data availability statement?

My reply: we discuss this general point in chapter 12 of Regression and Other Stories. Short answer is that discrete binning isn’t perfect but it’s transparent and can be better than simple linear model. One thing that people don’t always realize is that you can do binning and linear together, for example in R, y ~ z + age + age.65.74 + age75.84 + age.85.up, which can have the best of both worlds. The fitted model looks kind of goofy: it’s a step function with a slope, so has a bit of a sawtooth appearance, but gives some of the flexibility of binning along with doing something with trends within each age category.

In Stan, “~” should be called a “distribution statement,” not a “sampling statement.”

Aki writes:

The sampling statement name for tilde (~) statements in Stan documentation has been replaced with distribution statement. The background is

  • the tilde (~) statement does not do any sampling (but due to the sampling statement name, some people had wrongly assumed so, which caused confusion)
  • in the literature, tilde (~) is usually read “is distributed as”
  • right side of the tilde (~) statement can only be a built-in or user defined distribution
  • distribution statement makes it more natural to discuss the difference in defining the model with collection of distributions or with log density increments

The change affects only the documentation. The documentation has been revised to be more clear on differences in describing models with distribution statements and increment log density (target +=) statements.

As this blog lacks equations support, it’s best to go to read the updated documentation on distribution statement.

In addition, Stan User’s Guide sections on censored data models and zero-inflated count models are good examples illustrating the difference between describing a data model with distribution statement or writing the likelihood directly with increment log density statement.

This doesn’t change Stan’s capabilities or performance in any way, but I think it’s still important!  I’ve often heard people say that the ~ in a Stan model corresponds to sampling, and it doesn’t!

Also don’t forget this example:

theta ~ normal(0, 1);
theta ~ normal(0, 1);

If you include the above two lines of code in your Stan model, the result is not a redundant specification that theta is distributed as normal(0, 1). Rather, the above two lines add two terms to the target function. They are equivalent to:

target += normal_lpdf(theta | 0, 1);
target += normal_lpdf(theta | 0, 1);

and equivalent to the explicit and less computationally-efficient expression:

target += -0.5 * log(2*pi()) - 0.5 * square(theta);
target += -0.5 * log(2*pi()) - 0.5 * square(theta);

and, because of the convolution properties of the normal distribution, equivalent to the single line of code:

theta ~ normal(0, 1/sqrt(2));


target += normal_lpdf(theta | 0, 1/sqrt(2));

Anyway, it’s not a “sampling statement.” This misinterpretation was bugging us for awhile and then we just recently noticed the faulty terminology in the manual (which could well be something that I used to write or say myself!), so we fixed it!

Stan does have a sampling statement; it’s called “random number generation”

The funny thing is that you can do sampling in Stan! It has to be done in the generated quantities block; for example:

phi = normal_rng(0, 1);

What happens if you include the following two lines in your generated quantities block?

phi = normal_rng(0, 1);
phi = normal_rng(0, 1);

Stan code is executed sequentially, and the above code will first sample phi from a unit normal, then sample phi again from a unit normal. So the first line above is completely overwritten, in the same way that if you write, a=2; a=3; in Stan (or just about any other language), it will give the same result as a simple a=3;.

Stan of course also does sampling in its posterior inference. That’s something different. Here we’re talking about random sampling within Stan.

[Edit: Fixed RNG notation.]

New online Stan course: 80 videos + hosted live coding environment

Scott Spencer and AthlyticZ have a new online Stan course coming out soon and it looks fantastic! Scott has a ton of experience with Stan and is a great teacher (we’ve co-taught in-person Stan classes together several times). The course uses examples from sports, but the coding and modeling techniques are applicable to any domain. Check out Scott’s post on the Stan forum for all the details.

Comedy and child abuse in literature

I recently read Never Mind, the first of the Patrick Melrose by Edward St. Aubyn. I was vaguely aware of these novels, I guess from reviews when the books came out or the occasional newspaper or magazine feature story, the author is the modern-day Evelyn Waugh, etc. Anyway, yeah, the book was great. Hilarious, thought-provoking, pitch-perfect, the whole deal. A masterful performance.

Also, it’s a very funny book about child abuse. The child abuse scenes in the book are not funny. They’re horrible, not played for laughs at all. But the book itself is hilarious, and child abuse is at the center of it.

This got me thinking about other classics of literary humor that center on child abuse: Lolita, of course, and also The Nickel Boys and L’Arabe du Futur.

I guess this is related to the idea that discomfort can be an aspect of humor. In these books, some of the grim humor arises from the disconnect between the horrible actions being portrayed and their deadpan depictions on the page.

I found all the above-mentioned books to be very funny, very fun to read, and very upsetting to read at the same time.

This well-known paradox of R-squared is still buggin me. Can you help me out?

There’s this well-known—ok, maybe not well-enough known—example where you have a strong linear predictor but R-squared is only 1%.

The example goes like this. Consider two states of equal size, one of which is a “blue” state where the Democrats consistently win 55% of the two-party vote and the other which is a “red” state where Republicans win 55-45. The two states are much different in their politics! Now suppose you want to predict people’s votes based on what states they live in. Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5. Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05. The R-squared is, then, 0.05^2/0.5^2 = 0.01, or 1%.

There’s no trick here; the R-squared here really is 1%. We’ve brought up this example before, and commenters pointed to this article by Rosenthal and Rubin from 1979 giving a similar example and this article by Abelson from 1985 exploring the issue further.

I don’t have any great intuition for this one, except to say that usually we’re not trying to predict one particular voter; we’re interested in aggregates. So the denominator of R-squared, the total variance, which is 0.5^2 in this particular case, is not of much interest.

I’m not thrilled with that resolution, though, because suppose we compare two states, one in which the Democrats win 70-30 and one in which the Republicans win 70-30. The predicted probability is either 0.7 or 0.3, hence a standard deviation of 0.2, so the R-squared is 0.2^2/0.5^2 = 0.16. Still a very low-seeming value, even in this case you’re getting a pretty good individual prediction (the likelihood ratio is (0.7/0.3)/(0.3/0.7) = 0.49/0.09 = 5.4).

I guess the right way of thinking about this sort of example is consider some large number of individual predictions . . . I dunno. It’s still buggin me.

“Beyond the black box: Toward a new paradigm of statistics in science” (talks this Thursday in London by Jessica Hullman, Hadley Wickham, and me)

Sponsored by the Alan Turing Institute, the talks will be Thurs 20 June 2024, 5:30pm, at Kings College London. You can register for the event here, and here are the titles and abstracts of the three talks:

Beyond the black box: Toward a new paradigm of statistics in science

Andrew Gelman

Standard paradigms for data-based decision making and policy analysis fail and have led to a replication crisis in science – because they can’t handle uncertainty and variation and don’t seriously engage with the quality of evidence. We discuss how this has happened, touching on the piranha problem, the butterfly effect, the magic number 16, the one-way-street fallacy, the backpack fallacy, the Edlin factor, Clarke’s law, the analysts’s paradox, and the greatest trick the default ever pulled. We then discuss ways to go beyond the push-a-button, take-a-pill model to a more active engagement of data in science.

Data Analysis as Imagination

Jessica Hullman

Learning from data, whether in exploratory or confirmatory analysis settings, requires one to reason about the likelihood of many competing explanations. However, people are boundedly rational agents who often engage in pattern-finding at the expense of recognising uncertainty or considering potential sources of heterogeneity and variation in the effects they seek to discover. Taking this seriously motivates new classes of interface tools that help people extend their imagination in hypothesising and interpreting effects.

Data science in production

Hadley Wickham

This talk will discuss what it means to put data science “in production”. In industry, any successful data science project will be run repeatedly for months or years, typically on a server that can’t be worked with interactively. This poses an entirely new set of challenges that won’t be encountered in university classes, but that are vital to overcome if you want to have an impact in your job.

In this talk, Hadley discusses three principles useful for understanding data science in production: not just once, not just one computer, and not just alone. Hadley discusses the challenges associated with each and, where possible, what solutions (both technical and sociological) are currently available.


Myths of American history from the left, right, and center; also a discussion of the Why everything you thought you knew was wrong” genre of book.

Sociologist Claude Fischer has an interesting review of an edited book, “Myth America: Historians Take On the Biggest Legends and Lies About Our Past.”

I’m a big fan of the “Why everything you thought you knew was wrong” genre—it’s a great way to get into the topic. Don’t get me wrong: I’m not a fan of contrarianism for its own sake, especially when dressed in the language of expertise (see, for example, here and here). My point, rather, is that if you do have something reasonable to say, the contrarian or revisionist framework can be a good way to do it.

Just as a contrast: Our book Bayesian Data Analysis, published in 1995, had many reasonable takes that were different from what was standard in Bayesian statistics at the time. We just presented our results and methods straight, with no editorializing and very little discussion of how our perspective differed from what had come before. That was fine—the book was a success, after all—but, arguably, our presentation would’ve been even clearer and more compelling had we talked about where we thought that existing practice was wrong.

Why did we do it that way, writing the book using such a non-confrontational framing? It was my reaction to the academic literature on Bayesian statistics which was full of debate and controversy. Debate and controversy are fun, and can be a great way to learn—but the message I wanted to convey in our book was that Bayesian methods are useful for solving real problems, both theoretical and applied, not that Bayesian inference was necessary or that it satisfied some optimality property. I wanted BDA to be the book that took Bayesian inference beyond philosophy and argument toward methodology and applications. So I consciously avoided framing anything as controversial. The goal was to demonstrate how and why to do it, not to win a debate.

As I say, I think our tack worked. But there are ways in which it’s better to acknowledge, address, and argue against contrary views, rather than to simply present your own perspective. Both for historical reasons and for pedagogical practice, it’s good to talk about what else is out there—and also to explore the holes in your own recommended approach. In writing BDA, we were pushing against decades of Bayesian statistics being associated with philosophy and argument; thirty years later, we have moved on, Bayesian methods are here to stay, and there’s room for more open discussion of challenges and controversies.

To return to the topic of our post . . . In his discussion of “Myth America: Historians Take On the Biggest Legends and Lies About Our Past,” Fischer writes:

The book’s premise is that these myths derange our politics and undermine sound public policy. Although the authors address a few “bipartisan myths,” they focus on myths of the Right. . . . They might have [written about] the kernels of truths that are found in some conservative stories and also by addressing myths on the Left. . . .

Fischer shares a bunch of interesting examples. First, some myths that were successfully shot down in the book under discussion:

Akhil Reed Amar argues that the Constitution was not designed to restrain popular democracy but was instead a remarkably populist document for its time.

Daniel Immerwahr debunks the sanctimony that the U.S. has not pursued empire . . .

Michael Kazin criticizes the depiction of socialism as a recent infection from overseas. . . .

Elizabeth Hinton challenges the view that harsh police suppression is typically a reaction to criminal violence. She chronicles the long history, especially but not only in the South, of authorities aggressively policing even quiescent communities.

Before going on, let me say that this first myth, that the Constitution was designed to restrain popular democracy, seems to me as much of a myth of the Left as of the Right. On the right, sure, there’s this idea that the constitution protects us from mob rule; but it also seems like a leftist take, that the so-called founding fathers were just setting up a government to protect the interests of property owners. To the extent that this belief is actually in error, it seems to me that Amar is indeed addressing a myth on the Left.

In any case, Fischer continues with “a few examples of conceding some conservative points”:

Carol Anderson reviews recent Republican efforts, starting long before Trump, to cry voter fraud. Although little evidence points to substantial fraud these days, imposters, vote-buying, and messing with ballot boxes was common in the past. It was most visible, though not necessarily more common, in immigrant-filled cities run by Democratic machines. . . .

Erika Lee’s and Geraldo Cadava’s chapters on the southern border and on undocumented immigration undercut the current Fox News hysteria. They discuss the long history of cross-border movement and the repeated false alarms about foreigners. . . . However, the authors might have admitted that large-scale immigration is often disruptive. . . . reactions were not just xenophobic, but often over real material and cultural worries.

Contributors describe as “rebellions” the violence that broke out in Black neighborhoods at many points in the 20th century, but they do not dignify similar actions on the Right as rebellions, for example, the anti-immigrant riots of the 19th century and the anti-busing violence of the 1970s. These outbursts also entailed aggrieved communities raging against elites who imported scabs and elites who imposed school integration. Why are some labeled as rebellions and others as riots?

Naomi Oreskes and Erik M. Conway debunk the “The Magic of the Marketplace” myth. American businessmen and American economic growth have always relied heavily on government investment and subsidies . . . Still, a complete story would have appreciated how risk-taking entrepreneurs, from the Vanderbilts to the Fords, effectively deployed resources in ways that enabled prosperity for most Americans.

Fischer then provides some “legends of the Left, critique of which might burnish the historians’ reputation for objectivity and balance”:

The slumbering progressive vote. Seemingly forever, but certainly in recent years, voices on the Left have claimed that millions of working-class, minority voters are poised to vote for progressives if only candidates and parties spoke to their interests (the “What’s Wrong with Kansas?” question). Repeatedly, such voters have not emerged. Trump, however, did mobilize many chronic non-voters, suggesting that there are probably more slumbering right-populists than left-populists.

Explaining the Civil War. In a perennial argument, some on the Right minimize the role of slavery so as to promote the “Lost Cause” story that the war was about states’ rights . . . ome on the Left have also downplayed slavery, preferring to interpret the war as a struggle between different kinds of business interests, thereby both inflating the role of capitalism and blaming it. Also wrong; the cause was slavery.

“People of Color.” This label is an ahistorical effort to sort ethnoracial groups into two classes . . . submerges from view the vastly different experiences of, say, the descendants of slaves, third-generation Mexican-Americans, refugees from Afghanistan, and immigrants from Ghana. It would seem to lump together somewhat pale Latin or Native Americans with dark-skinned but economically successful Asians. . . . POC is a rhetorical slogan, not a historically-rooted category.

I don’t have anything to add here. Fischer’s discussion and examples are interesting. I don’t buy his argument that more left-right balance in the book would represent an “opportunity to increase public confidence in professional history.” To put together a book like this in order to increase public confidence seems like a mug’s game. I think you just have to put together the best book you can. And, at that point, you can do it two ways. Either present a frankly partisan view and own up to it, saying something like, “Lots of sources will give you the dominant conservative perspective on American history; here, we present a perspective that is well known in academia but does not always make it into the school books or the news broadcasts.” That’s what James Loewen did in his classic book, “Lies My Teacher Told Me.” Or you try your best to present a range of political perspectives and then you make that clear, saying something like, “There are many persistent misreadings of American history from conservatives, and our book shoots these down. In addition we have seen overreactions from the other direction, and we discuss some of these too.” I don’t think either of these presentations will do much to “increase public confidence in professional history,” but they have the advantage of clarity, and they help the editors and the readers alike to place the book within past and current political debates.

P.S. One of the editors of the book under discussion is Princeton historian Kevin Kruse, who came up in this space a couple years ago regarding some plagiarism in his Ph.D. thesis. Fischer didn’t mention this in his post. I guess plagiarism isn’t so relevant, given that he’s just the editor, of the book, not the author. As an editor of a couple of books myself, I ended up doing a lot of editing of other people’s chapters, which I guess is kind of the opposite of plagiarism. I think that’s expected, that the editors will do some writing as necessary to get the project done. I have no idea how Kruse and his co-editor Julian Zelizer operated with this particular book.

P.P.S. The other editor of that book is Julian Zelizer. I’ve collaborated with Adam Zelizer. Are they related? How many Zelizers could there be in political science??

One way you can understand people is to look at where they prefer to see complexity.

In her article, “On not sleeping with your students,” philosopher Amia Srinivasan writes that she was struck by “how limited philosophers’ thinking was”:

How could the same people who were used to wrestling with the ethics of eugenics and torture (issues you might have imagined were more clear-cut) think that all there was to say about professor-student sex was that it was fine if consensual?

Many philosophers prefer to see complexity only where it suits them.

This was interesting, and it gave me two thoughts.

First there’s the whole asshole angle: a philosopher being proudly bold and transgressive by considering the virtues of torture while not reflecting on issues closer to home, which reminds me of our quick rule of thumb is that when someone seems to be acting like a jerk, an economist will defend the behavior as being the essence of morality, but when someone seems to be doing something nice, an economist will raise the bar and argue that he’s not being nice at all. The point is that in some areas of academia it’s considered a positive to be counterintuitive and unpredictable. One thing I like about Srinivasan is that she’s not doing that. Like Bertrand Russell, she’s direct. Don’t get me wrong, Bertrand Russell had lots of problems in his philosophy as well as in his life—just take a look at Ray Monk’s biography of him—, but I appreciate the clarity and directness of his popular philosophical writing. Indeed, the clarity and directness can make it easier to see problems in what he wrote, and that’s good too.

The bit that really caught me in the above excerpt, though, was that last sentence, which got me thinking that one way you can understand people is to look at where they prefer to see complexity. I’m not quite sure what to do with this; I’m still chewing on it. It reminds me of the principle that you can understand people by looking at what bothers them. I wrote a post on that, many years ago, but now I can’t find it.

Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions

This is Jessica. Conformal prediction, referring to a class of distribution-free approaches to quantifying predictive uncertainty, has attracted interest for medical AI applications. Reasons include because prediction sets seem to align with the kinds of differential diagnoses doctors already use, and they can support common triage decisions like ruling in and ruling out critical conditions. 

However, like any uncertainty quantification technique, the nuance needed to describe what conformal approaches provide can get lost in translation. We have catalogs of common misinterpretations of p-values, confidence intervals, Bayes factors, AUC, etc., to which we might now add misinterpretations of conformal prediction. The below set is based on what I’m seeing as I read papers about applying conformal prediction for medical decision-making. If you’ve encountered others that I’ve missed (even if not in a health setting), please share them.

Misconception 1: Conformal prediction provides individualized uncertainty

It would be great if we could get prediction sets with true conditional coverage without having to make distributional assumptions, i.e., if we could guarantee that the probability that a prediction set at any fixed test point X_n+1 contains the true label is at least 1 – alpha. Unfortunately, assumption-free conditional coverage is not possible. But some enthusiastic takes on conformal prediction describe what it provides as if it is achieved. 

For example, Dawei Xie pointed me to this Nature Medicine commentary that calls for clinical uses of AI to include predictive uncertainty. The authors start with what appears to be a common motivation for conformal prediction in health: standard AI pipelines optimize population-level accuracy, failing to capture “the vital clinical fact that each patient is a unique person,” motivating methods that can “provide reliable advice for all individual patients.” The goal is to use uncertainty associated with the prediction to decide whether to abstain and bring in a human expert, who might gather more information or consider how the model was developed. 

This is all fine. The problem is that they propose to solve this challenge with conformal prediction, which they describe as a new tool “that can produce personalized measures of uncertainty.” You can get “relaxed” versions of conditional coverage, but no truly personalized quantification of uncertainty.

Misconception 2: The non-conformity score makes conformal prediction robust to distribution shift

Another potential source of misinterpretation is the non-conformity score. In split conformal prediction, this is the score that is calculated for (x,y) pairs in a held-out calibration set in order to find the threshold expected to achieve at least 1-alpha coverage on test instances. Then given a new instance, its non-conformity score is compared to the threshold to determine which labels go in the prediction set. The non-conformity score can be any negatively-oriented score function derived from the trained model’s predictions, though the closer it approximates a residual the more useful the sets are likely to be. A simple example would be 1 – f_hat(xi)_y where f_hat(xi)_y is the softmax value for label y produced by the last layer of a neural net, and the threshold is based on the distribution of 1 – f_hat(xi)_yi in the calibration set, where yi is the true label. 

One could say that non-conformity scores capture how dissimilar an (x,y) pair under consideration is from what the model has learned about label posterior distributions from the training data. But some of the application papers I’m seeing make more generic statements, describing the score as measuring how strange the new instance, as if in an absolute sense, or how unusual the new instance is relative to the training data, as if it is used to detect distribution shift.  

Misconception 3: You can get knowledge-free robustness to distribution shift 

Some papers acknowledge that standard split conformal coverage is not robust to violations of exchangeability, and cite work that relaxes this assumption to get coverage under certain types of distribution shifts. The risk here is describing these approaches as if one can get valid coverage under shifts without having to introduce any additional assumptions. Even in the work of Gibbs et al., which makes the least assumptions as far as I can tell, you still have to select a function class that covers the shifts you want coverage to be robust to. There is no “knowledge-free” way around violations of the typical assumptions.

Misconception 4: Conformal prediction can only provide marginal coverage over the randomness in calibration set and test points

In contrast to the above, I’ve also seen a few more skeptical takes on conformal prediction for medical decision making, arguing that conformal prediction sets are unreliable under shifts in input and label distributions and for subsets of the data. Papers that make these arguments can also mislead, by implying that any use of conformal prediction equates to simple split conformal prediction where coverage is marginal over the randomness in the calibration and test set points. This neglects to acknowledge the development of approaches that provide class-conditional or group-conditional coverage or the previously mentioned attempts at coverage under classes of shifts. Beware blanket statements that write off entire classes of approaches based on what the simplest variations achieve. 

Progress in AI may be exploding, but achieving nuance in discussions of uncertainty quantification is still hard.

Statistics Blunder at the Supreme Court

Joe Stover points to this op-ed by lawyer and political activist Ted Frank, who writes:

Even Supreme Court justices are known to be gullible. In a dissent from last week’s ruling against racial preferences in college admissions, Justice Ketanji Brown Jackson enumerated purported benefits of “diversity” in education. “It saves lives,” she asserts. “For high-risk Black newborns, having a Black physician more than doubles the likelihood that the baby will live.”

A moment’s thought should be enough to realize that this claim is wildly implausible. . . . the actual survival rate is over 99%.

Indeed, there’s no treatment that will take the survival rate up to 198%.

Frank continues:

How could Justice Jackson make such an innumerate mistake? A footnote cites a friend-of-the-court brief by the Association of American Medical Colleges, which makes the same claim in almost identical language. It, in turn, refers to a 2020 study . . . [which] makes no such claims. It examines mortality rates in Florida newborns between 1992 and 2015 and shows a 0.13% to 0.2% improvement in survival rates for black newborns with black pediatricians (though no statistically significant improvement for black obstetricians).

The AAMC brief either misunderstood the paper or invented the statistic. (It isn’t saved by the adjective “high-risk,” which doesn’t appear and isn’t measured in Greenwood’s paper.)

Here’s the quote from the brief by the Association of American Medical Colleges:

And for high-risk Black newborns, having a Black physician is tantamount to a miracle drug: it more than doubles the likelihood that the baby will live.

Here’s the relevant passage from the cited article, “Physician–patient racial concordance and disparities in birthing mortality for newborns”:

And here’s the relevant table:

Stover summarizes:

As far as I can tell, the justification for the quote is probably in Table 1, col. 1. Baseline mortality rate (white newborn + white dr) is 290 (per 100k). Black newborn is +604 above that giving 894/100k. Then -494 from that when it is black newborn + black dr giving 400/100k. So the black newborn mortality rate is more than cut in half when the doctor is also black.

So while the amicus brief did seem to misunderstand or misrepresent the study, the qualitative finding still holds.

Of course, maybe there are other statistical problems. I figure these basic stats don’t need a model though and could have been pulled out of the raw dataset easily.

There’s also Table 2 of the article, which presents data on babies with and without comorbidities. I’m guessing that’s what the amicus brief was talking about when referring to “at-risk” newborns.

In any case, the judge’s key mistake was to trust the amicus brief. I guess this shows a general problem when judges rely on empirical evidence. On one hand, a judge and a judge’s staff are a bunch of lawyers and have no particular expertise in evaluating scientific claims—it’s not like they’re gonna go read journal articles and try to untangle what’s in Table 1 of the Results section. On the other hand, evidently the Association of American Medical Colleges has no such expertise either. I can see why some judges would prefer to rely entirely on legal reasoning and leave empirical findings aside entirely. On the other hand, sometimes they need to rule based on the facts of a case, and in that case empirical results can matter . . . so I’m not sure what they’re supposed to do! I guess I’m overthinking this somehow, but I’m not quite sure where.

How could the judge’s opinion have been changed to accurately summarize this research? Instead of “For high-risk Black newborns, having a Black physician more than doubles the likelihood that the baby will live,” she could’ve written, “A study from Florida found Black infant mortality rates to be half as high with Black physicians than with White physicians.” OK, this could probably be phrased better, but here are the key improvements:
– Instead of just saying the statement as a general truth, localize it to “a study from Florida.”
– Instead of saying “more than doubles the likelihood that the baby will live,” say that the mortality rate halved.

That last bit is kind of funny . . . but I can see that if you’re writing an amicus brief in a hurry, you can, without reflection, think that “reducing risk of death by half” is the same as “doubling the survival rate.” I mean, sure, once you think about it, it’s obviously wrong, but it almost sounds right if you’re just letting the words flow. This is not an excuse!—I’m sure that whoever wrote that brief is really embarrassed right now—just an attempt at understanding.

Evaluating quantitative evidence is hard! A couple posts from the archive brought up errors from Potter Stewart and Antonin Scalia. I’ll do my small part in all of this by referring to these people as judges, not “Justices.”

Faculty and postdoc jobs in computational stats at Newcastle University (UK)

If you’re looking for a job in computational statistics, Newcastle is hiring one or two faculty positions and a postdoc position. The application deadline is in 10 days for the postdoc position and in a month for the faculty positions.

Close to the action

The UK (and France) are where much of the action is in MCMC, especially theory, in my world. There are great Bayesian computational statisticians in Newcastle, including Professor Chris Oates, and the department head, Professor Murray Pollock. In case you don’t know UK geography, Durham is a mere 30km down the road, with even more MCMC and comp stats researchers.

Faculty position(s)

These are lecturer and senior lecturer position, which is like assistant and associate professors in the U.S.

Lecturer Job ad

Application deadline: 18 July 2024

3 year postdoc position

Application deadline: 23rd June 2024

Postdoc job ad

About the postdoc

I’m at a conference with both Chris Oates and Murray Pollock in London right now. Murray just gave a really exciting talk on the topic of the postdoc, which is federated learning as part of the FUSION ERC project, which involves a bigger network of MCMC researchers including Christian Robert and Eric Moulines in Paris and Gareth Roberts at University of Warwick. They’re applying cutting edge diffusion processes (e.g., Brownian bridges) to recover exact solutions to the federated learning problem (where subsampled data sets, for instance from different hospitals, are fit independently and their posteriors are later combined without sharing all the data).

For more information

Contact: Murray Pollock (Murray.Pollock (at)

Arnold Foundation and Vera Institute argue about a study of the effectiveness of college education programs in prison.

OK, this one’s in our wheelhouse. So I’ll write about it. I just want to say that writing this sort of post takes a lot of effort. When it comes to social engagement, my benefit/cost ratio is much higher if I just spend 10 minutes writing a post about the virtues of p-values or whatever. Maximizing the number of hits and blog comments isn’t the only goal, though, and I do find that writing this sort of long post helps me clarify my thinking, so here we go. . . .

Jonathan Ben-Menachem writes:

Two criminal justice reform heavyweights are trading blows over a seemingly arcane subject: research methods. . . . Jennifer Doleac, Executive Vice President of Criminal Justice at Arnold Ventures, accused the Vera Institute of Justice of “research malpractice” for their evaluation of New York college-in-prison programs. In a response posted on Vera’s website, President Nick Turner accused Doleac of “giving comfort to the opponents of reform.”

At first glance, the study at the core of this debate doesn’t seem controversial: Vera evaluated Manhattan DA-funded college education programs for New York prisoners and found that participants were less likely to commit a new crime after exiting prison. . . . Vera used a method called propensity score matching, and constructed a “control” group on the basis of prisoners’ similarity to the “treatment” group. . . . Despite their acknowledgment that “differences may remain across the groups,” Vera researchers contended that “any remaining differences on unobserved variables will be small.”

Doleac didn’t buy it. . . . She argued that propensity score matching could not account for potentially different “motivation and focus.” In other words, the kind of people who apply for classes are different from people who don’t apply, so the difference in outcomes can’t be attributed to prison education. . . .

Here’s Doleac’s full comment:

Vera Institute just released this study of a college-in-prison education program in NY, funded by the Manhattan DA’s Criminal Justice Investment Initiative. Researchers compared people who chose to enroll in the program with similar-looking people who chose not to. This does not isolate the treatment effect of the education program. It is very likely that those who enrolled were more motivated to change, and/or more able to focus on their goals. This pre-existing difference in motivation & focus likely caused both the difference in enrollment in the program and the subsequent difference in recidivism across groups.

This report provides no useful information about whether this NY program is having beneficial effects.

Now we return to Ben-Menachem for some background:

This fight between big philanthropy and a nonprofit executive is extremely rare, and points to a broader struggle over research and politics. The Vera Institute boasts a $264 million operating budget, and . . . has been working on bail reform since the 1960s. Arnold Ventures was founded in 2010, and the organization has allocated around $400 million to criminal justice reform—some of which went to Vera.

How does the debate over methods relate to larger policy questions? Ben-Menachem writes:

Although propensity score matching does have useful applications, I might have made a critique similar to Doleac if I was a peer reviewer for an academic journal. But I’m not sure about Doleac’s claim that Vera’s study provides “no useful information,” or her broader insistence on (quasi) experimental research designs. Because “all studies on this topic use the same flawed design,” Doleac argued, “we have *no idea* whether in-prison college programming is a good investment.” This is a striking declaration that nothing outside of causal inference counts.

He connects this to an earlier controversy:

In 2018, Doleac and Anita Mukherjee published a working paper called “The Moral Hazard of Lifesaving Innovations: Naloxone Access, Opioid Abuse, and Crime” which claimed that naloxone distribution fails to reduce overdose deaths while also “making riskier opioid use more appealing.” In addition to measurement problems, the moral hazard frame partly relied on an urban myth—“naloxone parties,” where opioid users stockpile naloxone, an FDA approved medication designed to rapidly reverse overdose, and intentionally overdose with the knowledge that they can be revived. The final version of the study includes no references to “naloxone parties,” removes the moral hazard framing from the title, and describes the findings as “suggestive” rather than causal.

Later that year, Doleac and coauthors published a research review in Brookings citing her controversial naloxone study claiming that both naloxone and syringe exchange programs were unsupported by rigorous research. Opioid health researchers immediately demanded a retraction, pointing to heaps of prior research suggesting that these policies reduce overdose deaths (among other benefits). . . .

Ben-Menachem connects this to debates between economists and others regarding the role of causal inference. He writes:

While causal inference can be useful, it is insufficient on its own and arguably not always necessary in the policy context. By contrast, Vera produces research using a very wide variety of methods. This work teaches us about the who, where, when, what, why, and how of criminalization. Causal inference primarily tells us “whether.”

I disagree with him on this one. Propensity score matching (which should be followed up with regression adjustment; see for example our discussion here) is a method that is used for causal inference. I will also channel my causal-inference colleagues and say that, if your goal is to estimate and understand the effects of a policy, causal inference is absolutely necessary. Ben-Menachem’s mistake is to identify “causal inference” with some particular forms of natural-experiment or instrumental-variables analyses. Also, no matter how you define it, causal inference primarily tells us, or attempts to tell us, “how much” and “where and when,” not “whether.” I agree with his larger point, though, which is that understanding (what we sometimes call “theory”) is important.

I think Ben-Menachem’s framing of this as economists-doing-causal-inference vs. other-researchers-doing-pluralism misses the mark. Everybody’s doing causal inference here, one way or another, and indeed matching can be just fine if it is used as part of a general strategy for adjustment, even if, as with other causal inference methods, it can do badly when applied blindly.

But let’s move on. Ben-Menachem continues:

In a recent interview about Arnold Ventures’ funding priorities, Doleac explained that her goal is to “help build the evidence base on what works, and then push for policy change based on that evidence.” But insisting on “rigorous” evidence before implementing policy change risks slowing the steady progress of decarceration to a grinding halt. . . .

In an email, Vera’s Turner echoed this point. “The cost of Doleac’s apparently rigid standard is that it not only devalues legitimate methods,” he wrote, “but it sets an unreasonably and unnecessarily high burden of proof to undo a system that itself has very little evidence supporting its current state.”

Indeed, mass incarceration was not built on “rigorous research.” . . . Yet today some philanthropists demand randomized controlled trials (or “natural experiments”) for every brick we want to remove from the wall of mass incarceration. . . .

Decarceration is a fight that takes place on the streets and in city halls across America, not in the halls of philanthropic organizations. . . . the narrow emphasis on the evaluation standards of academic economists will hamstring otherwise promising efforts to undo the harms of criminalization.

Several questions arise here:

1. What can be learned from this now-controversial research project? What does it tell us about the effects of New York college-in-prison programs, or about programs to reduce prison time?

2. Given the inevitable weaknesses of any study of this sort (including studies that Doleac or I or other methods critics might like), how should its findings inform policy?

3. What should advocates’ or legislators’ views of the policy options be, given that the evidence in favor of the status quo is far from rigorous by any standard?

4. Given questions 1, 2, 3 above, what is the relevance of methodological critiques of any study in a real-world policy context?

Let me go through these four questions in turn.

1. What can be learned from this now-controversial research project?

First we have to look at the study! Here it is: “The Impacts of College-in-Prison Participation on Safety and Employment in New York State: An Analysis of College Students Funded by the Criminal Justice Investment Initiative,” published in November 2023.

I have no connection to this particular project, but I have some tenuous connection to both of the organizations involved in this debate, as many years ago I attended a brief meeting at the Arnold Foundation regarding a study being done by the Vera Institute regarding a program they were doing in the correctional system. And many years ago my aunt Lucy taught math at Sing Sing prison for awhile.

Let’s go to the Vera report, which concludes:

The study found a strong, significant, and consistent effect of college participation on reducing new convictions following release. Participation in this form of postsecondary education reduced reconviction by at least 66 percent. . . .

Vera also conducted a cost analysis of these seven college-in-prison programs . . . Researchers calculated the costs reimbursed by CJII, as well as two measures of the overall cost: the average cost per student and the costs of adding an additional group of 10 or 20 students to an existing college program . . . Adding an additional group of 10 or 20 students to those colleges that provided both education and reentry services would cost colleges approximately $10,500 per additional student, while adding an additional group of students to colleges that focused on education would cost approximately $3,800 per additional student. . . . The final evaluation report will expand this cost analysis to a benefit-cost analysis, which will evaluate the return on investment of these monetary and resource outlays in terms of avoided incarceration, averted criminal victimization, and increased labor force participation and improved income.

And they connect this to policy:

This research indicates that academic college programs are highly effective at reducing future convictions among participating students. Yet, interest in college in prison among prospective students far outstrips the ability of institutions of higher education to provide that programming, due in no small part to resource constraints. In such a context, funding through initiatives such as CJII and through state and federal programs not only supports the aspirations of people who are incarcerated but also promotes public safety.

Now let’s jump to the methods. From page 13 of the report onward:

To understand the impact of access to a college education on the people in the program, Vera researchers needed to know what would have happened to these people if they had not participated in the program. . . . Ideally, researchers need these comparisons to be between groups that are otherwise as similar as possible to guard against attributing outcomes to the effects of education that may be due to the characteristics of people who are eligible for or interested in participating in education. In a fair comparison of students and nonstudents, the only difference between the two is that students participated in college education in prison while nonstudents did not. . . . One study of the impacts of college in prison on criminal legal system outcomes found that people who chose or were able to access education differed in their demographics, employment and conviction histories, and sentence lengths from people who did not choose or have the ability to access education. This indicates a need for research and statistical methods that can account for such “selection” into college education . . .

The best way to create the fair comparisons needed to estimate causal effects is to perform a randomized experiment. However, this was not done in this study due to the ethical impact of withholding from a comparison group an intervention that has established positive benefits . . . Vera researchers instead aimed to create a fairer comparison across groups using a statistical technique called propensity score matching . . . Vera researchers matched students and nonstudents on the following variables:
– demographics . . .
– conviction history . . .
– correctional characteristics . . .
– education characteristics . . .
Researchers considered nonstudents to be eligible for comparison not only if they met the same academic and behavioral history requirements as students but also if they had a similar time to release during the CIP period, a similar age at incarceration, and a similar time from prison admission to eligibility. . . . when evaluating whether an intervention influences an outcome of interest, it is a necessary but not sufficient condition that the intervention happens before the outcome. Vera researchers therefore defined a “start date” for students and a “virtual start date” for nonstudents in order to determine when to begin measuring in-facility outcomes, which included Tier II, Tier III, high-severity, and all misconducts. . . . To examine the effect of college education in prison on misconducts and on reported wages, Vera researchers used linear regression on the matched sample. For formal employment status and for an incident within six months and 12 months of release that led to a new conviction, Vera used logistic regression on the matched sample. For recidivism at any point following release, Vera used survival analysis on the matched sample to estimate the impact of the program on the time until an incident that leads to a new conviction occurs.

What about the concern expressed by Doleac regarding differences that are not accounted for by the matching and adjustment variables? Here’s what the report says:

Vera researchers have attempted to control [I’d prefer the term “adjust” — ed.] for pre-incarceration factors, such as conviction history, age, and gender, that may contribute to misconducts in prison. However, Vera was not able to control for other pre-incarceration factors that have been found in the literature to contribute to misconducts, such as marital status and family structure, mental health needs, a history of physical abuse, antisocial attitudes and beliefs, religiosity, socioeconomic disadvantage and exposure to geographically concentrated poverty, and other factors that, if present, would still allow a person to remain eligible for college education but might influence misconducts. Vera researchers also have not been able to control for factors that may be related to misconducts, including characteristics of the prison management environment, such as prison size, and the proportion of people incarcerated under age 25, as Vera did not have access to information about the facilities where nonstudents were incarcerated. Vera also did not have access to other programs that students and nonstudents may be participating in, such as work assignments, other programming, or health and mental health service engagement, which may influence in-facility behavior and are commonly used as controls in the literature. If other literature on the subject is correct and education does help to lower misconducts, Vera may have, by chance, mismatched students with controls who, unobserved to researchers and unmeasured in the data, were less likely to have characteristics or be exposed to environments that influence misconducts. While prior misconducts, assigned security class, and time since admission may, as proxies, capture some of this information, they may do so imperfectly.

They have plans to mitigate these limitations going forward:

First, Vera will receive information on new students and newly eligible nonstudents who have enrolled or become eligible following receipt of the first tranche of data. Researchers will also have the opportunity to follow the people in the analytical sample for the present study over a longer period of time. . . . Second, researchers will receive new variables in new time periods from both DOCCS and DOL. Vera plans to obtain more detailed information on both misconducts and counts of misconducts that take place in different time periods for the final report. . . . Next, Vera will obtain data on pre-incarceration wages and formal employment status, which could help researchers to achieve better balance between students and nonstudents on their work histories . . .

In summary: Yeah, observational studies are hard. You adjust for what you can adjust for, then you can do supplementary analyses to assess the sizes and directions of possible biases. I’m kinda with Ben-Menachem on this one: Doleac’s right that the study “does not isolate the treatment effect of the education program,” but there’s really no way to isolate this effect—indeed, there is no single “effect,” as any effect will vary by person and depend on context. But to say that the report “provides no useful information” about the effect . . . I think that’s way too harsh.

Another way of saying this is that, speaking in general terms, I don’t find adjusting for existing pre-treatment variables to be a worse identification strategy than instrumental variables, or difference-in-differences, or various other methods that are used for causal inference from observational studies. All these methods rely on strong, false assumptions. I’m not saying that these methods are equivalent, either in general or in any particular case, just that all have flaws. And indeed, in her work with the Arnold Foundation, Doleac promotes various criminal-justice reforms. So I’m not quite sure why she’s so bothered by this particular Vera study. I’m not saying she’s wrong to be bothered by it; there just must be more to the story, other reasons she has for concern that were not mentioned in her above-linked social media post.

Also, I don’t believe that estimate from the Vera study that the treatment reduces recidivism by 66%. No way. See the section “About that ’66 percent'” below for details. So there are reasons to be bothered by that report; I just don’t quite get where Doleac is coming from in her particular criticism.

2. Given the inevitable weaknesses of any study of this sort, how should its findings inform policy?

I guess it’s the usual story: each study only adds a bit to the big picture. The Vera study is encouraging to the extent that it’s part of a larger story that makes sense and is consistent with observation. The results so far seem too noisy to be able to say much about the size of the effect, but maybe more will be learned from the followups.

3. What should advocates’ or legislators’ views of the policy options be, given that the evidence in favor of the status quo is far from rigorous by any standard?

This I’m not sure. It depends on your understanding of justice policy. Ben-Menachem and others want to reduce mass incarceration, and this makes sense to me, but others have different views and take the position that mass incarceration has positive net effects.

I agree with Ben-Menachem that policymakers should not stick with the status quo, just on the basis that there is no strong evidence in favor of a particular alternative. For one thing, the status quo is itself relatively recent, so it’s not like it can be supported based on any general “if it ain’t broke, don’t fix it” principle. But . . . I don’t think Doleac is taking a stick-with-the-status-quo position either! Yes, she’s saying that the Vera study “provides no useful information”—a statement I don’t really agree with—but I don’t see her saying that New York’s college-in-prison education program is a bad idea, or that it shouldn’t be funded. I take Doleac as saying that, if policymakers want to fund this program, they should be clear that they’re making this decision based on their theoretical understanding, or maybe based on political concerns, not based on a solid empirical estimate of its effects.

4. Given questions 1, 2, 3 above, what is the relevance of methodological critiques of any study in a real-world policy context?

Methodological critique can help us avoid overconfidence in the interpretation of results.

Concerns such as Doleac’s regarding identification help us understand how different studies can differ so much in their results: in addition to sampling variation and varying treatment effect, the biases of measurement and estimation depend on context. Concerns such as mine regarding effect sizes should help when taking exaggerated estimates and mapping them to cost-benefit analyses.

Even with all our concerns, I do think projects such as this Vera study are useful in that they connect the qualitative aspects of administrating the program with quantitative evaluation. It’s also important that the project itself has social value and that the proposed mechanism of action makes sense. I’m reminded of our retrospective control study of the Millennium Villages project (here’s the published paper, here and here are two unpublished papers on the design of the study, and here’s a later discussion of our study and another evaluation of the project): the study could never have been perfect, but we learned a lot from doing a careful comparison.

To return to Ben-Menachem’s post, I think the framing of this as a “fight over rigor” is a mistake. The researchers at the Vera Institute and the economist at the Arnold Foundation seem to be operating at the same, reasonable, level of rigor. They’re concerned about causal identification and generalizability, they’re trying to learn what they can from observational data, etc. Regression adjustment with propensity scores is no more or less rigorous than instrumental variables or change-point analysis or multilevel modeling or any other method that might be applied in this sort of problem. It’s really all about the details.

It might help to compare this to an example we’ve discussed in this space many times before: flawed estimates of the effect of air pollution on lifespan. There’s lot of theory and evidence that air pollution is bad for your life expectancy. The theory and evidence are not 100% conclusive—there’s this idea that a little bit of pollution can make you stronger by stimulating your immune system or whatever—but we’re pretty much expecting heavy indoor air pollution to be bad for you.

The question then comes up, what is learned that is policy relevant from a really bad study of the effects of air pollution. I’d say, pretty much nothing. I have a more positive take on the Vera study, partly because it is very directly studying the effect of a treatment of interest. The analysis has some omitted variables concerns, also the published estimates are, I believe, way too high, but it still seems to me to be moving the ball forward. I guess that one way they could do better would be to focus on more immediate outcomes. I get that reduction in recidivism is the big goal, but that’s kind of indirect, meaning that we would expect smaller effects and noisier estimates. Direct outcomes of participation in the program could be a better thing to focus on. But I’m speaking in general terms here, as I have no knowledge of the prison system etc.

About that “66 percent”

As noted above, the Vera study concluded:

Participation in this form of postsecondary education reduced reconviction by at least 66 percent.

“At least 66 percent” . . . where did this come from? I searched the paper for “66” and found this passage:

Vera’s study found that participation in college in prison reduced the risk of reconviction by 66 to 67 percent (a relative risk of 0.33 and 0.34). (See Table 7.) The impact of participation in college education was found to reduce reconviction in all three of the analyses (six months, 12 months, and at any point following release). The consistency of estimated treatment effects gives Vera confidence in the validity of this finding.

And here is the relevant table:

Ummmm . . . no. Remember Type M errors? The raw estimate is HUGE (a reduction in risk of 66%) and the standard error is huge too (I guess it’s about 33%, given that a p-value of 0.05 corresponds to an estimate that’s approximately two standard errors away from zero) . . . that’s the classic recipe for bias.

Give it a straight-up Edlin factor of 1/2 and your estimated effect is to reduce the risk of reconviction by 33%, which still sounds kinda high to me, but I’ll leave this one to the experts. The Vera report states that they “detected a much stronger effect than prior studies,” and those prior studies could very well be positively biased themselves, so, yeah, my best guess is that any true average effect is less than 33%.

So when they say, “at least 66 percent”: I think that’s just wrong, an example of the very common statistical error of reporting an estimate without correcting for bias.

Also, I don’t buy that the result appearing in all three of the analyses represents a “consistency of estimated treatment effects” that should give “confidence in the validity of this finding.” The three analyses have a lot of overlap, no? I don’t have the raw data to check what proportion of the reconvictions within 12 months or at any point following release already occurred within 6 months, and I’m not saying the three summaries are entirely redundant. But they’re not independent pieces of information either. I have no idea why the estimates are soooo close to each other; I guess that is probably just one of those chance things which in this case give a misleading illusion of consistency.

Finally, to say a risk reduction of “66 to 67 percent” is a ridiculous level of precision, given that even if you were to just take the straight-up classical 95% intervals you’d get a range of risk reductions of something like 90 percent to zero percent (a relative risk between 0.1 and 1.0).

So we’re seeing overestimation of effect size and overconfidence in what can be learned by the study, which is an all-too-common problem in policy analysis (for example here).

None of this has anything to do with Doleac’s point. Even with no issues of identification at all, I don’t think this treatment effect estimate of 66% (or “at least 66%” or “66 to 67 percent”) decline in recidivism should be taken seriously.

To put it another way, if the same treatment were done on the same population, just with a different sample of people, what would I expect to see? I don’t know—but my best estimate would be that the observed difference would be a lot less than 66%. Call it the Edlin factor, call it Type M error, call it an empirical correction, call it Bayes; whatever you want to call it, I wouldn’t feel comfortable taking that 66% as an estimated effect.

As I always say for this sort of problem, this does not mean that I think the intervention has no effect, or that I have any certainty that the effect is less than the claimed estimate. The data are, indeed, consistent with that claimed 66% decline. The data are also consistent with many other things, including (in my view more plausibly) smaller average effects. What I’m disagreeing with is the claim that the study demonstrates provides strong evidence for that claimed effect, and I say this based on basic statistics, without even getting into causal identification.

P.S. Ben-Menachem is a Ph.D. student in sociology at Columbia and he’s published a paper on police stops in the APSR. I don’t recall meeting him, but maybe he came by the Playroom at some point? Columbia’s a big place.

How would the election turn out if Biden or Trump were replaced by a different candidate?

Paul Campos points to this post where political analyst Nate Silver writes:

If I’d told you 10 years ago a president would seek re-election at 81 despite a supermajority of Americans having concerns about his age, and then we’d hit 8% inflation for 2 years, you wouldn’t be surprised he was an underdog for reelection. You’d be surprised it was even close! . . .

Trump should drop out! . . . Biden would lose by 7 points, but I agree, the Republican Party and the country would be better served by a different nominee.

Campos points out that the claim that we “hit 8% inflation for 2 years” is untrue—actually, “Inflation on a year over year basis hit 8% or higher for exactly seven months of the Biden presidency, from March through September of 2022, not “two years.” It did not hit 8% in any calendar year”—and I guess that’s part of the issue here. The fact that Silver, who is so statistically aware, made this mistake is an interesting example of something that a lot of people have been talking about lately, the disjunction between economic performance and economic perception. I don’t know how Nate will respond to the “8% inflation for 2 years” thing, but I guess he might say that it feels like 8% to people, and that’s what matters.

But then you’d want to rephrase Nate’s statement slightly, so say someting like:

If I’d told you 10 years ago a president would seek re-election at 81 (running against an opponent who is 77) despite a supermajority of Americans having concerns about his age, and with inflation hitting 9% in the president’s second year and then rapidly declining to 3.5% but still a concern in the polls . . .

If Nate had told me that ten years ago, I’m not sure what I’d have thought. I guess if he’d given me that scenario, I would’ve asked about the rate of growth in real per-capita income . . . ummm, here’s something . . . It seems that real per-capita disposable personal income increased by 1.1% during 2023. These sorts of numbers depend on what you count (for example, real per-capita GDP increased by 2.3% during that period) and what is your time window (real per-capita disposable personal income dropped a lot in 2002 and then has gradually increased since then, while the increase in GDP per capita has been more steady).

In any case, economic growth of 1 or 2% is, from the perspective of recent history, neither terrible nor great. Given past data on economic performance and election outcome, I would not be at all surprised to find the election to be close, as can be seen in this graph from Active Statistics:

The other thing is a candidate being 81 years old . . . it’s hard to know what to say about this one. Elections have often featured candidates who have some unprecedented issue that could be a concern to many voters, for example Obama being African-American, Mitt Romney being Mormon, Hillary Clinton being female, Bill Clinton openly having had affairs, George W. Bush being a cheerleader . . . The age issue came up with Reagan; see or example this news article by Lou Cannon from October, 1984, which had this line:

Dr. Richard Greulich, scientific director of the National Institute on Aging, said Reagan is in “extraordinarily good physical shape” for his age.

Looking back, this is kind of amazing quote, partly because it’s hard to imagine an official NIH scientist issuing this sort of statement—nowadays, we’d hear something from the president’s private doctor and there’d be no reason for an outsider to take it seriously—and partly because of how careful Greulich was to talk about “physical shape” and not mental shape, which is relevant given Reagan’s well-known mental deterioration during his second term.

The 2020 and 2024 elections are a new thing in that both candidates are elderly, and, at least as judged by some of their statements and actions, appear to have diminished mental capacity. When considering the age issue last year (in reaction to earlier posts by Campos and Silver), I ended up with this equivocal conclusion:

Comparing Biden and Trump, it’s not clear what to do with the masses of anecdotal data; on the other hand, it doesn’t seem quite right to toss all that out and just go with the relatively weak information from the base rates. I guess this happens a lot in decision problems. You have some highly relevant information that is hard to quantify, along with some weaker, but quantifiable statistics. . . . I find it very difficult to think about this sort of question where the available data are clearly relevant yet have such huge problems with selection.

Both Biden and Trump were subject to primary challenges this year, and the age criticisms didn’t get much traction for either of them. I’m guessing this is because, fairly or not, there was some perception that the age issue had already been litigated in earlier primary election campaigns where Biden and Trump defeated multiple younger alternatives.

Putting this all together, and in response to Nate’s implicit question, if you had told me 10 years ago that the president would seek re-election at 81 (running against an opponent who is 77) despite a supermajority of Americans having concerns about his age, and with inflation hitting 9% in the president’s second year and then rapidly declining to 3.5% but still a concern in the polls, then I’d probably first ask about recent changes in GDP and per-capita income per capita and then say that I would not be surprised if the election were close, nor for that matter would I be surprised if one of the candidates were leading by a few points in the polls.

What about Nate’s other statement: “Trump should drop out! . . . Biden would lose by 7 points, but I agree, the Republican Party and the country would be better served by a different nominee”?

Would replacing Trump by an alternative candidate increase the Republican party’s share of the two-party vote by 3.5 percentage points?

We can’t ever know this one, but there are some ways to think about the question:

– There’s some political science research on the topic. Steven Rosenstone in his classic 1983 book, Forecasting Presidential Elections, estimates that politically moderate nominees do better than those with more extreme views, but with a small effect of around 1 percentage point. When it comes to policy, Trump is pretty much in the center of his party right now, and it seems doubtful that an alternative Republican candidate would be much closer to the center of national politics. A similar analysis goes for Biden. In theory, either Trump or Biden could be replaced by a more centrist candidate who could do better in the election, but that doesn’t seem to be where either party is going right now.

– Trump has some unique negatives. He lost a previous election as an incumbent, he’s just been convicted of a felony, and he’s elderly and speaks incoherently, which is a minus in its own right and also makes it harder for the Republicans to use the age issue against Biden. Would replacing Trump by a younger candidate with less political baggage be gain the party 3.5 percentage points of the vote? I’m inclined to think no, again by analogy to other candidate attributes which, on their own, seemed like potential huge negatives but didn’t seem to have such large impacts on the election outcome. Mitt Romney and Hillary Clinton both performed disappointingly, but I don’t think anyone is saying that Romney’s religion and Clinton’s gender cost them 3.5 percentage points of the vote. Once the candidates are set, voters seem to set aside their concerns about the individual candidate.

– Political polarization just keeps increasing, which leads us to expect less cross-party voting and less short-term impact of the candidate on the party’s vote share. If the effect of changing the nominee was on the order of 1 or 2 percentage points a few decades ago, it’s hard to picture the effect being 3.5 percentage points now.

The other thing is that Trump in 2016 and 2020 performed roughly about as well as might have been expected given the economic and political conditions at the time. I see no reason to think that a Republican alternative would’ve performed 3.5 percentage points in either of these elections. It’s just hard to say. Trump is arguably a much weaker candidate in 2024 than he was in 2016 and 2020, given his support for insurrection, felony conviction, and increasing incoherence as a speaker. If you want to say that a different Republican candidate would do 3.5 percentage points better in the two-party vote, I think you’d have to make your argument on those grounds.

P.S. You might ask why, as a political scientist, I’d be responding to arguments from a law professor and a nonacademic pundit/analyst. The short answer is that these arguments are out there, and social media is a big part of the conversation; the analogy twenty or more years ago would’ve been responding to a news article, magazine story, or TV feature. The longer answer is that academia moves more slowly. There must be a lot of relevant political science literature here that I’m not aware of . . . obviously, given that the last time we carefully looked at these issues was in 1993! I can read Campos and Silver and social media, process my thoughts, and post them here, which is approximately a zillion times faster and less effortful than writing an article on the topic for the APSR or whatever. Back in the day I would’ve posted this on the Monkey Cage, and then maybe another political scientist would’ve followed it up with a more informed perspective.

P.P.S. In a followup post, Campos introduces a concept I’d not heard before, the “backup quarterback syndrome”:

Nate Silver has fallen for the backup quarterback syndrome, which is the well-known fact that, on any team that isn’t completely dominating its competition, the backup quarterback tends to be the most popular player, because fans can so easily project their fantasies onto that player, since the starting quarterback’s flaws are viewed in real time, while the backup quarterback can bask in the future glory attributed to him by optimism bias.

I disagree with Campos regarding Nate here: it’s my impression that when Nate expresses strong confidence that a replacement Republican would do much better than Trump, and speculates that a replacement Democrat would do much better than Biden, Nate is not making a positive statement about Ron DeSantis or Gretchen Whitmer or whomever, so much as comparing Trump and Biden to major-party nominees from the past. Nate’s argument in support of the backup quarterback is based on his assessment of the flaws of the current QB’s.

That said, I like the phrase, “backup quarterback syndrome.” It does seem like a fallacy. It’s probably been studied (maybe not specifically in the football-fan context) in the heuristic-and-biases literature.

1. Why so many non-econ papers by economists? 2. What’s on the math GRE and what does this have to do with stat Ph.D. programs? 3. How does modern research on combinatorics relate to statistics?

Someone who would prefer to remain anonymous writes:

A lot of the papers I’ve been reading that sound really interesting don’t seem to involve economics per se (e.g.,, but they usually seem to come out of econ (as opposed to statistics) departments. Why is that? Is it a matter of culture? Or just because there are more economists? Or something else?

And here’s the longer version of my question.

I’ve been reading your blog for a couple of years and this post of yours, “Is an Oxford degree worth the parchment it’s printed on?”, from a month ago got me thinking about studying statistics. My background is mainly in engineering (BS CompE/Math, MS EE). Is it possible to get accepted to a good stats program with my background? I know people who have gone into econ with an engineering, but not statistics. I’ve also been reading some epidemiology papers that are really cool, so statistics seems ideal, since it’s heavily used in both econ and epidemiology, but I wonder if there’s some domain specific knowledge I’d be missing.

I’ve noticed that a lot of programs “strongly recommend” taking the GRE math subject test; is that pretty much required for someone with an unorthodox background? I’d probably have to read a topology and number theory text, and maybe a couple others to get an acceptable GRE math score, but those don’t seem too relevant to statistics (?). I’ve done that sort of thing before – I read and did all the exercises in a couple of engineering texts when I switched fields within engineering, and I could do it again, but, if given the choice, there are a other things I’d rather spend my time on.

Also, I recently ran into my old combinatorics professor, and he mentioned that he knew some people in various math departments who used combinatorics in statistics for things like experimental design. Is that sort of work purely the realm of the math departments, or does that happen in stats departments too? I loved doing combinatorics, and it would be great if I could do something in that area too.

My reply:

1. Here are a few reasons why academic economists do so much work that does not directly involve economics:

a. Economics is a large and growing field in academia, especially if you include business schools. So there are just a lot of economists out there doing work and publishing papers. They will branch out into non-economics topics sometimes.

b. Economics is also pretty open to research on non-academic topics. You don’t always see that in other fields. For example, I’ve been told that in political science, students and young faculty are often advised not to work in policy analysis.

c. Economists learn methodological tools, in particular, time series analysis and observational studies, which are useful in other empirical settings.

d. Economists are plugged in to the news media, so you might be more likely to hear about their work.

2. Here’s the syllabus for the GRE math subject test. I don’t remember any topology or number theory on the exam, but it’s possible they changed the syllabus some time during the past 40 years, also it’s not like my memory is perfect. Topology is cool—everybody should know a little bit of of topology, and even though it only very rarely arises directly in statistics, I think the abstractions of topology can help you understand all sorts of things. Number theory, yeah, I think that’s completely useless, although I could see how they’d have it on the test, because being able to answer a GRE math number theory question is probably highly correlated with understanding math more generally.

3. I am not up on the literature for combinatorics for experimental design. I doubt that there’s a lot being done in math departments in this area that has much relevance for applied statistics, but I guess there must be some complicated problems where this comes up. I too think combinatorics is fun. There probably are some interesting connections between combinatorics and statistics which I just haven’t thought about. My quick guess would be that there are connections to probability theory but not much to applied statistics.

P.S. This blog is on a lag, also sometimes we respond to questions from old emails.

Questions and Answers for Applied Statistics and Multilevel Modeling

Last semester, every student taking my course was required to contribute before each class to a shared Google doc by putting in a question about the reading or the homework, or by answering another student’s question. The material on this document helped us guide discussion during the class.

At the end of the semester, the students were required to add one more question, which then I responded to in the document itself.

Here it is!