Comedy and child abuse in literature

I recently read Never Mind, the first of the Patrick Melrose by Edward St. Aubyn. I was vaguely aware of these novels, I guess from reviews when the books came out or the occasional newspaper or magazine feature story, the author is the modern-day Evelyn Waugh, etc. Anyway, yeah, the book was great. Hilarious, thought-provoking, pitch-perfect, the whole deal. A masterful performance.

Also, it’s a very funny book about child abuse. The child abuse scenes in the book are not funny. They’re horrible, not played for laughs at all. But the book itself is hilarious, and child abuse is at the center of it.

This got me thinking about other classics of literary humor that center on child abuse: Lolita, of course, and also The Nickel Boys and L’Arabe du Futur.

I guess this is related to the idea that discomfort can be an aspect of humor. In these books, some of the grim humor arises from the disconnect between the horrible actions being portrayed and their deadpan depictions on the page.

I found all the above-mentioned books to be very funny, very fun to read, and very upsetting to read at the same time.

This well-known paradox of R-squared is still buggin me. Can you help me out?

There’s this well-known—ok, maybe not well-enough known—example where you have a strong linear predictor but R-squared is only 1%.

The example goes like this. Consider two states of equal size, one of which is a “blue” state where the Democrats consistently win 55% of the two-party vote and the other which is a “red” state where Republicans win 55-45. The two states are much different in their politics! Now suppose you want to predict people’s votes based on what states they live in. Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5. Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05. The R-squared is, then, 0.05^2/0.5^2 = 0.01, or 1%.

There’s no trick here; the R-squared here really is 1%. We’ve brought up this example before, and commenters pointed to this article by Rosenthal and Rubin from 1979 giving a similar example and this article by Abelson from 1985 exploring the issue further.

I don’t have any great intuition for this one, except to say that usually we’re not trying to predict one particular voter; we’re interested in aggregates. So the denominator of R-squared, the total variance, which is 0.5^2 in this particular case, is not of much interest.

I’m not thrilled with that resolution, though, because suppose we compare two states, one in which the Democrats win 70-30 and one in which the Republicans win 70-30. The predicted probability is either 0.7 or 0.3, hence a standard deviation of 0.2, so the R-squared is 0.2^2/0.5^2 = 0.16. Still a very low-seeming value, even in this case you’re getting a pretty good individual prediction (the likelihood ratio is (0.7/0.3)/(0.3/0.7) = 0.49/0.09 = 5.4).

I guess the right way of thinking about this sort of example is consider some large number of individual predictions . . . I dunno. It’s still buggin me.

“Beyond the black box: Toward a new paradigm of statistics in science” (talks this Thursday in London by Jessica Hullman, Hadley Wickham, and me)

Sponsored by the Alan Turing Institute, the talks will be Thurs 20 June 2024, 5:30pm, at Kings College London. You can register for the event here, and here are the titles and abstracts of the three talks:

Beyond the black box: Toward a new paradigm of statistics in science

Andrew Gelman

Standard paradigms for data-based decision making and policy analysis fail and have led to a replication crisis in science – because they can’t handle uncertainty and variation and don’t seriously engage with the quality of evidence. We discuss how this has happened, touching on the piranha problem, the butterfly effect, the magic number 16, the one-way-street fallacy, the backpack fallacy, the Edlin factor, Clarke’s law, the analysts’s paradox, and the greatest trick the default ever pulled. We then discuss ways to go beyond the push-a-button, take-a-pill model to a more active engagement of data in science.

Data Analysis as Imagination

Jessica Hullman

Learning from data, whether in exploratory or confirmatory analysis settings, requires one to reason about the likelihood of many competing explanations. However, people are boundedly rational agents who often engage in pattern-finding at the expense of recognising uncertainty or considering potential sources of heterogeneity and variation in the effects they seek to discover. Taking this seriously motivates new classes of interface tools that help people extend their imagination in hypothesising and interpreting effects.

Data science in production

Hadley Wickham

This talk will discuss what it means to put data science “in production”. In industry, any successful data science project will be run repeatedly for months or years, typically on a server that can’t be worked with interactively. This poses an entirely new set of challenges that won’t be encountered in university classes, but that are vital to overcome if you want to have an impact in your job.

In this talk, Hadley discusses three principles useful for understanding data science in production: not just once, not just one computer, and not just alone. Hadley discusses the challenges associated with each and, where possible, what solutions (both technical and sociological) are currently available.


Myths of American history from the left, right, and center; also a discussion of the Why everything you thought you knew was wrong” genre of book.

Sociologist Claude Fischer has an interesting review of an edited book, “Myth America: Historians Take On the Biggest Legends and Lies About Our Past.”

I’m a big fan of the “Why everything you thought you knew was wrong” genre—it’s a great way to get into the topic. Don’t get me wrong: I’m not a fan of contrarianism for its own sake, especially when dressed in the language of expertise (see, for example, here and here). My point, rather, is that if you do have something reasonable to say, the contrarian or revisionist framework can be a good way to do it.

Just as a contrast: Our book Bayesian Data Analysis, published in 1995, had many reasonable takes that were different from what was standard in Bayesian statistics at the time. We just presented our results and methods straight, with no editorializing and very little discussion of how our perspective differed from what had come before. That was fine—the book was a success, after all—but, arguably, our presentation would’ve been even clearer and more compelling had we talked about where we thought that existing practice was wrong.

Why did we do it that way, writing the book using such a non-confrontational framing? It was my reaction to the academic literature on Bayesian statistics which was full of debate and controversy. Debate and controversy are fun, and can be a great way to learn—but the message I wanted to convey in our book was that Bayesian methods are useful for solving real problems, both theoretical and applied, not that Bayesian inference was necessary or that it satisfied some optimality property. I wanted BDA to be the book that took Bayesian inference beyond philosophy and argument toward methodology and applications. So I consciously avoided framing anything as controversial. The goal was to demonstrate how and why to do it, not to win a debate.

As I say, I think our tack worked. But there are ways in which it’s better to acknowledge, address, and argue against contrary views, rather than to simply present your own perspective. Both for historical reasons and for pedagogical practice, it’s good to talk about what else is out there—and also to explore the holes in your own recommended approach. In writing BDA, we were pushing against decades of Bayesian statistics being associated with philosophy and argument; thirty years later, we have moved on, Bayesian methods are here to stay, and there’s room for more open discussion of challenges and controversies.

To return to the topic of our post . . . In his discussion of “Myth America: Historians Take On the Biggest Legends and Lies About Our Past,” Fischer writes:

The book’s premise is that these myths derange our politics and undermine sound public policy. Although the authors address a few “bipartisan myths,” they focus on myths of the Right. . . . They might have [written about] the kernels of truths that are found in some conservative stories and also by addressing myths on the Left. . . .

Fischer shares a bunch of interesting examples. First, some myths that were successfully shot down in the book under discussion:

Akhil Reed Amar argues that the Constitution was not designed to restrain popular democracy but was instead a remarkably populist document for its time.

Daniel Immerwahr debunks the sanctimony that the U.S. has not pursued empire . . .

Michael Kazin criticizes the depiction of socialism as a recent infection from overseas. . . .

Elizabeth Hinton challenges the view that harsh police suppression is typically a reaction to criminal violence. She chronicles the long history, especially but not only in the South, of authorities aggressively policing even quiescent communities.

Before going on, let me say that this first myth, that the Constitution was designed to restrain popular democracy, seems to me as much of a myth of the Left as of the Right. On the right, sure, there’s this idea that the constitution protects us from mob rule; but it also seems like a leftist take, that the so-called founding fathers were just setting up a government to protect the interests of property owners. To the extent that this belief is actually in error, it seems to me that Amar is indeed addressing a myth on the Left.

In any case, Fischer continues with “a few examples of conceding some conservative points”:

Carol Anderson reviews recent Republican efforts, starting long before Trump, to cry voter fraud. Although little evidence points to substantial fraud these days, imposters, vote-buying, and messing with ballot boxes was common in the past. It was most visible, though not necessarily more common, in immigrant-filled cities run by Democratic machines. . . .

Erika Lee’s and Geraldo Cadava’s chapters on the southern border and on undocumented immigration undercut the current Fox News hysteria. They discuss the long history of cross-border movement and the repeated false alarms about foreigners. . . . However, the authors might have admitted that large-scale immigration is often disruptive. . . . reactions were not just xenophobic, but often over real material and cultural worries.

Contributors describe as “rebellions” the violence that broke out in Black neighborhoods at many points in the 20th century, but they do not dignify similar actions on the Right as rebellions, for example, the anti-immigrant riots of the 19th century and the anti-busing violence of the 1970s. These outbursts also entailed aggrieved communities raging against elites who imported scabs and elites who imposed school integration. Why are some labeled as rebellions and others as riots?

Naomi Oreskes and Erik M. Conway debunk the “The Magic of the Marketplace” myth. American businessmen and American economic growth have always relied heavily on government investment and subsidies . . . Still, a complete story would have appreciated how risk-taking entrepreneurs, from the Vanderbilts to the Fords, effectively deployed resources in ways that enabled prosperity for most Americans.

Fischer then provides some “legends of the Left, critique of which might burnish the historians’ reputation for objectivity and balance”:

The slumbering progressive vote. Seemingly forever, but certainly in recent years, voices on the Left have claimed that millions of working-class, minority voters are poised to vote for progressives if only candidates and parties spoke to their interests (the “What’s Wrong with Kansas?” question). Repeatedly, such voters have not emerged. Trump, however, did mobilize many chronic non-voters, suggesting that there are probably more slumbering right-populists than left-populists.

Explaining the Civil War. In a perennial argument, some on the Right minimize the role of slavery so as to promote the “Lost Cause” story that the war was about states’ rights . . . ome on the Left have also downplayed slavery, preferring to interpret the war as a struggle between different kinds of business interests, thereby both inflating the role of capitalism and blaming it. Also wrong; the cause was slavery.

“People of Color.” This label is an ahistorical effort to sort ethnoracial groups into two classes . . . submerges from view the vastly different experiences of, say, the descendants of slaves, third-generation Mexican-Americans, refugees from Afghanistan, and immigrants from Ghana. It would seem to lump together somewhat pale Latin or Native Americans with dark-skinned but economically successful Asians. . . . POC is a rhetorical slogan, not a historically-rooted category.

I don’t have anything to add here. Fischer’s discussion and examples are interesting. I don’t buy his argument that more left-right balance in the book would represent an “opportunity to increase public confidence in professional history.” To put together a book like this in order to increase public confidence seems like a mug’s game. I think you just have to put together the best book you can. And, at that point, you can do it two ways. Either present a frankly partisan view and own up to it, saying something like, “Lots of sources will give you the dominant conservative perspective on American history; here, we present a perspective that is well known in academia but does not always make it into the school books or the news broadcasts.” That’s what James Loewen did in his classic book, “Lies My Teacher Told Me.” Or you try your best to present a range of political perspectives and then you make that clear, saying something like, “There are many persistent misreadings of American history from conservatives, and our book shoots these down. In addition we have seen overreactions from the other direction, and we discuss some of these too.” I don’t think either of these presentations will do much to “increase public confidence in professional history,” but they have the advantage of clarity, and they help the editors and the readers alike to place the book within past and current political debates.

P.S. One of the editors of the book under discussion is Princeton historian Kevin Kruse, who came up in this space a couple years ago regarding some plagiarism in his Ph.D. thesis. Fischer didn’t mention this in his post. I guess plagiarism isn’t so relevant, given that he’s just the editor, of the book, not the author. As an editor of a couple of books myself, I ended up doing a lot of editing of other people’s chapters, which I guess is kind of the opposite of plagiarism. I think that’s expected, that the editors will do some writing as necessary to get the project done. I have no idea how Kruse and his co-editor Julian Zelizer operated with this particular book.

P.P.S. The other editor of that book is Julian Zelizer. I’ve collaborated with Adam Zelizer. Are they related? How many Zelizers could there be in political science??

One way you can understand people is to look at where they prefer to see complexity.

In her article, “On not sleeping with your students,” philosopher Amia Srinivasan writes that she was struck by “how limited philosophers’ thinking was”:

How could the same people who were used to wrestling with the ethics of eugenics and torture (issues you might have imagined were more clear-cut) think that all there was to say about professor-student sex was that it was fine if consensual?

Many philosophers prefer to see complexity only where it suits them.

This was interesting, and it gave me two thoughts.

First there’s the whole asshole angle: a philosopher being proudly bold and transgressive by considering the virtues of torture while not reflecting on issues closer to home, which reminds me of our quick rule of thumb is that when someone seems to be acting like a jerk, an economist will defend the behavior as being the essence of morality, but when someone seems to be doing something nice, an economist will raise the bar and argue that he’s not being nice at all. The point is that in some areas of academia it’s considered a positive to be counterintuitive and unpredictable. One thing I like about Srinivasan is that she’s not doing that. Like Bertrand Russell, she’s direct. Don’t get me wrong, Bertrand Russell had lots of problems in his philosophy as well as in his life—just take a look at Ray Monk’s biography of him—, but I appreciate the clarity and directness of his popular philosophical writing. Indeed, the clarity and directness can make it easier to see problems in what he wrote, and that’s good too.

The bit that really caught me in the above excerpt, though, was that last sentence, which got me thinking that one way you can understand people is to look at where they prefer to see complexity. I’m not quite sure what to do with this; I’m still chewing on it. It reminds me of the principle that you can understand people by looking at what bothers them. I wrote a post on that, many years ago, but now I can’t find it.

Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions

This is Jessica. Conformal prediction, referring to a class of distribution-free approaches to quantifying predictive uncertainty, has attracted interest for medical AI applications. Reasons include because prediction sets seem to align with the kinds of differential diagnoses doctors already use, and they can support common triage decisions like ruling in and ruling out critical conditions. 

However, like any uncertainty quantification technique, the nuance needed to describe what conformal approaches provide can get lost in translation. We have catalogs of common misinterpretations of p-values, confidence intervals, Bayes factors, AUC, etc., to which we might now add misinterpretations of conformal prediction. The below set is based on what I’m seeing as I read papers about applying conformal prediction for medical decision-making. If you’ve encountered others that I’ve missed (even if not in a health setting), please share them.

Misconception 1: Conformal prediction provides individualized uncertainty

It would be great if we could get prediction sets with true conditional coverage without having to make distributional assumptions, i.e., if we could guarantee that the probability that a prediction set at any fixed test point X_n+1 contains the true label is at least 1 – alpha. Unfortunately, assumption-free conditional coverage is not possible. But some enthusiastic takes on conformal prediction describe what it provides as if it is achieved. 

For example, Dawei Xie pointed me to this Nature Medicine commentary that calls for clinical uses of AI to include predictive uncertainty. The authors start with what appears to be a common motivation for conformal prediction in health: standard AI pipelines optimize population-level accuracy, failing to capture “the vital clinical fact that each patient is a unique person,” motivating methods that can “provide reliable advice for all individual patients.” The goal is to use uncertainty associated with the prediction to decide whether to abstain and bring in a human expert, who might gather more information or consider how the model was developed. 

This is all fine. The problem is that they propose to solve this challenge with conformal prediction, which they describe as a new tool “that can produce personalized measures of uncertainty.” You can get “relaxed” versions of conditional coverage, but no truly personalized quantification of uncertainty.

Misconception 2: The non-conformity score makes conformal prediction robust to distribution shift

Another potential source of misinterpretation is the non-conformity score. In split conformal prediction, this is the score that is calculated for (x,y) pairs in a held-out calibration set in order to find the threshold expected to achieve at least 1-alpha coverage on test instances. Then given a new instance, its non-conformity score is compared to the threshold to determine which labels go in the prediction set. The non-conformity score can be any negatively-oriented score function derived from the trained model’s predictions, though the closer it approximates a residual the more useful the sets are likely to be. A simple example would be 1 – f_hat(xi)_y where f_hat(xi)_y is the softmax value for label y produced by the last layer of a neural net, and the threshold is based on the distribution of 1 – f_hat(xi)_yi in the calibration set, where yi is the true label. 

One could say that non-conformity scores capture how dissimilar an (x,y) pair under consideration is from what the model has learned about label posterior distributions from the training data. But some of the application papers I’m seeing make more generic statements, describing the score as measuring how strange the new instance, as if in an absolute sense, or how unusual the new instance is relative to the training data, as if it is used to detect distribution shift.  

Misconception 3: You can get knowledge-free robustness to distribution shift 

Some papers acknowledge that standard split conformal coverage is not robust to violations of exchangeability, and cite work that relaxes this assumption to get coverage under certain types of distribution shifts. The risk here is describing these approaches as if one can get valid coverage under shifts without having to introduce any additional assumptions. Even in the work of Gibbs et al., which makes the least assumptions as far as I can tell, you still have to select a function class that covers the shifts you want coverage to be robust to. There is no “knowledge-free” way around violations of the typical assumptions.

Misconception 4: Conformal prediction can only provide marginal coverage over the randomness in calibration set and test points

In contrast to the above, I’ve also seen a few more skeptical takes on conformal prediction for medical decision making, arguing that conformal prediction sets are unreliable under shifts in input and label distributions and for subsets of the data. Papers that make these arguments can also mislead, by implying that any use of conformal prediction equates to simple split conformal prediction where coverage is marginal over the randomness in the calibration and test set points. This neglects to acknowledge the development of approaches that provide class-conditional or group-conditional coverage or the previously mentioned attempts at coverage under classes of shifts. Beware blanket statements that write off entire classes of approaches based on what the simplest variations achieve. 

Progress in AI may be exploding, but achieving nuance in discussions of uncertainty quantification is still hard.

Statistics Blunder at the Supreme Court

Joe Stover points to this op-ed by lawyer and political activist Ted Frank, who writes:

Even Supreme Court justices are known to be gullible. In a dissent from last week’s ruling against racial preferences in college admissions, Justice Ketanji Brown Jackson enumerated purported benefits of “diversity” in education. “It saves lives,” she asserts. “For high-risk Black newborns, having a Black physician more than doubles the likelihood that the baby will live.”

A moment’s thought should be enough to realize that this claim is wildly implausible. . . . the actual survival rate is over 99%.

Indeed, there’s no treatment that will take the survival rate up to 198%.

Frank continues:

How could Justice Jackson make such an innumerate mistake? A footnote cites a friend-of-the-court brief by the Association of American Medical Colleges, which makes the same claim in almost identical language. It, in turn, refers to a 2020 study . . . [which] makes no such claims. It examines mortality rates in Florida newborns between 1992 and 2015 and shows a 0.13% to 0.2% improvement in survival rates for black newborns with black pediatricians (though no statistically significant improvement for black obstetricians).

The AAMC brief either misunderstood the paper or invented the statistic. (It isn’t saved by the adjective “high-risk,” which doesn’t appear and isn’t measured in Greenwood’s paper.)

Here’s the quote from the brief by the Association of American Medical Colleges:

And for high-risk Black newborns, having a Black physician is tantamount to a miracle drug: it more than doubles the likelihood that the baby will live.

Here’s the relevant passage from the cited article, “Physician–patient racial concordance and disparities in birthing mortality for newborns”:

And here’s the relevant table:

Stover summarizes:

As far as I can tell, the justification for the quote is probably in Table 1, col. 1. Baseline mortality rate (white newborn + white dr) is 290 (per 100k). Black newborn is +604 above that giving 894/100k. Then -494 from that when it is black newborn + black dr giving 400/100k. So the black newborn mortality rate is more than cut in half when the doctor is also black.

So while the amicus brief did seem to misunderstand or misrepresent the study, the qualitative finding still holds.

Of course, maybe there are other statistical problems. I figure these basic stats don’t need a model though and could have been pulled out of the raw dataset easily.

There’s also Table 2 of the article, which presents data on babies with and without comorbidities. I’m guessing that’s what the amicus brief was talking about when referring to “at-risk” newborns.

In any case, the judge’s key mistake was to trust the amicus brief. I guess this shows a general problem when judges rely on empirical evidence. On one hand, a judge and a judge’s staff are a bunch of lawyers and have no particular expertise in evaluating scientific claims—it’s not like they’re gonna go read journal articles and try to untangle what’s in Table 1 of the Results section. On the other hand, evidently the Association of American Medical Colleges has no such expertise either. I can see why some judges would prefer to rely entirely on legal reasoning and leave empirical findings aside entirely. On the other hand, sometimes they need to rule based on the facts of a case, and in that case empirical results can matter . . . so I’m not sure what they’re supposed to do! I guess I’m overthinking this somehow, but I’m not quite sure where.

How could the judge’s opinion have been changed to accurately summarize this research? Instead of “For high-risk Black newborns, having a Black physician more than doubles the likelihood that the baby will live,” she could’ve written, “A study from Florida found Black infant mortality rates to be half as high with Black physicians than with White physicians.” OK, this could probably be phrased better, but here are the key improvements:
– Instead of just saying the statement as a general truth, localize it to “a study from Florida.”
– Instead of saying “more than doubles the likelihood that the baby will live,” say that the mortality rate halved.

That last bit is kind of funny . . . but I can see that if you’re writing an amicus brief in a hurry, you can, without reflection, think that “reducing risk of death by half” is the same as “doubling the survival rate.” I mean, sure, once you think about it, it’s obviously wrong, but it almost sounds right if you’re just letting the words flow. This is not an excuse!—I’m sure that whoever wrote that brief is really embarrassed right now—just an attempt at understanding.

Evaluating quantitative evidence is hard! A couple posts from the archive brought up errors from Potter Stewart and Antonin Scalia. I’ll do my small part in all of this by referring to these people as judges, not “Justices.”

Faculty and postdoc jobs in computational stats at Newcastle University (UK)

If you’re looking for a job in computational statistics, Newcastle is hiring one or two faculty positions and a postdoc position. The application deadline is in 10 days for the postdoc position and in a month for the faculty positions.

Close to the action

The UK (and France) are where much of the action is in MCMC, especially theory, in my world. There are great Bayesian computational statisticians in Newcastle, including Professor Chris Oates, and the department head, Professor Murray Pollock. In case you don’t know UK geography, Durham is a mere 30km down the road, with even more MCMC and comp stats researchers.

Faculty position(s)

These are lecturer and senior lecturer position, which is like assistant and associate professors in the U.S.

Lecturer Job ad

Application deadline: 18 July 2024

3 year postdoc position

Application deadline: 23rd June 2024

Postdoc job ad

About the postdoc

I’m at a conference with both Chris Oates and Murray Pollock in London right now. Murray just gave a really exciting talk on the topic of the postdoc, which is federated learning as part of the FUSION ERC project, which involves a bigger network of MCMC researchers including Christian Robert and Eric Moulines in Paris and Gareth Roberts at University of Warwick. They’re applying cutting edge diffusion processes (e.g., Brownian bridges) to recover exact solutions to the federated learning problem (where subsampled data sets, for instance from different hospitals, are fit independently and their posteriors are later combined without sharing all the data).

For more information

Contact: Murray Pollock (Murray.Pollock (at)

Arnold Foundation and Vera Institute argue about a study of the effectiveness of college education programs in prison.

OK, this one’s in our wheelhouse. So I’ll write about it. I just want to say that writing this sort of post takes a lot of effort. When it comes to social engagement, my benefit/cost ratio is much higher if I just spend 10 minutes writing a post about the virtues of p-values or whatever. Maximizing the number of hits and blog comments isn’t the only goal, though, and I do find that writing this sort of long post helps me clarify my thinking, so here we go. . . .

Jonathan Ben-Menachem writes:

Two criminal justice reform heavyweights are trading blows over a seemingly arcane subject: research methods. . . . Jennifer Doleac, Executive Vice President of Criminal Justice at Arnold Ventures, accused the Vera Institute of Justice of “research malpractice” for their evaluation of New York college-in-prison programs. In a response posted on Vera’s website, President Nick Turner accused Doleac of “giving comfort to the opponents of reform.”

At first glance, the study at the core of this debate doesn’t seem controversial: Vera evaluated Manhattan DA-funded college education programs for New York prisoners and found that participants were less likely to commit a new crime after exiting prison. . . . Vera used a method called propensity score matching, and constructed a “control” group on the basis of prisoners’ similarity to the “treatment” group. . . . Despite their acknowledgment that “differences may remain across the groups,” Vera researchers contended that “any remaining differences on unobserved variables will be small.”

Doleac didn’t buy it. . . . She argued that propensity score matching could not account for potentially different “motivation and focus.” In other words, the kind of people who apply for classes are different from people who don’t apply, so the difference in outcomes can’t be attributed to prison education. . . .

Here’s Doleac’s full comment:

Vera Institute just released this study of a college-in-prison education program in NY, funded by the Manhattan DA’s Criminal Justice Investment Initiative. Researchers compared people who chose to enroll in the program with similar-looking people who chose not to. This does not isolate the treatment effect of the education program. It is very likely that those who enrolled were more motivated to change, and/or more able to focus on their goals. This pre-existing difference in motivation & focus likely caused both the difference in enrollment in the program and the subsequent difference in recidivism across groups.

This report provides no useful information about whether this NY program is having beneficial effects.

Now we return to Ben-Menachem for some background:

This fight between big philanthropy and a nonprofit executive is extremely rare, and points to a broader struggle over research and politics. The Vera Institute boasts a $264 million operating budget, and . . . has been working on bail reform since the 1960s. Arnold Ventures was founded in 2010, and the organization has allocated around $400 million to criminal justice reform—some of which went to Vera.

How does the debate over methods relate to larger policy questions? Ben-Menachem writes:

Although propensity score matching does have useful applications, I might have made a critique similar to Doleac if I was a peer reviewer for an academic journal. But I’m not sure about Doleac’s claim that Vera’s study provides “no useful information,” or her broader insistence on (quasi) experimental research designs. Because “all studies on this topic use the same flawed design,” Doleac argued, “we have *no idea* whether in-prison college programming is a good investment.” This is a striking declaration that nothing outside of causal inference counts.

He connects this to an earlier controversy:

In 2018, Doleac and Anita Mukherjee published a working paper called “The Moral Hazard of Lifesaving Innovations: Naloxone Access, Opioid Abuse, and Crime” which claimed that naloxone distribution fails to reduce overdose deaths while also “making riskier opioid use more appealing.” In addition to measurement problems, the moral hazard frame partly relied on an urban myth—“naloxone parties,” where opioid users stockpile naloxone, an FDA approved medication designed to rapidly reverse overdose, and intentionally overdose with the knowledge that they can be revived. The final version of the study includes no references to “naloxone parties,” removes the moral hazard framing from the title, and describes the findings as “suggestive” rather than causal.

Later that year, Doleac and coauthors published a research review in Brookings citing her controversial naloxone study claiming that both naloxone and syringe exchange programs were unsupported by rigorous research. Opioid health researchers immediately demanded a retraction, pointing to heaps of prior research suggesting that these policies reduce overdose deaths (among other benefits). . . .

Ben-Menachem connects this to debates between economists and others regarding the role of causal inference. He writes:

While causal inference can be useful, it is insufficient on its own and arguably not always necessary in the policy context. By contrast, Vera produces research using a very wide variety of methods. This work teaches us about the who, where, when, what, why, and how of criminalization. Causal inference primarily tells us “whether.”

I disagree with him on this one. Propensity score matching (which should be followed up with regression adjustment; see for example our discussion here) is a method that is used for causal inference. I will also channel my causal-inference colleagues and say that, if your goal is to estimate and understand the effects of a policy, causal inference is absolutely necessary. Ben-Menachem’s mistake is to identify “causal inference” with some particular forms of natural-experiment or instrumental-variables analyses. Also, no matter how you define it, causal inference primarily tells us, or attempts to tell us, “how much” and “where and when,” not “whether.” I agree with his larger point, though, which is that understanding (what we sometimes call “theory”) is important.

I think Ben-Menachem’s framing of this as economists-doing-causal-inference vs. other-researchers-doing-pluralism misses the mark. Everybody’s doing causal inference here, one way or another, and indeed matching can be just fine if it is used as part of a general strategy for adjustment, even if, as with other causal inference methods, it can do badly when applied blindly.

But let’s move on. Ben-Menachem continues:

In a recent interview about Arnold Ventures’ funding priorities, Doleac explained that her goal is to “help build the evidence base on what works, and then push for policy change based on that evidence.” But insisting on “rigorous” evidence before implementing policy change risks slowing the steady progress of decarceration to a grinding halt. . . .

In an email, Vera’s Turner echoed this point. “The cost of Doleac’s apparently rigid standard is that it not only devalues legitimate methods,” he wrote, “but it sets an unreasonably and unnecessarily high burden of proof to undo a system that itself has very little evidence supporting its current state.”

Indeed, mass incarceration was not built on “rigorous research.” . . . Yet today some philanthropists demand randomized controlled trials (or “natural experiments”) for every brick we want to remove from the wall of mass incarceration. . . .

Decarceration is a fight that takes place on the streets and in city halls across America, not in the halls of philanthropic organizations. . . . the narrow emphasis on the evaluation standards of academic economists will hamstring otherwise promising efforts to undo the harms of criminalization.

Several questions arise here:

1. What can be learned from this now-controversial research project? What does it tell us about the effects of New York college-in-prison programs, or about programs to reduce prison time?

2. Given the inevitable weaknesses of any study of this sort (including studies that Doleac or I or other methods critics might like), how should its findings inform policy?

3. What should advocates’ or legislators’ views of the policy options be, given that the evidence in favor of the status quo is far from rigorous by any standard?

4. Given questions 1, 2, 3 above, what is the relevance of methodological critiques of any study in a real-world policy context?

Let me go through these four questions in turn.

1. What can be learned from this now-controversial research project?

First we have to look at the study! Here it is: “The Impacts of College-in-Prison Participation on Safety and Employment in New York State: An Analysis of College Students Funded by the Criminal Justice Investment Initiative,” published in November 2023.

I have no connection to this particular project, but I have some tenuous connection to both of the organizations involved in this debate, as many years ago I attended a brief meeting at the Arnold Foundation regarding a study being done by the Vera Institute regarding a program they were doing in the correctional system. And many years ago my aunt Lucy taught math at Sing Sing prison for awhile.

Let’s go to the Vera report, which concludes:

The study found a strong, significant, and consistent effect of college participation on reducing new convictions following release. Participation in this form of postsecondary education reduced reconviction by at least 66 percent. . . .

Vera also conducted a cost analysis of these seven college-in-prison programs . . . Researchers calculated the costs reimbursed by CJII, as well as two measures of the overall cost: the average cost per student and the costs of adding an additional group of 10 or 20 students to an existing college program . . . Adding an additional group of 10 or 20 students to those colleges that provided both education and reentry services would cost colleges approximately $10,500 per additional student, while adding an additional group of students to colleges that focused on education would cost approximately $3,800 per additional student. . . . The final evaluation report will expand this cost analysis to a benefit-cost analysis, which will evaluate the return on investment of these monetary and resource outlays in terms of avoided incarceration, averted criminal victimization, and increased labor force participation and improved income.

And they connect this to policy:

This research indicates that academic college programs are highly effective at reducing future convictions among participating students. Yet, interest in college in prison among prospective students far outstrips the ability of institutions of higher education to provide that programming, due in no small part to resource constraints. In such a context, funding through initiatives such as CJII and through state and federal programs not only supports the aspirations of people who are incarcerated but also promotes public safety.

Now let’s jump to the methods. From page 13 of the report onward:

To understand the impact of access to a college education on the people in the program, Vera researchers needed to know what would have happened to these people if they had not participated in the program. . . . Ideally, researchers need these comparisons to be between groups that are otherwise as similar as possible to guard against attributing outcomes to the effects of education that may be due to the characteristics of people who are eligible for or interested in participating in education. In a fair comparison of students and nonstudents, the only difference between the two is that students participated in college education in prison while nonstudents did not. . . . One study of the impacts of college in prison on criminal legal system outcomes found that people who chose or were able to access education differed in their demographics, employment and conviction histories, and sentence lengths from people who did not choose or have the ability to access education. This indicates a need for research and statistical methods that can account for such “selection” into college education . . .

The best way to create the fair comparisons needed to estimate causal effects is to perform a randomized experiment. However, this was not done in this study due to the ethical impact of withholding from a comparison group an intervention that has established positive benefits . . . Vera researchers instead aimed to create a fairer comparison across groups using a statistical technique called propensity score matching . . . Vera researchers matched students and nonstudents on the following variables:
– demographics . . .
– conviction history . . .
– correctional characteristics . . .
– education characteristics . . .
Researchers considered nonstudents to be eligible for comparison not only if they met the same academic and behavioral history requirements as students but also if they had a similar time to release during the CIP period, a similar age at incarceration, and a similar time from prison admission to eligibility. . . . when evaluating whether an intervention influences an outcome of interest, it is a necessary but not sufficient condition that the intervention happens before the outcome. Vera researchers therefore defined a “start date” for students and a “virtual start date” for nonstudents in order to determine when to begin measuring in-facility outcomes, which included Tier II, Tier III, high-severity, and all misconducts. . . . To examine the effect of college education in prison on misconducts and on reported wages, Vera researchers used linear regression on the matched sample. For formal employment status and for an incident within six months and 12 months of release that led to a new conviction, Vera used logistic regression on the matched sample. For recidivism at any point following release, Vera used survival analysis on the matched sample to estimate the impact of the program on the time until an incident that leads to a new conviction occurs.

What about the concern expressed by Doleac regarding differences that are not accounted for by the matching and adjustment variables? Here’s what the report says:

Vera researchers have attempted to control [I’d prefer the term “adjust” — ed.] for pre-incarceration factors, such as conviction history, age, and gender, that may contribute to misconducts in prison. However, Vera was not able to control for other pre-incarceration factors that have been found in the literature to contribute to misconducts, such as marital status and family structure, mental health needs, a history of physical abuse, antisocial attitudes and beliefs, religiosity, socioeconomic disadvantage and exposure to geographically concentrated poverty, and other factors that, if present, would still allow a person to remain eligible for college education but might influence misconducts. Vera researchers also have not been able to control for factors that may be related to misconducts, including characteristics of the prison management environment, such as prison size, and the proportion of people incarcerated under age 25, as Vera did not have access to information about the facilities where nonstudents were incarcerated. Vera also did not have access to other programs that students and nonstudents may be participating in, such as work assignments, other programming, or health and mental health service engagement, which may influence in-facility behavior and are commonly used as controls in the literature. If other literature on the subject is correct and education does help to lower misconducts, Vera may have, by chance, mismatched students with controls who, unobserved to researchers and unmeasured in the data, were less likely to have characteristics or be exposed to environments that influence misconducts. While prior misconducts, assigned security class, and time since admission may, as proxies, capture some of this information, they may do so imperfectly.

They have plans to mitigate these limitations going forward:

First, Vera will receive information on new students and newly eligible nonstudents who have enrolled or become eligible following receipt of the first tranche of data. Researchers will also have the opportunity to follow the people in the analytical sample for the present study over a longer period of time. . . . Second, researchers will receive new variables in new time periods from both DOCCS and DOL. Vera plans to obtain more detailed information on both misconducts and counts of misconducts that take place in different time periods for the final report. . . . Next, Vera will obtain data on pre-incarceration wages and formal employment status, which could help researchers to achieve better balance between students and nonstudents on their work histories . . .

In summary: Yeah, observational studies are hard. You adjust for what you can adjust for, then you can do supplementary analyses to assess the sizes and directions of possible biases. I’m kinda with Ben-Menachem on this one: Doleac’s right that the study “does not isolate the treatment effect of the education program,” but there’s really no way to isolate this effect—indeed, there is no single “effect,” as any effect will vary by person and depend on context. But to say that the report “provides no useful information” about the effect . . . I think that’s way too harsh.

Another way of saying this is that, speaking in general terms, I don’t find adjusting for existing pre-treatment variables to be a worse identification strategy than instrumental variables, or difference-in-differences, or various other methods that are used for causal inference from observational studies. All these methods rely on strong, false assumptions. I’m not saying that these methods are equivalent, either in general or in any particular case, just that all have flaws. And indeed, in her work with the Arnold Foundation, Doleac promotes various criminal-justice reforms. So I’m not quite sure why she’s so bothered by this particular Vera study. I’m not saying she’s wrong to be bothered by it; there just must be more to the story, other reasons she has for concern that were not mentioned in her above-linked social media post.

Also, I don’t believe that estimate from the Vera study that the treatment reduces recidivism by 66%. No way. See the section “About that ’66 percent'” below for details. So there are reasons to be bothered by that report; I just don’t quite get where Doleac is coming from in her particular criticism.

2. Given the inevitable weaknesses of any study of this sort, how should its findings inform policy?

I guess it’s the usual story: each study only adds a bit to the big picture. The Vera study is encouraging to the extent that it’s part of a larger story that makes sense and is consistent with observation. The results so far seem too noisy to be able to say much about the size of the effect, but maybe more will be learned from the followups.

3. What should advocates’ or legislators’ views of the policy options be, given that the evidence in favor of the status quo is far from rigorous by any standard?

This I’m not sure. It depends on your understanding of justice policy. Ben-Menachem and others want to reduce mass incarceration, and this makes sense to me, but others have different views and take the position that mass incarceration has positive net effects.

I agree with Ben-Menachem that policymakers should not stick with the status quo, just on the basis that there is no strong evidence in favor of a particular alternative. For one thing, the status quo is itself relatively recent, so it’s not like it can be supported based on any general “if it ain’t broke, don’t fix it” principle. But . . . I don’t think Doleac is taking a stick-with-the-status-quo position either! Yes, she’s saying that the Vera study “provides no useful information”—a statement I don’t really agree with—but I don’t see her saying that New York’s college-in-prison education program is a bad idea, or that it shouldn’t be funded. I take Doleac as saying that, if policymakers want to fund this program, they should be clear that they’re making this decision based on their theoretical understanding, or maybe based on political concerns, not based on a solid empirical estimate of its effects.

4. Given questions 1, 2, 3 above, what is the relevance of methodological critiques of any study in a real-world policy context?

Methodological critique can help us avoid overconfidence in the interpretation of results.

Concerns such as Doleac’s regarding identification help us understand how different studies can differ so much in their results: in addition to sampling variation and varying treatment effect, the biases of measurement and estimation depend on context. Concerns such as mine regarding effect sizes should help when taking exaggerated estimates and mapping them to cost-benefit analyses.

Even with all our concerns, I do think projects such as this Vera study are useful in that they connect the qualitative aspects of administrating the program with quantitative evaluation. It’s also important that the project itself has social value and that the proposed mechanism of action makes sense. I’m reminded of our retrospective control study of the Millennium Villages project (here’s the published paper, here and here are two unpublished papers on the design of the study, and here’s a later discussion of our study and another evaluation of the project): the study could never have been perfect, but we learned a lot from doing a careful comparison.

To return to Ben-Menachem’s post, I think the framing of this as a “fight over rigor” is a mistake. The researchers at the Vera Institute and the economist at the Arnold Foundation seem to be operating at the same, reasonable, level of rigor. They’re concerned about causal identification and generalizability, they’re trying to learn what they can from observational data, etc. Regression adjustment with propensity scores is no more or less rigorous than instrumental variables or change-point analysis or multilevel modeling or any other method that might be applied in this sort of problem. It’s really all about the details.

It might help to compare this to an example we’ve discussed in this space many times before: flawed estimates of the effect of air pollution on lifespan. There’s lot of theory and evidence that air pollution is bad for your life expectancy. The theory and evidence are not 100% conclusive—there’s this idea that a little bit of pollution can make you stronger by stimulating your immune system or whatever—but we’re pretty much expecting heavy indoor air pollution to be bad for you.

The question then comes up, what is learned that is policy relevant from a really bad study of the effects of air pollution. I’d say, pretty much nothing. I have a more positive take on the Vera study, partly because it is very directly studying the effect of a treatment of interest. The analysis has some omitted variables concerns, also the published estimates are, I believe, way too high, but it still seems to me to be moving the ball forward. I guess that one way they could do better would be to focus on more immediate outcomes. I get that reduction in recidivism is the big goal, but that’s kind of indirect, meaning that we would expect smaller effects and noisier estimates. Direct outcomes of participation in the program could be a better thing to focus on. But I’m speaking in general terms here, as I have no knowledge of the prison system etc.

About that “66 percent”

As noted above, the Vera study concluded:

Participation in this form of postsecondary education reduced reconviction by at least 66 percent.

“At least 66 percent” . . . where did this come from? I searched the paper for “66” and found this passage:

Vera’s study found that participation in college in prison reduced the risk of reconviction by 66 to 67 percent (a relative risk of 0.33 and 0.34). (See Table 7.) The impact of participation in college education was found to reduce reconviction in all three of the analyses (six months, 12 months, and at any point following release). The consistency of estimated treatment effects gives Vera confidence in the validity of this finding.

And here is the relevant table:

Ummmm . . . no. Remember Type M errors? The raw estimate is HUGE (a reduction in risk of 66%) and the standard error is huge too (I guess it’s about 33%, given that a p-value of 0.05 corresponds to an estimate that’s approximately two standard errors away from zero) . . . that’s the classic recipe for bias.

Give it a straight-up Edlin factor of 1/2 and your estimated effect is to reduce the risk of reconviction by 33%, which still sounds kinda high to me, but I’ll leave this one to the experts. The Vera report states that they “detected a much stronger effect than prior studies,” and those prior studies could very well be positively biased themselves, so, yeah, my best guess is that any true average effect is less than 33%.

So when they say, “at least 66 percent”: I think that’s just wrong, an example of the very common statistical error of reporting an estimate without correcting for bias.

Also, I don’t buy that the result appearing in all three of the analyses represents a “consistency of estimated treatment effects” that should give “confidence in the validity of this finding.” The three analyses have a lot of overlap, no? I don’t have the raw data to check what proportion of the reconvictions within 12 months or at any point following release already occurred within 6 months, and I’m not saying the three summaries are entirely redundant. But they’re not independent pieces of information either. I have no idea why the estimates are soooo close to each other; I guess that is probably just one of those chance things which in this case give a misleading illusion of consistency.

Finally, to say a risk reduction of “66 to 67 percent” is a ridiculous level of precision, given that even if you were to just take the straight-up classical 95% intervals you’d get a range of risk reductions of something like 90 percent to zero percent (a relative risk between 0.1 and 1.0).

So we’re seeing overestimation of effect size and overconfidence in what can be learned by the study, which is an all-too-common problem in policy analysis (for example here).

None of this has anything to do with Doleac’s point. Even with no issues of identification at all, I don’t think this treatment effect estimate of 66% (or “at least 66%” or “66 to 67 percent”) decline in recidivism should be taken seriously.

To put it another way, if the same treatment were done on the same population, just with a different sample of people, what would I expect to see? I don’t know—but my best estimate would be that the observed difference would be a lot less than 66%. Call it the Edlin factor, call it Type M error, call it an empirical correction, call it Bayes; whatever you want to call it, I wouldn’t feel comfortable taking that 66% as an estimated effect.

As I always say for this sort of problem, this does not mean that I think the intervention has no effect, or that I have any certainty that the effect is less than the claimed estimate. The data are, indeed, consistent with that claimed 66% decline. The data are also consistent with many other things, including (in my view more plausibly) smaller average effects. What I’m disagreeing with is the claim that the study demonstrates provides strong evidence for that claimed effect, and I say this based on basic statistics, without even getting into causal identification.

P.S. Ben-Menachem is a Ph.D. student in sociology at Columbia and he’s published a paper on police stops in the APSR. I don’t recall meeting him, but maybe he came by the Playroom at some point? Columbia’s a big place.

How would the election turn out if Biden or Trump were replaced by a different candidate?

Paul Campos points to this post where political analyst Nate Silver writes:

If I’d told you 10 years ago a president would seek re-election at 81 despite a supermajority of Americans having concerns about his age, and then we’d hit 8% inflation for 2 years, you wouldn’t be surprised he was an underdog for reelection. You’d be surprised it was even close! . . .

Trump should drop out! . . . Biden would lose by 7 points, but I agree, the Republican Party and the country would be better served by a different nominee.

Campos points out that the claim that we “hit 8% inflation for 2 years” is untrue—actually, “Inflation on a year over year basis hit 8% or higher for exactly seven months of the Biden presidency, from March through September of 2022, not “two years.” It did not hit 8% in any calendar year”—and I guess that’s part of the issue here. The fact that Silver, who is so statistically aware, made this mistake is an interesting example of something that a lot of people have been talking about lately, the disjunction between economic performance and economic perception. I don’t know how Nate will respond to the “8% inflation for 2 years” thing, but I guess he might say that it feels like 8% to people, and that’s what matters.

But then you’d want to rephrase Nate’s statement slightly, so say someting like:

If I’d told you 10 years ago a president would seek re-election at 81 (running against an opponent who is 77) despite a supermajority of Americans having concerns about his age, and with inflation hitting 9% in the president’s second year and then rapidly declining to 3.5% but still a concern in the polls . . .

If Nate had told me that ten years ago, I’m not sure what I’d have thought. I guess if he’d given me that scenario, I would’ve asked about the rate of growth in real per-capita income . . . ummm, here’s something . . . It seems that real per-capita disposable personal income increased by 1.1% during 2023. These sorts of numbers depend on what you count (for example, real per-capita GDP increased by 2.3% during that period) and what is your time window (real per-capita disposable personal income dropped a lot in 2002 and then has gradually increased since then, while the increase in GDP per capita has been more steady).

In any case, economic growth of 1 or 2% is, from the perspective of recent history, neither terrible nor great. Given past data on economic performance and election outcome, I would not be at all surprised to find the election to be close, as can be seen in this graph from Active Statistics:

The other thing is a candidate being 81 years old . . . it’s hard to know what to say about this one. Elections have often featured candidates who have some unprecedented issue that could be a concern to many voters, for example Obama being African-American, Mitt Romney being Mormon, Hillary Clinton being female, Bill Clinton openly having had affairs, George W. Bush being a cheerleader . . . The age issue came up with Reagan; see or example this news article by Lou Cannon from October, 1984, which had this line:

Dr. Richard Greulich, scientific director of the National Institute on Aging, said Reagan is in “extraordinarily good physical shape” for his age.

Looking back, this is kind of amazing quote, partly because it’s hard to imagine an official NIH scientist issuing this sort of statement—nowadays, we’d hear something from the president’s private doctor and there’d be no reason for an outsider to take it seriously—and partly because of how careful Greulich was to talk about “physical shape” and not mental shape, which is relevant given Reagan’s well-known mental deterioration during his second term.

The 2020 and 2024 elections are a new thing in that both candidates are elderly, and, at least as judged by some of their statements and actions, appear to have diminished mental capacity. When considering the age issue last year (in reaction to earlier posts by Campos and Silver), I ended up with this equivocal conclusion:

Comparing Biden and Trump, it’s not clear what to do with the masses of anecdotal data; on the other hand, it doesn’t seem quite right to toss all that out and just go with the relatively weak information from the base rates. I guess this happens a lot in decision problems. You have some highly relevant information that is hard to quantify, along with some weaker, but quantifiable statistics. . . . I find it very difficult to think about this sort of question where the available data are clearly relevant yet have such huge problems with selection.

Both Biden and Trump were subject to primary challenges this year, and the age criticisms didn’t get much traction for either of them. I’m guessing this is because, fairly or not, there was some perception that the age issue had already been litigated in earlier primary election campaigns where Biden and Trump defeated multiple younger alternatives.

Putting this all together, and in response to Nate’s implicit question, if you had told me 10 years ago that the president would seek re-election at 81 (running against an opponent who is 77) despite a supermajority of Americans having concerns about his age, and with inflation hitting 9% in the president’s second year and then rapidly declining to 3.5% but still a concern in the polls, then I’d probably first ask about recent changes in GDP and per-capita income per capita and then say that I would not be surprised if the election were close, nor for that matter would I be surprised if one of the candidates were leading by a few points in the polls.

What about Nate’s other statement: “Trump should drop out! . . . Biden would lose by 7 points, but I agree, the Republican Party and the country would be better served by a different nominee”?

Would replacing Trump by an alternative candidate increase the Republican party’s share of the two-party vote by 3.5 percentage points?

We can’t ever know this one, but there are some ways to think about the question:

– There’s some political science research on the topic. Steven Rosenstone in his classic 1983 book, Forecasting Presidential Elections, estimates that politically moderate nominees do better than those with more extreme views, but with a small effect of around 1 percentage point. When it comes to policy, Trump is pretty much in the center of his party right now, and it seems doubtful that an alternative Republican candidate would be much closer to the center of national politics. A similar analysis goes for Biden. In theory, either Trump or Biden could be replaced by a more centrist candidate who could do better in the election, but that doesn’t seem to be where either party is going right now.

– Trump has some unique negatives. He lost a previous election as an incumbent, he’s just been convicted of a felony, and he’s elderly and speaks incoherently, which is a minus in its own right and also makes it harder for the Republicans to use the age issue against Biden. Would replacing Trump by a younger candidate with less political baggage be gain the party 3.5 percentage points of the vote? I’m inclined to think no, again by analogy to other candidate attributes which, on their own, seemed like potential huge negatives but didn’t seem to have such large impacts on the election outcome. Mitt Romney and Hillary Clinton both performed disappointingly, but I don’t think anyone is saying that Romney’s religion and Clinton’s gender cost them 3.5 percentage points of the vote. Once the candidates are set, voters seem to set aside their concerns about the individual candidate.

– Political polarization just keeps increasing, which leads us to expect less cross-party voting and less short-term impact of the candidate on the party’s vote share. If the effect of changing the nominee was on the order of 1 or 2 percentage points a few decades ago, it’s hard to picture the effect being 3.5 percentage points now.

The other thing is that Trump in 2016 and 2020 performed roughly about as well as might have been expected given the economic and political conditions at the time. I see no reason to think that a Republican alternative would’ve performed 3.5 percentage points in either of these elections. It’s just hard to say. Trump is arguably a much weaker candidate in 2024 than he was in 2016 and 2020, given his support for insurrection, felony conviction, and increasing incoherence as a speaker. If you want to say that a different Republican candidate would do 3.5 percentage points better in the two-party vote, I think you’d have to make your argument on those grounds.

P.S. You might ask why, as a political scientist, I’d be responding to arguments from a law professor and a nonacademic pundit/analyst. The short answer is that these arguments are out there, and social media is a big part of the conversation; the analogy twenty or more years ago would’ve been responding to a news article, magazine story, or TV feature. The longer answer is that academia moves more slowly. There must be a lot of relevant political science literature here that I’m not aware of . . . obviously, given that the last time we carefully looked at these issues was in 1993! I can read Campos and Silver and social media, process my thoughts, and post them here, which is approximately a zillion times faster and less effortful than writing an article on the topic for the APSR or whatever. Back in the day I would’ve posted this on the Monkey Cage, and then maybe another political scientist would’ve followed it up with a more informed perspective.

P.P.S. In a followup post, Campos introduces a concept I’d not heard before, the “backup quarterback syndrome”:

Nate Silver has fallen for the backup quarterback syndrome, which is the well-known fact that, on any team that isn’t completely dominating its competition, the backup quarterback tends to be the most popular player, because fans can so easily project their fantasies onto that player, since the starting quarterback’s flaws are viewed in real time, while the backup quarterback can bask in the future glory attributed to him by optimism bias.

I disagree with Campos regarding Nate here: it’s my impression that when Nate expresses strong confidence that a replacement Republican would do much better than Trump, and speculates that a replacement Democrat would do much better than Biden, Nate is not making a positive statement about Ron DeSantis or Gretchen Whitmer or whomever, so much as comparing Trump and Biden to major-party nominees from the past. Nate’s argument in support of the backup quarterback is based on his assessment of the flaws of the current QB’s.

That said, I like the phrase, “backup quarterback syndrome.” It does seem like a fallacy. It’s probably been studied (maybe not specifically in the football-fan context) in the heuristic-and-biases literature.

1. Why so many non-econ papers by economists? 2. What’s on the math GRE and what does this have to do with stat Ph.D. programs? 3. How does modern research on combinatorics relate to statistics?

Someone who would prefer to remain anonymous writes:

A lot of the papers I’ve been reading that sound really interesting don’t seem to involve economics per se (e.g.,, but they usually seem to come out of econ (as opposed to statistics) departments. Why is that? Is it a matter of culture? Or just because there are more economists? Or something else?

And here’s the longer version of my question.

I’ve been reading your blog for a couple of years and this post of yours, “Is an Oxford degree worth the parchment it’s printed on?”, from a month ago got me thinking about studying statistics. My background is mainly in engineering (BS CompE/Math, MS EE). Is it possible to get accepted to a good stats program with my background? I know people who have gone into econ with an engineering, but not statistics. I’ve also been reading some epidemiology papers that are really cool, so statistics seems ideal, since it’s heavily used in both econ and epidemiology, but I wonder if there’s some domain specific knowledge I’d be missing.

I’ve noticed that a lot of programs “strongly recommend” taking the GRE math subject test; is that pretty much required for someone with an unorthodox background? I’d probably have to read a topology and number theory text, and maybe a couple others to get an acceptable GRE math score, but those don’t seem too relevant to statistics (?). I’ve done that sort of thing before – I read and did all the exercises in a couple of engineering texts when I switched fields within engineering, and I could do it again, but, if given the choice, there are a other things I’d rather spend my time on.

Also, I recently ran into my old combinatorics professor, and he mentioned that he knew some people in various math departments who used combinatorics in statistics for things like experimental design. Is that sort of work purely the realm of the math departments, or does that happen in stats departments too? I loved doing combinatorics, and it would be great if I could do something in that area too.

My reply:

1. Here are a few reasons why academic economists do so much work that does not directly involve economics:

a. Economics is a large and growing field in academia, especially if you include business schools. So there are just a lot of economists out there doing work and publishing papers. They will branch out into non-economics topics sometimes.

b. Economics is also pretty open to research on non-academic topics. You don’t always see that in other fields. For example, I’ve been told that in political science, students and young faculty are often advised not to work in policy analysis.

c. Economists learn methodological tools, in particular, time series analysis and observational studies, which are useful in other empirical settings.

d. Economists are plugged in to the news media, so you might be more likely to hear about their work.

2. Here’s the syllabus for the GRE math subject test. I don’t remember any topology or number theory on the exam, but it’s possible they changed the syllabus some time during the past 40 years, also it’s not like my memory is perfect. Topology is cool—everybody should know a little bit of of topology, and even though it only very rarely arises directly in statistics, I think the abstractions of topology can help you understand all sorts of things. Number theory, yeah, I think that’s completely useless, although I could see how they’d have it on the test, because being able to answer a GRE math number theory question is probably highly correlated with understanding math more generally.

3. I am not up on the literature for combinatorics for experimental design. I doubt that there’s a lot being done in math departments in this area that has much relevance for applied statistics, but I guess there must be some complicated problems where this comes up. I too think combinatorics is fun. There probably are some interesting connections between combinatorics and statistics which I just haven’t thought about. My quick guess would be that there are connections to probability theory but not much to applied statistics.

P.S. This blog is on a lag, also sometimes we respond to questions from old emails.

Questions and Answers for Applied Statistics and Multilevel Modeling

Last semester, every student taking my course was required to contribute before each class to a shared Google doc by putting in a question about the reading or the homework, or by answering another student’s question. The material on this document helped us guide discussion during the class.

At the end of the semester, the students were required to add one more question, which then I responded to in the document itself.

Here it is!

“‘Pure Craft’ Is a Lie” and other essays by Matthew Salesses

I came across a bunch of online essays—“posts”—by Matthew Salesses, a professor of writing at Columbia:

‘Pure Craft’ Is a Lie

How Do We Teach Revision?

Who’s at the Center of Workshop and Who Should Be?

7 Things I Teach: A Manifesto

Also 22 Revision Prompts, which are so great that I’ll devote a separate post to them.

As a writer and a teacher of creative work (yes, statistical analysis is creative work!), I’m very interested in the above topics, and Salesses has a lot of interesting things to say.

I should warn you that he has a strong political take, and the political perspective is central to his thinking—but I think his advice should be valuable even to readers who don’t share his views on cultural politics. I’d draw the analogy to Tom Wolfe, whose cultural conservatism informs his views on art, views that should be of interest even to people to disagree with him on politics. It’s possible to be a big fan of Tom Wolfe while at the same time thinking it was pitiful for him to take his cultural politics so far as to deny evolution. Anyway, you can think what you want about Salesses’s political views and still appreciate his thoughts on writing and teaching, in the same way as you can still enjoy The Painted Word and From Bauhaus to Our House without having to subscribe to Wolfe’s political views. Pushing a position to its extreme can yield interesting results.

And more!

It seems that Salesses was doing this Pleiades Magazine blog for some period in 2015. Some googling turned up this fun list. Here’s Salesses:

I have been thinking for a while about how our attempts to define craft terms influence our students’ (and our own) aesthetics, and I have wanted to try other definitions. How to define “tone,” for example, seemed especially difficult. Here are some alternate definitions, for now:

Tone: an orientation toward the world

Plot: acceptance or rejection of consequences

Conflict: what gives or takes away the illusion of free will

Character arc: how a character changes or fails to change

Story arc: how the world in which the character lives is changed or fails to be changed

Characterization: what makes the character different from everyone else

Relatability: is it clear how the implied author is presenting the characters

Believability: the differences and similarities between various characters’ expectations

Vulnerability: the real author’s stakes in the implied author

Setting: awareness of the world

Pacing: modulation of breath

Structure: the organization of meaning

I guess that he wrote some other cool posts but I don’t know how many, and I can’t find any link that lets me scroll through them.

A few years after writing those posts, Salesses published a book, Craft in the Real World, which . . . is on sale at the local bookstore. I think I’ll buy it!

According to the publisher’s website, Craft in the Real World is an NPR Best Book of the Year, an Esquire Best Nonfiction Book of the Year, an Electric Literature Best Book of the Year. But I hadn’t heard of it until this recent google, following up on Salesses’s posts. It’s a big world out there, when a book written by a Columbia colleague, on a topic that interests me, which received multiple awards, was unfamiliar to me. The funny thing is, when I read the above-linked posts, I thought, This guy should write a book! And it turns out he did.

I recommend reading all this along with the advice of writing coach Thomas Basbøll.

He has some questions about a career in sports analytics.

Sometimes I get requests from high school students to answer questions. For example, Noah C. sent this list:

I’m writing with some questions for my final economics project. We pick a career aspiration, write about the job market and experience necessary to work in the industry, and conduct an interview with someone who has done relevant work before. I chose sports analytics as my field of choice, and I know you’ve done some statistics work with a professional sports team in the past.

Can you provide a general overview of the work you did with the team?
What was the workload like? How did it compare to the normal amount of work you need to do as a professor?
How did you begin working with the team in the first place? Did they contact you, or was it from your end?
What was the culture like from the organization’s side? Were they excessively demanding?
What advice do you have for someone interested in studying statistics?
What facet of experience is most valuable for someone looking for a job in sports analytics?
If someone wanted to work in the behind-the-scenes aspect of sports, is it best to start from within the world of the particular sport, or come in from outside?
How does the application of statistics in a social science/political context differ from a sporting context?

And some less serious questions:

Python or R, which is more useful to know?
What is your favorite metric for evaluating players?
Favorite athletes currently?
(My teacher is a big mets fan) Why are the Mets so bad historically? (be nasty please)

Before answering the above questions, let me emphasize that I only know about some small corner of sports analytics. That said, here are my responses:

1. I helped the team fit multilevel Bayesian models for various aspects of game play and player evaluation.

2. A couple of us met weekly for an hour and a half with some people from the team’s analytics group. They were the ones who did almost all the work. I think we were helpful because they could bounce ideas off us, and sometimes we had suggestions, and also we helped them build, test, and debug their code.

3. People from the team contacted me. They had read one of my books and found the methods there to be useful, and they wanted to go further.

4. We agreed ahead of time on a certain number of hours per week. On occasion we’d do a bunch of work outside the scheduled meetings, but that didn’t take up too much time, and, in any case, it was fun.

5. There are lots of ways to learn statistics. Ideally you can do it in the context of working on some application that is of interest to you.

6. I’m not sure what is the most valuable experience if you want to go into sports analytics. If I had to guess, I’d say programming with data, being able to manipulate data, make graphs, extract insights.

7. Some of sports analytics people I’ve met have a strong sports background; others have strong backgrounds in quantitative analysis and are interested in the sport they are working on. I think it would be hard to do this work if you had little to no interest in the sport.

8. There are some similarities between social-science statistics and sports statistics, also some difference. With sports we typically have a lot more data on individual perfomers. See this post, Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics.

9. I use R, which is popular in academic statistics, economics, and political science. In the business world, it’s my impression that Python is more popular. I fit my models in Stan, which can be called from R or Python.

10. I don’t have a favorite metric for evaluating players. If you fit an item-response model (see here, for example), then player abilities are parameters in the model and are estimated from data. So in you don’t have a metric (in the sense of some summary that is a combination of an individual player’s stats), you have a model that allows you to simultaneously estimate the relative abilities of all the players. It also makes sense to model players multidimensionality: in the general sense you can break skills into offense and defense, or subdivide them further (different sorts of rushing, receiving, or blocking skills in football; strikeouts, walks, home runs, and performance for balls in play in baseball; different sorts of matchups in basketball; etc.).

11. I haven’t been following sports too closely recently, so my favorite athlete depends on what I’ve been watching lately. Shohei is amazing, Mbappé and Messi gave us quite a show last summer, Simone Biles can do incredible things, ya gotta love Patrick Mahomes, . . . we could go on and on.

12. Hey, the Mets just won today. You gotta believe! Regarding their history, I recommend Jimmy Breslin’s classic book, Can’t Anybody Here Play This Game? Breslin also wrote a beautiful biography of Damon Runyon—a great read for anyone who’s ever lived in this city.

I strongly doubt that any human has ever typed the phrase, “torment executioners,” on any keyboard—except, of course, in discussions such as this.

Greg Mayer writes:

Still both appalled and amused by the notion of “torment executioners,” I googled the term and found earlier usages. Several of the earlier usages seem to be associated with scientific journals that are unfamiliar to me, and come from publishers like “Allied Academies” and “Hilaris Publishers.” Other early usages are associated with health and self help websites. Many usages show a decidedly non-idiomatic grasp of English.

Here are two examples. In “Examination Finds Better Approaches to Battle the Narcotic Emergency” by S. Uttam in the Journal of Psychology and Cognition, and submitted in January of 2021, there are multiple terms using torment: “torment prescriptions,” “torment executioners,” and “torment hindering properties.” (I did not click further into the site, as the page did not inspire confidence.)

And, from a chiropractic clinic in Chicago from July 2017, there are also multiple uses of torment in addition to “torment executioners.” The following sentence from the clinic site captures the general style of the writing in sites featuring the term “torment executioners”:

The most serious issue with back and neck torment from auto crashes is that because of the horrible idea of the mischance makes a substance irritation pathway end up noticeably actuated that doesn’t end rapidly.

Overall, it looks like “torment” pops up repeatedly when the intended meaning is “pain,” but instead some other word is used, either because of a deliberate attempt to avoid using “pain” (because it appears in some source material?) or because the writer is unfamiliar with idiomatic English.

I replied that I did some google searching too, and it looks to me like all the references are either fake research or examples of internet spam or link farms or whatever they are called right now: machine-generated pages that exist to pop up on a google search. I think what these pages do is to scrape text from random places on the internet and then run it through some synonym program to make the plagiarism less detectable.

I strongly doubt that any human has ever typed the phrase, “torment executioners,” on any keyboard—except, of course, in discussions such as this.

Mayer followed up:

The case of the piece by Uttam in the Journal of Psychology and Cognition may be similar to the UNR case—in an actual journal of sorts, and using imprecise synonyms.

The phrase also came up in a book from 2018 called Anxiety Disorder. It seems to be self-published, but Amazon has an audiobook of it.

Indeed, apparently the internet is awash in machine-produced books. Presumably with chatbots this will only get worse.

Report of average change from an Alzheimer’s drug: I don’t get the criticism here.

Alexander Trevelyan writes:

I was happy to see you take a moment to point out the issues with the cold water study that was making the rounds recently. I write occasionally about what I consider to be a variety of suspect practices in clinical trial reporting, often dealing with deceptive statistical methods/reporting. I’m a physicist and not a statistician myself—I was in a group that had joint meetings with Raghu Parthasarathy’s lab at Oregon—but I’ve been trying to hone my understanding of clinical trial stats recently.

Last week, the Alzheimer’s drug Leqembi (lecanemab) was approved by the FDA, which overall seems fine, but it rekindled some debate about the characterization of the drug causing a “27% slowing in cognitive decline” over placebo; see here. This 27% figure was touted by, for example, the NIH NIA in a statement about the drug’s promise.

So here’s my issue, which I’d love to hear your thoughts on (since this drug is a fairly big deal in Alzheimer’s and has been quite controversial)—the 27% number is a simple percentage difference that was calculated by first finding the change in baseline for the placebo and treatment groups on the CDR-SB test (see first panel of Figure 2 in the NEJM article), then using the final data point for each group to calculate the relative change between placebo and treatment. Does this seem as crazy to you as it does to me?

First, the absolute difference in the target metric was under 3%. Second, calculating a percentage difference on a quantity that we’ve rescaled to start at zero seems a bit… odd? It came to my attention because a smaller outfit—one currently under investigation by about every three-letter federal agency you can name—just released their most recent clinical trial results, which had very small N and no error bars, but a subgroup that they touted hovered around zero and they claimed a “200% difference!” between the placebo and treatment groups (the raw data points were a +0.6 and -0.6 change).

OK, I’ll click through and take a look . . .

My first reaction is that it’s hard to read a scholarly article from an unfamiliar field! Lots of subject-matter concepts that I’m not familiar with, also the format is different from things I usually read, so it’s hard for me to skim through to get to the key points.

But, OK, this isn’t so hard to read, actually. I’m here in the Methods and Results section of the abstract: They had 1800 Alzheimer’s patients, half got treatment and half got placebo, and their outcome is the change in score in “Clinical Dementia Rating–Sum of Boxes (CDR-SB; range, 0 to 18, with higher scores indicating greater impairment).” I hope they adjust for the pre-test score; otherwise they’re throwing away information, but in this case the sample size is so large that this should be no big deal, we should get approximate balance between the two groups.

In any case, here’s the result: “The mean CDR-SB score at baseline was approximately 3.2 in both groups. The adjusted least-squares mean change from baseline at 18 months was 1.21 with lecanemab and 1.66 with placebo.” So both groups got worse. That’s sad but I guess expected. And I guess this is how they got the 27% slowing thing: Average decline in control group was 1.66, average decline in treatment group is 1.21, you take 1 – 1.21/1.66 = 0.27, so a 27% slowing in cognitive decline.

Now moving to the statistical analysis section of the main paper: Lots of horrible stuff with significance testing and alpha values, but I can ignore all this. The pattern in the data seems clear. Figure 2 shows time trends for averages. I’d also like to see trajectories for individuals. Overall, though, saying “an average 27% slowing in cognitive decline” seems reasonable enough, given the data they show in the paper.

I sent the above to Trevelyan, who responded:

Interesting, but now I’m worried that maybe I spend too much time on the background and not enough time in making my main concern more clear. I don’t have any issues with the calculation of the percent difference, per se, but rather what it is meant to represent (i.e., the treatment effect). As you noted, and is unfortunately the state of the field, the curves always go down in Alzheimer’s treatment—but that doesn’t have to be the case! The holy grail is something that makes the treatment curve go up! The main thing that set off alarm bells for me is that the “other company” I referenced claims to have observed an improvement with their drug and an associated 200%(!) slowing in cognitive decline. In their case, the placebo got 0.6 points worse and the treatment 0.6 points better, so 200%! But their treatment could’ve gotten 10 points better and the placebo 10 points worse, and that’s also 200%! Or maybe 0.000001 points better versus 0.000001 points worse—again, 200%.

I think my overall concern is, “why are we using a metric that can break in such an obvious way under perfectly reasonable (if currently aspiration) treatment outcomes?”

See here for data from “other company” if you are curious (scroll down to mild subgroup, ugh).

And here’s a graph made by Matthew Schrag, who is an Alzheimer’s researcher and data sleuth, which rescales the change in the metric and shows the absolute change in the CDR-SB test. The inner plot shows the graph from the original paper; the larger plot is rescaled:

My reply: I’m not sure. I get your general point, but if you have a 0-18 score and it increases from 3.2 to 4.8, that seems like a meaningful change, no? They’re not saying they stopped the cognitive decline, just that they slowed it by 27%.

P.S. I talked with someone who works in this area who says that almost everyone in the community is skeptical about the claimed large benefits for lecanemab, and also that there’s general concern that resources spent on this could be better used in direct services. This is not to say the skeptics are necessarily right—I know nothing about all this!—but just to point out that there’s a lot of existing controversy here.

Who is the Stephen Vincent Benet of today?

For some reason the other day I was thinking about summer camp, and in particular the names of some of the campers: Travis Levi, Tony Kiefer, Patrick Amory, Southy Grinalds, Rusty Zorbaugh, . . . I remember very little about the actual kids. Some I liked, some I didn’t. I’m not in touch with any of them. Once on the street several decades ago I saw someone whose face looked familiar, I think my face was familiar to him too, we looked at each other and said hi but then were puzzled and walked away. I think that was Travis Levi but really I have no idea. The names, though, they have an emotional resonance for me. Not because of the people attached to them; it’s more the sound of these names that carries the feeling.

The resonance of these names reminded me of Stephen Vincent Benet’s classic poem, “American Names,” which begins:

I have fallen in love with American names,
The sharp names that never get fat,
The snakeskin-titles of mining-claims,
The plumed war-bonnet of Medicine Hat,
Tucson and Deadwood and Lost Mule Flat.

And ends with the beautiful stanza:

I shall not rest quiet in Montparnasse.
I shall not lie easy at Winchelsea.
You may bury my body in Sussex grass,
You may bury my tongue at Champmédy.
I shall not be there. I shall rise and pass.
Bury my heart at Wounded Knee.

Which made me wonder, who is the Stephen Vincent Benet of today? Back when I would go to used bookstores 40 or 50 years ago, there would often be some dusty hardbound volumes of his poems and stories on the shelves—I guess his books sold a lot of copies in the midcentury period. I think he’d be considered a “middlebrow,” to use the terminology of the time—here’s a good essay on the topic by literary critic Louis Menand. Benet was kinda classy, kinda folksy, took life and literature seriously but with a sense of humor, but lacked some depth; Dwight Macdonald described his book-length poem, John Brown’s Body, as “sometimes solemn, sometimes gay, always straining to put it across, like a night-club violinist.” That’s a problem with book-length poems in general—I’d say the same thing of Vikram Seth’s The Golden Gate, for example. But Seth’s a novelist, not a proclaimer in the mode of Benet.

Here’s a good and unfortunately anonymous mini-biography of Benet, which concludes, “The measure of his achievement, however, is indisputably John Brown’s Body, a poem whose naîveté and conventionality in themes, techniques and viewpoints are raised, by the greatness of its subject and Benét’s devoted craftsmanship, to the level of high folk art.” That seems about right.

I’m thinking that maybe the closest match to Stephen Vincent Benet in recent years is . . . Alice Walker? Successful writer of serious books that are not pulpy but are not quite considered all that as literature, public figure, a conscious representative of America in some way. I’m not sure, but maybe that’s the best fit, recognizing that literature as a whole has a much smaller cultural footprint today than it had a hundred years ago. Another possible match would be Stephen King, who fits into the “folk art” and “Americana” slots but as a massive bestseller has played a different role in our culture.

To what extent is psychology different from other fields regarding fraud and replication problems?

David Budescu read my recent article, How Academic Fraudsters Get Away With It (based on this blog post, in case that first link is paywalled for you), and wrote:

Can’t argue with most of your points and I can’t help but notice that some of them represent potentially testable psychological theories.

The recurrence of these problems in psychology is really painful, especially when some of the people involved are friends and collaborators.

The one point I don’t understand is why people are so eager to highlight the problem in psychology. If you get the daily Retraction Watch email, like I do, or look at their database, it is obvious that the problem is much worse in biomedical research (both in terms of quantity and, probably, potential impact and cost).

I wonder if the obsession with psychology may cause some people to underestimate the magnitude and breadth of the problem. Finally, I am curious how your analysis can explain fraudulent behavior in dentistry, cancer research, etc.

There are two issues here. The first is how the points made in my article, and by others on social media, represent potentially testable psychological theories. I have no idea, but if any psychologists want to look into this one, go for it!

The second issue is to what extent psychology is different from other fields regarding fraud and replication problems. Here are a few things I’ve written on the topic:
Why Does the Replication Crisis Seem Worse in Psychology?
Why Did It Take So Many Decades for the Behavioral Sciences to Develop a Sense of Crisis Around Methodology and Replication?
Biology as a cumulative science, and the relevance of this idea to replication

More on the oldest famous person ever (just considering those who lived to at least 104)

Yesterday the newspaper ran an obituary of Jack Jennings, who was part of the story that inspired The Bridge on the River Kwai:

His family believes that Mr. Jennings was the last survivor of the estimated 85,000 British, Australian and Indian soldiers who were captured when the British colony of Singapore fell to Japanese forces in February 1942. . . .

To build bridges, Mr. Jennings and at least 60,000 P.O.W.s — and thousands more local prisoners — were forced to cut down and debark trees, saw them into half-meter lengths, dig and carry earth to build embankments, and drive piles into the ground.

He died at 104. At first read I thought he was personally responsible for that Kwai story, but it then became clear that he was not himself famous; he was just involved in a famous event. Fair enough. Still worthy of an obituary.

In any case, this made me think about a question we discussed a couple years ago regarding who is the oldest famous person.

In honor of Mr. Jennings, I’ll restrict myself here to people who lived to at least 104. Wikipedia has lists which I assume are pretty comprehensive so I just went there.

Brooke Astor lived to 105 and Rose Kennedy lived to 104.

Marjory Stoneman Douglas lived to 108, but I’d only heard of her because of the horrible crime done at the school that was named after her, so I don’t think this quite counts as being famous for herself. On the other hand, given that shooting, she seems to be the most famous person who’s lived to that age.

The guy who directed and produced Pal Joey, directed On the Town, wrote and directed Damn Yankees, and was involved in a bunch of other Broadway classics lived to 107. He’s named George Abbott, and I’d never heard of him before writing this post, but he seems to be legitimately famous. He wrote the book for a Rodgers and Hart show!

And then there’s Olivia de Havilland, who lived to 104 and was Paul Campos’s choice for longest-living famous person (as always, excluding people who are famous because of their longevity), and I continue to hold out for Beverly Cleary, who lived a to a slightly older 104.

Oscar Neimeyer lived to 104. He’s famous!

Vance Trimble lived to 107! I’d never heard of Vance Trimble or anything about him—his name just jumped out at me when I was going through one of the lists of centenarians on wikipedia—but I should’ve heard of him. Listen to this: “He won a Pulitzer Prize for national reporting in recognition of his exposé of nepotism and payroll abuse in the U.S. Congress. . . . He was inducted into the Oklahoma Journalism Hall of Fame in 1974.” And he wrote biographies of Sam Walton, Ronald Reagan, Chris Whittle, and other business and political figures. Vance Trimble. If he’d lived in New York or Los Angeles, he’d be famous. Or maybe if he’d lived in New York or Los Angeles, he’d just have been one of many many reporters and not stood out from the crowd. Who knows?

George Seldes lived to 104 as well. He’s not famous and hasn’t been famous for nearly 100 years, but I once read a book he wrote, so I recognized the name. He was a political journalist.

Bel Kaufman, author of Up the Down Staircase, which I’ve never read but have seen on a shelf—it has a memorable title—lived to 103. But we’re not considering 103-year-olds here. This post is limited to 104’s and up. If we were covering 103-year-olds, I’d mention that she went to Hunter College and Columbia University! Her actual name was Bella, which for professional reasons was “shortened because Esquire only accepted manuscripts from male authors.” At least, that’s what wikipedia says. Perhaps this particular fact or claim will soon appear uncredited in an article by retired statistics professor Ed Wegman.

Herman Wouk lived to 103 also. He really was famous! He wrote The Caine Mutiny and Marjorie Morningstar, which were made into two iconic films of the 1950s. But, no, we’re not doing any 103’s here, so no more on him.

Jacques Barzun lived to 104. He’s a famous name, used to appear in the New York Times book review and places like that. It’s still hard for me to think of him as famous or important. To me, he just seems like someone who was well connected. Nothing like Olivia de Havilland or Beverly Cleary who made enduring cultural artifacts, or even George Abbott who did what it took to make some of those musicals happen. But I’ve heard of Barzun and had a vague idee of what he did, so I guess I’ll have to count him here.

On to the wikipedia’s list of centenarians who were engineers, mathematicians, and scientists . . . The only one who lived to at least 104 who resonates at all is Arthur R. von Hippel, listed as “German-American physicist and co-developer of radar.” Co-developer of radar . . . that’s pretty important! If I’m gonna count “one of the 60,000 soldiers who was part of the story that inspired Bridge on the River Kwai” as a famous person, then I’ll have to include “co-developer of radar” for sure. And he lived to 105. He didn’t quite reach the longevity of Vance Trimble, but he also “discovered the ferroelectric and piezoelectric properties of barium titanate.” He’s on the efficient frontier of age and fame, at least by my standards. Much more so than, say, Bernard Holden, “British railway engineer.”

Rush Limbaugh’s grandfather lived to 104. Sorry, but being a relative of a famous person doesn’t make you famous. Sure, I mentioned Rose Kennedy earlier, but she’s different. As the Kennedy matriarch, she was famous in her own right.

Huey Long lived to 105! But of course it was a different Huey Long. This oldster was a jazz singer. He was a member of the Ink Spots for several months in 1945. Sorry, not famous by my standards. Sharing the name of a more-famous person was enough to get my attention but not enough to count.

Hmmm, who else have we got? There’s Edward Fenlon, lived to 105, “American politician, member of the Michigan House of Representatives.” Completely obscure, but he’s from Michigan so maybe he’d be on Paul Campos’s list.

It says on wikipedia that Saint Anthony lived to 105 (born in 251, died in 356). He’s famous! But, hey, what can I say, I have some doubts about his numbers.


So, oldest famous person ever? I’ll have to go with the guy who directed Pal Joey, On the Town, and Damn Yankees and lived to 107.