“Political Prediction and the Wisdom of Crowds”: Evaluating an election forecast over time by comparing to betting odds over time

Rajiv Sethi, Julie Seager, et al. write:

We evaluate the relative forecasting performance of three statistical models and a prediction market for several outcomes decided during the November 2024 elections in the United States: the winner of the presidency, the popular vote, fifteen competitive states in the Electoral College, eleven Senate races, and thirteen House races. We argue that conventional measures of predictive accuracy such as the average daily Brier score reward modeling flaws that result in predicable reversals, as long as such movements are in a direction that is aligned with the eventual outcome. Instead, we adopt a test based on the idea that the strength of a model can be measured by the profitability of a trader who believes its forecasts and bets on the market based on this belief. . . . We find that all models failed to beat the market in the headline contract but some did so convincingly in contracts referencing less visible races.

They continue:

The ability of prediction markets to absorb novel sources of information and respond rapidly to unfolding and unprecedented events is a strength relative to statistical models, which are built and calibrated based on an assumption that the past will remain a good guide to the future. But markets also have weaknesses relative to models, being prone to excess volatility and occasionally vulnerable to price manipulation. The question of whether markets or models are more accurate on average is therefore an empirical one, and cannot be answered based on logical reasoning alone. In this paper, we examine this empirical question using data from three statistical models—FiveThirtyEight [Elliott Morris], the Economist [Dan Rosenheck, Ben Goodrich, Geonhee Han, and me], and Silver Bulletin [Nate Silver]—and the Polymarket exchange, which was the only venue on which contracts for a broad range of electoral outcomes were listed for the entire period from early August until election day on November 5.

I’m pretty sure that if the Economist had run with Ben Goodrich’s ideas when putting together their presidential election forecast (see section A.2 of this paper), we would’ve performed better in Sethi et al.’s evaluation.

This is not to say that anyone but Ben deserves credit for that (hypothetically) better performance; we ultimately made the decision to go with the simpler model. My point here is only the familiar one that, those long juicy time series notwithstanding, ultimately this is only a sample of size 1, first because this is is all based on a single national election and second because the outcome of the evaluation can depend so much on a single choice we made during our modeling and implementation process.

The idea of evaluating a forecast by comparing it to market prices is interesting, and it sends my thoughts in two opposite directions:

1. Given that a market exists, it makes sense to evaluate any outside information (in this case, public forecasts) based on what they add in predictive power to the forecast. Richard Clarida explains this idea in chapter 9 of our book, A Quantitative Tour of the Social Sciences.

2. Conversely, market prices are presumably influenced by public forecasts and, beyond that, new polling information shifts the markets and forecasts together. A few days before the election we discussed an aberrant poll from Iowa, which shifted both betting markets and forecasts.

Putting these perspectives together, it could make sense not to just have markets and forecasts compete but to ask where will markets do better and where will forecasts do better.

In general I’d expect markets to do better with one-of-a-kind information and forecasts to do better with numerical data that is part of an ongoing process.

For example, it was not clear in a forecast how to model information from new voter registrations, data from neighbor polls, or perceptions vs. reality of inflation. But these are factors that markets can incorporate in some ways.

Incorporating that Iowa poll, though, is the sort of thing that a forecast can do very well. Bayesian inference and partial pooling (across states and regions, over times, and among poll organizations) does not come naturally to people, but a model-based forecast can just crunch and include that new information easily. It won’t be perfect, but accounting for new polls is in the wheelhouse of our election forecasting models. This suggests that if you’re betting, you might want to go with market odds but then use the shift in the public forecasts to get a sense of how much your predictions should change given this new piece of information.

PhD position at UBC in Temporal Ecology Lab

This post is by Lizzie

My lab has an open position for a PhD student to join the lab. We’re looking for someone bright, motivated and collaborative to study how seed and seedling pathogens influence forest regeneration and diversity. This project would be part of a broader PhD with room to develop your own projects. This project is in close collaboration with the Plant Ecology Group at ETH Zürich, which is led by Professor Janneke Hille Ris Lambers.

If you’re interested, please find more information (including how to apply) on this page.

We’re open to folks from diverse training backgrounds. If your background is more computational and/or mathematical but you’d be excited to spend a few weeks outside and do a little lab work then you could be an ideal candidate; if you’re really interested, please apply no matter how well you think you line up with the ad.

Application review begins 1 July 2025 so apply soon for full consideration.

Approximate posterior recalibration

Tiffany Cai, Philip Greengard, Ben Goodrich, and I write:

Bayesian inference is often implemented using approximations, and resulting posterior un-
certainty intervals can then be too narrow, not fully capturing the uncertainty in the model.
We address the question of how to adjust these approximate posteriors so that they appropri-
ately capture uncertainty. We introduce two methods that extend simulation-based calibration
checking (SBC) to widen approximate posterior uncertainty intervals so as to aim for marginal
calibration. We demonstrate these methods in several experimental settings, and we discuss the
challenge of calibration using posterior inferences.

Also relevant are these papers:

Bayesian score calibration for approximate models, by Joshua Bon, David Warne, David Nott, and Christopher Drovandi

Simulation-based calibration of uncertainty intervals under approximate Bayesian estimation, by Terrance Savitsky and Julie Gershunskaya

Dan Luu and I consider possible reasons for bridge collapse

Dan Luu asks, Why is bridge declining?:

My partner and I [Luu] started playing bridge recently, and people at the local bridge club. People often comment on how young we are. You’ll find serious competitive players of all ages, but the median age of a casual player is probably in the 60s or 70s. There are a lot of discussions about why this is. The most common reasons I hear are:

1. People who are retired have more time to play games, the reason bridge looks so old is that that’s who has free time.

1.b. Bridge isn’t actually declining, as long as people keep retiring, the population of bridge players isn’t going to decline

2. Kids nowadays don’t have the attention span and play video games instead

3. Bridge takes a really long time to learn

4. Poker has sucked all the air out of the room, at least outside of Asian card games

(1) and (2) seem pretty obviously wrong to me. People my age and younger spend plenty of time playing games, they just choose games that aren’t bridge. As (2) hints at, they often choose video games, but they often choose tabletop games (including other card games) instead. . . . (2) seems easily falsified by the popularity of chess or go, games which take a substantially longer attention span than bridge . . .

I think (3) is a correct reason, but it isn’t sufficient to explain what’s going on. It’s true that bridge has an extremely steep learning curve, much more so than games like chess or go . . . (3) has some explanatory power and is consistent with the card games that are still popular, like poker, not to mention da lao er, da bai fen, tuo la ji, zhao pengyou, etc. (Asian card games which haven’t seen a decline in popularity the way bridge has).

Luu continues:

However, I don’t think (3) is a sufficient explanation. if people really want an easy entry, they could teach spades as an entry into bridge . . . A related comment is that it’s not clear that the initial learning curve to bridge is steeper than for a moderately complex modern board game like Terra Mystica (TM). . . . Modern board games as complex as TM aren’t exactly wildly popular, but they’re not in decline, either. If anything, it’s the opposite . . .

Even if it’s true, (4) is more of a symptom than an explanation.

Then he lands the punch:

There’s one factor that seems obvious to me that I haven’t heard. . . . there’s a stodgy bureaucracy around bridge. . . .

If you look at bidding systems in use today, most casual players in the U.S. use some variation or evolution of a single system. There’s more variety at higher levels, in tournament play, but systems that are considered too strange are either banned or highly discouraged. For example, Forcing Pass systems (where opening with a bid of “pass”, instead of indicating a hand not strong enough to bid, is used to indicate a strong hand) are banned in American (ABCL) tournaments. In tournaments where FP systems aren’t banned, players note that people who run FP are often severely penalized by tournament officials . . . And, because penalties are applied at the discretion of tournament officials, additional penalties can be applied for minor offenses . . .

If you contrast this to younger games like Magic or Netrunner, meta-breaking discoveries are lauded; people are rewarded for thinking about the meta and figuring out how to break it. If you manage to “break” the game in a way that’s not cheating, rules will be updated for the next tournament, but you’ll win the tournament in a way that’s considered to be creative and fair. In Bridge, officials may ban your bidding system for being too innovative in the middle of a tournament . . .

If, in the abstract, you imagine one card game where, decades after someone proposes a new meta, officials still ban it or penalize it as a Highly Unusual Method, and another card game, where people reward players for coming up with creative meta-breaking changes, and then you tell people to guess the relative age distribution of the games (ceteris paribus), I think most people would guess that the more creative game has a younger player base and the more conservative game has an older player base.

You might argue that this bidding system stuff doesn’t matter outside of the highest levels of play, but the cultural impact trickles down to the beginner level. When initially learning bridge, my partner asked “why does the system say to do X?” a couple times, but she quickly learned to not ask that question. . . . If you ask the analogous question when learning Netrunner (why was this deck constructed in this way), people *love* talking about this; part of teaching someone Netrunner is teaching people the reasons behind deckbuilding decisions and helping people learn how to construct better decks.

Much like (3), I don’t think that creativity stifling bureaucracy is a sufficient reason to explain the greying of bridge, but it seems like a plausible contributing factor.

Interesting. But I have another explanation, which I think is much bigger than the (1), (2), (3), and (4) listed by Luu, and that’s just that many pastimes are time-bound in their popularity. Bridge was huge for a roughly fifty-year period during the early part of the twentieth century–for example, Wikipedia says, “The number of people who play contract bridge has declined since its peak in the 1940s, when a survey found it was played in 44% of US households.” At that point, there’s nowhere to go but down.

Put it another way. Why don’t they name anyone Susan anymore? See the above graph. At the peak of its popularity, the name Susan was given to more than 1% of babies born in this country. Then Susan became less popular. That makes sense; there are trends. Then Susan became a baby-boomer name, kind of old-fashioned. Now Susan is an old-lady name, and parents aren’t picking it anymore, partly because it sounds like an old-lady name and partly because people choose from a wider variety of names, as illustrated by this graph from chapter 2 of Regression and Other Stories:

The point is, there’s no need for a Susan-specific explanation of the decline in popularity of “Susan.” It’s just the ebb and flow of popularity, along with a feedback mechanism by which, once a name gets less popular and becomes associated with a specific period in the past, that can drive the popularity down even further. This is something we’ve particularly seen with girls’ names over the years. “Susan” could mount a comeback (as has “Sophie” but not “Bertha,” to consider two names that were popular a couple generations before “Susan”), and there’s something to be said about how these trends can vary–but I think the baseline explanation is just the age thing.

Similarly with bridge, a game that was absolutely huge a hundred years ago, then gradually declined in popularity with competition with other amusements (TV is the usual culprit that’s given for the decline of everything else), then, like “Susan,” it got this generational tinge, so it’s less appealing to the kids–it’s an old-people’s game.

But what about chess and go? They’re a lot older than bridge and are going as strong as ever, or at least they haven’t cratered among the young in the way that bridge has. I could give a couple of answers: chess and go have special properties, they’ve stood the test of time in a way that other games haven’t. I’m not claiming that a popular board game will necessarily go into sharp decline or be associated with old people–similarly with names–; I’m just saying we shouldn’t be surprised when this happens.

The funny thing is, a few weeks ago at a party I was chatting with someone who had been the world champion at one of these modern tabletop card games–it might have even been Netrunner! Whatever game it was, I guess it’s orders of magnitude less popular than bridge, but it did sound like it was most popular among young adults. We’ll see how things look in fifty years.

P.S. If you google *Dan Luu bridge* you’ll also come across this fun post, where Luu writes:

You don’t have to be at a party to see this phenomenon in action, but there’s a curious thing I regularly see at parties in social circles where people value intelligence and cleverness without similarly valuing on-the-ground knowledge or intellectual rigor. People often discuss the standard trendy topics (some recent ones I’ve observed at multiple parties are how to build a competitor to Google search and how to solve the problem of high transit construction costs) and explain why people working in the field today are doing it wrong and then explain how they would do it instead. . . .

Asking people why they think their solutions would solve valuable problems in the field has become a hobby of mine when I’m at parties where this kind of superficial pseudo-technical discussion dominates the party. What I’ve found when I’ve asked for details is that, in areas where I have some knowledge, people generally don’t know what sub-problems need to be solved to solve the problem they’re trying to address, making their solution hopeless. . . .

Since I often attend parties with programmers, this means I often hear programmers retelling their cocktail-party level understanding of another field . . . An example I enjoyed was this Twitter thread where Hillel Wayne discussed how programmers without knowledge of trad engineering often have incorrect ideas about what trad engineering is like . . . Hillel compared the perceptions of people who’d actually worked in multiple fields to pop-programmer perceptions of trad engineering. One of the many examples of this that Hillel gives is when people talk about bridge building, where he notes that programmers say things like

The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly. If we did not quickly adapt to the unforeseen, the only foreseeable event would be our own destruction.

and

No one thinks about moving the starting or ending point of the bridge midway through construction.

But Hillel interviewed a civil engineer who said that they had to move a bridge! Of course, civil engineers don’t move bridges as frequently as programmers deal with changes in software but, if you talk to actual, working, civil engineers, many civil engineers frequently deal with changing requirements after a job has started that’s not fundamentally different from what programmers have to deal with at their jobs.

Luu continues with this amusing example:

A line I often hear from programmers is that programming is like “having to build a plane while it’s flying”, implicitly making the case that programming is harder than designing and building a plane since people who design and build planes can do so before the plane is flying. But, of course, someone who designs airplanes could just as easily say “gosh, my job would be very easy if I could build planes with 4 9s of uptime and my plane were allowed to crash and kill all of the passengers for 1 minute every week”.

All the little decisions we have to make when developing public-facing software

Jonah, Philip, Gustavo, and I are putting together an R package, caliBISG, implementing our calibrated Bayesian improved surname geocoding algorithm. The background to the awkward name is that BISG already existed, and in our recent paper we added a calibration step.

Setting up the R package required many steps (none of which were done by me). Below are some excerpts from the email thread.

Gustavo:

Testing the package–

I think the last name ‘Lee’ is a good illustrative example, so I ran it for Spokane county (2% Asian) and King county (Where Seattle is; Almost 20% Asian ).

> print_comparison_tables(race_probabilities(c(“Lee”, “Lee”), c(“wa”, “wa”), c(“spokane”, “king”)))
Surname: Lee
State: WA
County: Spokane
Year: 2020

Race Pr_calibisg Pr_bisg
—————————————-
API 0.11 0.27
White 0.76 0.62

Surname: Lee
State: WA
County: King
Year: 2020

Race Pr_calibisg Pr_bisg
—————————————-
API 0.52 0.75
White 0.35 0.18
—————————————-

Here you already see the huge difference in estimates that caliBISG offers relative to BISG–caliBISG’s estimates are much closer to the 45% Asian proportion of Lees in the U.S. census in 2010. BISG’s 75% number is huge even if King county has a large Asian population. The package was incredibly easy to use to do this.

Comments:

– Would be very useful to be able to aggregate counties or states. For example, if I want to compare probabilities or predict race in the greater Dallas area, I would want to aggregate across 11 counties all within Texas. If I wanted to do the same for the Philadelphia metro area, I would want to aggregate across counties in NJ, DE, and PA.

– Not much else. I used all of the functions in the package with ease and the documentation was clear.

– Slightly getting ahead of myself, but if we can pass the output to the functions in this package or write our own, it wouldn’t take that much to put a paper about racially polarized voting together.

– Can we add a function for seeing all of the names together? I know the dataframe would be enormous, but seeing just the top 100 white, Black, Hispanic, etc. names would be useful in some contexts.

Minor comments:

– I had to use a personal access token to download the package—I’m guessing because RStudio isn’t connected to my github and the repo is private.

– Slightly more straightforward to rename print_comparison_tables to just print (since you can mask the function and use the compare_bisg class)

– I went ahead and downloaded a third state, but probabilities couldn’t be returned for it (Florida)–I’m guessing that’s just not set up on the back end yet though.

All in all, this is looking great. Thanks for all the work behind this Jonah and Philip!

Jonah:

Below are some responses to your comments and questions.

Would be very useful to be able to aggregate counties or states. For example, if I want to compare probabilities or predict race in the greater Dallas area, I would want to aggregate across 11 counties all within Texas. If I wanted to do the same for the Philadelphia metro area, I would want to aggregate across counties in NJ, DE, and PA.

Yeah I could imagine that being very useful. Can you say more about this? Imagine you’re talking to someone who thinks more like a software developer than a social science researcher (which is basically true of me, although working with Andrew I’ve been involved in various social science projects). How are you imagining the package would let the user specify something like this and how would you expect the aggregation to be done internally? Would we need to weight the counties according to population size? Or surname frequency? Or are you imagining something different? Would the user specify a list of states or counties to aggregate?

Can we add a function for seeing all of the names together? I know the dataframe would be enormous, but seeing just the top 100 white, Black, Hispanic, etc. names would be useful in some contexts.

Do you mean the top names in a county? State? Overall? I don’t think the data I have has enough information to provide this. The files I have don’t include a number for how many people in each state or county have a certain name. But maybe Philip does have data on this that I don’t have?

Slightly more straightforward to rename print_comparison_tables to just print (since you can mask the function and use the compare_bisg class)

Yeah I thought about this. I think we would need to set a default maximum number of tables to print instead of defaulting to printing one per row. Otherwise if someone used race_probabilities() with really long input vectors it could end up printing hundreds or thousands of tables. We could let the user specify something like options(calibisg_max_print = some_positive_integer) to change the default, which is similar to what base R does with its max.print option. How does that sound? We could also provide an argument to the print method to override the default. And what would you think is a good default number of tables/rows to print?

I went ahead and got rid of print_comparison_tables in favor of just defining a print method for the compare_bisg class. I set the default to print tables for a maximum of 5 rows but this can be changed either via print(max_print = …) or by setting options(calibisg.max_print = …), which will change the default for an entire R session. I’m not sure if 5 is the right default, just needed something to use for now. What do you suggest for a default?

I also added something similar for the number of digits to print. The default is 2 but can be changed either via print(digits = …) or options(calibisg.digits = …).

The GitHub repo should now be updated with these changes.

I went ahead and downloaded a third state, but probabilities couldn’t be returned for it (Florida)–I’m guessing that’s just not set up on the back end yet though.

It wasn’t set up for this yet, but I worked on it yesterday and I think it should be working now if you want to reinstall and try again. It should also now work to request states that we don’t have caliBISG for at all and it will still give you traditional BISG estimates. That should work for all but six states where I’m missing some information I need to calculate BISG, but Philip is working on getting me that info. So for now regular BISG should work for 44 states and caliBISG should work for 7 states. And for regular BISG you don’t need to download any large files. The BISG calculations are done by the package on the fly using smaller data files that are small enough to include with the package when you install it.

Thanks for your advice and feedback!

Gustavo:

This is our idea: If more than one state, separate inference for each state—all the counties in all of the states. Suppose 3 states. Package would need to do calIBISG for each state separately–then we would weigh by county level population size to get the metro area.

Regarding the top names in a county, I actually can’t think of a practical use for this feature, so let’s drop this one.

Regarding printing of the tables, the way it looks now (with the updated package) looks good to me. Adding a max print makes sense, too, but as a default let’s say four? Past four tables you need to scroll on most displays and that gets confusing.

Jonah:

Changing the default to print 4 tables sounds good. I’ll do that now. Were you also able to check that using more states works? I think you said you previously tried Florida before I had enabled that, but that should now be working. It should also now work to request estimates for a state we don’t have caliBISG for and you should still get BISG (except for 6 states that Philip is still getting the necessary data for).

For example, this should work for BISG even though we don’t have caliBISG for Maryland yet:

> most_probable_race(“Smith”, “MD”, “Allegany”)

name year state county calibisg_race bisg_race in_census
1 smith 2020 MD allegany white_nh NA

Warning message:
caliBISG is not available for 1 input(s). Returning NA estimates for those cases.

Gustavo:

I tested MD as you suggested, but also a bunch of other states. Unless I happened to guess the exact states that we don’t have the data for yet, there are still a bunch that the package doesn’t return BISG predictions for.

> most_probable_race(“Smith”, “MD”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 MD allegany white_nh NA
Warning message:
caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “SD”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SD allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “ND”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 ND allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “SC”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SC allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “NC”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 NC allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.

Jonah:

Thanks for checking. I think the issue here is that there’s no Allegany county in the states you tried except Maryland. Currently at the least the state and county have to exist in order to be able to compute BISG using the data Philip sent me (Philip is that correct?). The surname doesn’t necessarily have to exist (we have an “all other names” distribution).

Gustavo:

Oh, yes, duh.

BISG estimate comes up here:

> most_probable_race(“Smith”, “SD”, “Buffalo”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SD buffalo aian NA

Jonah:

Ok great, glad that works. One other update about hosting the large caliBISG files. We had talked about hosting them on Harvard Dataverse since GitHub has stricter file size limits. However, I’ve been researching this a bit more, and if we don’t track the files in the GitHub repository but rather only attach them to GitHub releases of the package, it seems like there are no file size limitations. This would be nice because it means we could have everything in the same place rather than host the code on GitHub and the files elsewhere. I’m going to test this out by creating an unadvertised release so I can see if I can upload large files and download them with the package.

Me:

sounds good; thanks!

(As you can see, I’ve been a very valuable contributor to this process.)

Jonah:

A few quick updates and questions:

– I’ve got the GitHub downloads working. See below for instructions for trying it out.

– I think we’re basically ready for more people to try out the package. Do you think we should make the repository public for that or wait until we’ve gotten more feedback? If the latter, we can give specific people access to the private repo.

– We seem to not have full county names for Florida. Apparently the Florida voter file uses abbreviated county names, whereas other states have full names. Philip is working on getting the full names for Florida from the census files.

– I adapted the previous demo into a package vignette. See attached HTML version.

– To try out this newest version of the package use the code below. The GitHub personal access token (PAT) is required because it’s still a private repository.

install.packages(c(“pak”, “gitcreds”))

# will prompt you to enter your GITHUB PAT
gitcreds::gitcreds_set()

# you may need to restart your R session before
# trying to install the package if it fails to
# detect your PAT
pak::pak(“jgabry/caliBISG”)

# download all 7 caliBISG files from GitHub
# first deleting any old versions of the files
library(caliBISG)
delete_all_data()
download_data()

# check that the files were downloaded
available_data()

# use the functions like before
most_probable_race(“Smith”, “WA”, “King”)
race_probabilities(“Smith”, “WA”, “King”)

# I added a function to list the valid county names
valid_counties(“WA”)

Following up on my previous email, we now have full county names for Florida (thanks Philip!).

I just tagged a new test release (v0.0.2) and uploaded the new versions of the files (they’re all the same except for Florida).

1. Are you all ok with making the repository public now even while we’re still gathering feedback? Or would you prefer to wait until we’re fully ready for a release. Everything is easier if the repo is public (installing, downloading the data) but I can manage giving out access selectively if you’d rather keep it private for now.

2. I’ve added all of you as coauthors of the R package, but let me know if you’d rather not be listed. It’s common to have two citations in a case like this, one for the R package itself and one for the paper it’s based on. So when the user does citation(“caliBISG”) in R I’ll have it give both citations. Does that sound OK?

Gustavo:

Some feedback from testers. One was particularly helpful. I’ll send them over as I hear back:

Everything worked smoothly. Neat package.

I installed everything with no hiccups or issues. I played around a bit and the package is great — very intuitive and results look reasonable for the cases I tried.

In case they’re useful, three quick thoughts on usability:

FIPS support: I wonder if it’s worth supporting FIPS codes as alternatives to character for counties. That’ll help you avoid weird matching issues (e.g. looks like Saint Lawrence County in NY has to be formatted as “st.lawrence” or else you get NAs returned. FIPS codes might be a nice alternative for people who don’t want to worry about that cleaning.
Internal auto-replication: One minor thing: if you’re trying to get estimates for a bunch of names from the same county, it felt a bit clunky to have to replicate the state and county vector. For example, I think the way to check four surnames from the same county is to do something like:

most_probable_race(c(“Smith”, “Simko”, “Novoa”, “Gelman”),
rep(“WA”, 4),
rep(“King”, 4))

You need to replicate the state / county, or else you get NAs returned. I totally see why it works like that, to ensure the names / locations are all the same thing. But, I wonder if it’s worth implementing an edge case for length > 1 names, but length == 1 county and state and automatically replicate them internally (and maybe print a message that you did). To me, this looks much cleaner:

most_probable_race(c(“Smith”, “Simko”, “Novoa”, “Gelman”),
“WA”,
“King”)
# Message: generating BISG predictions for four surnames in King County, WA.

And that would let you directly insert a column from some other data frame, e.g:

most_probable_race(df$surnames,
“WA”,
“King”)

The counterargument is you could already just do that with data frame columns for state and county, so it’s not really critical. Just a small suggestion.

· Pivot option? I wonder if it’s worth having a quick binary argument in the two main prediction functions for a long data frame output. Some people might prefer it that way if they want to do some later grouping based on particular racial groups or estimates (e.g. all estimates > some fixed value). It would automate something like:

example_df <- most_probable_race(c("Smith", "Simko", "Novoa", "Gelman"), "WA", "King") example_df |>
pivot_longer(cols = starts_with(“bisg_”) | starts_with(“calibisg_”),
names_to = c(“method”, “race”),
names_pattern = “^(bisg|calibisg)_(.+)$”,
values_to = “prob”)

Good luck with the package! This is an awesome project. Any sense of when the other states will be added? I can’t wait to use it in my own work.

okay, a couple of notes. First, you guys need a tutorial. Just a basic tutorial showing how it works. It needs to be front and center in the github. Second > download_data()
* Downloading, reading, and saving file for: FL, 2020
Error: Failed to fetch release info: HTTP 404
available_data() returns 0?
character(0)
download_data(c(“VT”, “WA”), 2020)
* Downloading, reading, and saving file for: VT, 2020
Error: Failed to fetch release info: HTTP 404
looking at the example returns an http 404 error?
I wonder if this is because the package points to the github which isn’t open yet?
I also downloaded and compiled from source

Jonah:

This is great feedback, thanks! A few comments on the various suggestions:

– I think using FIPS codes in addition to names is a great idea. It’s definitely confusing that in some places the county names have appropriate spaces between words or after periods but other times they don’t. I guess this is just how they came out of the voter files? So providing FIPS as an alternative would be great. Philip, I guess we just need a dictionary to map between FIPS and county name for each state.

– I will definitely update to allow state and county to be length 1 so the user can provide a bunch of names for the same state and county more easily. I had thought about this at one point and forgot. I just opened an issue in the repository so I won’t forget.

– I have mixed thoughts on whether we should provide the pivot functionality ourselves since it’s pretty easy for people to do on their own. I’m open to it, just probably not urgent. I’ll open an issue in the repository so we remember to decide on this.

– We already have a tutorial, it’s just not front and center in the GitHub repository yet, it’s a package vignette. I forgot that installing from GitHub doesn’t automatically install the package vignettes. I’ll add some tutorial examples to the readme on the GitHub home page and add a note about how to get the vignettes when installing from GitHub.

– I think the person who got a 404 error while downloading the data doesn’t have a GitHub PAT set up. They said they downloaded the repository and built the package from source themselves, which doesn’t require a PAT (installing via pak() or install_github() does require one but not building it yourself). But running download_data() does require one. I just went ahead (one minute ago) and made the repo public, so I think it should now work without the PAT.

Here’s a link to the issue tracker for the package if you want to follow along when I complete things: https://github.com/jgabry/caliBISG/issues

Jonah again:

A few follow ups to my last email:

1. I’ve now already added the functionality for providing a single state and county with multiple different names. So this now works if you freshly install from GitHub:

most_probable_race(c(“Lopez”, “Jackson”), “WA”, “King”)

2. I also updated the readme on the GitHub landing page for the package to show how to download and access the tutorial vignette. I also added a very simple example to the readme itself.

3. Philip you asked previously what we would do about citations when we also have a JSS paper. I’ve seen cases where people replace the R package citation with the JSS citation and other cases when they ask people to cite everything. When someone calls citation(“caliBISG”) we can put a note indicating how we prefer they cite our work. For example, we could say to always cite your original paper about the caliBISG method and to cite the JSS paper if they used the implementation in the R package. Or we could ask them to cite all three (JRSS, JSS, R package). We can figure out what we prefer when we actually have the JSS paper.

4. Regarding the county names, I definitely still think FIPS is a good idea, but while we’re waiting on that would you all prefer if I made sure names with a period in them like the one mentioned in the feedback always have spaces (e.g. convert st.lawrence to st. lawrence)? Or would you prefer that we leave the names the way they are, which I guess is directly from the voter file or census? It would be easy for me to write a short script that makes sure there’s always a space after a period before I include the files with the package.

And I have a question about how we want to handle the FIPS codes. Here are a few options for how to let the user provide FIPS codes:
Add an argument fips that can be specified instead of county (using 3 or 5 digit FIPS codes):
most_probable_race(name, state, fips = FIPS)
Add an argument fips that can be specified instead of both state and county (using 5 digit FIPS codes):
most_probable_race(name, fips = FIPS)
Keep the user interface how it currently is and provide a function that converts between FIPS and county:
most_probable_race(name, state, county = fips_to_county(FIPS))
I’m open to any of these options (or a different one if you have a suggestion), but I do have a slight preference for the third option because it keeps the function signature the cleanest. It avoids the situation where we have a fips argument as well as the county and state arguments but only a subset of those arguments can be specified at a given time. In that situation we either need to error if they specify them at the same time or throw a warning and document which argument will take precedence. It’s not a huge deal to do that obviously, but in general I find APIs like that annoying and it seems cleaner to just have what we currently have but provide a way for the user to easily convert between FIPS and county.

Do any of you have a strong preference for one of these options or a different one?

Philip:

I prefer the third option and agree with what you wrote. The fips_to_county(FIPS) function would also have to take a state as input, right? Something like fips_to_county(state, FIPS)? Or alternatively a list of states fips_to_county(states, FIPS)?

An additional option would be to include FIPS county code in the output.

Jonah:

Yeah we could do it with a `state` argument if we ask for 3-digit FIPS codes. There are also 5-digit codes that include state info. The data you gave me has both, although in separate variables (I can combine them to create the 5-digit codes, which seem to be pretty widely used).

Gustavo:

Ditto on preferring option 3.

Yes would need to provide state as well.

For what it’s worth, there are functions from packages on CRAN that already do this. For example, censable::recode_fips_abb() converts state abbreviation to FIPS.

Jonah:

Glad you guys also prefer option 3. In terms of the function to convert between fips and county, do you have a preference between these two options?

1) fips_to_county(fips = “001”, state = “NY”)
2) fips_to_county(fips = “36001”)

Both of these options specify Albany county in NY. I guess we could support both, but if one seems preferable it’s simpler to just go with that option.

Philip:

I’ll defer to Gustavo on this one.

Gustvo:

Let’s do fips_to_county(fips = “36001”)
And just throw an error saying “fips should be a 5 digit code”, so it’s obvious what went wrong if people input something else?

Jonah:

So we’ll go with fips_to_county(fips = “36001”) with an informative error message if they don’t provide a 5-digit code. I’ll go ahead and implement that.

If you reinstall from GitHub you should now be able to use the fips_to_county() function. So, for example, these two calls to most_probable_race() should return the same output:

most_probable_race(
name = “Chan”,
state = “NY”,
county = “Albany”
)
most_probable_race(
name = “Chan”,
state = “NY”,
county = fips_to_county(“36001”)
)

If you provide an invalid FIPS code there are several different errors you could get, depending on if you provide the wrong number of digits or if the code is the right length but doesn’t correspond to any real county. For example:

> fips_to_county(“123”)
Error: `fips` must be a character vector of 5-digit FIPS codes.

> fips_to_county(“12345”)
Error: The following FIPS codes could not be converted: 12345

Me:

Hi all. I have nothing to add to this discussion . . . but could I blog it? This sort of realistic discussion about coding is not something that’s usually taught in school!

And they all said yes, so here we are.

I doubt any of you are interested in all the details above, but there’s something to be said for sharing this whole long exchange, just to get a sense of what it takes to build this sort of software package in a way that will be useful to people.

Unfair to Galton

In my post the other day about Monsters, I wrote about “scientists who held political views that you might now call odious, such as Francis Galton’s racism (which, like Laura Ingalls Wilder’s views, were close to the core of his statistical work) or J. B. S. Haldane’s communism (which seems more peripheral to his contributions to biology, although I expect that Haldane himself saw some connections there).”

A colleague who knows more about Galton than I do argued that the core of Galton’s statistical work had nothing to do with eugenics, even by Galton’s definition. It had been my impression that even when Galton was writing about heights of siblings or whatever, that eugenics was not far from his concerns, but according to my colleague, Galton’s work on correlation was originally motivated by trying to understand the paradoxes of evolution (not “eugenics,” particularly) and “when he had his breakthrough he saw that it was not heredity at all – it was a general statistical phenomenon.” My colleague continued, “Galton’s 1st book can be called eugenic – it said talent runs in families. But it had no effect. Why? Because everybody ‘knew’ that. He got no response and he moved on – he didn’t change his views but he moved on to other questions with no eugenical side.”

From a modern perspective, I can’t see how you can avoid labeling Galton as racist. For example, he charmingly wrote:

Visitors to Ireland after the potato famine generally remarked that the Irish type of face seemed to have become more prognathous–that is more like the Negro in the protrusion of the lower jaw. The interpretation of that which was that the men who survived the Starvation and other deadly accidents of that horrible time were generally of low and coarse organization.

Talk about adding insult to injury! Starve a country and then say that the survivors have been selected to be “low and coarse.”

Also this:

Average negroes possess too little intellect, self-reliance, and self-control to make it possible for them to sustain the burden of any respectable form of civilization without a large measure of external guidance and support.

And:

The Hindoo cannot fulfil the required conditions nearly as well as the Chinaman, for he is inferior to him in strength, industry, aptitude for saving, business habits, and prolific power. The Arab is little more than an eater up of other men’s produce; he is a destroyer rather than a creator, and he is unprolific.

Lots of people back in the 1800s wrote like that, and I don’t know enough about that period of history to assess the ways in which Galton was more or less racist that other educated Englishmen of his time. Perhaps one thing that disturbs me about Galton is that this was not just casual racism, just some guy in the pub making Irish jokes, but rather his carefully-thought-out views. But this is different from my colleague’s point about Galton’s statistical work, which is that it moved away from concerns about evolution and heredity and toward more general mathematical understanding of regression and correlation. It should be possible for me to be bothered by Galton’s racial views and to be bothered by their political implications, and to be interested in the connection between racist attitudes and the history of statistics, while also recognizing the development of ideas of correlation and regression on their own terms.

Also, to get back to the racism, the notorious Galton quotations above are from a lot of writing that he produced during his life. These remarks might well represent something close to his max level of racism rather than the mean or median. And such beliefs did not necessarily get in the way of his scientific investigations. For example, my colleague informed me, “When Galton first looked at fingerprints he also looked for a possible difference between African and English prints – were the first group less evolved (simpler) than the second? He could see no such difference. He once wrote (in 1863), ‘Exercising the right of occasional suppression and slight modification, it is truly absurd to see how plastic a limited number of observations become, in the hands of men with preconceived ideas.’ Evidently Galton had no such preconceived idea.” In some sense his rejection of a racial idea in this case was even more impressive, if he indeed came into the analysis expecting to see such a pattern.

Again, no need to single out Galton; he gets more of our attention because of the importance of his contributions to statistics. Casual racial thinking comes up all the time in the history of quantitative social and biological sciences. For example, in his 1957 book, Probability, Statistics, and Truth, Richard Von Mises attempted to explain an underdispersion in the monthly rates of girl and boy births as being caused by different sex ratios among different racial or socio-economic groups; see pages 84-85 of this article. Mises does not present a rigorous argument, and if you try to look at it carefully, the math breaks down; my point is just that, when coming up with an apparently unexpected pattern in social data, he reached for a racial explanation. The question is not whether he was a “racist” in whatever terms might be used–according to wikipedia, he left Germany after the Nazis took power–just that racial thinking was in the air.

As we’ve discussed in other contexts, the point of this discussion is not to characterize Galton as “good” or “bad” but rather to better understand his statistical and his racial views in context. History is important.

What does “Neuromancer” have to teach us about the role of AI in society?

This post is by Phil Price, not Andrew.

From junior high through about sophomore year in college I read a lot of science fiction, went to some science fiction conventions, etc., but then I drifted away from the genre for eight or nine years. What brought me back was “Neuromancer”, by William Gibson. It had come out in 1984 when I was in college but I guess I had already stopped reading science fiction by then, or else I somehow missed that specific book, so I didn’t get around to reading it until about 1992. The book is generally credited as starting the “cyberpunk” sub-genre, of which Neal Stephenson’s “Snow Crash” is another great example, although there are many progenitors with similar DNA; indeed I’m not sure why the much movie Blade Runner isn’t given the credit (or perhaps something even earlier).

I haven’t read Neuromancer in about thirty years, but came across it while browsing a bookstore and thought eh, why not give it another read, the development of artificial intelligence is a major theme and maybe it’ll be interesting in that context, not just for entertainment.

There are some minor spoilers below, nothing that I think would interfere with one’s enjoyment of the book but if you are especially picky about this kind of thing then you might want to stop reading now.

In Neuromancer we encounter two types of artificial intelligence: artificial _general_ intelligence, as personified (machinified?) by AI’s known as Wintermute and Neuromancer; and ‘constructs’ such as the “Dixie Flatline construct”, which, we are told, is not truly intelligent but merely seems intelligent. The Dixie Flatline construct is “just a bunch of ROM” that answers questions the way a guy called “Dixie Flatline” would answer them himself. But then it turns out it’s not just about answering questions, Dixie Flatline can also hack into computer systems, pretty much like the person on whom it is based. And it can’t _just_ be ROM because it can remember things that you tell it.

I recall being somewhat puzzled by the distinction between the real AI’s and the “construct”, back when I read the book, since the construct sure _seems_ intelligent. But now that distinction seems entirely reasonable: the construct behaves very much like an LLM like chatGPT, which…well, I know there are people who think that as LLM’s get more sophisticated they are going to turn into artificial general intelligences, but I don’t think that’s the case. Artificial general intelligence is possible, and an LLM might even be a key component to attaining it, but I don’t think any LLM, no matter how grand, will be enough on its own. My younger self was puzzled by the distinction between the construct and a “real” AI, but now it makes perfect sense to me! Just think of the construct as an LLM that is trained to respond like a specific real person.

Another somewhat-realistic-seeming element of the book is that there’s an organization, colloquially called the “Turing Cops”, that is tasked with preventing AI’s from becoming too powerful. There’s a fear that if an AI becomes powerful enough it could destroy humanity, or at least do terrible things. There’s a lot of current discussion about whether or how AI’s should be regulated, although at least for now I don’t think that discussion is focused on the capabilities so much as who can use them and how, whereas in the book the Turing Cops only care how smart they get.

So…what about the title of this post, what does the book have to teach us about the role of AI in society? Nothing. Or at least, nothing I can think of. It’s a work of fiction written forty years ago by someone who, by his admission, knew nothing about the technologies he was writing about. Nowadays we might say he was “vibe-writing” or something. There’s lots of nutty stuff and some plot holes.

I guess I’ll mention one more thing that is purely on the literary side. The main character of the book, a hacker/cracker named Case, is objectively a horrible person, as is his girlfriend Molly. They lie, cheat, steal, kill, get involved in a scheme that kills dozens of innocent people and show no remorse about it, etc. But I spent the whole book rooting for them! The story is told mostly from Case’s point of view, and I kind of adopted his view of the world. He’s not without a sense of emotion or a sense of morality, but while reading I found that I liked the people he liked, disliked the people he disliked, was appalled by the things he found appalling but unbothered by the things that didn’t bother him. I don’t really have a point here, I just find the phenomenon interesting.

This post is by Phil.

Better and worse ways to mix human and LLM responses in behavioral research (but you still have to figure what you’re measuring)

This is Jessica. At a recent workshop on LLMs for social science (and then in comments on my last post) I learned of this new paper by Broska, Howes, and van Loon which advocates mixed subjects design experiments: experiments that combine human subjects data with LLM simulation data. The basic idea is to gather enough human subjects data to learn the correlation between the LLM responses and human responses, then to combine responses across the two sources into an aggregate estimate. This estimate depends on solving an optimization problem to tune the magnitude of an adjustment to the human subjects estimate that will account for bias in the LLM responses and minimize the variance of the resulting estimator. You end up with an estimated effect that is at least as precise as you would get from just using the human responses. 

For example, the estimator for a population mean looks like this: 

sample mean of human responses – lambda(sample mean of LLM responses on instances with human responses – sample mean of LLM responses on instances without human responses)

where lambda is found by taking the least squares coefficient from regressing the human responses on the LLM predictions for those inputs (i.e., the covariance of human labels and model estimates divided by the variance in the LLM’s predictions on those inputs), and multiplying it by a shrinkage factor N/n+N, where N is the number of instances for which we only have LLM predictions and n is the number of instances for which we have human labeled data. As N gets much larger than n, this approaches 1 and there is less shrinkage; if N is smaller than n at most it’s ½, bringing the final estimate closer to the human subjects estimate. 

Note that all of the inputs (i.e., the Xs) for which we have gathered human and LLM responses have to be drawn from the same distribution. This way any bias in LLM predictions cancels out on average on the right hand side of the above equation, and the parameter is centered at the same parameter we’re trying to estimate with the human responses. When the sample sizes of the labeled (human) data and unlabeled data for which we only have LLM predictions are both large, the central limit theorem applies and can be used to get confidence intervals. Things get a bit more complicated when your target is a regression coefficient, but the same basic idea holds. 

This approach is using prediction powered inference; for more background on PPI, see the original paper or this more recent paper that connects it to surrogate outcomes. The Broska et al. paper goes further by showing how you can use the estimated correlation between human and LLM responses to determine the effective sample size for your mixed design study: the equivalent number of human subjects you’d need to get an estimator as precise as what you get from the mixed responses. They also show how to use it to do power analysis aimed at finding the optimal (under some budget) mix of human and LLM samples given a desired power level and effective sample size. 

Is this the answer for integrating LLMs in behavioral studies? 

My thoughts on the mixed design experiment are that A) it’s the most reasonable approach I’ve seen in the LLMs for social science literature for integrating LLM simulations into confirmatory-style experiments, certainly much better than using the LLM data without adjustment after doing some spot checks, but B) it leans hard on being able to get good point estimates of the relationship between LLM versus human data, and overall there’s little reason to believe we’re going to see marked improvements in the reliability of treatment effects estimated in behavioral studies due to having access to the cheaper faster LLM data source.

Regarding A, I’d much rather see people use this approach than try to argue that their LLM simulation results are close enough to human subject results to be substitutes based on a purported lack of evidence to assume otherwise. E.g., this approach is much better than doing some null hypothesis significance tests of LLM versus human results and deciding the LLM responses are valid substitutes when you see no significant differences. 

Regarding B, ultimately this approach depends on getting a good estimate of the correlation between the human and LLM responses. We end up in the familiar territory of trying to do power analysis from potentially noisy pilot samples. If your estimate of the correlation is noisy, because you have limited human data (since the point after all is to not have to gather too much human data!) then you’re going to end up misleading yourself by trusting that the estimate you get out of this approach is unbiased. If it’s not, for example when the sample size of the human data is in the range that many people in behavioral fields are accustomed to publishing on, then how likely are the researchers to want to mess with LLMs to get more precise estimates, given the other considerations that come into play, like how to best adapt the experimental protocol for LLMs? The paper acknowledges the potential noisiness of the correlation estimate used to tune lambda, but leaves it to the reader to decide if it’s worth propagating uncertainty there to the sample size calculations they propose. 

Another challenge–which is not a weakness of the proposed mixed subjects method, but of the status quo approach to behavioral studies in many fields–is that often the human sample is not a random sample, and so no one really knows what population it represents. Sometimes authors will try to do posthoc weighting in estimation to mimic a nationally representative sample, but it’s not the norm; instead many empirical study papers give only summaries based on convenience samples but tout them as estimating some general effect for people at large. So when we lean on this estimate of correlation between the LLM and human responses under the guise that we will arrive at an unbiased, maximally precise estimate of the true effect we are not necessarily any closer to being able to interpret what we’ve learned. 

Again, none of this is to say that PPI isn’t a reasonable method to draw on when researchers are aware of the risks (and the PPI estimates can likely be further improved, e.g., through first recalibrating the LLM for the specific population, as mentioned in the second PPI paper above). But it’s just a technical tool and doesn’t really help us address deeper epistemological questions of what we hope to learn by mixing human and LLM responses.

The replication crisis and the failure of theory within social psychology

John “not Indiana Jones” Williams writes, “I thought you would be interested in and a bit annoyed by this review from Science of a book on one aspect of the replication crisis. There is no mention of significance tests.”

Williams is pointing to a review by historian of science Elizabeth Lunbeck of a book by Ruth Leys about the discredited social-priming work (see here for some background) of John Bargh and others.

We’re already several levels deep in abstraction here:

1. The actual psychological phenomenon of social priming, such as it is.

2. The experimental techniques used by Bargh et al. to study that phenomenon.

3. The journal articles used to promote the idea. These articles cannot simply be considered as instantiations of the experiments: it takes a lot of effort and intellectual “technology,” as it were, to transform ambiguous or equivocal data into convincing and confident claims of empirical evidence and theoretical coherence.

4. The structures of the science and news media establishments that led to the claims in the journal articles being widely disseminated and believed.

5. The backlash within psychology and elsewhere against social priming, including failed replication experiments, statistical theory exploring how it all could’ve gone so wrong (that’s the “significance tests” thing to which Williams was referring), and news media reports of the debunking. I was involved in some of this myself.

6. Leys’s book, which tells some aspects of the above stories.

7. Lunbeck’s review of Leys’s book, which is what I’ll be writing about.

8. The reaction by Williams and me to Lunbeck’s review.

9. This post, which injects this all into social media discussion, so that you and others can react to my reaction to Lunbeck’s review of Leys’s book on the backlash to the promotion within science and the news media to journal articles that provided (unintentionally) misleading summaries of experiments designed to investigate a dubious psychological theory.

It’s an inverted pyramid balanced on a pinpoint.

I haven’t read Leys’s book, so my comments here are on Lunbeck’s review. She tells the story well, focusing on the psychology theories, not on the statistics or the academic politics.

I appreciate that. There’s lots to say about the statistical problems of unreplicated or unreplicable research–indeed, I’ve had a lot to say about the topic myself!–but that’s all just a means to an end. And, as I and others have argued, the fundamental problem with a lot of this bad research is not the bad statistics but rather the bad substantive theory, along with bad connections between theory and data. The bad statistics enables the bad science to appear successful; it does not in itself make the science bad.

To put it another way: bad science analyzed using good statistics does not produce good science; at best, it just makes it harder to be fooled by noise. Conversely, good science can proceed just fine using bad statistics. Good statistics should make good science more efficient and effective, but it’s usually not necessary.

So I think it’s appropriate that Lunbeck (and, I assume, Leys in the book being reviewed) does not dwell on the statistical errors that led to decades of overconfidence; rather, she focuses on the problems with the social priming theory and experiments:

Bargh marshaled this striking finding to support his claim that “automaticity,” not free will or intentionality, powerfully governs behavior. . . . Automatic responses–quick, efficient, intuitive–were just as powerful in shaping behavior as were more cognitively complex and considered ones, the theory went. . . . Critics charged that the original experimenters had ignored a long tradition of research and theorizing focused on so-called demand characteristics, the motivations at work in both subjects and researchers in the setting of the psychological experiment. . . . [Leys] is particularly focused on psychology’s long history of downplaying intentionality in human behavior. . . . priming researchers were repeatedly snared by conceptual and theoretical traps of their own devising. For instance, they eventually posited that “moderators,” such as desires to affiliate or gender, influence individuals’ responses to primes, but this undermined the generalizability of their experimental results.

Well put.

Just one more thing regarding “the generalizability of their experimental results.” Those published results are consistent with null effects. This was the point of the classic Simmons, Nelson, and Simonsohn (2011) methods paper. I think Lunbeck is aware of this–she refers to “confirmation bias and a ubiquity of tautological statements dressed up as theory.” I just wouldn’t want readers of the review to be left with the impression that those experimental results could be interpreted as claimed, even locally.

To put it another way: Mind the gap between the direct experimental results (the entire raw data along with precise descriptions of the conditions of the experiment) and how these results are presented in the published journal articles.

And, as we discussed yesterday, mind the gap between the extravagant claims in the title and abstract of a published paper, and the actual measurements that were conducted.

“Scientific poetic license?” What do you call it when someone is lying but they’re doing it in such a socially-acceptable way that nobody ever calls them on it?

When talking about the problems with science discourse, I often use this example: A 3-day study is called “long term,” and nobody even seems to notice the problem.

And sometimes I use this example: “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though it had no evidence of anyone actually becoming more powerful.

Here’s my question. What do we call this sort of thing? I want to say “lying,” but that doesn’t seem quite right.

Why not call it “lying”? To lie is to knowingly tell an untruth, and the authors of these statements undoubtedly knew that they were false: the authors of the first paper linked above knew that 3 days is not long term, and the authors of the second paper knew that their study had no data on power. They were knowingly telling untruths in the title of one paper and the abstract of the other.

But it doesn’t quite feel like lying, at least not to me. Why do I say this? Because I think of a lie, not just from the perspective of states of knowledge but also in terms of the motivations of the liar. People lie to hide things, or to avoid being caught, or to make them look better than they really are . . . a lie is a form of cheating. And, just as a lie is not just an untruth but a known untruth, similarly, I think of the act of lying as having a certain intentionality.

Bob Carpenter’s a linguist, so I asked him what we should call this behavior, if not “lying.” His first suggestion was “bullshitting,” which is when people just say stuff without any attempt to be truthful. But that didn’t seem right either: I don’t have the feeling that the authors of those two paper were bullshitting, exactly. I think the authors of those papers thought they were following the rules; indeed, they were following the rules of academic writing. It’s a form of ritualized insincerity, where you write things that sound impressive (“long term,” “instantly become more powerful”) even when these impressive-sounding things are not supported by, or are even contradicted by, available evidence. Bob then suggested the term “exaggeration,” which is closer to the mark than “lying” or “bullshitting” but doesn’t quite nail it either.

So I’m still stuck on this one. You might say that none of this matters: we know what’s going on so why get hung up on naming it. But I think that naming things is important; indeed, we have an entire lexicon devoted to this endeavor.

Why do I think naming is important? I think that when we can give something a good name, it helps us understand the problem better and even point toward a solution. Also, once something’s named, it can be easier to spot it in the wild. That’s the situation with a lot of fallacies, I think.

P.S. Raphael in comments suggests the term “reckless disregard for the truth.” That sounds about right!

Survey Statistics: Kish’s (and Meng’s) design effect ?

Leslie Kish was a foundational figure in Survey Statistics. Unlike some other foundational figures in statistics, Kish was not interested in eugenics. Kish had better ideas, like “Anything a human being can eat, I can eat.” And considering the effect of a survey design on its error: the design effect.

See the post introducing this blog series, where the comment section already dives into this topic, thanks to Raphael Nishimura !

Suppose we want to estimate a population mean (e.g. Republican vote). We can use a sample mean to estimate it. Then the error (sample mean vs population mean) depends on how we choose the sample. What if survey-takers are more or less Republican than the population ? A simple random sample (SRS), where every sample has the same probability, is the usual benchmark. Its sample mean is unbiased, but there is still variance.

Surveys designed to sample within groups (“strata“) can improve representativeness and reduce variance. In contrast, surveys designed to sample entire groups (“clusters“) can increase variance relative to a sample of the same size that is more mixed across groups.

Kish introduced the design effect in his 1965 book Survey Sampling.

Survey Sampling

Xiao-Li Meng, another statistician with marvelous ideas (and a lot of patience for my emails), derived a general formula for Kish’s design effect in his 2018 “Statistical Paradises and Paradoxes”. Let’s first review Meng 2018‘s notation:

Meng 2018 shows that the error relative to a simple random sample increases with population size (N – 1) and correlation between inclusion in the sample (R) and the quantity of interest (G). For example, if people who take surveys are also more likely to support a political candidate, expect more survey error.

Sharon Lohr asks readers to derive the above in the 3rd edition of her Sampling book as exercise 17 in Chapter 15, if you’d like a homework assignment.

Sampling Design and Analysis: Third Edition — Sharon Lohr

Meng 2018 asks: can survey weights offer a statistical paradise ? Survey weights describe how the survey sample can be scaled up to the population. One way to think about them is how many people in the population each person in the sample represents.

(Paradise = reducing correlation between sample inclusion and quantities of interest, isn’t that what we dream of ?)

Well, Gelman 2007 begins with “Survey weighting is a mess.” Kish 1992 also describes weighting as “ad hoc” and “hidden in appendices”. But before we despair, let’s look at Meng’s design effect with weights:

A_W in the above is almost Kish’s Deff for the increase in variance due to haphazard weights. From Kish’s 1965 book:

Kish’s derivation seems “model-based” (see e.g. Little 2004), assuming y are independent with variance S^2. If you’d like another homework assignment, Sharon Lohr asks readers to derive the above in the 3rd edition of her Sampling book as exercise 17 in Chapter 7.

But I’d like to connect Kish’s Deff from haphazard weights to Meng’s discussion. This might be a more difficult homework assignment. If W is independent of G and R (haphazard weights), is the data defect the same as for unweighted (i.e. tilde{D}_I = D_I in Meng’s notation) ? Dashing hope that the benefit of weights outweighs (pun !) the increase in Deff ?

Big Fiction, Dan Sinykin, and George V. Higgins

After reading Dan Sinykin’s article on close reading the other day, I checked out his 2002 book, “Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature.” It was fascinating, and now I want to track down the books on the topic that he refers to in the footnotes.

I’ve read and enjoyed some books by John Sutherland on the history of literature and publishing in the twentieth century, as well as lots of biographies of authors and books about magazines and newspapers that touch on various aspects of book publishing, and I’ll have to say that Sinykin takes it to the next level. Not that “Big Fiction” is better than those earlier books, exactly, just that he has what seems to me to be a more comprehensive perspective, covering both the business and literary angles. He also had lots of good stories about the authors, editors, and agents involved in the literary-fiction publishing business, and these stories had just the right level of detail; even the bits that might seem kind of gossipy gave insight. For example on page 83 there’s a letter from publisher Bennett Cerf that Sinykin accurately describes as “bizarre” and “outlandish” in its sexism, which reminded me that as a kid we had this book, Bennett Cerf’s Book of Riddles–it had a drawing of a big red rock eater on the cover (the answer to the riddle, “What’s big, red and eats rocks?”), and now I kinda want to have washed my hands after touching it (Cerf’s book, that is, not Sinykin’s). The point is that this is not just a goofy and slightly disturbing story about a now-forgotten mid-twentieth-century middlebrow celebrity, it also gives some sense of the sorts of Asimov-like behavior that were routinely tolerated back then.

Unrelatedly, one thing I like about Sinykin’s book is that it’s “literary” as well as “sociological.” That is, he talks a lot about the business of books, but he also discusses the literary quality of the books. For example, he discusses the novels of Danielle Steel with respect but without avoiding a consideration of their flaws. The book is mostly sociology–but Sinykin’s book has enough of the literary perspective that it seemed clear that he has some interest in these novels for their own sake.

One thing that surprised me about this book is that it nowhere refers to Gordon Hutner’s book, “What America Read: Taste, Class, and the Novel, 1920-1960” (briefly mentioned in this post from earlier this year, Revolutionary Road and That Darned Chatbot). The two books have a lot of overlap, not in time frame–Sinykin’s story begins when Hutner’s ends–but in the way that they place the literary fiction that’s remembered today within a larger framework including genre and non-genre fiction that was published during the same era.

It’s not that Sinykin is unaware of Hutner–in the preface, he remarks that he collaborated with the older author–; I guess he just decided that his book had enough references as it is, without getting into research on the pre-1960 period. Still, I’d like to hear Sinykin’s take on how the two books fit together.

George V. Higgins

One of the themes of Big Fiction is the shift of how novels are published. Up until 1980, new books would be published either in hardcover or pocket-sized paperback forms. Genre fiction–mysteries, science fiction, westerns, etc.–would be published only in paperback, while other fiction–literary work by John Updike or Saul Bellow or whatever but also popular fiction by James Michener, John Le Carré, etc.–would first be published in hardback and then go to paperback a year later. That was just how they did it. But in the 80s various publishers started going with large-format paperbacks (called “trade” paperbacks, a terminology that has always confused me but has something to do with differentiation within the business of book distribution), and then around twenty years later they began to pretty much retire the pocket-sized paperback format, which bums me out because I like to have a book in my pocket. The good news is that there’s about 50 years worth of paperbacks out there, so as long as I don’t want to read something that’s been written in the present century, I can often find it in pocket format. The funny thing is, they still print these pocket-sized paperbacks in France. Just not here, for some reason. Even in France, though, there does seem to be some move to the larger format.

Anyway, George V. Higgins–see here for background. Higgins had a bit of a hard-luck story in publishing. He had the fortune or misfortune that his first published book, The Friends of Eddie Coyle, which came out in 1970, was his most commercially successful and also, arguably, his best work. He wrote roughly a book a year for the next thirty years until his untimely death (real old-school stuff: he was a heavy eater and drinker and died of a heart attack). The general feeling is that if he could’ve just stepped off the treadmill and spent two or three years writing each book, he could’ve published 10 or 15 excellent books instead of 30 books which were often wonderful but which had serious flaws. Part of this surely was just his temperament, but another factor was the disintegration of the mass-market publishing world, a slow and complicated process that coincided with his career. He was always jumping from publisher to publisher. Sure, his books after the first did not sell so well, but this had to do with the fragmentation of the market. In his book, Sinykin talks a bit about literary authors such as Joan Didion and Colson Whitehead who wrote literary novels with genre elements, and a bit about best-selling genre authors such as Steven King, but not so much about Higgins’s category of genre novels with literary aspirations. A good analogy might be Ross Macdonald–his subject, styles, and themes are much different from those of Higgins, but the category seems to fit.

Anyway, it seems that what Higgins needed, and never really had, was a supportive editor, someone who could’ve supplied the tough love to explain wha wasn’t working in the books, while fronting Higgins the money so he could’ve spent more time getting each book right. That’s how at least some of the publishing industry used to work, but it didn’t work for Higgins, who never was in that stable situation. Again, this must’ve been partly just the way that Higgins liked to live and work, but I feel that if he’d had a good relationship with an editor and publisher, he could’ve done better.

Nobody buys books, so nobody edits books

This came up a few years ago in my discussion of a book I enjoyed, Suburban Dicks:

The book is well written, but every once in awhile there’s a passage that’s just off, to the extent that I wonder if the book had an editor. Here’s an example:

“Listen, I know what it sounds like, but, I don’t know, think of it this way,” Andrea said. “You were a child-psych major at Rutgers, right? And you got a job at Robert Wood Johnson as a family caseworker for kids in the pediatric care facility, right?”

“Yeah.”

Who talks that way? This is a classic blunder, to have character A tell character B something she already knows, just to inform the reader. I understand how this can happen—in an early draft. But it’s the job of an editor to fix this, no?

But then it struck me . . . nobody buys books! More books are published than ever before, but it’s cheap to publish a book. Sell a few thousand and you break even, I guess. (Maybe someone in comments can correct me here.) There’s not so much reading for entertainment any more, not compared to the pre-internet days. I’m guessing the economics in book publishing is that the money’s in the movie rights. So, from the publisher’s point of view, the reason for this book is not so much that it might sell 50,000 copies and make some money, but that they get part of the rights for the eventual filmed version (again, experts on publishing, feel free to correct me on this one). So, from that point of view, who cares if there are a few paragraphs that never got cleaned up? And, to be honest, those occasional slip-ups didn’t do much to diminish my reading experience. Seeing some uncorrected raw prose breaks the fourth wall a bit, but the book as a whole is pretty transparent; indeed, there’s a kind of charm to seeing the author as a regular guy who occasionally drops a turd of a paragraph.

It makes me sad that there was no editor to carefully read the book and point out the occasional lapses in continuity, but I can understand the economics of why the publisher didn’t bother. I’m sure the eventual movie script will be looked over more carefully.

It’s been four years since that post came up, so let’s do some googling . . . There doesn’t seem to be any movie, but there is a sequel, “The Self-Made Widow.” So that’s something.

My own experiences in publishing

All my books have been published with academic publishers. In retrospect I probably should’ve gone with the same publisher for all of them, but for various reasons I’m spread all over, having published books with CRC Press, Oxford University Press, Wiley, Cambridge University Press, and Princeton University Press. Each time there was a reason. We wanted to publish Red State Blue State with a trade press and get real publicity, and we even found a literary agent, but no trade press was interested which is why we went with Princeton. Then a few years later I was working on Crimes Against Data and I found an agent who’d been recommended to me . . . the agent was enthusiastic about the idea and said they were ready to shop it to some leading publishers . . . but then a couple days before I was going to sign the contract, this story came out–this was Jeffrey Epstein’s literary agency! See here for more background. I really dodged a bullet with that one. On the minus side, without that push from the agents I never got around to writing that book, indeed I have no idea if it will ever happen. So, yeah, intermediaries such as editors and literary agents can really make a difference.

Dan Sinykin

The other thing that Sinykin’s book made me wonder about was . . . What’s the story of the publication of books such as his? 75 or 100 years ago, books such as Big Fiction would’ve been published by “trade,” not academic presses. Indeed, 75 or 100 years ago, “Dan Sinykin,” if he were doing this sort of thing, would likely have been a journalist in the Edmund Wilson mode, not a college professor. Back then–but no longer!–there were lots more jobs in journalism than in academia. A similar example is Leah Garrett’s book about war novels, which again was published by a university press but 75 years ago might have appeared in the form of magazine journalism and a possible trade book.

This isn’t necessarily a bad thing–as a successful college professor, I reach much smaller audiences than I would as a comparably successful journalist, but I have a lot more flexibility in what to write about and in how much time to put into each thing I write–but it’s a change.

Sinykin’s published and edited a few academic books, and he occasionally writes or general-interest publications (that’s what brought his work to my attention in the first place), so I wouldn’t be surprised if he has some ideas for a popular or “trade” book to come next. Nobody buys books, and he’s unlikely to make real money on a trade book, but I guess it could give him the news-media credentials to publish more magazine articles, op-eds, etc., maybe even go on the lecture circuit, who knows?

Dan Sinykin on close reading in literature, and me on close reading in statistics

Interesting article by Dan Sinykin on close reading:

Reading, a skill easily taken for granted, is difficult–all the more so when reading literature that wields language as a medium for art. . . . It’s easy to see why close reading, which demands patience, openness to others, and slow, careful thought, is having a moment among academics. . . . academics are rediscovering the quiet excitement of close reading, a relief from the overheated corporate pablum routinely suffocating us.

Pablum is not just corporate

To the above, I’d just add that the “overheated pablum” is not just “corporate”; it’s coming from all sources. We see it in scientific research papers, on twitter, on NPR, all sorts of places that are not themselves corporate (OK, sure, twitter is owned by a corporation but the individual people posting the pablum are not themselves corporate). Overheated pablum is a style, a dominant style for the usual Gresham sort of reasons and also because, as Sinykin says, our discourse has evolved into position where nobody pays attention to anything, so things are written with the expectation that nobody will pay attention to them, etc. I see this sometimes online when someone will criticize something of mine, but they’re criticizing things I never said–one example is here, and there are links to a few more at the end of this post. Or, more directly, there are examples such as the psychology study described as “long term” even though it took place over only 3 days, or the study that claimed “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though it had no evidence of anyone actually becoming more powerful.

As I’ve complained, my frustration about such things is not just that this happens–that credentialed scholars write articles with titles and abstracts that are flat-out false–but that nobody seems to care, even when these things are pointed out. To put it in Sinykin’s terms:

– People aren’t doing close readings on scientific papers, even high-profile papers that get tons of media attention.

– Scientists aren’t even doing close reading on their own papers! For example, in the papers linked above, I don’t know that the authors even considered the idea that there could be a contradiction between describing a study as “long term” even though it only lasted 3 days, nor that there could be a problem with claiming that people in their study “instantly become more powerful” even though the study had no measurements of power.

To put it another way, expressions such as “long term” and “more powerful” are to be taken metaphorically, as a sales job, in the same way that you might say that someone “is going to cure cancer in two years” even if, no, you don’t actually think they’re going to cure cancer in two years. It’s advertising-speak, it’s letter-of-recommendation-speak, it’s Ted-talk-speak, it’s the-title-and-abstract-of-papers-published-in-Psychological-Science-speak, and if everybody does it, then it doesn’t count as lying, indeed it doesn’t even feel like lying.

Bad news when close reading is no longer expected

I don’t agree with all the political things that Sinykin says, but that’s not so important for my point here, which is that I agree that there’s not enough close reading going on, and when audiences don’t read more closely, this provides less incentives for authors to write for close readers. Indeed, authors can become indignant at close readers, attacking them as obsessed Javerts. We’re seeing a development of a new equilibrium.

Just by analogy, consider the much-discussed phenomenon of Netflix-style movies where every action on the screen is announced by the characters as it is happening. This is so annoying! It’s certainly not the pattern with every Netflix show, but it happens a lot with the more generic offerings you’ll see on that and other streaming channels, and it’s said that the reason is that people have a movie on in the background while they’re doing something else so the running explanation allows the movie to be followed in that passive way. Then when more movies and shows are produced in this way, it encourages more of this background watching, and you get a new equilibrium in which everybody’s paying less attention–the viewers are paying less attention to what’s happening on the screen, and the actors, directors, and producers are paying less attention. Is this a bad thing? Maybe not! Maybe it’s a throwback to the classic era of radio drama, I don’t know. The point is that close reading, or close watching, is a choice. At least, it’s a choice if I’m reading or watching in my native language. When I’m reading or watching in French, I need to apply my full concentration at all times or I’ll lose the thread.

The 4 aspects of close reading

Sinykin gives two examples of close reading, one from the Odyssey and one from the Bible. These examples made it clear that close reading is four things.

1. Most directly, close reading is figuring out the literal meaning of the text. Who are the characters, who is saying what (this can be tricky when reading long stretches of dialogue), who’s alive and who’s dead, what is the sequence of action, etc. It’s possible to read a story and fail in this very basic task, sometimes because the author is hiding things (Agatha Christie, Gene Wolfe, etc.), sometimes because the story is itself incoherent (it’s easier to think of examples from movies where continuity or logic is violated, but this can happen in written stories as well), sometimes just because you’re reading quickly, watching the story go by, and not focusing on the details.

2. The second step of close reading is understanding the characters’ motivations: not just what is happening but why. Sometimes this is explicitly stated, but usually you have to figure it out. In this category I’d also place whatever struggles the reader might have with unreliable narrators, information gaps, and whatever deliberate ambiguities are in the text.

3. The third step is following all the details that flow by. Often I read a book and enjoy it, but only on rereading do I notice all sorts of little things that I zipped by the first time in my rush to follow the story. This can even happen on the umpteenth reading! I recently reread Forlesen, and it was full of fun bits that I’d not previously caught. I’m not talking about subtle references or misdirections or deep themes or “easter eggs” or whatever, just the granular bits of conversation, thought, and event that I’d earlier skipped without noticing.

4. Finally, close reading involves understanding a literary work in its historical and cultural context. This has two parts. From one direction, in reading a story or watching a movie or TV show we can learn a lot about the time and place when it was produced, just from things happening in the background–patterns of speech, clothing styles, the way people are milling around in street scenes, etc.–this is what we call the Speed Racer principle. From the other direction, if you know something about the culture within which a work was produced, you can get additional insights into what the author of the story is trying to say.

It’s that fourth aspect of close reading that Sinykin focuses on:

Late in The Odyssey, Odysseus, who has endured 10 years of wandering to return home from the Trojan War, encounters his childhood nurse. No one has yet recognized him, and he does not want to be recognized. He appears a stranger. His erstwhile nurse washes his feet and, in doing so, sees a scar on his thigh, startling her into recognition. The mark on the body becomes, once noticed by a caring, knowing observer, auratic, suffused with meaning. At that moment, Homer interrupts the story with some 70 lines about how Odysseus suffered the wound that left the scar, only to pick up when the nurse drops Odysseus’ foot in the basin.

We might think, given how we have learned to read stories in our time, that Homer interjects the history of the scar into the scene to induce a feeling of suspense, suggests [philologist Erich] Auerbach. But we would be wrong. Suspense requires a distinction between foreground and background, which is unknown to Homer, who writes everything in a fully saturated now. While narrating the history of the scar, he does not expect us to be waiting to find out what happens with the nurse. He expects us, argues Auerbach, to be 100 percent in the presence of the past. Homer must describe the scar because if he did not, we would be left with an unexplained, mysterious detail, which he cannot bear. Everything must be illuminated. He must account for the scar. Everything in Homer proceeds with clarity, “never a lacuna, never a gap, never a glimpse of unplumbed depth.”

Sinykin shares a related close reading of a biblical passage, then concludes:

We can learn about a people through its style, its literature, which bears an ineradicable record of its version of reality. This, at least, was Auerbach’s gambit. The method is close reading. Others do it differently and can be no less exhilarating. It starts with a cultivation of sensitivity to art and language.

Agreed. I’d just add that this has several aspects, starting with the most basic of trying to figure out, as a reader, what exactly is happening in the story and what could be going on in the minds of the characters.

Close reading in statistics

This was all on my mind because just last year we were discussing the connections between close reading in literature and close reading in statistics (see also here).

As a statistician, when I read a report closely I go through the four steps listed above:

1. First, I try to put in the effort to understand exactly what was going on in the experiments being discussed. This can be difficult! Research papers often don’t include crucial information such as how exactly the experiment is done and what measurements were taken.

2. Next, it’s useful to understand the authors’ scientific goals. This is usually pretty clear from the way the results are presented.

3. Then there’s the struggle to follow all the details. A paper can have a lot of graphs and tables, and each one can take a lot of close reading to figure out. Especially when the paper has errors, as in the notorious work of Brian Wansink or Richard Tol.

4. Finally, the context of the work. Is this a Psychological Science paper from the 2010-2015 era? A natural experiment from the bad old days of regression discontinuity analysis? Or maybe something that we would expect to be done well? As with a story or novel, it’s good to know what genre you’re reading. And, from the other direction, the just-taken-for-granted aspects of a paper can give us insight into the scientific culture that it came from.

What exactly is “close reading”?

After all this, I was wondering how other people define “close reading,” so I looked up the term on wikipedia:

In literary criticism, close reading is the careful, sustained interpretation of a brief passage of a text. A close reading emphasizes the single and the particular over the general, via close attention to individual words, the syntax, the order in which the sentences unfold ideas, as well as formal structures. Close reading is thinking about both what is said in a passage (the content) and how it is said (the form, i.e., the manner in which the content is presented), leading to possibilities for observation and insight. . . .

In the practice of literary studies, the technique of close reading emerged in 1920s Britain in the work of I. A. Richards, his student William Empson, and the poet T. S. Eliot, all of whom sought to replace an “impressionistic” view of literature then dominant with what Richards called a “practical criticism” focused on language and form. American New Critics in the 1930s and 1940s anchored their views in similar fashion, and promoted close reading as a means of understanding that the autonomy of the work (often a poem) mattered more than anything else, including authorial intention, the cultural contexts of reception, and most broadly, ideology.

Hmmm, interesting.

The first paragraph above is a good match for how I was thinking about close reading and how Sinykin discusses the concept.

But the description at the end of the second paragraph, describing the attitude of the American New Critics, is pretty much the exact opposite of what we were discussing! Sinykin explicitly talked about how you can use knowledge of the Homeric and biblical contexts to understand what the authors of those passages were doing, and I was saying something similar with regard to reading scientific papers.

So now I’m confused: Is close reading centered on an understanding of cultural and historical context and authorial intention (my take, and I think Sinykin’s) or is it about “the autonomy of the work . . . more than authorial intention the cultural contexts of reception, and . . . ideology”? What’s going on here???

P.S. I can’t figure out how Sinykin’s article ended up at a sports site. I once published something at Baseball Prospectus, but my article was actually about baseball so that made a bit more sense. I’m not complaining–Sinykin’s post was interesting, and it was written in a friendly, nonacademic style that fit in with other articles at that site–I just wonder how it happened. I see that, in addition to teaching English at Emory University, Sinykin is also a professor of Quantitative Methods. So maybe he’ll appreciate this post!

Nuking New York, never easy

I was listening to this Unlear and Present Danger podcast on Fail Safe . . . I’ve never seen that movie but I did read the book many years ago. There was a copy of the paperback kicking around in our house when I was a kid, and I read the book a couple times.

The plot of the book and movie (spoiler alert!) is that a computer error leads to U.S. planes being sent to drop H-bombs on the Soviet Union. The Americans and Soviets try to recall or shoot down the bombers, but one gets through and nukes Moscow. In order to preserve the peace, the U.S. president orders the Air Force to nuke New York.

One thing Jamelle and John talked about in the podcast was the extreme implausibility of the ending, first that the president would order a nuclear attack on an American city and second that the bomber crews would do it. In retrospect, I agree this could not have happened, but when I read the book that scene didn’t seem so jarring.

So the question is: why did it not seem off? Why did the ending not destroy the sense of the book?

I have a few theories:

1. The book (and, I guess the movie) was full of realistic procedural details. Kind of like Tom Clancy, perhaps–I’ve never actually read anything by Clancy–, the author establishes a this-is-the-real-world-where-the-grownups-work feeling, so that when the events and motivations become more farfetched, they’re happening in this sort of hyperrealist setting.

2. The tension builds up throughout the book: Are they going to stop all the bombers, what will happen, etc. Reading the book, I really didn’t want a total war between the countries. With that in the background, nuking New York just didn’t seem so bad. Jamelle and John are a later generation, they’re coming at this after the two-superpower world, they remember how traumatic the World Trade Center attacks were, and a nuke in Manhattan would be something like a hundred times worse. But compared to the obliteration of our entire civilization, a loss of one city doesn’t seem like so much.

3. They said that the movie ended with a New York street scene, thus emphasizing the human cost of the bomb (I guess they didn’t include a corresponding scene in Moscow?). In the book, sure, the decision to bomb New York is presented as difficult–much is made of the fact that the president’s wife is in the city at the time–; still, text is not as emotion-provoking as images. Also, I did not live in or near New York at the time.

4. There’s also the more general point that, as a kid, I didn’t really expect books or movies to be realistic. Even setting aside fairy tales and science fiction, regular old realistic stories weren’t so realistic. Did Murder on the Orient Express make any sense? Cop shows on TV? Sitcoms? Considering the narrative art we were exposed to, there was internal consistency but not much real-world realism. So, sure, the conclusion of Fail Safe didn’t make so much sense, but, in the context of the book, which is a thriller with a kind of puzzle plot, everyone working to try to stop the world from ending, the goal is for them to find the solution to the puzzle, in the same way that this might be done in a locked-room mystery, for example, or a science fiction story set on a spaceship.

5. Continuing with that last point: the Cold War was often framed as a game. Mutually Assured Destruction was expressed in terms of game theory, the technocratic experts were gaming various scenarios. As kids, we played Risk, which wasn’t literally an expression of the Cold War but it did involve lots of armies battling all over the world. In a game, you think strategically, not in terms of human cost. So, from that perspective, nuking New York is a clever play, and the whole game is interesting because there are different players but the goal is not to “win” and beat the others, it’s to survive. To put it in modern terms, it’s more like a game of Loup Garou than a game of Risk.

6. Returning to the appeal of the story: I think the usual explanation for the popularity of Fail Safe is that the accidental-nuclear-war scenario is so scary and so real-seeming. And I agree that’s part of it. But I think a big appeal is the puzzle solving, the twist that at first everyone thinks they’re playing Risk, then they gradually realize they’re playing Loup Garou and they have to communicate sincerity in a world in which trust has been mostly destroyed. It’s a reframing of the Cold War nuclear standoff.

And one more thing. If this were all happening today, maybe nuking New York wouldn’t be such a traumatic decision? Back in the early 1960s, political candidate Barry Goldwater notoriously said, “Sometimes I think this country would be better off if we could just saw off the eastern seaboard and let it float out to sea,” but that was recognized as an extreme statement. Nowadays, political polarization is such that I suspect that a lot of Americans would be like, sure, ok, let’s wipe out New York, and could you take away San Francisco too? I don’t know, I just think that nuking New York in 2025 would have a different feel than nuking the city back in 1960, when it was the economic and cultural heart of the country rather than just one more regional interest.

Burdick

The authors of Fail Safe are Eugene Burdick and Harvey Wheeler. After reading Fail Safe as a kid, I went to the library and found two other books by Burdick: The Ugly American and The 480. Both were really interesting, also in the same vein of sending a political message in the context of a realistic insider-ish story. The Ugly American was about how the U.S. was screwing up in Vietnam by not winning the hearts and minds of the people, and the titular character was a hero, or at least a positive figure (despite what you might think from the title): he was “ugly” in the sense of being unpolished, not like the urbane U.S. political and military leaders, but he was working directly with the Vietnamese and making their lives better. The story was entirely in a Cold War context, and the message was that we’re gonna lose to the Communists if we don’t do things right. The book came out in 1958, when U.S. involvement in Southeast Asia was just getting started–Burdick was ahead of his time!–and title of the book became a catchphrase, but I guess its message was not adopted by the government. Or maybe there was a larger problem, an internal contradiction between the idea that we should be winning hearts and minds and the idea that the U.S. should be the geopolitical boss. Anyway, a thought-provoking book with a strong theme.

The 480 was another good one: an unusual novel in that the villains were number-crunching political consultants! The number in the title referred to the number of demographic categories that these consultants had used to model the U.S. electorate, and the idea was that they would win elections by targeting these categories–I guess we’d call it “microtargeting” today. The plot had some similarities to The Manchurian Candidate, a book and movie that came out a few years earlier, but The 480 played is straight whereas The Manchurian Candidate was some mixture of horror story and over-the-top satire. Eugene Burdick and Richard Condon just had different sensibilities. Condon also wrote the political-themed satires Winter Kills and Prizzi’s Honor, the latter of which was made into an John Huston movie.

When writing this post, I was curious about Burdick, who’d written these three important political novels–not important as literature, but important in their early expressions of three different political themes. According to Wikipedia, he was a war hero, a cold war liberal, and a political scientist at the University of California. But I never got to meet him–he died 25 years before I started working there. Maybe someone could make a movie or play about his life–he lived through a lot, had a lot to say, and had a view of the world that not too many people have anymore. A real throwback, in an interesting way.

P.S. Henry Farrell writes, regarding Burdick, “It’s worth reading the recent Jill Lepore book, If Then, which has a fair amount about his involvement with Simulmatics.” I’ll have to check it out. I remember seeing the review of that book when it came out, and I thought it was cool that someone else was interested in the topic. I don’t know if I’ve ever met anyone who read The 480; at least, it’s never come up in conversation, and I only came across it in a circuitous way. The book has stuck in my mind for decades, and I included it in my list of the five best books on how Americans vote.

On the statement, “American academia is entering a period of even more uncertainty”

The above sentence was written by sociologist Philip Cohen. I guess his statement about American academia is literally true, but the “even more” part seems misleading to me. My take on American academia is that it has been one of the economically cosseted islands of the American economy (along with the health care and police/military/security industries) within the sea of uncertainty, at-will employment, etc.

Academia, health, and security are three areas of the economy that have had atypically low uncertainty over the past few decades: they’ve been close to recession-proof and, yes, there is always belt-tightening but not a lot of people actually getting fired. One exception to this is the replacement of full-time teaching positions with adjuncts, but that seems different from the issues discussed in your above post.

I guess what I’m saying here is . . . ummm, I’m not saying that everyone in academia has it easy, just because I have it easy, having been lucky enough to step on the escalator at the right time (if maybe not as lucky as the lazybones discussed here). Rather, given all the uncertainty in the economy in the U.S. and the world during the past twenty years, I wouldn’t say it would hard to believe this would come to academia also. Especially given that there have been direct political efforts to attack academia. Even beyond that, though, it’s hard for any institution to hold out against the tide.

A relevant analogy here might be the police. When people talk about cutting the budget for the police or reducing the autonomy of police officers, police departments fight back, often pretty loudly. The police are like the university in that both are valued because: (1) they provide a necessary function for society, a function for which there is always a demand for more, and (2) they are highly politicized and active in politics (academia on the left and police on the right). The health-care industry is different. It satisfies property #1 but not property #2: the health-care industry is not associated with either side politically. Although that seems to be changing, with doctors and nurses moving toward the Democrats and the Republican party working pretty hard to antagonize them.

“Monsters: A Fan’s Dilemma”

At the recommendation of a blog commenter, I read the above-titled book by critic and memoirist Claire Dederer. The promotional material describes it as “a passionate, provocative, blisteringly smart interrogation of how we make and experience art in the age of #MeToo, and of the link between genius and monstrosity.” This didn’t sound so promising to me–it reminded me of about a zillion op-ed and arts page articles that I’ve come across in the past few years, and I didn’t feel like I needed another lecture about how we should separate the art from the artist, or conversely an explanation of how Kevin Spacey was never actually a good actor or whatever.

But the book was neither of those things. It was excellent and stimulated many thoughts which I’ll now share:

Who is worse, Pablo Picasso or Laura Ingalls Wilder?

This is not a serious question. Or, I should say, it’s a serious question that I am deliberately framing in a non-serious way, just as a way of demonstrating that there’s no unidimensional scale of badness.

Here’s the point. As a human being, Picasso seems like the worse of these two artists. As Dederer puts it, “The used-up women in his life make a fleshy pig-pile, so much that it can be hard to remember which is which: Fernande Olivier, Eva Gouel, Olga Khoklova, Marie-Thérèse Walter, Dora Maar, Françoise Gilot, and Jacqueline Roque. Two killed themselves–and so did Picasso’s grandson, Pablito–and most of the rest were left with their lives shattered after their time with Picasso. . . . . Picasso’s granddaughter Marina wrote in her memoir: ‘He submitted them to his animal sexuality, tamed them, bewitched them, ingested them, and crushed them onto his canvas. After he had spent many nights extracting their essence, once they were bled dry, he would dispose of them.’ It’s no crime to love a lot of women–even if it makes the women in question cross or jealous or crazy or suicidal. But of course Picasso was also abusive toward those women (beatings and burnings), and moreover he was a predator of young girls, who fascinated him and whom he used as models.”

On the other side, I have no reason to think that Laura Ingalls Wilder was an abusive person or that she did mean things at all (beyond the bad behavior that is occasional in all of us).

But . . . you can look at Picasso’s art and appreciate it straight up–as artifacts in themselves and in their role in politics and the development of art–without needing to concern yourself with his biography. No doubt his abusive behavior was connected to his artistic achievement, but the art was not about the brutality.

Wilder, on the other hand, embedded racism into the core of her books. Dederer informs us that the following sentence appeared on the first page in the early editions of Little House on the Prairie: “There were no people; only Indians lived there.” You can separate about Picasso’s art from his life in a way that you can’t separate Wilder from her political and social attitudes.

To look at this another way, consider scientists who held political views that you might now call odious, such as Francis Galton’s racism (which, like Laura Ingalls Wilder’s views, were close to the core of his statistical work) or J. B. S. Haldane’s communism (which seems more peripheral to his contributions to biology, although I expect that Haldane himself saw some connections there). My goal here is not to go around canceling people–it would be absolutely ridiculous to abandon the scientific insights or try to retroactively diminish the contributions of people with problematic social or political views, and it would be even more hopeless if we were to try to remove all the assholes from history too–at some point there’d be just about nobody left–even gentle Einstein had some strongly racist views, also apparently was not such a nice husband, perhaps in the manner of modern sports stars who go through life expecting that other people will take care of them and clean up all their messes–; rather, the biography is part of the story. When we talk about historical figures, we talk about when they lived and where they were from and sometimes about their personal lives; their political views and personal actions can be relevant to our understanding too.

Benefit of Clergy

Last time this topic came up, I brought up George Orwell’s classic essay, “Benefit of Clergy: Some Notes on Salvador Dali,” where he discusses how to simultaneously think of the famous Surrealist painter as both a great artist and a terrible person.

It really shouldn’t be so hard to say that Einstein was a brilliant physicist, also a campaigner for peace, also had some racist views, also was a bit of a pig who expected other people to clean up his messes. It shouldn’t be hard to say that Yuval Peres was a brilliant mathematician, a generous colleague, and a sexual harasser, or that Neil Gaiman went through life doing bad things but he also wrote influential books. But somehow it can be hard for people to do this. Dederer’s book is a thoughtful exploration of why this separation can be harder than it looks, why it is that, as she puts it, “The person does the crime and it’s the work that gets stained.”

To put it another way, if you don’t want to say, “X has been a good person and valuable contributor to society in many ways, but in some other ways he’s behaved badly and exploited his position,” it doesn’t necessarily mean that you’re clueless–that X’s misdeeds blind you to his contributions–; it could just mean that, in your judgment, the misdeeds outweigh the contributions enough that you don’t feel comfortable celebrating the contributions, or that the misdeeds change your interpretation of the contributions. Although it can go in the other direction too. I know Yuval as a generous colleague, willing to put in the time and thought to work out a difficult math problem with me. Years later I heard he had a side career as a sexual harasser, and that’s horrible, also I wonder if that flowed out of his generosity as a mathematician. That is–and I say this without knowing any of the context, so I’m really just using his case to represent the general principles here–it seems plausible to me that he was following his usual practice of being a caring, involved colleague to these women, and this care engaged his emotions, which, when combined with poor judgment and lack of self-control, led to his repeated inappropriate behavior.

Consider this diagram:

I put “brilliant mathematician” at the top here because, even if it might not be the most important thing about Yuval, it’s his most distinctive attribute: there are a lot more generous colleagues and sexual harassers in the world than there are brilliant mathematicians.

In any case, the point of the above triangle is that all three of its vertices go together. Yuval’s brilliance as a mathematician facilitates his generosity as a colleague. It’s a lot easier to be helpful if you have a deep understanding. And the generosity put him in a position that facilitated the harassment. My point is not to claim that if you want the brilliance, you need to accept the harassment–I suspect that had the consequences been clearer, Yuval would’ve been able to restrain himself–; my point is just that his misdeeds are connected to his virtues.

The principle of retroactivity

Dederer writes, “The principle of retroactivity means that if you’ve done something sufficiently asshole-like, it follows that you were an asshole all along.”

I guess this is true, in that everyone–well, just about everyone–really is “an asshole all along,” in some sense. Roman Polanski was an asshole, Albert Einstein was an asshole, Orwell and Dali were assholes of course, Terry Speed was an asshole long before he harassed that postdoc, also you and I and most of our neighbors–including those who have never done any harassment of any sort–are assholes in some aspects of our lives. Being an asshole is part of the human condition.

What I’m saying is that, once you have reason to look back in time for asshole behavior, you’ll be able to find it.

Dederer continues: “a current moment can remake the past anew, can imbue the past with new truth . . . the stain travels backward, affecting and defining the perpetrator not just at the time of the abuse, and not just after the abuse, but before he committed the crime.”

This reminds me of how it can be hard to assess how good a book or movie is, until you get to the end. A story of suspense or mystery can be very compelling, but only if the mystery is resolved in a satisfactory way. If the solution is a cheat, this reflects backward and makes the early parts of the story retrospectively flawed. Conversely, a great ending can retrospectively make earlier parts of the book or movie all make sense.

And this makes me wonder whether Dederer’s quote is revealing a problem we have when thinking about people and events, which is that we try to fit things into a storyline, whether that be a “Breaking Bad”-style decline into depravity or a redemption arc or an he-was-an-asshole-all-along narrative.

The idea of genius

Dederer talks about the problematic idea of the “genius,” which reminded me of my problems with the scientist-as-hero narrative. It’s a problem! There are geniuses, but they make their own characteristic errors. Even the best scientists make scientific errors; as I wrote here:

Brilliance represents an upper bound on the quality of your reasoning, but there is no lower bound. The most brilliant scientist in the world can take really dumb stances. Indeed, the success that often goes with brilliance can encourage a blind stubbornness. Not always–some top scientists are admirably skeptical of their own ideas–but sometimes. And if you want to be stubborn, again, there’s no lower bound on how wrong you can be. The best driver in the world can still decide to turn the steering wheel and crash into a tree.

But that’s the outside take. Dederer also looks at it from the perspective of the “monster”: “The experience of channeling something, of being a servant to something bigger than yourself, isn’t just for the prodigy, or even just the young–Picasso retained it throughout his life. . . . Part of Picasso’s livelong practice was to give himself to this greater power. This freedom was actually part of his job–paradoxically, part of his discipline.”

I can relate to that. I’ve been so lucky in my life to be able to work on problems that I think are important and interesting, and I do feel a sense of responsibility to make the most of my time here.

I don’t agree with everything Dederer says on the topic, though, for example: “Isn’t the genius the person who changes everything about his or her field? . . . If you go by that definition, Duchamp is actually a greater artist than Picasso. If a Renaissance artist time-traveled to the twentieth century, he would’ve recognized what Picasso was doing as painting. But Duchamp would’ve made zero sense to him as art. Duchamp changed everything. But Duchamp doesn’t fulfill an image that we have in our minds of genius.”

Sure, I’ll buy the what-the-Renaissance-artist-would-think bit, but . . . I don’t think that makes Duchamp a genius. Or, maybe he was a genius at promotion; it doesn’t make him a genius at art. In contrast, Picasso really was a genius as art! I know these judgments are subjective; my point is that I don’t think that being “the person who changes everything” is either a necessary or sufficient condition for genius.

Are we “excited by their asshole-ness?”

Later, Dederer writes, “Part of the reason so much attention has been trained on men like Picasso and Hemingway is exactly because they’re assholes. We are excited by their asshole-ness.”

Ummmm, who is this “we” you are talking about? I’m excited by the art that Picasso and Hemingway created, and then I’m interested in learning more about their lives. If they were super nice guys, I’d still be excited about their work. Now you might say that being an asshole was a condition for their work–perhaps the only way they could’ve made such contributions was through a single-minded focus that excluded all others–but, even so, at best that just means the asshole-ness was necessary, not that this is what attracts us to them.

Yeah, I know the trope of the sexy bad boy . . . here it is right here for you . . . but I think it’s orthogonal to the “genius” thing. Some people are fascinated by sexy bad boys, some people aren’t; I don’t think that’s the key to the appeal of Picasso or Hemingway.

Do writers and artists get special dispensation to be assholes?

Dederer writes, “Writers want to be left alone to write, and be waited on. . . . at least a few men are onto themselves. The novelist John Banville told the Irish Times that he was, not to put too fine a point on it, a shitty dad, and what’s more, probably most writers are. ‘[Writing] was very hard . . . on the people around me, on my children. I have not been a good father. I don’t think any writer is. You take so much and suck up so much of the oxygen that it’s very hard on one’s loved ones.”

What an asshole (Banville, that is, not Dederer). Indeed, Banville’s a double asshole in that quote, first by being a bad father (I’ll take his word on that) and second for blaming it on being a writer. Lots of writers have no problem being good fathers. There are 24 hours in the day, and there’s enough “oxygen” to be a good writer and a good parent. Look, Banville: if you or Neil Gaiman or Philip Roth or whoever wants to go around being an asshole, that’s you, and that’s all. Get over yourself, dude. You can take your Prince of Asturias Award for Literature and stick it where the sun don’t shine. That said, you might be a good writer; I’m not claiming otherwise.

George Orwell, Rebecca West, Claire Dederer, James Wolcott

I have the above list of names in my notes from reading the book. Unfortunately, I can’t remember what I wanted to say about them! It’s like a puzzle–What do these names have in common?–but I can’t figure out what it is. I read Dederer’s book several months ago.

One thing is that George Orwell and Rebecca West are pseudonyms, and Dederer writes about writers taking a new identity. I’m not sure where Wolcott fits in, though.

The monsters in our lives

Dederer’s deepest message is that the real issue with being a fan–or choosing not to be a fan–of art “monsters” (including monsters in their actions such as Roman Polanski and Pablo Picasso and monsters in their ideologies such as Laura Ingalls Wilder) has nothing to do with famous people and everything to do with people we love.

Not to get all Freudian about it, but the real challenge is dealing with the monsters of our childhood. Whether this is family members who physically abused or neglected us, or authority figures who abused their trust, or loved ones who treated us well but were abusive to others, we’re reliving those original contradictions of the people who were important in our lives. That’s why it’s so hard. The decision to reread Harry Potter or not, or to enjoy the dramatic stylings of Kevin Spacey . . . ultimately these are easy questions. If they feel hard, it’s because they stand in for closer, more personal questions.

Similarly, when thinking about academic misconduct, the fundamental challenges come when people who we’ve loved and respected have taken advantage of us–or of others.

We write about Francis Galton or Woody Allen or Yuval Peres because that’s less uncomfortable than writing about people closer to us.

An email from Jenny Diski

Also, Dederer wrote about the author Jenny Diski, which brought to mind an email exchange I had with Diski back in 2010. I wrote:

I’m writing to you because of a reaction I had to an offhand remark in your recently published review of a book on Psycho. You wrote:

“Skerry isn’t really one to let go of jargon. In the preface he explains how to read his book, not as most books are doomed to be read, from beginning to end, but differently and ‘in keeping with the multiplicity of voices that make up the text’. It gets quite scary: ‘The temporal structure of these chapters goes from the present-tense narrative of my research trip in Chapter 1 to the achronological, “cubist” structure of Chapter 3 . . .”

I don’t know any of the people involved, but I suspect that Skerry was not intentionally writing in jargon; it’s just hard to write clearly. Harder than many readers realize, and maybe harder than you, as a professional writer, realize. My guess is that Skerry was trying his best but he just doesn’t know any better.

I had a similar discussion with a friend on this topic a while ago, where he was accusing academics of deliberately writing obscurely, to make their work seem deeper than it really is, and I replied that we’d all like to write clearly but it’s not so easy to do so. I’ve written several books myself, but I’m a statistician, not a creative writer, and I’m always struggling to write clearly and with minimal jargon.

There are some fundamental difficulties here, the largest of which, I think, is that the natural way to explain a confusing point is to add more words—but if you add too many words, it’s hard to follow the underlying idea. Especially given that writing is one-dimensional; you can’t help things along with intonation, gestures, and facial expressions. There’s the smiley-face and its cousin, the gratuitous exclamation point (which happened to be remarked upon by Alan Bennett in that same issue of the LRB), but that’s slim pickings considering all the garnishes available for augmenting face-to-face spoken conversation.

My full reactions are here.

Anyway, I hope this is useful to you in giving a slightly different perspective on academic writing. In short: when we write badly, it’s not always on purpose!

To which Diski replied:

Thanks for your email. I’m sure you’re right when it comes to your field – any field involving maths, but I’m not so sure about the humanities. I think there is very little that can’t be said about movies or even literature and history plainly (by which I mean well written) enough to be accessible to any literate person.

In any case, Skerry was, I thought, using the idea of postmodernism idiotically and unnecessarily in order to make his book appear more scholarly. Just my opinion, of course.

It’s always so great when a person responds in a serious way to a serious question.

Last words

Not the last words of Dederer’s book, but the last words of the second-to-last chapter. She writes:

You will solve nothing by means of your consumption; the idea you can is a dead end. The way you consume art doesn’t make a bad person, or a good one. You’ll have to find some other way to accomplish that.

Stanford Human Trafficking Data Lab Hiring for a Full-time Postdoctoral Scholar

Ben Seiler sends along this opportunity:

The Stanford Human Trafficking Data Lab is accepting applications for a postdoctoral fellowship position to join a project investigating trafficking risks in charcoal supply chains in Brazil. The position is open to recent graduates of PhD programs in statistics, economics, computer science, operations research, or related data science fields. The position provides opportunities to participate in rigorous, quantitative research on human trafficking, including supply chain network analysis and geospatial modeling. The successful candidate will have strong data science skills, including experience working with large, complex data from varied sources, and machine learning methodologies. The underlying data are complex and will require sophisticated data management and integration skills. A candidate should have proficiency with GIS software and Python, strong written and interpersonal communication skills, and a demonstrated interest in addressing social justice issues through data-driven research. The postdoc will work in partnership with PI Grant Miller (Stanford University School of Medicine) and other research team members, and will contribute to study design, participate in field research, conduct data analysis, and disseminate findings through academic publications and presentations. The postdoctoral fellow will be expected to focus mainly on this project, but may spend up to 20% of their time on independent research. For more on the Lab’s other ongoing projects see https://htdatalab.stanford.edu/projects/. The postdoc will be based at Stanford University in the Department of Health Policy, located near to the Departments of Economics and Political Science and the Graduate School of Business.

TO APPLY: Please email a single PDF document named
“Lastname_Firstname_HTDL_Postdoc_2025” containing the following materials to Lydia Aletrais at [email protected]:

1. A cover letter describing your interest in this position, your relevant training and experience, and our earliest and preferred start dates
2. A current CV
3. A transcript (unofficial is fine)
4. Names, e-mail addresses, and phone numbers of 2-3 references.

Applications will be considered on a rolling basis. Short-listed applicants will be asked to complete a technical exercise and may be called for an interview.

Given that they’re doing network analysis and geospatial modeling, I expect that knowledge of Stan would be useful too.

When are AI/ML models unlikely to help with decision-making?

This is Jessica. Enthusiasm for integrating predictive models–most recently LLMs–to support or even replace human decision-makers can be found in many domains these days, from medicine to financial investments to content moderation. Surely not all decisions are better made with the help of AI. But how do you tell when a decision problem is not a good candidate for applying a predictive model trained on past data?  As we previously discussed, one of the tricky things about the long-running debate about when a statistical model will outperform human decision-makers is that many of the intuitions people like to cite for not using AI/ML don’t stand up to scrutiny. For example, saying that humans are better at knowing how to incorporate information about anomalous cases like the patient with the broken leg suggests that either that humans have encountered lots of these rare cases to make a good prediction about the effects (a contradiction), or that the ways in which anomalous events like breaking one’s leg impact the future are not subject to much uncertainty (unrealistic as a general rule). 

So what are good criteria for when to avoid deploying AI models for decision-making? In a new paper called Artificial Intelligence and Actor-Specific Decisions, Teppo Felin, Mari Sako, and I write: 

Artificial intelligence (AI) is increasingly seen as potentially replacing humans in decision making and problem solving across numerous domains. We argue that AI is useful for a broad range of decisions, but not for actor-specific ones. Actor-specific decisions are (a) forward-looking, (b) individual and idiosyncratic, and require (c) reasoning, and some form of (d) experimentation or intervention. These four criteria—informally captured by the “FIRE” acronym—demarcate when decisions are more conducive to being made by humans rather than AI. The “actor” in actor-specificity refers to the focal decision maker, thus highlighting the need for a first-person point of view to decision making—an approach that cannot be modeled from a third-person, population-level perspective (which is also the basis of AI). We also show how the FIRE criteria jointly implicate and offer normative guidance for actor-specific, strategic decision making. We discuss the implications of our arguments for the theory-based view and the key cognitive processes of search, representation, and aggregation.

The paper is still in working form, and we are looking for feedback. Overall, I really like the view we put forth as a description of how many of the decisions people face are personal and “messy” in ways that fall outside the bounds of formal models of prediction for decision-making. Here’s a quick summary of the four criteria:

Forward-looking decisions: Sometimes good decisions hinge on trying to predict futures that are unlikely to be like the past, in which case AI is unlikely to help. “An entrepreneur could disrupt the software-as-a-service market by …” calls for reasoning about what could actually disrupt a market, not repeating high plausibility responses according to historical data. An example where AI is not what you want is VC investment decisions: using AI will get you investments that look like what worked in the past but to the exclusion of high variance, high upside opportunities, which is what VCs dream of. 

Individual and idiosyncratic decisions: A common refrain is that we can’t rely on AI models for decision-making in cases where we care about individual decisions, like making the best choice for this particular patient. However, as mentioned above, once you define a decision problem with a relevant dataset and train a statistical model to solve it, it’s hard for humans to exceed its performance. The problem with this perspective is that it ignores the fact that decision-makers must bring in individual epistemic commitments or understandings of the world to act in many situations. 

For example, the true data-generating model is often not known, leaving questions as to what types of features matter and how to interpret them for the decision-maker to contend with.  Consider, for example, cases where high stakes medical decisions have to be made, but evidence of the value of accounting for certain patient properties is still emerging – different doctors may prefer to be more or less optimistic, perhaps based on their reading of the patient’s preference for taking risks. It is also often the case that ground truth labels are not so fixed and universal as in distinguishing cats from dogs; dermatological diagnoses, for example, can be subject to inherent ambiguity in the observational data provided by a patient. Given that they will be liable for their decisions (and may see their autonomy and sense of personal responsibility as central to their role as decision-makers), should we force them to decide based on what some majority vote preference suggests? There are also ways in which the objective used in modern machine learning pipelines underdetermines certain properties we want to hold for the predictions. For example, arbitrary choices made in model training like random seed can produce fitted models that will look equivalent on training data, but which we would not consider equivalent as soon as the distribution shifts (i.e., some will tank much more than others). In other words, there are often aspects of “good” decision-making that are not encoded in the usual objective, but which humans may have some insight into (e.g., experts having opinions on which features are likely to causally relate to the outcome and which are only associationally or spuriously related). Deploying an AI model shuts down the possibility for individual decision-makers to exercise judgment in cases where the right model might be debatable. 

Reasoning-based decisions: When decisions must be made in contexts where relevant data are scarce, causal reasoning provides a way to generate predictions and think experimentally. This can require thinking about downstream contingencies or implications of decision subjects’ agency, like imagining the potential symptoms of a given intervention and then the implications of those symptoms for patients’ behavior so as to reason about the intervention’s longer and shorter term effects. This kind of flexibility is not the target of conventional approaches to training AI models. In contexts where innovation is important and uncertainty is high, decisions can require formulating and reasoning about subproblems across stages in time. Consider the Wright brothers, who achieved heavier-than-air flight by identifying specific problems like lift, propulsion, and steering that they could then make progress on. There’s also a certain creative aspect to some decisions that people make – for example, some decisions are big enough that they shape the decision-maker’s future beliefs and alter their perception of opportunities and value itself. A decision in this sense is not just extrapolation or optimization, it’s a commitment underpinned by some type of forward looking causal reasoning about a future the decision-maker wants to reach. Things like choosing who to marry or what career to pursue are choices about what you are willing to commit to and who you want to be in the future more so than optimizations based on where you’ve been in the past. 

Decisions requiring experimentation and intervention: Many decisions are made only after the decision-maker has done some exploration or data generation, i.e., has taken some action and observed the results. Humans can even decide to test interventions that go against existing data. Returning to the Wright brothers, after identifying subproblems they made concrete progress by experimenting–e.g., testing in wind tunnels, working with light-weight aluminum for the engine–some of which ran counter to available evidence at the time. Human-identified interventions go deeper than the sorts of planning we’ve seen from AI agents so far – we choose what to test, when to intervene, and how to frame the problem in ways that might not be implied by any existing dataset. While agentic AI has become popular, so far their successes have been confined to tasks where the goals and environment are well-defined, like booking flights online.

Actor-specific decisions

We use “actor-specific” to summarize the first person view of decision-making that arises when the above criteria hold, emphasizing how decision-makers are often future-oriented and guided by personal concerns like their longer term goals and sense of identity. This is not equivalent to simply acknowledging that decision-makers can have different beliefs, e.g., differing sets of prior experiences as in a “personalist” view of probability. Actor-specific decisions are not amenable to plugging in an AI model – they require decision-makers to go beyond simply responding to information predictably based on past habits to formulate the problems themselves, where their autonomy as decision-makers is integral to how they approach the problem. This can be contrasted with the third person perspective assumed in ML and decision theory, where knowledge of the true data-generating process is assumed and the decision-maker’s goal is optimize their behavior in response to information signals under a known utility function. The latter can get us pretty far in many settings (and I obviously rely heavily on decision theory in lots of my work) but we need vocabulary for talking about where it doesn’t apply. 

One of our other points is that strategic decisions–including all sorts of business strategy considerations–are actor-specific. These kind sof decisions are made under competition, where the unique perspective or causal theorizing the decision-maker brings is crititical to their success.

Anything you can do with Bayesian inference you can do in other ways. Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similarly, I find that Bayesian inference makes it a lot easier to combine information.

This came up in comments a few years ago:

Anything you can do with Bayesian inference you can do in other ways. Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similarly, I find that Bayesian inference makes it a lot easier to combine information. For example, I’m sure that someone could do MRP non-Bayeisanly–and indeed there is a non-Bayesian tradition of partial pooling for small-area estimation in sample surveys–but I think it’s no coincidence that the widespread use of MRP has come along with the Bayesian approach.

If you look at my applied research papers, you’ll see a lot of analyses that maybe could’ve been done in non-Bayesian ways but in fact which my colleagues and I did Bayesianly, and which I suspect would never have been solved had we not had Bayesian tools.

There are also a lot of non-Bayesian success stories in statistics, and that’s fine too.

Bayesian inference is many things. It’s a set of tools for solving problems, also a framework for understanding statistical methods. Other statistical approaches similarly serve this dual duty, for example classical hypothesis testing is a set of methods and also a framework in which statistical inference is viewed as a set of testing problems. I don’t find that particular framework very helpful–indeed, I think it often gets in the way–but I do recognize that there are many problems for which methods developed in that tradition can be useful. Recall our discussion of lasso.

Different sequences in narrative: why suspense can go flat

After writing this post recommending a novel by Rebecca Makkai, I was curious and clicked over to Makkai’s webpage, where I saw this post with some tips on how not to write. It seems like reasonable advice. You can go read it and see what you think.

There’s just one thing I want to add, which is that in a work of narrative (I was going to say “work of fiction,” but this applies to narrative nonfiction too), there are four different sequences:

1. Chronological time.

2. The order that events are presented in the story.

3. Order that a particular character learns something: first a character knows A, then she knows B, then she knows C. These will not necessarily be in time order. Consider an inferential plot such as in a detective story, or even something simpler like a family drama where different characters learn past secrets at different times.

4. The order of events as constructed by the author. This could be chronological but it doesn’t have to be. For example, you might start writing a story and then realize in the middle that some bit of backstory is needed.

I think that another way of saying Makkai’s advice in that post is to say that, as an author, you should not confuse these four orderings.

The specific issue Makkai is concerned about is authors inappropriately trying to add surprise in a story. Surprise comes from manipulating sequence 2 above, and Makkai is saying, I think, that for this to work it has to also arise in sequence 3–it needs to be a character’s surprise, not just the reader’s surprise. Also, though, there’s sequence 4. As an author, you shouldn’t start off planning to be surprised. It doesn’t work that way. If you want to plan some surprise in sequences 2 and 3, go for it, but this should come naturally out of a logical sequence 4. Otherwise the surprise will seem artificial, in the same way that it would be artificial for an author to design a murder mystery, say, with all the details, and then only at the end try to figure out what the killer’s motive was. From a logical standpoint, the motive comes first.