More on the emptiness of the government’s “gold standard science” slogan

Yesterday we discussed the ridiculousness of the government mandating so-called gold standard science while at the same time promoting fraudulent, debunked, and flat-out fake research.

Since then, I came across a relevant news item, this one regarding the FBI:

Agents have been forced out. Others have been demoted or put on leave with no explanation. And in an effort to hunt down the sources of news leaks, Mr. Patel is forcing employees to take polygraph tests.

Polygraph tests are not gold standard science.

Survey Statistics: 2 flavors of calibration

I get confused when the same word is used for 2 different things. (I get confused a lot.)

In Survey Statistics, the word “calibration” is used in 2 different ways. Both are attempts to align estimates from our survey to an external source of data (e.g. census tables):

  1. Poststratification: Calibrate our estimates of means E[Y] to population data about another variable X.
  2. Intercept Correction: Calibrate our estimates of regressions E[Y|X] to aggregate data about E[Y].

Using more sources of data makes a statistical method good (a principle Andrew learned from Hal Stern). Kuriwaki et al. 2024 use both flavors of calibration to estimate Republican vote share by race and congressional district. Here is their Figure 4:

Suppose we want to estimate E[Y], the population mean. But we only have Y in the survey sample. For example, suppose Y is voting Republican. We can use the sample mean, Ehat[Y | sample], but what if survey-takers are more or less Republican than the population ?

If we have population data on X, e.g. racial group, we can estimate Republican vote share by racial group E[Y|X] and aggregate according to the known distribution of racial groups, invoking the law of total expectation: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of racial groups, at least we fix that with some calibration. Replacing “E” with estimates “Ehat”, poststratification calibrates our estimate of population mean E[Y] to the known distribution of X, using E[Ehat[Y | X, sample]].

For example, suppose 60% of white voters vote Republican, E[Y | X = white] = 60%, and E[Y | X = non-white] = 25%. Suppose P(X = white) = 70%. But our sample has the wrong distribution of racial groups, e.g. P(X = white | sample) = 50%. Without correcting this, our estimate would be: 60% * 50% + 25% * 50% = 42.5%, too low due to our sample containing too few white voters. But with the correct population distribution of X: 60% * 70% + 25% * 30% = 49.5%, roughly the 2016 election results. So instead of assuming E[Y] = E[Y | sample], our new calculation assumes E[Y | white] = E[Y | white, sample] and E[Y | non-white] = E[Y | non-white, sample].

For more, see Lumley 2010 Chapter 7: Poststratification, Raking, and Calibration. For implementation try:

survey::calibrate()
Preview
To summarize poststratification:
  • want: population mean E[Y]
  • have: Y,X in sample, X in population

But let’s flip this around. From election results we know E[Y]. We want our estimates of vote by racial group E[Y|X] to be calibrated to those known totals.

  • want: E[Y|X]
  • have: Y,X in sample, X in population, AND population mean E[Y]

This is a different flavor of calibration ! With poststratification, it was the totals themselves we wanted to estimate E[Y], and the regression E[Y|X] was a step along the way. Now the regression E[Y|X] is what we want to estimate, but the known total E[Y] helps us. So when we estimate E[Y|X], we constrain it to aggregate to the known E[Y]. One way this is done is called the “Logit Shift” in Rosenman et al. 2023 and “Intercept Correction” in Ghitza and Gelman 2020 (Edit: it appears also on p.769 of Ghitza and Gelman 2013 as a “simple adjustment”). This intercept correction isn’t needed if we do a regression of Y on X in the population. As described in regression textbooks, the fitted values Yhat (i.e. Ehat[Y|X]) will aggregate to the mean Y if your model includes an intercept. What breaks that in survey statistics is that we do the regression in the sample, not the population.

For example, suppose our sample includes too many Democratic voters overall because they are more likely to take surveys. So our Yhat for all racial groups are incorrectly shifted towards Democrats. We can correct the intercept of our model by subtracting enough so that the calibrated Yhats now aggregate to the correct lower proportion of Democrats.

Simplifying Kuriwaki et al. 2024 who aim to estimate Republican vote share by racial group E[Y|X] (by congressional district, which we will ignore here):

  • They first calibrate (2nd flavor) E[Z|X] to add Z (education) to the auxiliary data X (racial group), using known education aggregates from census tables.
  • They then calibrate (1st flavor) estimates of Republican vote share among racial groups E[Y|X] using auxiliary data X,Z. This is done by estimating E[Y|X,Z] and then averaging over the distribution of Z | X.
  • They then calibrate (2nd flavor again) E[Y|X] to known aggregate E[Y] from election results.

They use “calibration” to refer to both flavors (bolding my own):

Second, we improve upon existing survey modeling methods by developing two new calibration techniques… multilevel regression and poststratification (MRP) for small area estimation. MRP uses hierarchical modeling and calibration weights …Furthermore, we develop a two way survey calibration, which simultaneously calibrates estimates to both election results by geography and an external survey, instead of only to geography.

Calibration in machine learning

I get even more confused when other fields (e.g. machine learning) also use “calibration”.

Our 2nd flavor of calibration, the “intercept correction”, ensures that E[Y] = E[Yhat]. This is sometimes called “mean calibration. As noted above, we get this for free if sample and population are the same, and our regression of Y on X includes an intercept. It doesn’t matter if our model is correct at all !

A stricter form of calibration requires E[Y | Yhat] = Yhat. See for example p.30 of PATTERNS, PREDICTIONS, AND ACTIONS: A story about machine learning, by Moritz Hardt and Benjamin Recht. Or Jessica’s posts, e.g. here. This holds if our model for E[Y|X] is correct, which is much harder than just throwing in an intercept term to get mean calibration.

“Gold standard science”

I got this email the other day from a journalist at a major news organization:

I’m a science reporter from **, wondering if you’d have some time to talk/reflect on the Trump administration’s embrace and use of concepts like “Gold Standard Science” and replicability – and the extent to which this is being used to improve science, or not.

I responded that it’s impossible to take this seriously, given that the same people who claiming to advocate so-called gold standard science have been energetically pushing junk social science such as unsupported claims of widespread election fraud and junk biological science such as, most notoriously, a discredited paper on vaccines and autism. This is the absolute opposite of gold-standard science.

Don’t get me wrong. The government has supported bad science in the past–both Brian Wansink and Cass Sunstein have held government posts (under Bush Jr. and Obama, respectively), and I guess that lots of papers in the glory days of Psychological Science and PNAS from 2010-2015 were conducted in part using public funding. And the notorious Excel error paper is said to have influenced government funding. That was all a little bit different, though, because these were projects that seemed reasonable at first and then only in retrospect were recognized to be fatally flawed. The vaccines and autism stuff, though: that’s junk science that the government is endorsing, years after the fraud has been revealed.

Then this news recently came out, “White House Health Report Included Fake Citations”:

A report on children’s health released by the Make America Healthy Again Commission referred to scientific papers that did not exist. . . .

“It makes me concerned about the rigor of the report, if these really basic citation practices aren’t being followed,” said Katherine Keyes, a professor of epidemiology at Columbia University who was listed as the author of a paper on mental health and substance use among adolescents. Dr. Keyes has not written any paper by the title the report cited, nor does one seem to exist by any author.

Props to Keyes to stating her objections so mildly. I’d be screaming right now had that happened to me. “Concerned about the rigor of the report,” indeed!

From the official “Restoring Gold Standard Science” statement:

Employees shall not engage in scientific misconduct nor knowingly rely on information resulting from scientific misconduct. . . . Except as prohibited by law, and consistent with relevant policies that protect national security or sensitive personal or confidential business information, agency heads shall . . . make publicly available the following information within the agency’s possession: the data, analyses, and conclusions associated with scientific and technological information produced or used by the agency that the agency reasonably assesses will have a clear and substantial effect on important public policies or important private sector decisions . . . the models and analyses (including, as applicable, the source code for such models) . . .

The good news, I guess, is that the Wakefield autism data and the recent White House health report have no actual data, nor do they have any source code, so . . . nothing needs to be made publicly available! There’s no way to publicly share data and references that don’t exist. On the other hand, it does seem to be the case that government employees are “engaging in scientific misconduct” and “knowingly relying on information resulting from scientific misconduct” by promoting these fraudulent statements, so that seems to be a problem.

That report with the fake citations was released on 22 May 2025 and the Gold Standard Science document is dated 23 May 2025 so maybe the authors of that report are off the hook, as their violation occurred before this new policy was announced.

According to the news article, a government spokesperson “did not answer a question about the source of the fabricated references and downplayed them as ‘minor citation and formatting errors.” She said that “the substance of the MAHA report remains the same.”

We’ve seen that before! Fake data, fake references, garbled analysis, whatever it is, when critics point out the problems, the reaction is not to reassess but to double down. This is absolutely horrible behavior. The right thing would be to step back and say, “Hey, if we’re relying on discredited research and backing up our claims with fake citations, maybe we shouldn’t be so sure of ourselves.” But nooooo, they don’t do that. This is classic junk-science behavior.

Again, this is the opposite of anything that should be called “gold standard science.”

Beyond all that, the term “gold standard science” makes me uncomfortable, as I associate it with research in various fields where there is some causal identification and statistical significance which is then used to bully readers into accepting iffy claims. Setting aside anything about the current government, I’d be happy if the terms “gold standard” and “science” were never used in the same sentence (so I’d slightly change what I wrote on page 1 of this article from a few years back).

P.S. More here.

Pascal’s triangle, the Ramanujan principle, and what makes something look like a part of an ellipse or a part a parabola?

John Cook writes:

The nth row of Pascal’s triangle contains the binomial coefficients C(n, r) for r ranging from 0 to n. For large n, if you print out the numbers in the nth row vertically in binary you can see a circular arc.

He explains:

The length of the numerical representation of a number is roughly proportional to its logarithm. Changing the base only changes the proportionality constant. The examples above suggests that a plot of the logarithms of a row of Pascal’s triangle will be a portion of a circle, up to some scaling of one of the axes, so in general we have an ellipse.

Cook continues with an explanation of why the ellipse fits so well:

WoЇfgang pointed out that the curve should be a parabola rather than an ellipse because the binomial distribution is asymptotically normal. Makes perfect sense.

So I redid my plots with the parabola that interpolates log C(n, r) at 0, n/2, and n. This also gives a very good fit, but not as good!

But that’s not a fair comparison because it’s comparing the best (least squares) elliptical fit to a convenient parabolic fit.

So I redid my plots again with the least squares parabolic fit. The fit was better, but still not as good as the elliptical fit.

I think the reason the ellipse fits better than the parabola has to do with the limitations of the central limit theorem. First of all, it applies to CDFs, not PDFs. Second, it applies to absolute error, not relative error. In practice, the CLT gives a good approximation in the middle but not in the tails. With all the curves mentioned above, the maximum error is in the tails.

Beyond the issue of the tails, I think there’s a perceptual issue, which is that we learn about parabolas in their convex orientation, as here:

The other thing is that a circle or an ellipse is finite and a parabola keeps going forever. So, the very fact that this graph stops makes it look less parabola-like, as compared to the sort of graph you might make where you can visually follow the curve off the edge of the graph.

The Ramanujan principle

Also I wanted to connect Cook’s point, that a table of numbers expressed in positional notation is approximately a graph of their logarithms, to the Ramanujan principle:

Tables are commonly read as crude graphs: what you notice in a table of numbers is (a) the minus signs, and thus which values are positive and which are negative, and (b) the length of each number, that is, its order of magnitude.

The name of the principle comes from a famous story of the mathematician Srinivasa Ramanujan supposedly conjecturing the asymptotic form of the partition function based on a look at a table of the first several partition numbers: he was essentially looking at a graph on the logarithmic scale.

Survey Statistics: it is the people

Alan Zaslavsky and his course drew me into survey statistics. We focused on a simple question at the heart of statistics: how can we make statements about a population, given a sample? We need to represent everyone, because we all matter and are all unique. But not everyone can be in our sample. As Andrew says, this is what makes it so hard.

Why look at survey data ? As my teammate David Shor says here, super politically engaged people are overrepresented in the media. Survey data provides a counterbalance to that, aiming to represent everyone.

So let’s jump in with a new blog series !

Who is this blog series for ? Mainly folks who already know some survey statistics and want to learn more together. Folks who have heard of, could define, or even use these concepts, but have questions about them:

I’ll attempt to introduce concepts along the way. But there will be gaps, which we should chat about in the comments. And I will ask you questions, so please participate. It is people that make make survey statistics (and anything) great.

p.s. The title comes from It is The People: A Pacific Crest Trail Film by Elina Osborne. She quotes the Māori proverb “What is the most important thing ? It is people, it is people, it is people”. I just did a long “solo” hike of the Virginia section of the Appalachian Trail, where I learned so much from hikers and trail town residents. As a survey statistician, I get to keep listening and learning.

Names in fiction (Perkus Tooth, Morrison Roog, Ragle Gumm, Addison Doug, Bodie Kane, and Thalia Keith)

One of the books currently on our bathroom shelf is a collection of stories called The Book of Other People. I bought it because it contains a story by Daniel Clowes (somehow this embarrasses me, in the same way I’m embarrassed to be a fan of R.E.M.). Lots of the stories in the book are excellent. The one I want to talk about today is Perkus Tooth, by Jonathan Lethem. I usually don’t like Lethem’s writing (sorry, Phil!), and I can’t say I loved this story either, but the names . . . ahhhh, the names! Perkus Tooth, Morrison Roog, and lots more. I was reminded of Philip K. Dick (and not just from the “Roog,” which I guess is a direct homage), whose characters had unforgettable everyman-loser names like Joe Chip, Bob Arctor, Ragle Gumm, and, my personal favorite, Addison Doug.

It’s not easy to come up with such names. Dickens could do it, of course. Updike too: it takes guts to name your character Rabbit Angstrom. But not every writer has it. Recently I read the novel I Have Some Questions for You, by Rebecca Makkai–it was excellent, I recommend it!–but the names, they were nothing special. It’s not that the names of the characters in that book were wrong, exactly, or cliched; it’s more that they were almost too logical, as if at every branch of the tree she chose the most probable outcome. They didn’t have the right level of idiosyncrasy. Don’t get me wrong, I still think this was a great book on many levels, I enjoyed reading it, and I’m still thinking about after it was over. The names thing isn’t the most important part of the book. All the names were fine; just in the whole they fit in too well. It would be better with a little friction. In contrast, Meg Wolitzer is good with names: she has a way of making them realistic but still special in some way. I bring up Wolitzer because I was reminded of her book when reading Makkai’s novel. They had a similar feel: a reflection of youthful friendships from the perspective of adulthood. Also Claire Messud: her characters have good names too: Ludovic Seeley and all the rest.

The ladder of abstraction in statistical graphics

It was so much fun having a graphics post yesterday that I thought I’d do another, this time sharing one of my favorite recent articles, which begins:

Graphical forms such as scatterplots, line plots, and histograms are so familiar that it can be easy to forget how abstract they are. As a result, we often produce graphs that are difficult to follow. We propose a strategy for graphical communication by climbing a ladder of abstraction, starting with simple plots of special cases and then at each step embedding a graph into a more general framework. We demonstrate with two examples, first graphing a set of equations related to a modeled trajectory and then graphing data from an analysis of income and voting.

I really like this idea of presenting a sequence of increasingly abstract graphs. It’s kind of a graphical analogue to statistical workflow, in that we can understand the more complicated product by explicitly connecting it to the simpler steps that came before. All too often, we have this killer graph which we then have to spend lots of time explaining. Instead of presenting the graph and providing a separate explanation, my new recommendation is to build up from simpler graphs, explaining each new degree of freedom as it comes up.

Statistical graphics: When does it make sense to introduce deliberate distortion to counteract an expected perceptual illusion?

It’s been awhile since we’ve had a post entirely devoted to graphics!

Kaiser writes:

The link here contains an example of how the line-angle illusion can lead to misreading of trends on line charts:

Is there a bigger difference in revenue at Time 1 than Time 2? Many of us will think so but on careful judgment, I think all of us can agree that the difference at Time 2 is in fact larger. . . .

Studies have shown that humans tend to read not the vertical gaps but the angular gaps. Again, this issue is illustrated in the first mentioned paper:

Matthias explained that their implementation of the hammock plot uses a strategy to counteract this line-angle illusion.

I take this to mean they distort the data in such a way that after readers apply the line-angle illusion, the resulting view would convey correctly the correct trend. A kind of double negative strategy. The paper linked above offers one such counter-illusion strategy.

I imagine this is a bit controversial as we are introducing deliberate distortion to counteract an expected perceptual illusion.

I’m not aware of any software that offers built-in functions that perform this type of illusion-busting adjustments. Do you know any?

I’ve actually thought about this question a lot! In some sense, just about all statistical graphs introduce deliberate distortion to counteract an expected perceptual illusion, in the sense that, with the exception of maps and astronomical charts, a graph is an abstract representation of data.

But to get closer to what Kaiser is asking: the analogy I’ve given is, suppose you’re building a wooden chair but using boards that are warped. In this case, the right thing to do is to incorporate the warp into the design, i.e. cut some pieces shorter than others and at different angles, etc., so that they fit together as is, rather than trying to go all rectilinear and then glue/nail everything together. The trouble with the latter strategy is that the wood will exert pressure on the joints and eventually the chair will break or distort itself in some way.

So, similarly, if you can anticipate that a graph will be misread, it’s a good idea to account for this possibility in the design.

Most simply we do this by just not making graphs such as tilted pie charts, 3-D bar charts, and other gimmicks that jump off the page and engage all sorts of visual illusions.

The other common option is to make an additional graph. For example, you could keep the graph above with the two time series but just add another graph showing their difference. Why not?

But what about Kaiser’s original question: are there any graphs with deliberate distortions designed to counteract perceptual illusions of visual artifacts? (I guess that a log transformation doesn’t count here.)

I don’t have any perfect examples here, but I have one example from my applied research that comes close.

The example comes from my paper with Yotam on social penumbras. First there’s the explanatory diagram:

Then the data graph:

There are two things going on here.

First, the diagram is a circle but the data graphs are quarter circles, which we did for two reasons: (a) the quarter circle takes up only a quarter as much space, which is important when we’re displaying 14 of these at once (yes, you could just make smaller full circles but then you have only half the resolution when comparing sizes), and (b) for the goal of comparing one group to the next, quarter circles are better because you can compare the slices, as compared to full circles which all just look like bullseyes and are hard to tell apart.

Second, areas of shapes are notoriously difficult to compare. Why did we do these damn circle plots at all? Why not dot plots or repeated bar charts or some other visualization that would facilitate linear comparisons? The answer is that it was important to us to preserve the “feel” of the penumbra, the idea of concentric social groups. We were willing to pay a bit in statistical clarity in order to have this conceptual unity of the graph and the content of the paper.

But then the issue arises that, when comparing areas, people don’t really compare areas. Nor do they compare linear dimensions. At least according to Cleveland’s classic book, the implicit comparison is something in between. So by displaying the data as areas, we’re knowingly handing people a distortion. For example, if a certain group represents 1% of the population, then the core group (the yellow circle in the graph) will take up 1% of the area of the full circle and thus will be 10% in linear dimension.

That’s bad, right? Maybe not! Several of the groups in our study did have core populations, and if these were displayed as 1% in the linear dimension, they’d be really hard to tell apart. By using these intuitive-looking area graphs, we’re implicitly doing a square-root transformation without having to explain it. So I think it’s fair to say that we’re taking advantage of a perceptual illusion.

P.S. Pro tip: Do you see how in our graph, we order the groups by increasing size. Not alphabetically. You should almost never display your data alphabetically. OK, here’s a rare counterexample, with a slightly prettier visualization and some more graphs in Section 2.3 (“All graphs are comparisons”) of Regression and Other Stories:

For these we displayed the data straight up, no distortion.

LLMs as behavioral study participants

This is Jessica. There is lots of talk these days about how generative models will transform social science. Think using LLMs to simulate human behavior for purposes like designing and conducting social science studies, marketing research, or testing social systems, persuasive messaging, interfaces, etc.  Much of this is still contentious, but there is some consensus emerging around how LLMs can help versus hinder progress in social science research. 

Here I’m mostly going to consider using generative models in experimental studies of human behavior. Initially I started paying attention to this because of the potential trainwreck vibes from a statistical validation perspective… social scientists start relying on simulations with models that even the computer scientists don’t fully understand to learn about people, what could go wrong? There is lots of room for overinterpreting noise. But given the value (economic, epistemic, etc) of being able to predict what people will do, I think it’s worth considering what a rigorous methodology is for this kind of simulation science. As I write this, I’m on my way to a workshop on LLM-based behavioral simulations which will hopefully give me more food for thought on what this does / does not look like. 

At a high level, the emerging consensus is that LLMs may never make good substitutes for human study participants in the sense of letting us learn new things about human behavior without having to deal with humans. How do you discover new facts about the cognitive or social world until you’ve attempted to understand how generative models align with human behavior for those research questions? Even if your LLM-based simulation ends up aligning well with human results, you will ultimately have to collect enough human results to validate that you have a good simulation. It’s like trying to figure out how to generate synthetic data from real survey results before those results have been analyzed: you won’t know what is most important to preserve. At a more basic level, how do you expect to accurately simulate conditional distributions you’ve never observed? LLMs may do well on imputation-style problems (like filling in missing answers to some survey questions, or inferring what responses would have been to questions that hadn’t been asked yet) but we should not necessarily expect good performance when we try to extrapolate to new scenarios

One problem is that there will be biases that can compound across simulations. Similar to how having a huge sample size doesn’t necessarily give you better estimates in behavioral science when there’s selection bias operating, the fact that LLM simulation results often differ in some ways–and in particular, tend to produce more “extreme” results than humans–leaves the risk that we mislead ourselves. For example, LLM simulations often result in lower variance and diversity and more pronounced stereotypes than human results. Many of the studies I’ve seen people try to replicate with LLMs so far focus on replicating average treatment effects that are assumed to be constant, usually with representative U.S. samples. There’s some evidence that in such cases, effect sizes identified using generative models can be highly correlated with those based on humans, including for studies that could not be in the models’ training data. But when the focus is on heterogeneous effects or particular groups, things may get more distorted. In general, bias is challenging given that when we run experiments we are typically trying to understand the “edges” of some effect, e.g., by controlling for confounders while imposing interventions that we think will maximize the target difference. This isn’t to say that some progress can’t be made, e.g., by doing a bunch of generalization tests to identify scenarios where generative agents provide a useful impressionistic summary of human behavior, and being careful not to take them too far out of that neighborhood. And bigger/newer models appear to reduce distortions. But for brand new scenarios we’re stuck with heuristic approaches.

There is some interest in using LLMs as a first pass tool to prioritize existing results in need of replication. I’m not very enthusiastic about this either, given the shaky foundations of replication as a hallmark of good science. The LLM-based replications of social science studies I’ve seen so far are a mixed bag, with the same kinds of arbitrariness you see in human-based replications when it comes to determining how well a behavior has replicated (e.g., asking is the direction of the effect the same while ignoring big differences in magnitude). Some interesting questions come up about what it means to perform a valid replication with LLMs given their propensity for memorization, and to validate that an agent’s actions are a meaningful simulation of human behavior. Do we need to establish that an agent’s actions are self-consistent or “intentional” in the same way that human actions can be before we can trust them? For example, I’ve seen authors asking LLM agents to explain their behavior as evidence that a replication is trustworthy, similar to how you might elicit a human participant’s reasoning. This kind of thing is a big fraught pile of worms. There are also many additional degrees of freedom relative to human replications because you often have to adjust the experimental procedure to get reasonably robust results from LLMs, due to sensitivity to prompt variations and strategies. There is some advice emerging on how to identify more consistent prompts, prompt with demographic attributes, aggregate simulation results, etc. See e.g., here.  

Where LLM simulations have clearer promise is in exploratory theory building. John Horton’s (now old) 2023 paper describes how LLMs can play the role of economic models, where before conducting experiments you use them to understand the space of possible effects under particular assumptions about behavior. For example, maybe you want to understand how agents that are myopic in a particular way solve a negotation problem, to help you brainstorm what you might see in a study with humans. Or maybe you want help thinking through the range of effects you might see from different types of participants in a study you’re designing, to help you with sample size calculations. If we agree that thorough piloting is generally a good thing for behavioral research, then LLMs can be helpful in this regard, and potentially even more so when they are also used more directly to assist brainstorming, e.g., by generating hypotheses about important covariates.

There’s a tension in some of this literature between focusing on how well LLMs capture idealized behavior (e.g., economic rationality, predicting future events) versus how well they can be used to simulate human behavior with all of its biases. There are downstream use cases for both. If I’m having a generative agent negotiate deals or manage my wealth on my behalf, I want them to be more strategic and rational than I am, whereas if I am a company trying to understand what my customers value most in my products, I will prefer realism. The LLM-as-idealized-human approach is cleaner to study, as we have a better sense of what we’re looking for, but also potentially more limited (since there is a lot we can do with non-generative e.g., “rational” models in this regard). 

Russian roulette: You can have a deterministic potential-outcome framework, or an asymmetric utility function, but not both

Jonas Mikhaeil and I write:

It has been proposed in medical decision analysis to express the “first do no harm” principle as an asymmetric utility function in which the loss from killing a patient would count more than the gain from saving a life. Such a utility depends on unrealized potential outcomes, and we show how this yields a paradoxical decision recommendation in a simple hypothetical example involving games of Russian roulette. The problem is resolved if we allow the potential outcomes to be random variables. This leads us to conclude that, if you are interested in this sort of asymmetric utility function, you need to move to the stochastic potential outcome framework. We discuss the implications of the choice of parameterization in this setting.

I like this paper! Working out the example and writing it up helped me understand a bunch of things that had puzzled me regarding causal modeling and inference.

Jonas and I engaged on this project after hearing from Amanda Kowalski about her recent paper with Neil Christy, which got us thinking about what you can get from stochastic models for potential outcomes.

P.S. Here’s the final version of our paper, ultimately titled “Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals.”

Ecologists’ endless quest for automatic inference

This post is by Lizzie.

At the end of a recent course I taught on Bayesian approaches (which reminds me I should blog an update on that) a student asked ‘so when do we divide up our data into test and training?’ This stopped me a little as the whole course was on a workflow approach to science and stats that I hoped hammered home how to gain mechanistic insights from simulated data, preparing you for more insights using retrodictive checks on a model fit to your empirical data, etc.. I was on the spot suddenly realizing some gaps and failures in my course content. I also should not have surprised, as ecologists are going in big time on machine learning (are there other uses for test/training data? Yes, but that’s the dominant place this language is in use in my field now, IMHO), and we (I) don’t step back and teach the different approaches.

In discussing this with a stats colleague recently he mentioned the endless search for automatic inference. `Feed in data, pull crank, get scientific inference.’ It’s the opposite of the workflow to me. I also think it’s not going to work well, but it’s clearly the dream, and an alarming percent of ecology is devoted to it, without even knowing it.

Machine learning is the new best hope of automatic inference for ecology (and a lot of other fields) without anyone seeming to notice what they’re not getting. It’s amazing to me how many students seem blithely unaware of what machine learning is going to give you — (good) predictions for out-of-sample data, but a difficult time finding interpretable parameters and all the science that can go with them. (And, yes, I know some of the machine learning approaches are working on changing this.) So they see it as the inference approach.

The previous best hope of automatic inference was model comparison (LOO is the new magic, AIC was a big — BIG — hit, before that was stepwise regression with an alarming number of ecologists never learning any potential for problems with stepwise regression, but I digress) and it’s still running strong in some circles. Fit 6 or 600 or so models and compare them to see which is best. In my area, the models balloon since we have no idea what climatic driver to include. For example, I think water matters to trees growing outside, so for a precipitation variable, should I use total precipitation? Our maybe just during the growing season? Or, wait, maybe divide up growing and non-growing season. But then for the non-growing season, should I use snow depth? Snow water equivalent (SWE)? This is so hard, and there’s no clear answer.

Automatic inference to the rescue! You can put them all in with model comparison, including a suite of possible interactions, and see which ones really matter. Yay!

Did this work? Not at all if you ask me. I recently saw a tree ring talk that did this but you can tell the best fitting model actually made no biological sense after they thought about it more, so they presented the ‘second best model.’ And I am quite sure the second and third best model were pretty similar in any comparison metric you wanted to throw at them and they might have had really different answers to how the world works. (Ecologists have tried one way around this — model averaging, which I don’t think offers much either.) I am not sure why everyone is doing this other than that (1) we have all tacitly agreed it’s okay and (2) the other option seems harder, more uncertain and maybe we have not all tacitly agreed it’s okay.

What have we never gotten out of this as best I can tell:
(a) We start to see new patterns in what matters in these model comparisons and say, ‘hey — all this work together really shows we should focus on SWE in this context. Thank goodness we did model comparison as there is no other way we would have figured this out.’
(b) We use something we learned in model comparison to design an experiment that teaches us something new. Like, ‘wow, I never thought extreme heat in August would be so important, I will now set up an experiment to test the role of extreme heat in August. I am so glad I put that predictor — and extreme heat in every other month and in 3-month windows — in my model so I could find this out.’
(c) The feeling of joy at saying, ‘look at my minimum adequate model! This is great and so helpful.’
We never get these things because the results are almost always a mess. We all know this as best I can tell so we don’t even look closely at them as reviewers any more.

What’s the other option?

The other option to me is that you pick your few best-guess damn variables — the ones you can make predictions about and describe the functional relationship of them to your response variable(s) and you put those in your model. Maybe you fit a few models, but not endless models. In my experience, the first step in this process alone (picking those variables) gains me way more insights than any model comparison ever has. Why? Because it’s the opposite of automatic inference. It requires me to think.

What’s the downside of this other option? One would be that we pick the wrong predictors and never see that amazing predictor we would have just tossed in on model comparison. But given where 20+ years of model comparison has gotten us I am discounting this possibility. The other — and this is what students in my classes are really worried about — is that we don’t all tacitly agree this is okay. Many students I suggest this to don’t think it’s okay. They see how widespread model comparison and its ilk are and worry they cannot get published without it. They aren’t even trained in how to pick those variables.

We’re so over the top on automatic inference we don’t even train our students to be prepared for anything else. And worse yet, we tell them they’re doing (good) science.

With machine learning* we’re slipping even further away from science and our training is getting even worse as best I can tell. Students at UBC in data science learn to ‘tidy’ data as though there is no domain expertise in this process. ‘Tidy’ means removing outliers, gap filling and other things that horrify me to see students learn in their first term. How on earth do they know what an outlier is when they don’t even know what the data are? After this they learn random forests and some simple neural nets. Science done.

What’s the solution? I desperately hope people smarter than me are working on this question. One answer is obviously raising our standards and discounting work that doesn’t really give us much from whatever model comparison they used. Another is better training — I think we all need to admit that training has got to change with machine learning on the rise. A lot of students I work with now only take data science — they learn only machine learning and don’t know what a regression is or think it is anything they use. They need to see how interconnected all the inference methods are and what aims each one works well on for now (and not) and be prepared that that might change. This seems tractable. What seems less tractable is better training in science — training students to know there’s no automatic inference for science and getting useful insights is actually messier, harder, and involves more uncertainty than most people tell you (but, if you ask me, it’s also a lot more fun).

*We’re somehow also now calling most of machine learning ‘AI’ in ecology. Are other fields doing this? Why (I mean, other than wanting to sound like you’re doing the absolute coolest, most cutting edge thing)?

Jerzy Neyman, Sigmund Freud, and Milton Friedman walk into a bar . . . (the mistaken association of null hypothesis testing with rigor)

This discussion thread reminded me of the pervasive way in which null hypothesis significance testing is (mistakenly) thought to have some special level of rigor not possessed by other methods in statistics and machine learning. Christian Robert and I discuss this in our article, “Not only defended but also applied”: The perceived absurdity of Bayesian inference.

I see an analogy to strict theories in other fields that have an air of austere rigor. I’m thinking of rational choice theory in political science, monetarism in economics, and Freudian psychoanalysis in the 1940s-1970s (there were all sorts of therapies, but 5-days-a-week Freudian analysis had an air of rigor back then, I think; there was the idea that “strict Freudian” therapy was the best), or what was called “theory” in literary studies a couple decades ago. There was the sense that these were the most rigorous, or coldly analytical, approaches: expensive regimens, difficult to follow, but the most effective. Recall our discussion of how economics in the early 2000s was like Freudian psychology in the 1950s.

Election analytics positions available at the New York Times

Will Davis writes:

I oversee the Election Analytics department (aka The Needle and The Times/Siena Poll) at The Times.

We’re lucky enough to have two great roles open on the team right now for people excited to make a career in the field of election analytics.

* Election analyst: This is a mostly technical role, but it comes with the opportunity to write. This person would be responsible for some most essential elements of our election polling and modeling — modeling unit-level turnout and vote share, creating baselines for The Needle and helping us continue to innovate the design of The Times/Siena Poll.

* Election researcher: This is a great job for somebody with less of a technical skillset who’s eager to be around a team they can learn a ton from. It’s primarily focused on manual research and data entry.

These sound like excellent an opportunity to combine statistical modeling, computation, and graphics.

Taking our Models Seriously (my talk at StanBio Connect, this Friday 9am)

StanBio is a free, one-day online conference that will take place on Friday, 30 May 2025 from 9 am to 5 pm ET. Here’s my talk:

Taking our Models Seriously

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

In biomedical research we often use models that are “mechanistic” rather than “phenomenological”: that is, we try to model an underlying process rather than simply fitting a curve. Mechanistic models are necessary for inference with latent variables (as in pharmacology when there is interest in concentration within an internal organ) and useful for learning and extrapolating from sparse data. We discuss statistical and computational challenges we have resolved, along with some open problems.

Now that I’m posting this, I can’t remember what I was going to say! I guess I’ll have to figure it out between now and Friday morning.

Market and antimarket: The story of the Berkeley Electronic Press

William Davies writes:

The​ words ‘market’ and ‘capitalism’ are frequently used as if they were synonymous. Especially where someone is defending the ‘free market’, it is generally understood that they are also making an argument for ‘capitalism’. Yet the two terms can also denote very different sets of institutions and logics. According to the taxonomy developed by the economic historian Fernand Braudel, they may even be opposed to each other.

In Braudel’s analogy, long phases of economic history are layered one on top of another like the storeys of a house. At the bottom is ‘material life’, an opaque world of basic consumption, production and reproduction. Above this sits ‘economic life’, the world of markets, in which people encounter one another as equals in relations of exchange, but also as potential competitors. Markets are characterised by transparency: prices are public, and all relevant activity is visible to everyone. And because of competition, profits are minimal, little more than a ‘wage’ for the seller. Sitting on top of ‘economic life’ is ‘capitalism’. This, as Braudel sees it, is the zone of the ‘antimarket’: a world of opacity, monopoly, concentration of power and wealth, and the kinds of exceptional profit that can be achieved only by escaping the norms of ‘economic life’. Market traders engage with one another at a designated time and place, abiding by shared rules (think of a town square on market day); capitalists exploit their unrivalled control over time and space in order to impose their rules on everyone else (think of Wall Street). . . . Capitalism, in Braudel’s words, is ‘where the great predators roam and the law of the jungle operates’.

Interesting. Put this way, it all seems obvious, and I guess this must all be well known in economics, but I’ve never thought of it that way.

There can be no sharp distinction between “economic life” and “capitalism” or between “the market” and “the law of the jungle”—even the largest companies have to compete in some way in order to make payroll—but, yeah, the idea of capitalism as an antimarket, that makes sense. I was already familiar with the idea that firms are, in the words of Dan Davies, “islands of central planning linked by bridges of price signals,” but I hadn’t thought about this as a sort of definition of capitalism. (Dan Davies is, I assume, not directly related to William Davies who was quoted above.)

Here’s an example. Later in his article, Willam Davies writes:

Academic publishing, for example, is one of the most egregious rent-grabs around. Scholars, editors and reviewers work for free, so that large copyright-protected conglomerates can charge libraries several thousand pounds a year for digital access to journals they can’t do without. The profit margins of the big scientific publishers run as high as 40 per cent, enough to make the boss of Shell blush. Hence the enthusiasm for projects such as the not-for-profit Open Library of Humanities, set up by Birkbeck academics in 2013, which now publishes 33 open access journals per year. When it’s capitalism that’s the problem, and not markets, the only alternative is post-capitalism.

There’s some truth to that. I use Arxiv and my home page and, for that matter, this blog, to communicate scientific ideas directly without paying rent to Elsevier etc. I also publish articles in journals and I publish books with for-profit and non-profit publishers, so you could say I operate in some sort of mixed economy of publication.

But then I can tell you a story that puts us back into the capitalism-as-antimarket situation.

About 25 years ago, my friend Aaron Edlin started a set of journals which he called the Berkeley Electronic Press. The clever idea was that his journals would be online-only and freely accessible to all and, if you published a paper for one of his journals, you agreed to review some number of submissions. Also, the journals were arranged in four different tiers: you’d submit an article, and the editors would decide based on the reviews which tier your article would go in. Aaron’s an economist, and these innovations seemed like great resolutions of the problem of hassling reviewers and the problem of deciding what to publish. I published a paper in one of Aaron’s journals, back in the day, and it all went very smoothly. My article ended up in the third-tier “Contributions” category, and it’s only been cited 30 times, but, hey, what are you gonna do? The experience was much better than the usual story with academic journals where they act like they’re doing you some sort of huge favor for publishing your article. It was all very efficient and low-key. Aaron got his friends to edit some of these journals. He asked me too, but I was too busy.

In any case, their original business model didn’t seem to have worked out. Now they’ve just become one more crappy series of paywalled journals. I went to the Berkeley Electronic Press website and saw this: “In 2011, bepress chose to exit the commercial subscription-based journal business in order to focus all of our energies on our open access services; this meant selling the 60+ bepress journals which we had published for the last decade.”

I’m thinking that the mistake was to have 60+ journals in the first place. How can you possibly keep track of all of that? Maybe they should’ve just capped their number of journals at 10, and then it could all have worked out, I dunno. I don’t fault Aaron for this—I’ve started all sorts of projects that didn’t continue the way I’d originally planned, and nothing lasts forever in any case. He did keep those journals going for a few years, which isn’t nothing.

The relevance to the main theme of this post is that the Berkeley Electornic Press started out as some sort of cooperative or possibly market-based system but then got sucked into the capitalist antimarket.

We think of capitalism = market and cooperative being the opposite of a market, but in this case the connections go differently. The cooperative and market versions of the Berkeley Electronic Press are similar in that they involve some sort of open exchange between independent agents, whereas the capitalist antimarket version is all happening behind many layers of obscurity.

I recognize that none of this is new to economists. This particular perspective was new to me, though, hence this post.

“Perplexing Plots”: Crime fiction, modernism, and the air of rigor

I just finished this book, Perplexing Plots, by David Bordwell. It’s excellent. I don’t know that most of you would like it, but it was right in my sweet spot. The book came out in 2023, and I was sorry to learn that the author has recently died. He was already retired, and none of us live forever, but it still made me sad. Reading the book made me want to have a conversation with him.

Another way of looking at it, though, is that it’s wonderful that Bordwell managed to finish this book before he exhausted his allotted time on Earth. Also, he and his wife, Kristin Thompson, had a blog. Which I’ve added to the “Not currently active” section of our link page. Which I guess is where all blogs will eventually end up.

Experimentation

Regarding the book itself: Bordwell treats crime fiction and noir film in parallel, not comparing the two so much as considering them as a single entity (along with the now-minor field of crime/mystery stage drama). He draws connections between these popular forms (generally considered “lowbrow” or “middlebrow”) and “highbrow” modernist fiction.

One connection is the idea of experimentation. Modernist fiction is famously experimental. Bordwell argues that popular crime fiction and film have been able to get away with a lot of experimentation too, in part by working within familiar forms–he gives many examples, including the heist movies The Killing (Kubrick) and Pulp Fiction (Tarantino).

Rigor

Another connection between crime fiction and modernism, which I don’t recall explicitly mentioned by Bordwell, is the air of rigor. Modernist fiction, poetry, art, and architecture are associated with following strict, often unnatural-seeming rules, and there’s a sense that, to appreciate them, you need to give into their constraints. To say that you don’t like a modernist story because it makes no sense, or that you don’t like a modernist chair because it’s uncomfortable, that would be missing the point, as a key aspect of modernism is the rejection of traditional expectations and comforts.

Detective stories, and crime fiction more generally, have their own areas of rigor, from the “fair play” rules of the so-called golden age, to later expectations regarding characterization, point of view, and suspense. As with modernism, constraints can facilitate experimentation.

“Exploratory data analysis” and “confirmatory data analysis” are the same thing.

This is not new–it just happened to come up in class the other day and I thought it was worth saying again. I made the point in my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, and in chapter 6 of Bayesian Data Analysis, back in 1995.

Here’s an explainer from 2010:

So-called exploratory and confirmatory methods are not in opposition (as is commonly assumed) but rather go together. The history on this is that “confirmatory data analysis” refers to p-values, while “exploratory data analysis” is all about graphs, but both these approaches are ways of checking models. Bayesians should embrace graphical displays of data—which I interpret as visual posterior predictive checks—rather than, as is typical, treating exploratory data analysis as something to be done quickly before getting to the real work of modeling.

Here’s how I see things usually going in a work of applied statistics:

Step 1: Exploratory data analysis. Some plots of raw data, possibly used to determine a transformation.

Step 2: The main analysis—maybe model-based, maybe non-parametric, whatever. It is typically focused, not exploratory.

Step 3: That’s it.

I have a big problem with Step 3 (as maybe you could tell already). Sometimes you’ll also see some conventional model checks such as chi-squared tests or qq plots, but rarely anything exploratory. Which is really too bad, considering that a good model can make exploratory data analysis much more effective and, conversely, I’ll understand and trust a model a lot more after seeing it displayed graphically along with data.

There’s some history to all of this. John Tukey’s classic book, Exploratory Data Analysis, was published in 1977 but I believe he began to work in that area in the 1960s, about ten or fifteen years after doing his also extremely influential work on multiple comparisons (that is, confirmatory data analysis). I’ve always assumed that Tukey was finding p-values to be too limited a tool for doing serious applied statistics–something like playing the piano with mittens. I’m sure Tukey was super-clever at using the methods he had to learn from data, but it must have come to him that he was getting the most from his graphical displays of p-values and the like, rather than from their Type 1 and Type 2 error probabilities that he’d previously focused so strongly on. From there it was perhaps natural to ditch the p-values and the models entirely–as I’ve written before, I think Tukey went a bit too far in this particular direction–and see what he could learn by plotting raw data. This turned out to be an extremely fruitful direction for researchers, and followers in the Tukey tradition–I’m thinking of statisticians such as Bill Cleveland, Howard Wainer, Andreas Buja, Diane Cook, Antony Unwin, etc.–continued to make progress here. More recently, this has overlapped with work by Hadley Wickham and others on EDA-friendly graphics software and work by Jessica Hullman and others on the communication of uncertainty and variation.

The actual methods and case studies in the EDA book . . . well, that’s another story. Hanging rootograms, stem-and-leaf plots, goofy plots of interactions, the January temperature in Yuma, Nevada—all of this is best forgotten or, at best, remembered as an inspiration for important later work. Tukey was a compelling writer, though–I’ll give him that. I read Exploratory Data Analysis twenty-five years ago and was captivated. At some point I escaped its spell and asked myself why I should care about the temperature in Yuma–but, at the time, it all made perfect sense. Even more so once I realized that his methods are ultimately model-based and can be even more effective if understood in that way (a point that I became dimly aware of while completing my Ph.D. thesis in 1990–when I realized that the model I’d spent two years working on didn’t actually fit my data–and which I first formalized at a conference talk in 1997 and published in 2003 and 2004. It’s funny how slowly these ideas develop.).

It’s funny about Tukey’s work on multiple comparisons. He wrote an entire unpublished book on the topic, along with many short research articles. Individually each of these articles is readable and compelling, but, stepping back, I see how it’s all based on a foolish familywise error framework.

Here’s what I think happened: Tukey was following some version of the operations-research approach to statistics associated with Wald. It makes sense–this was the framework that they used to win the second world war. And it was pretty much the only game in town (yes, there was also Bayes, but for historical reasons Bayesian methods got no respect back then). Tukey was brilliant, a great problem solver, co-inventor of the fast Fourier transform and lots of other things, but for whatever reason he didn’t apply his depth, breadth, and creativity toward thinking about the fundamentals. I guess you’d call him a fox. In the usual telling, the fox is the hero and the hedgehog is the boring obsessive, and it’s fair to say that the world benefited from Tukey’s foxness, his interest in developing new methods and solving problems rather than refactoring the foundations of statistics–but I think that his lack of depth in that area contributed to him wasting a lot of time and effort on multiple comparisons, to the extent that his only way forward was to rip it all up and start again with EDA.

From a modern perspective, EDA and CDA are the same thing: they’re both ways of comparing observed data to hypothetical replications under a model. Recall the goal of EDA to discover the unexpected: “the unexpected” is defined relative to “the expected,” hence EDA is model checking just as CDA is, with the only differences being: (a) in EDA the display is visual rather than numerical, (b) in EDA the reference model is often defined implicitly rather than explicitly.

Indeed, the “news you can use” aspects of this post–and of my general point about EDA being the same as CDA–are:

1. If you’re gonna do EDA, make your reference model as explicit as possible. The clearer your assumptions, the better you can find problems. It’s Popper–or, really, Lakatos–in action.

2. If you’re gonna fit complex models (which we’re doing more and more of in statistics and machine learning), EDA is more important than ever. EDA is not a set of qq plots you make before getting to the serious bit of modeling; it’s a key step in workflow. I frame this Bayesianly as this is the simplest way for me to do the work, but you can do non-Bayesian versions as long as you have generative models for your data, and as long as your methods are flexible enough to accommodate different sources of information. (Recall the most important aspect of a statistical method.)

Unfortunately, Tukey was stuck in an old-fashioned statistical framework–good enough to solve operations research problems in the war, but not strong enough for all the applied problems he and others were encountering in the 1960s and later–so, for historical reasons, EDA and CDA were perceived as opposites. Which is too bad. Hence this post, which repeats things that my colleagues and I have been saying for 30 years, and which was implicit in the work of Rubin, Box, and others for longer than that.

Eunji Kim’s book, “The American Mirage: How Reality TV Upholds the Myth of Meritocracy”

My Columbia political science colleague Eunji Kim shared with me her new book on American public opinion. In its overall structure, it’s the most well-written and coherent academic political science book I can recall reading. Much better organized than our Red State Blue State book, which had a general theme and lots of little studies but no coherent message, and various other political science books I’ve read that are either a collection of only loosely-related articles, or, at the other extreme, a single article puffed up to book length. I enjoyed reading it from beginning to end. It’s kinda too bad because almost nobody reads academic books from beginning to end, but at least this one reader appreciated the care that was put into the book, not just the research and writing but also the structuring of the argument.

I have a few thoughts and questions:

1. The political science consensus

Kim present herself as going against the political science consensus by studying entertainment media. And I guess she’s right; Figure 2 of the book seems pretty clear.

Still, at times I feel she overstates the case. On page 3, she writes, “We tend to think of our political beliefs as well reasoned and carefully considered, or, at worst, determined by what’s happening around us right now. In this hyper-politicized world full of partisan news media, it seems implausible that something as frivolous as the latest reality TV show from Netflix, cop shows, and superhero movies, to name just a few, could possibly affect something as profound as people’s political attitudes.”

And then she continues with, “This is a reasonable viewpoint shared by the vast majority of political observers and scholars of public opinion.” What’s the evidence for this claim? I’m not trying to be picky here! I would guess that most political scientists do think that entertainment could affect political and social attitudes. Indeed, entertainment news often includes speculations about political effects of TV shows, celebrity endorsements, actions on the sports field, etc. Granted, these entertainment news stories are not written by political scientists, but political scientists are aware of such speculations–I know I am! One problem here is that lots of people, political scientists and others alike, tend to overestimate the effects of things that come to mind (this is related to Tversky and Kahneman’s “availability bias”). For example, when Joe Rogan endorsed Donald Trump and Taylor Swift endorsed Kamala Harris, there was a lot of talk about what effect these might have. I don’t think anybody thinks these effects are zero; at the same time, such effects of individual endorsers are not going to be huge, because there are a lot of celebrities out there.

Speaking more generally about entertainment’s effect on culture and political attitudes, again, there’s been lots of discussion and speculation on the effects of cultural influences including movies (Dirty Harry, Death Wish, Star Wars, Rambo, Back to the Future, Do the Right Thing, The Matrix, etc etc etc), business stories (McDonalds, De Lorean, Microsoft, Apple, Google, Tesla, etc.), pop stars, sports stars. You get lots of concern on the right about permissive cultural messages encouraging permissive attitudes and liberal voting, and corresponding concern on the left about vigilante movies and corporate success stories encouraging conservative social and political attitudes. I’ll take Kim’s word for it that political scientists have published very little on the topic, but I don’t think that means that they (we!) find it “implausible that something as frivolous . . . could possibly affect something as profound as people’s political attitudes.” I think it’s more what she talks about later in the book, that these things are very hard to study! We’ve pretty much ceded the field to sociologists. So, fair enough to say that this is an under-researched topic; I just think she’s setting up a straw man by saying that “the vast majority of political observers and scholars of public opinion” that entertainment media doesn’t matter.

Also, on p.139 Kim writes, “A credible possibility exists that entertainment media may still impact policy attitudes,” and then she cite some studies. Some of these studies may be iffy–I’ll get back to that later–but, in any case, doesn’t this kind of undercut her claim that nobody studies these things? But, yeah, if it’s really true that “80,000-plus academic treatises” mention Fox News (as she says on p.177), then fair enough to make the point that the effects of entertainment media are understudied.

Umm, here’s another place where I think Kim slightly misrepresents the field of political science. On p.179, she quotes Pye and Verba (1965) as saying, “Some would say that politics may be found everywhere–in the club room and the business office, among schoolmen and churchmen, and even the household–but surely politics assumes its great dimensions only when its stage is the state and its powers can shape the law of the land.” She then writes, “While many intuitively or instinctively postulate that politics only reveals itself in the corridors of power, I humbly diverge from that perspective. I align myself with those who recognize politics as a perennial undercurrent, coursing through the veins of everyday American life.” But Pye and Verba weren’t saying that politics only reveals itself in the corridors of power. They explicitly noted that politics occurs everywhere. What they said is that politics assumes its great dimensions” only with affairs of state. So I again I Kim is arguing against a strawman.

On the other hand, maybe the narrowness of political science that she’s arguing against is not such a strawman. Yes, political scientists recognize the political aspects of interactions at work, at home, on the football field, wherever–but if you look at what’s published in political science journals (including by me!) it’s pretty much all about government policies, political parties, and public opinion, maybe with the occasional article about labor union elections or church politics or whatever. If you want to look for articles on politics within firms, within families, etc., you’ll have to go to the economics and sociology literature. And that ain’t right! The insights of political science should be relevant to understanding small-scale politics, in the same way that economics and sociology provide insights in the small as well as the large.

2. Upward mobility and economic pessimism

Kim argues that entertainment media portray an unrealistic world of just deserts in which anyone who is deserving and works hard should be able to achieve success. How does this fit into another thing we hear about, which is a general pessimism in America, and indeed around the world. This came up during the 2024 campaign: by the usual measures, the economy was going well–not booming, but going fine, with low unemployment, moderate and decreasing inflation, slow but positive GDP growth, etc.–but there was a general sense of there being a bad economy, and also a larger sense that America was no longer operating as it should, that traditional social and economic roles were disappearing, etc. This was taken to be a big part of the vote swing to the Republicans.

I’m not saying there’s a contradiction here–it’s possible for people to feel that America is still falling apart while still being cheered by reality-TV success stories. I’m just not sure how this all fits together, and I didn’t see it addressed in Kim’s book.

Let me put it this way: the entertainment media have given people the sense that with hard work anyone can climb the ladder, and this can push people toward supporting traditional Republican low-tax, low-spending policies and opposing redistributionary policies of the Democrats–but at the same time people seem to feel that their economic situation is precarious. Again, this is easy enough to explain in traditional political-science language: the voters feel that times are tough for them and so they oppose government giveaways to the undeserving others, and you could say that exposure to “Cops” gives people the sense that the poors are bad people and that exposure to “Shark tank” gives people the sense that poor people have only themselves to blame–but people are not applying this logic to themselves, right? When it comes to their own economic problems (the price of eggs or the disappearing industrial base or whatever), then people see themselves as the victims. So there still are some pieces of the puzzle that I’m not understanding.

3. What is “meritocracy”?

Kim mentions meritocracy many times in the book but I didn’t see a definition. It seems to me that meritocracy has two meanings. The first is simply that people’s station in their jobs and society is based on their abilities or merit. The second is that people with merit get to run things. I think the second definition is more accurate: the idea that the people with merit get to run things is the “ocracy” part of meritocracy (“government or the holding of power by people selected on the basis of their ability”). In that case, though, the concept of meritocracy is itself self-destroying, because one thing the people with merit will do with their power is to get favored positions for their friends and family. I elaborate on this point here and also here. I think this issue is important because it moves the debate beyond “Is meritocracy good or bad?” to “Can meritocracy exist at all?”

To put it in political science terms: You have a population whose attitudes on social and economic mobility are being affected by entertainment media. But then there’s the question of how the attitudes transform to votes. I’m not saying Kim needs to figure it all out in this book; I’m just saying that, with “meritocracy” itself having such a slippery definition, this makes it even harder to interpret the book’s empirical results in terms of their impact on political outcomes–this is the issue I raised in point #2 above.

4. Reliance on shaky studies by others

In various places in Kim’s book she refers to various studies that I don’t trust. These are a bunch of papers published roughly around the notorious 2010-2015 period when the replication crisis in social science was at its height. I wasn’t quite sure what Kim’s take was on these studies. On one hand, she seem to be taking them at face value, which I think is a mistake, as I see these studies essentially as noise-mining exercises. On the other hand, she also downplays the importance of those results (or, I would say, those unsubstantiated claims).

Here are some examples:
pp.6-7. “It turns out that those who vote in schools are notably more inclined to support school funding initiatives due to ‘contextual priming.'” I looked up this paper and it’s based on observational data from one referendum in one year in one state. Indeed, on page 48 of Kim’s book she does a good job of pointing out the weaknesses of this sort of one-off observational study.
p.7. “Even chance encounters, like observing someone in apparent poverty in the streets of Boston, reportedly reshape attitudes on wealth redistribution.” This was a notoriously noisy experiment where the result went in the opposite direction as expected and did not even reach the conventional level of statistical significance; see here. My point is not that we should only talk about “statistically significant” results but rather that this study could easily–indeed, I think would usually–have been presented as a null finding.
p.7. “Scholars even found evidence of how shark attacks or lousy weather affected people’s voting behaviors.” This shark attack study has major problems, indeed I think it would be fair to describe it as debunked; see here and here.

Kim follows up with, “much more systematic than the political impacts of shark attacks or random encounters with out-group members would be the influence of entertainment media, given the ubiquity in our daily lives.” I agree. That’s one reason I think it’s funny that she reported those questionable studies with a straight face! Indeed, my colleagues and I have argued that those silly claims of large and persistent effects from trivial inputs cannot be all or mostly true for mathematical reasons that a large number of large and persistent effects cannot coexist:

Kim continues by saying, “Perhaps the glaring irony is that, while political scientists have no trouble believing the powerful impact of ostensibly arbitrary or seemingly irrelevant events on public opinion, they have been reluctant to study the media that citizens primarily consume.” To that I reply: (a) lots of political scientists don’t believe those claims about the purportedly large and systematic effects of shark attacks, stranger encounters, etc. Yes, some do (Achen and Bartels are decorated political science academics, and that streets-of-Boston-study did win two awards), but lots of political scientists (including Anthony Fowler, Andy Hall, and me!) are skeptical and think that the social priming research is a dead end; and (b) As Kim explain so well in her book, it’s hard to study the political effects of entertainment media! I don’t think political scientists are reluctant to study entertainment media; it just takes a lot of effort so we gravitate to where the data and research questions are cleaner. Indeed, one of the big problems with social priming research is that it’s easy to run an experiment or grab observational data and perform what appears to be a well-identified statistical analysis. So Kim is on to something here . . .

To continue:
p.43. “The emotional resonance or ‘affective imprint’ of these media narratives can leave a lasting impact on our attitudes and behavior, often occurring at an unconscious level.” I’m very skeptical of these claims of large and persistent unconscious effects. I looked at one such claim carefully and it disintegrated under careful inspection; see here. I don’t these claims make a lot of sense, also I don’t find the evidence given in support of those claims to be at all convincing. On the plus side, I think Kim’s main story is strong on its own–for one thing, there’s no need to believe that effect of Shark Tank etc. on attitudes about meritocracy etc. are “occurring at an unconscious level”–so I think she could strengthen her argument by detaching it from the questionable replication-crisis-era research that she is citing.
p.49: Kim skeptically discusses a paper about Harry Potter that “explores the hypothesis that avid Potterheads might be less inclined to support Trump.” I think that paper might have ben a parody; in any case, it has problems of causal identification (see here). In any case, I think Kim’s discussion of that paper was good, in that she points out the lack of any coherent theoretical story or model there.
p.139. “MTV’s 16 and Pregnant is credited with altering rates of teen childbearing.” I haven’t looked into this myself, but that claim has been contested.

5. Media consumption

This is something I know very little about. I learned a lot from all the examples in the book. There was one place where I can share something. On page 34, Kim writes, “Remember how sports fans run for cover when their team’s performance is too depressing to watch? Politics, just like any team sport, works similarly.” I wrote a paper about this in 2016 with Doug Rivers, David Rothschild, and Sharad Goel. We found strong evidence for differential nonresponse. This wasn’t media consumption, it was response to opinion polls, but it seems related. Rothschild and I also made the point, with additional data, in an article that year in Slate.

On page 77, Kim reports that 40% of respondents said they watched America’s Got Talent, 31% watched Shark Tank, and 27% watched Hell’s Kitchen. Elsewhere she writes that this is the percentage of people who described themselves as “regular viewers” of the shows. Wow–those are huge numbers! I say this as someone who’s never watched any of the shows on the list of Figure 3.7, except that I watched part of an episode of Survivor once, about 25 years ago. So I’m not tuned in to the viewing habits of the average American. Fair enough! But I also wonder what people mean when they say they are regular viewers” of a show. On p.76, it says that American Idol and America’s Got Talent had average audiences of 5-10 million. 10 million is 3% of Americans. Can it really work out that 3% of people watch the show at any given time, but 30% are regular viewers? What does it mean to be a regular viewer if you watch the show less than 10% of the time? I’m not saying Kim is wrong here; I’m just having difficulty making sense of these numbers. It’s not like in the 1970s, when popular TV shows had ratings in the 30s.

6. Conflicting messages from entertainment media and social media

As discussed above, Kim argues that entertainment media has been pushing an if-you-work-hard-in-America-you-can-make-it story, what she calls “meritocracy” and that this helps to explain why many Americans oppose policies of economic redistribution from rich to poor. Of course this isn’t the only reasons for people to oppose redistribution–there are economic arguments, political ideology, partisanship, all sorts of reasons–; she’s just saying it’s part of the story, and a shift of even a few percentage points can make a difference. And I can buy her argument about the importance of entertainment media.

But what about social media? It seems like Facebook, Twitter, etc., are full of people pushing get-rich-quick schemes, people making a million dollars off of some scam or another. These don’t represent hard work or meritocracy; they represent a way that people are getting something for nothing. You might see this and envy those people, or want to be them, but I wouldn’t think that a natural reaction to seeing social media posts on house-flippers or whatever would be to conclude that the American economic system is fair.

So here’s my concern. Kim is pitting two economic narratives against each other:

A. The economic system is complicated; we need regulation and transfer payments to keep things running effectively. I’d call this the liberal or Keynesian view, the idea that for moral reasons the government should work to reduce income inequality, also this will be good for the economy as a whole, by increasing employment and output.

B. People pretty much get their economic just deserts; if you work hard you can make it in our society. I’d call this the conservative or monetarist view, the idea that transfer payments are a moral hazard discouraging people from working, and that limited government will grow the economy and help everyone.

Kim doesn’t quite say it like this, but I get the impression that she’d associate narrative A with the sorts of economic statistics you’d see in the news media (GDP growth, unemployment rate, inflation rate, etc.), while narrative B is being pushed by social media.

But what about this other narrative:

C. The system is rigged: no matter how hard you work, you’re just spinning your wheels; the real money is being made by people who are doing some scam or who’ve found their way into some passive investments (real estate, bitcoin, whatever). This is a cynical narrative–I guess it has some truth; it’s still cynical–and it’s different from A and B above.

My impression is that narrative C, which can have liberal or conservative policy implications, is big on social media. And I’m not quite sure how it fits into Kim’s book, which is so focused on the just-deserts message being pushed on entertainment media. Even if a tiktok account is telling you that anyone can become rich through bitcoin, that’s not a message of “meritocracy” as it doesn’t connect to hard work or to any sense of personal merit.

I sent the above to Eunji and she replied:

To address your first point regarding the political science consensus, I understand your concern about overstating the skepticism around entertainment media’s role in shaping political attitudes. As you rightly point out, discussions about the potential political impact of celebrities, TV shows, and movies do occur regularly in the media and among political scientists, though they’re rarely the subject of academic writing. My intention wasn’t to suggest that scholars entirely dismiss these effects, but rather to highlight how understudied they are compared to more traditional political research. In my effort to write a more accessible book, I see now that my argument might have come across as setting up a straw man.

That said, I think part of the reason my writing leans in this direction is due to my own experience in the field. I can’t help but reflect on the immense skepticism I faced throughout grad school regarding my focus on entertainment media. Well, part of the skepticism, I think, arose from the fact that I am not American–a ridiculous point of view that I unfortunately heard repeatedly. It often felt like my work wasn’t being taken seriously, and I was told repeatedly that it wasn’t “real” political science. Many scholars have cast serious doubt on whether entertainment media can truly shape political attitudes, and that created a sense of swimming against a very strong current in my research. Many of them have all been generous mentors, and I admire their scholarship, but as a grad student, it felt a bit daunting that my argument goes against what some of the towering figures in my subfield wrote.

One way I think about contemporary economic pessimism, which I touch on in the book, perhaps more implicitly, is the distinction between how people perceive their personal situation (egotropic) versus how they think about the national economy (sociotropic). Many Americans report feeling a heightened sense of personal economic insecurity, but they also continue to believe that anyone can succeed if they work hard. What’s interesting is that those who feel economically insecure on a personal level are often the ones most drawn to rags-to-riches stories in entertainment media. This may partly explain why the perception of personal hardship coexists with the belief in upward mobility.

As for the 2024 election, I think the core intuition of my book does help explain what happened. Throughout the campaign, liberal and mainstream media were telling Americans that the economy was doing better than they thought, but on platforms like TikTok, there were millions of videos where everyday Americans complained about the rising cost of basic items, like burgers at McDonald’s. Here, we see a significant gap between the elite/mainstream media narrative and the media consumed by ordinary Americans–those TikTok videos, in particular, shaped economic perceptions much more powerfully for a wider swath of the electorate. In a way, the entertainment and social media content that reflects economic frustrations could have been far more influential in shaping perceptions of the economy than the messages coming from traditional outlets. TikTok and social media platforms are obviously very different from reality TV I studied in this book, and my current working projects are dealing with these newer topics.

Your thoughts on the definition of “meritocracy” are very helpful. This is something I’ve personally struggled with, as the definition remains so fuzzy, and every scholar seems to use it in different ways. I’m not sure I fully agree with what James Flynn argues–that to the extent people with merit get higher status, they would use that status for nepotism, so to speak. In real life, of course, that’s often what happens, but I always find it interesting that in America, job referrals are so widely accepted. They happen far more frequently than in places like Korea, for example, where most systems are based on standardized exams and applications. When I think of meritocracy, I tend to think of the common, though admittedly fuzzy, definition that many historians and sociologists seem to use—namely, the belief that anyone who works hard can get ahead economically and socially in America. I think this definition makes even more sense when considering the downstream consequences I showed in Chapter 6, where people begin to view the rich as more deserving, attributing their success to internal factors. The electoral implications of meritocratic beliefs (whether believing in meritocracy makes you more likely to vote for a conservative party) are much harder to prove, and honestly, that’s something I didn’t even attempt in this book. The case of Trump, however, is an interesting one because he is essentially a product of reality TV. That was an opportunity I grabbed to talk about the electoral consequences of entertainment media in the last chapter and in my separate paper published in APSR.

On the studies you flagged as shaky, I completely understand your concerns and appreciate your rigorous eye. The studies you mentioned do indeed come from that fraught period of social science (!), and I can see now how they might undercut the strength of my argument. I will move away from relying too heavily on them, especially in light of the replication crisis. I want the core of my argument to stand on more solid ground, and your suggestions will definitely help me revise the evidence base.

On the question of media consumption and the numbers behind “regular viewership”: your QJPS paper on the mythical swing voter is actually what inspired Jin Woo and me to write a paper about temporal variations in partisan media consumption (we even cite your paper in that one)! For the percentages of people who say they regularly watched shows like America’s Got Talent, 40% is the figure that respondents report. However, the definition of “regular viewer” is, of course, extremely fuzzy, and this is a long-standing debate among media scholars about how on earth we can reliably measure these things. Surveys tend to inflate these numbers, where people’s perceptions of “regularity” vary significantly. I used both survey and behavioral data from Nielsen, and I see how that could lead to confusion. When survey respondents report that they regularly watch a show, they might include people who watch it on-demand, via streaming, or in a delayed fashion, which inflates the number of self-identified “regular viewers.”

Lastly, regarding the conflicting narratives and messages from entertainment media and social media: you’re right that countless different narratives are circulating on social media. One interesting observation, though, is that many young people now view social media creators as aspirational career models. What’s particularly striking is that becoming a successful social media creator who generates millions of views (and, consequently, significant income) seems to require little more than sheer talent. Unlike traditional career paths, which often demand family money or connections, all you need is a phone and some compelling content. I often wonder how this shifts the way younger generations think about meritocracy and who is truly deserving of success. In this economy, it seems that if you’re poor, you have no excuse–after all, you’ve got a phone too!

So that last point is that, when some Youtube or Tiktok video is promoting some get-rich-through-passive-income scheme, what they’re implicitly selling is not the scam that they’re promoting but rather the meta-message provided by their success: it’s not that you, the viewer, can or should become rich by flipping houses or selling Beanie Babies or whatever, but rather that you can become successful by creating your own Youtube channel. The meritocracy, or whatever it is, comes from the idea that anyone with a phone can, by working hard enough, become a successful influencer.

Prior as data, prior as belief, prior as soft constraint, prior as unconditional distribution in a generative model

In Bayesian inference, the prior density is this thing that you multiply to the likelihood to get the unnormalized posterior distribution.

For the purpose of inference conditional on the data, the prior is the prior is the prior, and where it comes from doesn’t matter.

But in a generative context–which is necessary for experimental design, model checking, prediction, extrapolation, and causal inference–the source and interpretation of the prior distribution do matter.

Here are a few places the prior can come from:

Prior as data. More specifically, prior as information external to the data included in the likelihood. For example if you’re analyzing diagnostic test data, you could use lab test results on known samples to construct a prior for the sensitivity and specificity of the test. There’s no need for these lab results to come before (“prior to”) the assays being analyzed. One implication here is that you can pile priors on top of each other. For example, in Stan you could have:

  theta ~ normal(0, 10);
  theta ~ normal(0.2, 0.4);

where the first line represents a regularizing prior–a statement that you can be pretty sure that the true value of theta is less than 10 in absolute value, or, equivalently, a statement that you are only demanding that your method work well for problems in which the true theta is in that range–and the second line represents information from some external data. That latter information could just as well be framed as a factor of the likelihood: y_prior ~ normal(theta, 0.4), where y_prior = 0.2. But it will often come into the model as part of the prior.

Prior as belief. Here you’re conditioning on some assumption or belief as a way of constructing a statistical procedure which will be optimal over the space defined by averaging over your assumed prior. This belief might not be based on any data and it might not correspond to any real population.

Prior as soft constraint. Here you’re regularizing (or, in the case of a flat prior, purposely not regularizing). This sort of prior is implicitly defined based what it does when it’s multiplied by the likelihood (as discussed here), and the likelihood depends on the data, hence, a regularizing prior can depend on the data too. Then there’s no longer a generative model (a joint distribution of parameters and data) but that doesn’t have to stop you from using the posterior distribution that comes out.

Prior as unconditional distribution. This is the full generative model, and it’s a wonderful thing when you haven’t, but often you don’t. The math of the posterior predictive distribution “thinks” that this is what the prior is doing, but often it’s not.

These different ideas overlap. For example, Zwet and I used the Open Science Collaboration database and the Cochrane database to construct our proposal for informative default priors scaled by the standard error of estimates. This is a prior coming from external data, it also approximates our belief on what might be the effect of a new study, it regularizes our inferences, and it corresponds to the unconditional effect size in a generative model in which studies are drawn from a population that is similar to those historical databases.

For all the aspects of the prior to work together . . . that’s the dream. But it doesn’t always happen that way. That’s most obvious for a flat prior on an unbounded space, which cannot correspond to any generative model, but it also is the case for weakly informative priors, for models that we are using but we don’t believe (because we’re consciously excluding some potentially relevant information), etc. So it’s good to be aware of the different ways a prior can be constructed and interpreted.