Uber could use your statistical analysis.

Johannes Hallermeier writes:

After CU, I went on to work at Uber, as an applied scientist on the policy research team.
I’m reaching out because Uber is hiring for a number of interesting applied science & engineering roles across policy, marketing and marketplace.
—————————————-
Applied science / engineering roles at Uber
– two roles on policy research (#1, #3)
– two roles on marketing applied science (#2, #4, plus a third role coming online soon)
– several roles on marketplace (pricing, matching, incentives, etc., see rest of list below)
before applying, please ping [email protected], including your CV and preferred role (for questions and/or referrals)

Applied Scientist II, Policy & Consumer Research

Applied Scientist II, Brand Marketing

Applied Scientist II, Earnings Policy

Applied Scientist II, Marketing Applied Scientist II

Machine Learning Engineer

Optimization Engineer (Operations Research)

Backend Engineer, Rider Pricing & Incentives

Senior Optimization Engineer

Senior Software Engineer, Rider Pricing and Incentives

Senior Software Engineer, Dynamic Pricing

Staff Machine Learning Engineer, Causal Inference

Staff Software Engineer

Staff Software Engineer, Rider Pricing Platform

Staff Machine Learning Engineer, Pricing and Incentives

Staff Machine Learning Engineer, Dynamic Pricing

Staff Optimization Engineer, Dynamic Pricing

Senior Staff Machine Learning Engineer, Marketplace Pricing & Incentives

Science Manager – Dynamic Pricing

Manager, Science

Senior Staff Engineer – Marketplace Competitive Intelligence

Scientist II – Competitive Intelligence

Sr. Scientist – Competitive Intelligence

Sr. Scientist – Competitive Intelligence

Scientist II, Pricing and Incentives

Senior Scientist, Pricing and Incentives

I guess maybe if you interview for a job there, they’ll pick you up directly from the airport?

In any case, I can only assume that expertise in Bayes and Stan will come in handy.

Validity and deduction in causal inference

Kevin Esterling writes:

I wanted to share a paper my co-authors and I recently published on the necessity of construct and external validity for deduction in causal inference.

The reason I write to you is that we discuss your “Why ask Why” paper [coauthored with Guido Imbens] at some length (for example, on p. 9 of the PDF) and show that from a deductive perspective, in omitting assumptions for construct and external validity the analyst inadvertently changes their “what if”-type question, that is intended to be deductive, into a “why”-type exploratory question. Adding assumptions for construct and external validity is required to preserve the deductiveness of a “what if” question.

What they say in their article makes sense to me.  We discuss similar issues from a modeling perspective in chapter 19 of Regression and Other Stories (https://sites.stat.columbia.edu/gelman/regression/).  I guess there will be some controversy from proponents of causal identification, not that they would disagree these points but they might argue that there is not a tradeoff between internal and external validity.  To put it another way, I sometimes think that causal identification strategies are overrated because they lead people to focus in sometimes minor issues of internal validity while ignoring the elephant in the room that is external validity.  I express that view here:  https://statmodeling.stat.columbia.edu/2020/01/13/how-to-get-out-of-the-credulity-rut-regression-discontinuity-edition-getting-beyond-whack-a-mole/ and here:  https://statmodeling.stat.columbia.edu/2021/03/11/regression-discontinuity-analysis-is-often-a-disaster-so-what-should-you-do-instead-do-we-just-give-up-on-the-whole-natural-experiment-idea-heres-my-recommendation/.  From the other direction, proponents of causal identification argue that, to do a study with strong internal validity, this requires effort that can also pay off in external validity.  Or, to put it another way, they argue that a study with poor internal validity is typically not well formulated, and it will have external validity problems too.  I discuss some of those debates in my chapter in the volume, Field Experiments and Their Critics:  https://sites.stat.columbia.edu/gelman/research/published/yalecausal2.pdf.

Esterling responded:

That’s an interesting point about the possible dependence in the types of validity in that if a study has poor internal validity, it’s probably just badly done and so will lack the other validities as well. But I don’t think the reverse is true in that a researcher who obsesses over and achieves perfect internal validity might then neglect considerations of construct and external validity. It’s not that there is any fundamental tradeoff between internal and external/construct validity, it’s just how the researcher budgets their time and mental effort. Maybe the last 10% of time and effort researchers put into internal validity could be better allocated to warranting external and construct validity.

Which I take to be the gist of your blog post on RDD cautioning about “overconfidence borne from the slogan, ‘causal identification,’ which leads researchers, reviewers, and outsiders to think that the analysis has some special truth value”–that overconfidence is exactly what motivated our paper.

And then what you write in your chapter in field experiments and their critics, “the mapping from any research finding—experimental or observational–is in effect an ongoing conversation among models, data, and analysis”–yes, exactly!

PhD position at UBC in Temporal Ecology Lab

This post is by Lizzie

My lab has an open position for a PhD student to join the lab. We’re looking for someone bright, motivated and collaborative to study how seed and seedling pathogens influence forest regeneration and diversity. This project would be part of a broader PhD with room to develop your own projects. This project is in close collaboration with the Plant Ecology Group at ETH Zürich, which is led by Professor Janneke Hille Ris Lambers.

If you’re interested, please find more information (including how to apply) on this page.

We’re open to folks from diverse training backgrounds. If your background is more computational and/or mathematical but you’d be excited to spend a few weeks outside and do a little lab work then you could be an ideal candidate; if you’re really interested, please apply no matter how well you think you line up with the ad.

Application review begins 1 July 2025 so apply soon for full consideration.

Dan Sinykin on close reading in literature, and me on close reading in statistics

Interesting article by Dan Sinykin on close reading:

Reading, a skill easily taken for granted, is difficult–all the more so when reading literature that wields language as a medium for art. . . . It’s easy to see why close reading, which demands patience, openness to others, and slow, careful thought, is having a moment among academics. . . . academics are rediscovering the quiet excitement of close reading, a relief from the overheated corporate pablum routinely suffocating us.

Pablum is not just corporate

To the above, I’d just add that the “overheated pablum” is not just “corporate”; it’s coming from all sources. We see it in scientific research papers, on twitter, on NPR, all sorts of places that are not themselves corporate (OK, sure, twitter is owned by a corporation but the individual people posting the pablum are not themselves corporate). Overheated pablum is a style, a dominant style for the usual Gresham sort of reasons and also because, as Sinykin says, our discourse has evolved into position where nobody pays attention to anything, so things are written with the expectation that nobody will pay attention to them, etc. I see this sometimes online when someone will criticize something of mine, but they’re criticizing things I never said–one example is here, and there are links to a few more at the end of this post. Or, more directly, there are examples such as the psychology study described as “long term” even though it took place over only 3 days, or the study that claimed “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though it had no evidence of anyone actually becoming more powerful.

As I’ve complained, my frustration about such things is not just that this happens–that credentialed scholars write articles with titles and abstracts that are flat-out false–but that nobody seems to care, even when these things are pointed out. To put it in Sinykin’s terms:

– People aren’t doing close readings on scientific papers, even high-profile papers that get tons of media attention.

– Scientists aren’t even doing close reading on their own papers! For example, in the papers linked above, I don’t know that the authors even considered the idea that there could be a contradiction between describing a study as “long term” even though it only lasted 3 days, nor that there could be a problem with claiming that people in their study “instantly become more powerful” even though the study had no measurements of power.

To put it another way, expressions such as “long term” and “more powerful” are to be taken metaphorically, as a sales job, in the same way that you might say that someone “is going to cure cancer in two years” even if, no, you don’t actually think they’re going to cure cancer in two years. It’s advertising-speak, it’s letter-of-recommendation-speak, it’s Ted-talk-speak, it’s the-title-and-abstract-of-papers-published-in-Psychological-Science-speak, and if everybody does it, then it doesn’t count as lying, indeed it doesn’t even feel like lying.

Bad news when close reading is no longer expected

I don’t agree with all the political things that Sinykin says, but that’s not so important for my point here, which is that I agree that there’s not enough close reading going on, and when audiences don’t read more closely, this provides less incentives for authors to write for close readers. Indeed, authors can become indignant at close readers, attacking them as obsessed Javerts. We’re seeing a development of a new equilibrium.

Just by analogy, consider the much-discussed phenomenon of Netflix-style movies where every action on the screen is announced by the characters as it is happening. This is so annoying! It’s certainly not the pattern with every Netflix show, but it happens a lot with the more generic offerings you’ll see on that and other streaming channels, and it’s said that the reason is that people have a movie on in the background while they’re doing something else so the running explanation allows the movie to be followed in that passive way. Then when more movies and shows are produced in this way, it encourages more of this background watching, and you get a new equilibrium in which everybody’s paying less attention–the viewers are paying less attention to what’s happening on the screen, and the actors, directors, and producers are paying less attention. Is this a bad thing? Maybe not! Maybe it’s a throwback to the classic era of radio drama, I don’t know. The point is that close reading, or close watching, is a choice. At least, it’s a choice if I’m reading or watching in my native language. When I’m reading or watching in French, I need to apply my full concentration at all times or I’ll lose the thread.

The 4 aspects of close reading

Sinykin gives two examples of close reading, one from the Odyssey and one from the Bible. These examples made it clear that close reading is four things.

1. Most directly, close reading is figuring out the literal meaning of the text. Who are the characters, who is saying what (this can be tricky when reading long stretches of dialogue), who’s alive and who’s dead, what is the sequence of action, etc. It’s possible to read a story and fail in this very basic task, sometimes because the author is hiding things (Agatha Christie, Gene Wolfe, etc.), sometimes because the story is itself incoherent (it’s easier to think of examples from movies where continuity or logic is violated, but this can happen in written stories as well), sometimes just because you’re reading quickly, watching the story go by, and not focusing on the details.

2. The second step of close reading is understanding the characters’ motivations: not just what is happening but why. Sometimes this is explicitly stated, but usually you have to figure it out. In this category I’d also place whatever struggles the reader might have with unreliable narrators, information gaps, and whatever deliberate ambiguities are in the text.

3. The third step is following all the details that flow by. Often I read a book and enjoy it, but only on rereading do I notice all sorts of little things that I zipped by the first time in my rush to follow the story. This can even happen on the umpteenth reading! I recently reread Forlesen, and it was full of fun bits that I’d not previously caught. I’m not talking about subtle references or misdirections or deep themes or “easter eggs” or whatever, just the granular bits of conversation, thought, and event that I’d earlier skipped without noticing.

4. Finally, close reading involves understanding a literary work in its historical and cultural context. This has two parts. From one direction, in reading a story or watching a movie or TV show we can learn a lot about the time and place when it was produced, just from things happening in the background–patterns of speech, clothing styles, the way people are milling around in street scenes, etc.–this is what we call the Speed Racer principle. From the other direction, if you know something about the culture within which a work was produced, you can get additional insights into what the author of the story is trying to say.

It’s that fourth aspect of close reading that Sinykin focuses on:

Late in The Odyssey, Odysseus, who has endured 10 years of wandering to return home from the Trojan War, encounters his childhood nurse. No one has yet recognized him, and he does not want to be recognized. He appears a stranger. His erstwhile nurse washes his feet and, in doing so, sees a scar on his thigh, startling her into recognition. The mark on the body becomes, once noticed by a caring, knowing observer, auratic, suffused with meaning. At that moment, Homer interrupts the story with some 70 lines about how Odysseus suffered the wound that left the scar, only to pick up when the nurse drops Odysseus’ foot in the basin.

We might think, given how we have learned to read stories in our time, that Homer interjects the history of the scar into the scene to induce a feeling of suspense, suggests [philologist Erich] Auerbach. But we would be wrong. Suspense requires a distinction between foreground and background, which is unknown to Homer, who writes everything in a fully saturated now. While narrating the history of the scar, he does not expect us to be waiting to find out what happens with the nurse. He expects us, argues Auerbach, to be 100 percent in the presence of the past. Homer must describe the scar because if he did not, we would be left with an unexplained, mysterious detail, which he cannot bear. Everything must be illuminated. He must account for the scar. Everything in Homer proceeds with clarity, “never a lacuna, never a gap, never a glimpse of unplumbed depth.”

Sinykin shares a related close reading of a biblical passage, then concludes:

We can learn about a people through its style, its literature, which bears an ineradicable record of its version of reality. This, at least, was Auerbach’s gambit. The method is close reading. Others do it differently and can be no less exhilarating. It starts with a cultivation of sensitivity to art and language.

Agreed. I’d just add that this has several aspects, starting with the most basic of trying to figure out, as a reader, what exactly is happening in the story and what could be going on in the minds of the characters.

Close reading in statistics

This was all on my mind because just last year we were discussing the connections between close reading in literature and close reading in statistics (see also here).

As a statistician, when I read a report closely I go through the four steps listed above:

1. First, I try to put in the effort to understand exactly what was going on in the experiments being discussed. This can be difficult! Research papers often don’t include crucial information such as how exactly the experiment is done and what measurements were taken.

2. Next, it’s useful to understand the authors’ scientific goals. This is usually pretty clear from the way the results are presented.

3. Then there’s the struggle to follow all the details. A paper can have a lot of graphs and tables, and each one can take a lot of close reading to figure out. Especially when the paper has errors, as in the notorious work of Brian Wansink or Richard Tol.

4. Finally, the context of the work. Is this a Psychological Science paper from the 2010-2015 era? A natural experiment from the bad old days of regression discontinuity analysis? Or maybe something that we would expect to be done well? As with a story or novel, it’s good to know what genre you’re reading. And, from the other direction, the just-taken-for-granted aspects of a paper can give us insight into the scientific culture that it came from.

What exactly is “close reading”?

After all this, I was wondering how other people define “close reading,” so I looked up the term on wikipedia:

In literary criticism, close reading is the careful, sustained interpretation of a brief passage of a text. A close reading emphasizes the single and the particular over the general, via close attention to individual words, the syntax, the order in which the sentences unfold ideas, as well as formal structures. Close reading is thinking about both what is said in a passage (the content) and how it is said (the form, i.e., the manner in which the content is presented), leading to possibilities for observation and insight. . . .

In the practice of literary studies, the technique of close reading emerged in 1920s Britain in the work of I. A. Richards, his student William Empson, and the poet T. S. Eliot, all of whom sought to replace an “impressionistic” view of literature then dominant with what Richards called a “practical criticism” focused on language and form. American New Critics in the 1930s and 1940s anchored their views in similar fashion, and promoted close reading as a means of understanding that the autonomy of the work (often a poem) mattered more than anything else, including authorial intention, the cultural contexts of reception, and most broadly, ideology.

Hmmm, interesting.

The first paragraph above is a good match for how I was thinking about close reading and how Sinykin discusses the concept.

But the description at the end of the second paragraph, describing the attitude of the American New Critics, is pretty much the exact opposite of what we were discussing! Sinykin explicitly talked about how you can use knowledge of the Homeric and biblical contexts to understand what the authors of those passages were doing, and I was saying something similar with regard to reading scientific papers.

So now I’m confused: Is close reading centered on an understanding of cultural and historical context and authorial intention (my take, and I think Sinykin’s) or is it about “the autonomy of the work . . . more than authorial intention the cultural contexts of reception, and . . . ideology”? What’s going on here???

P.S. I can’t figure out how Sinykin’s article ended up at a sports site. I once published something at Baseball Prospectus, but my article was actually about baseball so that made a bit more sense. I’m not complaining–Sinykin’s post was interesting, and it was written in a friendly, nonacademic style that fit in with other articles at that site–I just wonder how it happened. I see that, in addition to teaching English at Emory University, Sinykin is also a professor of Quantitative Methods. So maybe he’ll appreciate this post!

Russian roulette: You can have a deterministic potential-outcome framework, or an asymmetric utility function, but not both

Jonas Mikhaeil and I write:

It has been proposed in medical decision analysis to express the “first do no harm” principle as an asymmetric utility function in which the loss from killing a patient would count more than the gain from saving a life. Such a utility depends on unrealized potential outcomes, and we show how this yields a paradoxical decision recommendation in a simple hypothetical example involving games of Russian roulette. The problem is resolved if we allow the potential outcomes to be random variables. This leads us to conclude that, if you are interested in this sort of asymmetric utility function, you need to move to the stochastic potential outcome framework. We discuss the implications of the choice of parameterization in this setting.

I like this paper! Working out the example and writing it up helped me understand a bunch of things that had puzzled me regarding causal modeling and inference.

Jonas and I engaged on this project after hearing from Amanda Kowalski about her recent paper with Neil Christy, which got us thinking about what you can get from stochastic models for potential outcomes.

P.S. Here’s the final version of our paper, ultimately titled “Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals.”

A study is conducted on two groups. When does it make sense to report two separate estimates, and when does it make sense to just report the pooled estimate?

A journalist writes that he read a paper reporting on a medical experiment conducted on two different groups of people, and all that was reported was the estimated average effects. In this case, the treatment when applied to people in the first group was qualitatively different from the treatment when applied to the second group.

That is, there were groups 1 and 2, and in each group, there was a comparison of T to C. The journalist wanted to see T1 – C1 and T2 – C2, but all that was reported was (T1 + T2)/2 – (C1 + C2)/2, and the concern was that T1 and T2 were two different things. When asked why he didn’t share the separate estimates for 1 and 2, the author said that his team didn’t do this because they didn’t want to risk introducing too many statistical comparisons into their analysis.

The journalist asked for my thoughts on this, and I replied as follows:

Yes, I’ve seen people do this sort of averaging before. Sometimes it’s a mistake, other times it makes some sense because the separate estimates can be so noisy. The situation is that you can get a more stable estimate of (A+B)/2 than you can of either A or B, so that’s cool. The bad news is that now you’re not estimating either A or B, you’re estimating (A+B)/2, so the question is what interpretation does this have.

Here’s an example where some averaging is ok. Way back a few decades ago my colleagues and I estimated “the incumbency advantage” in congressional elections. We estimated the effect separately for each election year, which made sense, because the effect was changing over time. It could’ve been reasonable to estimate by averaging over each decade, because within any given decade it doesn’t change so much and then you get a more stable estimate. What we did, though, was estimate for each year and then plot the time series of estimates, so that the reader could do the smoothing by eye–I think that was the best way to go, short of fitting a hierarchical time series model, which would’ve been more work (but maybe now this would be the way to go).

What we did not do, though, was separately estimate the incumbency advantage for Democrats and for Republicans. Actually, we did separate estimates, plotted the separate time series, and they just looked like two noisy versions of the same thing, so we decided to make things simple and estimate a single incumbency advantage for each year. I think this was ok, largely because (A+B)/2 can be interpreted as the average incumbency advantage for that year, and (a) there’s no strong theoretical reason to think the incumbency advantage would be much different between the two parties, and (b) even if it does, we’re estimating an average incumbency advantage, which has a clear enough interpretation.

Here’s an example where averaging doesn’t make sense to me. Many years ago I was working with some colleagues who were studying civil war. I don’t remember all the details, but the basic story was they were fitting logistic regression to predict whether a country would be in civil war. The data were country-years, and the outcome was 1 if civil war and 0 if not. I argued that they should be fitting two separate models: one model predicting the probability that a civil war starts in a given country and year, and one model predicting whether a civil war ends. These would be fit to two different datasets, the first being all the country-years that were not already in civil war and the second being the others. So, for example, the United States from 1789-1861 and 1866-present would be in the first dataset, and the United States from 1861-1865 would be in the second dataset. There’d be a lot more data points in dataset 1 than in dataset 2; that’s just the way it is. The point is that there’s no good reason to be interested in averages of these two processes.

I don’t know enough about the context of the problem to say more than that.

Generalizing for Sampling and Causal Inference (my talk 3pm today at the University of Maryland)

Monday, April 28, 2025, 3:00 PM, 1101 A. James Clark Hall, University of Maryland, College Park:

We can combine model and design-based inference to address the following challenges of generalizing from sample to population: sparse data, small-area estimation, adjustment for non-census variables, cluster sampling, and survey weights. The methods are intellectually exciting and also important in the real world, as we demonstrate using examples in public health and public opinion, medical research, and policy analysis.

There will be discussion from Barry Graubard and Partha Lahiri. I’ll be discussing this paper and some related ideas.

It’s always fun to come back to the University of Maryland. I took classes in probability and stochastic processes there, many years ago.

P.S. Here’s a link to the talk.

Friday 10am: Online conversation on “Experiments, Causal Inference, and Limits of Evidence” with Nancy Cartwright and Berna Devezer

It’s organized by Martin Paul Fritze and Cait Lamberton from the Center for Empirical Philosophy and Behavioral Insights, it takes place 10am Fri 25 Apr 2025, and the zoom link is here.

The three of us will each speak for 15 minutes and then there will be discussion. I have no idea what any of us will say . . . it should be full of surprises.

As background here are some of my earlier interactions with Cartwright and Devezer:

Benefits and limitations of randomized controlled trials: I agree with Deaton and Cartwright

More on possibly rigor-enhancing practices in quantitative psychology research

And, as a bonus, this from Dan Simpson:

Dan’s Paper Corner: Can we model scientific discovery and what can we learn from the process?

and this from Jessica Hullman:

Taking theory more seriously in psychological science

Show up to this event and come prepared with some tough questions!

Dumb statistical models, always making people look bad

This is Jessica. In my Prediction for Decision-making class last quarter we talked about human predictions and decisions relative to algorithmic ones. This led to a discussion about why it’s often hard to demonstrate the value of human knowledge once you have a decent statistical model. 

Some readers may be aware of studies under the guise of “clinical versus statistical prediction,” most of which have found that statistical models generally outperform or match the accuracy of human predictors (see, e.g., Meehl’s 1954 book, where he summarizes the evidence at the time, and other meta-analyses. Then more recently (last 10 years or so) there has been an uptick in studies showing that when you give human decision-makers access to AI model predictions, they tend to do worse than the AI alone

There are a few ways to look at this from the standpoint of information that is available to the decision-maker. One is that human knowledge is valuable for guiding developing the model, but once you have a statistical model, it’s a better aggregator of the information. This is echoed by research on judgmental bootstrapping, where a statistical model trained on a human expert’s past judgments will tend to outperform that expert

This can be seen as a result of how we evaluate predictions. Regardless of how a human might arrive at some judgment on a specific case, we have to evaluate them actuarially (i.e., over a set of cases defined on some reference group, like other patients with similar characteristics). Minimizing loss over aggregates is what a statistical model is designed to do, so if you evaluate human judgment against statistical predictions in aggregate on data similar to what the model was trained on, then you should expect statistical prediction to win (unless your model really sucks) because it defines optimal use of the information for the task you’ve set up. From this perspective we shouldn’t be surprised. 

But the difficulty of showing the value of human expertise once you have a decent model can still make people uncomfortable, because it seems to contradict intuitions we have about what humans bring to decision scenarios. There are ways in which these intuitions can be shown to be misguided, but also some perspectives from which they seem to hold weight. 

Myths about human advantages

First let’s talk about what common beliefs about the superiority of clinical judgment tend to miss. For example, there’s an intuition that people can detect when a new instance is anomalous and step in to improve on the model’s prediction. I.e., there will always be longtail events that are underrepresented in training data, and people will know better what to do on these cases. A frequently cited example is what to expect from the person with the broken leg. In his book Meehl uses the example of estimating the probability that a professor, who is often at the movies on Tuesday night (with probability 0.9 according to the actuarial table) will go to the movies this Tuesday. Imagine that we know that he just broke his leg and got a hip cast (and let’s also assume this is pre-Americans with Disabilities Act). The clinician knows that this rules out the movies since he won’t fit in the seat with his bulky cast. 

But how often does this kind of scenario arise? Grove summarizes Meehl’s analysis of why this example does not vindicate clinical judgment:

First, the base rate of people breaking a leg is low, and so “broken leg cases” need not substantially decrease the accuracy of statistical predictions. Second, in the broken leg example, we have a highly reliable theory allowing us to predict clinically that Professor A has a very small probability of going to the movies; the theory rests on the physics of fitting a person with a hip cast into a 1954 movie seat. However, behavioral science theories are extremely seldom as well corroborated as those of physical mechanics. Third, the broken leg in Meehl’s example reduces the probability of movie attendance to zero, or some figure close to it. By contrast, when rare events occur in applied psychology, the event seldom guarantees that a person will, or will not, engage in the behavior of interest. Indeed, human behavior is so multidetermined that even unusual events typically change the probabilities modestly, or at most moderately. In sum, it is easy to see that “broken leg cases” exist and offer an opportunity for the clinician to do what the formula cannot. However, as Paul always maintained, it is very difficult to know whether a given case is a bona fide broken leg case, i.e., whether the clinician should overrule the actuarial prediction for a particular individual, or follow it.

A variation on the intuition that people can deal better with anomalies is that they can detect when a model is going outside its training distribution and adjust the prediction. But what do we have to assume for that to be a reliable benefit of human judgment? Larry Hedges, a fellow Northwestern faculty member, gave a talk awhile back where he walked through some examples from medicine and education. One was the example of an educator who has access to evidence from a randomized experiment (an estimate of the average treatment effect for some population) of some intervention that can improve student learning outcomes. The educator has more detailed information about the specific sample of students in their classroom relative to the general population from the controlled experiment. Should we trust them to be able to estimate the conditional average treatment effect if they apply the intervention in their classroom, and adjust how they apply it based on this? Hedges’ point was to show, given various ways of setting up this problem, that we repeatedly find that this would imply the educator had access to more data than is realistic, potentially many times as was used to estimate ATE from the randomized controlled trial.

This doesn’t necessarily mean that people can’t improve upon model predictions when a model is clearly out of domain (see, e.g., a recent paper on model-assisted decision making with Matt Hardy, Dan Goldstein, Jake Hofman, and Sam Zhang where we looked at how well people could predict weather in a new city given access to predictions from a model trained on a different city). We just can’t easily argue that this kind of tailoring of statistical evidence is due to people having more direct experience with the new domain. 

I remember Hedges concluding that if there’s anything that he expects to get from a human doctor over statistical predictions, it’s perhaps about knowing how to get him to adhere to treatment plans, e.g., reading his mood or what motivates him to pay attention. 

One of the other reasons that is sometimes cited for why humans should have some advantage is that they can act in the world to acquire more information as needed, while a predictive model, such as might be used to assist in diagnosing medical conditions or making treatment decisions, will not by default. But access to extra features alone can be hard to motivate as a general reason for expecting human superiority. If there are some important and easily elicitable pieces of information that aren’t available to the model (e.g., some info about the patient’s preferences, more recent measurements, or results from some test we order conditional on a risk prediction) probably we would be trying to either bring them into the prediction model or designing a separate statistical decision rule for incorporating them, and either of these would ultimately outperform the human judgment. So our reasons for having humans involved end up being because we haven’t yet done what we should be doing. 

Some better theories of human advantages

I think there is something to be said for human judgment along a few lines. 

One is about the ability to construct causal theories. This paper by Felin and Holweg on this perspective describes it as being not just a matter of differing information access between the human and model, it’s that the human can a) conceive of possibilities that may contradict the prior evidence, and b) construct tests to see if these “delusions” might hold water. Humans excel at forward-looking causal reasoning, which leads to opportunities for exploring novel ideas that the backward-looking imitative statistical learning paradigm can’t match. We can reason about counterfactuals and act when we think there’s something convincing.  They summarize this difference as humans being driven by data-belief asymmetries in ways that statistical models can’t be, i.e., people can hold beliefs that seem to contradict the prior evidence (e.g., the possibility of heavier-than-air human-powered flight immediately prior to the Wright brothers’ experiments) but which when explored through thoughtful experimentation hold weight.

I had a chat with Felin awhile back that got me thinking about the example of venture capitalists deciding what startups to back (which he mentions here). Apparently firms that use AI to help with such decisions are less likely to back the rare but truly innovative companies that go on to achieve major success compared to those that don’t. For this kind of decision under huge uncertainty and with built-in domain shift perhaps humans are not so bad.  

A related perspective to the humans-as-forward-looking-causal-reasoners hypothesis that I find compelling is that humans excel at identifying the right level of abstraction to reason about things in the world. There are plenty of demonstrations of statistical models relying on superficial clues (barns in the background of images of sheep) that are only loosely related to the target task (identify the animal) and sensitive to specific training conditions (conditions under which these particular photos were taken). Humans, at least superficially, seem more capable of developing abstractions that allow them to extract information that can be translated robustly from one situation to another; e.g., the relevant concept to reason about animals being near barns is farm  animals, not just sheep. Or imagine a self-driving car encountering an exploding fire hydrant. If such events are very sparse in the training data it will be very uncertain about how to proceed whereas a human will understand that it means water on the road and go from there.

P.S. It’s important to note that just because head-to-head comparisons with statistical models tend to make people look bad, this doesn’t mean that there isn’t sometimes value in human knowledge over statistical predictions. In AI-assisted decision-making research, there is work on learning to defer or learning with abstention, which is about optimizing performance by figuring out which decision tasks to give to the human versus the model based on estimated performance on different regions of the feature space. There have also been a few recent papers that attempt to quantify or test for unique complementary information on the part of the human decision-makers. We have a few papers about quantifying the value of complementary information in AI-assisted decision-making. Alur et al. have a related paper on tests for when human can discriminate between instances that are indistinguishable to a statistical model.  

These papers don’t get at exactly what it is that humans bring that is outside the known context of the problem, they simply show how to identify when humans have some complementary information to model predictions.

Blue Rose Research is hiring !

Blue Rose Research has a few roles that we’re actively hiring for as we gear up to elect more Democrats in 2026 and advance progressive causes!

A bit about our work:

  • For the 2024 US election, we used engineering and data science to advise major progressive organizations on directing hundreds of millions of dollars to the right ads and states.

  • We tested thousands of social media videos, ads and talking points in the 2024 election cycle and partnered with orgs across the space to ensure that the most effective messages were deployed from the state legislative level all the way up to the Presidential race, and spanning the issue advocacy space as well.

  • We also tracked public opinion to inform overall strategy and make sure that decisionmakers had a more comprehensive understanding of what issues were most important to voters.

  • And we’ve built up a technical stack that enables us to keep developing innovative machine learning, statistical, and engineering solutions.

Now as we are looking ahead to 2026, we are hiring for the following positions:

    • Data Scientist – Machine Learning (Salary: $130k – $170k)
      Our message testing team deploys custom models in Python to estimate the causal treatment effect of videos and messages. Apply if this sounds fun! (It is.)

    • Machine Learning Engineer (Salary: $130k – $170k)
      Our engineering team uses off-the-shelf and custom fine-tuned LLMs to enhance our traditional modeling pipelines as well as develop new cutting-edge applications. Apply if this sounds fun! (It is.)

  • And our overall jobs board will be updated here as additional roles open up

All positions are remote, with optional office time with the team in New York City.

Please circulate and apply!

Reach out with any questions ([email protected]).

Causality and Crime: In science as in genre storytelling, the thrill of the unexpected can only come with reference to (and in confounding) some preexisting norm.

In his final book, Perplexing Plots, the late David Bordwell wrote:

It’s not accidental that mystery stories are drawn to tricky shufflings of viewpoint or chronology. A plotline built on a detective’s present-time inquiry into past events helps us understand when the order of events is rearranged or character perspective changes.

This reminds me of two things related to statistics:

1. From my article with Jessica, Is Your Chart a Detective Story? Or a Police Report?:

Every data visualization is a story, a plot to be unraveled—but some are more approachable than others. Modern statistical displays of data—grids of scatterplots for inspecting correlations, for example—succeed by being transparent and allowing trends in the data to stand out. In contrast, classic data visualizations often succeed, paradoxically, by being a bit opaque: a puzzle that a reader figures out. . . .

In science we are delighted by unexpected brilliance, which we immediately try to systematize. The same goes for visualization: When we see a new and revelatory graph, we want to take it apart and see how it works. . . .

We can liken this experience to narrative, a lens through which many great (and lesser) works of art have been interpreted. Narrative involves some interplay between plot and perspective, events and interpretation, storyline and characters. Similarly, the practice of science can be viewed as the interplay between data and models. Data are the facts. Models are the characters whose perspectives and assumptions shape what we take away from the story. At the simplest level, the choice of how to visualize data structures the viewer’s experience of those data by promoting certain comparisons over others. It’s a character choice, a choice of model. . . .

Much has been written about how different forms of narrative involve the reader in different ways, from the relatively passive engagement of viewers of a film, to the more active involvement of those following a serial television drama, to the experience of people reading novels who must in a sense create entire movies in their heads. Data visualizations can fall in different places along this continuum. The stories told by some are so strong and clear that they require little from the viewer. Others are far more demanding. One could draw an analogy to works of art that are more or less accessible to the audience—but with the difference that hard-to-follow art is often intentionally ambiguous, whereas challenging visualizations are meant to be understood. In that sense, visualizations are more like video games than art or music. They invoke a trial-and-error experience reminiscent of the “active learning” approaches studied by educational psychologists.

As with video games, it is often the more unconventional visualizations that are the most appealing ones, even to broad audiences. That which is not familiar is more challenging; and aesthetic choices, like the use of pleasing shapes and symmetry, can help entice the viewer to try and solve the puzzle. . . .

What is exciting and unconventional is also a function of our expectations. Music is said to be compelling to the extent that it balances expectation and surprise: A note is interesting when it catches us off-guard, but then it should also make sense within the larger pattern of the piece as it develops. The same is true for storytelling: The thrill of the unexpected can only come with reference to (and in confounding) some preexisting norm.

In addition to addressing the issue of the pleasures of difficulty in narrative, this seems closely related to another of Bordwell’s points, which is that genre fiction can be highly experimental in form and makes this accessible by placing those innovations in a stylized context that is comfortable to readers or viewers.

2. The forward logic of the data generation process and the reverse logic of inference. In a statistical “generative model” or “directed acyclic graph,” there is a logical order: decisions and outcomes happen in time and can influence what comes in the future. In statistical learning, we start with the data and go backward to make inference about parameters that have already been generated and forward to make inference about predictive quantities. When we fit a model and apply it to the future, we’re going back and forth in time.

I think I’ve written something on the logic of data generation and the logic of inference in statistics, but no amount of searching turns anything up. The closest is my article with Guido, Why ask why? Forward causal inference and reverse causal questions, which also appears in Regression and Other Stories as section 21.5, “Causes of effects and effects of causes.”

“Exposing omitted moderators: Explaining why effect sizes differ in the social sciences”

Antonia Krefeld-Schwalb, Eli Rosen Sugerman, and Eric Johnson write:

Policymakers increasingly rely on behavioral science in response to global challenges, such as climate change or global health crises. But applications of behavioral science face an important problem: Interventions often exert substantially different effects across contexts and individuals. We examine this heterogeneity for different paradigms that underlie many behavioral interventions. We study the paradigms in a series of five preregistered studies across one in-person and 10 online panels, with over 11,000 respondents in total. We find substantial heterogeneity across settings and paradigms, apply techniques for modeling the heterogeneity, and introduce a framework that measures typically omitted moderators.

I like this. It reminds me of our piranha paper but directly informed by empirical data.

The focus on treatment interactions—equivalently, variation of effect size—makes sense to me. It’s something I’ve been thinking about for a long time, for example this from 2005, this from 2014, this from 2015, this from 2023, and our recent paper on causal quartets for visualizing varying treatment effects. Also this paper from 2004 that I’m still chewing on.

But I haven’t thought so much about modeling (rather than just describing) variation in effects. so this paper by Krefeld et al. seems like an important step forward.

Magnitude and direction

Also I was struck by this statement from the paper:

Moderators are associated with effect sizes through two paths—effecting manipulation intensity and interacting with the effect of the manipulation.

This reminds me of something regarding education research—really, policy research in general—that I’ve been saying a lot recently but haven’t written down. The idea is that education is like a vector with a magnitude and a direction. The magnitude is how hard students work on their own or in small groups—those are the two scenarios where most of the learning gets done—and the direction is what they learn.

As teachers, we have two jobs. Job #1 is to motivate students to learn, that is, to increase the magnitude. Job #2 is to teach correct and useful things, that is, to get the direction right. My books are a mix of #1 and #2. To help with the magnitude, we try to structure the material to be clear, to smooth the path to learning and to give students lots of handholds: stories, examples, math, code, explanations, homeworks, all sorts of things. To help with the direction, we work hard to include useful material and to remove or to argue against ideas we think are counterproductive. For example, when we were writing BDA back in the early 1990s, a big idea in Bayesian inference was decision theory of Bayes estimators, and another idea was Bayesian null hypothesis testing. We put in very little on those topics in the book, and most of what we did put in was to explain why we weren’t putting it in. From the other direction, we pivoted the book around three chapters on hierarchical modeling, model checking, and the relation between design and analysis: to us, these were important concepts that students might otherwise not see.

As you can see just from the above paragraph, it’s a lot easier for me to think about direction than magnitude, which makes sense because I think I’m a much better statistician than a teacher, and indeed my teaching is best done not one-on-one but rather in this broadcasty way by exploring ideas through writing.

To get back to education research: I think most education interventions, at least the ideas that get tested in controlled trials, are focused on improving magnitude. The idea is that the subject-matter experts are supposed to get the direction right, and the education researchers work on the magnitude.

But this has implications for education research! What I’m calling “the magnitude,” which is motivation for students to work hard, figure things out, and learn on their own or with peers, is in large part an interaction between the teacher and student. That’s right, interactions again!

And not just education research. So many social interventions are ultimately about motivation.

This idea—that the most important part of a treatment is in its interaction with the people being treated—is in direct conflict with the dominant approach of thinking about causal inference, what I call the black-box or push-a-button, take-a-pill model of science. Something’s gotta give, and maybe this new paper by Krefeld et al. will take us a little bit in the right direction.

P.S. I could do without the trolley example—I’d be happy to never again hear about that fat guy (described in this article as “a large man wearing a backpack,” which I guess is the politically correct way to say “fat guy” now), but, hey, it’s their paper, they can use whatever examples they want!

New Yorker magazine demonstrates a naive faith in social science

David Remnick writes:

How you interpret and prioritize the cascade of reasons for Trump’s reëlection is a kind of Rorschach test. It will require a long reckoning before anyone can conclude which of the leading factors—economic anxiety, cultural politics, racism, misogyny, Biden’s decline, Harris’s late start—was determinative.

Here I don’t want to talk about the content of Remnick’s article—much has and will be written about the new administration, from many political perspectives—, but rather on the implicit faith he shows in an impossible version of social science. An attitude which perhaps should be no surprise, coming from a magazine that regularly publishes Malcolm Gladwell.

Two things from the above-quoted sentence jump out at me.

First, the final word, “determinative.” Trump won the popular vote nationwide and in the decisive swing state by a little less than 2 percentage points, which, yeah, is close, so it makes sent to look at how different factors could’ve affected the outcome. But the idea that one factor was “determinative” . . . that’s just nuts. There’s no “determinative” here. Lots was going on.

Second, the phrase, “It will require a long reckoning before anyone can conclude,” which implies that there’s some true answer that will at some point be figured out. Now, let me be clear on this: often in history there is a true answer, and sometimes, with care and effort, researchers can figure it out. For example, here’s Walter Mebane’s estimate of the actual number of votes for Bush and Gore in Florida in 2000. At the time of the aborted vote counting, there was a huge uncertainty into what the voters had actually wanted, but the actual tally of vote intentions now seems clear. This is separate from the legal question of what should’ve been done in November and December, 2020, but it’s an example where social science can, with time, resolve a potential factual dispute.

In this case, though, there’s no clear question to answer, and the idea that, after some “long reckoning,” anyone can conclude “which of the leading factors . . . was determinative,” makes no sense. Or, to put it another way, “anyone can conclude” anything at any time, but no long reckoning is required; indeed, the takes are coming thick and fast on the op-ed pages every day.

What can social science do?

To step back for a moment, my problem with the above-quoted snippet from the New Yorker article is that it seems to me to imply a faith in a simplistic version of social science, in which: (a) when something happens, there’s one factor that determines it, and (b) that, in time, people will figure out what that factor is.

I’m cool with the attitude that, with time, historians and social scientists can get a better understanding, both about what actually happened (looking at many data sources, not just exit polls and geographic vote totals) and about what could’ve happened otherwise (causal inference, counterfactuals and all that). So I’d’ve been cool if Remnick had just said something like, “It will require a long reckoning before we can have a clear sense of what actually happened during the campaign and the election, and before we can come to informed guesses about how things could have gone differently.”

Picky, picky?

Arguably, that last sentence of mine is what Remnick actually meant to say; he just wrote something that was a little bit sloppy, excusable given that he’s not a social scientist and also he was writing on deadline. His job is not to get all the nuances right; he’s writing for the New Yorker, not the American Political Science Review, after all, and if he uses the phrase “conclude which of the leading factors . . . was determinative” as shorthand for “how things could have gone differently,” then, no big deal, he got his point across.

The Speed Racer principle

So I’m not actually saying that Remnick did anything wrong here. Rather, what’s interesting to me is that I think his phrasing represents an implicit belief in social science as a way of finding the “determinative” factor.

Remembver the Speed Racer Principle? Sometimes the most interesting aspect of a cultural product is not its overt content but rather its unexamined assumptions. I think this New Yorker quote is interesting in revealing an unexamined assumption about how the social world works, and what social science can do.

Individual probability, model multiplicity, and multicalibration

This is Jessica. I’ve been posting recently on questions related to individual probability, i.e., assigning probability to individual events, related to a course I just wrapped up where this was a theme. For example, previously we talked about how statistical exchangeability–where the joint distribution of a random variable is unaffected by the order in which it’s observed–is from one perspective all you need to reconcile differences between “groupist” (e.g., probability as long run frequency over a set of similar events) and “individualist” notions of the probability of a single event (e.g., a probability as based on some expert’s beliefs about a specific event conditioned on some prior experience). A fundamental problem with individual probabilities is the reference class problem: to assign a probability to an event that will only happen once, we have to identify some group of events that we believe capture the essential characteristics of the event in question and estimate the probability over that. But often there will be several equally-appropriate-seeming reference groups that we must choose between. For example, if we have a criminal defendant with a combination of prior arrests and convinctions that we have never seen together before, which subset of these features do we use in estimating their probability of commiting another crime if released, assuming that using different reference groups result in different conditional probabilities? 

The reference class problem is associated with the idea that there is often no unique way to assign a conditional probability to a particular event. Consequently, predictive multiplicity–the fact that machine learning problems often admit multiple competing models that perform more or less equally well–has been accepted as a consequence of the underspecification of individual probabilities in the machine learning literature. Multiplicity is sometimes described as the Rashomon effect, or through the idea of a Rashomon set, i.e., a set of models that predict equally accurately but which make conflicting predictions on some subset of data space. For example, in Breiman’s well known paper on the two cultures of statistical modeling, he uses the Rashomon effect to argue that we have to be careful not to draw conclusions about the process that generated some data using a single explanatory model unless we can somehow rule out all the competing models. This possibility has been discussed more recently in conjuction to concerns like algorithmic fairness, where the presence of other equally accurate models that assign some specific person the opposite prediction is considered unsettling. 

But is multiplicity due to the underspecification of individual probabilities really a fundamental property of predictive models learned from data? Imagine you have two models that are equally supported by the data (and make predictions close to the true conditional probabilities for the various possible reference groups). But they disagree non-trivially in their predictions. Should we accept that such situations exist and cannot necessarily be resolved? 

 A new paper by Roth and Tolbert reframes this as “the reference class problem at scale.” More concretely, say you have some distribution over a collection of elements representing different combinations of features, which might represent, e.g., different people’s records. Assume some true function F that maps from features of these records to outcomes. We can define a set of reference classes representing subsets of the elements; for example, if we have a universe of elements representing different combinations of an age, race, and income variable, then we can define subsets for different possible combinations of the values of these variables.

A model can be considered consistent with a reference class if, when you average the model’s predictions over elements in that reference class, you are within some error epsilon of the true rate over that reference class.

One way to then think about predictive multiplicity is a case where we have at least two models that are consistent on all of the reference classes for some error bound epsilon, but that frequently make different predictions. In Roth and Tolbert’s characterization, this means that the probability that their predictions differ by more than some small amount epsilon is greater than epsilon. 

Now, returning to the previous question, should we accept such multiplicity as a fact of life in machine learning, an inevitable symptom of an underspecified learning problem? 

The recent work by Roth and Tolbert and a more technical version by Roth et al. argues that this kind of multiplicity is always resolvable, at least in theory. The answer to finding resolution lies in multicalibration, which I’ve discussed previously on the blog. A multicalibrated model is one that, for any efficiently identifiable, possibly intersecting set of groups that are supported by the data, its predictions are approximately calibrated. 

Multicalibration can be achieved by algorithms that work via a boosting process. For any reference class with sufficient probability mass on which the model is not found to be consistent, with access to data sampled from the same distribution, we can produce a new model with a lower squared error. We continue doing this until we arrive at a model that is consistent with all of the reference classes.

This leads Roth and colleagues to argue that: 

although individual probabilities are unknowable, they are contestable via a computationally and data efficient process that must lead to agreement. Thus we cannot find ourselves in a situation in which we have two equally accurate and unimprovable models that disagree substantially in their predictions—providing an answer to what is sometimes called the predictive or model multiplicity problem. 

In other words, they show that with sufficient data we can improve one or both of the models until we have a multicalibrated model. Put this way, this is not so surprising. The contributions of these papers are more nuanced of course, and including showing how bounds on the amount of data needed scales with parameters like the error bound and probability mass of the considered subsets. 

I like how framing predictive multiplicity in terms of models being inconsistent with reference groups nicely connects underspecification of individual probabilities with model multiplicity. This link has previously been left vague. My only real complaint (which applies to lots of theory papers related to calibration) is the downplaying of the distinction between theoretically possible and practically possible in some of the statements. E.g., in the quote above, saying “we cannot find ourselves in a situation in which we have two equally accurate and unimprovable models that disagree substantially in their predictions” requires some qualification, because we can absolutely find ourselves in a situation where we have multiple models that disagree and we don’t have sufficient data (because we’re dealing with rare classes/very large labels and limited data). 

One question this work has me now thinking about is when observing model multiplicity is still useful, even if you could reconcile the models to some extent through cross-calibration. It’s directly relevant to some work that Abhraneel Sarma, Dawei Xie and I have been doing related to Cynthia Rudin’s vision that predictive multiplicity is a good thing, e.g., because it provides room for identifying models that align with human preferences like fairness or monotonicity constraints on the relationship between features and outcome. Our part of this has been to develop an interactive interface (and more generally think through a workflow for incorporating domain expertise in model selection) for Rashomon sets of Generalized Additive Models. Here, the individual models in the set can differ in terms of what features they include, how they are coded, and what their shape functions look like. 

It seems that when a model’s predictions are intended as a decision aid for a human expert, then rather than trying to reconcile the multiplicity entirely by cross-calibrating, we may do better by deploying a model that aligns with the domain experts’ preferences. One reason that I’ve brought up before is that trusting the calibration data can seem to contradict the reasons for including humans in a decision process in the first place: often we want them there because we are wary of distributions shifting or the model predictions unknowingly reflecting artifacts in the training data. Involving the expert in model selection to ensure the deployed model captures key aspects of how human experts think features relate to the outcome (e.g., having asthma should not result in lower risk) may produce better decisions in practice by providing some robustness against shifts, even if we could have improved calibration according to  historical data. There are also problems that arise when experts have access to some highly performant model but it conflicts with their expectations about how features should be used. I’ve heard stories about this happening in medical settings where its discovered after a fancy new model is deployed that it’s basically being ignored.

On that claim about “How does energy impact economic growth”

Hanno Böck writes:

I recently saw a graphic coming from here posted multiple times on social media that I found quite misleading in its data representation.

There exist some variations of it, but they all share the same problem.

The most notable issue is that the graphic uses logarithmic scales on both axes. This has the effect of squeezing everything together on the upper right end and visually creates a much stronger correlation than there actually is.

Another thing to note, and this is where I’d be curious what you think about it, is that it gives an R^2 value of 0.8 at the bottom. First of all, R^2 is, as far as I can tell, not something that can be easily and intuitively understood (it seems a simple r coefficient would be more appropriate). But that’s not the main problem. The value is, as far as I can tell, simply wrong.

When I try to calculate R^2 for that data, I get 0.43. It appears that what was done here was to calculate the R^2 value over the log values of the input data. (If I do that, I get 0.81.)

In case you want to play with the data, here’s some quick python I wrote to create similar graphs with a non-log scale, and the relevant data sources from the world bank and EIA.

My reply:

I don’t think the logarithmic scale is a problem, and it’s fine to compute the R-squared of log-scaled data. In any case, the scatterplot tells the story; I don’t thin R-squared adds anything here.

I clicked through to the source, and the real problem seems to be their title, “How does energy impact economic growth.” The data they show are cross-sectional with no such causal implication.

Bock responded:

I’m surprised that you don’t see a problem in the log scale. I believe this is the main issue with this graph. (As a rule of thumb, I’d say log scales should rarely be used in public communication at all, as they are not easy to understand intuitively. If they are used, there needs to be a good explanation, which I don’t see here.)

To maybe illustrate this more clearly, I have attached linear and log-scaled versions of the data. To me, they tell a different story. The log version implies that there is a general, strong correlation between electricity consumption and per capita gdp. But the actual data tells me that the correlation is only present below a certain threshold, and above that, we have extreme differences of energy use in countries with very similar gdp levels. (E.g. quite rich countries like Denmark/Switzerland with a very low electricity use.)

Regarding your point about causal inference, that’s probably a valid point as well, but not really what I’m trying to get at here. The reason is that I don’t think that blog post got a lot of attention, but the graphic is shared very widely.

Böck posted a longer discussion here. Setting aside the above-discussed issues with the log scale and R-squared, the rest of his post has interesting economics content.

The piranha principle: What does it mean, exactly?

Barnabas Szaszi writes:

I’m contacting you now to get your advice/interpretation on the piranha theorems regarding a paper I’m writing about the generalizability of behavioral interventions. Here, I use the Piranha theorem to back my claim that average affect of nudges are small:

First, there are theoretical reasons to assume that the average effect of choice architecture interventions is modest. It has been widely argued that in systems such as human behavior – and more broadly in social and behavioral sciences – phenomena are causally dense (Meehl, 1967, Gelman, 2011, Almaatouqa et al. 2022). The Piranha theorem (Gelman 2017 blogpost, Tosh et al. 2025) suggests that in such systems, many influencing factors operate concurrently, the interaction and interference of these factors would overwhelm most main effects leading to predominantly small effects.

I was wondering if you could help to clarify whether my interpretation is correct, these effects remain usually individually small because they overwhelm each other, or they are rather unlikely to remain individually small because of their cumulative impact?

My reply:

The basic piranha paradox is that there are subfields where many papers claim to find large and general effects, but of different factors. In the wild, where all these factors are operating in an uncontrolled fashion, the result of adding all these large effects, if these effects are independent, is that outcomes will be extremely variable. Given that real-world behavior is not so unpredictable, this implies that, either there are not a large number of large effects, or that the large effects happen to have large negative interactions that cause them to cancel out. But in that latter case the original claim (that effects are general) is false.

It’s also correct to say that many small effects won’t remain small, in the sense that, even when researchers talk about small effects, these effects are not so small. For example, an intervention that is claimed to increase some desirable behavior by 10%. That sounds small, but given that any intervention won’t work on most people, an average treatment effect of 0.1 is still large; see here. If you have many different treatments, each with an average effect of 0.1, then the piranha issue still arises: in the wild, with these treatments operating in an uncontrolled way, all these effects will add up in unexpected ways, leading to complete chaos–unless the effects interact in a way to mostly cancel each other out, in which case the effects of each treatment are highly context-dependent, in a way that would cast doubt on the original claims of universality.

The R-squared on this is kinda low, no? (Nobel prize edition)

An economist who would prefer anonymity points to the above wacky graph of a “robust regression.” It’s from a paper written by 2 out of the 3 recent Nobel prize winners in economics!

The full paper is here, and my correspondent points us to p. 921 of the published version.

My correspondent writes:

They can do what they want. They ignore all criticism, even when repeated by mainstream economists in whispers.

“They ignore all criticism” seems pretty standard in science. I guess the only hope is for the field to advance through external criticism. In that sense, it’s fine for questionable papers to be published, as long as data and code are made available and as long as the journals do not hold criticism to higher standards than the original work.

Unfortunately, giving out Nobel prizes is kind of the opposite of criticism (and here’s another recent example).

“Florida man eats diet of butter, cheese, beef; cholesterol oozes from his body”: How much am I to blame for this?

Oh, this is horrible, no joke:

What could go wrong with eating an extremely high-fat diet of beef, cheese, and sticks of butter? Well, for one thing, your cholesterol levels could reach such stratospheric levels that lipids start oozing from your blood vessels, forming yellowish nodules on your skin.

That was the disturbing case of a man in Florida who showed up at a Tampa hospital with a three-week history of painless, yellow eruptions on the palms of his hands, soles of his feet, and elbows. His case was published today in JAMA Cardiology.

The man, said to be in his 40s, told doctors that he had adopted a “carnivore diet” eight months prior. His diet included between 6 lbs and 9 lbs of cheese, sticks of butter, and daily hamburgers that had additional fat incorporated into them. Since taking on this brow-raising food plan, he claimed his weight dropped, his energy levels increased, and his “mental clarity” improved.

I’m uncomfortably reminded of my friend Seth Roberts, who died of heart failure not long after adopting a diet in which he ate half a stick of butter a day, which he claimed from self-experimentation to have improved his brain function.

The Florida man was ingesting a lot more than half a stick of butter per day; still, I was struck by his report of improved mental clarity, which uncomfortably echoed Seth’s claims from a decade earlier.

I feel some very small responsibility for this chain of events. Seth was a friend of mine at the University of California in the early 1990s–we taught a couple of courses together–and he kept me informed of his work on self-experimentation. In 2005 he published a research article on the topic, and I promoted it on the blog. Even then, his self-experimentation was unusual, but he hadn’t gone off the deep end. We had further blog discussion that year (also here), and Alex Tabarrok linked to it from his Marginal Revolution blog. From there it was picked up by Stephen Dubner and Steven Levitt in the New York Times. At the time I thought this multiple stage of amplification was pretty cool, and at this point Seth didn’t need my help to get publicity. Dubner and Levitt invited him to guest-blog on his diet at Freakonomics, then a year later Seth published a successful diet book, which I reviewed positively, as did Tabarrok. Around that time, I also published a couple of conversations with Seth on his research; see here and here. Seth also started his own blog and website where people using his diet could share tips and otherwise communicate. He was followed by mid-level celebrities such as Dennis Prager and Tucker Max.

Seth was an early figure in the paleo-lifestyle movement. Indeed, he was telling me his theories about the benefits of caveman diet and caveman lifestyle many years before I heard it anywhere else. In the late 1990s or early 2000s he even wrote a book on the topic, but he never found a publisher, and I guess he abandoned the project after publishing his diet book.

Going through the blog archives, I was surprised to learn that, as early as 2007, Seth was claiming cognitive benefits from oil consumption. He shared data showing it had improved his reaction time. I think this must have been a byproduct of him drinking oil as part of his weight-loss plan. Here’s what I wrote back in 2007:

Encouraged by the success of his self-experimentation to help his sleep, mood, and weight concerns, Seth Roberts has been experimenting with the effects of drinking flaxseed oil. . . . Commenting on another recent one of Seth’s self-experiments, I wrote,

Seth,
Not to be a wet blanket or anything, but aren’t you worried that your findings might be due to expectation effects: you knew which oil you were taking when doing the tests, right?

Seth replied,

Andrew, no, I’m not worried that the results are due to expectations. If the results always conformed to my expectations, I’d be worried, but they haven’t — see my post about eggs. Moreover, this particular result confirms a result that was a surprise. In other words, I’ve gotten the same result when I was expecting it and when I wasn’t expecting it.

I’m still concerned, though. Seth is saying that it’s not just an expectation effect because he wasn’t always expecting the results. But I could see a bias arising from positive feedback, as follows: You try a new treatment and then see what happens after, with no expectations except that things might change. There is some noise to this measurement–just at random, it will be higher or lower than before. Having seen this, you adjust your expectations; this then affects your next measurement, etc.

In retrospect, I think I was too mild. I suggested to Seth: “maybe you could get a partner in experimentation, someone who lives or works nearby, and he or she could give you a randomly assigned oil. That is, your partner would know which oil you’re getting, but you wouldn’t. In fact, you wouldn’t even know if you were being given something new that day.” Seth agreed but wrote, “they are low on my to-do list.”

Again, I wish I’d pushed back harder on that:

– A clean negative result could’ve given Seth (and me!) a healthy dose of skepticism and motivated him to think more carefully about his theories when deciding what experiments to try next.

– A clean positive result could’ve given him and his followers a level of confidence and motivated him to experiment more systematically, rather than spiraling out of control with goofy and possibly dangerous ideas such as eating a half a stick of butter a day.

We had another exchange in 2009, where, again, I said, “maybe this could all be a confirmation bias,” and, again, Seth pushed back, and, again, I now wish I’d taken a firmer line.

As the years went on, Seth gained confidence and his experiments became more and more dubious, culminating in the stick-of-butter thing which, maybe it wasn’t fatal, but maybe it was, and I doubt it was helpful, either to his well-being or others’.

This is a long chain of events, so I don’t feel a high degree of responsibility for my small part in encouraging the crazier aspects of the paleo lifestyle movement. But, to the extent that I was proud that my promotion of Seth’s work led to his media success, I should correspondingly feel bad about the negative consequences.

How did Laura Wattenberg’s baby name predictions turn out, 15 years later?

The famed baby named analyst (see also here, and lots more here) came out with a new post, 15 Years Ago I Made a Prediction. Here’s How it Turned Out:

I [Wattenberg] recently came across a list I created back in 2009. Parents Magazine had challenged me to look a decade into the future and predict the top baby names of 2019. To make it more interesting, I restricted my choices to names that ranked outside the top 40 at that time. Any bullseyes would be genuine predictions of change, not just staying the course.

My prediction list:

As of 2019, only three of the names—Amelia, Harper and Lucas—actually made it all the way to the top 10. (Henry joined them a year later.) As a group, though, the names rose in popularity by a hearty 88% from the statistical baseline I was working with.

Graph showing steady popularity growth of a set of 20 baby names from 2008-2019.
Table of the 20 predicted names showing that 17 of the 20 rose in popularity as of 2019.

So far, that looks like a solid set of predictions. But wait. To predict future top 10 hits, I naturally chose names that were already on the rise. Did my so-called expertise add any predictive value, or did the names simply follow their expected trendlines?

As a test, I created a second control list. These names were statistical doppelgangers to the predicted names at the starting point: close in popularity, and rising at a similar rate.

It turns out that the doppelganger names rose less dramatically and were already in decline a decade later. None made it to the top 10. Score one for expertise!

Line graph showing the popularity of "predicted" names rising steadily, and the popularity of "control" names rising only slightly then declining

What made my prediction list different from the control group? Looking name by name, the answer appears to be sounds—specifically, trendy ending sounds. My choices mostly avoided the hot suffixes of the 2000s like -den and -ton. My one concession to the trend, Peyton, turned out to be my single worst prediction. Meanwhile the biggest hits I failed to identify, Scarlett and Willow, have much more distinctive sounds.

The lesson is that not all popularity is created equal. Names that are part of creative sound trends share their style space with dozens of other similar names.

Good stuff.  Also fun that she did a control group.  We associate the control group with causal inference, but here she’s using it to evaluate a prediction, which mathematically is a similar problem–it’s a comparison!–but without the direct estimation of any causal effect.

How far can exchangeability get us toward agreeing on individual probability?

This is Jessica. What’s the common assumption behind the following? 

    • Partial pooling of information over groups in hierarchical Bayesian models 
    • In causal inference of treatment effects, saying that the outcome you would get if you were treated (Y^a) shouldn’t change depending on whether you are assigned the treatment (A) or not
    • Acting as if we believe a probability is the “objective chance” of an event even if we prefer to see probability as an assignment of betting odds or degrees of belief to an event

The question is rhetorical, because the answer is in the post title. These are all examples where statistical exchangeability is important. Exchangeability says the joint distribution of a set of random variables is unaffected by the order in which they are observed. 

Exchangeability has broad implications. Lately I’ve been thinking about it as it comes up at the ML/stats intersection, where it’s critical to various methods: achieving coverage in conformal prediction, using counterfactuals in analyzing algorithmic fairness, identifying independent causal mechanisms in observational data, etc. 

This week it came up in the course I’m teaching on prediction for decision-making. A student asked whether exchangeability was of interest because often people aren’t comfortable assuming data is IID. I could see how this might seem like the case given how application-oriented papers (like on conformal prediction) sometimes talk about the exchangeabilty requirement as an advantage over the usual assumption of IID data. But this misses the deeper significance, which is that it provides a kind of practical consensus between different statistical philosophies. This consensus, and the ways in which it’s ultimately limited, is the topic of this post.

Interpreting the probability of an individual event

One of the papers I’d assigned was Dawid’s “On Individual Risk,” which, as you might expect, talks about what it means to assign probability to a single event. Dawid distinguishes “groupist” interpretations of probability that depend on identifying some set of events, like the frequentist definition of probability as the limiting frequency over hypothetical replications of the event, from individualist interpretations, like a “personal probability” reflecting the beliefs of some expert about some specific event conditioned on some prior experience. For the purposes of this discussion, we can put Bayesians (subjective, objective, and pragmatic, as Bob describes them here) in the latter personalist-individualist category. 

On the surface, the frequentist treatment of probability as an “objective” quantity appears incompatible with the individualist notion of probability as a descriptor of a particular event from the perspective of the particular observer (or expert) ascribing beliefs. If you have a frequentist and a personalist thinking about the next toss of a coin, for example, you would expect the probability the personalist assigns to depend on their joint distribution over possible sequences of outcomes, while the frequentist would be content to know the limiting probability. But de Finetti’s theorem shows that if one believes a sequence of events to be exchangeable, then you can’t distinguish their beliefs about those random variables from conceiving of independent events with some underlying probability. Given a sequence of exchangeable Bernoulli random variables X1, X2, X3, … , you can think of a draw from their joint distribution as sampling p ~ mu, then drawing X1, X2, X3, … from Bernoulli(p) (where mu is a distribution on [0,1]). So the frequentist and personalist can both agree, under exchangeability, that p is meaningful for decision making. David Spiegalhalter recently published an essay on interpreting probability that he ended by commenting on how remarkable this pragmatic consensus is.

But Dawid’s goal is to point out ways in which the apparent alignment is not as satisfactory as it may seem in resolving the philosophical chasm. It’s more like we’ve thrown a (somewhat flimsy) plank over it. Exchangeability may sometimes get us across by allowing the frequentist and personalist to coordinate in terms of actions, but we have to be careful how much weight we put on this.  

The reference set depends on the state of information

One complication is that the personalist’s willingness to assume exchangeability depends on the information they have. Dawid uses the example of trying to predict the exam score of some particular student. If they have no additional information to distinguish the target student from the rest, the personalist might be content to be given an overall limiting relative frequency p of failure across a set of students. But as soon as they learn something that makes the individual student unique, p is no longer the appropriate reference for the individual student’s probability of passing the exam. 

As an aside, this doesn’t mean that exchangeability is only useful if we think of members of some exchangeable set as identical. There may still be practical benefits of learning from the other students in the context of a statistical model, for example. See, e.g., Andrew’s previous post on exchangeability as an assumption in hierarchical models, where he points out that assuming exchangeability doesn’t necessarily mean that you believe everything is indistinguishable, and if you have additional information distinguishing groups, you can incorporate that in your model as group-level predictors.

But for the purposes of personalists and frequentists agreeing on a reference for the probability of a specific event, the dependence on information is not ideal. Can we avoid this by making the reference set more specific? What if we’re trying to predict a particular student’s score on a particular exam in a world where that particular student is allowed to attempt the same exam as many times as they’d like? Now that the reference group refers to the particular student and particular exam, would the personalist be content to accept the limiting frequency as the probability of passing the next attempt? 

The answer is, not necessarily. This imaginary world still can’t get us to the generality we’d need for exchangeability to truly reconcile a personalist and frequentist assessment of the probability. 

Example where the limiting frequency is constructed over time

Dawid illustrates this by introducing a complicating (but not at all unrealistic) assumption: that the student’s performance on their next try on the exam will be affected by their performance on the previous tries. Now we have a situation where the limiting frequency of passing on repeated attempts is constructed over time. 

As an analogy, consider drawing balls from an urn, where when we draw our first ball, there is 1 red ball and 1 green ball in it. Upon drawing a ball, we immediately return and add an additional ball of the same color. At each draw, each ball in the urn is equally likely of being drawn, and  the sequence of colors is exchangeable. 

Given that p is not known, which do you think the personalist would prefer to consider as the probability of a red ball on the first draw: the proportion of red balls currently in the urn, or the limiting frequency of drawing a red ball over the entire sequence? 

Turns out in this example, the distinction doesn’t actually matter: the personalist should just bet 0.5. So why is there still a problem in reconciling the personalist assessment with the limiting frequency?

The answer is that we now have a situation where knowledge of the dynamic aspect of the process makes it seem contradictory for the personalist to trust the limiting frequency. If they know it’s constructed over time, then on what ground is the personalist supposed to assume the limiting frequency is the right reference for the probability on the first draw? This gets at the awkwardness of using behavior in the limit to think about individual predictions we might make.

Why this matters in the context of algorithmic decision-making

This example is related to some of my prior posts on why calibration does not satisfy everyone as a means of ensuring good decisions. The broader point in the context of the course I’m teaching is that when we’re making risk predictions (and subsequent decisions) about people, such as in deciding who to grant a loan or whether to provide some medical treatment, there is inherent ambiguity in the target quantity. Often there are expectations that the decision-maker will do their best to consider the information about that particular person and make the best decision they can. What becomes important is not so much that we can guarantee our predictions behave well as a group (e.g., calibration) but that we understand how we’re limited by the information we have and what assumptions we’re making about the reference group in an individual case.