“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.

Conformal prediction and people

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes. 

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution. 

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.  

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments  doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction. 

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially. 

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

A new piranha paper

Kris Hardies points to this new article, Impossible Hypotheses and Effect-Size Limits, by Wijnand and Lennert van Tilburg, which states:

There are mathematical limits to the magnitudes that population effect sizes can take within the common multivariate context in which psychology is situated, and these limits can be far more restrictive than typically assumed. The implication is that some hypothesized or preregistered effect sizes may be impossible. At the same time, these restrictions offer a way of statistically triangulating the plausible range of unknown effect sizes.

This is closely related to our Piranha Principle, which we first formulated here and then followed up with this paper. It’s great to see more work being done in this area.

Statistical practice as scientific exploration

This was originally going to happen today, 8 Mar 2024, but it got postponed to some unspecified future date, I don’t know why. In the meantime, here’s the title and abstract:

Statistical practice as scientific exploration

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? After a brief review of that topic (in short, I am a Bayesian but not an inductivist), I discuss the ways in which researchers when using and developing statistical methods are acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modeling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formally tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow, as described in part in this article: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

The whole thing is kind of mysterious to me. In the email invitation it was called the UPenn Philosophy of Computation and Data Workshop, but then they sent me a flyer where it was called the Philosophy of A.I., Data Science, & Society Workshop in the Quantitative Theory and Methods Department at Emory University. It was going to be on zoom so I guess the particular university affiliation didn’t matter.

In any case, the topic is important, and I’m always interested in speaking with people on the philosophy of statistics. So I hope they get around to rescheduling this one.

Relating t-statistics and the relative width of confidence intervals

How much does a statistically significant estimate tell us quantitatively? If you have an estimate that’s statistically distinguishable from zero with some t-statistic, what does that say about your confidence interval?

Perhaps most simply, with a t-statistic of 2, your 95% confidence intervals will nearly touch 0. That is, they’re just about 100% wide in each direction. So they cover everything from nothing (0%) to around double your estimate (200%).

More generally, for a 95% confidence interval (CI), 1.96/t — or let’s say 2/t — gives the relative half-width of the CI. So for an estimate with t=4, then everything from half your estimate to 150% of your estimate is in the 95% CI.

For other commonly-used nominal coverage rates, the confidence intervals have a width that is less conducive to a rule of thumb, since the critical value isn’t something nice like ~2. (For example, with 99% CIs, the Gaussian critical value is 2.58.) Let’s look at 90, 95, and 99% confidence intervals for t = 1.96, 3, 4, 5, and 6:

Confidence intervals on the scale of the estimate

You can see, for example, that even at t=5, the halved point estimate is still inside the 99% CI. Perhaps this helpfully highlights how much more precision you need to confidently state the size of an effect than just to reject the null.

These “relative” confidence intervals are just this smooth function of t (and thus the p-value), as displayed here:

confidence intervals on scale of the estimate by p-value and t-statistic

It is only when the statistical evidence against the null is overwhelming — “six sigma” overwhelming or more —that you’re also getting tight confidence intervals in relative terms. Among other things, this highlights that if you need to use your estimates quantitatively, rather than just to reject the null, default power analysis is going to be overoptimistic.

A caveat: All of this just considers standard confidence intervals based on normal theory labeled by their nominal coverage. Of course, many p < 0.05 estimates may have been arrived at by wandering through a garden of forking paths, or precisely because it passed a statistical significance filter. Then these CIs are not going to conditionally have their advertised coverage.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Defining optimal reliance on model predictions in AI-assisted decisions

This is Jessica. In a previous post I mentioned methodological problems with studies of AI-assisted decision-making, such as are used to evaluate different model explanation strategies. The typical study set-up gives people some decision task (e.g., Given the features of this defendant, decide whether to convict or release), has them make their decision, then gives them access to a model’s prediction, and observes if they change their mind. Studying this kind of AI-assisted decision task is of interest as organizations deploy predictive models to assist human decision-making in domains like medicine and criminal justice. Ideally, the human is able to use the model to improve the performance they’d get on their own or if the model was deployed without a human in the loop (referred to as complementarity). 

The most frequently used definition of appropriate reliance is that if the person goes with the model prediction but it’s wrong, this is overreliance. If they don’t go with the model prediction but it’s right, this is labeled underreliance. Otherwise it is labeled appropriate reliance. 

This definition is problematic for several reasons. One is that the AI might have a higher probability than the human of selecting the right action, but still end up being wrong. It doesn’t make sense to say the human made the wrong choice by following it in such cases. Because it’s based on post-hoc correctness, this approach confounds two sources of non-optimal human behavior: not accurately estimating the probability that the AI is correct versus not making the right choice of whether to go with the AI or not given one’s beliefs. 

By scoring decisions in action space, it also equally penalizes not choosing the right action (which prediction to go with) in a scenario where the human and the AI have very similar probabilities of being correct and one where either the AI or human has a much higher probability of being correct. Nevertheless, there are many papers doing it this way, some with hundreds of citations. 

In A Statistical Framework for Measuring AI Reliance, Ziyang Guo, Yifan Wu, Jason Hartline and I write: 

Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s prediction from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational agent facing the same decision task as the behavioral agents.

It’s a similar approach to our rational agent framework for data visualization, but here we assume a setup in which the decision-maker receives a signal consisting of the feature values for some instance, the AI’s prediction, the human’s prediction, and optionally some explanation of the AI decision. The decision-maker chooses which prediction to go with. 

We can compute the upper bound or best attainable performance in such a study (rational benchmark) as the expected score of a rational decision-maker on a randomly drawn decision task. The rational decision-maker has prior knowledge of the data generating model (the joint distribution over the signal and ground truth state). Seeing the instance in a decision trial, they accurately perceive the signal, arrive at Bayesian posterior beliefs about the distribution of the payoff-relevant state, then choose the action that maximizes their expected utility over the posterior. We calculate this in payoff space as defined by the scoring rule, such that the cost of an error can vary in magnitude. 

We can define the “value of rational complementation” for the decision-problem at hand by also defining the rational agent baseline: the expected performance of the rational decision-maker without access to the signal on a randomly chosen decision task from the experiment. Because it represents the score the rational agent would get if they could rely only on their prior beliefs about the data-generating model, the baseline is the expected score of a fixed strategy that always chooses the better of the human alone or the AI alone.

figure showing human alone, then ai alone, then (with ample room between them) rational benchmark

If designing or interpreting an experiment on AI reliance, the first thing we might want to do is look at how close to the benchmark the baseline is. We want to see a decent amount of room for the human-AI team to improve performance over the baseline, as in the image above. If the baseline is very close to the benchmark, it’s probably not worth adding the human.

Once we have run an experiment and observed how well people make these decisions, we can treat the value of complementation as a comparative unit for interpreting how much value adding the human contributes over making the decision with the baseline. We do this by normalizing the observed score within the range where the rational agent baseline is 0 and the rational agent benchmark is 1 and looking at where the observed human+AI performance lies. This also provides a useful sense of effect size when we are comparing different settings. For example, if we have two model explanation strategies A and B we compared in an experiment, we can calculate expected human performance on a randomly drawn decision trial under A and under B and measure the improvement by calculating (score_A − score_B)/value of complementation. 

figure showing human alone, then ai alone, then human+ai, then rational benchmark

We can also decompose sources of error in study participants’ performance. To do this, we define a “mis-reliant” rational decision-maker benchmark, which is the expected score of a rational agent constrained to the reliance level that we observe in study participants. Hence this is the best score a decision-maker who relies on the AI the same overall proportion of the time could attain had they perfectly perceived the probability that the AI is correct relative to the probability that the human is correct on every decision task. Since the mis-reliant benchmark and the study participants have the same reliance level (i.e., they both accept the AI’s prediction the same percentage of the time), the difference in their decisions lies entirely in accepting the AI predictions at different instances. The mis-reliant rational decision-maker always accepts the top X% AI predictions ranked by performance advantage over human predictions, but study participants may not.

figure showing human alone, ai alone, human+ai, misreliant rational benchmark, rational benchmark

By calculating the mis-reliant rational benchmark for the observed reliance level of study participants, we can distinguish between reliance loss, the loss from over- or under-relying on the AI (defined as the difference between the rational benchmark and mis-reliant benchmark divided by the value of rational complementation), and discrimination loss, the loss from not accurately differentiating the instances where the AI is better than the human from the ones where the human is better than the AI (defined as the difference between the mis-reliant benchmark and the expected score of participants divided by the value of rational complementation). 

We apply this approach to some well-known studies on AI reliance and can extend the original interpretations to varying degrees, ranging from observing a lack of potential to see complementarity in the study given how much better the AI was going in than the human, to the original interpretation missing that participants’ reliance levels were pretty close to rational and they just couldn’t distinguish which signals they should go with AI on. We also observe researchers making comparisons across conditions for which the upper and lower bounds on performance differ without accounting for the difference.

There’s more in the paper – for example, we discuss how the rational benchmark, our upper bound representing the expected score of a rational decision-maker on a randomly chosen decision task, may be overfit to the empirical data. This occurs when the signal space is very large (e.g., the instance is a text document) such that we observe very few human predictions per signal. We describe how the rational agent could determine the best response on the optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and the overfit upper bound. 

While we focused on showing how to improve research on human-AI teams, I’m excited about the potential for this framework to help organizations as they consider whether deploying an AI is likely to improve some human decision process. We are currently thinking about what sorts of practical questions (beyond Could pairing a human and AI be effective here?) we can answer using such a framework.

Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile)

This is a long post, so let me give you the tl;dr right away: Don’t use the word “whom” in your dating profile.

OK, now for the story. Fasten your seat belts, it’s going to be a bumpy night.

It all started with this message from Dmitri with subject line, “Man I hate to do this to you but …”, which continued:

How could I resist?

https://www.cnbc.com/2024/02/15/using-this-word-can-make-you-more-influential-harvard-study.html

I’m sorry, let me try again … I had to send this to you BECAUSE this is the kind of obvious shit you like to write about. I like how they didn’t even do their own crappy study they just resurrected one from the distant past.

OK, ok, you don’t need to shout about it!

Following the link we see this breathless press release NBC news story:

Using this 1 word more often can make you 50% more influential, says Harvard study

Sometimes, it takes a single word — like “because” — to change someone’s mind.

That’s according to Jonah Berger, a marketing professor at the Wharton School of the University of Pennsylvania who’s compiled a list of “magic words” that can change the way you communicate. Using the word “because” while trying to convince someone to do something has a compelling result, he tells CNBC Make It: More people will listen to you, and do what you want.

Berger points to a nearly 50-year-old study from Harvard University, wherein researchers sat in a university library and waited for someone to use the copy machine. Then, they walked up and asked to cut in front of the unknowing participant.

They phrased their request in three different ways:

“May I use the Xerox machine?”
“May I use the Xerox machine because I have to make copies?”
“May I use the Xerox machine because I’m in a rush?”
Both requests using “because” made the people already making copies more than 50% more likely to comply, researchers found. Even the second phrasing — which could be reinterpreted as “May I step in front of you to do the same exact thing you’re doing?” — was effective, because it indicated that the stranger asking for a favor was at least being considerate about it, the study suggested.

“Persuasion wasn’t driven by the reason itself,” Berger wrote in a book on the topic, “Magic Words,” which published last year. “It was driven by the power of the word.” . . .

Let’s look into this claim. The first thing I did was click to the study—full credit to CNBC Make It for providing the link—and here’s the data summary from the experiment:

If you look carefully and do some simple calculations, you’ll see that the percentage of participants who complied was 37.5% under treatment 1, 50% under treatment 2, and 62.5% under treatment 3. So, ok, not literally true that both requests using “because” made the people already making copies more than 50% more likely to comply: 0.50/0.375 = 1.33, and increase of 33% is not “more than 50%.” But, sure, it’s a positive result. There were 40 participants in each treatment, so the standard error is approximately 0.5/sqrt(40) = 0.08 for each of those averages. The key difference here is 0.50 – 0.375 = 0.125, that’s the difference between the compliance rates under the treatments “May I use the Xerox machine?” and “May I use the Xerox machine because I have to make copies?”, and this will have a standard error of approximately sqrt(2)*0.08 = 0.11.

The quick summary from this experiment: an observed difference in compliance rates of 12.5 percentage points, with a standard error of 11 percentage points. I don’t want to say “not statistically significant,” so let me just say that the estimate is highly uncertain, so I have no real reason to believe it will replicate.

But wait, you say: the paper was published. Presumably it has a statistically significant p-value somewhere, no? The answer is, yes, they have some “p < .05" results, just not of that particular comparison. Indeed, if you just look at the top rows of that table (Favor = small), then the difference is 0.93 - 0.60 = 0.33 with a standard error of sqrt(0.6*0.4/15 + 0.93*0.07/15) = 0.14, so that particular estimate is just more than two standard errors away from zero. Whew! But now we're getting into forking paths territory: - Noisy data - Small sample - Lots of possible comparisons - Any comparison that's statistically significant will necessarily be huge - Open-ended theoretical structure that could explain just about any result. I'm not saying the researchers were trying to anything wrong. But remember, honesty and transparency are not enuf. Such a study is just too noisy to be useful.

But, sure, back in the 1970s many psychology researchers not named Meehl weren’t aware of these issues. They seem to have been under the impression that if you gather some data and find something statistically significant for which you could come up with a good story, that you’d discovered a general truth.

What’s less excusable is a journalist writing this in the year 2024. But it’s no surprise, conditional on the headline, “Using this 1 word more often can make you 50% more influential, says Harvard study.”

But what about that book by the University of Pennsylvania marketing professor? I searched online, and, fortunately for us, the bit about the Xerox machine is right there in the first chapter, in the excerpt we can read for free. Here it is:

He got it wrong, just like the journalist did! It’s not true that including the meaningless reason increased persuasion just as much as the valid reason did. Look at the data! The outcomes under the three treatment were 37.5%, 50%, and 62.5%. 50% – 37.5% ≠ 62.5% – 37.5%. Ummm, ok, he could’ve said something like, “Among a selected subset of the data with only 15 or 16 people in each treatment, including the meaningless reason increased persuasion just as much as the valid reason did.” But that doesn’t sound so impressive! Even if you add something like, “and it’s possible to come up with a plausible theory to go with this result.”

The book continues:

Given the flaws in the description of the copier study, I’m skeptical about these other claims.

But let me say this. If it is indeed true that using the word “whom” in online dating profiles makes you 31% more likely to get a date, then my advice is . . . don’t use the word “whom”! Think of it from a potential-outcomes perspective. Sure, you want to get a date. But do you really want to go on a date with someone who will only go out with you if you use the word “whom”?? That sounds like a really pretentious person, not a fun date at all!

OK, I haven’t read the rest of the book, and it’s possible that somewhere later on the author says something like, “OK, I was exaggerating a bit on page 4 . . .” I doubt it, but I guess it’s possible.

Replications, anyone?

To return to the topic at hand: In 1978 a study was conducted with 120 participants in a single location. The study was memorable enough to be featured in a business book nearly fifty years later.

Surely the finding has been replicated?

I’d imagine yes; on the other hand, if it had been replicated, this would’ve been mentioned in the book, right? So it’s hard to know.

I did a search, and the article does seem to have been influential:

It’s been cited 1514 times—that’s a lot! Google lists 55 citations in 2023 alone, and in what seem to be legit journals: Human Communication Research, Proceedings of the ACM, Journal of Retailing, Journal of Organizational Behavior, Journal of Applied Psychology, Human Resources Management Review, etc. Not core science journals, exactly, but actual applied fields, with unskeptical mentions such as:

What about replications? I searched on *langer blank chanowitz 1978 replication* and found this paper by Folkes (1985), which reports:

Four studies examined whether verbal behavior is mindful (cognitive) or mindless (automatic). All studies used the experimental paradigm developed by E. J. Langer et al. In Studies 1–3, experimenters approached Ss at copying machines and asked to use it first. Their requests varied in the amount and kind of information given. Study 1 (82 Ss) found less compliance when experimenters gave a controllable reason (“… because I don’t want to wait”) than an uncontrollable reason (“… because I feel really sick”). In Studies 2 and 3 (42 and 96 Ss, respectively) requests for controllable reasons elicited less compliance than requests used in the Langer et al study. Neither study replicated the results of Langer et al. Furthermore, the controllable condition’s lower compliance supports a cognitive approach to social interaction. In Study 4, 69 undergraduates were given instructions intended to increase cognitive processing of the requests, and the pattern of compliance indicated in-depth processing of the request. Results provide evidence for cognitive processing rather than mindlessness in social interaction.

So this study concludes that the result didn’t replicate at all! On the other hand, it’s only a “partial replication,” and indeed they do not use the same conditions and wording as in the original 1978 paper. I don’t know why not, except maybe that exact replications traditionally get no respect.

Langer et al. responded in that journal, writing:

We see nothing in her results [Folkes (1985)] that would lead us to change our position: People are sometimes mindful and sometimes not.

Here they’re referring to the table from the 1978 study, reproduced at the top of this post, which shows a large effect of the “because I have to make copies” treatment under the “Small Favor” condition but no effect under the “Large Favor” condition. Again, given the huge standard errors here, we can’t take any of this seriously, but if you just look at the percentages without considering the uncertainty, then, sure, that’s what they found. Thus, in their response to the partial replication study that did not reproduce their results, Langer et al. emphasized that their original finding was not a main effect but an interaction: “People are sometimes mindful and sometimes not.”

That’s fine. Psychology studies often measure interactions, as they should: the world is a highly variable place.

But, in that case, everyone’s been misinterpreting that 1978 paper! When I say “everybody,” I mean this recent book by the business school professor and also the continuing references to the paper in the recent literature.

Here’s the deal. The message that everyone seems to have learned, or believed they learned, from the 1978 paper is that meaningless explanations are as good as meaningful explanations. But, according to the authors of that paper when they responded to criticism in 1985, the true message is that this trick works sometimes and sometimes not. That’s a much weaker message.

Indeed the study at hand is too small to draw any reliable conclusions about any possible interaction here. The most direct estimate of the interaction effect from the above table is (0.93 – 0.60) – (0.24 – 0.24) = 0.33, with a standard error of sqrt(0.93*0.07/15 + 0.60*0.40/15 + 0.24*0.76/25 + 0.24*0.76/25) = 0.19. So, no, I don’t see much support for the claim in this post from Psychology Today:

So what does this all mean? When the stakes are low people will engage in automatic behavior. If your request is small, follow your request with the word “because” and give a reason—any reason. If the stakes are high, then there could be more resistance, but still not too much.

This happens a lot in unreplicable or unreplicated studies: a result is found under some narrow conditions, and then it is taken to have very general implications. This is just an unusual case where the authors themselves pointed out the issue. As they wrote in their 1985 article:

The larger concern is to understand how mindlessness works, determine its consequences, and specify better the conditions under which it is and is not likely to occur.

That’s a long way from the claim in that business book that “because” is a “magic word.”

Like a lot of magic, it only works under some conditions, and you can’t necessarily specify those conditions ahead of time. It works when it works.

There might be other replication studies of this copy machine study. I guess you couldn’t really do it now, because people don’t spend much time waiting at the copier. But the office copier was a thing for several decades. So maybe there are even some exact replications out there.

In searching for a replication, I did come across this post from 2009 by Mark Liberman that criticized yet another hyping of that 1978 study, this time from a paper by psychologist Daniel Kahenman in the American Economic Review. Kahneman wrote:

Ellen J. Langer et al. (1978) provided a well-known example of what she called “mindless behavior.” In her experiment, a confederate tried to cut in line at a copying machine, using various preset “excuses.” The conclusion was that statements that had the form of an unqualified request were rejected (e.g., “Excuse me, may I use the Xerox machine?”), but almost any statement that had the general form of an explanation was accepted, including “Excuse me, may I use the Xerox machine because I want to make copies?” The superficiality is striking.

As Liberman writes, this represented a “misunderstanding of the 1978 paper’s results, involving both a different conclusion and a strikingly overgeneralized picture of the observed effects.” Liberman performs an analysis of the data from that study which is similar to what I have done above.

Liberman summarizes:

The problem with Prof. Kahneman’s interpretation is not that he took the experiment at face value, ignoring possible flaws of design or interpretation. The problem is that he took a difference in the distribution of behaviors between one group of people and another, and turned it into generic statements about the behavior of people in specified circumstances, as if the behavior were uniform and invariant. The resulting generic statements make strikingly incorrect predictions even about the results of the experiment in question, much less about life in general.

Mindfulness

The key claim of all this research is that people are often mindless: they respond to the form of a request without paying attention to its context, with “because” acting as a “magic word.”

I would argue that this is exactly the sort of mindless behavior being exhibited by the people who are promoting that copying-machine experiment! They are taking various surface aspects of the study and using it to draw large, unsupported conclusions, without being mindful of the details.

In this case, the “magic words” are things like “p < .05," "randomized experiment," "Harvard," "peer review," and "Journal of Personality and Social Psychology" (this notwithstanding). The mindlessness comes from not looking into what exactly was in the paper being cited.

In conclusion . . .

So, yeah, thanks for nothing, Dmitri! Three hours of my life spent going down a rabbit hole. But, hey, if any readers who are single have read far enough down in the post to see my advice not to use “whom” in your data profile, it will all have been worth it.

Seriously, though, the “mindlessness” aspect of this story is interesting. The point here is not, Hey, a 50-year-old paper has some flaws! Or the no-less-surprising observation: Hey, a pop business book exaggerates! The part that fascinates me is that there’s all this shaky research that’s being taken as strong evidence that consumers are mindless—and the people hyping these claims are themselves demonstrating the point by mindlessly following signals without looking into the evidence.

The ultimate advice that the mindfulness gurus are giving is not necessarily so bad. For example, here’s the conclusion of that online article about the business book:

Listen to the specific words other people use, and craft a response that speaks their language. Doing so can help drive an agreement, solution or connection.

“Everything in language we might use over email at the office … [can] provide insight into who they are and what they’re going to do in the future,” says Berger.

That sounds ok. Just forget all the blather about the “magic words” and the “superpowers,” and forget the unsupported and implausible claim that “Arguments, requests and presentations aren’t any more or less convincing when they’re based on solid ideas.” As often is the case, I think these Ted-talk style recommendations would be on more solid ground if they were just presented as the product of common sense and accumulated wisdom, rather than leaning on some 50-year-old psychology study that just can’t bear the weight. But maybe you can’t get the airport book and the Ted talk without a claim of scientific backing.

Don’t get me wrong here. I’m not attributing any malign motivations to any of the people involved in this story (except for Dmitri, I guess). I’m guessing they really believe all this. And I’m not using “mindless” as an insult. We’re all mindless sometimes—that’s the point of the Langer et al. (1978) study; it’s what Herbert Simon called “bounded rationality.” The trick is to recognize your areas of mindlessness. If you come to an area where you’re being mindless, don’t write a book about it! Even if you naively think you’ve discovered a new continent. As Mark Twain apparently never said, it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

The usual disclaimer

I’m not saying the claims made by Langer et al. (1978) are wrong. Maybe it’s true that, under conditions of mindlessness, all that matters is the “because” and any empty explanation will do; maybe the same results would show up in a preregistered replication. All I’m saying is that the noisy data that have been presented don’t provide any strong evidence in support of such claims, and that’s what bothers me about all those confident citations in the business literature.

P.S.

After writing the above post, I sent this response to Dmitri:

OK, I just spent 3 hours on this. I now have to figure out what to do with this after blogging it, because I think there are some important points here. Still, yeah, you did a bad thing by sending this to me. These are 3 hours I could’ve spent doing real work, or relaxing . . .

He replied:

I mean, yeah, that’s too bad for you, obviously. But … try to think about it from my point of view. I am more influential, I got you to work on this while I had a nice relaxing post-Valentine’s day sushi meal with my wife (much easier to get reservations on the 15th and the flowers are a lot cheaper), while you were toiling away on what is essentially my project. I’d say the magic words did their job.

Good point! He exploited my mindlessness. I responded:

Ok, I’ll quote you on that one too! (minus the V-day details).

I’m still chewing on your comment that you appreciate the Beatles for their innovation as much as for their songs. The idea that there are lots of songs of similar quality but not so much innovation, that’s interesting. The only thing is that I don’t know enough about music, even pop music, to have a mental map of where everything fits in. For example, I recently heard that Coldplay song, and it struck me that it was in the style of U2 . But I don’t really know if U2 was the originator of that soaring sound. I guess Pink Floyd is kinda soaring too, but not quite in the same way . . . etc etc … the whole thing was frustrating to me because I had no sense of whether I was entirely bullshitting or not.

So if you can spend 3 hours writing a post on the above topic, we’ll be even.

Dmitri replied:

I am proud of the whole “Valentine’s day on the 15th” trick, so you are welcome to include it. That’s one of our great innovations. After the first 15-20 Valentine’s days, you can just move the date a day later and it is much easier.

And, regarding the music, he wrote:

U2 definitely invented a sound, with the help of their producer Brian Eno.

It is a pretty safe bet that every truly successful musician is an innovator—once you know the sound it is easy enough to emulate. Beethoven, Charlie Parker, the Beatles, all the really important guys invented a forceful, effective new way of thinking about music.

U2 is great, but when I listened to an entire U2 song from beginning to end, it seemed so repetitive as to be unlistenable. I don’t feel that way about the Beatles or REM. But just about any music sounds better to me in the background, which I think is a sign of my musical ignorance and tone-deafness (for real, I’m bad at recognizing pitches) more than anything else. I guess the point is that you’re supposed to dance to it, not just sit there and listen.

Anyway, I warned Dmitri about what would happen if I post his Valentine’s Day trick:

I post this, then it will catch on, and it will no longer work . . . just warning ya! You’ll have to start doing Valentine’s Day on the 16th, then the 17th, . . .

To which Dmitri responded:

Yeah but if we stick with it, it will roll around and we will get back to February 14 while everyone else is celebrating Valentines Day on these weird wrong days!

I’ll leave him with the last word.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Our new book, Active Statistics, is now available!

Coauthored with Aki Vehtari, this new book is lots of fun, perhaps the funnest I’ve ever been involved in writing. And it’s stuffed full of statistical insights. The webpage for the book is here, and the link to buy it is here or directly from the publisher here.

With hundreds of stories, activities, and discussion problems on applied statistics and causal inference, this book is a perfect teaching aid, a perfect adjunct to a self-study program, and an enjoyable bedside read if you already know some statistics.

Here’s the quick summary:

This book provides statistics instructors and students with complete classroom material for a one- or two-semester course on applied regression and causal inference. It is built around 52 stories, 52 class-participation activities, 52 hands-on computer demonstrations, and 52 discussion problems that allow instructors and students to explore the real-world complexity of the subject. The book fosters an engaging “flipped classroom” environment with a focus on visualization and understanding. The book provides instructors with frameworks for self-study or for structuring the course, along with tips for maintaining student engagement at all levels, and practice exam questions to help guide learning. Designed to accompany the authors’ previous textbook Regression and Other Stories, its modular nature and wealth of material allow this book to be adapted to different courses and texts or be used by learners as a hands-on workbook.

It’s got 52 of everything because it’s structured around a two-semester class, with 13 weeks per semester and 2 classes per week. It’s really just bursting with material, including some classic stories and lots of completely new material. Right off the bat we present a statistical mystery that arose with a Wikipedia experiment, and we have a retelling of the famous Literary Digest survey story but with a new and unexpected twist (courtesy of Sharon Lohr and J. Michael Brick). And the activities have so much going on! One of my favorites is a version of the two truths and a lie game that demonstrates several different statistical ideas.

People have asked how this differs from my book with Deborah Nolan, Teaching Statistics: A Bag of Tricks. My quick answer is that Active Statistics has a lot more of everything, it’s structured to cover an entire two-semester course in order, and it’s focused on applied statistics. Including a bunch of stories, activities, demonstrations, and problems on causal inference, a topic that is not always so well integrated into the statistics curriculum. You’re gonna love this book.

You can buy it here or here. It’s only 25 bucks, which is an excellent deal considering how stuffed it is with useful content. Enjoy.

How to code and impute income in studies of opinion polls?

Nate Cohn asks:

What’s your preferred way to handle income in a regression when income categories are inconsistent across several combined survey datasets? Am I best off just handling this with multiple categorical variables? Can I safely create a continuous variable?

My reply:

I thought a lot about this issue when writing Red Sate Blue State. My preferred strategy is to use a variable that we could treat as continuous. For example when working with ANES data I was using income categories 1,2,3,4,5 which corresponded to income categories 1-16th percentile, 16-33rd, 34-66th, 67-95th, and 96-100th. If you have different surveys with different categories, you could use some somewhat consistent scaling, for example one survey you might code as 1,3,5,7 and another might be coded as 2,4,6,8. I expect that other people would disagree with this advice but this the sort of thing that I was doing. I’m not so much worried about the scale being imperfect or nonlinear. But if you have a non-monotonic relation, you’ll have to be more careful.

Cohn responds:

Two other thoughts for consideration:

— I am concerned about non-monotonicity. At least in this compilation of 2020 data, the Democrats do best among rich and poor, and sag in the middle. It seems even more extreme when we get into the highest/lowest income strata, ala ANES. I’m not sure this survives controls—it seems like there’s basically no income effect after controls—but I’m hesitant to squelch a possible non-monotonic effect that I haven’t ruled out.

—I’m also curious for your thoughts on a related case. Suppose that (a) dataset includes surveys that sometimes asked about income and sometimes did not ask about income, (b) we’re interested in many demographic covariates, besides income, and; (c) we’d otherwise clearly specify the interaction between income and the other variables. The missing income data creates several challenges. What should we do?

I can imagine some hacky solutions to the NA data problem outright removing observations (say, set all NA income to 1 and interact our continuous income variable with whether we have actual income data), but if we interact other variables with the NA income data there are lots of cases (say, MRP where the population strata specifies income for full pop, not in proportion to survey coverage) where we’d risk losing much of the power gleaned from other surveys about the other demographic covariates. What should we do here?

My quick recommendation is to fit a model with two stages, first predicting income given your other covariates, then predicting your outcome of interest (issue attitude, vote preference, whatever) given income and the other covariates. You can fit the two models simultaneously in one Stan program. I guess then you will want some continuous coding for income (could be something like sqrt(income) with income topcoded at $300K) along with a possibly non-monotonic model at the second level.

Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics

In politics, as in baseball, hot prospects from the minors can have trouble handling big-league pitching.

Right after Sarah Palin was chosen as the Republican nominee for vice president in 2008, my friend Ubs, who grew up in Alaska and follows politics closely, wrote the following:

Palin would probably be a pretty good president. . . . She is fantastically popular. Her percentage approval ratings have reached the 90s. Even now, with a minor nepotism scandal going on, she’s still about 80%. . . . How does one do that? You might get 60% or 70% who are rabidly enthusiastic in their love and support, but you’re also going to get a solid core of opposition who hate you with nearly as much passion. The way you get to 90% is by being boringly competent while remaining inoffensive to people all across the political spectrum.

Ubs gives a long discussion of Alaska’s unique politics and then writes:

Palin’s magic formula for success has been simply to ignore partisan crap and get down to the boring business of fixing up a broken government. . . . It’s not a very exciting answer, but it is, I think, why she gets high approval ratings — because all the Democrats, Libertarians, and centrists appreciate that she’s doing a good job on the boring non-partisan stuff that everyone agrees on and she isn’t pissing them off by doing anything on the partisan stuff where they disagree.

Hey–I bet you never thought you’d see the words “boringly competent,” “inoffensive,” and “Sarah Palin” in the same sentence!

Prediction and extrapolation

OK, so what’s the big deal? Palin got a reputation as a competent nonpartisan governor but when she hit the big stage she shifted to hyper-partisanship. The contrast is interesting to me because it suggests a failure of extrapolation.

Now let’s move to baseball. One of the big findings of baseball statistics guru Bill James is that minor-league statistics, when correctly adjusted, predict major-league performance. James is working through a three-step process: (1) naive trust in minor league stats, (2) a recognition that raw minor league stats are misleading, (3) a statistical adjustment process, by which you realize that there really is a lot of information there, if you know how to use it.

For a political analogy, consider Scott Brown. When he was running for the Senate last year, political scientist Boris Shor analyzed his political ideology. The question was, how would he vote in the Senate if he were elected? Boris wrote:

We have evidence from multiple sources. The Boston Globe, in its editorial endorsing Coakley, called Brown “in the mode of the national GOP.” Liberal bloggers have tried to tie him to the Tea Party movement, making him out to be very conservative. Chuck Schumer called him “far-right.”

In 2002, he filled out a Votesmart survey on his policy positions in the context of running for the State Senate. Looking through the answers doesn’t reveal too much beyond that he is a pro-choice, anti-tax, pro-gun Republican. His interest group ratings are all over the map. . . .

All in all, a very confusing assessment, and quite imprecise. So how do we compare Brown to other state legislators, or more generally to other politicians across the country?

My [Boris’s] research, along with Princeton’s Nolan McCarty, allows us to make precisely these comparisons. Essentially, I use the entirety of state legislative voting records across the country, and I make them comparable by calibrating them through Project Votesmart’s candidate surveys.

By doing so, I can estimate Brown’s ideological score very precisely. It turns out that his score is -0.17, compared with her score of 0.02. Liberals have lower scores; conservatives higher ones. Brown’s score puts him at the 34th percentile of his party in Massachusetts over the 1995-2006 time period. In other words, two thirds of other Massachusetts Republican state legislators were more conservative than he was. This is evidence for my [Boris’s] claim that he’s a liberal even in his own party. What’s remarkable about this is the fact that Massachusetts Republicans are the most, or nearly the most, liberal Republicans in the entire country!

Very Jamesian, wouldn’t you say? And Boris’s was borne out by Scott Brown’s voting record, where he indeed was the most liberal of the Senate’s Republicans.

Political extrapolation

OK, now back to Sarah Palin. First, her popularity. Yes, Gov. Palin was popular, but Alaska is a small (in population) state, and surveys

find that most of the popular governors in the U.S. are in small states. Here are data from 2006 and 2008:

governors.png

There are a number of theories about this pattern; what’s relevant here is that a Bill James-style statistical adjustment might be necessary before taking state-level stats to the national level.

The difference between baseball and politics

There’s something else going on, though. It’s not just that Palin isn’t quite so popular as she appeared at first. There’s also a qualitative shift. From “boringly competent nonpartisan” to . . . well, leaving aside any questions of competence, she’s certainly no longer boring or nonpartisan! In baseball terms, this is like Ozzie Smith coming up from the minors and becoming a Dave Kingman-style slugger. (Please excuse my examples which reveal how long it’s been since I’ve followed baseball!)

So how does baseball differ from politics, in ways that are relevant to statistical forecasting?

1. In baseball there is only one goal: winning. Scoring more runs than the other team. Yes, individual players have other goals: staying healthy, getting paid, not getting traded to Montreal, etc., but overall the different goals are aligned, and playing well will get you all of these to some extent.

But there are two central goals in politics: winning and policy. You want to win elections, but the point of winning is to enact policies that you like. (Sure, there are political hacks who will sell out to the highest bidder, but even these political figures represent some interest groups with goals beyond simply being in office.)

Thus, in baseball we want to predict how a player can help his team win, but in politics we want to predict two things: electoral success and also policy positions.

2. Baseball is all about ability–natural athletic ability, intelligence (as Bill James said, that and speed are the only skills that are used in both offense and defense), and plain old hard work, focus, and concentration. The role of ability in politics is not so clear. In his remarks that started this discussion, Ubs suggested that Palin had the ability and inclination to solve real problems. But it’s not clear how to measure such abilities in a way that would allow any generalization to other political settings.

3. Baseball is the same environment at all levels. The base paths are the same length in the major leagues as in AA ball (at least, I assume that’s true!), the only difference is that in the majors they throw harder. OK, maybe the strike zone and the field dimensions vary, but pretty much it’s the same game.

In politics, though, I dunno. Some aspects of politics really do generalize. The Massachusetts Senate has got to be a lot different from the U.S. Senate, but, in their research, Boris Shor and Nolan McCarty have shown that there’s a lot of consistency in how people vote in these different settings. But I suspect things are a lot different for the executive, where your main task is not just to register positions on issues but to negotiate.

4. In baseball, you’re in or out. If you’re not playing (or coaching), you’re not really part of the story. Sportswriters can yell all they want but who cares. In contrast, politics is full of activists, candidates, and potential candidates. In this sense, the appropriate analogy is not that Sarah Palin started as Ozzie Smith and then became Dave Kingman, but rather a move from being Ozzie Smith to being a radio call-in host, in a world in which media personalities can be as powerful, and as well-paid, as players on the field. Perhaps this could’ve been a good move for, say, Bill Lee, in this alternative universe? A player who can’t quite keep the ball over the plate but is a good talker with a knack for controversy?

Commenter Paul made a good point here:

How many at-bats long is a governorship? The most granular I could imagine possibly talking is a quarter. At the term level we’d be doing better making each “at-bat” independent of the previous. 20 or so at-bats don’t have much predictive value either. Even over a full 500 at-bat season, fans try to figure out whether a big jump in BABIP is a sign of better bat control or luck.

The same issues arise at very low at-bat counts too. If you bat in front of a slugger, you can sit on pitches in the zone. If you’ve got a weakness against a certain pitching style, you might not happen to see it. And once the ball is in the air, luck is a huge factor in if it travels to a fielder or between them.

I suspect if we could somehow get a political candidate to hold 300-400 different political jobs in different states, with different party goals and support, we’d be able to do a good job predicting future job performance, even jumping from state to national levels. But the day to day successes of a governor are highly correlative.

Indeed, when it comes to policy positions, a politician has lots of “plate appearances,” that is, opportunities to vote in the legislature. But when it comes to elections, a politician will only have at most a couple dozen in his or her entire career.

All the above is from a post from 2011. I thought about it after this recent exchange with Mark Palko regarding the political candidacy of Ron DeSantis.

In addition to everything above, let me add one more difference between baseball and politics. In baseball, the situation is essentially fixed, and pretty much all that matters is player ability. In contrast, in politics, the most important factor is the situation. In general elections in the U.S., the candidate doesn’t matter that much. (Primaries are a different story.) In summary, to distinguish baseball players in ability we have lots of data to estimate a big signal; to distinguish politicians in vote-getting ability we have very little data to estimate a small signal.

Leap Day Special!

The above graph is from a few years ago but is particularly relevant today!

It’s funny that, in leap years, approximately 10% fewer babies are born on 29 Feb. I think it would be cool to have a Leap Day birthday. But I guess most people, not being nerds, would prefer the less-“weird” days before and after.

There’s lots of good stuff at the above link; I encourage you to read the whole thing.

In the years since, we’ve improved Stan so we can fit and improve the birthdays time series decomposition model using full Bayesian inference.

Here’s Aki’s birthday case study which has all the details. This will also be going into our Bayesian Workflow book.

A suggestion on how to improve the broader impacts statement requirement for AI/ML papers

This is Jessica. Recall that in 2020, NeurIPS added a requirement that authors include a statement of ethical aspects and future societal consequences extending to both positive and negative outcomes. Since then, requiring broader impact statements in machine learning papers has become a thing.

The 2024 NeurIPS call has not yet been released, but in 2023 authors were required to complete a checklist where they had to respond to the following: “If appropriate for the scope and focus of your paper, did you discuss potential negative societal impacts of your work?”, with either Y, N, or N/A with explanation as appropriate. More recently, ICML introduced a requirement that authors include impact statements in submitted papers: “a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. This statement should be in a separate section at the end of the paper (co-located with Acknowledgements, before References), and does not count toward the paper page limit.”

ICML provided authors who didn’t feel they had much to say the following boiler-plate text:

“This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.”  

but warned authors to “to think about whether there is content which does warrant further discussion, as this statement will be apparent if the paper is later flagged for ethics review.”

I find this slightly amusing in that it sounds like what I would expect authors to be thinking even without an impact statement: This work is like, so impactful, for society at large. It’s just like, really important, on so many levels. We’re out of space unfortunately, so we’ll have to leave it at that.\newline\newline\newline\newline Love, \newline\newline\newline\newline the authors \newline\newline\newline\newline

I have an idea that might increase the value of the exercises, both for authors and those advocating for the requirements: Have authors address potential impacts in the context of their discussion of related work *with references to relevant critical work*, rather than expecting them to write something based on their own knowledge and impressions (which is likely to be hard for many authors for reasons I discuss below).  In other words, treat the impact statement as another dimension of contextualizing one’s work against existing scholarship, rather than a free-form brainstorm.

Why do I think this could be an improvement?  Here’s what I see as the main challenges these measures run into (both my own thoughts and those discussed by others):  

  1. Lack of incentives for researchers to be forthright about possible negative implications of their work, and consequently a lack of depth in the statements they write. Having them instead find and cite existing critical work on ethical or societal impacts doesn’t completely reconcile this, but presumably the critical papers aren’t facing quite the same incentives to say only the minimum amount. I expect it is easier for the authors to refer to the kind of critiques that ethics experts think are helpful than it is for them to write such critical reflections themselves.
  2. Lack of transparency around how impacts statements factor into reviews of papers. Authors perceive reviewing around impacts statements as a black box, and have responded negatively to the idea that their paper could potentially get rejected for not sufficiently addressing broader impacts. But authors have existing expectations about the consequences for not citing some relevant piece of prior work.
  3. Doubts about whether AI/ML researchers are qualified to be reflecting on the broader impacts of their work. Relative to say, the humanities, or even areas of computer science that are closer to social science, like HCI, it seems pretty reasonable to assume that researchers submitting machine learning papers are less likely to gravitate to and be skilled at thinking about social and ethical problems, but skilled at thinking about technical problems. Social impacts of technology require different sensibilities and training to make progress on (though I think there are also technical components to these problems as well, which is why both sides are needed). Why not acknowledge this by encouraging the authors to first consult what has been said by experts in these areas, and add their two cents only if there are aspects of the possible impacts or steps to be taken to address them (e.g., algorithmic solutions) that they perceive to be unaddressed by existing scholarship? This would better acknowledge that just any old attempt to address ethics is not enough (consider, e.g., Gemini’s attempt not to stereotype, which was not an appropriate way to integrate ethical concerns into the tech). It would also potentially encourage more exchange between what currently can appear to be two very divided camps of researchers.
  4. Lack of established processes for reflecting on ethical implications in time to do something about them (e.g., choose a different research direction) in tech research. Related work is often one of the first sections to be written in my experience, so at least those authors who start working on their paper in advance of the deadline might have a better chance of acknowledging potential problems and adjusting their work in response. I’m less convinced that this will make much of a difference in many cases, but thinking about ethical implications early is part of the end goal of requiring broader impacts statements as far as I can tell, and my proposal seems more likely to help than hurt for that goal.

The above challenges are not purely coming from my imagination. I was involved in a couple survey papers led by Priyanka Nanayakkara on what authors said in NeurIPS broader impacts statements, and many contained fairly vacuous statements that might call out buzzwords like privacy or fairness but didn’t really engage with existing research. If we think it’s important to properly understand and address potential negative societal impacts of technology, which is the premise of requiring impacts statements to begin with, why expect a few sentences that authors may well be adding at the last minute to do this justice? (For further evidence that that is what’s happening in some cases, see e.g., this paper reporting on the experiences of authors writing statements). Presumably the target audience of the impact statements would benefit from actual scholarship on the societal implications over rushed and unsourced throwing around of ethical-sounding terms. And the authors would benefit from having to consult what those who are investing the time to think through potential negative consequences carefully have to say.

Some other positive byproducts of this might be that the published record does a better job of pointing awareness to where critical scholarship needs to be further developed (again, leading to more of a dialogue between the authors and the critics). This seems critical, as some of the societal implications of new ML contributions will require both ethicists and technologists to address. And those investing the time to think carefully about potential implications should see more engagement with their work among those building the tools.

I described this to Priyanka, who also read a draft of this post, and she pointed out that an implicit premise of the broader impact requirements is that the authors are uniquely positioned to comment on the potential harms of their work pre-deployment. I don’t think this is totally off base (since obviously the authors understand the work at a more detailed level than most critics), but to me it misses a big part of the problem: that of misaligned incentives and training (#1, #3 above). It seems contradictory to imply that these potential consequences are not obvious and require careful reflection AND that people who have not considered them before will be capable of doing a good job at articulating them.

At the end of the day, the above proposal is an attempt to turn an activity that I suspect currently feels “religious” for many authors into something they can apply their existing “secular” skills to. 

When Steve Bannon meets the Center for Open Science: Bad science and bad reporting combine to yield another ovulation/voting disaster

The Kangaroo with a feather effect

A couple of faithful correspondents pointed me to this recent article, “Fertility Fails to Predict Voter Preference for the 2020 Election: A Pre-Registered Replication of Navarrete et al. (2010).”

It’s similar to other studies of ovulation and voting that we’ve criticized in the past (see for example pages 638-640 of this paper.

A few years ago I ran across the following recommendation for replication:

One way to put a stop to all this uncertainty: preregistration of studies of all kinds. It won’t quell existing worries, but it will help to prevent new ones, and eventually the truth will out.

My reaction was that this was way too optimistic.The ovulation-and-voting study had large measurement error, high levels of variation, and any underlying effects were small. And all this is made even worse because they were studying within-person effects using a between-person design. So any statistically significant difference they find is likely to be in the wrong direction and is essentially certain to be a huge overestimate. That is, the design has a high Type S error rate and a high Type M error rate.

And, indeed, that’s what happened with the replication. It was a between-person comparison (that is, each person was surveyed at only one time point), there was no direct measurement of fertility, and this new study was powered to only be able to detect effects that were much larger than would be scientifically plausible.

The result: a pile of noise.

To the authors’ credit, their title leads right off with “Fertility Fails to Predict . . .” OK, not quite right, as they didn’t actually measure fertility, but at least they foregrounded their negative finding.

Bad Science

Is it fair for me to call this “bad science”? I think this description is fair. Let me emphasize that I’m not saying the authors of this study are bad people. Remember our principle that honesty and transparency are not enough. You can be of pure heart, but if you are studying a small and highly variable effect using a noisy design and crude measurement tools, you’re not going to learn anything useful. You might as well just be flipping coins or trying to find patterns in a table of random numbers. And that’s what’s going on here.

Indeed, this is one of the things that’s bothered me for years about preregistered replications. I love the idea of preregistration, and I love the idea of replication. These are useful tools for strengthening research that is potentially good research and for providing some perspective on questionable research that’s been done in the past. Even the mere prospect of preregistered replication can be a helpful conceptual tool when considering an existing literature or potential new studies.

But . . . if you take a hopelessly noisy design and preregister it, that doesn’t make it a good study. Put a pile of junk in a fancy suit and it’s still a pile of junk.

In some settings, I fear that “replication” is serving a shiny object to distract people from the central issues of measurement, and I think that’s what’s going on here. The authors of this study were working with some vague ideas of evolutionary psychology, and they seem to be working under the assumption that, if you’re interested in theory X, that the way to science is to gather some data that have some indirect connection to X and then compute some statistical analysis in order to make an up-or-down decision (“statistically significant / not significant” or “replicated / not replicated”).

Again, that’s not enuf! Science isn’t just about theory, data, analysis, and conclusions. It’s also about measurement. It’s quantitative. And some measurements and designs are just too noisy to be useful.

As we wrote a few years ago,

My criticism of the ovulation-and-voting study is ultimately quantitative. Their effect size is tiny and their measurement error is huge. My best analogy is that they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

At some point, a set of measurements is so noisy that biases in selection and interpretation overwhelm any signal and, indeed, nothing useful can be learned from them. I assume that the underlying effect size in this case is not zero—if we were to look carefully, we would find some differences in political attitude at different times of the month for women, also different days of the week for men and for women, and different hours of the day, and I expect all these differences would interact with everything—not just marital status but also age, education, political attitudes, number of children, size of tax bill, etc etc. There’s an endless number of small effects, positive and negative, bubbling around.

Bad Reporting

Bad science is compounded by bad reporting. Someone pointed me to a website called “The National Pulse,” which labels itself as “radically independent” but seems to be an organ of the Trump wing of the Republican party, and which featured this story, which they seem to have picked up from the notorious sensationalist site, The Daily Mail:

STUDY: Women More Likely to Vote Trump During Most Fertile Point of Menstrual Cycle.

A new scientific study indicates women are more likely to vote for former President Donald Trump during the most fertile period of their menstrual cycle. According to researchers from the New School for Social Research, led by psychologist Jessica L Engelbrecht, women, when at their most fertile, are drawn to the former President’s intelligence in comparison to his political opponents. The research occurred between July and August 2020, observing 549 women to identify changes in their political opinions over time. . . .

A significant correlation was noticed between women at their most fertile and expressing positive opinions towards former President Donald Trump. . . . the 2020 study indicated that women, while ovulating, were drawn to former President Trump because of his high degree of intelligence, not physical attractiveness. . . .

As I wrote above, I think that research study was bad, but, conditional on the bad design and measurement, its authors seem to have reported it honestly.

The news report adds new levels of distortion.

– The report states that the study observed women “to identify changes in their political opinions over time.” First, the study didn’t “observe” anyone; they conducted an online survey. Second, they didn’t identify any changes over time: the women in the study were surveyed only once!

– The report says something about “a significant correlation” and that “the study indicated that . . .” This surprised me, given that the paper itself was titled, “Fertility Fails to Predict Voter Preference for the 2020 Election.” How do you get from “fails to predict” to “a significant correlation”? I looked at the journal article and found the relevant bit:

Results of this analysis for all 14 matchups appear in Table 2. In contrast to the original study’s findings, only in the Trump-Obama matchup was there a significant relationship between conception risk and voting preference [r_pb (475) = −.106, p = .021] such that the probability of intending to vote for Donald J. Trump rose with conception risk.

Got it? They looked at 14 comparisons. Out of these, one of these was “statistically significant” at the 5% level. This is the kind of thing you’d expect to see from pure noise, or the mathematical equivalent, which is a study with noisy measurements of small and variable effects. The authors write, “however, it is possible that this is a Type I error, as it was the only significant result across the matchups we analyzed,” which I think is still too credulous a way to put it; a more accurate summary would be to say that the data are consistent with null effects, which is no surprise given the realistic possible sizes of any effects in this very underpowered study.

The authors of the journal article also write, “Several factors may account for the discrepancy between our [lack of replication of] the original results.” They go on for six paragraphs giving possible theories—but never once considering the possibility that the original studies and theirs were just too noisy to learn anything useful.

Look. I don’t mind a bit of storytelling: why not? Storytelling is fun, and it can be a good way to think about scientific hypotheses and their implications. The reason we do social science is because we’re interested in the social world; we’re not just number crunchers. So I don’t mind that the authors had several paragraphs with stories. The problem is not that they’re telling stories, it’s that they’re only telling stories. They don’t ever reflect that this entire literature is chasing patterns in noise.

And this lack of reflection about measurement and effect size is destroying them! They went to all this trouble to replicate this old study, without ever grappling with that study’s fundamental flaw (see kangaroo picture at the top of this post). Again, I’m not saying that they authors are bad people or that they intend to mislead; they’re just doing bad, 2010-2015-era psychological science. They don’t know better, and they haven’t been well served by the academic psychology establishment which has promoted and continues to promote this sort of junk science.

Don’t blame the authors of the bad study for the terrible distorted reporting

Finally, it’s not the authors’ fault that their study was misreported by the Daily Mail and that Steve Bannon associated website. “Fails to Predict” is right there in the title of the journal article. If clickbait websites and political propagandists want to pull out that p = 0.02 result from your 14 comparisons and spin a tale around it, you can’t really stop them.

The Center for Open Science!

Science reform buffs will enjoy these final bits from the published paper:

When do we expect conformal prediction sets to be helpful? 

This is Jessica. Over on substack, Ben Recht has been posing some questions about the value of prediction bands with marginal guarantees, such as one gets from conformal prediction. It’s an interesting discussion that caught my attention since I have also been musing about where conformal prediction might be helpful. 

To briefly review, given a training data set (X1, Y1), … ,(Xn, Yn), and a test point (Xn+1, Yn+1) drawn from the same distribution, conformal prediction returns a subset of the label space for which we can make coverage guarantees about the probability of containing the test point’s true label Yn+1. A prediction set Cn achieves distribution-free marginal coverage at level 1 − alpha when P(Yn+1 ∈ Cn(Xn+1)) >= 1 − alpha for all joint distributions P on (X, Y). The commonly used split conformal prediction process attains this by adding a couple of steps to the typical modeling workflow: you first split the data into a training and calibration set, fitting the model on the training set. You choose a heuristic notion of uncertainty from the trained model, such as the softmax values–pseudo-probabilities from the last layer of a neural network–to create a score function s(x,y) that encodes disagreement between x and y (in a regression setting these are just the residuals). You compute q_hat, the ((n+1)(1-alpha))/n quantile of the scores on the calibration set. Then given a new instance x_n+1, you construct a prediction set for y_n+1 by including all y’s for which the score is less than or equal to q_hat. There are various ways to achieve slightly better performance, such as using cumulative summed scores and regularization instead.

Recht makes several good points about limitations of conformal prediction, including:

—The marginal coverage guarantees are often not very useful. Instead we want conditional coverage guarantees that hold conditional on the value of Xn+1 we observe. But you can’t get true conditional coverage guarantees (i.e., P(Yn+1 ∈ Cn(Xn+1)|Xn+1 = x) >= 1 − alpha for all P and almost all x) if you also want the approach to be distribution free (see e.g., here), and in general you need a very large calibration set to be able to say with high confidence that there is a high probability that your specific interval contains the true Yn+1.

—When we talk about needing prediction bands for decisions, we are often talking about scenarios where the decisions we want to make from the uncertainty quantification are going to change the distribution and violate the exchangeability criterion. 

—Additionally, in many of the settings where we might imagine using prediction sets there is potential for recourse. If the prediction is bad, resulting in a bad action being chosen, the action can be corrected, i.e., “If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong.”

Recht also criticizes research on conformal prediction as being fixated on the ability to make guarantees, irrespective of how useful the resulting intervals are. E.g., we can produce sets with 95% coverage with only two points, and the guarantees are always about coverage instead of the width of the interval.

These are valid points, worth discussing given how much attention conformal prediction has gotten lately. Some of the concerns remind me of the same complaints we often hear about traditional confidence intervals we put on parameter estimates, where the guarantees we get (about the method) are also generally not what we want (about the interval itself) and only actually summarize our uncertainty when the assumptions we made in inference are all good, which we usually can’t verify. A conformal prediction interval is about uncertainty in a model’s prediction on a specific instance, which perhaps makes it more misleading to some people given that it might not be conditional on anything specific to the instance. Still, even if the guarantees don’t stand as stated, I think it’s difficult to rule out an approach without evaluating how it gets used. Given that no method ever really quantifies all of our uncertainty, or even all of the important sources of uncertainty, the “meaning” of an uncertainty quantification really depends on its use, and what the alternatives considered in a given situation are. So I guess I disagree that one can answer the question “Can conformal prediction achieve the uncertainty quantification we need for decision-making?” without considering the specific decision at hand, how we are constructing the prediction set exactly (since there are ways to condition the guarantees on some instance-specific information), and how it would be made without a prediction set. 

There are various scenarios where prediction sets are used without a human in the loop, like to get better predictions or directly calibrate decisions, where it seems hard to argue that it’s not adding value over not incorporating any uncertainty quantification. Conformal prediction for alignment purposes (e.g., control the factuality or toxicity of LLM outputs) seems to be on the rise. However I want to focus here on a scenario where we are directly presenting a human with the sets. One type of setting where I’m curious whether conformal prediction sets could be useful are those where 1) models are trained offline and used to inform people’s decisions, and 2) it’s hard to rigorously quantify the uncertainty in the predictions using anything the model produces internally, like softmax values which can be overfit to the training sample.

For example, a doctor needs to diagnose a skin condition and has access to a deep neural net trained on images of skin conditions for which the illness has been confirmed. Even if this model appears to be more accurate than the doctor on evaluation data, the hospital may not be comfortable deploying the model in place of the doctor. Maybe the doctor has access to additional patient information that may in some cases allow them to make a better prediction, e.g., because they can decide when to seek more information through further interaction or monitoring of the patient. This means the distribution does change upon acting on the prediction, and I think Recht would say there is potential for recourse here, since the doctor can revise the treatment plan over time (he lists a similar example here). But still, at any given point in time, there’s a model and there’s a decision to be made by a human.    

Giving the doctor information about the model’s confidence in its prediction seems like it should be useful in helping them appraise the prediction in light of their own knowledge. Similarly, giving them a prediction set over a single top-1 prediction seems potentially preferable so they don’t anchor too heavily on a single prediction. Deep neural nets for medical diagnoses can do better than many humans in certain domains while still having relatively low top-1 accuracy (e.g., here). 

A naive thing to do would be to just choose some number k of predictions from the model we think a doctor can handle seeing at once, and show the top-k with softmax scores. But an adaptive conformal prediction set seems like an improvement in that at least you get some kind of guarantee, even if it’s not specific to your interval like you want. Set size conveys information about the level of uncertainty like the width of a traditional confidence interval does, which seems more likely to be helpful for conveying relative uncertainty than holding set size constant and letting the coverage guarantee change (I’ve heard from at least one colleague who works extensively with doctors that many are pretty comfortable with confidence intervals). We can also take steps toward the conditional coverage that we actually want by using an algorithm that calibrates the guarantees over different classes (labels), or that achieves a relaxed version of conditional coverage, possibilities that Recht seems to overlook. 

So while I agree with all the limitations, I don’t necessarily agree with Recht’s concluding sentence here:

“If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong. If you can, point predictions coupled with subsequent action are enough to achieve nearly optimal decisions.” 

It seems possible that seeing a prediction set (rather than just a single top prediction) will encourage a doctor to consider other diagnoses that they may not have thought of. Presenting uncertainty often has _some_ effect on a person’s reasoning process, even if they can revise their behavior later. The effect of seeing more alternatives could be bad in some cases (they get distracted by labels that don’t apply), or it could be good (a hurried doctor recognizes a potentially relevant condition they might have otherwise overlooked). If we allow for the possibility that seeing a set of alternatives helps, it makes sense to have a way to generate them that give us some kind of coverage guarantee we can make sense of, even if it gets violated sometimes. 

This doesn’t mean I’m not skeptical of how much prediction sets might change things over more naively constructed sets of possible labels. I’ve spent a bit of time thinking about how, from the human perspective, prediction sets could or could not add value, and I suspect its going to be nuanced, and the real value probably depends on how the coverage responds under realistic changes in distribution. There are lots of questions that seem worth trying to answer in particular domains where models are being deployed to assist decisions. Does it actually matter in practice, such as in a given medical decision setting, for the quality of decisions that are made if the decision-makers are given a set of predictions with coverage guarantees versus a top-k display without any guarantees? And, what happens when you give someone a prediction set with some guarantee but there are distribution shifts such that the guarantees you give are not quite right? Are they still better off with the prediction set or is this worse than just providing the model’s top prediction or top-k with no guarantees? Again, many of the questions could also be asked of other uncertainty quantification approaches; conformal prediction is just easier to implement in many cases. I have more to say on some of these questions based on a recent study we did on decisions from prediction sets, where we compared how accurately people labeled images using them versus other displays of model predictions, but I’ll save that for another post since this is already long. 

Of course, it’s possible that in many settings we would be better using some inherently interpretable model for which we no longer need a distribution-free approach. And ultimately we might be better off if we can better understand the decision problem the human decision-maker faces and apply decision theory to try to find better strategies  rather than leaving it up to the human how to combine their knowledge with what they get from a model prediction. I think we still barely understand how this occurs even in high stakes settings that people often talk about.

I love this paper but it’s barely been noticed.

Econ Journal Watch asked me and some others to contribute to an article, “What are your most underappreciated works?,” where each of us wrote 200 words or less about an article of ours that had received few citations.

Here’s what I wrote:

What happens when you drop a rock into a pond and it produces no ripples?

My 2004 article, Treatment Effects in Before-After Data, has only 23 citations and this goes down to 16 after removing duplicates and citations from me. But it’s one of my favorite papers. What happened?

It is standard practice to fit regressions using an indicator variable for treatment or control; the coefficient represents the causal effect, which can be elaborated using interactions. My article from 2004 argues that this default class of models is fundamentally flawed in considering treatment and control conditions symmetrically. To the extent that a treatment “does something” and the control “leaves you alone,” we should expect before-after correlation to be higher in the control group than in the treatment group. But this is not implied by the usual models.

My article presents three empirical examples from political science and policy analysis demonstrating the point. The article also proposes some statistical models. Unfortunately, these models are complicated and can be noisy to fit with small datasets. It would help to have robust tools for fitting them, along with evidence from theory or simulation of improved statistical properties. I still hope to do such work in the future, in which case perhaps this work will have the influence I hope it deserves.

Here’s the whole collection. The other contributors were Robert Kaestner, Robert A. Lawson, George Selgin, Ilya Somin, and Alexander Tabarrok.

My contribution got edited! I prefer my original version shown above; if you’re curious about the edited version, just follow the link and you can compare for yourself.

Others of my barely-noticed articles

Most of my published articles have very few citations; it’s your usual Zipf or long-tailed thing. Some of those have narrow appeal and so, even if I personally like the work, it is understandable that they haven’t been cited much. For example, “Bayesian hierarchical classes analysis” (16 citations) took a lot of effort on our part and appeared in a good journal, but ultimately it’s on a topic that not many researchers are interested in. For another example, I enjoyed writing “Teaching Bayes to Graduate Students in Political Science, Sociology, Public Health, Education, Economics, . . .” (17 citations) and I think if it reached the right audience of educators it could have a real influence, but it’s not the kind of paper that gets built upon or cited very often. A couple of my ethics and statistics papers from my Chance column only have 14 citations each; no surprise given that nobody reads Chance. At one point I was thinking of collecting them into a book, as this could get more notice.

Some papers are great but only take you part of the way there. I really like my morphing paper with Cavan and Phil, “Using image and curve registration for measuring the goodness of fit of spatial and temporal predictions” (12 citations) and, again, it appeared in a solid journal, but it was more of a start than a finish to a research project. We didn’t follow it up, and it seems that nobody else did either.

Sometimes we go to the trouble of writing a paper and going through the review process, but then it gets so little notice that I ask myself in retrospect, why did we bother? For example, “Objective Randomised Blinded Investigation With Optimal Medical Therapy of Angioplasty in Stable Angina (ORBITA) and coronary stents: A case study in the analysis and reporting of clinical trials” has been cited only 5 times since its publication in 2019—and three of those citations were from me. It seems safe to say that this particular dropped rock produced few ripples.

What happened? That paper had a good statistical message and a good applied story, but we didn’t frame it in a general-enough way. Or . . . it wasn’t quite that, exactly. It’s not a problem of framing so much as of context.

Here’s what would’ve made the ORBITA paper work, in the sense of being impactful (i.e., useful): either a substantive recommendation regarding heart stents or a general recommendation (a “method”) regarding summarizing and reporting clinical studies. We didn’t have either of these. Rather than just getting the paper published, we should’ve done the hard work to more forward in one of those two directions. Or, maybe our strategy was ok if we can use this example in some future article. The article presented a great self-contained story that could be part of larger recommendations. But the story on its own didn’t have impact.

This is a good reminder that what typically makes a paper useful is if it can get used by people. A starting point is the title. We should figure out who might find the contents of the article useful and design the title from there.

Or, for another example, consider “Extension of the Isobolographic Approach to Interactions Studies Between More than Two Drugs: Illustration with the Convulsant Interaction between Pefloxacin, Norfloxacin, and Theophylline in Rats” (5 citations). I don’t remember this one at all, and maybe it doesn’t deserve to be read—but if it does, maybe it should’ve be more focused on the general approach so it could’ve been more directly useful to people working in that field.

“Information, incentives, and goals in election forecasts” (21 citations). I don’t know what to say about this one. I like the article, it’s on a topic that lots of people care about, the title seems fine, but not much impact. Maybe more people will look at it in 2024? “Accounting for uncertainty during a pandemic” is another one with only 21 citations. For that one, maybe people are just sick of reading about the goddam pandemic. I dunno; I think uncertainty is an important topic.

The other issue with citations is that people have to find your paper before they would consider citing it. I guess that many people in the target audiences for our articles never even knew they existed. From that perspective, it’s impressive that anything new ever gets cited at all.

Here’s an example of a good title: “A simple explanation for declining temperature sensitivity with warming.” Only 25 citations so far, but I have some hopes for this one: the title really nails the message, so once enough people happen to come across this article one way or another, I think they’ll read it and get the point, and this will eventually show up in citations.

“Tables as graphs: The Ramanujan principle” (4 citations). OK, I love this paper too, but realistically it’s not useful to anyone! So, fair enough. Similarly with “‘How many zombies do you know?’ Using indirect survey methods to measure alien attacks and outbreaks of the undead” (6 citations). An inspired, hilarious effort in my opinion, truly a modern classic, but there’s no real reason for anyone to actually cite it.

“Should we take measurements at an intermediate design point?” (3 citations). This is the one that really bugs me. Crisp title, clean example, innovative ideas . . . it’s got it all. But it’s sunk nearly without a trace. I think the only thing to do here is to pursue the researcher further, get new results, and publish those. Maybe also set up the procedure more explicitly as a method, rather than just the solution to a particular applied problem.

I’ve been mistaken for a chatbot

… Or not, according to what language is allowed.

At the start of the year I mentioned that I am on a bad roll with AI just now, and the start of that roll began in late November when I received reviews back on a paper. One reviewer sent in a 150 word review saying it was written by chatGPT. The editor echoed, “One reviewer asserted that the work was created with ChatGPT. I don’t know if this is the case, but I did find the writing style unusual ….” What exactly was unusual was not explained.

That was November 20th. By November 22nd my computer shows a file created named ‘tryingtoproveIamnotchatbot,’ which is just a txt where I pasted in the GitHub commits showing progress on the paper. I figured maybe this would prove to the editors that I did not submit any work by chatGPT.

I didn’t. There are many reasons for this. One is I don’t think that I should. Further, I suspect chatGPT is not so good at this (rather specific) subject and between me and my author team, I actually thought we were pretty good at this subject. And I had met with each of the authors to build the paper, its treatise, data and figures. We had a cool new meta-analysis of rootstock x scion experiments and a number of interesting points. Some of the points I might even call exciting, though I am biased. But, no matter, the paper was the product of lots of work and I was initially embarrassed, then gutted, about the reviews.

Once I was less embarrassed I started talking timidly about it. I called Andrew. I told folks in my lab. I got some fun replies. Undergrads in my lab (and others later) thought the review itself may have been written by chatGPT. Someone suggested I rewrite the paper with chatGPT and resubmit. Another that I just write back one line: I’m Bing.

What I took away from this was myriad, but I came up with a couple next steps. I decided this was not a great peer review process that I should reach out to the editor (and, as one co-author suggested, cc the editorial board). And another was to not be so mortified as to not talk about this.

What I took away from these steps were two things:

1) chatGPT could now control my language.

I connected with a senior editor on the journal. No one is a good position here, and the editor and reviewers are volunteering their time in a rapidly changing situation. I feel for them and for me and my co-authors. The editor and I tried to bridge our perspectives. It seems he could not have imagined that I or my co-authors would be so offended. And I could not have imagined that the journal already had a policy of allowing manuscripts to use chatGPT, as long as it was clearly stated.

I was also given some language changes to consider, so I might sound less like chatGPT to reviewers. These included some phrases I wrote in the manuscript (e.g. `the tyranny of terroir’). Huh. So where does that end? Say I start writing so I sound less to the editor and others ‘like chatGPT’ (and I never figured out what that means), then chatGPT digests that and then what? I adapt again? Do I eventually come back around to those phrases once they have rinsed out of the large language model?

2) Editors are shaping the language around chatGPT.

Motivated by a co-author’s suggestion, I wrote a short reflection which recently came out in a careers column. I much appreciate the journal recognizing this as an important topic and that they have editorial guidelines to follow for clear and consistent writing. But I was surprised by the concerns from the subeditors on my language. (I had no idea my language was such a problem!)

This problem was that I wrote: I’ve been mistaken for a chatbot (and similar language). The argument was that I had not been mistaken — my writing had been. The debate that ensued was fascinating. If I had been in a chatroom and this happened, then I could write `I’ve been mistaken for a chatbot’ but since my co-authors and I wrote this up and submitted it to a journal, it was not part of our identities. So I was over-reaching in my complaint. I started to wonder: if I could not say ‘I was mistaken for an AI bot’ — why does the chatbot get ‘to write’? I went down an existential hole, from which I have not fully recovered.

And since then I am still mostly existing there. On the upbeat side, writing the reflection was cathartic and the back and forth with the editors — who I know are just trying to their jobs too — gave me more perspectives and thoughts, however muddled. And my partner recently said to me, “perhaps one day it will be seen as a compliment to be mistaken for a chatbot, just not today!”

Also, since I don’t know an archive that takes such things so I will paste the original unedited version below.

I have just been accused of scientific fraud. It’s not data fraud (which, I guess, is a relief because my lab works hard at data transparency, data sharing and reproducibility). What I have just been accused of is writing fraud. This hurts, because—like many people—I find writing a paper a somewhat painful process.

Like some people, I comfort myself by reading books on how to write—both to be comforted by how much the authors of such books stress that writing is generally slow and difficult, and to find ways to improve my writing. My current writing strategy involves willing myself to write, multiple outlines, then a first draft, followed by much revising. I try to force this approach on my students, even though I know it is not easy, because I think it’s important we try to communicate well.

Imagine my surprise then when I received reviews back that declared a recently submitted paper of mine a chatGPT creation. One reviewer wrote that it was `obviously Chat GPT’ and the handling editor vaguely agreed, saying that they found `the writing style unusual.’ Surprise was just one emotion I had, so was shock, dismay and a flood of confusion and alarm. Given how much work goes into writing a paper, it was quite a hit to be accused of being a chatbot—especially in short order without any evidence, and given the efforts that accompany the writing of almost all my manuscripts.

I hadn’t written a word of the manuscript with chatGPT and I rapidly tried to think through how to prove my case. I could show my commits on GitHub (with commit messages including `finally writing!’ and `Another 25 mins of writing progress!’ that I never thought I would share), I could try to figure out how to compare the writing style of my pre-chatGPT papers on this topic to the current submission, maybe I could ask chatGPT if it thought I it wrote the paper…. But then I realized I would be spending my time trying to prove I am not a chatbot, which seemed a bad outcome to the whole situation. Eventually, like all mature adults, I decided what I most wanted to do was pick up my ball (manuscript) and march off the playground in a small fury. How dare they?

Before I did this, I decided to get some perspectives from others—researchers who work on data fraud, co-authors on the paper and colleagues, and I found most agreed with my alarm. One put it most succinctly to me: `All scientific criticism is admissible, but this is a different matter.’

I realized these reviews captured both something inherently broken about the peer review process and—more importantly to me—about how AI could corrupt science without even trying. We’re paranoid about AI taking over us weak humans and we’re trying to put in structures so it doesn’t. But we’re also trying to develop AI so it helps where it should, and maybe that will be writing parts of papers. Here, chatGPT was not part of my work and yet it had prejudiced the whole process simply by its existential presence in the world. I was at once annoyed at being mistaken for a chatbot and horrified that reviewers and editors were not more outraged at the idea that someone had submitted AI generated text.

So much of science is built on trust and faith in the scientific ethics and integrity of our colleagues. We mostly trust others did not fabricate their data, and I trust people do not (yet) write their papers or grants using large language models without telling me. I wouldn’t accuse someone of data fraud or p-hacking without some evidence, but a reviewer felt it was easy enough to accuse me of writing fraud. Indeed, the reviewer wrote, `It is obviously [a] Chat GPT creation, there is nothing wrong using help ….’ So it seems, perhaps, that they did not see this as a harsh accusation, and the editor thought nothing of passing it along and echoing it, but they had effectively accused me of lying and fraud in deliberately presenting AI generated text as my own. They also felt confident that they could discern my writing from AI—but they couldn’t.

We need to be able to call out fraud and misconduct in science. Currently, the costs to the people who call out data fraud seem too high to me, and the consequences for being caught too low (people should lose tenure for egregious data fraud in my book). But I am worried about a world in which a reviewer can casually declare my work AI-generated, and the editors and journal editor simply shuffle along the review and invite a resubmission if I so choose. It suggests not only a world in which the reviewers and editors have no faith in the scientific integrity of submitting authors—me—but also an acceptance of a world where ethics are negotiable. Such a world seems easy for chatGPT to corrupt without even trying—unless we raise our standards.

Side note: Don’t forget to submit your entry to the International Cherry Blossom Prediction Competition!

Cherry blossoms—not just another prediction competition

It’s back! As regular readers know, the Cherry Blossom Prediction Competition will run throughout February 2024. We challenge you to predict the bloom date of cherry trees at five locations throughout the world and win prizes.

We’ve been promoting the competition for three years now—but we haven’t really emphasized the history of the problem. You might be surprised to know that bloom date prediction interested several famous nineteenth century statisticians. Co-organizer Jonathan Auerbach explains:

The “law of the flowering plants” states that a plant blooms after being exposed to a predetermined quantity of heat. The law was discovered in the mid-eighteenth century by René Réaumur, an early adopter of the thermometer—but it was popularized by Adolphe Quetelet, who devoted a chapter to the law in his Letters addressed to HRH the Grand Duke of Saxe Coburg and Gotha (Letter Number 33). See this tutorial for details.

Kotz and Johnson list Letters as one of eleven major breakthroughs in statistics prior to 1900, and the law of the flowering plants appears to have been well known throughout the nineteenth century, influencing statisticians such as Francis Galton and Florence Nightingale. But its popularity waned among statisticians as statistics transitioned from a collection of “fundamental” constants to a collection of principles for quantifying uncertainty. In fact, Ian Hacking mocks the law as a particularly egregious example of stamp collecting statistics.

But the law is widely used today! Charles Morren, Quetelet’s collaborator, later coined the term phenology, the name of the field that currently studies life-cycle events, such as bloom dates. Phenologists keep track of accumulated heat or growing degree days to predict bloom dates, crop yields, and the emergence of insects. Predictions are made using a methodology that is largely unchanged since Quetelet’s time—despite the large amounts of data now available and amenable to machine learning.

Fun with Dååta: Reference librarian edition

Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).

Rasmuth writes,

While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).

The search is on.

… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.

The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,

Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.

Rasmuth concludes,

She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?

Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.

I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…

P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.