The optimizer’s curse

The above sketch shows a decision tree.

The circles are uncertainty nodes and the squares are decision nodes. Read the tree from left to right: to start, there is uncertainty of which of the strata i=1,…,I you will be in. In any given stratum, you will have to decide between options 1 and 2, and for each of these decision options there is uncertainty about the payoff.

The goals are:

(a) Conditional on the stratum, pick the best decision. This is the local decision problem.

(b) Averaging over the strata, evaluate the expected value of the tree, that is, the expected value under an optimal decision analysis given the uncertainty.

The challenge is that you don’t know which internal decision is best, because there is uncertainty about the payoffs.

The “optimizer’s curse” is that if, for each stratum in step (a), you make the best decision given available information–that is, you estimate the expected payoff under each of the two decision options and then pick the the one whose expected payoff is higher–then if you use these expected payoffs in step (b) you will systematically overestimate the value of the tree.

The “curse” here is not that the optimizer is making bad decisions, it’s that a naive estimate will be overly optimistic about the net value because you’re selecting on choices that look good.

In 2007, Erwann Rogard, Hao Lu, and I published a paper on the topic, including the above diagram. Here’s our abstract:

The evaluation of decision trees under uncertainty is difficult because of the required nested operations of maximizing and averaging. Pure maximizing (for deterministic decision trees) or pure averaging (for probability trees) are both relatively simple because the maximum of a maximum is a maximum, and the average of an average is an average. But when the two operators are mixed, no simplification is possible, and one must evaluate the maximization and averaging operations in a nested fashion, following the structure of the tree. Nested evaluation requires large sample sizes (for data collection) or long computation times (for simulations).

An alternative to full nested evaluation is to perform a random sample of evaluations and use statistical methods to perform inference about the entire tree. We show that the most natural estimate is biased and consider two alternatives: the parametric bootstrap and hierarchical Bayes inference. We explore the properties of these inferences through a simulation study.

I kinda like the paper. I wouldn’t say it’s one of my all-time favorites, but I think it’s interesting, and I like that we offer two different solutions to the problem.

On the downside, the paper seems to have disappeared without a trace. In 20 years, it’s only been cited three times, and none of them look very impressive:

“Using Alternating Decision Treets,” indeed.

Maybe one problem with our paper was its dry-as-dust title, “Evaluation of multilevel decision trees.”

This all came to mind because Sean Manning pointed me to this post, “The best cause will disappoint you: An intro to the optimisers curse.” Now that’s a good title.

It seems that the term “optimizer’s curse” came from this 2006 paper by James Smith and Robert Winkler, which has a lot of overlap with our article that appeared a year later. Both papers use hierarchical Bayesian analysis. Their paper is better than ours, for sure, and not just in the title, as they make a much better case for the importance of the problem. But we were working independently. Too bad: had we joined forces we could’ve produced something better, as each of the two papers had lots of material that was not in the other. Smith and Winkler consider the problem of choosing among many options with different levels of uncertainty, whereas we consider a multiplicity of binary decisions. These are just two cases of the general principle.

The above-linked post, by someone who goes by the handle “titotal,” is good too. It doesn’t have any new technical material, but it explains the problem in plain English from first principles, goes through some examples, and discusses some of the policy implications.

Treating AI review like the contentious policy design problem it is

This is Jessica. Many researchers are thinking about what we should do about scientific peer review now that AI makes producing papers so much easier. Submission numbers keep getting higher — in the past week, I saw reports that the most recent ACL submission cycle got 17k+ submissions, up from ~10k last cycle. TMLR went from getting 500 submissions every 60 days or so to getting the same number ever 19 days. There are simply not enough human reviewers to handle the surge, at least not without a dip in quality. The noiser the review system gets, the greater the incentive to submit sloppy papers, because you might get lucky. This is the so called “review death spiral.” 

It is a hard problem. Quotas on submissions per author are one avenue forward, which TMLR just announced it would adopt. Not surprisingly, many reviewers are also turning to AI to help. The question becomes how to design AI review protocols to help reduce some of the noise, through preliminary filtering or flagging or helping guide human attention to parts of a paper that are most likely to be problematic. 

But what sorts of checks should an AI review assistant run on a paper? It’s useful to separate basic integrity violations AI could flag, like is there evidence of plagiarism, fake citations, missing code/data to reproduce main results (which are comparatively less controversial) from “epistemic filters,” like does the paper pass replicability checks, robustness checks, preregistration checks, statistical significance checks, etc. There’s a temptation to blur these things in proposing how to apply AI to review. It’s easy to assume that the metascientists have already established that practices like replicability or preregistration are truth-indicating and we can just implement them at scale (and indeed, ML researchers are citing open science and other reform arguments to back their proposals).

But if there’s one lesson to be learned from the aftermath of the replication crisis, it’s that there is no small, stable, non-conflicting set of detectable signals of good science that will find the good stuff and reject the bad. There are heuristics that can be useful prompts for deliberation – get in the habit of preregistering, make sure you can replicate your results, test the sensitivity of your results to choices you made along the way – but things get weird when we start treating them like universal requirements. Authors shift attention away from unrewarded signals, like better theory or exploratory work, and become preoccupied with rigor signaling through their methods. The result is not necessarily more thoughtfulness. 

And so even if the AI review tools we create are simply intended to inform human reviewers about what checks a paper passed, what we implement will have important policy implications by incentivizing more work like that in the future. I don’t think we are in a good position to predict what happens if suddenly we require multiverse robustness or statistical significance in a field like machine learning, which has in many ways been all about iterative improvement and “frictionless reproducibility” rather than individual results passing all the robustness checks.

The answer is not to avoid using AI in review until we can find a non-gameable set of credibility qualities to have AI focus on, as some have recently argued (though I agree with the linked paper that we need more rigor in how we go about motivating review tools). Non-gameability sounds nice, but any automated review policy that allocates attention will be gameable, because ensuring good science is not so simple as finding the right checklist. The relevant question is instead what assumptions and downstream incentives we are willing to tolerate. To this end, at the very least we should get in the habit of spelling out the assumptions we’re making, so that the trade-offs of focusing on particular proxies become explicit.

I wrote up this view recently in a paper called “Stop Treating Metascientific Heuristics as Quality Filters in AI Review.” Here’s the abstract: 

AI-implemented checks for reproducibility, robustness, preregistration, claim scope, and other intended proxies for scientific credibility can extend human reviewers’ capabilities. However, treating metascientific heuristics–whose theoretical grounding remains contested or incomplete–as necessary and sufficient signals for filtering out bad science is counterproductive to scientific progress. The emerging literature blurs the line between integrity filtering, based on necessary but insufficient signals of validity like reproducibility of stated results or lack of fake citations, and epistemic filtering, which uses machine-detectable signals to judge scientific quality. Drawing on critical metascience, we show that commonly proposed signals of research quality are insufficiently justified as general indicators of scientific value. The answer is not necessarily to ban AI in review, given the deluge of submissions venues are facing. Instead, in recognition of how any use of automated signals–even when deployed with human oversight–will shape attention and create incentives upstream, developers of AI review tools should explicitly specify their assumptions about how proxy signals inform on scientific quality in the context of specific review decisions. This approach treats AI review contributions as contestable decision policies that will shape future research, acknowledging the value-laden nature of scientific judgment and surfacing relevant tradeoffs. 

Rather than arguing for or against any particular proxies, I’m more interested in the methodological and philosophical mindset we should bring to the new questions raised by AI review. To demonstrate what I mean by more explicit motivation, I analyze an example review decision problem and set of detectable signals in the appendix, drawing on an analysis of how statistical significance and exact replication success relate to signal-to-noise ratios measured under error from a recent paper by Eric van Zwet, Andrew, and Witold Więcek. The takeaway is that the value of a proxy will depend on how you define the latent state you care about (e.g., whether the direction of an effect was correctly estimated, how big the true signal-to-noise ratio is), what you assume about the generating process (i.e., how the proxy noisily reflects the latent state), and what you assume about the decision-maker’s choice of actions and utility function. By suggesting this approach, I am *not* suggesting that one can validate a new review tool’s utility before its been deployed. The point is that there will be trade-offs no matter what, and the best we can do is be concrete about the kinds of  assumptions that have to hold for proxies to be useful in review, so the community can debate what risks they are willing to accept. 

In this sense, my argument is very much along the same lines as Devezer et al’s argument that those proposing reform procedures should adopt more formal methodology to avoid unwarranted overgeneralization. Once checks become part of review infrastructure, they stop being neutral diagnostics and become policy levers. Let’s start treating them as such in research on AI review.

Gambling provides a gentle rocking of the emotions to put you in a pleasant baby-like state

A commenter recommended the book, Addiction by Design: Machine Gambling in Las Vegas, by the anthropologist Natasha Dow Schüll, and I checked it out of the library. It’s a study of people who play slot machines and video poker, focusing on the locals: Vegas residents who have some low-level gambling addictions as part of their lives.

Nowadays, I guess that much of this business has been supplanted by machine gambling that you can do on your phone in the comfort of your own home. But the market for gambling must be far from being tapped: I imagine that there are many millions of potential gambling addicts out there, available to be hooked by some form of gambling or another.

As a statistician, I have mixed feelings about gambling. Ever since I was a kid, I’ve thought that probability is cool, and I like to bet. When we were kids we had a toy roulette set that we would play (just betting chips, not real money) and I’ve enjoyed poker and informal sports betting. The last time I’ve bet on anything was about 20 years ago, but that’s just more me getting older than anything else.

At the same time, there are all these addicts, and all the people who might not be addicts but who still degrade their standard of living, not to mention reward evil people (even if they’re pleasant as invididuals, they’re in an evil business; sorry, Nate!). And it just keeps getting worse.

To a statistician, this is all an endlessly fascinating topic: the odds and all that, but also whatever it is in people’s brains that motivate them to spend thousands of dollars on lottery tickets, etc.

As Schüll writes in her book, the popularity of machine gambling (which she says is the source of the majority of casino gambling profits in Vegas) is particularly puzzling in that people are just pulling the lever over and over again, without the sense of human context or any feeling of agency.

There’s also the interaction between the players and the people who make money from the machines:

For extreme machine gamblers, the experience of play is an end in itself–an “autotelic” zone beyond value as such, in that “no other reward than continuing the experience is required to keep it going.” Conversely, for the gambling industry the zone is a means to an end; although it carries no value in and of itself, it is possible to derive value from it. . . . In effect, gamblers’ drive to remain indefinitely suspended in the zone is rerouted, via the technological detours of the gambling industry, toward a destination of complete depletion.

It’s not just “the technological detours of the gambling industry,” it’s also politics: the industry doing what it takes to keep all this going, a gradual effort over many decades that continues to this day.

Later, Schüll summarizes:

Gambling addicts play machines to suspend themselves in a state of equilibriated affect.

This seems pretty accurate.

I would just add two things.

First, this equilibrium is not flat. It’s periods of stress, punctuated with the occasional excitement of winning and the frequent relaxing calm of losing. The best analogy I can think of is the way that a baby is calmed, not by lying completely still, but by being rocked in a somewhat irregular fashion.

Second, stakes matter. That “state of equilibriated affect” can only be achieved when real money is involved. I guess this is related to the phenomenon of habituation in drug exposure. Schüll talks with someone who started on a zero-stakes poker video game but them moved to the machines that take real dollars. We discussed this general idea recently in our post, Why isn’t it possible to play a fun and serious game of poker not for money?

It’s a good thing that babies don’t work that way–you can rock them a reasonable amount and they’ll be happy. No need to keep upping the stakes until the crib does a loop and the baby flies out the window. Although I guess that might happen if there were money in it.

“Are prediction markets causing more harm than good?”

The other day I was invited to an “anti-debate” on the above topic, scheduled for this afternoon. I’d not heard about the concept of an anti-debate before; here’s the description:

The Anti-Debate is a new format for debate where participants build on each other’s insights, so that greater complexity can emerge.

Despite its name, the Anti-Debate is not anti-debate. It actually starts out like a traditional debate, with opening statements and rebuttals. But then it goes further — guiding participants to explore how they might integrate their perspectives into a bigger picture. Hence our tagline: First Debate, Then Elevate.

Sounds reasonable to me. They refer to the concept of steel-manning, and I’m skeptical of that, but I agree that standard debate formats have problems (just read The Topeka School!) and I’m very open to this sort of alternative.

The organizer, Winter Ku, referred to my posts on “the statistical skepticism about betting markets versus polls (self-reinforcing prices, thin volume), and more recently the integrity and harm concerns in your ‘Uh oh prediction markets’ writing, e.g. manipulation, the absence of insider-trading rules, and the gambling-like risks to vulnerable users,” and it seemed like it would be fun to have a chance to speak on this with several hundred people who might well be inclined to disagree with me. At the very least, I’d get some good questions, lots of pushback, and I’d probably change my mind about a few things.

The anti-debate was to be held at Manifest, an annual festival about prediction markets and forecasting at the same California location that had this blogging workshop a couple months ago. Unfortunately I was only invited to the Manifest thing a couple days ago and I wasn’t able to fly out on such short notice.

I hope the anti-debate goes well without me! Actually, it’ll probably go better without me than with me. I think I’m a careful and interesting writer with lots of good ideas, but I don’t know how well I’d do in a live debate. I imagine I’d get flustered. On the other hand, sharing objections to prediction markets, in front of a crowd coming from a much different perspective than me, but open to listening, could possibly do some good, as well as being a learning experience for me.

So maybe next year! I don’t know if they’ll put the anti-debate up on youtube or whatever; if so, it would be interesting to see the arguments on both sides.

P.S. I came across this entertaining and meandering report from Peter Miller describing this Manifest conference. So now I know what I missed!

When is detecting AI-generated text worthwhile?

This is Jessica. AI-text detectors are coming to play a bigger role in adjudicating what texts are worthy of our attention. There was the surprising case of an apparently AI-generated short story winning the Commonwealth Foundation Short Story Prize, which returns 100% AI generated by Pangram, the leading detector whose false positive rate is reported as roughly 1 in 10,000 in its own audits and near zero on medium-to-long passages in an external audit. Applying Pangram to the other 4 stories that won awards this year suggests two others were heavily AI-assisted. More recently, the NeurIPS Position Paper track announced that it was desk rejecting 18% of submitted papers that were detected by Pangram as fully AI-generated. Another 13% are getting followed up on with the authors to investigate AI use. In this case the Call for Papers made clear that submissions should be “substantially written by human authors,” so this should not have come as a surprise.

We’re having to reconsider what authorship means. Can a person create literature or express their position on a subject without writing a single sentence themselves? When do we really care who strung the words together?   

Some people think detection is a waste of our collective time because we will never reach an equilibrium. AI-generated text will keep shifting toward what passes the detector. Human writers will continually update their beliefs about what features are indicative of AI-writing, but will also be influenced to write more like AI by reading so much AI text. There’s no stable target, just an endless cat and mouse game that incentivizes being savvy enough at any given time to avoid getting flagged. Meanwhile people are being morally scorned and suffering reputational damage for being caught on the wrong side of things. This may disproportionately affect some writers (like non-native english speakers) who are finally seeing the playing field leveled a bit. 

On the other hand, there are situations where it really is important to know who strung the words together. Education is the most obvious one. It’s just very hard to teach someone to think if they’re not writing down their ideas themselves. 

The problem is that outside of select scenarios like teaching, what we really tend to care about is who controlled the ideas, and this is not equivalent to who strung the words together. Some would argue that the latter is becoming increasingly irrelevant given that AI can write more fluently than many people and many people prefer AI-generated text. 

Of course the reason we’re seeing detection used to filter paper submissions is because the ideal process–where the content of each paper is carefully considered on its own merits–is increasingly untenable given the huge surge in submissions in some fields. It’s easy to pump out credible-seeming papers with minimal human oversight using AI, and enough people are doing this to create serious problems. 

Mostly my response is that if we are going to debate the value of detection we should be willing to make our assumptions explicit. So let’s walk through a toy model to think about what we’re really conjecturing about.

One way to think of the latent state that we actually care about in paper review is the author type. Let’s say type A authors come up with their ideas and do a lot of the writing themselves. Type B authors rely on AI to do much of the thinking for them, and also use AI to do much of the writing. Type C authors come up with their own ideas, but engage in extensive prompting to get AI to write everything they want to say for them.*

For each paper, we choose to either pass or reject, conditional on the output of a Pangram check. Let’s say we only care about whether it flags 100% AI generated or not, so the signal s is binary, where s=1 means AI detected.

Based on available Pangram audits, if a text is actually written heavily by AI there is a very high chance it flags as AI-generated: beta=P(s=1|AI written) with beta very close to 1. If a text is not written by AI, there is a very small chance it flags as AI-generated: alpha=P(s=1|human written). Pangram’s internal audits put alpha around 10^−4 but other audits find essentially zero false positives for medium-to-long passages. 

So P(s=1| A)=alpha, and if we assume Types B and C use AI to a similar extent for the writing, then \beta=P(s=1|B) = P(s=1|C). The posterior probability that a flagged paper is from a Type B author is then:

P(B|s=1) = (beta × p_B)/(alpha × p_A + beta × p_B + beta × p_C), and since alpha is tiny and beta is close to 1, P(B|s=1) ≈ p_B/(p_B + p_C)

The relevant considerations become what we think the author population looks like, and how costly we think a false positive versus a false negative are. 

As a starting point, let’s say that for our conference submissions this year, Type C is the rarest, at 20%, and Type A and Type B equally split the remaining mass at 40% each. Let’s also say that we consider rejecting an acceptable paper, c_FP, to be twice as bad as passing an unacceptable one c_FN. 

The optimal decision rule is to reject if c_FN​ * P(B|s=1)>c_FP * ​P(A or C|s=1), or equivalently P(B|s=1)>c_FP/(c_FN+​c_FP​​)

With c_FP=2 and c_FN=1, this means we reject if P(B|s=1) > 2/3.

Under the prevalence assumptions above, P(B|s=1) is approximately 2/3, so we are right on the boundary. From the standpoint of making the right decisions for this particular conference cycle, it’s not obviously bad. But if Type C is a little more common, e.g., we shift a little mass from p_A to p_C to make p_C 0.25, then P(B|s=1) is 0.62, then we shouldn’t desk reject only based on the flag. Similarly if we were to decide that falsely rejecting an acceptable paper is three times as bad as passing an unacceptable one, we shouldn’t rely on it alone. 

This model is obviously very simple. But it shows us what kinds of things we have to make assumptions about in the most basic case. Obviously I don’t really know how many people are using AI blindly to write papers, nor how many people are relying heavily on AI to write up their own ideas. You should take my numbers with a grain of salt. Personally I can’t imagine how relying on AI to do all the writing when I came up with the ideas would ever feel efficient, because I tend to have strong opinions on how things are said. But I can accept I am probably more of a control freak than many others. And AI overreliance is easy to slip into. Maybe papers chairs from recent ML conferences (or arXiv moderators) have estimates on bad-actor rates based on what they are seeing. 

What this exercise can’t tell us is how scientific progress is impacted by the warping of incentives that can happen when we use AI-detection as a filter. Classic principal-agent problems suggest that when we care about something hard to observe—like scientific quality or long-term epistemic value—but must rely on observable proxy signals to judge authors’ outputs, we should expect authors to shift more effort toward improving exclusively on those proxies. Avoiding m-dashes and ‘not this, but this’ constructions and whatever else currently ups the posterior probability of AI-generation is orthogonal to the actual thinking that research requires. What if relying more heavily on AI to write up our ideas is a good idea for science in the long run, in terms of more clearly communicating the ideas or saving a lot of time, so that we can get more good ideas out in the same amount of time? Then too much emphasis on detection might slow us down. However, I’m doubtful we are currently anywhere near a state of the world where discouraging writing with AI is as costly for scientific progress as spending time reviewing and reading many more questionable AI-generated papers is. The bigger threat at the moment is the slop overwhelming our ability to find the good stuff.

*We could also posit Type D authors that get AI to generate the ideas, but then write the papers themselves to evade detection, or are extremely good at getting AI-written text to evade detection. But this seems much less likely so I’m ignoring it.

Against shallow anti-rational humanism

Jessica writes:

I get so tired of people dumping on decision theory because real world decisions are complex. If decision theory is so deeply flawed, I’d love to know what alternative methods the critics advise for trying to evaluate and improve decision making in some real world setting. Should we give up on modeling completely because some cause problems for our assumptions? What happened to the epistemic value of attempting to formalize goals so as to better understand what components we think are at play? Do we really want to go back to talking about man as a creature of instinct and habit and leave it at that?

I agree, and this reminds me of a discussion from twenty years ago (!) about the transition from viewing people as “rational animals” to viewing people as “irrational computers.”

Here’s Thomas Jefferson from 1823:

We believed . . . that man was a rational animal, endowed by nature with rights, and with an innate sense of justice; and that he could be restrained from wrong and protected in right, by moderate powers, confided to persons of his own choice, and held to their duties by dependence on his own will.

He’s coming from a liberal (in the U.S. politics) perspective, with the idea being that rationality is a way to move forward from outmoded feudal arrangements. Not that this was so easy–Jefferson owned slaves!–, but nobody said that rationality was easy, just that it’s a way forward.

This association, in which the left was associated with utopian rationality and the right was associated with sensible acceptance of irrationality, continued for another century. Consider, for example, the contrast between the rationalist and socialist George Barnard Shaw and the Catholic conservative G. K. Chesterton. This association of rationality with the left continued through the New Deal period in the U.S. and the idea of the Soviet Union as being scientifically socialist. The second world war pitted Soviet central planning and “Fordist” American organization against the blood-and-soil Axis powers.

Sometime during the mid-cold-war period there was a shift, at least in the U.S. and its allies, where science and technology was associated with the military-industrial complex and gained a conservative tinge, while the left embraced an anti-technology, back-to-the-land vision. “Humanism” moved from a conservative, roll-back-the-tide, Chestertonian position to a liberal, fight-the-Man position.

Nowadays things are a mess: conservatives support military and police hardware, coal, nuclear power, bitcoin, data centers, and gas guzzlers more generally, but conservatives also oppose vaccines and scientific more research more generally, and Biblical creationism hasn’t gone away either. And, with conservatives in charge of the country and much of public discourse, liberals are often defining themselves based on what they oppose.

I’m with Jessica in that I see no conflict between humanism and rationality. Rationality is an ideal or a way of being, not an algorithm. Yes, we’re animals, and rationality is one of our very useful tricks. I wouldn’t want to abandon rationality or define ourselves against it, any more than I’d want to abandon running or singing or any of the other things that we can do so well, when we do them well.

Noem’s Razor and why I think the concept of “unintended consequences” is overrated

I was thinking more about Noem’s Razor (“Never attribute to stupidity that which is adequately explained by malice”) and it reminded me of that “Unintended consequences” often were actually intended, a principle that I discussed back in 2008 in the context of Freakonomics, that reliable purveyor of conventional wisdom; see also here and here.

One of my general problems with the concept of “unintended consequences” is that it so often seems to be used either as an argument against a proposed reform (recommending to not do this seemingly good thing because of its unintended consequences; what Albert Hirschman called the “perversity thesis” in his classic book, The Rhetoric of Reaction) or as a way to get evildoers off the hook by arguing that their bad actions were actually the unintended consequences of somebody’s good intentions.

I have a similar problem with Hanlon’s Razor (“Never attribute to malice that which is adequately explained by stupidity”). Often Hanlon’s Razor applies, that’s for sure, but I also think it can be a way to let people off the hook.

Also, often the simpler explanation is the right one. In the motivating example for Noem’s Razor, someone attributed the lethal behavior of the immigration police in Minneapolis as a “Sad case of poor incentive design (ICErs create expensive externalities bc of legal, reputation. etc costs of processing bad detentions and arrests. Textbook amateur mistake.”–but it seemed to me more likely that those police were doing what the government wanted. So the incentives (by which the agents can break the law without fear of consequences) worked directly. There’s no evidence that the consequences were unintended. The political consequences may well have been not as desired, but I see that as more of a political miscalculation than anything else.

As the economists say, when there’s a policy that seems like it doesn’t make sense, think more carefully about the incentives. And of course this policy applies much more generally, as in the literature on regulatory capture.

I don’t buy the argument that the nice guys are the real assholes. I think the assholes are usually the real assholes.

That doesn’t mean I think that all purported do-gooders are actually doing good–see here, for example. People need to be evaluated based on what they do, not what they say.

15 new articles on statistical workflow!

Aki, Richard, Lizzie, and I put together a special issue on Statistical Workflow for the Philosophical Transactions of the Royal Society. I guess “royal” isn’t as impressive as it used to be, but still.

Statistics and data analytics play an increasingly important role in and across science and policy. But much of what is done by the best practitioners–their “workflow”–is tacit knowledge only glanced over in textbooks and research articles. In this new collection covering a wide range of disciplines, leading statisticians and researchers discuss the motivations and details for their workflows.

The four of us did this project because we were all interested in Bayesian workflow, and we wanted to learn more about statistical workflow in general, not just the Bayesian part.

Here’s what’s in the issue:

  • Statistical workflow, by Andrew Gelman, Aki Vehtari & Richard McElreath
  • Unsupervised machine learning for scientific discovery: workflow
    and best practices, by Andersen Chang, Tiffany M Tang, Tarek M Zikry & Geneva I Allen
  • PCS workflow for veridical data science in the age of AI, by Zachary T Rewolinski & Bin Yu
  • Simulations in statistical workflows, by Paul-Christian Bürkner, Marvin Schmitt & Stefan T Radev
  • An automatic finite-sample robustness metric: when can dropping a little data change conclusions? Part I: definitions and experiments, by Ryan Giordano, Rachael Meager & Tamara Broderick
  • An automatic finite-sample robustness metric: when can dropping a little data change conclusions? Part II: theory and intuition, by Ryan Giordano, Rachael Meager & Tamara Broderick
  • Building a Backdrop of Meaning in Magnitude (BoMM) as part of research workflow, by Megan Dailey Higgs
  • A preliminary data analysis workflow for meta-analysis of dependent effect sizes, by Elizabeth Tipton, James Pustejovsky & Jingru Zhang
  • A four-step simulation-based workflow for ecological analysis and science, by EM Wolkovich, T Jonathan Davies, William D Pearse & Michael Betancourt
  • Scientific workflow in experimental economics, by Anna Dreber & Séverine Toussaert
  • Hidden processes of workflow in cognitive developmental psychology, by Lauren N. Girouard & Susan A. Gelman
  • Reproducible workflow for online AI in digital health, by Susobhan Ghosh et al.
  • Model checks for Bayesian estimation and forecasting of health coverage indicators in low- and middle-income countries, by Leontine Alkema et al.
  • Closing the gap between statistical and scientific workflows for improved forecasts in ecology, by Victor Van der Meersch, James Regetz, T Jonathan Davies & EM Wolkovich
  • Machine learning workflows in climate modeling: design patterns and insights from case studies, by Tian Zheng et al.

Lots of good stuff here, and lots of different perspectives. Thanks to all the authors. The issue is here, and all the papers should be freely available.

If you have any thoughts on the articles in the volume, or on any other statistical workflow topics, just let us know right here in the comments box.

How much skill is in “skill games”? There can’t be much.

A few years ago we posted on luck vs. skill in poker and luck vs. skill in sports.

A new one of these came up when Palko pointed me to this disturbing news article, “They Look Like Slot Machines. They Pay Out in Cash. And Critics Say They Are Getting Workers Killed,” which reports:

Store clerks in Pennsylvania have been robbed and shot while handling payouts for “skill games,” which are not subject to the security standards required of gambling operations. . . .

They look like casino slot machines and video arcade games, but they are neither. They are skill games. Like their name implies, players must use their skills — memory, reflexes, strategy, recognition — to win cash. They don’t solely rely on the luck of the draw, like with slot machines. . . .

The Pennsylvania Gaming Control Board licenses 17 casinos and 75 truck stop video gaming terminal facilities, requiring them to have secure facilities, trained staff, and digital video recording. Their gambling machines also have to be linked to a centralized computer monitoring system. Businesses that offer skill games are not held to any standards, their critics say. As a result, some are putting their employees in danger by having them pay winners with cash. . . .

Some gruesome stories follow, along with predictable quotes from evil people making money off these things.

“Skill games”?

But here’s my question. How much skill is actually in these “skill games”? I assume not much, because, if the games really did involve skill, then skillful players could just show up and win regularly.

I guess the “skill games” could involve some small amount of skill, but not enough so that skillful players could beat the house edge.

James Heathers will fix Wiley’s problems for less than 3.7 million dollars (that is, 2,553,739 Jamaican beef patties, 47,064 whisky-sodden meals at Newark airport, or nearly 218 invites to a conference featuring Gray Davis, Grover Norquist, and a rabbi)

The data thug quotes from: an April 2023 post from the EVP of Research at Wiley:

In September 2022, Wiley identified and immediately alerted the industry to paper mill activity we found operating at scale. Specifically, we found fraudulent outside editors that had subverted our processes and workflows, leading to a proliferation of bad content. This scheme hit Hindawi’s Special Issues program hard.

For those who are unfamiliar with academic publishing: Wiley is a long-established firm.

Back when I was a student, Wiley was perhaps considered the #1 publisher within statistics. They published Feller’s classic books on probability, Cochran’s classics on design of experiments and survey sampling, and many other standard texts.

In recent decades, as with other academic publishers, they’ve branched out into other publishing-related businesses, for example, Hindawi, which has a habit of filling your inbox with spam about dodgy journals. From Wikipdia: “In 2023 and after over 7000 article retractions in Hindawi journals related to the publication of articles originating from paper mills, Wiley announced that it will cease using the Hindawi brand and will integrate Hindawi’s 200 remaining journals into its main portfolio. The Wiley CEO who initiated the Hindawi acquisition stepped down in the wake of those announcements.”

To those of us of a certain age, seeing Wiley and Hindawi in the same sentence is disturbing in itself, a sign of what the world of publishing has come to. Not that publishing has ever been pure—just for example, back in the 1960s and 70s, legitimate publishers released fake-science books such as Chariots of the Gods, The Bermuda Triangle, and The Jupiter Effect—; still, it was sad to see the once-respected Wiley name dragged so low.

You can hire James Heathers for less than $3.7 million

Heathers points out that, because Wiley is a public company, certain of its business records are required to be public, and he found this:

Heathers explains:

‘Legal settlement’ is exactly what it sounds like, and the footnote description is ‘a litigation matter related to consideration for a previous acquisition’.

The shorthand is: their own shareholders sued them. They said they were going to, and did. . . .

This is not uncommon . . . Any large public company in business for long enough has seen a suit or two like this. . . . Generally, they settle. . . . this is noticeably more expensive than running a full-scale proactive research integrity program.

And here’s the kicker:

For 3.7M, you could have the world. I [Heathers] am quite confident in saying: I could run that as the operating budget of a fraud mitigation unit for multiple years, and drop the amount of nonsense by . . . maybe two-thirds, three-quarters? within that time.

All right, then!

This is not new to Wiley

Just one thing. This is not new. Wiley’s been in the lucrative science-fraud business for awhile. Recall this story from 2011, “Wiley Wegman chutzpah update: Now you too can buy a selection of garbled Wikipedia articles, for a mere $1400-$2800 per year!”

But, yeah, the Hindawi business sounds a lot worse. When Wiley was conned by a formerly respected academic into republishing Wikipedia content and charging money for it, that was just a one-time breach in editorial standards. The Hindawi story seems like something else entirely. On the other hand, when it comes to fraudulent publishing, they had some track record.

Adversarial journalism

Heathers writes:

There absolutely IS adversarial journalism in academia/research/science/etc. Science Magazine, Undark, Vox, etc. have all published great pieces on this.

Ahhhh, Undark Magazine . . . that brings up memories. A few years ago, Undark published a terrible article, misrepresenting a scientific story in which I’d been involved. That was adversarial journalism in the worst sense, in that the journalists were coming in with an agenda and using it to distort and slam anyone who disagreed with it. See here for more on that story.

I’m not saying you shouldn’t trust anything in Undark just cos they ran that one bad article, any more than I’d un-recommend the books by Cochran just cos they were published by Wiley. I just thought it was funny that Heathers mentioned Undark in particular, given that my only experience with that magazine was so unpleasant.

The Application Matters: Medical Ethics and Counterfactual Utilities

I believe, as applied statisticians, we need to get our hands dirty and immerse ourselves in the applications we try to address. This post is mostly about medical ethics and the famous “first, do no harm” principle. It is also an attempt to understand how statistics can serve medical practice. The motivation for this comes from a recent debate in the statistics literature about counterfactual losses, which often invokes this “first, do no harm’’ principle as a motivation. Much has been written about the theory of these counterfactual losses — and I’m sure they will find a fruitful application — but do they actually speak to the challenge of medical decision-making that the “first, do no harm’’ principle seeks to address?

I will argue that they cannot, because this principle is concerned with medicine at its most human: medical practice centered on the relationship between an individual patient and an individual physician. But what can statistics help with? Modern medical obligations acknowledge that medicine is embedded in society; they highlight medical practitioners’ concern with justice and with reducing health disparities. These are concerns statistics can help to address.

But let me start at the beginning. There’s a recent literature that considers decision making under counterfactual loss — what if the utility of your decisions not only depends on the realized outcome but also on what could have been, on a counterfactual? A paradigmatic example is the following “first, do no harm’’ utility: Suppose you’re administering a drug and there are only two extreme outcomes. The patient may live, or they will die. The literature (e.g., Bordley, 2009,  Ben-Michae et al., 2023, Christy and Kowalski, 2026) has interpreted the medical aphorism “first, do no harm” as requiring a utility function that assigns asymmetric weights to saving a life and causing a patient’s death. The disutility from killing a patient who, counterfactually, would have survived outweighs the positive utility of saving a patient who otherwise would have died.  Although this may initially seem attractive, several authors have pointed out complications that arise when decisions are based on such counterfactual losses (e.g., Dawid and  Senn, 2023, Sarvet and Stensrud, 2023).

Andrew and I contributed to this literature with a small example that seemingly produces a counterintuitive recommendation, which I discuss below.

In response, Koch and co-authors write:

[T]his seemingly nonsensical result can be reasonable in a different setting. […] It may be reasonable for a  physician to prefer standard care, prioritizing the avoidance of adverse counterfactual outcomes over  improvements in expected benefits. Indeed, such a decision reflects the Hippocratic principle of “do  no harm”. […] This example underscores the fact that a utility function represents the preferences of the  decision-maker and is therefore inherently subjective and context-dependent.

This uncovers a problem with our argument based on intuition — see, this decision doesn’t make sense, does it? Intuition, of course, can be misleading. One way our example might be misleading, as Koch et al. point out,  is that it may describes a setting in which we simply do not hold these counterfactual utilities.  If we were to transplant the same recommendation into an appropriate setting, it might no longer appear nonsensical and might instead conform to how we think we should behave.

This has me very excited. I believe statistics is at its best when it takes its applications seriously. So, in this blog post, I want to do just that.

I will briefly give the example Andrew and I came up with to show that a “do no harm’’ utility can lead to counterintuitive decision recommendations. We do so through an example involving Russian roulette. It is a useful example, but by no means an accurate representation of what we would consider plausible in real medical settings. What it does show, however, is that we need to be really careful with these “do no harm’’ utilities: if we don’t really hold them, they may lead to nonsensical decisions.

Taking the application seriously, we will dive into medical ethics to ask whether the proposed counterfactual “do no harm” utilities help with medical decisions. We do so by briefly examining the origin and history of the “first, do no harm” principle.  We will see that “do no harm” is perhaps best understood in the context of a professional ethic that commits physicians to the rules of their craft and to respect for each individual patient. Statistics cannot truly speak to this individual-level patient-physician relationship. Since the Hippocratic Oath, however, medicine has changed substantially. With the advent of scientific methods in clinical medicine, doctors face new moral obligations not captured by the “do no harm’’ principle. Some of these new obligations arise from the relationship among medicine and society; others arise from the use of scientific methods themselves. We will look at modern medical oaths to get a glimpse of these new obligations — and how statistics can help fulfill them.

Russian Roulette 

As a starting point, let me present our simple and somewhat morbid example in which counterfactual utilities give a counterintuitive decision recommendation: Imagine we are choosing between two games of Russian roulette. In the first game, the status quo, we play with a six-chamber gun, one chamber of which is loaded. That is, we face a one-in-six chance of death. We are then offered the option to switch to a seven-chamber gun, the new alternative “treatment.” If we switch, we face better odds: only a one-in-seven chance of dying. By switching games, we lower our probability of death, which to me seems preferable. 

What would the counterfactual “do no harm’’ utility function recommend? To figure this out, we treat the outcomes under either game of Russian roulette as (independent) potential outcomes and divide the population of players into four principal strata based on survival status. Only two of the principal strata are relevant for our decision, those in which a player would survive one game but die playing the other. It’s easy to work out that with probability 6/42 switching to the new gun saves you: you would die under the status quo but survive under the treatment. But with probability 5/42, you would have survived under the status quo, but switching to the new gun, you will die. Suppose we interpret “first, do no harm’’ as mandating that the negative repercussions of our treatment choice, the death of a player, outweigh the benefits of saving a life. For example, suppose saving a life has utility +1, while the death of a player has utility −2. Then the 6/42 chance that the treatment saves you is outweighed by the 5/42 chance that the treatment kills you in cases where, counterfactually, you would have lived.

Under this counterfactual utility, we ought not to switch. It recommends we stick to the status quo, under which we face a higher chance of death. This strikes me as a counterintuitive decision recommendation.

The “First, do no harm” Principle

There is, however, a limit to the force of this argument based on intuition. One might argue that the recommendation in the Russian roulette example is not evidence against counterfactual utilities in general, but rather an indication that, when playing Russian roulette, we do not hold utilities of this kind. When transplanted to a setting where we have such asymmetric counterfactual utilities, the same recommendation might be sensible. The counterfactual-utility literature often motivates asymmetric counterfactual utilities by appealing to the “first, do no harm’’ principle in medicine.

For the rest of this post, I will discuss whether counterfactual utilities are useful in this paradigmatic application: medical decision-making.

In a paper frequently cited by advocates of counterfactual utilities, Cedric Smith (2005) discusses the origin and limitations of the “first, do no harm” principle. It is actually not part of the Hippocratic Oath, or the wider Hippocratic corpus, as is often implied, but has somewhat nebulous roots. Smith traces its origin to the seventeenth-century English physician Thomas Sydenham. While undoubtedly catchy, this principle is not embedded in a larger ethical framework that would give guidance on its interpretation or justifications for its use.

The is a problem because taken literally, this “first, do no harm’’ principle is a poor guide to medical decision-making. Let me cite Louis Lasagna, an American physician of the last century who was very involved in rethinking the Hippocratic Oath:

“To observe this advice [first, do no harm] literally is to deny important therapy to everyone, since only inert nostrums [quack medicine without active pharmaceutical ingredients] can be guaranteed to do no harm. It is more reasonable to ask doctors to balance the potential gains against the possible harm; would that we could only quantify these probabilities more precisely!” (Lasagna cited in  Smith, 2005)

A call to action for us statisticians if I ever saw one. Of course, the counterfactual-utility literature that cites this principle is not advocating what Lasagna warns against: doing absolutely no harm. Its proponents are well aware that benefits and risks must be carefully weighed against each other. If the principle is not meant to be taken literally, then its obscure origin becomes a problem: it gives us little insight into what actually matters to medical practitioners, because it is disconnected from any wider tradition that would help us interpret it. 

Luckily, we can find a similar, more nuanced statement in the Hippocratic corpus (Epidemics I):

“Declare the past, recognize the present, foretell the future: attend to these things. As to diseases, make a habit of two things—to help, or at least to do no harm. The art has three factors, the disease, the patient, the physician. The physician is the servant of the art.”

The Greek word here is technē (orig. τέχνη) which we might also want to translate as “craft”.  Medicine is a craft because the decisions a physician has to face cannot be made by rote application of knowledge. As a craftsperson, the physician as an individual becomes relevant. That is why the Hippocratic Oath commits the physician, as an individual, to be benevolent in each patient interaction. Medical ethics based on the Hippocratic Oath is not focused on outcomes, let alone utility, but concerned with the character of the physician and their obligations toward their patient (Pellegrino, 2006). It centers the patient-physician relationship. 

With this background in mind, we can understand why the “benevolence” implied in the imperative to help is qualified with the phrase ‘’or at least do no harm’’ — if I’m already committed to help, it may seem that I’m already committed to do no harm. Lynn Jansen (2022) argues that this is where the professional aspect of medicine enters: As a professional, the physician needs to restrict their actions to those that align with their profession. That is, while they strive for benevolence in the sense of furthering the patient’s overall well-being, they reject all courses of action that would harm the patient’s medical well-being. This second aspect is often called non-maleficence. 

Statistics and Medicine 

In modern medicine, this tension is heightened. Taking the patient’s moral agency seriously, a physician must be careful not to “confuse technical with moral authority” (Pellegrino, 2006) or override patients’ values. This is worth keeping in mind. The patient must be involved in weighing benefits and risks. Thus, the medical professional does not have sole discretion to choose an optimal treatment. “Help, or at least do no harm” is a professional mantra that guides a physician in their interactions with patients. It is not a constraint on optimal decision-making; it is a moral commitment to respect each patient.

This conception of medicine is in stark contrast to the world seen through the lens of statistics. Compare this focus on the individuality of both patient and physician with the following quotation from an 1835 report to the Academy of Sciences, written by a committee of four mathematicians, including Poisson, on operations for gallstones: 

“In statistical affairs … the first care before all else is to lose sight of the man taken in isolation in order to consider him only as a fraction of the species. It is necessary to strip him of his individuality to arrive at the elimination of all accidental effects that individuality can introduce into the question.(taken from Hacking, 1990)

Statistics’ power lies in constructing aggregates, making disparate things hold together (Desrosières, 1998). Historically, these aggregates were useful for the emerging nation-state and were quickly adopted to address large-scale social problems, such as public health. Many professions, including medicine, strongly resisted losing sight of the particular – in our case, the individual patient — in favor of aggregates. Even randomized experiments, which we nowadays all too easily accept as the gold standard of evidence, had a hard time entering clinical medicine (Porter, 2020). 

Due to this tension, modern medicine has a dual nature.  On the one hand, doctors are still committed to treating their patients as individuals — medicine is the art of healing. Yet with advances of scientific methods within medicine, and with the recognition that health must be understood in the context of society, doctors face new moral obligations (Pellegrino, 2006).

Modern Medical Oaths

To get a glimpse of these new obligations and the self-understanding of doctors in the twenty-first-century, we can look to modern versions of medical oaths. While many doctors still take the ancient Hippocratic Oath, many medical schools revise the original text or students take an additional self-formulated oath. In 2005, for example, students at Weill Cornell Medical College began taking a revised Hippocratic Oath. Let me highlight a brief excerpt:

I vow […]

That above all else I will serve the highest interests of my patients through the practice of my science and my art; That I will be an advocate for patients in need and strive for justice in the care of the sick.

Notice the emphasis on justice; it’s not idiosyncratic to this oath. Two further examples show similar themes. The University of Pittsburgh School of Medicine’s class of 2024 took an oath that highlighted the social determinants of health and advocated for a more equitable health care system. Harvard Medical School’s class of 2019 vowed to combat structural oppression and promote social justice. In this admittedly selective set of examples, much emphasis is placed on how medicine relates to society. Core commitments are justice and the building of an equitable health care system.

So, how can we statisticians help modern medical practice? Modern medical ethics places great emphasis on patients’ autonomy and their freedom to choose based on their own values. For a patient’s decision to be well informed, deliberation about benefits and risks is central — but the decision ultimately depends on a personal tradeoff shaped by the patient’s values. For this reason, our goal should perhaps not be to optimize treatment decisions. We do need to help estimate the benefits and risks of treatments more accurately, but treatment decisions remain part of the individual patient-physician relationship. Instead, we should put more emphasis on identifying and reducing disparities in the health care system, focusing on medicine as embedded in society. The most important task may not be deciding which drug to administer, but reducing inequalities in access to treatment in the first place. I believe statistics has an important role to play in making health care systems more equitable and more just. 

I’m on the EPA science advisory board.

I just joined, and I’m one of 37 people on the board, a mix of people from academia, industry, and government.

If you google *EPA science advisory board*, you get sent to this page, which at first seems reasonable:

Look carefully, though, and you’ll see that the most recent meeting was in 2024!

Here’s the official description:

The SAB is a Federal Advisory Committee established by Congress to provide advice to the agency on scientific and technical matters. It is administered by the EPA Science Advisory Board Staff Office through a Designated Federal Officer. All meetings are open to the public. . . . SAB panel members serve until the work of the panel is complete. Some meetings are held virtually. Panels usually conduct 2-3 video teleconferences and one in-person meeting to discuss reports and work products before providing advice to the Administrator through the Chartered SAB.

My dad worked for the Environmental Protection Agency a long time ago–he was in Mobile Source Enforcement, which pretty much involved stopping people from manufacturing or selling leaded gasoline–and I’m inclined to serve my country when asked.

Then again, I just read this news article about Lee Zeldin, the current director of the EPA, and it’s pretty scary. The official documentation says that our role is to provide advice to the administrator. I guess then it’s up to him to decide what to do about it.

The Pick-the-Winner-Picker Heuristic: Preference for Categorically Correct Forecasts

A couple years ago, Jay Naborn wrote:

I am studying people’s preference for categorically correct forecasts (such as getting the winner of a sports game right) over error-minimizing ones (such as getting close on the margin). We have experimental evidence of this, why it happens, etc.

What I would be interested in doing is demonstrating that this preference is/can be a mistake. To do so, it would be nice to show that doing well in terms of minimizing continuous error is a better predictor of future winner-picking than is doing well in terms of winner-picking. I am curious if you have any leads as to some existing dataset that would be helpful here, or some simulation/modeling strategy that may work.

I replied that, yes, this relates to a point we made here.

Recently Naborn followed up:

The blog post you sent (and a couple others of yours) were very informative for our background thinking. My work (with Jonathan Bogard) forecast evaluation is now published at the Journal of Marketing Research.

And here’s the abstract:

People routinely make decisions based on predictions made by others (e.g., political pundits, market analysts), so it is in their best interest to identify high-quality forecasts. Experts characterize good forecasting as minimization of continuous error (i.e., predictions close to the eventual outcome). By contrast, the present work reveals that laypeople typically see good forecasts as those that correctly predict an event’s categorical outcome (e.g., the winning team). Using within-subjects, between-subjects, and incentive-compatible designs, fifteen studies demonstrate this “pick-the-winner-picker heuristic” as well as its psychological mechanism: People evaluate forecasts by assigning separate weights to (a) categorical correctness and (b) continuous error minimization, depending on the overall importance of the categorical and continuous dimensions for that situation. Thus, in the common case when the categorical dimension matters most (e.g., sports contests), people prize forecasts that accurately predicted the categorical outcome (e.g., the winner, not the margin of victory). However, when the categorical dimension’s stakes are experimentally reduced, an attenuation is observed. While this describes how people typically evaluate forecasts, crucially, a dimension’s importance is not necessarily related to its diagnosticity of forecaster skill or reliability. Accordingly, the pick-the-winner-picker heuristic may constitute a normative mistake, while framing manipulations help debias judgments.

Interesting. It’s good to see research on this topic.

An economist writes: “the fulminations over the #1 pick seem overheated to me.”

Jonathan Falk writes:

I [Falk] am always amazed at the amount of (digital) ink spilled on the perverse incentives involved in taking to get the #1 draft pick. The current local woes of the Giants and Jets obviously contribute a lot to these discussions, but they happen all the time. As an economist, it’s clear to me that the value of a draft pick is the incremental value, not the absolute value. I’m completely aware that the upper tails of distributions have much more dispersion than the center, or even the 80th-90th percentile does, but the fulminations over the #1 pick still seem overheated to me.

First, of course, is the fact that assessment is made with error, and there are plenty of #1 busts in every sport. #2s can be busts as well, of course, but that merely lowers the expected difference between #1 and #2 as the true value of both is attenuated towards 0 — #1 loses more.

Second, there is the issue of team fit. Greatness is a vector, not a number, and if the teams ahead of you in draft order need something else, you still stand a chance of getting the player optimized for your needs. Going the other way, of course, is that higher draft picks absolutely lower the number of teams that can steal your guy.

Third, teams are… teams. One person can only contribute so much. So the relevant assessment is now how much better A is than B, but how much the addition of A versus the addition of B will change the prospects of your team — which I think is pretty obviously a lower difference, though I guess your rationale for voting runs in the other direction — you ought to judge a small incremental addition by the gigantic difference between winning a championship or not.

Fourth, more narrowly economic, every incrementally pick costs more. I don’t think that effect is huge in the context of overall payrolls, but isn’t that then another anomaly? If #1 picks are so dramatically better than, say, #5 picks, why aren’t they paid multiples more?

I don’t really have anything to say here, because I have no sense of how much teams are paying for #1 or #2 picks. I do remember a couple years ago that everyone was talking bout Wemby, but basketball’s different than football because there are only 5 players on the court, so one player can make more of a difference.

The case of Wemby makes me think that one way this could be studied would be to compare different years. In some years there is a clear consensus #1 pick, other years not.

What advice do you have for this student who’s in his first year of college and interested in both statistics and political science?

Joey Jennings writes:

I’m a first-year statistics major and wanted to reach out because statistics and political science were my two main options when choosing a major, and I’m still considering law school down the line.

I’m very interested in how statistical thinking intersects with politics, public policy, and legal reasoning, and your career seems to embody that combination. I was hoping to ask whether you have any general advice for a student early in college who is trying to keep these paths open and build a strong foundation.

My response: I think I’m too old and too privileged to offer much useful advice to a young student just starting out. My own experience is that I always loved math but I didn’t want to do pure math–it just seemed pointless to try to prove theorems, knowing that there would be other mathematicians who were better than me, proving better theorems–, I studied physics, but then I took some classes in probability and statistics and the subject really grooved with me. I also took some political science classes, and it was interesting to see the relevance of mathematical and statistical ideas in understanding various aspects of voting and political representation. Back then the state of the art in political analytics was pretty low. There was some good work, but also lots of unthinking applications of inappropriate models, so there were lots of openings for a student to do innovative work. I guess things are even better now, in the sense that you can do innovative work at a much higher level, making use of what’s already out there.

As for advice: ok, yeah, I still think it’s a good idea to “learn to code.” Coding is the most rigorous thing out there, and it’s how we understand our statistical models (as discussed in our Bayesian Workflow book). Work on real applications where you can. And choose your courses more based on the quality of the teachers than on the descriptions of the classes.

And, ummm, anyone else out there have any further advice to offer?

A study is retracted after it turns out that its authors were misrepresented as “third-party experts” even though they were actually paid by the company?

Gur Huberman points to this news article:

A Study Is Retracted, Renewing Concerns About the Weedkiller Roundup

Problems with a 25-year-old landmark paper on the safety of Roundup’s active ingredient, glyphosate, have led to calls for the E.P.A. to reassess the widely used chemical.

In 2000, a landmark study claimed to set the record straight on glyphosate, a contentious weedkiller used on hundreds of millions of acres of farmland. The paper found that the chemical, the active ingredient in Roundup, wasn’t a human health risk despite evidence of a cancer link.

Last month, the study was retracted by the scientific journal that published it a quarter century ago . . .

The 2000 paper, a scientific review conducted by three independent scientists, was for decades cited by other researchers as evidence of Roundup’s safety. It became the cornerstone of regulations that deemed the weedkiller safe.

But since then, emails uncovered as part of lawsuits against the weedkiller’s manufacturer, Monsanto, have shown that the company’s scientists played a significant role in conceiving and writing the study.

Oh, what was that significant role?

Monsanto employees praised each other for their “hard work” on the paper, which included data collection, writing and review. One Monsanto employee expressed hope that the study would become “‘the’ reference on Roundup and glyphosate safety.” . . .

In retracting the study last month, the journal, Regulatory Toxicology and Pharmacology, cited “serious ethical concerns regarding the independence and accountability of the authors.” Martin van den Berg, the journal’s editor in chief, said the paper had based its conclusions largely on unpublished studies by Monsanto. . . . There was no disclosure of a conflict of interest on the part of the authors beyond a mention in the acknowledgments that Monsanto had provided scientific support.

There seems to be some controversy about the safety of this pesticide:

Dr. Philip J. Landrigan, who is a pediatrician and epidemiologist and the director of the Program in Global Public Health at Boston College . . . recently chaired an advisory committee for a global glyphosate study that found that even low doses of glyphosate-based herbicides caused leukemia in rats. . . .

Laboratory tests first flagged potential risks posed by exposure to glyphosate as far back as the early 1980s, and soon after, studies of Midwestern farmers exposed to herbicides started to show an increase in certain cancers. A U.S.-backed effort to eradicate coca fields in Colombia by spraying glyphosate from planes onto hundreds of thousands of acres of cropland led to widespread reports of illnesses among residents.

The 2000 paper declaring glyphosate safe was published against that backdrop. . . .

Bayer has paid out more than $10 billion to settle approximately 100,000 Roundup claims . . .

And then there’s the bigger picture:

The retraction points to a wider problem of research secretly funded by industries like tobacco and lead, said David Rosner, co-director of the Center for the History and Ethics of Public Health at Columbia University. “Shading the science to favor the corporate interest,” he said, was likely “the rule rather than the exception.” Journals needed to “press scientists more forcefully to identify conflicts of interest,” he said. “Huge financial interests are at stake.”

The most disturbing thing in the linked emails was that the Monsanto people referred to the authors of that paper as “third party experts” and as “independent experts.”

But if they were paid by Monsanto, then it doesn’t seem accurate to characterize them as “third party” or “independent” experts.

The research article appeared in 2000. The emails were released in 2017 in the process of a lawsuit. The article was retracted in 2025 (although the official publication date of the retraction is February, 2026, i.e., a month after the writing of this post).

I don’t know what to think about all this. On one hand, how much can you trust research on a controversial topic that was written, funded, and reviewed by one of the parties to the controversy? They do say this in the paper, “In this effort, the authors have had the cooperation of Monsanto Company that has provided complete access to its database of studies and other documentation,” but it sounds like Monsanto provided more than data access.

I guess I could try to read the original article . . . .OK, let’s take a look:

The paper goes into details on three studies from 1988, 1991, and 1992 of oral doses in rats over 10 or 15 days. Then it looks like there was another study from 1973 on oral doses in rats for 15 days, and then three studies of skin exposure from 1983 and 1991, two on monkeys and one on humans. Then there’s a mouse study from 1992, rat studies from 1987 and 1992, a dog study from 1985, a rat study from 1979, a mouse study from 1983, a rat study from 1981, . . . ok, I’m getting tired now. There’s not really much for me to chew on here as a statistician. It does seem that belief in these results is going to boil down to your trust in the research team, and so the undisclosed conflicts of interest are a big deal.

On the other hand . . . I’ve done research funded by Novartis–they paid my colleagues and they paid me directly too. We published a paper based on that work–two of the authors were Novartis employees and two of the other authors had worked for me at the time (more precisely, they’d worked at Columbia under my supervision). That project used Novartis data, but it was a little different from the above-discussed Roundup article in that its purpose was methods rather than policy.

Also I did some consulting for Monsanto at one point, I think! I can’t remember the details, I think I was on the scientific advisory board of some company that was doing some agricultural stuff, I went to one of their meetings and then I stopped hearing from them, actually I can’t even remember if they paid me. So I’m not gonna get on my high horse and denounce industry-funded or pharma-funded research in general terms.

Two Health Economists Walk into a Bar: What bothered me in that conversation of Jay Bhattacharya and Emily Oster

Last week I was at a conference on enhancing scientific integrity (as I reported here), and one of the sessions was an interview of Jay Bhattacharya, the current director of the National Institutes of Health, and Emily Oster, a professor of economics and Brown University.

I referred to that session in a post the other day regarding the recent case of a report from the Centers for Disease Control and Prevention that was pulled by Bhattacharya, in his additional capacity as acting director of the CDC. I’ll get back to that story in a bit, but here I wanted to talk about some larger things that bothered me in the interview.

Before getting to my disagreements, let me give my positive take, which is that both the people in the interview had an air of moral seriousness.

This is important. So much of the discourse in politics and social science these days is polluted with cynicism, whether it be from history professor Niall Ferguson decrying the “wokeness” on college campuses when he’s not encouraging college students to do “oppo research” on each other, or Lawrence Summers sleazing around with a sex trafficker and then trying to enlist his rich friends to intimidate student journalists, or Cass Sunstein writing an entire book on a topic he knows nothing about, or Sunstein’s friend Adrian Vermeule promoting election denial, or Mehmet Oz and Andrew Huberman trading off their medical and scientific credentials to hawk dietary supplements, or Steven Levitt promoting dubious claims on mind-body healing and global warming denialism (presumably because they’re cool and transgressive, respectively), or Matthew Walker torturing the data, etc etc. I’m talking about researchers who see science as a path to glory, not to understanding, and politically-minded academics who will happily promote stupid ideas that push their agenda. Beyond that there are straight-up politicians who lie, cheat, and steal, and that’s bad too–but here I’m talking about that nexus between government, policy, and the human sciences.

Anyway, Bhattacharya and Oster weren’t like that. They recognize that we’re talking about serious issues here. When asked about disruptions to NIH funding, Bhattacharya emphasized the larger goal of improving public health, making the point that they want to fund a portfolio of projects to address health challenges. I have no sense of how things are run internally within NIH, so I’m not saying I agree or disagree with his particular administrative directions, but I appreciated that he kept his eye on the ball by emphasizing ultimate goals. For her part, Oster questioned Bhattacharya on a number of issues. She too gave the sense that this is a serious topic, not just a political game.

How to do better is another question! Last month Oster wrote positively about some silly dietary guidelines recently released by the FDA, and if you read her op-ed carefully she doesn’t actually seem to agree with most of those guidelines (the best thing she could say about them was that they were “not crazy”), so I take it that in writing that piece she was making a sort of persuasion calculation that the best way to be effective is to mix the criticism with a gallon of sugar. That’s not my style. So, Oster uses a different approach than I do, and I’m sure we’d have our differences in how to interpret statistical evidence. But, again, I think she’s engaging with moral seriousness.

And it’s possible to be morally serious while still having fun. Consider Nate Silver. Nate’s an entertaining writer–I try to be too!–and I’ve had my disagreements with him regarding statistics and communication, but I think he’s coming from a place of intellectual and moral seriousness that shows respect for the challenges of political analytics and the stakes involved. Indeed, sometimes when he’s disagreed with me, it’s on the implicit grounds that he’s making progress in understanding the real world, doing some analytical engineering that is outpacing the statistical theory. I still think there’s a benefit to interrogating the edge cases where our methods break down . . . anyway, my point is that I’m not just using the term “moral seriousness” to refer to things that I agree with. I’m talking about an attitude that I see in Bhattacharya, Oster, and Silver that I don’t see in, say, Niall Ferguson or Andrew Huberman.

Now, to return to our main thread, these are the parts of last week’s interview that bothered me:

1. When asked about some news reports regarding the NIH and CDC, Bhattacharya dismissed them as “fake news.” This annoyed me for two reasons. First, he offered no evidence that the reports were untrue. Second, he was appointed by a man who spews out false statements at an amazing rate, including on the topic of public health. Who are we supposed to trust here? News reports or a political appointee? Also, Bhattacharya himself has a record of being sloppy with the facts, as I happen to know because it happened to me.

Now, don’t get me wrong, I’m not saying that Bhattacharya was lying or misinformed regarding recent NIH and CDC policies. It could well be that the news items were erroneous or misleading–and, if so, I can see how Bhattacharya would be legitimately annoyed. And he should feel free to express his annoyance! But just dismissing the reports as “fake news” . . . that’s not a serious response.

As I wrote above, I appreciate that Bhattacharya treats the nation’s public health spending with the seriousness it deserves. As a statistician, I think information needs to be treated with respect as well. Which means he should be addressing serious news reports and, for that matter, respecting the institution of journalism. Which he wasn’t doing here.

2. When the topic of vaccines came up, Bhattacharya came out strongly in favor of vaccination, and he expressed the view that it is better for vaccination to be voluntary rather than mandatory. This could be. I guess it depends on the context. For almost all my life, childhood vaccines were mandatory, just about everybody got vaccinated, and just about nobody complained about it. So mandatory vaccination can work just fine–we have decades of experience on this one. The bad news is that in the past few years, vaccination has become politicized and anti-vax attitudes have become embedded in right-wing politics. So it could be that Bhattacharya is right and the mandates will have to go, we’ll just have to accept more sick and dead kids and adults, just the price to pay for this aspect of political dysfunction. I don’t know, but it could be, so I’m not going to criticize Bhattacharya for his hot take on this issue.

What bothered me was . . . if you are going to go with a voluntary vaccination strategy, I think you’d want a strong strategy of encouraging people to choose vaccination for themselves and their kids. So I think his response would’ve been stronger if he’d also said something about how to vigorously promote vaccine usage. That’s part of public health policy too. Also, Bhattacharya doesn’t have a great track record on this issue: just a few years ago he was part of an anti-vax organization. See here for the ugly story. OK, fine, everybody makes mistakes and has lapses in judgment. But then at least he should address that, in the past, he’s been part of the problem. To just say that you want vaccines to be optional but without addressing that history, that’s not right.

3. The un-publishing of that CDC report. Bhattacharya said he stopped the CDC from publishing the report because it was using an approach called a test-negative design, which he thinks is a bad statistical method. When he said this, Oster jumped in and said that she too thought it was a bad method. It was only a brief exchange and there was no time for either of them to give a reference or to explain why they think the method is bad. In the meantime, it seems that the report has been leaked; see here. One of the authors of the report said, “I’m strongly opposed to this kind of censorship . . . It should be out in the world at large for the scientific community to judge it for what it is.”

I think the best next step would be for the CDC to release the report officially, along with a critical response from a statistician explaining how the method is flawed. Bhattacharya said it was common knowledge that the method was terrible; on the other hand, it seems that this “test-negative design” is a standard approach for studying the effect of vaccines in the population after they have been released; see also here. So at the very least it would be a valuable educational opportunity to see this article that was on the verge of publication, and to understand its purported problems. Publishing the report along with a companion article discussing its problems, that could make sense. Canceling the report without explaining why (and, no, just saying you don’t like this method isn’t enough of an explanation) . . . that’s not serious science. Scientific integrity is not being advanced by this sort of behavior.

I was also upset that Oster just jumped into the discussion to say that she, too, hates the test-negative design. Neither Bhattacharya nor Oster are statisticians. They’re health economists. It’s fine for a health economist to have an opinion on a statistical method, but, to be so sure about it, that doesn’t seem right to me. To the extent that Bhattacharya and Oster have legitimate concerns about the statistical method, they can work with a statistician to express these concerns openly and scientifically.

I’m not saying that statisticians or epidemiologists are always right or that other professionals should defer to them. Statisticians can be wrong, really wrong, and the errors can be compounded by a presumption that they know what they’re doing. So question these reports all you want. But then is the time to bring in an expert of your own, not to wing it.

Above I talked about moral seriousness regarding outcomes. There’s also moral seriousness regarding methods, and neither of the two people in that interview were displaying it. Also important is moral seriousness about communication, which has not been displayed by Bhattacharya, who has yet to come to grips with the fact that he was on the board of an anti-vax organization.

P.S. Dorothy Bishop provides a detailed discussion of this event.

If that CDC report had just included some fake citations and some crazy dietary advice, the boss would surely have approved it for publication.

From a news article, “C.D.C. Cancels Publication of Study Showing Benefits of Covid Vaccines”:

The acting head of the Centers for Disease Control and Prevention has canceled the publication of a study that found that the Covid vaccine sharply cut the odds of hospitalizations and emergency visits last winter, a Health Department spokesman said. . . .

The study, conducted by C.D.C. scientists, calculated the effectiveness of Covid shots by looking at the vaccination status of people who had sought care at hospitals and emergency rooms. It found that vaccination cut the likelihood of emergency visits due to Covid by 50 percent and of hospitalizations by 55 percent, according to a summary of the study viewed by The New York Times.

It was scheduled to be published on March 19 in The Morbidity and Mortality Weekly Report, the C.D.C.’s flagship journal. News of its cancellation was reported earlier by The Washington Post.

Some former C.D.C. officials said it was unusual for the head of the agency to cancel a scientific publication that had already been cleared by the agency’s staff scientists and had been scheduled for publication.

So what happened?

Andrew Nixon, a spokesman for the Department of Health and Human Services . . . said that assessment “identified concerns regarding the methodological approach to estimating vaccine effectiveness, and the manuscript was not accepted for publication.”

But:

“I’ve never seen a case where an article in the M.M.W.R. that got to that stage was not published,” said Dr. Michael Iademarco, who led the center that included the publication’s operations from 2014 to 2022.

And:

The approach employed in this research has been used for years by scientists at the C.D.C. and elsewhere to gauge the real-world performance of flu and Covid vaccines, said Dr. Fiona Havers, a vaccine expert who resigned from the agency in June.

No link to the report itself. Maybe the authors should anonymously email it to [email protected] and then it can appear in the next file dump.

It must be horrible to be working for CDC right now. They were literally shot at by an anti-vax terrorist, and now the in-house anti-vaxxers are suppressing their reports. Meanwhile the government is releasing health-related reports with fake citations and is releasing dietary guidelines which are so bad that even a supporter of these guidelines can do no better than describing them as “not crazy.”

So, that’s the way it’s going. The report with fake citations is released. The “not crazy” (actually, crazy) advice is promoted. The CDC report is suppressed. I guess it doesn’t meet the government’s high standards. Maybe if they’d thrown in some fake citations and some nutty health advice, it would’ve been approved for publication. That’s how you get “gold standard science,” right?

P.S. More here. I hope that future updates are coming.

“Making Your Research Free May Cost You”

Stephanie Lee writes:

Stephanie Rolin, a mental-health services researcher, found out last month that a journal had accepted her latest paper for publication. But there was an asterisk. Community Mental Health Journal was requiring her to fork over about $4,400 — a fee that she hadn’t budgeted for, and one she says she cannot afford. . . .

Most studies appear in paywalled journals, and critics have long contended that those paywalls enrich publishers while gatekeeping taxpayers from the research they fund. The NIH has been pushing for more openness in the ecosystem into which it pours nearly $48 billion annually, and its biggest move yet took effect on July 1. Under a policy that was approved by the Biden administration to take effect at the end of 2025, and moved up six months by the Trump administration, all agency-funded research must now be made freely and immediately available. The previous policy had allowed papers to stay paywalled for up to a year.

But since July 1, some publishers have only given researchers one way to comply with the NIH’s mandate: paying fees that were previously optional. In a year when federal funding has been exceptionally unreliable, scientists say they are stressed about spending thousands of grant dollars on unexpected and questionable open-access charges.

Things don’t have to be this way, open-science experts say: These fees are imposed entirely by publishers. The most prominent examples are Springer Nature and Elsevier, for-profit enterprises that generate billions in revenue. . . .

When Rolin submitted to Community Mental Health Journal earlier this year, she expected the process to go as it had when she’d published in its pages before. At the time, Springer Nature — which sets policies for the 3,000-plus journals under its umbrella — gave NIH-funded authors a “hybrid” of two choices. They could pay an open-access fee to make their study available right away. Or, for free, they could put their paper behind the journal’s paywall while preparing a second copy that was identical save for formatting changes and copy edits. Within 12 months of journal publication, this author’s version would become openly available on a federal database called PubMed Central, in line with a 2008-era NIH requirement. . . .

In late July, Community Mental Health Journal hit Rolin with a $4,390 bill for article-processing charges. Springer Nature’s website now explains that publishing behind a paywall is “not a viable option” for authors like her because it “conflicts with immediate public access policies, such as NIH’s policy.” . . .

Rolin said she’d been aware that the NIH policy was forthcoming, but was surprised by Springer Nature’s hard-line interpretation. Similarly, Elsevier’s terms and conditions for putting studies on PubMed Central list options that involve either author-paid fees or delayed embargoes that wouldn’t comply with the NIH’s mandate. A page describing how NIH-funded authors can “comply with NIH’s public access requirements” has been deleted. . . .

Not every publisher is responding in kind. The JAMA journals, published by the American Medical Association, say that immediately after publication, authors can post their accepted manuscript in a repository of their choice. . . .

But Springer Nature and Elsevier aren’t the only ones reacting to the NIH’s mandate this way. Melanie J. Scott, an associate professor of surgery at the University of Pittsburgh, had a paper accepted in August by the Journal of Leukocyte Biology, which is published by the Society for Leukocyte Biology and Oxford University Press. . . .

In the meantime, researchers will have to figure out how to foot the bill. . . .

Now I’m wondering exactly what is the government policy. I’d think it would be fine to post the paper on a preprint server such as Arxiv, then it doesn’t matter what’s happening with the journals, right?

The funny thing is, this happened to me just the other day, with this article, I think it was, which is indeed published at a Springer journal. Fortunately for me, this research was not NIH-funded so I did not need to pay, nor did I need to withdraw my submission from the journal. I can’t remember how much they wanted to charge me because I was never going to pay. Maybe $2K? And Theory and Society is not a major journal! I like Theory and Society–I’ve published two papers there in the past year–; I’m just saying that it’s wack to ask someone to pay $2K to publish there.

P.S. It’s good to see a government policy that was pushed by both the Biden and Trump administrations so we can talk about it without getting into a political tangle.

My talk at Stanford later this month: “What to do when your estimate is 1 standard error away from 0?”

Tuesday 28 Apr 2026, 4pm in CoDa E160:

What to do when your estimate is 1 standard error away from 0?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We provide a new answer to this simple yet very important question. Thinking clearly about this problem leads us to bring in many ideas in statistical analysis and computing, including causal identification, meta-analysis, Mister P, expectation propagation, decision analysis, experimental design, and the fundamental unity of Bayesian and frequentist statistics. We demonstrate our approach in examples from many applications, including medicine, social science, business, sports, and public policy.

This work is joint with Witold Więcek and Erik van Zwet.

In addition to all the above, I’ll probably drift into some related general topics such as the role of experimentation in science and engineering and the limitations of thinking about policy analysis in terms of causal inference.