The Anthropic Principle in Statistics and Science (my talk this Mon 29 June, 4:20pm London time)

The Anthropic Principle in Statistics and Science

The anthropic principle in physics states that our existence implies certain constraints on the natural conditions under which we evolved. In statistics, a corresponding anthropic principle can be used to infer properties of the models we should fit to data. For example, experiments are typically aimed to have a precision sufficient to estimate effects of interest but without overkill; it is rare to have an estimate that is 10 standard errors from zero. We demonstrate through several examples in social and medical sciences how the anthropic principle, combined with Bayesian inference, can be used to improve statistical practice.

Here are a couple of applications of the idea:

• [2000] Should we take measurements at an intermediate design point?

• [2022] A proposal for informative default priors scaled by the standard error of estimates (with Erik van Zwet)

In my talk I’ll discuss these and other examples. I think this anthropic principle is really important, arguably more important in statistics than in physics, which is the field where it originated.

Here’s the zoom information for the talk on Mon 29 June, 4:20pm London time:

https://imperial-ac-uk.zoom.us/j/97341955036?pwd=1kKNbPAwJthKtG55ynXMVF3TLSvIbl.1
Meeting ID: 973 4195 5036
Passcode: J3Ue$f

I’ll be speaking (remotely) at this conference celebrating the 60th birthday of physicist Andrew Jaffe. This seems to be the season for 60th birthday conferences.

I know AJ from when he was visiting the Flatiron Institute last year. We worked together on The Squealer: Sensification of model exploration and model misfit. There’s no connection between the Squealer and the anthropic principle; I decided to speak on the latter topic because I thought it would be of general interest to an audience of physicists.

Bayesian Workflow exists as a physical book!

We’re very excited about this book. It’s the result of several years of effort. You can order from the publisher or from Amazon.

Here’s the book’s webpage, which includes the data and code for the book’s examples and case studies, of which there are many.

Here’s the table of contents:

Part 1: From Bayesian inference to Bayesian workflow
1. Bayesian theory and Bayesian practice
2. Statistical modeling and workflow
3. Computational tools
4. Introduction to workflow: Modeling performance on a multiple choice exam

Part 2: Statistical workflow
5. Building statistical models
6. Using simulations to capture uncertainty
7. Prediction, generalization, and causal inference
8. Visualizing and checking fitted models
9. Comparing and improving models
10. Statistical inference and scientific inference

Part 3: Computational workflow
11. Fitting statistical models
12. Diagnosing and fixing problems with fitting
13. Approximate algorithms and approximate models
14. Simulation-based calibration checking
15. Statistical modeling as software development

Part 4. Case studies
16. Coding a series of models: Simulated data of movie ratings
17. Prior specification for regression models: Reanalysis of a sleep study
18. Predictive model checking and comparison: Clinical trial
19. Building up to a hierarchical model: Coronavirus testing
20. Using a fitted model for decision analysis: Classification competition
21. Posterior predictive checking: Stochastic learning in dogs
22. Incremental development and testing: Black cat adoptions
23. Debugging a model: World Cup football
24. Leave-one-out cross validation model checking and comparison: Roaches
25. Model building and expansion: Golf putting
26. Model building with latent variables: Markov models for animal movement
27. Model building: Time-series decomposition for birthdays
28. Models for regression coefficients and variable selection: Student grades
29. Sampling problems with latent variables: No vehicles in the park
30. Challenge of multimodality: Differential equation for planetary motion
31. Simulation-based calibration checking in model development workflow

Appendices
A. Statistical and computational workflow for Bayesians and non-Bayesians
B. How to get the most out of Bayesian Data Analysis

One way to think of the book is that it’s all the things missing from BDA, like how to set up an informative prior, what to do when your computations aren’t converging, how to work through a series of models fit to the same data, how to design and perform simulated-data experiments . . . and all sorts of other things too.

The core of the book–parts 1 through 3–clock in under 200 pages, and then we have another 300 pages full of case studies demonstrating different aspects of Bayesian statistical and computational workflow. The appendices should be useful to you too, first because the workflow ideas in this book apply to non-Bayesian inference too, and second because BDA still has lots of valuable material in it, so it’s good to know where to look.

This new Bayesian Workflow book could change your life (we hope), and I thank my coauthors, Aki Vehtari and Richard McElreath, with Daniel Simpson, Charles C. Margossian, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, Martin Modrák, Vianey Leos Barajas, for all their care and effort. We thank our employers and various funding agencies for giving us the resources to be able to write this book as a side project along with all our daily responsibilities. And we thank many people for their input on earlier versions of the book, along with the Stan developers making so much of this work possible and the Stan community of users for supplying a continuing series of challenges that have motivated many of the ideas and methods discussed in the book.

I posted this already on the blog and you can see answers to some questions in the comments there. I’m posting it again here because, hey, we don’t come out with a new book every day!

I hope you find the book readable, interesting, and useful.

Out of the frying pan and into the fire: Scientific American returned to form, and then this happened:

Last month I wrote the following post. I scheduled it for November, but then some Scientific American-related news arose, so I’m bumping it up in the schedule.

First, here’s my post from May:

I’m not saying this is the same Scientific American as old. Martin Gardner is long gone, and in the age of social media the articles are shorter. That’s the way of the world. But it’s got serious, interesting articles, a mix of pure science, applied science, policy, and service journalism. The latest in science without the boosterism of so much of science and technology reporting.

Last year, though, the magazine was much more political:

A bit of policy is fine, and there’s a lot of science to global warming, for sure. I wouldn’t want Scientific American to “bothsides” the issue. I’m not saying they need entirely to stick to sports, as it were. But the politicking was getting out of control. I’m glad they’ve returned to their lane.

Then the other day this happened:

Scientific American has been acquired by LabX Media Group, which holds Discover Magazine, IFLScience, and a number of other science publications. . . . And they have started out by firing writers and editors. . . .

I know nothing about LabX Media Group or the new Scientific American management, so I have no sense of whether this is a mere budget-cutting realignment or a full-on Sports Illustrated-style bust-out operation. Martin Gardner is a culture hero and deservedly so, but that was a long time ago, and those days aren’t coming back. Indeed, blogs like these, many of which are Gardner-inspired in one way or another, have taken his place.

It’s funny how magazines, even online, keep disappearing. The model of paying a magazine $50 a year for a subscription and getting a range of interesting material, seems more reasonable than paying $50 each for subscriptions for a bunch of individual bloggers, but, with the exception of the New Yorker, the New York Times, and a few others, we don’t really see much of that.

One way to see this is that I’m not myself a Scientific American reader. I follow all these blogs, many of which are science themed, and each of which, in its own way, goes into more depth than I’d get from a Scientific American article. So there’s this weird thing where I’m concerned about something that I’m not reading anyway. Which is different from Sports Illustrated. Back when Sports Illustrated was a real thing, I’d buy it from time to time. I read it for the articles, as the saying goes.

That said, institutions continue in their own way. I was happy recently to see that Scientific American had pulled itself out of its politicized rut, so it’s a disappointment if it’s now getting taken apart.

“Springer Nature has removed two studies by Max Planck.”

Jim Moody points to this news article, “Why have papers by one of history’s most famous physicists been retracted? Springer Nature has removed two studies by Max Planck. A bot may be to blame.”

If you’re gonna retract something from Max Planck, I’d suggest starting here, with the notorious Manifesto of the Ninety-Three German Intellectuals defending Kaiser Wilhelm’s invasion of Belgium. Here are a couple of retractable passages:

It is not true that the life and property of a single Belgian citizen was injured by our soldiers without the bitterest self-defense having made it necessary.

It is not true that our troops treated Louvain brutally. Furious inhabitants having treacherously fallen upon them in their quarters, our troops with aching hearts were obliged to fire a part of the town as a punishment.

I guess they were the world’s most moral army. “Aching hearts” . . . that must have absolutely sucked. Really mean of those Belgians for defending themselves.

Just to be clear, I’m not saying that Planck should be “canceled.”

Who among us hadn’t retroactively disgraced ourselves with a lachrymose defense of military aggression?

I’m just saying, if you have to retract a paper by Max Planck, I’d retract that one.

P.S. The funny thing is that the above-linked article describes the famous physicist as “almost as widely revered for his character as his physics. In 1933, for example, he bravely confronted Adolf Hitler over Nazi Germany’s discriminatory laws against Jews.” I’ve never read anything about Planck’s life so I don’t know what changed with him between 1914 and 1933. Maybe the loss of the war in 1918 soured him on armed adventures.

Supplement that alphabetized display with another graph showing the states in a more informative order.

I just wrote a long post inspired by a recent post from economist Paul Krugman. Krugman’s post was good, but I’m annoyed that his graph (reproduced above) lists the states alphabetically. Don’t do that! It’s called the Alabama first error.

I would’ve put this as a P.S. on my earlier post but I was afraid that would distract people from my larger point, so I’m just raising the graphical issue here.

If the goal is to have a look-up table, then, sure, alphabetical is fine. But I don’t think that’s the point of that graph. Indeed, if you wanted a look-up table, I’d still prefer a non-alphabetical graph and then you could click to get the numbers in a spreadsheet.

How best to order the states in that graph, then? You could try different things. My first idea is to list in order of average per-capita income by state. (These rankings don’t change much over time; for clarity we could just order by average per-capita income in 2020.)

P.S. All the commenters so far are disagreeing with me, so let me reassess.

I doubt that most readers are looking at this graph to look up individual states. I think the goal is to present the general trend and variation across U.S. states. For this purpose, alphabetical order makes it hard to see systematic patterns that might be clearer using any reasonable ordering.

That said, alphabetical order has the benefit of familiarity, and given that all of you think this is important, I’m willing to believe that my take is a minority view, and maybe the designer of the graph is better off going with the majority.

So I’ll alter my recommendation. Instead of saying, “Don’t alphabetize,” I’ll say, “Supplement with another graph showing the states in a more informative order.”

Structural equation modeling (SEM) and positive definiteness

This post is from Bob.

Mitzi and I were swotting up on structural equation models (SEM) for our class this past Monday at the Modern Modeling and Methods (M3) conference at Fordham University. It was a lot of fun and now I think I understand SEM notation. I really like these applied conferences and this was a group of psychometrician, econometricians, and sociometricians. Many if not most of them thought about models in terms of SEM, so we thought we should figure it out. But I was left with a concern you may be able to help me sort out.

The example

The first worked example in Ken Bollen’s seminal 1979 textbook on SEM is a study of how industrialization relates to democracy. It comes from his paper,

  • Bollen, Kenneth A. (1979). “Political Democracy and the Timing of Development.” American Sociological Review, 44(4).

and was reprised in his book

  • Bollen, Kenneth A. (1989). Structural Equations with Latent Variables. Wiley.

I had the pleasure of sitting across from Ken at the invited speakers dinner at the conference, so I’m glad I looked into SEM before that. Good news for the SEM devotees—he released a completely revised guide to SEM a few months ago.

The data and parameters

The data consists of eleven covariates (called “indicators” in SEM) for each of 75 countries. Four of the covariates are related to democracy in 1960 (y1, y2, y3, y4), the same four measurements were taken again again in 1965 (y5, y6, y7, y8) , and there were three measurements of industrialization in 1960 (x1, x2, x3).

The SEM model the original researcher came up with here assumes three latent scalars per country, industrialization in 1960 (IND60), level of democracy in 1960 (DEM60), and level of democracy in 1965 (DEM65). These latent parameters are related in the following way: democracy in 1960 is a regression on industrialization in 1960, and democracy in 1965 is a regression on both democracy in 1960 and industrialization in 1960.

The covariates are then modeled like a seemingly unrelated regression in econometrics. The four democracy 1965 parameters are treated as regressions on the latent level of democracy in 1965, and similarly for the democracy in 1960, and industrialization in 1960.

Rather than independent errors, a SEM model explicitly indicates with arrows which pairs of observations are allowed to have non-zero correlation in the covariance matrix for the observations. The three industrialization observations are assumed to have zero correlation—there are no arrows between any of the three measurements in the SEM diagram. Each of the four measurements in 1960 is assumed to covary with the same measurement taken in 1965. In addition, the second and fourth measurement in each year are assumed to be correlated with each other, which leads to a box-like structure.

The SEM diagram

Here are the arrows in the diagram, where I’m not using their standard LISREL notation, but writing them in R expression syntax to indicate what is regressed on what. In their graphical notation, just replace ~ with <-. All three latent variables and all eleven measurements are indexed by country.

IND60
DEM60 ~ IND60
DEM65 ~ DEM60, IND60

x1, x2, x3 ~ IND60
y1, y2, y3, y4 ~ DEM60
y5, y6, y7, y8 ~ DEM65

The covariance structure is indicated by stating which pairs of measurements are modeled with non-zero correlation. The first four just pair the measurements of the same thing across 1960 and 1965.

y1 <-> y5
y2 <-> y6
y3 <-> y7
y4 <-> y8

The last pair of correlations are within 1960 and within 1965.

y2 <-> y4
y6 <-> y8

Together, these induce an odd box structure, where y2 is correlated with y6 and y4, both of which are correlated with y8, but y2 and y8 are assumed to have zero correlation.

y2 <-> y6
^      ^
|      |
v      v
y4 <-> y8

Stan implementation

We didn’t get this far in my half of the class, so I will share here the Stan Playground example where I fit Bollen’s example (you can get the data and the Stan model through the Playground link:

It gets the right answer compared to lavaan/blavaan, which is nice. In the Stan code, xi is IND60 and eta1, eta2 are DEM60, DEM65. The relation among the latent parameters are modeled directly as regressions. The correlations among the observations are modeled using soft zeroing, where I just put a tight prior around zero on the structural zero elements, because Stan doesn’t give you a good way of setting up structural zeroes in a covariance matrix (Sean Pinkney or Ben Goodrich might know how to do this?).

This makes me curious how the lavaan package in R manages this. There’s a Bayesian version of lavaan built on top of Stan, blavaan. The first example right at the top of the home pages for both the lavaan and blavaan is Bollen’s democracy model. I guess it’s like the Scottish lip cancer data set for spatial modeling or Fisher’s iris data for regressions.

My questions

Consider a simple diagram among measurements like the following.

x <-> y
y <-> z

This says there can be non-zero correlation between A/B and also between B/C, but the correlation between A/C is zero. It’s a simplified case of the box we saw in the actual example. These arrows implies the correlation matrix looks as follows.

|        1  rho[x,y]         0 |
| rho[x,y]         1  rho[y,z] | = Omega
|        0  rho[y,z]         1 |

Given that the correlation matrix Omega must be positive definite, this limits the range of rho[x,y] and rho[y,z]. For example, we can’t have rho[x,y] = rho[y,z] = 0.9, or rho[x,z] would have to be greater than zero to maintain positive definiteness.

Q1: Why doesn’t SEM instead say that the correlation rho[x,z] is just the minimum value it can be given rho[x,y] and rho[y,z]? I’m suggesting that we instead treat the above diagram as implying no additional correlation between x and z other than that implied by the correlation between x and y and the correlation between y and z? That is, why try to shrink rho[x,z] all the way to zero? From the text, it feels like the motivation is to enforce zero correlation in the model. But all this is doing is simplifying regressions—it won’t actually enforce zero correlation among the measurements that are modeled with zero correlation. I wished I’d asked Ken this question at dinner, but I’ll ping him about this blog post and hopefully get a response.

Of course, in the pragmatic Bayesian workflow, we’d use posterior predictive checks to evaluate whether there’s unmodeled correlation between x and z.

Q2: I’m also curious what Andrew and others think about enforcing structural zeroes in correlation between measurements as opposed to just estimating a dense covariance matrix and inspecting where the correlations fall.

Getting justice can require a lot of effort, and usually at some point we’ll just give up, which is what the cheaters rely on.

I just read this compelling op-ed by Brendan Ballou, “One Man Stole $660 Million. He’ll Never Pay It Back,” which tells the story of several brazen white-collar criminals who avoided prosecution for federal crimes by the simple expedient of bribing the president of the United States. Ballou argues, though, that there could still be ways of catching these guys:

In a world where the Department of Justice and the president are either indifferent to or actively support rich criminals, what can be done? Fortunately, there is a range of legal tools that ordinary citizens can use to pursue civilly the sort of corruption that would ordinarily be prosecuted criminally.

The shareholders potentially cheated by Mr. Wiederhorn could sue the Trump inaugural committee under the federal civil RICO law — written to destroy the Mafia — for seemingly helping to secure Mr. Wiederhorn’s freedom. Companies that follow the law can sue rivals, like Binance, that do not, under California’s Unfair Competition Law. And investors scammed by Mr. Milton can sue the political committees he donated to if they were “unjustly enriched” by his scheme. . . .

When regular citizens can’t act themselves, they can pressure their local prosecutors to do so. Recall Mr. Homan’s $50,000 in cash from undercover F.B.I. agents. This Justice Department may not continue the investigation. But Mr. Homan’s personal business is headquartered in Virginia, and it would be awfully interesting to find out whether Mr. Homan reported that money on his state tax returns. If he didn’t, he may well have committed a crime. . . .

He concludes:

Criminals and government officials are barely hiding their schemes, and their brazenness is meant to make us feel helpless, to think that nothing can be done. That is false. We already have the legal tools to fight corruption. We just need to use them.

This is inspirational and I hope someone does all of this.

My point in the present post is that getting justice can require a lot of effort.

Here’s an example. The other day I was talking with someone about research fraud, and he characterized the Michael Lacour story as the biggest scandal ever in political science. I disagreed. It was my impression that Lacour had been forgotten (here’s some background), but what about the time that the American Political Science Association gave an award to a plagiarized book? Here’s the story. I’d never heard of any of the people involved in that episode, but it incensed me that APSA had done this.

I wasn’t the only angry person. Indeed, I’d heard about the Frank Fischer case from Alan Sokal, who’d emailed an academic official at Rutgers University, where the plagiarist worked, but there was no useful response. So I decided to take a whack at it. I sent off this email to the people on the committee that had given that award:

Dear APSA Public Policy Section:

I learned recently that you gave your 2017 Aaron Wildavsky Enduring Contribution Award to Frank Fischer for his 2003 book Reframing Public Policy. I was surprised to hear this, given that the book appears to have plagiarized material. For background, see this document by Krešimir Petković and Alan Sokal:
https://chronicle-assets.s3.amazonaws.com/5/items/biz/pdf/plagiarism_fischer.pdf
and this note by Petković:
https://chronicle-assets.s3.amazonaws.com/5/items/biz/pdf/Petkovic_Experiment_with_CPS.pdf
and this news article for further background:
https://www.chronicle.com/article/alan-sokal-takes-aim-at-an/124969

Petković, a political science graduate student in Croatia, found places in Fischer’s 2003 book where he had used materials from previously published work by others without giving full attribution. In addition to copying without attribution (as Petković writes, Fischer mentions the book he copied from, but nowhere near the copied passage), Fischer also makes mistakes such as misspelling authors’ names and reproduces errors that arose in the original sources.

Two of the works from which Fischer copied in his 2003 book without appropriate attribution are:

Majone, Giandomenico, 1989. Evidence, Argument, and Persuasion in the Policy Process. New Haven: Yale University Press.

Walsh, David, 1972. Sociology and the Social World. In: Filmer, Paul, Phillipson, Michael, Silverman, David and Walsh, David, New Directions in Sociological Theory. London, Collier-Macmillan: 15-35. [Also published by MIT Press, Cambridge, Mass., 1973.]

I am not an expert in this area and have no intention of pursuing any formal process here. Indeed, I am not even a member of APSA. However, I am a political scientist and, as such, am distressed to see APSA promoting plagiarism.

My recommendation is that you retract the award. If that is too difficult, one thing you could do is retroactively also give this award to Majone (1989) and Walsh (1972). It does not seem fair that they did the work and someone else gets the award, no? I do not know Prof. Fischer and am making no judgment regarding the quality of his writing. It may be that it is indeed an enduring contribution to the field; if so, all authors of this enduring contribution should be recognized.

Yours,

Andrew Gelman
Professor, Department of Statistics
Professor, Department of Political Science
Columbia University, New York

P.S. I have also cc-ed the members of APSA’s Committee on Professional Ethics, Rights, and Freedoms.

From APSA’s guide to professional ethics:

“7. Political scientists, like all scholars, are expected to practice intellectual honesty and to uphold the scholarly standards of their discipline.

7.1 Plagiarism, the deliberate appropriation of the work of others represented as one’s own, not only may constitute a violation of the civil law but represents a serious breach of professional ethics.

7.2 Departments of political science should make it clear to both faculty and students that such misconduct will lead to disciplinary action and, in the case of serious offenses, may result in dismissal.”

A few months later I followed up:

Hi all. I was just wondering what happened with this. As I wrote last year to **, I am not submitting a formal grievance or complaint. I just wanted to let the committee be aware of this situation so that they can have the opportunity to fix it.
So I was interested to find out how things have progressed, as it seems to be an embarrassment to APSA to have given a major award for a book with plagiarized material!
Andy

After several months I hadn’t heard back from the committee so I pinged them in June. A couple weeks later they got back to me and said they couldn’t do anything because it had not been submitted as a formal complaint.

Fair enough. I didn’t think it would be right for me to file the complaint myself, given that I’m not at all knowledgeable about this area of political science.

Meanwhile, the books that had been plagiarized, Majone (1989) and Walsh (1972), never got that award. Doesn’t seem fair to me!

Anyway, my point is that it takes work to pursue these things, and it’s more my inclination to point out the problem than to go through the political and administrative steps needed to rectify the problem.

I’m not dissing “the political and administrative steps”–I have a lot of respect for people who can do these things!–it’s just not something that I’m good at.

Here’s another example. I once had a colleague who plagiarized my work. When I realized what was going on, I was stunned. But then, looking back, I realize that I’d been warned of this behavior years earlier, indeed my memory flashed back to a time that I’d seen something else he’d plagiarized from me, and I’d just kind of filed that image in my mind and forgotten it. My collaborator and I had a good thing going, and, hey, nobody’s perfect, so it was easier to look away. When I confronted him about the plagiarism–this was a long time ago–he kind of wriggled around, saying that he didn’t want to share credit with me on the project I’d been working on with him–at one point I was dictating formulas to him over the phone–but we could jointly write a separate article on the topic. This just pissed me off, but, ultimately, he won, in the sense that he correctly calculated that I was rational enough not to want to get involved in a major scandal early in my career. Yes, he’s the one who would’ve looked bad had I raised a formal complaint, but it wouldn’t have done my reputation any favors to be seen as a complainer. Also, though, I won, in that I stopped my involvement in this project and I moved on to better collaborators.

The episode bothered me (which is why I keep talking about it), but my cost-benefit analysis led to the decision to not file a formal complaint. That’s the decision-theory analysis. The game-theory analysis is that my colleague could see ahead to the next move: he know I was rational and that it would be a net loss to me to make a fuss about his actions, and I expect that this minimax analysis led him to the conclusion that he’d be safe in plagiarizing me. Yes, he was taking a risk to his reputation in doing so, but it was a calculated risk, in his mind less than the expected benefit to his reputation of taking full credit for this part of our joint research.

What should be done?

I’m not sure. In academic scandals, maybe it’s best just to move on. So what if some obscure political scientist got some award that he didn’t deserve? So what if some researcher publishes substandard work because he decides to not credit a collaborator? Worse things happen every day in academia. Indeed, if you want to talk about the worst scandal in modern political science, I might give the nod to Samuel Huntington’s book, The Clash of Civilizations and the Remaking of World Order, not because of plagiarism or anything like that, but just because arguably it’s had a large and malign influence in the world. Given all the problems in social science, plagiarism is the least of our concerns. So, although it annoys me, ultimately I think the appropriate strategy is to just let it happen, to talk about it but not to worry about seeking justice.

When it comes to business and government corruption, though, I agree with Ballou that something should be done. Legislatures should be writing laws, local and state governments should be prosecuting, lawyers should be suing, etc. These guys are stealing, giving and taking bribes . . . this is the kind of thing that degrades the entire economic and political system.

So, again, I hope some people make some of the moves that Ballou recommends. They should just be aware that it will take a lot of effort and persistence.

Treating AI review like the contentious policy design problem it is

This is Jessica. Many researchers are thinking about what we should do about scientific peer review now that AI makes producing papers so much easier. Submission numbers keep getting higher — in the past week, I saw reports that the most recent ACL submission cycle got 17k+ submissions, up from ~10k last cycle. TMLR went from getting 500 submissions every 60 days or so to getting the same number ever 19 days. There are simply not enough human reviewers to handle the surge, at least not without a dip in quality. The noiser the review system gets, the greater the incentive to submit sloppy papers, because you might get lucky. This is the so called “review death spiral.” 

It is a hard problem. Quotas on submissions per author are one avenue forward, which TMLR just announced it would adopt. Not surprisingly, many reviewers are also turning to AI to help. The question becomes how to design AI review protocols to help reduce some of the noise, through preliminary filtering or flagging or helping guide human attention to parts of a paper that are most likely to be problematic. 

But what sorts of checks should an AI review assistant run on a paper? It’s useful to separate basic integrity violations AI could flag, like is there evidence of plagiarism, fake citations, missing code/data to reproduce main results (which are comparatively less controversial) from “epistemic filters,” like does the paper pass replicability checks, robustness checks, preregistration checks, statistical significance checks, etc. There’s a temptation to blur these things in proposing how to apply AI to review. It’s easy to assume that the metascientists have already established that practices like replicability or preregistration are truth-indicating and we can just implement them at scale (and indeed, ML researchers are citing open science and other reform arguments to back their proposals).

But if there’s one lesson to be learned from the aftermath of the replication crisis, it’s that there is no small, stable, non-conflicting set of detectable signals of good science that will find the good stuff and reject the bad. There are heuristics that can be useful prompts for deliberation – get in the habit of preregistering, make sure you can replicate your results, test the sensitivity of your results to choices you made along the way – but things get weird when we start treating them like universal requirements. Authors shift attention away from unrewarded signals, like better theory or exploratory work, and become preoccupied with rigor signaling through their methods. The result is not necessarily more thoughtfulness. 

And so even if the AI review tools we create are simply intended to inform human reviewers about what checks a paper passed, what we implement will have important policy implications by incentivizing more work like that in the future. I don’t think we are in a good position to predict what happens if suddenly we require multiverse robustness or statistical significance in a field like machine learning, which has in many ways been all about iterative improvement and “frictionless reproducibility” rather than individual results passing all the robustness checks.

The answer is not to avoid using AI in review until we can find a non-gameable set of credibility qualities to have AI focus on, as some have recently argued (though I agree with the linked paper that we need more rigor in how we go about motivating review tools). Non-gameability sounds nice, but any automated review policy that allocates attention will be gameable, because ensuring good science is not so simple as finding the right checklist. The relevant question is instead what assumptions and downstream incentives we are willing to tolerate. To this end, at the very least we should get in the habit of spelling out the assumptions we’re making, so that the trade-offs of focusing on particular proxies become explicit.

I wrote up this view recently in a paper called “Stop Treating Metascientific Heuristics as Quality Filters in AI Review.” Here’s the abstract: 

AI-implemented checks for reproducibility, robustness, preregistration, claim scope, and other intended proxies for scientific credibility can extend human reviewers’ capabilities. However, treating metascientific heuristics–whose theoretical grounding remains contested or incomplete–as necessary and sufficient signals for filtering out bad science is counterproductive to scientific progress. The emerging literature blurs the line between integrity filtering, based on necessary but insufficient signals of validity like reproducibility of stated results or lack of fake citations, and epistemic filtering, which uses machine-detectable signals to judge scientific quality. Drawing on critical metascience, we show that commonly proposed signals of research quality are insufficiently justified as general indicators of scientific value. The answer is not necessarily to ban AI in review, given the deluge of submissions venues are facing. Instead, in recognition of how any use of automated signals–even when deployed with human oversight–will shape attention and create incentives upstream, developers of AI review tools should explicitly specify their assumptions about how proxy signals inform on scientific quality in the context of specific review decisions. This approach treats AI review contributions as contestable decision policies that will shape future research, acknowledging the value-laden nature of scientific judgment and surfacing relevant tradeoffs. 

Rather than arguing for or against any particular proxies, I’m more interested in the methodological and philosophical mindset we should bring to the new questions raised by AI review. To demonstrate what I mean by more explicit motivation, I analyze an example review decision problem and set of detectable signals in the appendix, drawing on an analysis of how statistical significance and exact replication success relate to signal-to-noise ratios measured under error from a recent paper by Eric van Zwet, Andrew, and Witold Więcek. The takeaway is that the value of a proxy will depend on how you define the latent state you care about (e.g., whether the direction of an effect was correctly estimated, how big the true signal-to-noise ratio is), what you assume about the generating process (i.e., how the proxy noisily reflects the latent state), and what you assume about the decision-maker’s choice of actions and utility function. By suggesting this approach, I am *not* suggesting that one can validate a new review tool’s utility before its been deployed. The point is that there will be trade-offs no matter what, and the best we can do is be concrete about the kinds of  assumptions that have to hold for proxies to be useful in review, so the community can debate what risks they are willing to accept. 

In this sense, my argument is very much along the same lines as Devezer et al’s argument that those proposing reform procedures should adopt more formal methodology to avoid unwarranted overgeneralization. Once checks become part of review infrastructure, they stop being neutral diagnostics and become policy levers. Let’s start treating them as such in research on AI review.

“Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle” is a very clean example of meritocracy.

In a post about possible corruption in the government and finance sector, Paul Campos points to a news article entitled, “Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle,” that features an adorable photo of the three Lutnicks standing next to a fashion model.

Campos labels this as, “The Meritocracy!”, and clearly he’s being ironic: his point is that it seems unlikely that these two twenty-somethings are really the people with the most merit needed to run this zillion-dollar company. All things are possible, but it would be an amazing coincidence if, among all the possible financial executives out there, that these two would happen to be the best.

And, sure, I get that.

But now I want to point to my old post on the topic, Meritocracy won’t happen: The problem’s with the “ocracy.”

The short version is that the news item, “Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle,” is a very clean example of meritocracy. Lutnick Sr. had the merit (in whatever sense) that took him to the top of the heap, and he used that merit to get jobs for his kids: that’s the “ocracy” part.

If all that merit did was get you top jobs and lots of money, that’s not meritocracy, that’s just merit-based employment and pay. What makes it “meritocracy” that the people with the merit don’t just get nice jobs, they also get to be in charge of everything (”ocracy”). And one thing you do when you’re in charge is take care of your kids!

As Mark Palko discussed over ten years ago, our society seems to have become more tolerant of nepotism. Or maybe the point is that nepotism has always been a thing, but in recent years there’s been more of an effort by rich people and the news media to portray nepotistic hires as having special merit of their own. This is not to say that children of the successful cannot make great contributions themselves—John Quincy Adams comes to mind, also Julian Lennon had that cool song a few decades ago where he sounded just like his dad, so that’s something too. And then there was Oliver Wendell Holmes, Jr., who surpassed his famous father in achievements. And Alexander of Macedon didn’t do so bad either.

Anyway, “meritocracy” implies that the people with merit rule society, and they’ll use their power to help their kids.

Nepo babies aren’t a counterexample to meritocracy, they’re a central part of it.

To select or not to select?

This post is by Aki

New preprint To select or not to select: predictively consistent priors instead of model selection with Anna Elisabeth Riha, Leevi Lindgren, David Kohns, Paul Bürkner and me. arXiv.2606.22850

tl;dr: Model selection is not a substitute for building good models in the first place.

Abstract: Bayesian modelling workflows often consider multiple candidate models of varying complexity. Model selection is commonly used to navigate potential trade-offs between model complexity and generalisability to new data. We study when model selection is unnecessary or can even be harmful for predictive performance in finite data regimes and find that the need for selecting simpler models can depend on prior choice. We formalise predictively consistent priors, which keep prior predictive implications stable as model complexity increases. Across examples and numerical experiments, including adding covariates in linear and logistic regression, forward variable selection, and nonlinear modelling, flexible models with predictively consistent priors typically match or outperform selected simpler models in out-of-sample predictive performance. When selection helps, it can indicate poor joint prior implications, such as excessive prior mass on implausible predictive values. Based on our findings, we propose replacing the notion of sparsity or parsimony at the level of model components with specifying priors that remain sensible in predictive space as models become more complex.

These ideas have been around, but there was no single easy paper to refer to explaining and illustrating some important aspects of model selection. Sure, model selection can reduce overfitting, but even better is to use big models and predictively consistent priors.

This is a long (76 pages) slow science paper. I had been showing variants of some plots in my talks years ago, but polishing the explanations and adding more theory took a long time. Anna, Leevi, David, and Paul all did great work on this.

Survey Statistics: perfect collinearity in the sample but not in the population

In 2019, Andrew blogged about collinearity in Bayesian models. In the comments, he pointed to an example from Bayesian Data Analysis, 2nd edition (BDA2). I think it is a useful example to keep in mind when extrapolating from sample to population. Since folks (like me) may only have BDA3 on their shelf, I thought I’d talk thru it.

Amazon.com: Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science): 9781584883883: Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin: Books

Pretend it is 1980 and we are at the US Census Bureau. We just revamped the occupational coding system, and it’s so much better ! We want 1980-style codes on all our old data that only had 1970-style codes. Let’s trade in our peasant blouses for some shoulder pads.

Say we have double-coded training data (n = 10,000) with:

  • O_1980 = occupation coded in the 1980 coding system
  • O_1970 = occupation coded in the 1970 coding system
  • E = education, either high or low
  • I = income, either high or low

We want to impute O_1980 for the single-coded full dataset (N = 1,000,000) with only O_1970, E, and I.

Consider everyone with the a specific occupation according to the 1970 codes, e.g. Accountants. Say there are 200 accountants in the double-coded training data and they have either high income and high education or low income and low education. They have either OCCUP1 or OCCUP2 according to the 1980 codes.

From BDA2 Table 9.1:

Say we use standard regression software to fit p(O_1980 | O_1970 = Accountants, E, I). It will flag the predictors E and I as perfectly collinear, because in the double-coded training sample, education and income are perfectly correlated.

Suppose you drop education and use only income. The single-coded data actually has some low education and high income folks. The model only uses income, so 90% of them get OCCUP1. But suppose I drop income and use only education. My model only uses education, so only 10% of them get OCCUP1. Who is correct ?

As the authors say:

the truth is that we have essentially no evidence on the split for these units… the occupational split for the ‘E=low, I=high’ units should vary between, say, 90/10 and 10/90. … If some variable should or could be in the model on substantive grounds, then it should be included even if it is not ‘statistically significant’ and even if there is no information in the data to estimate it using traditional methods.

 

Mind-body healing: An exchange.

This has come up a few times on the blog already:

Carroll/Langer: Credulous, scientist-as-hero reporting from a podcaster who should know better

7 steps to junk science that can achieve worldly success

A suggestion for Freakonomics and Sean Carroll: Interview Nick Brown

Two researchers in the Harvard psychology department published a paper reporting that they could make people heal faster by telling them that more time had passed. Nick Brown and I looked at this paper carefully and didn’t think that it offered good evidence for its claims. Meanwhile, the paper was promoted uncritically in various media outlets.

As I wrote a couple years ago, to the extent that healing is important, I think it’s important not to overstate evidence for speculative claims about what works. Individual and societal resources are limited. If you want to say something like, “Sure, this is pie-in-the-sky research, but if it works it would be wonderful (‘kind of amazing,’ as physics podcaster Dean Carroll might say), so it deserves our attention, respect, and funding as a high-risk, high-return possibility” . . . go for it. That argument could be made. But then that argument should be made. Don’t fudge it by acting as if there’s evidence that isn’t really there.

Nick and I published an article in a psychology journal discussing the problems with the paper in question, framing it as a more general exploration of how scientific errors can propagate. One of the authors of the original paper then published an article in that journal arguing that we had gotten it wrong and that they really did have strong evidence. Nick and I didn’t find their response convincing on scientific or statistical grounds, but we thought it could possibly be rhetorically effective: just as a piece of writing, if you read it in isolation, it might make you think that we were full of crap. So we closed the loop by replying in the journal, basically restating what we’d said in our earlier article.

The four articles are in different places online and I thought it could be helpful to have all of them in the same place. So here they are:

Peter Aungle and Ellen Langer (2023), Physical healing as a function of perceived time:

In this study we wounded study participants following a standardized procedure and manipulated perceived time to test whether perceived time affected the rate of healing. We measured the amount of healing that occurred across three conditions using a within-subjects design: Slow Time (half as fast as clock time), Normal Time (clock time), and Fast Time (twice as fast as clock time). Based on the theory of mind–body unity—which posits simultaneous and bidirectional influences of mind on body and body on mind—we hypothesized that wounds would heal faster or slower when perceived time was manipulated to be experienced as longer or shorter respectively. Although the actual elapsed time was 28 min in all three conditions, significantly more healing was observed in the Normal Time condition compared to the Slow Time condition, in the Fast Time condition compared to the Normal Time condition, and in the Fast Time condition compared to the Slow Time condition. These results support the hypothesis that the effect of time on physical healing is directly affected by one’s psychological experience of time, independent of the actual elapsed time.

Andrew Gelman and Nicholas Brown (2024), How statistical challenges and misreadings of the literature combine to produce unreplicable science: An example from psychology:

Given the well-known problems of replicability, how is it that researchers at respected institutions continue to publish and publicize studies that are fatally flawed in the sense of not providing evidence to support their strong claims? We argue that two general problems are: (a) difficulties of analyzing data with multilevel structure and (b) misinterpretation of the literature. We demonstrate with the example of a recently published claim that altering patients’ subjective perception of time can have a notable effect on physical healing. We discuss ways of avoiding or at least reducing such problems, including comparing final results to simpler analyses, moving away from shot-in-the-dark phenomenological studies, and more carefully examining previous published claims. Making incorrect choices in multilevel modeling is just one way that things can go wrong, but this example also provides a window into more general problems with complicated designs, cutting-edge statistical methods, and the connections between substantive theory, experimental design, data collection, and replication.

Peter Aungle, Daniel Chen, and Nicholas Holmes (2026), Beyond Statistical Myopia: Replying to a Misguided Critique of Mind-Body Research:

In response to Gelman and Brown’s recent critique of Aungle and Langer (2023), we argue that their article illustrates how narrow statistical reasoning and selective literature review can misrepresent and undermine credible scientific findings. Using their discussion of perceived time and physical healing as a case study, we identify three general problems: (a) a failure to accurately characterize the methods and results of the study they critique, (b) misinterpretations and omissions in their review of the relevant literature, and (c) a tendency to generalize from isolated statistical issues to sweeping claims about the invalidity of mind-body research. We adopt Gelman and Brown’s recommended model and find that the main effect remains robust. We also document errors in their interpretations of other cited studies and demonstrate that they ignore decades of rigorous, well-replicated research on placebo effects and health mindsets. By examining their critique in detail, we highlight how methodological skepticism, when untethered from accurate reading and balanced appraisal, can mislead rather than clarify.

Nicholas Brown and Andrew Gelman (2026), This is the reason for external replication: Response to Aungle et al. (2026):

In an earlier article we addressed a controversy regarding a form of mind-body healing, arguing that a recent paper had overstated evidence from experiments and from literature review. In reaction, one of the authors of that paper disputed our claims. Here we explain why we remain skeptical.

The short answer is that, no, we don’t see any evidence that manipulating people’s subjective experience of time will help them heal better, nor do we see evidence that telling people that they’re exercising will get them to lose weight without their being any changes in their diet or exercise, or various other things claimed in that original paper. I do think it’s possible for researchers, through a combination of sloppy statistics, forking paths, and inaccurate literature review, to create an impression of a strong body of evidence even when nothing is going on–this was a point made eloquently in the classic 2011 article by Simmons, Nelson, and Simonsohn. And I think this combination is enough not just for people to mislead others, but, more importantly, to fool themselves, which can then allow them to spread misunderstanding in the scientific literature, the popular press, and, yes, NPR, Ted, and podcasts.

The whole thing makes me sad, to see researchers caught in a loop of misunderstanding so that, even after their mistakes are pointed out to them, they double down and remain confused. There’s no way that the authors of the above papers will agree with me on this point, and maybe they will find all this to be condescending, but I’m completely sincere here. It makes me sad to see people aim their careers in this direction. The good news is that over the years I’ve received many many emails from young researchers who see this sort of thing going on in their labs and want to do better. I guess the best way to get a grip on this problem is to see how others have been trapped in it.

Golems, auditors, and AI

This post is by Phil.

Some time ago I wrote some thoughts about “Neuromancer” ( https://statmodeling.stat.columbia.edu/2025/06/12/what-does-neuromancer-have-to-teach-us-about-the-role-of-ai-is-society/ ), which features two kinds of artificial intelligence, one of which seems like it could be realized with a Large Language Model, i.e. we could pretty much make it today. The other is something more powerful, an artificial general intelligence that not only has computational power but also imagination and desires. I think it’s an open question whether an LLM can have genuine desires (and even a genuine imagination) as opposed to being able to pretend that it does. Also an open question whether that distinction even makes sense to talk about.

I’ve read some other fiction within the past few months that has also given me things to think about, AI-wise.

First there was Feet of Clay, by Terry Pratchett. Pratchett writes lightweight, fun, but generally forgettable fantasy novels. I mentioned that book in an earlier post, https://statmodeling.stat.columbia.edu/2026/01/21/what-a-coincidence-what-a-coincidence/ , because it uses a rare plot device that happened to crop up in the very next book that I read. But I mention it now for a different reason: in the book there are golems (an animated, artificial humanoid in Jewish folklore created entirely from inanimate matter, such as clay or mud) that are treated pretty much like robots. A golem’s operating system is written on a piece of paper contained in its head. In the book, Golems are treated like we treat industrial robots or Roombas or similar: they are given simple, repetitive tasks at which they work, sometimes day and night. Nobody feels bad about using them however they want, because the golems have no emotions. Or do they? In the book some golems get together and create a golem of their own, and give it instructions that are…well, basically they are trying to create something more human. Of course, the fact that they desire to do such a thing suggests that they are not in fact emotionless objects.

Well, I just read another Pratchett book, “Thief of Time”. (Spoilers follow. Stop reading here if you want to read this book and be surprised.) This book has beings called ‘auditors’ who are responsible for maintaining order in the universe. They are described as being nearly emotionless except for hating disorder. To them, humans pretty much personify disorder so I think they could be said to hate humans. To better understand humans so they can learn to control us better, some of the auditors create human bodies for themselves and occupy them…and, uh oh, with the bodies come emotions. They get hungry, they can feel pain, things taste good or taste bad, etc. As they strive to satisfy their bodies’ desires, they start to act more and more like humans. They want things.

I mention this here because it touches on something I wonder about AIs, or at least LLMs: can they have desires? Certainly they can be told to _pretend_ they do — one could prompt an LLM to pretend that it wishes to take over the world, for example — but would it _really_ “want” to take over the world? Would it want anything at all?

Thinking about those kinds of questions, I realized that I don’t understand human emotions and sensations either. I don’t see how a bunch of computer circuits can be made to feel pain, but I also don’t understand how a bunch of nerves and neurons can feel pain either. I can understand how either one can respond to stimuli — if the temperature at this point exceeds such-and-such a temperature, fire these muscles — but I’m talking about the _sensation_ of pain. How does that arise? And is there something about a computer that works with voltages on a chip that prevents it from being able to have that sensation? Do nerves and brains somehow allow a sensation that literally cannot be duplicated in silico?

Sadly, Thief of Time did not answer any of those questions for me. But it did get me thinking about them, so I guess that’s something.

This post is by Phil

Workshop on Rethinking the Role of Bayesianism in the Age of Modern AI

Esmeralda Whitammer, Sara Wade, Vincent Fortuin, Konstantina Palla, and Theodore Papamarkou write:

We are organising a focused workshop on Rethinking the Role of Bayesianism in the Age of Modern AI from October 26 to 30, 2026, bringing together researchers exploring the frontiers of Bayesian machine learning and deep learning. The meeting will take place in Edinburgh, Scotland, UK, and will be hosted by the University of Edinburgh’s School of Informatics.

This workshop follows in the footsteps of the meetings held at Dagstuhl in 2024 and MBZUAI in 2025. This year, the meeting is growing and becoming an official event of the International Society for Bayesian Analysis (ISBA)’s new section on Bayesian AI. We are planning to maintain the collaborative and interactive spirit of the previous meetings, with a programme that includes talks, panel discussions, poster sessions, and ample time for interaction among participants representing a wide range of perspectives and expertise.

Looks interesting!  They should invite Aki for sure.

The new rule in economics: One star is p < 0.20, two stars is a set of steak knives, three stars is you're fired.

Someone pointed me to a series of applied economics papers:

1. George Borjas and Nate Breznau, Ideological bias in the production of research findings:

Our study exploits an opportunity to observe 158 researchers working independently in 71 teams during an experiment. After being asked their position on immigration policy, they used the same data to answer the same empirical question: Does immigration affect public support for social welfare programs? . . . teams composed of pro-immigration researchers estimated more positive impacts of immigration on public support for social programs, while anti-immigration teams estimated more negative impacts. The differences arise because different teams adopted different model specifications. . .

The results include an unusual labeling of statistical significance:

Usually it’s one star for p < 0.05, two stars for p < 0.01, as here:

or here:

These are not intended to be authoritative references; they just turned up in a quick search. The point is that 0.05 is the usual standard. Using 0.10 is a way of manufacturing a “statistically significant” result when you don’t have it in your data (as here). In the case of the Borjas and Breznau paper, the data were too variable to get a conventionally strong result, but they still wanted to get it published, and so they shifted the stars. I’m surprised that the reviewers didn’t catch it!

Don’t get me wrong. I don’t think people should be using statistical significance, at any level, as a threshold. To get a sense of my perspective you can read our paper, Abandon Statistical Significance. Even if you have an estimate that’s just one standard error from zero, that’s still evidence of the direction of the effect, as long as no selection is going on.

2. Katrin Auspurg and Josef Brüderl, Fragile Evidence for an Ideological Bias in the Production of Research Findings: Comment on Borjas and Breznau:

Although we were able to reproduce B&B’s numerical results, our reanalysis shows that the reported association is not robust. Specifically, the association hinges on a coding error. Data from four teams that contradict the ideology hypothesis were excluded from the analysis due to idiosyncratic variable coding. Correcting this error renders the ideology effect no longer statistically significant. Also, B&B employed a different outcome variable and weighƟng scheme to that used in a previous paper based on the same data. These two analytical decisions further contribute to the observed ideology effect. Correcting the coding error or using the same specification as in the previous paper renders the ideology effect indistinguishable from zero. . . .

They also go with the 10% significance level, I guess to be consistent with the original paper?

3. Nate Breznau and George Borjas, A Lack of Robustness in Robustness Checking from Auspurg and Brüderl:

In our published paper, we explicitlyacknowledged the limitations of our findings which are based on secondary data and a small sample. After examining Auspurg and Brüderl’s claims, we conclude that they have not presented any new evidence that warrants any correction to our conclusions. . . .

This rejoinder includes the table at the top of this post, in which the significance level has now crept up to 0.20.

I’m anticipating a few more rounds of this, culminating in a table by Breznau and Borjas in which anything with a two-sided p-value of less than 0.5 is given a star. Everybody’s a winner!

P.S. Just kidding in the title of the post. This “p < 0.20" thing isn't really the new rule in econ; it's just something from this one paper. It may be that its authors got some special exemption from the 0.05 threshold.

Online haters in the low-budget literary biz

I’m a big fan of John Lennon (the American author, not the English musician, but, sure, I’m a fan of the musician too). I’ve read most of his books, and it saddens me that literature is such a niche interest that even a versatile, talented, and accessible novelist such as Lennon can’t make a living out of it. OK, I understand the economics: if there were more money to be made from writing fiction, more people would be doing it, there’d be more competition, so it’s not clear that Lennon himself would thrive in that environment. But still.

Lennon’s an interesting case in that he’s had a certain amount of success–early books being published by serious commercial presses and getting respected reviews, and these books made it into stores to the extent that readers such as me came across them), he gets asked to write for the London Review of Books (all they ever publish of me is letters!) and he has a comfortable job teaching at an Ivy League university–but his fiction nowadays . . . ummm, “disappears without a trace” would be putting it too strongly, but readers have to go and search for it. There are just too many people out there who can write well and would like to write for a living, and too few people who want to pick up a book and read a story. The numbers don’t work out.

The above is all background to a weird and kind of mysterious story, which is that there’s someone online who hates Lennon’s guts, but not for any personal reason, just professional grievances of some sort. The person in question is Colin Fleming, and he seems to be, like Lennon himself, a moderately successful writer, which, as discussed, seems like a frustrating position to be in. Fleming has a low opinion of Lennon’s work. That’s fine; literary judgment is subjective. But he’s so angry at Lennon, which just seems odd to me. Lennon’s just some guy, right? Fleming’s blog reminds me of a wacky book from fifty years ago by disaffected journalist Richard Kostelanetz (see some discussion here). I find something fascinating about these cul-de-sacs of literature and publishing–but it’s disturbing to see it happening real time, directed at a real person.

If you want to draw connections, you can note that Lennon once reviewed a book by James Lasdun who once wrote a book about how someone had stalked him. Fleming doesn’t appear to be a stalker; he’s just really angry in a way that seems disproportionate to whatever set him off. At least, that’s my perspective; Fleming seems angry that Lennon has reached literary heights while writing really bad stuff, but, as I see it, Lennon is just getting by–publishing four stories in the New Yorker over a twenty-year period isn’t enough to pay the bills–and I think he’s an excellent writer. I get that Fleming is angry, but it doesn’t seem to me that he’s picking an appropriate target.

P.S. Just incidentally, I think Fleming underestimates the difficulty of coming up with a good title. Coming up with a good title is harder than it looks (unless you’re Donald Westlake). When people can do it, they deserve our respect. When they can’t, they deserve our sympathy, not our mockery. Even some great books have mediocre titles.

P.P.S. Just for fun, here’s a review by Lennon of a recent book by Stephen King.

A tool for learning about Fourier transforms

Eric Novik came by my talk the other day and we were chatting about a number of things, including how much we forget as the years go by. I remarked that I used to be very comfortable with Fourier analysis and was able to use it as a research tool—see section 2.2 of my Ph.D. thesis, and it also came up in my research leading to R-hat (although it didn’t make it into the writeup)—but at this point I only understand Fourier analysis on a conceptual level. It’s not one of these things that stuck with me.

In response, Eric pointed to this app that he created (with chatbot assistance) to help him my understand some things about Fourier series. Maybe it will be useful to some of you too. The source code is here.

Gray Davis, Grover Norquist, and a rabbi walk into Peter Thiel’s Dialog conference . . . and get no press coverage!

You know that Oscar Wilde saying, “There is only one thing in the world worse than being talked about, and that is not being talked about”?

This came to mind with respect to three once-famous people: Gray Davis, Grover Norquist, and a rabbi.

Act 1 (2021-2022): I receive emails from some sort of, ummm, I don’t want to call it a “scam” exactly . . . let’s call it a “networking event,” featuring luminaries such as “Gray Davis – Of Counsel, Loeb & Loeb. Fmr. Governor, California. [Los Angeles],” “Grover Norquist – President, Americans for Tax Reform. [Washington, D.C.],” and “David Wolpe – Rabbi, Sinai Temple. [Los Angeles].”

It seemed to be a great opportunity–just look at the email:

Hello Andrew,

We’ve heard a lot of great things about you, which is why you’ve been selected for membership. Dialog members–ranging from scientists to elected politicians, CEOs, artists, economists, media figures, and political dissidents–regularly convene to intellectually challenge each other in off-the-record conversations exploring pressing issues. We think you’d add an exciting perspective!

On the other hand, they were charging $16,846, which, as you may have heard, would cover the cost of a lot of Jamaican beef patties.

If they’d really heard a lot of great things about me, and they thought I’d add an exciting perspective, you wouldn’t think they’d charge me for the privilege, right?

I asked the organizers, who replied:

I absolutely get it; the majority of those who are invited to Dialog typically only attend conversations or gatherings as the keynote speaker, and if money is involved, it’s typically because they’re being paid to attend.

To keep Dialog fully independent and off the record, it is 100% participant funded–everyone who attends pays to do so.

Wow! So Gray Davis, Grover Norquist, and the rabbi were paying thousands of dollars to mingle with each other? It kinda makes you wonder. One of the other listed members was as the “Turki Al Faisal Al Saud, Former Minister of Intelligence, Saudi Arabia”–no, I’m not kidding there! I wonder if they let him take the bone saw on the plane? I bet he had a great conversation with “Zeke Emanuel – Vice Provost for Global Initiatives, Professor & Chair, Department of Medical Ethics and Health Policy, University of Pennsylvania.” And what about “Lawrence Summers – President Emeritus & Professor, Harvard University. Fmr. Secretary of the Treasury, United States”: did he really pay? It’s hard for me to imagine Larry paying for anything out of his own pocket. Maybe he got some friendly Harvard donor to fork over the money?

As I discussed in the above-linked blog post, I could see reasons why Gray Davis or Grover Norquist might want to talk with me, and I could see reasons why I might want to talk with Gray Davis or Grover Norquist, but I can’t figure out why each of us needs to spend $16,846 to do it. We could just talk on the phone for free!

Act 2 (2026): This arrives in the inbox:

I am reaching out on behalf of the WIRED team. We are working on a story about Dialog, the private, invite-only organization co-founded by Peter Thiel.

WIRED has obtained internal Dialog records, exposed by its website, including a membership directory and the registration list for the group’s 2026 retreat. Your name appears in them.

We wanted to give you the opportunity to comment before we publish. We’d welcome any response, including whether you’d confirm your affiliation with Dialog and anything you’d like to say about the group or your involvement.

Our deadline is 1pm EST, but if you’d need more time to prepare a response, please let me know as soon as possible.

As many of you know, I never check my email before 4pm. It’s actually daylight time here in New York, not standard time, but either way it’s before 4.

In any case, another email arrived soon after:

My sincerest apologies for this mixup! Please ignore our previous email. Your name was mixed up with a list of Dialog attendees we are trying to reach. We’re actually reaching out because we saw your 2022 blog post about being invited to the event, wanted to mention it in our story, and thought you should have the opportunity to comment on it, if you wanted to.

That evening I saw the message and replied:

Hi, sure, feel free to quote me. I stand by what I wrote before. I’ve never actually attended the Dialog event, as I have better uses for my $16,000.

The news article appeared soon after, under the title, “Leak Exposes Members of Peter Thiel’s Secretive ‘Dialog’ Society,” with subtitle, “More than 200 of the world’s elites registered for a retreat whose agenda runs from panels on cult-building and sex to prepping for World War III. An associated app offers matchmaking.”

Wow–I had no idea! I have to say, the idea of seeing Gray Davis, Grover Norquist, and a rabbi talking about cult-building and sex . . . ok, still not worth $16,846, but maybe there’s some entertainment value there.

Act 3 (2026): The story was picked up by other news organizations. I know this because a few of them contacted me directly and asked if I had anything more. I forwarded them three of the emails I’d received back in 2021 and 2022. There was also a story in the Hollywood Reporter (Palko pointed me to it) mentioning anti-free-press warrior Peter Thiel and a bunch of movie stars and executives, the most notable of whom was Benj Pasek, one of the composers of the music for La La Land.

It’s hard for me to picture Benj Pasek forking over $16,846 for the opportunity to mingle with Gray Davis, Grover Norquist, a rabbi, and the head of Saudi intelligence. But maybe his agent paid for it? I dunno.

Act 4 (2026): Here’s what I’m wondering. What do Gray Davis, Grover Norquist, and the rabbi think about all this? Each of them is a bigshot in his own field (failed politician, political lobbyist, religious leader), but none of them is important enough to be mentioned in any of these news articles.

How humiliating!

There was a time when the name Gray Davis meant something, a time when Grover Norquist had armies at his command, a time when a rabbi could call down thunderbolts. And now they’re just anonymous names in a list. What a comedown. Here’s my advice to these three guys: Fire your publicist.

I hope at least that they enjoyed the conferences. $16,846 is real money!

The only thing I don’t get is why the news organizations are making such a big deal about all of this. It’s an annual conference where rich guys spend thousands of dollars to be in each others’ company. No joke, it doesn’t sound much different from a country club.

And there’s this whole bit about the membership list being a secret. I don’t get why this is supposed to be a big thing either. Country clubs keep their membership lists secret too–it’s part of the whole exclusivity cachet. They’re not the public library, y’know!

Summary

Gray Davis, Grover Norquist, and a rabbi got the worst of all worlds. They had to go to a boring conference, they paid $16,846, they got no press coverage out of the deal, and they didn’t get any Jamaican beef patties.

I have no idea what food they serve at the Dialog conferences. I’m guessing it’s standard crappy upscale catering food, nothing nearly as good as you could get for $2.85 at Golden Krust here on 125 St.

LLM-generated Stan case study on Galileo’s inclined plane experiment

This post is from Bob.

I’ve been planning for at least a couple years to generate a case study around Galielo’s use of an inclined plane instrumented with water clocks to estimate the terrestrial gravitational constant. Here are some photographs of a replica in the Museo Galileo (click to blow them up). And here’s a video simulation of the experiment. We replace his clever pendulum apparatus explained in the video and the web page with simple Bayesian statistics so we can actually estimate the gravitational constant.

The case study

Here is a draft.

Bob Carpenter. 2026. Estimating g from Galileo’s Water Clock: A scientific Bayesian inverse problem with Stan and CmdStanPy. GitHub.

I list myself as the author here because I’m responsible and AIs can’t own copyright in the U.S., but 100% of the text and code was written by Claude Opus 4.8 (medium or high effort, but I can’t recall which). I used the desktop app, which doesn’t allow sharing, but you can try it yourself.

The prompt

Here’s the sloppy prompt I used, which I just typed in without much thought in a couple minutes to get a feel for what it could do on its own.

I would like to generate a case study written in Quarto and using CmdStanPy to demonstrate solving scientific Bayesian inverse problems. I want to use a simulation of Galileo’s water clock experiment, which can be used to estimate the gravitational constant. I would like you to start by generating the mathematical model description in LaTeX, the model code in Stan to solve the inverse problem, and a simulation driver in Python using CmdStanPy and plotnine for plotting. Please just `import plotnine as pn` and use `pn.geom…`, etc. All I need in the output now is a call to `.summary()` on the fit returned by `.sample()`. Wrap this all up in a quarto document for me from which I can generate HTML by calling `quarto render galileo.qmd`.

It was done before I got back to my desk with a cup of coffee (well under five minutes). So not quite the several hours Andrew said it took him to write his case study on the New York Knicks basketball team, which he posted earlier today. Of course, this was much simpler and I didn’t have to think through any details before generating it.

Is it right?

What Claude produced looks really good to me. If a student had done this, I’d given them an A. I can’t object to the way it described Galileo’s experiment, wrote the math, wrote the Stan code, wrote the Python simulation, or plotted the raw data as Andrew is always urging us to do.*

The source

You can find the source .qmd file on my GitHub:

https://github.com/bob-carpenter/case-studies/tree/master/galileo-gravity

It’s short, so I would have just included it, but the blog software blocked my post after considering it an attack on the site. To get it to render with resources embedded, I had to ask Claude a follow-up question and manually insert a single line of config into the .yaml header for the markdown document.

Putting this blog post together took longer than writing the prompt and checking the results.


*   Maybe Claude runs a little simulation of Andrew like I do. Andrew himself claims to run a simulation of Jennifer Hill—it’s the basis of his
handy statistical lexicon entry for “WWJD,” which he told me stands for “What would Jennifer do?” Unfortunately, neither the lexicon entry nor its underlying link explains the acronym.

Gambling provides a gentle rocking of the emotions to put you in a pleasant baby-like state

A commenter recommended the book, Addiction by Design: Machine Gambling in Las Vegas, by the anthropologist Natasha Dow Schüll, and I checked it out of the library. It’s a study of people who play slot machines and video poker, focusing on the locals: Vegas residents who have some low-level gambling addictions as part of their lives.

Nowadays, I guess that much of this business has been supplanted by machine gambling that you can do on your phone in the comfort of your own home. But the market for gambling must be far from being tapped: I imagine that there are many millions of potential gambling addicts out there, available to be hooked by some form of gambling or another.

As a statistician, I have mixed feelings about gambling. Ever since I was a kid, I’ve thought that probability is cool, and I like to bet. When we were kids we had a toy roulette set that we would play (just betting chips, not real money) and I’ve enjoyed poker and informal sports betting. The last time I’ve bet on anything was about 20 years ago, but that’s just more me getting older than anything else.

At the same time, there are all these addicts, and all the people who might not be addicts but who still degrade their standard of living, not to mention reward evil people (even if they’re pleasant as invididuals, they’re in an evil business; sorry, Nate!). And it just keeps getting worse.

To a statistician, this is all an endlessly fascinating topic: the odds and all that, but also whatever it is in people’s brains that motivate them to spend thousands of dollars on lottery tickets, etc.

As Schüll writes in her book, the popularity of machine gambling (which she says is the source of the majority of casino gambling profits in Vegas) is particularly puzzling in that people are just pulling the lever over and over again, without the sense of human context or any feeling of agency.

There’s also the interaction between the players and the people who make money from the machines:

For extreme machine gamblers, the experience of play is an end in itself–an “autotelic” zone beyond value as such, in that “no other reward than continuing the experience is required to keep it going.” Conversely, for the gambling industry the zone is a means to an end; although it carries no value in and of itself, it is possible to derive value from it. . . . In effect, gamblers’ drive to remain indefinitely suspended in the zone is rerouted, via the technological detours of the gambling industry, toward a destination of complete depletion.

It’s not just “the technological detours of the gambling industry,” it’s also politics: the industry doing what it takes to keep all this going, a gradual effort over many decades that continues to this day.

Later, Schüll summarizes:

Gambling addicts play machines to suspend themselves in a state of equilibriated affect.

This seems pretty accurate.

I would just add two things.

First, this equilibrium is not flat. It’s periods of stress, punctuated with the occasional excitement of winning and the frequent relaxing calm of losing. The best analogy I can think of is the way that a baby is calmed, not by lying completely still, but by being rocked in a somewhat irregular fashion.

Second, stakes matter. That “state of equilibriated affect” can only be achieved when real money is involved. I guess this is related to the phenomenon of habituation in drug exposure. Schüll talks with someone who started on a zero-stakes poker video game but them moved to the machines that take real dollars. We discussed this general idea recently in our post, Why isn’t it possible to play a fun and serious game of poker not for money?

It’s a good thing that babies don’t work that way–you can rock them a reasonable amount and they’ll be happy. No need to keep upping the stakes until the crib does a loop and the baby flies out the window. Although I guess that might happen if there were money in it.