The high cost of split R-hat

This post is by Bob.

I’ve been thinking a lot lately about R-hat given that I’m using it for online converging monitoring in our new Walnuts implementation. In that setting, where I use Welford accumulators to update R-hat estimates every iteration, I can’t use split R-hat without way too much buffering. So I’ve been thinking about the effect of splitting, too, and whether we need it. I asked Andrew and he said Kenny Shirley once produced an example where split R-hat diagnosed non-convergence that regular R-hat didn’t, but that example is lost to time and we’ve never seen this kind of behavior with NUTS as far as I know (please give us an example in the comments or via email to Andrew if you have).

Relating R-hat and ESS

My intuition was that we could set a low enough R-hat threshold that it would ensure a high enough effective sample size (ESS) when we crossed it. The relation’s a little tighter than I thought, with

    Rhat^2 ≈ 1 + M / ESS,

where M is the number of chains and ESS is the effective sample size of all chains combined. There’s a multivariate proof in Vats and Knudson, 2021, Revisitng the Gelman-Rubin diagnostic, Statistical Science, page 2 and section 5 for details, but it’s pretty straightforward to get the intuition when you reduce R-hat^2 to (N-1)/N + var(chain-means) / man(chain-variances) as Charles Margossian did in his nested R-hat paper. Vats and Knudson disapprove of Andrew and Aki’s suggested threshold of 1.1 from BDA3, because it is satisfied with a combined ESS of 20 across Andrew’s default 4 chains.

Being me, I tried to validate my intuition with simulations rather than linear algebra. Also, I like to see that things work in practice that theory entails to make sure I’ve understood all the assumptions baked into the theory (one can’t prove anything without assumptions!). When asked to code a simulation using ArviZ, Claude inserted a (2 * M) in the numerator in place of the M. Where did that come from, I asked? It told me it needed the factor of 2 because ArviZ uses split Rhat. D’oh! Of course it does, because we’ve doubled M without increasing ESS.

A worked example

Suppose we have 4 chains with a combined ESS of 400. Then sqrt(1 + 4/400) ≈ 1.005 and sqrt(1 + (2 * 4) / 400) ≈ 1.01. We’ve effectively doubled the number after the 1 by splitting. Unlike Vats and Knudson, I usually don’t need an ESS >> 100, so the 400 required for split R-hat < 1.01 is perhaps a bit too conservative for my tastes. On the other hand, we face a practical problem estimating ESS reliably with fewer than 50 or so ESS per chain. Estimation is challenging because it relies on autocorrelation estimates from the chains themselves, which become much noisier when based on shorter chains. (Side question: Do we not combine autocorrelation estimates across chains to reduce standard error because some chains might not be mixing?) Also, we know this algebra wasn't a coincidence of 4 chains and 400 draws. The Taylor expansion of sqrt(1 + x) is the convergent sequence

    sqrt(1 + x) = 1 + x/2 - x^2 / 8 + x^3 / 16 + ...

When x < 0.1, the first-order approximation, sqrt(1 + x) = 1 + x / 2, is good.

The bottom line for practitioners

We need around twice as many draws to get below a fixed threshold with split R-hat than with the original R-hat.

Guess who’s getting the big-money donations in the Maine U.S. Senate race?

Just in time for July 4th, Tom Ferguson, Paul Jorgensen, Matthias Lalisse, and Jie Chen share the above graph and write:

What can one Senate race reveal about the hidden machinery of American politics? In Maine, donor patterns expose how campaign finance can shape party competition, political narratives, and the choices voters are asked to make long before ballots are counted. . . .

Platner is strongly supported by Senator Bernie Sanders and other progressives, while many establishment Democrats dislike him. Major media keep printing articles questioning his character. By contrast, Collins’ somewhat contradictory legislative history attracts less coverage. . . .

Our tabulations of the race show that Collins is much closer to a typical Republican pattern (or, to be fair, those of the Old Guard Democratic leaders [Nancy Pelosi and Chuck Schumer, along with Paul Ryan and Mitch McConnell]) in a key respect: the size profile of her donors. . . .

The Republican Senator from Maine is hugely dependent on very large donors. By contrast, Platner strikingly resembles Sanders: he attracts essentially no big money. Recently the numbers of billionaires supporting the candidates has emerged as an issue. A very few have supported Platner with small sums. Almost a hundred (counting spouses) have made contributions of varying sizes to Collins. The overall configuration is as shown [above] and is perfectly obvious.

They also report:

If you put aside contributions that are below the $200 threshold for disclosure, the percentage of money received from Maine donors differs sharply between the candidates. Senate elections have been nationalized for a long time. Contributions from Maine itself make up approximately 20% of all money for Platner; by contrast, Collins’ rate is slightly under 3%. (Not a misprint.) Her biggest contributors include a Who’s Who of prominent financiers in private equity and hedge funds, including Steve Schwarzman of BlackRock, Ken Griffin of Citadel, along with other well known Republican donors, including Larry Ellison of Oracle.

And they give an example of how this works:

A day after a Super Pac backing her received a $2 million dollar contribution from a private equity magnate who, according to press reports, stood to gain munificently from President Trump’s One Big Beautiful Bill, [Collins] provided a crucial vote to spring the bill out of committee. Then she loudly voted against it on the floor.

Another way of looking at this is to ask, why a person living outside of Maine give $100,000+ to Susan Collins? Roughly speaking, the following conditions are needed:
1. The donor has to be rich enough to be able to spare $100,000 as loose change.
2. It has to be legally possible to give this amount of money, or the perceived consequences of violating the law have to be minimal.
3. The donor has to consider Republican Party control of the U.S. Senate has to be important enough to be worth spending $100,000 to make a small change in the probability of this happening.
4. It has to be easy to write the check; that is, the donor does not need to get the agreement of many other people to release the money.
5. Any negative political, social, and economic consequences of revealing oneself to be a strong partisan have to be mild, compared to the perceived benefits of making the donation.

And in recent years these five conditions have increasingly been present:
1. There are more and more super-rich people who can spend $100,000 without blinking an eye.
2. The Supreme Court keeps liberalizing campaign finance laws, also the government has become much more encouraging and tolerant of corruption. On the rare occasions where people are prosecuted, they get off, and even on the rare occasions are imprisoned for corruption, they get pardoned.
3. With political polarization, the two parties are further apart than ever, and party-line voting in Congress has become the norm.
4. The money is being given by individuals, or by companies controlled by single individuals. It’s not like the old days, where, if General Motors made a campaign contribution, they’d need the coordination of some board of directors.
5. This last one is the most interesting. A flip side of partisan polarization is that, if you give a lot of money to the Republicans, it will piss off a lot of Democrats, and vice versa. Political independents might not be so happy either. One way out is that it’s becoming easier and easier to skirt the regulations and campaign in secret. Beyond this, I guess these donors have decided that the Republican business sphere is large enough that they can afford to alienate Democrats and independents. And Black Rock, Citadel, and Oracle are not primarily customer-facing businesses.

The optimizer’s curse

The above sketch shows a decision tree.

The circles are uncertainty nodes and the squares are decision nodes. Read the tree from left to right: to start, there is uncertainty of which of the strata i=1,…,I you will be in. In any given stratum, you will have to decide between options 1 and 2, and for each of these decision options there is uncertainty about the payoff.

The goals are:

(a) Conditional on the stratum, pick the best decision. This is the local decision problem.

(b) Averaging over the strata, evaluate the expected value of the tree, that is, the expected value under an optimal decision analysis given the uncertainty.

The challenge is that you don’t know which internal decision is best, because there is uncertainty about the payoffs.

The “optimizer’s curse” is that if, for each stratum in step (a), you make the best decision given available information–that is, you estimate the expected payoff under each of the two decision options and then pick the the one whose expected payoff is higher–then if you use these expected payoffs in step (b) you will systematically overestimate the value of the tree.

The “curse” here is not that the optimizer is making bad decisions, it’s that a naive estimate will be overly optimistic about the net value because you’re selecting on choices that look good.

In 2007, Erwann Rogard, Hao Lu, and I published a paper on the topic, including the above diagram. Here’s our abstract:

The evaluation of decision trees under uncertainty is difficult because of the required nested operations of maximizing and averaging. Pure maximizing (for deterministic decision trees) or pure averaging (for probability trees) are both relatively simple because the maximum of a maximum is a maximum, and the average of an average is an average. But when the two operators are mixed, no simplification is possible, and one must evaluate the maximization and averaging operations in a nested fashion, following the structure of the tree. Nested evaluation requires large sample sizes (for data collection) or long computation times (for simulations).

An alternative to full nested evaluation is to perform a random sample of evaluations and use statistical methods to perform inference about the entire tree. We show that the most natural estimate is biased and consider two alternatives: the parametric bootstrap and hierarchical Bayes inference. We explore the properties of these inferences through a simulation study.

I kinda like the paper. I wouldn’t say it’s one of my all-time favorites, but I think it’s interesting, and I like that we offer two different solutions to the problem.

On the downside, the paper seems to have disappeared without a trace. In 20 years, it’s only been cited three times, and none of them look very impressive:

“Using Alternating Decision Treets,” indeed.

Maybe one problem with our paper was its dry-as-dust title, “Evaluation of multilevel decision trees.”

This all came to mind because Sean Manning pointed me to this post, “The best cause will disappoint you: An intro to the optimisers curse.” Now that’s a good title.

It seems that the term “optimizer’s curse” came from this 2006 paper by James Smith and Robert Winkler, which has a lot of overlap with our article that appeared a year later. Both papers use hierarchical Bayesian analysis. Their paper is better than ours, for sure, and not just in the title, as they make a much better case for the importance of the problem. But we were working independently. Too bad: had we joined forces we could’ve produced something better, as each of the two papers had lots of material that was not in the other. Smith and Winkler consider the problem of choosing among many options with different levels of uncertainty, whereas we consider a multiplicity of binary decisions. These are just two cases of the general principle.

The above-linked post, by someone who goes by the handle “titotal,” is good too. It doesn’t have any new technical material, but it explains the problem in plain English from first principles, goes through some examples, and discusses some of the policy implications.

Survey Statistics: Big Changes in the Times/Siena Poll

Yesterday Nate Cohn wrote about The Big Changes Coming to the Times/Siena Poll, with
more details in their poll of Maine.

Say we want to estimate average Platner support in Maine’s likely electorate, E(Y). But we only have survey respondents, R = 1.

The NYT uses survey weights to weight respondents, E(YW | R = 1). In contrast, some pollsters use MRP, fitting a Multilevel Regression model for Platner support, then applying it to the population, E(E_model(Y | X, R = 1)).

Nate discusses 2 Big Changes to how they construct the weights W.

(The polar bear has not yet hiked in ME, but he is training for it. This above is in TN.)

Big Change 1: Support score

A few weeks ago we saw the NYT started weighting on “synthetic 2024 vote”, which is recalled 2024 vote that is validated with the voter file and imputed if needed.

Now they’re also weighting on support score = E(2024 vote | other X variables). Nate explains the motivation:

While a poll can’t weight on dozens of variables, the support score lets us pile a lot of information into a single measure.

This reminded me of the causal inference context, where D’Amour and Franks (2021) “see especially strong performance for propensity weights computed with respect to the prognostic score”, where the prognostic score is E(Y | X, control). In our survey context, this would be a model for Platner support Y. Instead, the NYT use 2024 vote, perhaps for applicability across multiple outcomes Y ?

Big Change 2: Energy balancing

Beyond adding new weighting variables, they’re also changing how they calculate the weights. Nate notes the challenge of weighting on many variables and interactions with typical sample sizes. So they are turning to the R package WeightIt, which implements the energy balancing method from Huling & Mak (2024):

This article introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses without the need for careful model or moment specification.

The energy balancing weights do not use outcome Y, but the paper notes that estimates can be improved with a model for Y.

How do energy balancing weights handle the challenge of jointly weighting on many variables with typical sample sizes “without the need for model specification” ?

OK, I guess Lawrence “Epstein” Krauss didn’t follow his brother’s advice.

The former Arizona State University physicist reported in 2018 this advice from his “religious right wing law professor brother” [that’s Krauss’s description, not mine]:

Therefore i think you should pursue a mixed strategy. On the one hand, you should non-aggressively, soberly, suggest that the groping allegation is exaggerated but likely the result of a good faith misunderstanding. At the same time you should acknowledge that all these accusations have woken you up. You had never fully realized how vulnerable women are, and how the “me-too” campaign reflects decades of oppression and exploitation. You were blindly ignorant of, and insensitive to, this reality. this blind ignorance was all the more inexcusable in that you yourself have a daughter. You absolutely pledge that all your future behavior will reflect this newfound realization. You pledge to enroll (and indeed you should find and enroll in before making this pledge) in a program designed to educate and sensitize men to the pervasive atmosphere of sexual assault and harassment. You pledge to devote the rest of your career to this goal and to change your behavior to reflect this new realization. You pledge never ever again to make gestures that even have a slight chance of being perceived as harassing to females. You apologize profusely for all your hurtful gestures in the past, and recognize that the women who have complained about you are not making their complaints up. You were too physical in the past, you were blind to the vulnerability of women exposed to men in positions of power and influence, you abused that position and their trust even though you were sure at the time that you were doing nothing wrong. You know better now, because you understand women’s vulnerability in ways you didn’t before. You humbly ask Arizona State, or indeed any university that is interested, to give you another chance to show that you are in fact nothing but a caring, active physicist who is now more respectful of women. You are absolutely dedicated to pursuing your academic aspirations without future distractions. Importantly, you should do something dramatic, such as offer 100% of the royalties from your next book to some foundation that assists women who have been victims of harassment.

Jeez, what an asshole, to recommend that the “caring, active physicist” bring his daughter into his P.R. strategy.

In any case it seems that Krauss did not follow his brother’s advice. Not only does Krauss express no remorse about his own behavior, he hedges his bets on Jeffrey Epstein, referring to the financier’s “alleged criminality.” (Elsewhere he wrote that “everyone was a victim, including Jeffrey here.”)

Scroll down below Krauss’s linked post and here are the other things they recommend you read:

“Why We Need to Talk About Transgender School Shooters,” indeed. On the other hand, it seems that this is only the 325th most important thing they needed to talk about, so maybe that need wasn’t so great.

And here’s the second of those links:

I agree with sub-heading on this one. The paradox is that Arizona State, Harvard, and other Epstein-associated universities were themselves “rewarding those who exemplify and cultivate intellectual vices.”

And, yes, I’m saying intellectual vices, not just financial and sexual vices. To the extent that Epstein stood for anything intellectually, it was the principle of recirculating B.S. from well-placed elites (as here). Also the above proposed parade of insincerity (oh, sorry, the “mixed strategy”) is an intellectual vice. For that matter, I think it was an intellectual vice for Biggar to misrepresent the position of someone with whom he had an academic and political dispute.

That’s fine. Biggar can be correct in his larger point even if he does not always live up to these ideals himself, and it’s not his fault that he happened to have published on the same website as someone who is a kind of negative illustration of his point. It’s just interesting to see the juxtaposition.

But, hey, for a mere $4500 you can go on a one-week cruise with this guy (that’s Krauss, not Biggar). I think that part of what you get for this equivalent of 3130 Jamaican beef patties is the right to come up to him on the boat and say, “Hey, Lorrie, what’s your position on the statement, ‘You had never fully realized how vulnerable women are, and how the “me-too” campaign reflects decades of oppression and exploitation. . . . You absolutely pledge that all your future behavior will reflect this newfound realization. . . . You pledge to devote the rest of your career to this goal’?” For $4500, the least he can give you is a straight answer.

Cheapskate evolutionary biologist underpays his statistical help

OK, this one was funny. I searched the Epstein files for “statistician” and found this receipt from biologist Robert Trivers:

Only $1000 for the statistician??? What a cheapskate! Especially given that he said the statistician “did an outstanding job.”

Given all the statistical problems in evolutionary biology, maybe he should’ve allocated more of his research budget to the statistician.

Some background on Trivers is here.

The Anthropic Principle in Statistics and Science (my talk this Mon 29 June, 4:20pm London time)

The Anthropic Principle in Statistics and Science

The anthropic principle in physics states that our existence implies certain constraints on the natural conditions under which we evolved. In statistics, a corresponding anthropic principle can be used to infer properties of the models we should fit to data. For example, experiments are typically aimed to have a precision sufficient to estimate effects of interest but without overkill; it is rare to have an estimate that is 10 standard errors from zero. We demonstrate through several examples in social and medical sciences how the anthropic principle, combined with Bayesian inference, can be used to improve statistical practice.

Here are a couple of applications of the idea:

• [2000] Should we take measurements at an intermediate design point?

• [2022] A proposal for informative default priors scaled by the standard error of estimates (with Erik van Zwet)

In my talk I’ll discuss these and other examples. I think this anthropic principle is really important, arguably more important in statistics than in physics, which is the field where it originated.

Here’s the zoom information for the talk on Mon 29 June, 4:20pm London time:

https://imperial-ac-uk.zoom.us/j/97341955036?pwd=1kKNbPAwJthKtG55ynXMVF3TLSvIbl.1
Meeting ID: 973 4195 5036
Passcode: J3Ue$f

I’ll be speaking (remotely) at this conference celebrating the 60th birthday of physicist Andrew Jaffe. This seems to be the season for 60th birthday conferences.

I know AJ from when he was visiting the Flatiron Institute last year. We worked together on The Squealer: Sensification of model exploration and model misfit. There’s no connection between the Squealer and the anthropic principle; I decided to speak on the latter topic because I thought it would be of general interest to an audience of physicists.

Bayesian Workflow exists as a physical book!

We’re very excited about this book. It’s the result of several years of effort. You can order from the publisher or from Amazon.

Here’s the book’s webpage, which includes the data and code for the book’s examples and case studies, of which there are many.

Here’s the table of contents:

Part 1: From Bayesian inference to Bayesian workflow
1. Bayesian theory and Bayesian practice
2. Statistical modeling and workflow
3. Computational tools
4. Introduction to workflow: Modeling performance on a multiple choice exam

Part 2: Statistical workflow
5. Building statistical models
6. Using simulations to capture uncertainty
7. Prediction, generalization, and causal inference
8. Visualizing and checking fitted models
9. Comparing and improving models
10. Statistical inference and scientific inference

Part 3: Computational workflow
11. Fitting statistical models
12. Diagnosing and fixing problems with fitting
13. Approximate algorithms and approximate models
14. Simulation-based calibration checking
15. Statistical modeling as software development

Part 4. Case studies
16. Coding a series of models: Simulated data of movie ratings
17. Prior specification for regression models: Reanalysis of a sleep study
18. Predictive model checking and comparison: Clinical trial
19. Building up to a hierarchical model: Coronavirus testing
20. Using a fitted model for decision analysis: Classification competition
21. Posterior predictive checking: Stochastic learning in dogs
22. Incremental development and testing: Black cat adoptions
23. Debugging a model: World Cup football
24. Leave-one-out cross validation model checking and comparison: Roaches
25. Model building and expansion: Golf putting
26. Model building with latent variables: Markov models for animal movement
27. Model building: Time-series decomposition for birthdays
28. Models for regression coefficients and variable selection: Student grades
29. Sampling problems with latent variables: No vehicles in the park
30. Challenge of multimodality: Differential equation for planetary motion
31. Simulation-based calibration checking in model development workflow

Appendices
A. Statistical and computational workflow for Bayesians and non-Bayesians
B. How to get the most out of Bayesian Data Analysis

One way to think of the book is that it’s all the things missing from BDA, like how to set up an informative prior, what to do when your computations aren’t converging, how to work through a series of models fit to the same data, how to design and perform simulated-data experiments . . . and all sorts of other things too.

The core of the book–parts 1 through 3–clock in under 200 pages, and then we have another 300 pages full of case studies demonstrating different aspects of Bayesian statistical and computational workflow. The appendices should be useful to you too, first because the workflow ideas in this book apply to non-Bayesian inference too, and second because BDA still has lots of valuable material in it, so it’s good to know where to look.

This new Bayesian Workflow book could change your life (we hope), and I thank my coauthors, Aki Vehtari and Richard McElreath, with Daniel Simpson, Charles C. Margossian, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, Martin Modrák, Vianey Leos Barajas, for all their care and effort. We thank our employers and various funding agencies for giving us the resources to be able to write this book as a side project along with all our daily responsibilities. And we thank many people for their input on earlier versions of the book, along with the Stan developers making so much of this work possible and the Stan community of users for supplying a continuing series of challenges that have motivated many of the ideas and methods discussed in the book.

I posted this already on the blog and you can see answers to some questions in the comments there. I’m posting it again here because, hey, we don’t come out with a new book every day!

I hope you find the book readable, interesting, and useful.

Out of the frying pan and into the fire: Scientific American returned to form, and then this happened:

Last month I wrote the following post. I scheduled it for November, but then some Scientific American-related news arose, so I’m bumping it up in the schedule.

First, here’s my post from May:

I’m not saying this is the same Scientific American as old. Martin Gardner is long gone, and in the age of social media the articles are shorter. That’s the way of the world. But it’s got serious, interesting articles, a mix of pure science, applied science, policy, and service journalism. The latest in science without the boosterism of so much of science and technology reporting.

Last year, though, the magazine was much more political:

A bit of policy is fine, and there’s a lot of science to global warming, for sure. I wouldn’t want Scientific American to “bothsides” the issue. I’m not saying they need entirely to stick to sports, as it were. But the politicking was getting out of control. I’m glad they’ve returned to their lane.

Then the other day this happened:

Scientific American has been acquired by LabX Media Group, which holds Discover Magazine, IFLScience, and a number of other science publications. . . . And they have started out by firing writers and editors. . . .

I know nothing about LabX Media Group or the new Scientific American management, so I have no sense of whether this is a mere budget-cutting realignment or a full-on Sports Illustrated-style bust-out operation. Martin Gardner is a culture hero and deservedly so, but that was a long time ago, and those days aren’t coming back. Indeed, blogs like these, many of which are Gardner-inspired in one way or another, have taken his place.

It’s funny how magazines, even online, keep disappearing. The model of paying a magazine $50 a year for a subscription and getting a range of interesting material, seems more reasonable than paying $50 each for subscriptions for a bunch of individual bloggers, but, with the exception of the New Yorker, the New York Times, and a few others, we don’t really see much of that.

One way to see this is that I’m not myself a Scientific American reader. I follow all these blogs, many of which are science themed, and each of which, in its own way, goes into more depth than I’d get from a Scientific American article. So there’s this weird thing where I’m concerned about something that I’m not reading anyway. Which is different from Sports Illustrated. Back when Sports Illustrated was a real thing, I’d buy it from time to time. I read it for the articles, as the saying goes.

That said, institutions continue in their own way. I was happy recently to see that Scientific American had pulled itself out of its politicized rut, so it’s a disappointment if it’s now getting taken apart.

“Springer Nature has removed two studies by Max Planck.”

Jim Moody points to this news article, “Why have papers by one of history’s most famous physicists been retracted? Springer Nature has removed two studies by Max Planck. A bot may be to blame.”

If you’re gonna retract something from Max Planck, I’d suggest starting here, with the notorious Manifesto of the Ninety-Three German Intellectuals defending Kaiser Wilhelm’s invasion of Belgium. Here are a couple of retractable passages:

It is not true that the life and property of a single Belgian citizen was injured by our soldiers without the bitterest self-defense having made it necessary.

It is not true that our troops treated Louvain brutally. Furious inhabitants having treacherously fallen upon them in their quarters, our troops with aching hearts were obliged to fire a part of the town as a punishment.

I guess they were the world’s most moral army. “Aching hearts” . . . that must have absolutely sucked. Really mean of those Belgians for defending themselves.

Just to be clear, I’m not saying that Planck should be “canceled.”

Who among us hadn’t retroactively disgraced ourselves with a lachrymose defense of military aggression?

I’m just saying, if you have to retract a paper by Max Planck, I’d retract that one.

P.S. The funny thing is that the above-linked article describes the famous physicist as “almost as widely revered for his character as his physics. In 1933, for example, he bravely confronted Adolf Hitler over Nazi Germany’s discriminatory laws against Jews.” I’ve never read anything about Planck’s life so I don’t know what changed with him between 1914 and 1933. Maybe the loss of the war in 1918 soured him on armed adventures.

Supplement that alphabetized display with another graph showing the states in a more informative order.

I just wrote a long post inspired by a recent post from economist Paul Krugman. Krugman’s post was good, but I’m annoyed that his graph (reproduced above) lists the states alphabetically. Don’t do that! It’s called the Alabama first error.

I would’ve put this as a P.S. on my earlier post but I was afraid that would distract people from my larger point, so I’m just raising the graphical issue here.

If the goal is to have a look-up table, then, sure, alphabetical is fine. But I don’t think that’s the point of that graph. Indeed, if you wanted a look-up table, I’d still prefer a non-alphabetical graph and then you could click to get the numbers in a spreadsheet.

How best to order the states in that graph, then? You could try different things. My first idea is to list in order of average per-capita income by state. (These rankings don’t change much over time; for clarity we could just order by average per-capita income in 2020.)

P.S. All the commenters so far are disagreeing with me, so let me reassess.

I doubt that most readers are looking at this graph to look up individual states. I think the goal is to present the general trend and variation across U.S. states. For this purpose, alphabetical order makes it hard to see systematic patterns that might be clearer using any reasonable ordering.

That said, alphabetical order has the benefit of familiarity, and given that all of you think this is important, I’m willing to believe that my take is a minority view, and maybe the designer of the graph is better off going with the majority.

So I’ll alter my recommendation. Instead of saying, “Don’t alphabetize,” I’ll say, “Supplement with another graph showing the states in a more informative order.”

Structural equation modeling (SEM) and positive definiteness

This post is from Bob.

Mitzi and I were swotting up on structural equation models (SEM) for our class this past Monday at the Modern Modeling and Methods (M3) conference at Fordham University. It was a lot of fun and now I think I understand SEM notation. I really like these applied conferences and this was a group of psychometrician, econometricians, and sociometricians. Many if not most of them thought about models in terms of SEM, so we thought we should figure it out. But I was left with a concern you may be able to help me sort out.

The example

The first worked example in Ken Bollen’s seminal 1979 textbook on SEM is a study of how industrialization relates to democracy. It comes from his paper,

  • Bollen, Kenneth A. (1979). “Political Democracy and the Timing of Development.” American Sociological Review, 44(4).

and was reprised in his book

  • Bollen, Kenneth A. (1989). Structural Equations with Latent Variables. Wiley.

I had the pleasure of sitting across from Ken at the invited speakers dinner at the conference, so I’m glad I looked into SEM before that. Good news for the SEM devotees—he released a completely revised guide to SEM a few months ago.

The data and parameters

The data consists of eleven covariates (called “indicators” in SEM) for each of 75 countries. Four of the covariates are related to democracy in 1960 (y1, y2, y3, y4), the same four measurements were taken again again in 1965 (y5, y6, y7, y8) , and there were three measurements of industrialization in 1960 (x1, x2, x3).

The SEM model the original researcher came up with here assumes three latent scalars per country, industrialization in 1960 (IND60), level of democracy in 1960 (DEM60), and level of democracy in 1965 (DEM65). These latent parameters are related in the following way: democracy in 1960 is a regression on industrialization in 1960, and democracy in 1965 is a regression on both democracy in 1960 and industrialization in 1960.

The covariates are then modeled like a seemingly unrelated regression in econometrics. The four democracy 1965 parameters are treated as regressions on the latent level of democracy in 1965, and similarly for the democracy in 1960, and industrialization in 1960.

Rather than independent errors, a SEM model explicitly indicates with arrows which pairs of observations are allowed to have non-zero correlation in the covariance matrix for the observations. The three industrialization observations are assumed to have zero correlation—there are no arrows between any of the three measurements in the SEM diagram. Each of the four measurements in 1960 is assumed to covary with the same measurement taken in 1965. In addition, the second and fourth measurement in each year are assumed to be correlated with each other, which leads to a box-like structure.

The SEM diagram

Here are the arrows in the diagram, where I’m not using their standard LISREL notation, but writing them in R expression syntax to indicate what is regressed on what. In their graphical notation, just replace ~ with <-. All three latent variables and all eleven measurements are indexed by country.

IND60
DEM60 ~ IND60
DEM65 ~ DEM60, IND60

x1, x2, x3 ~ IND60
y1, y2, y3, y4 ~ DEM60
y5, y6, y7, y8 ~ DEM65

The covariance structure is indicated by stating which pairs of measurements are modeled with non-zero correlation. The first four just pair the measurements of the same thing across 1960 and 1965.

y1 <-> y5
y2 <-> y6
y3 <-> y7
y4 <-> y8

The last pair of correlations are within 1960 and within 1965.

y2 <-> y4
y6 <-> y8

Together, these induce an odd box structure, where y2 is correlated with y6 and y4, both of which are correlated with y8, but y2 and y8 are assumed to have zero correlation.

y2 <-> y6
^      ^
|      |
v      v
y4 <-> y8

Stan implementation

We didn’t get this far in my half of the class, so I will share here the Stan Playground example where I fit Bollen’s example (you can get the data and the Stan model through the Playground link:

It gets the right answer compared to lavaan/blavaan, which is nice. In the Stan code, xi is IND60 and eta1, eta2 are DEM60, DEM65. The relation among the latent parameters are modeled directly as regressions. The correlations among the observations are modeled using soft zeroing, where I just put a tight prior around zero on the structural zero elements, because Stan doesn’t give you a good way of setting up structural zeroes in a covariance matrix (Sean Pinkney or Ben Goodrich might know how to do this?).

This makes me curious how the lavaan package in R manages this. There’s a Bayesian version of lavaan built on top of Stan, blavaan. The first example right at the top of the home pages for both the lavaan and blavaan is Bollen’s democracy model. I guess it’s like the Scottish lip cancer data set for spatial modeling or Fisher’s iris data for regressions.

My questions

Consider a simple diagram among measurements like the following.

x <-> y
y <-> z

This says there can be non-zero correlation between A/B and also between B/C, but the correlation between A/C is zero. It’s a simplified case of the box we saw in the actual example. These arrows implies the correlation matrix looks as follows.

|        1  rho[x,y]         0 |
| rho[x,y]         1  rho[y,z] | = Omega
|        0  rho[y,z]         1 |

Given that the correlation matrix Omega must be positive definite, this limits the range of rho[x,y] and rho[y,z]. For example, we can’t have rho[x,y] = rho[y,z] = 0.9, or rho[x,z] would have to be greater than zero to maintain positive definiteness.

Q1: Why doesn’t SEM instead say that the correlation rho[x,z] is just the minimum value it can be given rho[x,y] and rho[y,z]? I’m suggesting that we instead treat the above diagram as implying no additional correlation between x and z other than that implied by the correlation between x and y and the correlation between y and z? That is, why try to shrink rho[x,z] all the way to zero? From the text, it feels like the motivation is to enforce zero correlation in the model. But all this is doing is simplifying regressions—it won’t actually enforce zero correlation among the measurements that are modeled with zero correlation. I wished I’d asked Ken this question at dinner, but I’ll ping him about this blog post and hopefully get a response.

Of course, in the pragmatic Bayesian workflow, we’d use posterior predictive checks to evaluate whether there’s unmodeled correlation between x and z.

Q2: I’m also curious what Andrew and others think about enforcing structural zeroes in correlation between measurements as opposed to just estimating a dense covariance matrix and inspecting where the correlations fall.

Getting justice can require a lot of effort, and usually at some point we’ll just give up, which is what the cheaters rely on.

I just read this compelling op-ed by Brendan Ballou, “One Man Stole $660 Million. He’ll Never Pay It Back,” which tells the story of several brazen white-collar criminals who avoided prosecution for federal crimes by the simple expedient of bribing the president of the United States. Ballou argues, though, that there could still be ways of catching these guys:

In a world where the Department of Justice and the president are either indifferent to or actively support rich criminals, what can be done? Fortunately, there is a range of legal tools that ordinary citizens can use to pursue civilly the sort of corruption that would ordinarily be prosecuted criminally.

The shareholders potentially cheated by Mr. Wiederhorn could sue the Trump inaugural committee under the federal civil RICO law — written to destroy the Mafia — for seemingly helping to secure Mr. Wiederhorn’s freedom. Companies that follow the law can sue rivals, like Binance, that do not, under California’s Unfair Competition Law. And investors scammed by Mr. Milton can sue the political committees he donated to if they were “unjustly enriched” by his scheme. . . .

When regular citizens can’t act themselves, they can pressure their local prosecutors to do so. Recall Mr. Homan’s $50,000 in cash from undercover F.B.I. agents. This Justice Department may not continue the investigation. But Mr. Homan’s personal business is headquartered in Virginia, and it would be awfully interesting to find out whether Mr. Homan reported that money on his state tax returns. If he didn’t, he may well have committed a crime. . . .

He concludes:

Criminals and government officials are barely hiding their schemes, and their brazenness is meant to make us feel helpless, to think that nothing can be done. That is false. We already have the legal tools to fight corruption. We just need to use them.

This is inspirational and I hope someone does all of this.

My point in the present post is that getting justice can require a lot of effort.

Here’s an example. The other day I was talking with someone about research fraud, and he characterized the Michael Lacour story as the biggest scandal ever in political science. I disagreed. It was my impression that Lacour had been forgotten (here’s some background), but what about the time that the American Political Science Association gave an award to a plagiarized book? Here’s the story. I’d never heard of any of the people involved in that episode, but it incensed me that APSA had done this.

I wasn’t the only angry person. Indeed, I’d heard about the Frank Fischer case from Alan Sokal, who’d emailed an academic official at Rutgers University, where the plagiarist worked, but there was no useful response. So I decided to take a whack at it. I sent off this email to the people on the committee that had given that award:

Dear APSA Public Policy Section:

I learned recently that you gave your 2017 Aaron Wildavsky Enduring Contribution Award to Frank Fischer for his 2003 book Reframing Public Policy. I was surprised to hear this, given that the book appears to have plagiarized material. For background, see this document by Krešimir Petković and Alan Sokal:
https://chronicle-assets.s3.amazonaws.com/5/items/biz/pdf/plagiarism_fischer.pdf
and this note by Petković:
https://chronicle-assets.s3.amazonaws.com/5/items/biz/pdf/Petkovic_Experiment_with_CPS.pdf
and this news article for further background:
https://www.chronicle.com/article/alan-sokal-takes-aim-at-an/124969

Petković, a political science graduate student in Croatia, found places in Fischer’s 2003 book where he had used materials from previously published work by others without giving full attribution. In addition to copying without attribution (as Petković writes, Fischer mentions the book he copied from, but nowhere near the copied passage), Fischer also makes mistakes such as misspelling authors’ names and reproduces errors that arose in the original sources.

Two of the works from which Fischer copied in his 2003 book without appropriate attribution are:

Majone, Giandomenico, 1989. Evidence, Argument, and Persuasion in the Policy Process. New Haven: Yale University Press.

Walsh, David, 1972. Sociology and the Social World. In: Filmer, Paul, Phillipson, Michael, Silverman, David and Walsh, David, New Directions in Sociological Theory. London, Collier-Macmillan: 15-35. [Also published by MIT Press, Cambridge, Mass., 1973.]

I am not an expert in this area and have no intention of pursuing any formal process here. Indeed, I am not even a member of APSA. However, I am a political scientist and, as such, am distressed to see APSA promoting plagiarism.

My recommendation is that you retract the award. If that is too difficult, one thing you could do is retroactively also give this award to Majone (1989) and Walsh (1972). It does not seem fair that they did the work and someone else gets the award, no? I do not know Prof. Fischer and am making no judgment regarding the quality of his writing. It may be that it is indeed an enduring contribution to the field; if so, all authors of this enduring contribution should be recognized.

Yours,

Andrew Gelman
Professor, Department of Statistics
Professor, Department of Political Science
Columbia University, New York

P.S. I have also cc-ed the members of APSA’s Committee on Professional Ethics, Rights, and Freedoms.

From APSA’s guide to professional ethics:

“7. Political scientists, like all scholars, are expected to practice intellectual honesty and to uphold the scholarly standards of their discipline.

7.1 Plagiarism, the deliberate appropriation of the work of others represented as one’s own, not only may constitute a violation of the civil law but represents a serious breach of professional ethics.

7.2 Departments of political science should make it clear to both faculty and students that such misconduct will lead to disciplinary action and, in the case of serious offenses, may result in dismissal.”

A few months later I followed up:

Hi all. I was just wondering what happened with this. As I wrote last year to **, I am not submitting a formal grievance or complaint. I just wanted to let the committee be aware of this situation so that they can have the opportunity to fix it.
So I was interested to find out how things have progressed, as it seems to be an embarrassment to APSA to have given a major award for a book with plagiarized material!
Andy

After several months I hadn’t heard back from the committee so I pinged them in June. A couple weeks later they got back to me and said they couldn’t do anything because it had not been submitted as a formal complaint.

Fair enough. I didn’t think it would be right for me to file the complaint myself, given that I’m not at all knowledgeable about this area of political science.

Meanwhile, the books that had been plagiarized, Majone (1989) and Walsh (1972), never got that award. Doesn’t seem fair to me!

Anyway, my point is that it takes work to pursue these things, and it’s more my inclination to point out the problem than to go through the political and administrative steps needed to rectify the problem.

I’m not dissing “the political and administrative steps”–I have a lot of respect for people who can do these things!–it’s just not something that I’m good at.

Here’s another example. I once had a colleague who plagiarized my work. When I realized what was going on, I was stunned. But then, looking back, I realize that I’d been warned of this behavior years earlier, indeed my memory flashed back to a time that I’d seen something else he’d plagiarized from me, and I’d just kind of filed that image in my mind and forgotten it. My collaborator and I had a good thing going, and, hey, nobody’s perfect, so it was easier to look away. When I confronted him about the plagiarism–this was a long time ago–he kind of wriggled around, saying that he didn’t want to share credit with me on the project I’d been working on with him–at one point I was dictating formulas to him over the phone–but we could jointly write a separate article on the topic. This just pissed me off, but, ultimately, he won, in the sense that he correctly calculated that I was rational enough not to want to get involved in a major scandal early in my career. Yes, he’s the one who would’ve looked bad had I raised a formal complaint, but it wouldn’t have done my reputation any favors to be seen as a complainer. Also, though, I won, in that I stopped my involvement in this project and I moved on to better collaborators.

The episode bothered me (which is why I keep talking about it), but my cost-benefit analysis led to the decision to not file a formal complaint. That’s the decision-theory analysis. The game-theory analysis is that my colleague could see ahead to the next move: he know I was rational and that it would be a net loss to me to make a fuss about his actions, and I expect that this minimax analysis led him to the conclusion that he’d be safe in plagiarizing me. Yes, he was taking a risk to his reputation in doing so, but it was a calculated risk, in his mind less than the expected benefit to his reputation of taking full credit for this part of our joint research.

What should be done?

I’m not sure. In academic scandals, maybe it’s best just to move on. So what if some obscure political scientist got some award that he didn’t deserve? So what if some researcher publishes substandard work because he decides to not credit a collaborator? Worse things happen every day in academia. Indeed, if you want to talk about the worst scandal in modern political science, I might give the nod to Samuel Huntington’s book, The Clash of Civilizations and the Remaking of World Order, not because of plagiarism or anything like that, but just because arguably it’s had a large and malign influence in the world. Given all the problems in social science, plagiarism is the least of our concerns. So, although it annoys me, ultimately I think the appropriate strategy is to just let it happen, to talk about it but not to worry about seeking justice.

When it comes to business and government corruption, though, I agree with Ballou that something should be done. Legislatures should be writing laws, local and state governments should be prosecuting, lawyers should be suing, etc. These guys are stealing, giving and taking bribes . . . this is the kind of thing that degrades the entire economic and political system.

So, again, I hope some people make some of the moves that Ballou recommends. They should just be aware that it will take a lot of effort and persistence.

Treating AI review like the contentious policy design problem it is

This is Jessica. Many researchers are thinking about what we should do about scientific peer review now that AI makes producing papers so much easier. Submission numbers keep getting higher — in the past week, I saw reports that the most recent ACL submission cycle got 17k+ submissions, up from ~10k last cycle. TMLR went from getting 500 submissions every 60 days or so to getting the same number ever 19 days. There are simply not enough human reviewers to handle the surge, at least not without a dip in quality. The noiser the review system gets, the greater the incentive to submit sloppy papers, because you might get lucky. This is the so called “review death spiral.” 

It is a hard problem. Quotas on submissions per author are one avenue forward, which TMLR just announced it would adopt. Not surprisingly, many reviewers are also turning to AI to help. The question becomes how to design AI review protocols to help reduce some of the noise, through preliminary filtering or flagging or helping guide human attention to parts of a paper that are most likely to be problematic. 

But what sorts of checks should an AI review assistant run on a paper? It’s useful to separate basic integrity violations AI could flag, like is there evidence of plagiarism, fake citations, missing code/data to reproduce main results (which are comparatively less controversial) from “epistemic filters,” like does the paper pass replicability checks, robustness checks, preregistration checks, statistical significance checks, etc. There’s a temptation to blur these things in proposing how to apply AI to review. It’s easy to assume that the metascientists have already established that practices like replicability or preregistration are truth-indicating and we can just implement them at scale (and indeed, ML researchers are citing open science and other reform arguments to back their proposals).

But if there’s one lesson to be learned from the aftermath of the replication crisis, it’s that there is no small, stable, non-conflicting set of detectable signals of good science that will find the good stuff and reject the bad. There are heuristics that can be useful prompts for deliberation – get in the habit of preregistering, make sure you can replicate your results, test the sensitivity of your results to choices you made along the way – but things get weird when we start treating them like universal requirements. Authors shift attention away from unrewarded signals, like better theory or exploratory work, and become preoccupied with rigor signaling through their methods. The result is not necessarily more thoughtfulness. 

And so even if the AI review tools we create are simply intended to inform human reviewers about what checks a paper passed, what we implement will have important policy implications by incentivizing more work like that in the future. I don’t think we are in a good position to predict what happens if suddenly we require multiverse robustness or statistical significance in a field like machine learning, which has in many ways been all about iterative improvement and “frictionless reproducibility” rather than individual results passing all the robustness checks.

The answer is not to avoid using AI in review until we can find a non-gameable set of credibility qualities to have AI focus on, as some have recently argued (though I agree with the linked paper that we need more rigor in how we go about motivating review tools). Non-gameability sounds nice, but any automated review policy that allocates attention will be gameable, because ensuring good science is not so simple as finding the right checklist. The relevant question is instead what assumptions and downstream incentives we are willing to tolerate. To this end, at the very least we should get in the habit of spelling out the assumptions we’re making, so that the trade-offs of focusing on particular proxies become explicit.

I wrote up this view recently in a paper called “Stop Treating Metascientific Heuristics as Quality Filters in AI Review.” Here’s the abstract: 

AI-implemented checks for reproducibility, robustness, preregistration, claim scope, and other intended proxies for scientific credibility can extend human reviewers’ capabilities. However, treating metascientific heuristics–whose theoretical grounding remains contested or incomplete–as necessary and sufficient signals for filtering out bad science is counterproductive to scientific progress. The emerging literature blurs the line between integrity filtering, based on necessary but insufficient signals of validity like reproducibility of stated results or lack of fake citations, and epistemic filtering, which uses machine-detectable signals to judge scientific quality. Drawing on critical metascience, we show that commonly proposed signals of research quality are insufficiently justified as general indicators of scientific value. The answer is not necessarily to ban AI in review, given the deluge of submissions venues are facing. Instead, in recognition of how any use of automated signals–even when deployed with human oversight–will shape attention and create incentives upstream, developers of AI review tools should explicitly specify their assumptions about how proxy signals inform on scientific quality in the context of specific review decisions. This approach treats AI review contributions as contestable decision policies that will shape future research, acknowledging the value-laden nature of scientific judgment and surfacing relevant tradeoffs. 

Rather than arguing for or against any particular proxies, I’m more interested in the methodological and philosophical mindset we should bring to the new questions raised by AI review. To demonstrate what I mean by more explicit motivation, I analyze an example review decision problem and set of detectable signals in the appendix, drawing on an analysis of how statistical significance and exact replication success relate to signal-to-noise ratios measured under error from a recent paper by Eric van Zwet, Andrew, and Witold Więcek. The takeaway is that the value of a proxy will depend on how you define the latent state you care about (e.g., whether the direction of an effect was correctly estimated, how big the true signal-to-noise ratio is), what you assume about the generating process (i.e., how the proxy noisily reflects the latent state), and what you assume about the decision-maker’s choice of actions and utility function. By suggesting this approach, I am *not* suggesting that one can validate a new review tool’s utility before its been deployed. The point is that there will be trade-offs no matter what, and the best we can do is be concrete about the kinds of  assumptions that have to hold for proxies to be useful in review, so the community can debate what risks they are willing to accept. 

In this sense, my argument is very much along the same lines as Devezer et al’s argument that those proposing reform procedures should adopt more formal methodology to avoid unwarranted overgeneralization. Once checks become part of review infrastructure, they stop being neutral diagnostics and become policy levers. Let’s start treating them as such in research on AI review.

“Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle” is a very clean example of meritocracy.

In a post about possible corruption in the government and finance sector, Paul Campos points to a news article entitled, “Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle,” that features an adorable photo of the three Lutnicks standing next to a fashion model.

Campos labels this as, “The Meritocracy!”, and clearly he’s being ironic: his point is that it seems unlikely that these two twenty-somethings are really the people with the most merit needed to run this zillion-dollar company. All things are possible, but it would be an amazing coincidence if, among all the possible financial executives out there, that these two would happen to be the best.

And, sure, I get that.

But now I want to point to my old post on the topic, Meritocracy won’t happen: The problem’s with the “ocracy.”

The short version is that the news item, “Howard Lutnick gives top Cantor Fitzgerald jobs to his sons Brandon and Kyle,” is a very clean example of meritocracy. Lutnick Sr. had the merit (in whatever sense) that took him to the top of the heap, and he used that merit to get jobs for his kids: that’s the “ocracy” part.

If all that merit did was get you top jobs and lots of money, that’s not meritocracy, that’s just merit-based employment and pay. What makes it “meritocracy” that the people with the merit don’t just get nice jobs, they also get to be in charge of everything (”ocracy”). And one thing you do when you’re in charge is take care of your kids!

As Mark Palko discussed over ten years ago, our society seems to have become more tolerant of nepotism. Or maybe the point is that nepotism has always been a thing, but in recent years there’s been more of an effort by rich people and the news media to portray nepotistic hires as having special merit of their own. This is not to say that children of the successful cannot make great contributions themselves—John Quincy Adams comes to mind, also Julian Lennon had that cool song a few decades ago where he sounded just like his dad, so that’s something too. And then there was Oliver Wendell Holmes, Jr., who surpassed his famous father in achievements. And Alexander of Macedon didn’t do so bad either.

Anyway, “meritocracy” implies that the people with merit rule society, and they’ll use their power to help their kids.

Nepo babies aren’t a counterexample to meritocracy, they’re a central part of it.

To select or not to select?

This post is by Aki

New preprint To select or not to select: predictively consistent priors instead of model selection with Anna Elisabeth Riha, Leevi Lindgren, David Kohns, Paul Bürkner and me. arXiv.2606.22850

tl;dr: Model selection is not a substitute for building good models in the first place.

Abstract: Bayesian modelling workflows often consider multiple candidate models of varying complexity. Model selection is commonly used to navigate potential trade-offs between model complexity and generalisability to new data. We study when model selection is unnecessary or can even be harmful for predictive performance in finite data regimes and find that the need for selecting simpler models can depend on prior choice. We formalise predictively consistent priors, which keep prior predictive implications stable as model complexity increases. Across examples and numerical experiments, including adding covariates in linear and logistic regression, forward variable selection, and nonlinear modelling, flexible models with predictively consistent priors typically match or outperform selected simpler models in out-of-sample predictive performance. When selection helps, it can indicate poor joint prior implications, such as excessive prior mass on implausible predictive values. Based on our findings, we propose replacing the notion of sparsity or parsimony at the level of model components with specifying priors that remain sensible in predictive space as models become more complex.

These ideas have been around, but there was no single easy paper to refer to explaining and illustrating some important aspects of model selection. Sure, model selection can reduce overfitting, but even better is to use big models and predictively consistent priors.

This is a long (76 pages) slow science paper. I had been showing variants of some plots in my talks years ago, but polishing the explanations and adding more theory took a long time. Anna, Leevi, David, and Paul all did great work on this.

Survey Statistics: perfect collinearity in the sample but not in the population

In 2019, Andrew blogged about collinearity in Bayesian models. In the comments, he pointed to an example from Bayesian Data Analysis, 2nd edition (BDA2). I think it is a useful example to keep in mind when extrapolating from sample to population. Since folks (like me) may only have BDA3 on their shelf, I thought I’d talk thru it.

Amazon.com: Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science): 9781584883883: Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin: Books

Pretend it is 1980 and we are at the US Census Bureau. We just revamped the occupational coding system, and it’s so much better ! We want 1980-style codes on all our old data that only had 1970-style codes. Let’s trade in our peasant blouses for some shoulder pads.

Say we have double-coded training data (n = 10,000) with:

  • O_1980 = occupation coded in the 1980 coding system
  • O_1970 = occupation coded in the 1970 coding system
  • E = education, either high or low
  • I = income, either high or low

We want to impute O_1980 for the single-coded full dataset (N = 1,000,000) with only O_1970, E, and I.

Consider everyone with the a specific occupation according to the 1970 codes, e.g. Accountants. Say there are 200 accountants in the double-coded training data and they have either high income and high education or low income and low education. They have either OCCUP1 or OCCUP2 according to the 1980 codes.

From BDA2 Table 9.1:

Say we use standard regression software to fit p(O_1980 | O_1970 = Accountants, E, I). It will flag the predictors E and I as perfectly collinear, because in the double-coded training sample, education and income are perfectly correlated.

Suppose you drop education and use only income. The single-coded data actually has some low education and high income folks. The model only uses income, so 90% of them get OCCUP1. But suppose I drop income and use only education. My model only uses education, so only 10% of them get OCCUP1. Who is correct ?

As the authors say:

the truth is that we have essentially no evidence on the split for these units… the occupational split for the ‘E=low, I=high’ units should vary between, say, 90/10 and 10/90. … If some variable should or could be in the model on substantive grounds, then it should be included even if it is not ‘statistically significant’ and even if there is no information in the data to estimate it using traditional methods.

 

Mind-body healing: An exchange.

This has come up a few times on the blog already:

Carroll/Langer: Credulous, scientist-as-hero reporting from a podcaster who should know better

7 steps to junk science that can achieve worldly success

A suggestion for Freakonomics and Sean Carroll: Interview Nick Brown

Two researchers in the Harvard psychology department published a paper reporting that they could make people heal faster by telling them that more time had passed. Nick Brown and I looked at this paper carefully and didn’t think that it offered good evidence for its claims. Meanwhile, the paper was promoted uncritically in various media outlets.

As I wrote a couple years ago, to the extent that healing is important, I think it’s important not to overstate evidence for speculative claims about what works. Individual and societal resources are limited. If you want to say something like, “Sure, this is pie-in-the-sky research, but if it works it would be wonderful (‘kind of amazing,’ as physics podcaster Dean Carroll might say), so it deserves our attention, respect, and funding as a high-risk, high-return possibility” . . . go for it. That argument could be made. But then that argument should be made. Don’t fudge it by acting as if there’s evidence that isn’t really there.

Nick and I published an article in a psychology journal discussing the problems with the paper in question, framing it as a more general exploration of how scientific errors can propagate. One of the authors of the original paper then published an article in that journal arguing that we had gotten it wrong and that they really did have strong evidence. Nick and I didn’t find their response convincing on scientific or statistical grounds, but we thought it could possibly be rhetorically effective: just as a piece of writing, if you read it in isolation, it might make you think that we were full of crap. So we closed the loop by replying in the journal, basically restating what we’d said in our earlier article.

The four articles are in different places online and I thought it could be helpful to have all of them in the same place. So here they are:

Peter Aungle and Ellen Langer (2023), Physical healing as a function of perceived time:

In this study we wounded study participants following a standardized procedure and manipulated perceived time to test whether perceived time affected the rate of healing. We measured the amount of healing that occurred across three conditions using a within-subjects design: Slow Time (half as fast as clock time), Normal Time (clock time), and Fast Time (twice as fast as clock time). Based on the theory of mind–body unity—which posits simultaneous and bidirectional influences of mind on body and body on mind—we hypothesized that wounds would heal faster or slower when perceived time was manipulated to be experienced as longer or shorter respectively. Although the actual elapsed time was 28 min in all three conditions, significantly more healing was observed in the Normal Time condition compared to the Slow Time condition, in the Fast Time condition compared to the Normal Time condition, and in the Fast Time condition compared to the Slow Time condition. These results support the hypothesis that the effect of time on physical healing is directly affected by one’s psychological experience of time, independent of the actual elapsed time.

Andrew Gelman and Nicholas Brown (2024), How statistical challenges and misreadings of the literature combine to produce unreplicable science: An example from psychology:

Given the well-known problems of replicability, how is it that researchers at respected institutions continue to publish and publicize studies that are fatally flawed in the sense of not providing evidence to support their strong claims? We argue that two general problems are: (a) difficulties of analyzing data with multilevel structure and (b) misinterpretation of the literature. We demonstrate with the example of a recently published claim that altering patients’ subjective perception of time can have a notable effect on physical healing. We discuss ways of avoiding or at least reducing such problems, including comparing final results to simpler analyses, moving away from shot-in-the-dark phenomenological studies, and more carefully examining previous published claims. Making incorrect choices in multilevel modeling is just one way that things can go wrong, but this example also provides a window into more general problems with complicated designs, cutting-edge statistical methods, and the connections between substantive theory, experimental design, data collection, and replication.

Peter Aungle, Daniel Chen, and Nicholas Holmes (2026), Beyond Statistical Myopia: Replying to a Misguided Critique of Mind-Body Research:

In response to Gelman and Brown’s recent critique of Aungle and Langer (2023), we argue that their article illustrates how narrow statistical reasoning and selective literature review can misrepresent and undermine credible scientific findings. Using their discussion of perceived time and physical healing as a case study, we identify three general problems: (a) a failure to accurately characterize the methods and results of the study they critique, (b) misinterpretations and omissions in their review of the relevant literature, and (c) a tendency to generalize from isolated statistical issues to sweeping claims about the invalidity of mind-body research. We adopt Gelman and Brown’s recommended model and find that the main effect remains robust. We also document errors in their interpretations of other cited studies and demonstrate that they ignore decades of rigorous, well-replicated research on placebo effects and health mindsets. By examining their critique in detail, we highlight how methodological skepticism, when untethered from accurate reading and balanced appraisal, can mislead rather than clarify.

Nicholas Brown and Andrew Gelman (2026), This is the reason for external replication: Response to Aungle et al. (2026):

In an earlier article we addressed a controversy regarding a form of mind-body healing, arguing that a recent paper had overstated evidence from experiments and from literature review. In reaction, one of the authors of that paper disputed our claims. Here we explain why we remain skeptical.

The short answer is that, no, we don’t see any evidence that manipulating people’s subjective experience of time will help them heal better, nor do we see evidence that telling people that they’re exercising will get them to lose weight without their being any changes in their diet or exercise, or various other things claimed in that original paper. I do think it’s possible for researchers, through a combination of sloppy statistics, forking paths, and inaccurate literature review, to create an impression of a strong body of evidence even when nothing is going on–this was a point made eloquently in the classic 2011 article by Simmons, Nelson, and Simonsohn. And I think this combination is enough not just for people to mislead others, but, more importantly, to fool themselves, which can then allow them to spread misunderstanding in the scientific literature, the popular press, and, yes, NPR, Ted, and podcasts.

The whole thing makes me sad, to see researchers caught in a loop of misunderstanding so that, even after their mistakes are pointed out to them, they double down and remain confused. There’s no way that the authors of the above papers will agree with me on this point, and maybe they will find all this to be condescending, but I’m completely sincere here. It makes me sad to see people aim their careers in this direction. The good news is that over the years I’ve received many many emails from young researchers who see this sort of thing going on in their labs and want to do better. I guess the best way to get a grip on this problem is to see how others have been trapped in it.

Golems, auditors, and AI

This post is by Phil.

Some time ago I wrote some thoughts about “Neuromancer” ( https://statmodeling.stat.columbia.edu/2025/06/12/what-does-neuromancer-have-to-teach-us-about-the-role-of-ai-is-society/ ), which features two kinds of artificial intelligence, one of which seems like it could be realized with a Large Language Model, i.e. we could pretty much make it today. The other is something more powerful, an artificial general intelligence that not only has computational power but also imagination and desires. I think it’s an open question whether an LLM can have genuine desires (and even a genuine imagination) as opposed to being able to pretend that it does. Also an open question whether that distinction even makes sense to talk about.

I’ve read some other fiction within the past few months that has also given me things to think about, AI-wise.

First there was Feet of Clay, by Terry Pratchett. Pratchett writes lightweight, fun, but generally forgettable fantasy novels. I mentioned that book in an earlier post, https://statmodeling.stat.columbia.edu/2026/01/21/what-a-coincidence-what-a-coincidence/ , because it uses a rare plot device that happened to crop up in the very next book that I read. But I mention it now for a different reason: in the book there are golems (an animated, artificial humanoid in Jewish folklore created entirely from inanimate matter, such as clay or mud) that are treated pretty much like robots. A golem’s operating system is written on a piece of paper contained in its head. In the book, Golems are treated like we treat industrial robots or Roombas or similar: they are given simple, repetitive tasks at which they work, sometimes day and night. Nobody feels bad about using them however they want, because the golems have no emotions. Or do they? In the book some golems get together and create a golem of their own, and give it instructions that are…well, basically they are trying to create something more human. Of course, the fact that they desire to do such a thing suggests that they are not in fact emotionless objects.

Well, I just read another Pratchett book, “Thief of Time”. (Spoilers follow. Stop reading here if you want to read this book and be surprised.) This book has beings called ‘auditors’ who are responsible for maintaining order in the universe. They are described as being nearly emotionless except for hating disorder. To them, humans pretty much personify disorder so I think they could be said to hate humans. To better understand humans so they can learn to control us better, some of the auditors create human bodies for themselves and occupy them…and, uh oh, with the bodies come emotions. They get hungry, they can feel pain, things taste good or taste bad, etc. As they strive to satisfy their bodies’ desires, they start to act more and more like humans. They want things.

I mention this here because it touches on something I wonder about AIs, or at least LLMs: can they have desires? Certainly they can be told to _pretend_ they do — one could prompt an LLM to pretend that it wishes to take over the world, for example — but would it _really_ “want” to take over the world? Would it want anything at all?

Thinking about those kinds of questions, I realized that I don’t understand human emotions and sensations either. I don’t see how a bunch of computer circuits can be made to feel pain, but I also don’t understand how a bunch of nerves and neurons can feel pain either. I can understand how either one can respond to stimuli — if the temperature at this point exceeds such-and-such a temperature, fire these muscles — but I’m talking about the _sensation_ of pain. How does that arise? And is there something about a computer that works with voltages on a chip that prevents it from being able to have that sensation? Do nerves and brains somehow allow a sensation that literally cannot be duplicated in silico?

Sadly, Thief of Time did not answer any of those questions for me. But it did get me thinking about them, so I guess that’s something.

This post is by Phil

Workshop on Rethinking the Role of Bayesianism in the Age of Modern AI

Esmeralda Whitammer, Sara Wade, Vincent Fortuin, Konstantina Palla, and Theodore Papamarkou write:

We are organising a focused workshop on Rethinking the Role of Bayesianism in the Age of Modern AI from October 26 to 30, 2026, bringing together researchers exploring the frontiers of Bayesian machine learning and deep learning. The meeting will take place in Edinburgh, Scotland, UK, and will be hosted by the University of Edinburgh’s School of Informatics.

This workshop follows in the footsteps of the meetings held at Dagstuhl in 2024 and MBZUAI in 2025. This year, the meeting is growing and becoming an official event of the International Society for Bayesian Analysis (ISBA)’s new section on Bayesian AI. We are planning to maintain the collaborative and interactive spirit of the previous meetings, with a programme that includes talks, panel discussions, poster sessions, and ample time for interaction among participants representing a wide range of perspectives and expertise.

Looks interesting!  They should invite Aki for sure.