The high cost of split R-hat

This post is by Bob.

I’ve been thinking a lot lately about R-hat given that I’m using it for online converging monitoring in our new Walnuts implementation. In that setting, where I use Welford accumulators to update R-hat estimates every iteration, I can’t use split R-hat without way too much buffering. So I’ve been thinking about the effect of splitting, too, and whether we need it. I asked Andrew and he said Kenny Shirley once produced an example where split R-hat diagnosed non-convergence that regular R-hat didn’t, but that example is lost to time and we’ve never seen this kind of behavior with NUTS as far as I know (please give us an example in the comments or via email to Andrew if you have).

Relating R-hat and ESS

My intuition was that we could set a low enough R-hat threshold that it would ensure a high enough effective sample size (ESS) when we crossed it. The relation’s a little tighter than I thought, with

    Rhat^2 ≈ 1 + M / ESS,

where M is the number of chains and ESS is the effective sample size of all chains combined. There’s a multivariate proof in Vats and Knudson, 2021, Revisitng the Gelman-Rubin diagnostic, Statistical Science, page 2 and section 5 for details, but it’s pretty straightforward to get the intuition when you reduce R-hat^2 to (N-1)/N + var(chain-means) / man(chain-variances) as Charles Margossian did in his nested R-hat paper. Vats and Knudson disapprove of Andrew and Aki’s suggested threshold of 1.1 from BDA3, because it is satisfied with a combined ESS of 20 across Andrew’s default 4 chains.

Being me, I tried to validate my intuition with simulations rather than linear algebra. Also, I like to see that things work in practice that theory entails to make sure I’ve understood all the assumptions baked into the theory (one can’t prove anything without assumptions!). When asked to code a simulation using ArviZ, Claude inserted a (2 * M) in the numerator in place of the M. Where did that come from, I asked? It told me it needed the factor of 2 because ArviZ uses split Rhat. D’oh! Of course it does, because we’ve doubled M without increasing ESS.

A worked example

Suppose we have 4 chains with a combined ESS of 400. Then sqrt(1 + 4/400) ≈ 1.005 and sqrt(1 + (2 * 4) / 400) ≈ 1.01. We’ve effectively doubled the number after the 1 by splitting. Unlike Vats and Knudson, I usually don’t need an ESS >> 100, so the 400 required for split R-hat < 1.01 is perhaps a bit too conservative for my tastes. On the other hand, we face a practical problem estimating ESS reliably with fewer than 50 or so ESS per chain. Estimation is challenging because it relies on autocorrelation estimates from the chains themselves, which become much noisier when based on shorter chains. (Side question: Do we not combine autocorrelation estimates across chains to reduce standard error because some chains might not be mixing?) Also, we know this algebra wasn't a coincidence of 4 chains and 400 draws. The Taylor expansion of sqrt(1 + x) is the convergent sequence

    sqrt(1 + x) = 1 + x/2 - x^2 / 8 + x^3 / 16 + ...

When x < 0.1, the first-order approximation, sqrt(1 + x) = 1 + x / 2, is good.

The bottom line for practitioners

We need around twice as many draws to get below a fixed threshold with split R-hat than with the original R-hat.

Structural equation modeling (SEM) and positive definiteness

This post is from Bob.

Mitzi and I were swotting up on structural equation models (SEM) for our class this past Monday at the Modern Modeling and Methods (M3) conference at Fordham University. It was a lot of fun and now I think I understand SEM notation. I really like these applied conferences and this was a group of psychometrician, econometricians, and sociometricians. Many if not most of them thought about models in terms of SEM, so we thought we should figure it out. But I was left with a concern you may be able to help me sort out.

The example

The first worked example in Ken Bollen’s seminal 1979 textbook on SEM is a study of how industrialization relates to democracy. It comes from his paper,

  • Bollen, Kenneth A. (1979). “Political Democracy and the Timing of Development.” American Sociological Review, 44(4).

and was reprised in his book

  • Bollen, Kenneth A. (1989). Structural Equations with Latent Variables. Wiley.

I had the pleasure of sitting across from Ken at the invited speakers dinner at the conference, so I’m glad I looked into SEM before that. Good news for the SEM devotees—he released a completely revised guide to SEM a few months ago.

The data and parameters

The data consists of eleven covariates (called “indicators” in SEM) for each of 75 countries. Four of the covariates are related to democracy in 1960 (y1, y2, y3, y4), the same four measurements were taken again again in 1965 (y5, y6, y7, y8) , and there were three measurements of industrialization in 1960 (x1, x2, x3).

The SEM model the original researcher came up with here assumes three latent scalars per country, industrialization in 1960 (IND60), level of democracy in 1960 (DEM60), and level of democracy in 1965 (DEM65). These latent parameters are related in the following way: democracy in 1960 is a regression on industrialization in 1960, and democracy in 1965 is a regression on both democracy in 1960 and industrialization in 1960.

The covariates are then modeled like a seemingly unrelated regression in econometrics. The four democracy 1965 parameters are treated as regressions on the latent level of democracy in 1965, and similarly for the democracy in 1960, and industrialization in 1960.

Rather than independent errors, a SEM model explicitly indicates with arrows which pairs of observations are allowed to have non-zero correlation in the covariance matrix for the observations. The three industrialization observations are assumed to have zero correlation—there are no arrows between any of the three measurements in the SEM diagram. Each of the four measurements in 1960 is assumed to covary with the same measurement taken in 1965. In addition, the second and fourth measurement in each year are assumed to be correlated with each other, which leads to a box-like structure.

The SEM diagram

Here are the arrows in the diagram, where I’m not using their standard LISREL notation, but writing them in R expression syntax to indicate what is regressed on what. In their graphical notation, just replace ~ with <-. All three latent variables and all eleven measurements are indexed by country.

IND60
DEM60 ~ IND60
DEM65 ~ DEM60, IND60

x1, x2, x3 ~ IND60
y1, y2, y3, y4 ~ DEM60
y5, y6, y7, y8 ~ DEM65

The covariance structure is indicated by stating which pairs of measurements are modeled with non-zero correlation. The first four just pair the measurements of the same thing across 1960 and 1965.

y1 <-> y5
y2 <-> y6
y3 <-> y7
y4 <-> y8

The last pair of correlations are within 1960 and within 1965.

y2 <-> y4
y6 <-> y8

Together, these induce an odd box structure, where y2 is correlated with y6 and y4, both of which are correlated with y8, but y2 and y8 are assumed to have zero correlation.

y2 <-> y6
^      ^
|      |
v      v
y4 <-> y8

Stan implementation

We didn’t get this far in my half of the class, so I will share here the Stan Playground example where I fit Bollen’s example (you can get the data and the Stan model through the Playground link:

It gets the right answer compared to lavaan/blavaan, which is nice. In the Stan code, xi is IND60 and eta1, eta2 are DEM60, DEM65. The relation among the latent parameters are modeled directly as regressions. The correlations among the observations are modeled using soft zeroing, where I just put a tight prior around zero on the structural zero elements, because Stan doesn’t give you a good way of setting up structural zeroes in a covariance matrix (Sean Pinkney or Ben Goodrich might know how to do this?).

This makes me curious how the lavaan package in R manages this. There’s a Bayesian version of lavaan built on top of Stan, blavaan. The first example right at the top of the home pages for both the lavaan and blavaan is Bollen’s democracy model. I guess it’s like the Scottish lip cancer data set for spatial modeling or Fisher’s iris data for regressions.

My questions

Consider a simple diagram among measurements like the following.

x <-> y
y <-> z

This says there can be non-zero correlation between A/B and also between B/C, but the correlation between A/C is zero. It’s a simplified case of the box we saw in the actual example. These arrows implies the correlation matrix looks as follows.

|        1  rho[x,y]         0 |
| rho[x,y]         1  rho[y,z] | = Omega
|        0  rho[y,z]         1 |

Given that the correlation matrix Omega must be positive definite, this limits the range of rho[x,y] and rho[y,z]. For example, we can’t have rho[x,y] = rho[y,z] = 0.9, or rho[x,z] would have to be greater than zero to maintain positive definiteness.

Q1: Why doesn’t SEM instead say that the correlation rho[x,z] is just the minimum value it can be given rho[x,y] and rho[y,z]? I’m suggesting that we instead treat the above diagram as implying no additional correlation between x and z other than that implied by the correlation between x and y and the correlation between y and z? That is, why try to shrink rho[x,z] all the way to zero? From the text, it feels like the motivation is to enforce zero correlation in the model. But all this is doing is simplifying regressions—it won’t actually enforce zero correlation among the measurements that are modeled with zero correlation. I wished I’d asked Ken this question at dinner, but I’ll ping him about this blog post and hopefully get a response.

Of course, in the pragmatic Bayesian workflow, we’d use posterior predictive checks to evaluate whether there’s unmodeled correlation between x and z.

Q2: I’m also curious what Andrew and others think about enforcing structural zeroes in correlation between measurements as opposed to just estimating a dense covariance matrix and inspecting where the correlations fall.

LLM-generated Stan case study on Galileo’s inclined plane experiment

This post is from Bob.

I’ve been planning for at least a couple years to generate a case study around Galielo’s use of an inclined plane instrumented with water clocks to estimate the terrestrial gravitational constant. Here are some photographs of a replica in the Museo Galileo (click to blow them up). And here’s a video simulation of the experiment. We replace his clever pendulum apparatus explained in the video and the web page with simple Bayesian statistics so we can actually estimate the gravitational constant.

The case study

Here is a draft.

Bob Carpenter. 2026. Estimating g from Galileo’s Water Clock: A scientific Bayesian inverse problem with Stan and CmdStanPy. GitHub.

I list myself as the author here because I’m responsible and AIs can’t own copyright in the U.S., but 100% of the text and code was written by Claude Opus 4.8 (medium or high effort, but I can’t recall which). I used the desktop app, which doesn’t allow sharing, but you can try it yourself.

The prompt

Here’s the sloppy prompt I used, which I just typed in without much thought in a couple minutes to get a feel for what it could do on its own.

I would like to generate a case study written in Quarto and using CmdStanPy to demonstrate solving scientific Bayesian inverse problems. I want to use a simulation of Galileo’s water clock experiment, which can be used to estimate the gravitational constant. I would like you to start by generating the mathematical model description in LaTeX, the model code in Stan to solve the inverse problem, and a simulation driver in Python using CmdStanPy and plotnine for plotting. Please just `import plotnine as pn` and use `pn.geom…`, etc. All I need in the output now is a call to `.summary()` on the fit returned by `.sample()`. Wrap this all up in a quarto document for me from which I can generate HTML by calling `quarto render galileo.qmd`.

It was done before I got back to my desk with a cup of coffee (well under five minutes). So not quite the several hours Andrew said it took him to write his case study on the New York Knicks basketball team, which he posted earlier today. Of course, this was much simpler and I didn’t have to think through any details before generating it.

Is it right?

What Claude produced looks really good to me. If a student had done this, I’d given them an A. I can’t object to the way it described Galileo’s experiment, wrote the math, wrote the Stan code, wrote the Python simulation, or plotted the raw data as Andrew is always urging us to do.*

The source

You can find the source .qmd file on my GitHub:

https://github.com/bob-carpenter/case-studies/tree/master/galileo-gravity

It’s short, so I would have just included it, but the blog software blocked my post after considering it an attack on the site. To get it to render with resources embedded, I had to ask Claude a follow-up question and manually insert a single line of config into the .yaml header for the markdown document.

Putting this blog post together took longer than writing the prompt and checking the results.


*   Maybe Claude runs a little simulation of Andrew like I do. Andrew himself claims to run a simulation of Jennifer Hill—it’s the basis of his
handy statistical lexicon entry for “WWJD,” which he told me stands for “What would Jennifer do?” Unfortunately, neither the lexicon entry nor its underlying link explains the acronym.

Stein’s method, learning and inference -or- how to really monitor convergence and thin chains

This post is from Bob.

I’ve been thinking a lot about scores (gradients of the log density function) and how they can be used for convergence monitoring. We know that the expected value of the score is zero. Stein generalized this with Stein operators. In the monomial case, the Stein operators give you functions in increasing degrees, all of which have zero expectation in the posterior. Here theta is the variable being sampled and S is the score function, so that S(theta) is the gradient of the target log density evaluated at theta.

    Order 0: S(theta)

    Order 1: 1 + theta .* S(theta)

    Order 2: 2 * theta + theta^2 .* S(theta)

This leads to a natural test for convergence of first, second, and third moments. Just compute Monte Carlo estimates of these quantities and see if they’re zero. We’d want to standardize for standard deviation to make the result scale-free like R-hat. To develop some intuitions, in a standard normal distribution p(theta) = normal(theta | 0, I), we have S(theta) = -theta, and thus S(theta) converges to zero at the same rate as our variable theta converges to its true value; the order 1 test is 1 – theta^2, which we know has expectation zero because theta^2 has a ChiSquared(1) distribution with expectation of 1). The order 1 case corresponds to equipartition in physics and the form D + theta’ * S(theta) also naturally has zero expectation as shown in the viral theorem in physics in the 1870s.

Diving into this a bit more led me back to Jackson Gorham and Lester Mackey’s work on Stein’s method. They haven’t been sitting still since introducing the basic idea, which kernelizes the idea above. Mackey et al. have produced an absolutely wonderful summary of this body of work in two forms. The first is a dense, 41-slide deck with all the key definitions and results. I’d suggest at least skimming this first.

Lester Mackey. April 2026. Stein’s Method, Learning, and Inference.. GitHub.

Mackey along with Chris Oates and Qiang Liu, who have also worked heavily in this area, put together a definitive monograph. They’ve presented a great deal of difficult material in a way that I can digest (though it’s going to be rough going if you’re not well versed in sampling and how MCMC is traditionally measured and evaluated).

Qiang Liu, Lester Mackey, Chris Oates. March 2026. Probabilistic Inference and Learning with Stein’s Method. arXiv.

In particular, they go over Stein variational inference, which seems to me like it would be the ideal way to perform quasi Monte Carlo-like inference for statistical models if we could only get a robust version to scale. The idea’s to initialize a bunch of points, then use optimization to minimize a kernelized Stein discrepancy of the empirical distribution of those points to the true distribution.

The Kappa Zoo: David Eubanks’s online monograph on rating models

David Eubanks writes:

My site is kappazoo.com, and it’s still a work in progress. I would rather have emailed after I had the new goodness-of-fit code done, but I saw that you’re doing a summer workshop (on Andrew’s blog) [editor: Modern Modeling Methods (M3)], so thought I’d mention it now.

It may be billed as a work in progress, but it’s a complete draft with no missing sections that provides a really nice overview of rating/crowdsourcing models. These are the models that dragged me into statistics, namely Bayesian rating models formulated as noisy measurement models. The first model of this kind that I or Eubanks could find was Phil Dawid and Allan Skene’s (1979) paper on rating.

Eubanks works through a great deal of workflow without calling it that. There are multiple model evaluation and comparison measures used and explained with connections to information-theoretic notions like entropy.

There’s a long discussion of Cohen’s kappa statistic, which is a commonly reported statistic measuring inter-rater agreement. As Eubanks notes, it doesn’t deliver on its promise of adequately measuring inter-rater agreement. The discussion is quite good here and complementary to the discussion from me and Becky Passonneau in our paper on rating in NLP, though our conclusions are the same.

I was surprised to see that Eubanks has a section comparing item-response theory (IRT) models with difficulty. I’ve been trying to convince people this is important for years. It took me around ten years to figure out how to move from Dawid and Skene’s IRT-0-like model to an IRT-1-like model, which we report in in our latest paper on crowdsourcing with difficulty parameters (which also works through a lot of Bayesian workflow in considering different models). I can’t identify what took so long—it seems so obvious to me now.

Full day Stan tutorial at Modern Modeling Methods (M3) this summer in New York (22 June 2026)

This post is from Bob

Mitzi Morris and Bob Carpenter, two of Stan’s developers, will be presenting a tutorial on Stan and Bayesian data analysis aimed at psychometricians this summer.

Abstract

This workshop is a full day, hands-on introduction to Bayesian modeling and statistical inference using the probabilistic programming language Stan.

The course will be organized around the key properties of Bayesian statistical modeling for science, including the nature of uncertainty, modeling a generative process through a data generating distribution, modeling existing knowledge through a prior, and pushing uncertainty through inference. As we do this, we will show how Stan can be used to both code the models and perform statistical inference for quantities of interest, be they retrospective parameter estimates or prospective predictions or forecasts. We will concentrate on full Bayesian posterior inference, including a discussion of calibration, model checking for both prior and posterior inference, and model comparison with cross-validation. We will spend some time showing how some structural equation models (SEM) can be translated directly to Stan and will also introduce psychological models for educational testing, crowdsourcing, rating and ranking, and real-time decision processes.

This class will require a notebook computer with a network connection (Wifi will be available in the classroom). We will use the Stan Playground, which runs Stan in the browser, which we will pre-populate with models of interest. We will probably also break into R or Python at various points to demonstrate methods not yet supported by the Playground, such as the brms regression expression language.

Andrew on psychometrics

Andrew once told me that any model you could come up with was probably invented by a psychometrician 50 years ago (make that 60—he said it at least 10 years ago). I have evidence that he’s right form the project that drew me into Bayesian statistics—crowdsourcing. Andrew and Jennifer Hill helped me formulate a crowdsourcing model where raters give you noisy measurements of underlying categorical variables (e.g., they answer survey questions about whether a word in context is a noun, for example, to use something I was working on at the time). Turns out Phil Dawid and A.P. Skene published the same model in 1979 in one of the earlier applications of the expectation maximization (EM) algorithm and they used natural language data (drawn from medical records).

The rest of the conference

The rest of the program looks really great—it’s just the kind of applied wrestling with real data that I like.

Speaking of Jennifer Hill, she’s one of the keynote speakers at M3. Every talk of Jennifer’s I’ve attended has been great. You may know her as Andrew’s co-author on the regression books, which I cannot recommend highly enough if you’re interested in this kind of applied modeling.

Beyond the conference

You see the same kind of Bayesian modeling focus for real data at venues such as ISEC (international ecology conference to which I went to once just because I like these models and these kinds of conferences), StanCon (see you in Uppsala in August!), and GeoMed (which Mitzi attends). I’m sure there are more in other fields. I’m always disappointed that there’s almost nothing like these kinds of nitty-gritty applied papers at ISBA (Nagoya this summer) or BayesComp (somewhere next year). The conferences about Bayesian statistics or computing that I’ve been to have all been super theoretical.

Why are there squares everywhere in statistics (e.g., normal density, variance, least squares, etc.)?

I remember asking my colleagues at Carnegie Mellon this very same question as I was first learning basic statistics in the early 1990s and they gave the same kind of answers as I found more recently in the AskStatistics subreddit. It’s an evergreen question, coming up regularly enough I think it’s fair to say it’s a zombie. This is probably going to be familiar material to most blog readers, but if you’re like me when I first started reading this blog decades ago, then read on.

The answers are all over the place ranging from “the central limit theorem” to “it’s distance” to “it makes everything positive” to “it is smooth unlike absolute value” (another path to positivity) to “mathematical convenience” or “the math says so” (sure, but how?) to the “central limit theorem” (a more specific form of “the math says so”) to the link to entropy (it is the distribution that maximizes entropy for a given mean and variance), and so on.

I think the easiest answer for the person asking “why squares?” is due to Gauss by way of Pythagoras. Simply put,

    *** the mean is the number that minimizes square error. ***

That is, if I have a sequence x[1], …, x[N], then mu = mean(x) is the number that minimizes the error

    err(mu, x) = SUM{n in 1:N} (x[n] – mu)^2,

in the sense that

    ARGMIN_mu err(mu, x) = mean(x).

You can verify this by taking a derivative of the summation, showing that it’s zero at the mean, then confirming that minimizing, and checking that you have convexity by checking that the second derivative is positive.

Gauss realized that he could define a density where mean(x) is the maximum likelihood estimator. It is now called the “Gaussian distribution” or “normal distribution”. That is, we have

    log normal(x | mu, 1) = -1/2 (x – mu)^2 + const.

If we have a bunch of observations and want to maximize their likelihood, that’s equivalent to minimizing the error I wrote down above,

    ARGMAX_mu PRODUCT{n in :N} normal(x[n] | mu, 1)

        = ARGMAX_mu PRODUCT{n in 1:N} exp(-1/2 (x[n] – mu)^2) * const

        = ARGMAX_mu SUM{n in 1:N} log(exp(-1/2 (x[n] – mu)^2)) + log(const)

        = ARGMAX_mu N * log(const) + SUM{n in 1:N} -1/2 (x[n] – mu)^2

        = ARGMIN_mu 1/2 SUM{n in 1:N} (x[n] – mu)^2

        = mean(x)

This is why “ordinary least squares” fit for a regression with normal error terms yields the same result as maximum likelihood.

Some of the other answers in these threads are correct, but they failed to enlighten the original posters as to why there is a square. Just saying “because of the math” like my colleagues did is not helpful without saying which math! For example, the normal distribution, with its quadratic log density term, arises through the central limit theorem asymptotically. Squaring does emphasize outliers as people said, but that’s not the reason. If you say “for mathematical convenience,” it would help to say what it buys you, such as the sum of normals remaining normals, normal being its own conjugate prior in regression, integrals being very nice, tails being very nicely behaved, the connection to maximum entropy, residence in the exponential family, etc. etc.

Why not absolute values instead of squares?

We can use absolute values. In that case, medians replace means, because absolute error has the pleasant property that

    *** the median is the number that minimizes absolute error. ***

In symbols, that’s

    ARGMIN_mu SUM{x in 1:N} abs(x[n] – mu) = median(x)

Connection to Bayesian inference

If our model is correct (big “if” that’s almost always false, of course), then we know that the posterior mean is the parameter estimate that minimizes expected square error, whereas the posterior median minimizes expected absolute error. Just like in the simpler case. It’s important to note methodologically that we don’t get a natural point estimate out of Bayesian inference without specifying an error function. If the error function is quadratic, the error minimizing estimate is the posterior mean, whereas if the error function is absolute error, the error minimizing estimate is the posterior median.

The multivariate normal case

There is a strong relation to distance, as some of the answers on the AskStatistics subreddit said. More precisely, squared error is just squared Euclidean distance from the point [mu mu … mu]. This points the way to how Gauss defined the multivariate normal distribution. Simply replace the squared Euclidean distance term (x[n] – mu)^2 term with a quadratic form involving the inverse covariance,

    log normal(y | mu, Sigma) = -1/2 (x[n] – mu)’ * inverse(Sigma) * (x[n] – mu) + const,

where Sigma is a positive-definite covariance matrix. This quadratic form is the distance in the space defined by the Euclidean metric inverse(Sigma). The resulting distance over vectors is known as the “Mahalanobis distance” in statistics. For example, if Sigma = [[1, 0.9], [0.9, 1]], for a highly correlated bivariate normal, it will be easier (less distance under the quadratic form) to move along the diagonal x = y than transverse to the diagonal x = y. That is, the points [1 1] and [2 2] are closer to each other given the inverse metric Sigma (distance approximately 1.05) than [1 1] is to [1 1 + sqrt(2)] (distance approximately 10.5), even though they are the same simple Euclidean distance away from each other.

Alchemize: PyMC’s model to replace Stan/PyMC, etc. with an LLM

This post is from Bob

I’ll let Thomas Wiecki, who is one of the core PyMC devs and one of the partners at PyMC Labs, speak for himself here:

If you haven’t seen what people are doing with agentic AI, this is a good example. I’m really happy that Thomas and PyMC Labs are sharing their thoughts and initial tries at things like this as I think it has the potential to benefit everyone working on modeling.

If you want to see the basis of the agent’s instructions, check out the “skill” for PyMC that Chris Fonnesbeck wrote.

We’ve already batted this around a bit in email with Thomas, so I can summarize some talking points:

LLM-based chatbots are really good at translating. Compiling (or more technically correct, transpiling) a statistical model down to a language like Rust or C++ or JAX is a kind of translation.

You can start from PyMC’s execution trace, but you can also start with a model description. You could also start with something like Stan code.

The biggest bottleneck to deploying Bayesian models in my opinion is the inherent variance and unreliability of MCMC-based inference. Our workflow proposals are all about making sure this doesn’t go wrong. Wiecki’s point here is that we can have the bots go through the workflow. Iterating until the gradients and log densities match is a good example, but this could be extended to more parts of workflow.

The skills feel a lot like writing a textbook for a bot. I have no idea how hard or easy this is or how much it improves over the baseline. Jeremy Magland built a RAG-like helper for Stan that compressed the Stan Reference Manual down to 1K tokens for context (like a skill) and allowed it to search and import from the Stan User’s Guide, but never measured how much it improved over the baseline. It really feels like it should also have the Stan Functions Reference, BDA3, Regression and Other Stories, and the Bayesian Workflow book, as well.

Hopefully we’ll asymptote at writing a textbook sized set of skills and not have to write one per target model (that is, something like the Stan User’s Guide, Reference Manual, Functions Reference).

I’m curious as to whether it will eventually be able to make writing hard models easier. I’m thinking of efforts like epinow2, which involves a very large chunk of Stan code.

As the foundation models and chatbot tuning changes, there’s going to be an issue of regression testing and tuning for whatever the latest models are.

P.S. This effort explains how Thomas was able to create the huge posteriordb pull requests for PyMC (#320 and #319)!

P.P.S. The latest thing Claude (Opus 4.7) did that impressed me was generate the ess(MatrixXd, vector) function in summary.hpp. This function estimates effective sample size Stan style (discounting for R-hat > 1) on a ragged array of Markov chains. We have to generalize all the posterior analysis tools to deal with the new asynchronous parallel sampler). I had Stan’s ESS function and all the other functions I’d written for the ragged structures to give it as a guide. It’s very easy to code review that it matches Stan’s implementation for the new data structure. I only had to tweak the output a little bit for style.

David W. Hogg on why we do astrophysics (in the face of LLMs and the lack of clinical value)

David W. Hogg, who takes his role as scientific gadfly seriously, recently (February 2026) released an interesting rumination on science, LLMs, and the human component of research.

I like that his first move is to insert an ungrammatical question mark into the title to question the whole process.

There is a lot to unpack here, starting with the claim that LLMs “show no signs of intelligence.” I seriously don’t know what “signs” people are looking for other than the ability to converse in 200 languages fluently in just about any subject known to humanity. But like most AI benchmarks, we tossed out the Turing Test as soon as a computer could make a passable go of it.

One comment I can get behind is, “a data scientist who has taken an astronomy class might be better prepared [to deal with astrophysics data and modeling] than an astronomer who has taken a data science class.” The usual feeling among scientists is the opposite, but I’m guessing this is because they’ve just never worked with truly talented programmers like the ones we have at Flatiron (e.g., Brian Ward, Steve Bronder, Jeff Soules, and Robert Blackwell, among the folks with whom I work).

The note is stuffed with both descriptive accounts and normative statements such as “Every scientific paper is written to help all of its writers, and all of its readers, learn and grow, no matter their career stages.” I’m imagining living in the Big Rock Candy Mountains of academia, where that’s true.

I just don’t get “I believe that (in astrophysics) software is written to support the astrophysics literature, and that every important piece of software should have an associated paper in the astrophysics literature.” Why not publish software papers in software venues like JOSS? Most literatures won’t even publish software papers, so good on the astrophysics folks for allowing them.

Folks who bemoan the publish-or-perish mentality in academia should consider the statement, “My point is that astrophysics is the astrophysics literature.” I agree if you extend that to include things not peer reviewed like software.

David takes on workflow with the contentious statement, “a lot of ‘implicit knowledge’ or folklore about things like how to observe, how to reduce data, how to organize projects, how to visualize data and models, how to read and write, and so on. Much of this never appears in the literature. Is that not also astrophysics? Yes it is, but it is astrophysics practice.” I’m postmodern enough to believe it’s impossible to separate astrophysics from astrophysics practice. Isn’t the computing also astrophysics practice? He wants that in the astro literature. For this, he concludes, “I would welcome a project in which we tried to make much of this implicit knowledge explicit.” Just not in the literature? We never could get the workflow paper published.

I know David’s senior enough to know how the sausage is made, so I don’t see how he can claim, “Papers (and the authorships on those papers) and the citations of those papers are not ‘coin’ of anything!” Of course they’re the “coin” of both hiring, tenure and promotion committees. I believe Andrew has gone on record saying people value publications so highly because nobody wants to devalue the coin that put them where they are (though I’m sure he said it more cleanly than that).

David says, “it isn’t really correct, if you use an LLM, to give that LLM co-authorship on your paper,”. It’s not legal in the United States, either.

More normative statements like, “You can’t decide not to cite a relevant paper because you don’t like the author, or don’t like the author’s institution, or don’t like their funding sources.” I guess he hasn’t hung out in linguistics. Or mathematics. My undergraduate advisor, Ed Palmer, was a student of Frank Harary. Harary and Béla Bollobás developed an entirely parallel theory of random graphs, proving all the same theorems and doggedly refusing to cite each other.

This just seems patently false: “Anyone who has the capability of getting a PhD in astrophysics has the capability of doing many remunerative things, substantially more remunerative than my job.” He then goes on to one of his main philosophical (economic?) points, “If all we really wanted was to know how the Universe worked, we would start a hedge fund, and use the proceeds to pay an astrophysics institute, filled with people who wanted to do astrophysics rather than find out the answers.” I’m not convinced that someone who’s good at science will also be good at business. I think folks like Jim Simons, who co-founded our institute with Marilyn Simons, is an exception. Sure, if you can make tens of billions of dollars, you can fund a lot of science. Or buy a lot of Jamaican beef patties to bring it back along to themes of the blog.

I don’t understand why he says, “No astronomer (that I know) is improving the calibration of JWST instruments because they want the US Navy to have a higher kill rate.” Is that because he doesn’t think people who do this should be called “astronomers?”

Some of the arguments are silly, but at least get those niggling question marks, “We put ‘the universe’ in ‘the university’?” I also find trickle-down theories of training to be particularly weak. Here, David says, “We train a technical workforce.” Sure, but training a technical workforce by teaching them astrophysics seems very inefficient—see the point above on hiring data scientists versus astrophysicists. Perhaps the weakest argument is that astrophysics funding takes away from even more dangerous things the government could be funded, under the heading “We beat ploughshares into swords.” I also think “We create opportunities for development.” in the sense of development sites in Chile and similar places for observatories. Just think about how useful that money could be spent in some other way for development? On the other hand, I can fully get behind, “Astrophysics is a satisfying activity.” I’m still surprised I get paid to do what I love!

A lot of this paper is about LLMs, but I don’t think that’s the interesting part. Right in the abstract, David discusses, “two possible (extreme and bad) policy recommendations related to the use of LLMs in astrophysics, dubbed ‘let-them-cook’ and ‘ban-and-punish.’ I argue strongly against both of these; it is not going to be easy to develop or adopt good moderate policies.”

As a good writer, he sticks the landing. First, by reframing the question as, “Finally: Why did I write this white paper? I wrote this because I became concerned about some of the ideas circulating in the astrophysics community about LLMs and their capabilities, conflating what are (in my view) text-interpolators with what are (in my view) scientists.” Then concluding ” Ultimately, I think the real question we face—if we do indeed face a question—is not the question of how we do astrophysics. It is the question of why we do astrophysics.”

Expanding the Stan User’s Guide

This post is from Bob.

The Stan User’s Guide has been evolving organically along with the project. I’m writing this post for two reasons—to let you know what we’ve added lately and to encourage you to add your own chapters.

History

Initialy there was just one doc that included what is now the Reference Manual, Function sReference and CmdStan Manual. You can browse them all on the web or download pdfs from the Stan documentation web site.

Most of the topics grew up opportunistically from either things the team was interested in, new features we added to Stan, or just trying to cover the basics of the chapters of Bayesian Data Analysis and Data Analysis Using Regression and Multilevel/Hierarchical Models.

Recent additions

There have been some recent additions, including some from new faces:

There are also a metric ton of minor clarifications, fixes, and model examples, such as Mitzi Morris’s additional sufficient statistics optimizations and you can simply look at all the documentation pull requests.

The future is you

You can see from Franziska’s pull request, Abner’s PR, and Brynjólfur’s PR that we provide a lot of guidance and feedback. I found it super useful to write a lot of the documentation when I was learning statistics because I got feedback from Andrew Gelman, Ben Goodrich, Aki Vehtari, and Michael Betancourt, among others.

Some topics that would be nice to add

There is a lot to add to our existing discussions and you can see a lot of that in the pull request list on our documentation repository. There are also many topics that we don’t cover or don’t cover well. Off the top of my head, roughly ordered near the top according to my assessment of how impactful it would be to add.

  • causal inference using Rubin’s potential outcomes framework
  • Thompson sampling for reinforcement learning/bandits (sorry, Andrew!)
  • penalized complexity priors and the Besag-York-Molie II spatial model (working from Mitzi’s Morris’s case study)
  • Extreme value models
  • RNA-seq and DNA-seq composition models at the read level and k-mer level
  • Econometric models (I don’t even know what these are, but there are books)
  • Stationarity-constraining parameterizations of VAR models a la Sarah Heaps
  • Neural networks
  • A/B testing (ideally sequential)
  • spatio-temporal models along the lines of Leon Held’s survey
  • non-trivial examples of ODE models like Lotka-Volterra (my case study), SEIR (Elizaveta Semenova et al.’s case study) or even better, pharmacokinetic/pharmacodynamic (multiple case studies around Torsten), soil carbon (my very early case study, but there’s been much better work since)
  • Hilbert-space approximations to Gaussian processes (GPs)
  • PDE approximations to GPs
  • Fourier space analyses of GPs
  • N-gram language models
  • Plackett-Luce model for contest with more than two players (this can follow my case study for StanCon)

Some of those may already be mentioned in our doc, so sorry if you’ve already written something. Also, feel free to suggest more topics in the comments.

In ML, everyone’s Humpty Dumpty

This post is from Bob.

I used to work in natural language semantics, and the following dialogue from Lewis Carroll’s Through the Looking Glass, and What Alice Found There was the most common pull-quote to see at the beginning of a thesis.

“When I use a word,: Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.”

“The question is,” said Alice, “whether you can make words mean so many different things.”

“The question is,” said Humpty Dumpty, “which is to be master — that’s all.”

Humpty Dumpty came to mind recently after a spate of discussions with ML folks about inference (i.e., what they call “learning”).

What in the world does “empirical Bayes” mean?”

Empirical Bayes came up on Wednesday with some ML folks I was talking to, then I ran into Dave Blei this morning, who told me he’s giving a sequence of talks on empirical Bayes over the next few weeks (at Columbia and at University of Chicago). I asked him what “empirical Bayes” meant to him, because it seems to be used very fluidly in ML. He gave me a new usage, saying he used it for any model that uses data to fit parameters of a prior, including just plain old hierarchical modeling fit with sampling. Dave gave the example of ARD in Gaussian processes (aka, hierarchical models).

I would only use the term in the way described in the Wikipedia entry for “Empirical Bayes”, namely

… empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out.

I pinged Mark Goldstein, one of our top-notch ML postdocs, and he pretty much reeled off the Wikipedia definition. So there seems to be a lot of variation in how this is used.

Robbins on Empirical Bayes

Dave also pointed me to the following video by Herbert Robbins (yes, that Robbins, who was so far ahead of the computational statistics curve that he introduced stochastic gradient descent and multi-armed bandits in the early 1950s).

Terminology drift in ML

I’ve been a bit shocked at how many technical terms have drifted in meaning in ML. I’m not talking about people making honest mistakes or clueless mistakes, I’m talking about true drift in meaning where the ML folks will stand by their definitions.

Likelihood: I’ve seen “likelihood” used for what I’d call the data generating distribution and sometimes just as a synonym for density. Aki has this one covered in his recent post, A data model is not just a likelihood. In stats, the likelihood is defined as the function L(theta) = p(y_obs | theta)—that is, it’s a function of theta for some fixed observed data.

Causal: With the advent of LLMs, anything with an autoregressive structure is now being described as “causal” in a “past causes the future” sense. This is even being extended to arbitrary directed graphical models, which are now being called “causal” even when there’s no explicit causality being modeled. That is, you can now describe a simple regression from an observational experience as “causal” with no extra work.

Estimation: Estimation is almost always called “learning” in ML.

Parameters: These are usually called “weights” for neural networks, which I think now make up 99.9% of all work in ML. But if you tell ML folks that neural networks are parametric models, they’ll most often deny it. A statistician would confusingly call a neural network a “non-parametric model” and then tell you that means it has a lot of parameters.

Inference: In our diffusion model reading group, the ML postdocs tell me that “inference” means what I would call “posterior predictive sampling”. For example, generating output to a query from an LLM would be called “inference.”

Bias: In statistics, this usually means expected error. In ML, it’s heavily overloaded. It can be used to name just about any kind of error measure (e.g., errors in Matt Hoffman’s sampling papers). ML folks also use the term “bias” to mean the intercept in a regression (no, I’m not kidding).

Prior: This is often called an “inductive bias” in ML circles, which can include aspects of data generating distributions as well as priors.

In Bayesian statistics, a prior is the marginal distribution over parameters. In ML and informal presentations of “Bayesian statistics,” it’s just any marginal that gets plugged into Bayes’s rule. For instance, the prevalence of a disease p(disease+) is called the “prior” when I evaluate positive predictive accuracy p(disease=+ | test=+) given a testing sensitivity distribution p(test=+ | disease=+). In a k-way classification, “prior” means the marginal distribution over categories.

Bayesian: I think in ML this term is used very broadly for any situation in which there’s a prior not strictly stated as a penalty function for penalized learning (i.e., regression). I think Andrew’s down with this definition as he also uses Bayesian for anything that looks vaguely Bayesian no matter how inference is performed. For example, Empirical Bayes is just Bayes to Andrew, as is using a Laplace approximation or even a simple maximum likelihood estimate (just think of it as a one-point posterior summary!).

Regression: Note quite on topic, but I think of neural networks as just a GPU-friendly form of non-linear regression.

Uncertainty quantification: This is the primary subject of statistics, though I think the term “uncertainty quantification” is much more prevalent in engineering/signal processing than in ML. There are even journals of that title that look sort of like statistics journals.

I’d find reading ML papers easier if there was less meaning drift from well-established terminology. I’m not saying the ML folks should be up to date with Gelman’s idiomatic statistical lexicon (the concepts are fun, and it can be useful for talking to Andrew and people in his circle like the blog readers, but I wouldn’t recommend using these terms in papers without explanation).

I’m sure there are many more terms have been coined or have drifted one way or another that I’m forgetting about.

Nutpie: state-of-the-art mass matrix adaptation for HMC

This post is from Bob

Nutpie laps NUTS

The short story is that the Nutpie sampler is usually twice as fast as Stan’s sampler for Stan models and twice as fast as PyMC’s samplers for PyMC models. The speed improvement is due entirely to better mass-matrix adaptation. Everything else is the same—dual averaging for step size adaptation and the biased-progressive, multinomial no-U-turn sampler for Hamiltonian Monte Carlo. Note that the speed is not coming from Python or Rust; we have a C++ implementation in the works we plan to interface with Python and then R.

The backstory

Adrian Seyboldt, a member of PyMC Labs, designed the sampler and released the open-source implementation linked above. He built it so he could try out Rust! Adrian’s also behind Stan’s game-changing new sum-to-zero parameterization along with Sean Pinkney. He’s a physicist by training and has amazing geometric intuition.

I tried Nutpie and it didn’t disappoint. So I wanted to understand the algorithm. I couldn’t understand the Rust code and Adrian hadn’t written any pseudocode or algorithm description. Eliot Carlson, with a recently minted MA in stats from Columbia, showed up at my office on the recommendation of Philip Greengard to look for a computational stats project. I asked him what he thought about Rust and like any great grad student, he wasn’t daunted by the fact that he knew nothing about Rust and was a statistician by training. With serious digging and some help from Adrian, Eliot extracted pseudocode descriptions and we were able to formulate the algorithm in what I hope is an understandable way. Eliot now works full time at PyMC Labs.

The paper

The arXiv paper just came out yesterday.

We’d love to hear any comments, corrections, or suggestions people might have.

My role in this was just helping to formulate the whole thing as a stats paper in a language that computational statisticians could understand. Adrian and Eliot worked out the proofs, from which I learned a lot.

Why is Nutpie better?

There are two basic reasons why Nutpie’s diagonal mass-matrix adaptation is better than Stan’s:

1. Better initialization: Use the gradients at the initial point to estimate mass matrix and make the whole sampler scale invariant.

2. Two sources of information: Use the position and the gradient of the log density at the position as two sources of information with which to estimate a mass matrix

The motivation for (1) is that the outer product of the scores is a one-sample Monte Carlo estimate of the expected negative Hessian of the target log density. The one-sample estimate is regularized by taking a geometric average with an identity matrix—using a few more draws here may be more stable.

Will wonders never cease, but it turns out to be relatively simple to take the geometric mean as the midpoint of a geodesic in the affine-invariant Riemannian manifold (AIRM) of positive-definite matrices. It works out to regular geometric mean in the diagonal case, making it simple to implement.

The motivation for (2) is that in a multivariate normal distribution with covariance Sigma, the covariance of the draws is Sigma and the covariance of the scores (gradients of log density of the draws) is Sigma inverse. These are then geometrically averaged to produce an estimate. This can converge very fast—many fewer iterations than the dimensionality of the problem.

As with Stan, the default in Nutpie is to use a diagonal mass matrix. Unlike Stan, Nutpie also lets you use a low-rank plus diagonal approximation (as in L-BFGS optimization), which can go a long way to decorrelating hierarchical models. Using a rank K approximation, what would otherwise be N^3 operations like solves or matrix products become K^2 * N, which is manageable for K of order 10 or less.

Mathematical motivation

The geometric average of the covariance of the draws and the inverse covariance of the scores minimizes the Fisher divergence from a standard normal distribution to the preconditioned target density. Fisher divergence is like Kullback-Leibler divergence, but instead measures the expected distance between gradients rather than the difference between log densities. This turns out in practice to lead to a better diagonal preconditioner. The paper provides proofs and an appendix with geometric motivations.

What we do not prove and what I do not believe anyone understands is how to select an optimal static preconditioner when the target log density has varying curvature, in any form: diagonal, low rank plus diagonal, or dense. I think what we really need is dynamic adaptation that isn’t cubic like Riemannian and doesn’t require implicit integrators.

SparseNUTS: Preconditioning hierarchical models in HMC with a sparse “Laplace approximation” at the marginal mode

This post is from Bob.

Cole Monnahan (who came up with the idea and did the heavy lifting), two of his colleagues from the Template Model Builder (TMB) project, Kasper Kristensen and James T. Thorson, and I just put a paper up on arXiv.

If you have any feedback on either, we’d love to hear it.

The method

It’s quite simple if you know how the components work. The tricky part is getting them all to play together nicely, efficiently, and robustly.

1. Take a max marginal mode of a hierarchical model, of the kind produced by the linear mixed effects package (lme4) or Template Model Builder (TMB). Both packages are implemented in C++ under the hood and distributed with an R interface.

2. Center a second-order Taylor series approximation at the marginal mode using a precision matrix (inverse covariance) rather than covariance matrix. This is like a Laplace approximation, but it’s not centered at a global mode.

3. Take the resulting sparse precision matrix and use it to precondition the target density. This is equivalent to using the precision as a mass matrix in HMC, as Radford Neal showed in his introduction in the MCMC Handbook. This approach is necessitated by Stan’s lack of sparse mass matrix support.

4. Use TMB’s R interface and Andrew Johnson’s StanEstimators package to make the model available to Stan’s samplers and other tooling.

There’s nothing Stan-specific about this technique. It could be rolled into PyMC and NumPyro with a lot of work if JAX’s experimental sparse library is up to the task of duplicating what TMB is doing.

Empirical evaluation

The paper is evaluated on 15+ realistic models, some of which can be scaled. The results show that with this preconditioning, Stan can scale to 10K+ parameter hierarchical models when there is sparseness and high correlations.

The paper demonstrate how much better SparseNUTS is than trying to use Stan’s built-in diagonal or dense mass matrix estimators. Obviously, the dense mass matrix won’t scale to 10K parameters as the mass matrix will have 100M entries and that becomes nearly impossible to estimate or manage computationally in Stan. Stan is very slow to estimate mass matrices in these cases (WALNUTS should be better, which we hope to roll out soon), but the real problem is that diagonal preconditioning is insufficient for hierarchical models.

Should you use it?

Yes! The form of hierarchical models addressed by SparseNUTS are widely used and Stan by itself simply cannot fit them. SparseNUTS also provides much more flexibility than INLA in writing custom likelihoods and priors.

Why not code this method in Stan?

We do not have the sparse Hessian tooling within Stan’s automatic differentiation library to implement this directly. Nor are we likely to roll out something for specific model classes, as we try to stick to black-box techniques in the core of Stan. This isn’t because we don’t like model-class specific methods. We are happy to implement them in special packages like brms. It’s just because one has to limit one’s scope somehow to keep a software project manageable with limited developer hours.

Some alternatives on the horizon

We are evaluating how far we’ll be able to scale Charles Margossian’s black-box approach to marginalization, which Steve Bronder and Charles have gotten into shape so it will be in the next Stan release. As INLA demonstrates, marginalization can be an efficient alternative to sampling the whole model if you can get the marginalization right.

We’re also evaluating how far we can get with Adrian Seyboldt’s Nutpie sampler’s approach to estimating low-rank plus diagonal mass matrices based on Fisher divergence, which could also potentially solve this problem as most of the structure in hierarchical models is low rank. Adrian and Eliot Carlsen (both of PyMC Labs) and I are writing that up now and we will release the paper soon. So far, Nutpie’s low-rank plus diagonal adaptation works incredibly well for some problems, but fails spectacularly on others for reasons we do not yet understand.

What do Socrates, Herb Simon, and Andrew Gelman have in common? A love of the oral tradition

This post is from Bob.

With all the hand-wringing over generative AI ruining people’s ability to think, I thought I’d turn the clock back a few millennia, a few decades, then a few minutes.

Socrates

Let’s start by turning the clock back a few millennia when the deep thinkers were wringing their hands over writing. Writing wasn’t exactly new in Socrates’s time, but it wasn’t widespread in the pre-classical and very early classical period. Here’s Plato channeling Socrates in Phaedrus.

And so it is that you by reason of your tender regard for the writing that is your offspring have declared the very opposite of its true effect. If men learn this, it will implant forgetfulness in their souls. They will cease to exercise memory because they rely on that which is written, calling things to remembrance no longer from within themselves, but by means of external marks.

What you have discovered is a recipe not for memory, but for reminder. And it is no true wisdom that you offer your disciples, but only the semblance of wisdom, for by telling them of many things without teaching them you will make them seem to know much while for the most part they know nothing. And as men filled not with wisdom but with the conceit of wisdom they will be a burden to their fellow

You know, Phaedrus, that is the strange thing about writing, which makes it truly correspond to painting. The painter’s products stand before us as though they were alive. But if you question them, they maintain a most majestic silence. It is the same with written words. They seem to talk to you as though they were intelligent, but if you ask them anything about what they say from a desire to be instructed they go on telling just the same thing forever.

As Shane Parrish says on the philosophy blog from which I lifted the quote, “Oh the irony of having an argument against writing in a written text.” Of course, it took Plato to write it down. The reason Socrates’s name is not on the Columbia library is because he didn’t write! We only know of him because his student, Plato, did write, as did Plato’s student, Aristotle.

I’m old enough to remember the same arguments about diminished thinking ability rolled out against calculators, personal computers, the internet, and Wikipedia. I’m a too young to remember the same arguments leveled against slide rules, sound recording, motion pictures, or television.

Herb Simon

I sat in on Herb’s graduate-level cognitive science classes in the late 1980s at Carnegie Mellon. They were fascinating and largely focused on associative vs. logical reasoning in a production system framework. Kahnemann later popularized this distinction using the terms “fast” and “slow.”

That’s all background. The relevant fact for today’s discussion is that Herb started every semester by telling students to put away their paper and pens, saying they were distractions from attention. Herb would let you submit an essay arguing for note-taking being cognitively useful and he’d let you take notes if you turned it in. This is a level of dedication to the oral tradition that is unique in my experience. People today think phones are distracting, but if you’re serious, you’ll realize pencil and paper are also distracting. Herb did project slides from a computer (against the express wishes of our humanities and social sciences dean, who thought projectors and computers in classes would ruin thinking). He also used the board. So not purely oral. That’s the perfect segue to our contemporary subject…

Andrew Gelman

If I understood Andrew correctly when he was talking about his classes the other day, he won’t let students use phones or computers in his class. I believe this is for the same reason Herb didn’t want them to use pencil and paper. Namely, that he thinks it takes their attention away from focusing on learning the material. I don’t think this anti-phone, anti-computer stance is unusual—I’ve heard a lot of professors complain about students use of devices and try to enforce some kind of ban.

What’s unusual about Andrew these days is that he doesn’t use slides for his public talks. Even for big deal talks like plenary sessions at ISBA. It’s straight up oral tradition plus pantomimed graphs. Being Andrew, he doesn’t even need notes. I believe this dedication to the oral tradition in presenting statistics is unique to Andrew.

I don’t know if Andrew uses slides in his classes. When I was at Columbia, I sat in on classes where he would put the title slide up, then riff on interesting side material for an hour without ever advancing the slide deck (for me, that master class style was awesome). Andrew often mentions that he does live coding in his classes, but I don’t know if he lets students follow along on their computers.

Back to the oral tradition

If we combine Andrew’s no slides approach and Herb’s no notes approach, we’re back to Socrates’s happy place. A pure oral tradition.

P.S. I’m not advocating giving up writing. Or giving up taking notes. Or not being able to use the internet during class. Or giving up LLMs, for that matter. Then again, my attitude’s purely theoretical—I haven’t been responsible for a college class since 1996, which was still before mobile phones and computers were available to students.

Everything I need to know I learned in Little League

This post is by Bob

“Little League” is what we call baseball for kids in the United States. I often tell people that I learned a ton about how to behave and how to approach problems, teamwork, and life in little league. Turns out I’ve been saying that for a while. My sister just sent me this little poster I made for my dad at some point.

Dad repeated this advice regularly, even decades after my baseball-playing days. I still believe it’s good advice, so I’m sharing.

I put the most important advice first—keep your eye on the ball. That’s really key to just about anything.

I have found that hustle is also really critical in life. Dad and I loved hustling baseball players like Pete Rose. Dad used to drive me from Detroit to Cincinnatti in the early 70s to see the Big Red Machine in person, then drive back for work the next day. I find it demoralizing today how players just watch their hits rather than hustling as soon as there’s contact. I really miss “little ball”, which is why Cleveland’s my favorite team (that and it’s Mitzi’s home town).

The staying loose part is also really important and really hard. No editor, so I included keeping your eye on the ball twice. Without the duplication, I could have saved enough space to not cramp the bottom—otherwise, my graphical layout’s pretty good.

For me, sportsmanship is really critical. I also makes me sad that players only shake hands with their own team after the end of the game. We always had to go and shake every other player’s hand and tell them “good game” (even if it wasn’t). And the pros did the same.

I can’t emphasize the teamwork advice enough for the real world—part of that should have said “there’s enough credit to go around.” I should have put that higher up. Listening to how star players respond to interviews is key—it’s usually along the lines of, “I’m just trying to play my role and help the other 8 guys out on the field.”

Getting in front of the ball is also super important not only literally in baseball, but also metaphorically in life. You can do so much by just getting in front of the ball. It might hurt a bit when it hits you if you can’t catch it cleanly, but at least it didn’t get by you! I might rephrase “throw overhand” as “take the straight ahead approach” rather than “trying to get fancy.”

As a bonus, my sister also sent along this photo of our Little League days in Detroit.

That’s dad in the back and me on the far left of the back row. This is 1972 or 1973, so I was 8 or 9 years old (top row, far left) and dad was only 29 or 30. At the time this was taken, he was paying his way through law school photographing sports teams and accident scenes (I tagged along to both and “helped” in the darkroom). I love the attention to detail in the arrangement of gloves on the first row and the classically crossed bats—I also learned photography and design from dad, not to mention printing. Also notice how poor the neighborhood was. One of my teammates, Adam (can’t recall his last name), is wearing dress shoes; you can’t see Tibor’s shoes in the back, but they were mostly duct tape. We couldn’t even afford new baseballs, so I’m not sure how dad managed the spiffy uniforms—probably shilling a pizza joint or auto body repair on the back.

 
 

EVERYTHING I NEED TO KNOW I LEARNED IN LITTLE LEAGUE*

DAD ON THE GAME
keep your eye on the ball

DAD ON HUSTLE
run, don’t walk

DAD ON BATTING
stay loose, keep your shoulder down & keep your eye on the ball

DAD ON SPORTSMANSHIP
don’t be a bad loser & don’t be a bad winner; shake hands

DAD ON TEAMWORK
there are 8 other players to help you

DAD ON FIELDING
get in front of the ball & keep your eye on it

DAD ON THROWING
overhand, it goes straight

* FROM MY DAD [Mack L. Carpenter]

Seven-parameter drift-diffusion pdfs and cdfs now in Stan

This post is from Bob.

Drift-diffusion models

Whew. The cdf function for the seven-parameter drift-diffusion model was just merged. The pdf was merged a few months ago. This is a big deal. These pdfs and cdfs are used for in decision-time models in cognitive psychology. There’s a really nice illustration through NLM on nih.gov. The basic idea is that you have a binary task like deciding if an image is red or blue. The data being recorded is time to decision and the decision being made. The underlying generative model is a continuous Wiener diffusion process that has a lag time to get started before drifting with some bias toward opposing decision boundaries. The decision is determined by when it crosses a boundary and which one it crosses (see the illustration). The cdf is important when the task ends before a decision is made, giving you censored observations, which require cdfs or truncated pdfs to implement.

The first time I saw this model being applied was by Bruno Nicenboim and Shravan Vasishth (psycholinguists at Potsdam at the time, though Bruno has since moved to Tilburg) about six or eight years ago. At that point, it took Stan a month or so to fit the model (yes, that’s a month, not a typo)—you may know them as two of the three authors of the really wonderful book, Introduction to Bayesian Data Analysis for Cognitive Science (2025, CRC), which, in its final chapter, covers accumulator models of which the drift-diffusion model is one form. Now these models are very fast in Stan with the new built-in functions.

The pull requests and engineering challenge

Hats off to Franziska Henrich, a cognitive psychologist and Stan developer at the University of Freiburg, aka GitHub user Franzi2114, for writing the code and bearing with Steve Bronder’s hundreds of comments and fixes and my final round of a hundred or so change requests. You can see all the gory details in the discussions around the pull requests and in the code itself.

These pair of functions were perhaps the two hardest functions to get into Stan for a myriad of reasons. The most challenging obstacle beyond the inherent complication of the functions themselves is that our testing framework for densities can’t handle seven-parameter densities. So all the tests had to be projected into subsets of parameters (and seven choose four or five or whatever it was led to a lot of tests). A further difficulty is that to make the arithmetic stable, the code branches all over the place (see the Hartmann and Klaeur article linked below), which also complicates testing.

Some academic background

In addition to Vasishth et al.’s book chapter, there is a vast literature on drift-diffusion models in cognitive psychology and elsewhere. Most relevantly, Franziska wrote an open-access article about the model and the Stan implementation.

Luckily, Hartmann and Klauer provided the derivatives in a previous (closed-access) article.

As is often the case, you can find a pdf through Google Scholar. It’s a daunting pile of mathematics that puts the “M” in “mathematical psychology.” Luckily for us, the authors published an R implementation in package WienR on CRAN (the name is because it’s the Wiener diffusion model underlying the process), which Franziska could use for testing.

Coming to the Stan language next release

We just put out a new Stan release, so we have plenty of time to get the language wrappers around the math library functions before the next release of Stan. Ideally, we’ll also have a User’s Guide chapter with examples of how to use them. We’re always open to new *User’s Guide* chapters about models or methodologies in wide use, and as you can see from this example, we take pull requests, which go down much more easily with the User’s Guide.

A slew of improvements to NUTS

This post is from Bob.

Hold onto your hats, because 2026 promises to bring a whole slew of improvements to MCMC for continuously differentiable densities. The really awesome part is that all of these improvements are orthogonal, so they stack. I’m going to list the ones we know work first, followed by a couple in which we have high hopes.

I’m currently working on implementing all this in a C++ reference implementation, which I plan to roll out with an interface like that of Nutpie. You can follow here:

You’ll find the work in progress on branches. We are, of course, always happy to get feedback on the code, and I’m happy to get feedback on these or other ideas for sampling.

Fisher divergence for mass matrix adaptation

Adrian Seyboldt developed much faster and more robust and better targeted adaptation using Fisher divergence for his sampler Nutpie. I’m helping Adrian and Eliot Carlsen finish up a paper on this, which we hope to release in a week or so. In addition to better diagonal and dense estimators, it contains a really nice low-rank plus diagonal preconditioner that seems very effective (the risk is getting stuck too much in the subspace defined by the low rank structure). It also contains all the mathematical proofs, thanks to Adrian. The basic idea is that Fisher divergence adaptation targets getting the preconditioned target as close to a standard normal in terms of gradients as possible (not as close as possible in terms of density—that’s KL divergence). With some surprising mathematical magic, you can solve the exact optimization of Fisher divergence by taking the geometric mean in the affine-invariant manifold of positive-definite matrices of two quantities: the inverse covariance of the draws and the covariance of the scores (where the score is the gradient of the log density). In a multivariate normal, the covariance of the scores is the inverse covariance of the target density, has zero expectation, and thus acts like a control variate. Who knew? (That was a rhetorical question. Ben Goodrich, of course, knew—he knows everything.)

Adaptive step size on the fly

This was developed by Nawaf Bou-Rabee, Tore Kleppe, Sifan Liu, and me over the course of a few papers expanding on our Gibbs self tuning (GIST) ideas, culminating in our WALNUTS paper (on arXiv). The basic idea here is at each leapfrog step to try to take a step, and if the Hamiltonian diverges past a tolerance, try a smaller step size (half the original). There’s a bit of adjustment to do for reversibility (very much like in the MALT sampler of Lionel Riou-Durand et al. [the paper has an all star cast]), but the expectation is that the adjustment will be close to unity and that seems to hold in practice. There’s some subtlety here in that the WALNUTS sampler we’ve produced can be slower than NUTS (on a per gradient basis) for densities that NUTS can fit well, but if it’s well tuned in terms of minimal numbers of micro steps per macro step in the WALNUTS algorithm, it will be more efficient on a per-iteration basis. This shows we really are fixing the integrator, even if that doesn’t always give us a better sampler on a per-gradient basis. This same kind of issue comes up with higher-order leapfrog algorithms—they’re more precise, but often not worth the effort. On the plus side, when you have highly varying scales, like in the funnel density, WALNUTS can sample it effectively where NUTS will just fail silently reporting overly optimistic R-hat and ESS values.

Isokinetic sampler

As the name implies, the idea of isokinetic sampler is that you keep the kinetic energy constant. This is carried out by designing a kinetic energy function that’s different than the usual quadratic model derived from Newtonian physics (decomposed a la Hamilton, of course). This was developed decades ago by Mark Tuckerman at NYU’s Courant Institute for molecular dynamics, written about in Leimkuhler and Matthews Molecular Dynamics textbook, and reinvented recently by Jakob Robink and Uroš Seljak under the name “microcanonical sampling.” Isokinetic sampling has the remarkable property that you can fix a single energy level set and the isokinetic sampler can be restricted to that, yet remain ergodic for the distribution. Although they’re singing the praises of unadjusted methods, Reuben Cohn-Gordon joined Jakob and Uroš and they wrote a joint paper on Metropolis-adjusted methods (on arXiv). Of course, you could extend this to something like jittered HMC, multinomial HMC, or NUTS (we discuss these alternatives for standard HMC in the various GIST papers in detail). Recently my colleagues Tore Kleppe and Nawaf Bou-Rabee (my partners in crime on GIST) have verified the results presented by Robnik et al. and it seems to give a pretty clean factor of 1.5 to 3 over the standard NUTS kinetic energy model.

Unadjusted sampling, at least for warmup

In statistics, we’re used to thinking of unbiased MCMC with the correct ergodicity properties—you run them longer and you get a better answer. But in estimation in stats, we’re often OK with introducing a bit of bias if it gives us a large enough reduction in variance that we get better answers. If you view sampling the same way, then removing the Metropolis step from HMC gives you a biased sampler that can be a lot faster at moving toward the typical set where we want to sample than an adjusted sampler. And while Jakob, Uroš and Reuben are mainly working on unadjusted samplers and trying to prove error bounds for them under pretty strong assumptions, they have also suggested the reasonable tactic of using unadjusted sampling during warmup then switching to adjusted later so that everything works as expected for longer runs.

Smoother adaptation of Nutpie

Nutpie, like NUTS before it, works in blocks. It evaluates a given number of MCMC iterations, then updates its mass matrix estimate. I’ve developed a smooth alternative that simply exponentially discounts the past at a decreasing rate to mimic the block structure of Stan’s warmup. This seems to work very well and doesn’t suffer the problem of Nutpie of never converging due to a finite cap on block size (Adrian’s working on changing the version in Nutpie in some way to get around the initial design of 100-length blocks; Stan works in a sequence of blocks that doubles in size).

Adam replacement for dual averaging

Stan used the dual averaging stochastic gradient descent algorithm to set step size. It’s not spelled out in the short NUTS paper that dual averaging is an SGD algorithm because the NUTS paper never gives you the objective and its gradient. If you work it out backwards from the dual averaging algorithm, you can deduce that Hoffman and Gelman used a normal target (i.e., squared error), the gradient of which is the negative difference between the observation and the target. This gives you simple stochastic gradients you can plug into any SGD optimizer. I’ve just done this this week, but I’ve already found that Adam is much faster to converge, less highly variable after convergence, and more monotonic getting to convergence in the short tests I’ve done so far. I had to add an update discounting factor to Adam or you get the terrible oscillations rather than convergence for which both dual averaging and Adam are known.

Concurrent adaptation and convergence monitoring

This is a big one that Andrew’s been asking for for years. I’ve finished the online R-hat monitor (it’s in a branch of that name in the Walnuts repo on flatironinstitute). It uses the C++11 threads library to build an asynchronous, non-blocking monitor. The chains roll along as usual, each within its own thread, but they accumulate their within chain means and variances via Welford’s algorithm. They publish these in a per-chain Atomic using relaxed memory guarantees (hence the lack of blocking). Then there is also a monitor thread running that continually reads the per-chain means and variances and computes R-hat (the original one, not the split or ranked versions that are currently used in posterior (R) and ArviZ (Python)). I am able to run at least 100 R-hat checks per second without slowing down the chains noticeably, so it can detect convergence within a few iterations of when it happens even for very fast models. This all works blazingly fast on my new Mac Studio Server running 16 chains in parallel(*). Next up, I have to monitor adaptation. For that, there’s not a pre-built solution, but I’ll be using something like R-hat to monitor whether the mass matrix and step sizes have converged (on a log scale, which makes the monitoring respect the positive-definite manifold structure and operate scale free). And I’ll have to use a double-buffered store rather than an atomic because the vectors required for monitoring convergence of the mass matrix are D-dimensional, not default copyable.

(*) Mac ARM hardware

Not an algorithmic change, but … if you haven’t gotten the memo yet, the new Mac ARM chips are very well suited to parallel sampling like we do in Stan. Their memory architecture is much more tightly integrated into the CPU and more multiplexed than Intel architecture. This is great for sampling or other tasks where data and parameters are being hit asynchronously in memory in parallel.

I just got a Mac Studio with 20 performance cores on the M3 Ultra chip and 256GB of memory. I would highly recommend this specific machine if you have US$6K to spend (your cost may vary—electronics seem to be way more expensive outside of the U.S.). They also have a US$2K starter model, which should also be great compared to just about anything else. Even the MacBooks are good—the inexpensive Air I got a few years ago crushed my mega-expensive (albeit 8 year old) iMac running Intel Xeon chips. People have told me Stan’s broken on Windows after comparing Windows and Mac, but really, it’s just the memory architecture. I didn’t do any controlled tests, but I swear that tests that were taking 3s now take 1-2s after upgrading to the Tahoe OS (bunch of small UI changes that make pretty much no difference to my experience); Apple says they’re continuing to integrate more ARM goodness into their OS, so maybe that was it? The release notes don’t mention anything about heavy thread performance.

I don’t know how long this is going to be useful. I plan to develop a lot of these algorithms in JAX to run on GPU, which is going to be way faster than anything you can do on the desktop.


TENTATIVE IMPROVEMENTS

These haven’t been well tested by us yet.

Generalized HMC

NUTS is not good for GPUs. Matt Hoffman et al. have a new paper out that’s going into the next edition of the MCMC handbook about MCMC on modern hardware that explains why (on arXiv for now). Hugh Dance et al. just wrote another paper showing how to code something like NUTS on GPU, but it’s still expensive. Something like generalized HMC promises to give much of the advantage of NUTS implicitly without the elaborate recursive doubling structure. This is the kind of sampling to which Matt Hoffman and his former colleague Pavel Sountsov developed (Matt left Google and doesn’t work on sampling any more as far as I know). This is also what Gilad Turok, Chirag Modi, and I found with delayed rejection. It also sidesteps the problem of having to chain all the adjustments for local step-size adaptativity which can add up in a bad way for longer chains. It does add some additional tuning, which is how much to partially refresh momentum, which is a kind of proxy for path length for U-turns. It also has the advantage of being implicitly Rao-Blackwellized compared to HMC (you can average all the steps on an HMC path when computing expectations, but it’s just usually not worth the storage—this may turn out to be the case for G-HMC too).

Local mass matrix adaptation

I think of this as the final frontier. If we could locally condition, we wouldn’t need variable step sizes. We don’t have anything that works for this in complex cases. In Stan, you get a choice of a diagonal or dense preconditioned, and in Nutpie you also get low rank plus diagonal, but they’re all global preconditions, not local.

In very simple cases, we can use GIST to generate a mass matrix by taking an inverse Wishart sample around the negative Hessian. If you set the degrees of freedom correctly to get low variance, this pretty much perfectly preconditions a multivariate normal and should be a sound strategy in any log concave density. It’s just expensive, even with everything Cholesky factored. But that’s not enough for a system like Stan. The problem in going to more general models is that the Hessian’s no longer guaranteed to be positive definite (as it would be in a log concave density). This means we have to use something like Michael Betancourt’s softabs technique to condition the Hessian back to positive definite (e.g., eigendecompose, move negative eigenvalues up to positive, put back together, which is prohibitive because of the cubic cost). As an alternative, Nawaf and Tore have been looking at some explicit Riemannian integrators that only require a few Hessian-vector products, which are cheap with autodiff (linear in dimension rather than quadratic). This avoids the problem with the implicit integrator in Riemannian HMC, which is an additional obstacle beyond the need for positive-definite metrics. We might go back to implicit midpoint, as Arya Pourzanjani explored in his thesis, because we can use GIST to set a step size where we can guarantee stability; even so, the implicit nature of the algorithm is a real challenge for reversibility.

From Bayesian inference to LLMs (Steve Bronder’s 2025 CppCon talk)

This post is from Bob.

Steve is now a C++ celebrity!

The production values for these are top notch. They’re also highly Google ranked, so you can’t miss them when you search for advanced C++ topics.

Steve’s been writing C++ code for Stan since before the pandemic. He now works with me at Flatiron Institute as a software engineer. He joined the Stan team after taking a GPU class and asking us if we had any use for GPUs. We said, “Yes, please.”

I say Steve’s a celebrity because of the prestige of CppCon—it’s where the language developers, top educators, and infrastructure tool builders hang out. For example, Steve had lunch with Matt Goldbolt. I use Goldbolt’s app all the time and didn’t realize it was a person’s name until Steve mentioned the lunch. To get some sense of CppCon, you can see Godbolt himself going full fanboy over a CppCon selfie with Laurie Kirk.

The model underlying R-hat and a Bayesian estimator

This post is from Bob

Andrew and I were talking the other day about generalizing R-hat convergence monitoring to the situation where we have multiple asynchronous threads running chains and we needed ragged input. This is because I’m coding (with Steve Bronder and Brian Ward’s help) a parallel auto-stopping version of Stan combining the step-size adaptivity of WALNUTS and the warmup of Nutpie—stay tuned (or follow it or join in and help on the WALNUTS GitHub).

The usual R-hat assumes each chain is the same length. Andrew suggested it would be good to go back to the model to think about how to generalize. I had never thought of R-hat that way, but it turns out it’s really simple, so I can take you all along for the ride.

A Stan model for Bayesian R-hat

Here’s the Stan model. The input is an M by N matrix of draws theta—the output includes the posterior for R and the indicator if it is below 1.01. The mean of the former is the Bayesian estimate of R, i.e., R-hat. The mean of the latter is the estimated probability that R is below 1.01, just to give it a perhaps not completely arbitrary cutoff.

rhat.stan

data {
  int<lower=1> M;
  int<lower=2> N;
  matrix[M, N] theta;
}
parameters {
  real mu;
  real<lower=0> sigma;
  real<lower=0> tau;
  vector<multiplier=tau>[M] alpha;
}
model {
  mu ~ normal(0, 5);
  sigma ~ normal(0, 5);
  tau ~ normal(0, 5);
  alpha ~ normal(0, tau);
  for (m in 1:M) {
    theta[m] ~ normal(mu + alpha[m], sigma);
  }
}
generated quantities {
  real<lower=1> R = sqrt(1 + square(tau) / square(sigma));
  int<lower=0, upper=1> R_ok = R < 1.01;
}

The draws in chain m, (i.e., theta[m] == theta[m, 1:N]), are given a normal(mu + alpha[m], sigma) distribution, so that mu is the global mean across chains and alpha[m] is the difference in mean in chain m from the global mean. The posterior standard deviation of all of the draws is represented by sigma. The variable tau is the standard deviation of the alpha[1:M] variables, so that represents the scale of variation among the chain means. The rest of the priors are weakly informative assuming the values are on roughly a unit scale or slightly larger.

The multiplier=tau converts alpha to a non-centered parameterization given its prior distribution normal(0, tau). The non-centered parameterization is more efficient here, though it may not be with large M and large N.

The generated quatnties define R as the square root of 1 + (tau / sigma)^2. This makes it easy to see that it's the variation between chains as represented by tau, that must go to zero for R to converge to 1. The sigma is just providing the proper scaling to make the R statistic unit free.

Stan will propagate uncertainty through tau and sigma to R. Then the indicator R_ok will be 1 if R is less than 1.01, so its posterior mean is the probability that R is less than 1.01. This is how you code event probability estimators in Stan.

The Bayesian estimate of R, which we can call the "Bayesian R-hat", is the posterior mean of R. The "hat" etymology derives from statisticians liking to decorate random variables with hats to indicate estimates of the random variable. So if X is a random variable, X-hat is an estimate of that variable. We could've taken a posterior median---that's a different estimator with slightly different properties.

A test model

We need a simple model to generate draws that we can test. I'll choose one that's not trivial, so HMC will have to work a bit and we can see some R-hat values that aren't all 1 after rounding.

eval.stan

parameters {
  vector<lower=0>[5] theta;	   
}
model {
  theta ~ lognormal(0, 1);
}

A Student-t with 3 degrees of freedom also works.

A simulator using CmdStanPy

Let's kick the tires, as they say in my hometown of Detroit. We're going to generate some data using the test model

fit.py

import numpy as np
import cmdstanpy as csp
csp.disable_logging()

model = csp.CmdStanModel(stan_file='eval.stan')
fit = model.sample(iter_warmup=10, iter_sampling=200, chains=8, max_treedepth=1, show_progress=False)
print(fit.summary(sig_figs=3))

draws = fit.stan_variable("theta")
rhat_model = csp.CmdStanModel(stan_file='rhat.stan')
for d in range(5):
    theta = draws[:, d]
    num_chains = fit.runset.chains
    draws_per_chain = fit.num_draws_sampling
    theta_matrix = theta.reshape(num_chains, draws_per_chain)
    rhat_data = {
        "M": num_chains,
        "N": draws_per_chain,
        "theta": theta_matrix
    }
            
    rhat_fit = rhat_model.sample(data=rhat_data, show_progress=False)
    R_hat = np.mean(rhat_fit.stan_variable("R"))
    R_ok = np.mean(rhat_fit.stan_variable("R_ok"))
    print(f"{d=};  R-hat: {R_hat:0.3f}; Pr[R < 1.01]: {R_ok:0.3f}")

I find Python very readable, but let me walk you through it. We import and turn off some of the annoying warning messages. Then we compile the evaluation model, sample it, and print the summary. We set tree depth to 1 to force NUTS to act like Langevin and thus sample inefficiently. And we only go 10 warmup iterations and 200 sampling iterations over eight chains.

Then we extract the draws from the posterior as a big 3D array draws of shape (D x M x N), where D is number of parameters, M number of chains, and N iterations per chain.

Then we loop over all the variables in the model, with d going from 0 to 4 (Python indexes from 0, unlike R). For each variable we extract the draws as theta. We grab out the number of chains and draws per chain and reshape the matrix so that it's (M x N) and put it into a data dictionary (like an R list) called rhat_data.

Then we take the data and fit using the Rhat model and extract our Bayesian estimate of R-hat as the posterior mean of the variable R and same for R_ok. Then we report.

Here's the output with columns trimmed for readability.

(stanenv) rhat$ python3 fit.py
          Mean    MCSE  StdDev   R_hat
theta[1]  1.63  0.1070    1.75    1.04
theta[2]  1.46  0.0915    1.61    1.03
theta[3]  1.63  0.1890    2.06    1.07
theta[4]  1.50  0.0995    1.78    1.03
theta[5]  1.68  0.1110    1.98    1.03

d=0;  R-hat: 1.046;  Pr[R < 1.01]: 0.016
d=1;  R-hat: 1.019;  Pr[R < 1.01]: 0.306
d=2;  R-hat: 1.058;  Pr[R < 1.01]: 0.002
d=3;  R-hat: 1.011;  Pr[R < 1.01]: 0.616
d=4;  R-hat: 1.014;  Pr[R < 1.01]: 0.525

Looks reasonable on first glance comparing the estimate R_hat value from CmdStanPy and the Bayesian approach.

Because it's a hierarchical model, we can't directly calculate a maximum likelihood estimate---we'd have to marginalize and use max marginal likelihood. Marginalizing out the intermediate normal (alpha) would also be a better way to code rhat.stan. I imagine fitting the model in brms would also be more robust than my simple example code here---they have better priors. I'll leave all that as an exercise for the reader.

Citation

I went back to the original for the first time ever to verify that indeed this is right model for what R-hat is modeling.

Assistant professor position at USI in Lugano

This post is from Bob

Antonietta Mira writes,

Assistant Professor position (tenure-track) in Theoretical Data Science and Machine Learning, Università della Svizzera italiana, Lugano, Switzerland

USI is looking for a talented junior statistician for Tenure Track Assistant Professor position in “Theoretical Data Science and Machine Learning”. The deadline for applications is 30 November 2025 and we are looking for a starting date around the Summer 2026. Starting salary is 110,000 CHF per year. More details about the required profile can be found on:

Official announcement.

The faculty is located in Lugano in the South of Switzerland on the University’s main campus, 1.5 hrs from Zürich and 1 hr from Milan. It has been expanding its activities in Data Science, Statistics and Machine Learning. Since 2024, we are now having both a bachelor’s and master’s degree in Data Science and a master’s degree in Artificial Intelligence. In August 2026, we will be hosting the European Meeting of Statisticians in Lugano.
Timeline of the hiring process

  • 30 November 2025: application deadline
  • 2-4 February 2026: In person interviews in Lugano, Switzerland.
  • Summer 2026: envisioned starting date

For any questions, do not hesitate to contact the Dean, Prof. Ernst C. Wit ([email protected]).

Mitzi and I have visited Antonietta for a week on two separate occasions over the last few years and we absolutely loved Lugano the city, the USI campus, and getting to hang out with Antonietta. Teaching is in English and faculty salaries in Switzerland are much better than anywhere else I know in Europe (1 Swiss franc is about US$1.25).