Stein’s method, learning and inference -or- how to really monitor convergence and thin chains

This post is from Bob.

I’ve been thinking a lot about scores (gradients of the log density function) and how they can be used for convergence monitoring. We know that the expected value of the score is zero. Stein generalized this with Stein operators. In the monomial case, the Stein operators give you functions in increasing degrees, all of which have zero expectation in the posterior. Here theta is the variable being sampled and S is the score function, so that S(theta) is the gradient of the target log density evaluated at theta.

    Order 0: S(theta)

    Order 1: 1 + theta .* S(theta)

    Order 2: 2 * theta + theta^2 .* S(theta)

This leads to a natural test for convergence of first, second, and third moments. Just compute Monte Carlo estimates of these quantities and see if they’re zero. We’d want to standardize for standard deviation to make the result scale-free like R-hat. To develop some intuitions, in a standard normal distribution p(theta) = normal(theta | 0, I), we have S(theta) = -theta, and thus S(theta) converges to zero at the same rate as our variable theta converges to its true value; the order 1 test is 1 – theta^2, which we know has expectation zero because theta^2 has a ChiSquared(1) distribution with expectation of 1). The order 1 case corresponds to equipartition in physics and the form D + theta’ * S(theta) also naturally has zero expectation as shown in the viral theorem in physics in the 1870s.

Diving into this a bit more led me back to Jackson Gorham and Lester Mackey’s work on Stein’s method. They haven’t been sitting still since introducing the basic idea, which kernelizes the idea above. Mackey et al. have produced an absolutely wonderful summary of this body of work in two forms. The first is a dense, 41-slide deck with all the key definitions and results. I’d suggest at least skimming this first.

Lester Mackey. April 2026. Stein’s Method, Learning, and Inference.. GitHub.

Mackey along with Chris Oates and Qiang Liu, who have also worked heavily in this area, put together a definitive monograph. They’ve presented a great deal of difficult material in a way that I can digest (though it’s going to be rough going if you’re not well versed in sampling and how MCMC is traditionally measured and evaluated).

Qiang Liu, Lester Mackey, Chris Oates. March 2026. Probabilistic Inference and Learning with Stein’s Method. arXiv.

In particular, they go over Stein variational inference, which seems to me like it would be the ideal way to perform quasi Monte Carlo-like inference for statistical models if we could only get a robust version to scale. The idea’s to initialize a bunch of points, then use optimization to minimize a kernelized Stein discrepancy of the empirical distribution of those points to the true distribution.

The Kappa Zoo: David Eubanks’s online monograph on rating models

David Eubanks writes:

My site is kappazoo.com, and it’s still a work in progress. I would rather have emailed after I had the new goodness-of-fit code done, but I saw that you’re doing a summer workshop (on Andrew’s blog) [editor: Modern Modeling Methods (M3)], so thought I’d mention it now.

It may be billed as a work in progress, but it’s a complete draft with no missing sections that provides a really nice overview of rating/crowdsourcing models. These are the models that dragged me into statistics, namely Bayesian rating models formulated as noisy measurement models. The first model of this kind that I or Eubanks could find was Phil Dawid and Allan Skene’s (1979) paper on rating.

Eubanks works through a great deal of workflow without calling it that. There are multiple model evaluation and comparison measures used and explained with connections to information-theoretic notions like entropy.

There’s a long discussion of Cohen’s kappa statistic, which is a commonly reported statistic measuring inter-rater agreement. As Eubanks notes, it doesn’t deliver on its promise of adequately measuring inter-rater agreement. The discussion is quite good here and complementary to the discussion from me and Becky Passonneau in our paper on rating in NLP, though our conclusions are the same.

I was surprised to see that Eubanks has a section comparing item-response theory (IRT) models with difficulty. I’ve been trying to convince people this is important for years. It took me around ten years to figure out how to move from Dawid and Skene’s IRT-0-like model to an IRT-1-like model, which we report in in our latest paper on crowdsourcing with difficulty parameters (which also works through a lot of Bayesian workflow in considering different models). I can’t identify what took so long—it seems so obvious to me now.

Full day Stan tutorial at Modern Modeling Methods (M3) this summer in New York (22 June 2026)

This post is from Bob

Mitzi Morris and Bob Carpenter, two of Stan’s developers, will be presenting a tutorial on Stan and Bayesian data analysis aimed at psychometricians this summer.

Abstract

This workshop is a full day, hands-on introduction to Bayesian modeling and statistical inference using the probabilistic programming language Stan.

The course will be organized around the key properties of Bayesian statistical modeling for science, including the nature of uncertainty, modeling a generative process through a data generating distribution, modeling existing knowledge through a prior, and pushing uncertainty through inference. As we do this, we will show how Stan can be used to both code the models and perform statistical inference for quantities of interest, be they retrospective parameter estimates or prospective predictions or forecasts. We will concentrate on full Bayesian posterior inference, including a discussion of calibration, model checking for both prior and posterior inference, and model comparison with cross-validation. We will spend some time showing how some structural equation models (SEM) can be translated directly to Stan and will also introduce psychological models for educational testing, crowdsourcing, rating and ranking, and real-time decision processes.

This class will require a notebook computer with a network connection (Wifi will be available in the classroom). We will use the Stan Playground, which runs Stan in the browser, which we will pre-populate with models of interest. We will probably also break into R or Python at various points to demonstrate methods not yet supported by the Playground, such as the brms regression expression language.

Andrew on psychometrics

Andrew once told me that any model you could come up with was probably invented by a psychometrician 50 years ago (make that 60—he said it at least 10 years ago). I have evidence that he’s right form the project that drew me into Bayesian statistics—crowdsourcing. Andrew and Jennifer Hill helped me formulate a crowdsourcing model where raters give you noisy measurements of underlying categorical variables (e.g., they answer survey questions about whether a word in context is a noun, for example, to use something I was working on at the time). Turns out Phil Dawid and A.P. Skene published the same model in 1979 in one of the earlier applications of the expectation maximization (EM) algorithm and they used natural language data (drawn from medical records).

The rest of the conference

The rest of the program looks really great—it’s just the kind of applied wrestling with real data that I like.

Speaking of Jennifer Hill, she’s one of the keynote speakers at M3. Every talk of Jennifer’s I’ve attended has been great. You may know her as Andrew’s co-author on the regression books, which I cannot recommend highly enough if you’re interested in this kind of applied modeling.

Beyond the conference

You see the same kind of Bayesian modeling focus for real data at venues such as ISEC (international ecology conference to which I went to once just because I like these models and these kinds of conferences), StanCon (see you in Uppsala in August!), and GeoMed (which Mitzi attends). I’m sure there are more in other fields. I’m always disappointed that there’s almost nothing like these kinds of nitty-gritty applied papers at ISBA (Nagoya this summer) or BayesComp (somewhere next year). The conferences about Bayesian statistics or computing that I’ve been to have all been super theoretical.

Why are there squares everywhere in statistics (e.g., normal density, variance, least squares, etc.)?

I remember asking my colleagues at Carnegie Mellon this very same question as I was first learning basic statistics in the early 1990s and they gave the same kind of answers as I found more recently in the AskStatistics subreddit. It’s an evergreen question, coming up regularly enough I think it’s fair to say it’s a zombie. This is probably going to be familiar material to most blog readers, but if you’re like me when I first started reading this blog decades ago, then read on.

The answers are all over the place ranging from “the central limit theorem” to “it’s distance” to “it makes everything positive” to “it is smooth unlike absolute value” (another path to positivity) to “mathematical convenience” or “the math says so” (sure, but how?) to the “central limit theorem” (a more specific form of “the math says so”) to the link to entropy (it is the distribution that maximizes entropy for a given mean and variance), and so on.

I think the easiest answer for the person asking “why squares?” is due to Gauss by way of Pythagoras. Simply put,

    *** the mean is the number that minimizes square error. ***

That is, if I have a sequence x[1], …, x[N], then mu = mean(x) is the number that minimizes the error

    err(mu, x) = SUM{n in 1:N} (x[n] – mu)^2,

in the sense that

    ARGMIN_mu err(mu, x) = mean(x).

You can verify this by taking a derivative of the summation, showing that it’s zero at the mean, then confirming that minimizing, and checking that you have convexity by checking that the second derivative is positive.

Gauss realized that he could define a density where mean(x) is the maximum likelihood estimator. It is now called the “Gaussian distribution” or “normal distribution”. That is, we have

    log normal(x | mu, 1) = -1/2 (x – mu)^2 + const.

If we have a bunch of observations and want to maximize their likelihood, that’s equivalent to minimizing the error I wrote down above,

    ARGMAX_mu PRODUCT{n in :N} normal(x[n] | mu, 1)

        = ARGMAX_mu PRODUCT{n in 1:N} exp(-1/2 (x[n] – mu)^2) * const

        = ARGMAX_mu SUM{n in 1:N} log(exp(-1/2 (x[n] – mu)^2)) + log(const)

        = ARGMAX_mu N * log(const) + SUM{n in 1:N} -1/2 (x[n] – mu)^2

        = ARGMIN_mu 1/2 SUM{n in 1:N} (x[n] – mu)^2

        = mean(x)

This is why “ordinary least squares” fit for a regression with normal error terms yields the same result as maximum likelihood.

Some of the other answers in these threads are correct, but they failed to enlighten the original posters as to why there is a square. Just saying “because of the math” like my colleagues did is not helpful without saying which math! For example, the normal distribution, with its quadratic log density term, arises through the central limit theorem asymptotically. Squaring does emphasize outliers as people said, but that’s not the reason. If you say “for mathematical convenience,” it would help to say what it buys you, such as the sum of normals remaining normals, normal being its own conjugate prior in regression, integrals being very nice, tails being very nicely behaved, the connection to maximum entropy, residence in the exponential family, etc. etc.

Why not absolute values instead of squares?

We can use absolute values. In that case, medians replace means, because absolute error has the pleasant property that

    *** the median is the number that minimizes absolute error. ***

In symbols, that’s

    ARGMIN_mu SUM{x in 1:N} abs(x[n] – mu) = median(x)

Connection to Bayesian inference

If our model is correct (big “if” that’s almost always false, of course), then we know that the posterior mean is the parameter estimate that minimizes expected square error, whereas the posterior median minimizes expected absolute error. Just like in the simpler case. It’s important to note methodologically that we don’t get a natural point estimate out of Bayesian inference without specifying an error function. If the error function is quadratic, the error minimizing estimate is the posterior mean, whereas if the error function is absolute error, the error minimizing estimate is the posterior median.

The multivariate normal case

There is a strong relation to distance, as some of the answers on the AskStatistics subreddit said. More precisely, squared error is just squared Euclidean distance from the point [mu mu … mu]. This points the way to how Gauss defined the multivariate normal distribution. Simply replace the squared Euclidean distance term (x[n] – mu)^2 term with a quadratic form involving the inverse covariance,

    log normal(y | mu, Sigma) = -1/2 (x[n] – mu)’ * inverse(Sigma) * (x[n] – mu) + const,

where Sigma is a positive-definite covariance matrix. This quadratic form is the distance in the space defined by the Euclidean metric inverse(Sigma). The resulting distance over vectors is known as the “Mahalanobis distance” in statistics. For example, if Sigma = [[1, 0.9], [0.9, 1]], for a highly correlated bivariate normal, it will be easier (less distance under the quadratic form) to move along the diagonal x = y than transverse to the diagonal x = y. That is, the points [1 1] and [2 2] are closer to each other given the inverse metric Sigma (distance approximately 1.05) than [1 1] is to [1 1 + sqrt(2)] (distance approximately 10.5), even though they are the same simple Euclidean distance away from each other.

Alchemize: PyMC’s model to replace Stan/PyMC, etc. with an LLM

This post is from Bob

I’ll let Thomas Wiecki, who is one of the core PyMC devs and one of the partners at PyMC Labs, speak for himself here:

If you haven’t seen what people are doing with agentic AI, this is a good example. I’m really happy that Thomas and PyMC Labs are sharing their thoughts and initial tries at things like this as I think it has the potential to benefit everyone working on modeling.

If you want to see the basis of the agent’s instructions, check out the “skill” for PyMC that Chris Fonnesbeck wrote.

We’ve already batted this around a bit in email with Thomas, so I can summarize some talking points:

LLM-based chatbots are really good at translating. Compiling (or more technically correct, transpiling) a statistical model down to a language like Rust or C++ or JAX is a kind of translation.

You can start from PyMC’s execution trace, but you can also start with a model description. You could also start with something like Stan code.

The biggest bottleneck to deploying Bayesian models in my opinion is the inherent variance and unreliability of MCMC-based inference. Our workflow proposals are all about making sure this doesn’t go wrong. Wiecki’s point here is that we can have the bots go through the workflow. Iterating until the gradients and log densities match is a good example, but this could be extended to more parts of workflow.

The skills feel a lot like writing a textbook for a bot. I have no idea how hard or easy this is or how much it improves over the baseline. Jeremy Magland built a RAG-like helper for Stan that compressed the Stan Reference Manual down to 1K tokens for context (like a skill) and allowed it to search and import from the Stan User’s Guide, but never measured how much it improved over the baseline. It really feels like it should also have the Stan Functions Reference, BDA3, Regression and Other Stories, and the Bayesian Workflow book, as well.

Hopefully we’ll asymptote at writing a textbook sized set of skills and not have to write one per target model (that is, something like the Stan User’s Guide, Reference Manual, Functions Reference).

I’m curious as to whether it will eventually be able to make writing hard models easier. I’m thinking of efforts like epinow2, which involves a very large chunk of Stan code.

As the foundation models and chatbot tuning changes, there’s going to be an issue of regression testing and tuning for whatever the latest models are.

P.S. This effort explains how Thomas was able to create the huge posteriordb pull requests for PyMC (#320 and #319)!

P.P.S. The latest thing Claude (Opus 4.7) did that impressed me was generate the ess(MatrixXd, vector) function in summary.hpp. This function estimates effective sample size Stan style (discounting for R-hat > 1) on a ragged array of Markov chains. We have to generalize all the posterior analysis tools to deal with the new asynchronous parallel sampler). I had Stan’s ESS function and all the other functions I’d written for the ragged structures to give it as a guide. It’s very easy to code review that it matches Stan’s implementation for the new data structure. I only had to tweak the output a little bit for style.

David W. Hogg on why we do astrophysics (in the face of LLMs and the lack of clinical value)

David W. Hogg, who takes his role as scientific gadfly seriously, recently (February 2026) released an interesting rumination on science, LLMs, and the human component of research.

I like that his first move is to insert an ungrammatical question mark into the title to question the whole process.

There is a lot to unpack here, starting with the claim that LLMs “show no signs of intelligence.” I seriously don’t know what “signs” people are looking for other than the ability to converse in 200 languages fluently in just about any subject known to humanity. But like most AI benchmarks, we tossed out the Turing Test as soon as a computer could make a passable go of it.

One comment I can get behind is, “a data scientist who has taken an astronomy class might be better prepared [to deal with astrophysics data and modeling] than an astronomer who has taken a data science class.” The usual feeling among scientists is the opposite, but I’m guessing this is because they’ve just never worked with truly talented programmers like the ones we have at Flatiron (e.g., Brian Ward, Steve Bronder, Jeff Soules, and Robert Blackwell, among the folks with whom I work).

The note is stuffed with both descriptive accounts and normative statements such as “Every scientific paper is written to help all of its writers, and all of its readers, learn and grow, no matter their career stages.” I’m imagining living in the Big Rock Candy Mountains of academia, where that’s true.

I just don’t get “I believe that (in astrophysics) software is written to support the astrophysics literature, and that every important piece of software should have an associated paper in the astrophysics literature.” Why not publish software papers in software venues like JOSS? Most literatures won’t even publish software papers, so good on the astrophysics folks for allowing them.

Folks who bemoan the publish-or-perish mentality in academia should consider the statement, “My point is that astrophysics is the astrophysics literature.” I agree if you extend that to include things not peer reviewed like software.

David takes on workflow with the contentious statement, “a lot of ‘implicit knowledge’ or folklore about things like how to observe, how to reduce data, how to organize projects, how to visualize data and models, how to read and write, and so on. Much of this never appears in the literature. Is that not also astrophysics? Yes it is, but it is astrophysics practice.” I’m postmodern enough to believe it’s impossible to separate astrophysics from astrophysics practice. Isn’t the computing also astrophysics practice? He wants that in the astro literature. For this, he concludes, “I would welcome a project in which we tried to make much of this implicit knowledge explicit.” Just not in the literature? We never could get the workflow paper published.

I know David’s senior enough to know how the sausage is made, so I don’t see how he can claim, “Papers (and the authorships on those papers) and the citations of those papers are not ‘coin’ of anything!” Of course they’re the “coin” of both hiring, tenure and promotion committees. I believe Andrew has gone on record saying people value publications so highly because nobody wants to devalue the coin that put them where they are (though I’m sure he said it more cleanly than that).

David says, “it isn’t really correct, if you use an LLM, to give that LLM co-authorship on your paper,”. It’s not legal in the United States, either.

More normative statements like, “You can’t decide not to cite a relevant paper because you don’t like the author, or don’t like the author’s institution, or don’t like their funding sources.” I guess he hasn’t hung out in linguistics. Or mathematics. My undergraduate advisor, Ed Palmer, was a student of Frank Harary. Harary and Béla Bollobás developed an entirely parallel theory of random graphs, proving all the same theorems and doggedly refusing to cite each other.

This just seems patently false: “Anyone who has the capability of getting a PhD in astrophysics has the capability of doing many remunerative things, substantially more remunerative than my job.” He then goes on to one of his main philosophical (economic?) points, “If all we really wanted was to know how the Universe worked, we would start a hedge fund, and use the proceeds to pay an astrophysics institute, filled with people who wanted to do astrophysics rather than find out the answers.” I’m not convinced that someone who’s good at science will also be good at business. I think folks like Jim Simons, who co-founded our institute with Marilyn Simons, is an exception. Sure, if you can make tens of billions of dollars, you can fund a lot of science. Or buy a lot of Jamaican beef patties to bring it back along to themes of the blog.

I don’t understand why he says, “No astronomer (that I know) is improving the calibration of JWST instruments because they want the US Navy to have a higher kill rate.” Is that because he doesn’t think people who do this should be called “astronomers?”

Some of the arguments are silly, but at least get those niggling question marks, “We put ‘the universe’ in ‘the university’?” I also find trickle-down theories of training to be particularly weak. Here, David says, “We train a technical workforce.” Sure, but training a technical workforce by teaching them astrophysics seems very inefficient—see the point above on hiring data scientists versus astrophysicists. Perhaps the weakest argument is that astrophysics funding takes away from even more dangerous things the government could be funded, under the heading “We beat ploughshares into swords.” I also think “We create opportunities for development.” in the sense of development sites in Chile and similar places for observatories. Just think about how useful that money could be spent in some other way for development? On the other hand, I can fully get behind, “Astrophysics is a satisfying activity.” I’m still surprised I get paid to do what I love!

A lot of this paper is about LLMs, but I don’t think that’s the interesting part. Right in the abstract, David discusses, “two possible (extreme and bad) policy recommendations related to the use of LLMs in astrophysics, dubbed ‘let-them-cook’ and ‘ban-and-punish.’ I argue strongly against both of these; it is not going to be easy to develop or adopt good moderate policies.”

As a good writer, he sticks the landing. First, by reframing the question as, “Finally: Why did I write this white paper? I wrote this because I became concerned about some of the ideas circulating in the astrophysics community about LLMs and their capabilities, conflating what are (in my view) text-interpolators with what are (in my view) scientists.” Then concluding ” Ultimately, I think the real question we face—if we do indeed face a question—is not the question of how we do astrophysics. It is the question of why we do astrophysics.”

Expanding the Stan User’s Guide

This post is from Bob.

The Stan User’s Guide has been evolving organically along with the project. I’m writing this post for two reasons—to let you know what we’ve added lately and to encourage you to add your own chapters.

History

Initialy there was just one doc that included what is now the Reference Manual, Function sReference and CmdStan Manual. You can browse them all on the web or download pdfs from the Stan documentation web site.

Most of the topics grew up opportunistically from either things the team was interested in, new features we added to Stan, or just trying to cover the basics of the chapters of Bayesian Data Analysis and Data Analysis Using Regression and Multilevel/Hierarchical Models.

Recent additions

There have been some recent additions, including some from new faces:

There are also a metric ton of minor clarifications, fixes, and model examples, such as Mitzi Morris’s additional sufficient statistics optimizations and you can simply look at all the documentation pull requests.

The future is you

You can see from Franziska’s pull request, Abner’s PR, and Brynjólfur’s PR that we provide a lot of guidance and feedback. I found it super useful to write a lot of the documentation when I was learning statistics because I got feedback from Andrew Gelman, Ben Goodrich, Aki Vehtari, and Michael Betancourt, among others.

Some topics that would be nice to add

There is a lot to add to our existing discussions and you can see a lot of that in the pull request list on our documentation repository. There are also many topics that we don’t cover or don’t cover well. Off the top of my head, roughly ordered near the top according to my assessment of how impactful it would be to add.

  • causal inference using Rubin’s potential outcomes framework
  • Thompson sampling for reinforcement learning/bandits (sorry, Andrew!)
  • penalized complexity priors and the Besag-York-Molie II spatial model (working from Mitzi’s Morris’s case study)
  • Extreme value models
  • RNA-seq and DNA-seq composition models at the read level and k-mer level
  • Econometric models (I don’t even know what these are, but there are books)
  • Stationarity-constraining parameterizations of VAR models a la Sarah Heaps
  • Neural networks
  • A/B testing (ideally sequential)
  • spatio-temporal models along the lines of Leon Held’s survey
  • non-trivial examples of ODE models like Lotka-Volterra (my case study), SEIR (Elizaveta Semenova et al.’s case study) or even better, pharmacokinetic/pharmacodynamic (multiple case studies around Torsten), soil carbon (my very early case study, but there’s been much better work since)
  • Hilbert-space approximations to Gaussian processes (GPs)
  • PDE approximations to GPs
  • Fourier space analyses of GPs
  • N-gram language models
  • Plackett-Luce model for contest with more than two players (this can follow my case study for StanCon)

Some of those may already be mentioned in our doc, so sorry if you’ve already written something. Also, feel free to suggest more topics in the comments.

In ML, everyone’s Humpty Dumpty

This post is from Bob.

I used to work in natural language semantics, and the following dialogue from Lewis Carroll’s Through the Looking Glass, and What Alice Found There was the most common pull-quote to see at the beginning of a thesis.

“When I use a word,: Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.”

“The question is,” said Alice, “whether you can make words mean so many different things.”

“The question is,” said Humpty Dumpty, “which is to be master — that’s all.”

Humpty Dumpty came to mind recently after a spate of discussions with ML folks about inference (i.e., what they call “learning”).

What in the world does “empirical Bayes” mean?”

Empirical Bayes came up on Wednesday with some ML folks I was talking to, then I ran into Dave Blei this morning, who told me he’s giving a sequence of talks on empirical Bayes over the next few weeks (at Columbia and at University of Chicago). I asked him what “empirical Bayes” meant to him, because it seems to be used very fluidly in ML. He gave me a new usage, saying he used it for any model that uses data to fit parameters of a prior, including just plain old hierarchical modeling fit with sampling. Dave gave the example of ARD in Gaussian processes (aka, hierarchical models).

I would only use the term in the way described in the Wikipedia entry for “Empirical Bayes”, namely

… empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out.

I pinged Mark Goldstein, one of our top-notch ML postdocs, and he pretty much reeled off the Wikipedia definition. So there seems to be a lot of variation in how this is used.

Robbins on Empirical Bayes

Dave also pointed me to the following video by Herbert Robbins (yes, that Robbins, who was so far ahead of the computational statistics curve that he introduced stochastic gradient descent and multi-armed bandits in the early 1950s).

Terminology drift in ML

I’ve been a bit shocked at how many technical terms have drifted in meaning in ML. I’m not talking about people making honest mistakes or clueless mistakes, I’m talking about true drift in meaning where the ML folks will stand by their definitions.

Likelihood: I’ve seen “likelihood” used for what I’d call the data generating distribution and sometimes just as a synonym for density. Aki has this one covered in his recent post, A data model is not just a likelihood. In stats, the likelihood is defined as the function L(theta) = p(y_obs | theta)—that is, it’s a function of theta for some fixed observed data.

Causal: With the advent of LLMs, anything with an autoregressive structure is now being described as “causal” in a “past causes the future” sense. This is even being extended to arbitrary directed graphical models, which are now being called “causal” even when there’s no explicit causality being modeled. That is, you can now describe a simple regression from an observational experience as “causal” with no extra work.

Estimation: Estimation is almost always called “learning” in ML.

Parameters: These are usually called “weights” for neural networks, which I think now make up 99.9% of all work in ML. But if you tell ML folks that neural networks are parametric models, they’ll most often deny it. A statistician would confusingly call a neural network a “non-parametric model” and then tell you that means it has a lot of parameters.

Inference: In our diffusion model reading group, the ML postdocs tell me that “inference” means what I would call “posterior predictive sampling”. For example, generating output to a query from an LLM would be called “inference.”

Bias: In statistics, this usually means expected error. In ML, it’s heavily overloaded. It can be used to name just about any kind of error measure (e.g., errors in Matt Hoffman’s sampling papers). ML folks also use the term “bias” to mean the intercept in a regression (no, I’m not kidding).

Prior: This is often called an “inductive bias” in ML circles, which can include aspects of data generating distributions as well as priors.

In Bayesian statistics, a prior is the marginal distribution over parameters. In ML and informal presentations of “Bayesian statistics,” it’s just any marginal that gets plugged into Bayes’s rule. For instance, the prevalence of a disease p(disease+) is called the “prior” when I evaluate positive predictive accuracy p(disease=+ | test=+) given a testing sensitivity distribution p(test=+ | disease=+). In a k-way classification, “prior” means the marginal distribution over categories.

Bayesian: I think in ML this term is used very broadly for any situation in which there’s a prior not strictly stated as a penalty function for penalized learning (i.e., regression). I think Andrew’s down with this definition as he also uses Bayesian for anything that looks vaguely Bayesian no matter how inference is performed. For example, Empirical Bayes is just Bayes to Andrew, as is using a Laplace approximation or even a simple maximum likelihood estimate (just think of it as a one-point posterior summary!).

Regression: Note quite on topic, but I think of neural networks as just a GPU-friendly form of non-linear regression.

Uncertainty quantification: This is the primary subject of statistics, though I think the term “uncertainty quantification” is much more prevalent in engineering/signal processing than in ML. There are even journals of that title that look sort of like statistics journals.

I’d find reading ML papers easier if there was less meaning drift from well-established terminology. I’m not saying the ML folks should be up to date with Gelman’s idiomatic statistical lexicon (the concepts are fun, and it can be useful for talking to Andrew and people in his circle like the blog readers, but I wouldn’t recommend using these terms in papers without explanation).

I’m sure there are many more terms have been coined or have drifted one way or another that I’m forgetting about.

Nutpie: state-of-the-art mass matrix adaptation for HMC

This post is from Bob

Nutpie laps NUTS

The short story is that the Nutpie sampler is usually twice as fast as Stan’s sampler for Stan models and twice as fast as PyMC’s samplers for PyMC models. The speed improvement is due entirely to better mass-matrix adaptation. Everything else is the same—dual averaging for step size adaptation and the biased-progressive, multinomial no-U-turn sampler for Hamiltonian Monte Carlo. Note that the speed is not coming from Python or Rust; we have a C++ implementation in the works we plan to interface with Python and then R.

The backstory

Adrian Seyboldt, a member of PyMC Labs, designed the sampler and released the open-source implementation linked above. He built it so he could try out Rust! Adrian’s also behind Stan’s game-changing new sum-to-zero parameterization along with Sean Pinkney. He’s a physicist by training and has amazing geometric intuition.

I tried Nutpie and it didn’t disappoint. So I wanted to understand the algorithm. I couldn’t understand the Rust code and Adrian hadn’t written any pseudocode or algorithm description. Eliot Carlson, with a recently minted MA in stats from Columbia, showed up at my office on the recommendation of Philip Greengard to look for a computational stats project. I asked him what he thought about Rust and like any great grad student, he wasn’t daunted by the fact that he knew nothing about Rust and was a statistician by training. With serious digging and some help from Adrian, Eliot extracted pseudocode descriptions and we were able to formulate the algorithm in what I hope is an understandable way. Eliot now works full time at PyMC Labs.

The paper

The arXiv paper just came out yesterday.

We’d love to hear any comments, corrections, or suggestions people might have.

My role in this was just helping to formulate the whole thing as a stats paper in a language that computational statisticians could understand. Adrian and Eliot worked out the proofs, from which I learned a lot.

Why is Nutpie better?

There are two basic reasons why Nutpie’s diagonal mass-matrix adaptation is better than Stan’s:

1. Better initialization: Use the gradients at the initial point to estimate mass matrix and make the whole sampler scale invariant.

2. Two sources of information: Use the position and the gradient of the log density at the position as two sources of information with which to estimate a mass matrix

The motivation for (1) is that the outer product of the scores is a one-sample Monte Carlo estimate of the expected negative Hessian of the target log density. The one-sample estimate is regularized by taking a geometric average with an identity matrix—using a few more draws here may be more stable.

Will wonders never cease, but it turns out to be relatively simple to take the geometric mean as the midpoint of a geodesic in the affine-invariant Riemannian manifold (AIRM) of positive-definite matrices. It works out to regular geometric mean in the diagonal case, making it simple to implement.

The motivation for (2) is that in a multivariate normal distribution with covariance Sigma, the covariance of the draws is Sigma and the covariance of the scores (gradients of log density of the draws) is Sigma inverse. These are then geometrically averaged to produce an estimate. This can converge very fast—many fewer iterations than the dimensionality of the problem.

As with Stan, the default in Nutpie is to use a diagonal mass matrix. Unlike Stan, Nutpie also lets you use a low-rank plus diagonal approximation (as in L-BFGS optimization), which can go a long way to decorrelating hierarchical models. Using a rank K approximation, what would otherwise be N^3 operations like solves or matrix products become K^2 * N, which is manageable for K of order 10 or less.

Mathematical motivation

The geometric average of the covariance of the draws and the inverse covariance of the scores minimizes the Fisher divergence from a standard normal distribution to the preconditioned target density. Fisher divergence is like Kullback-Leibler divergence, but instead measures the expected distance between gradients rather than the difference between log densities. This turns out in practice to lead to a better diagonal preconditioner. The paper provides proofs and an appendix with geometric motivations.

What we do not prove and what I do not believe anyone understands is how to select an optimal static preconditioner when the target log density has varying curvature, in any form: diagonal, low rank plus diagonal, or dense. I think what we really need is dynamic adaptation that isn’t cubic like Riemannian and doesn’t require implicit integrators.

SparseNUTS: Preconditioning hierarchical models in HMC with a sparse “Laplace approximation” at the marginal mode

This post is from Bob.

Cole Monnahan (who came up with the idea and did the heavy lifting), two of his colleagues from the Template Model Builder (TMB) project, Kasper Kristensen and James T. Thorson, and I just put a paper up on arXiv.

If you have any feedback on either, we’d love to hear it.

The method

It’s quite simple if you know how the components work. The tricky part is getting them all to play together nicely, efficiently, and robustly.

1. Take a max marginal mode of a hierarchical model, of the kind produced by the linear mixed effects package (lme4) or Template Model Builder (TMB). Both packages are implemented in C++ under the hood and distributed with an R interface.

2. Center a second-order Taylor series approximation at the marginal mode using a precision matrix (inverse covariance) rather than covariance matrix. This is like a Laplace approximation, but it’s not centered at a global mode.

3. Take the resulting sparse precision matrix and use it to precondition the target density. This is equivalent to using the precision as a mass matrix in HMC, as Radford Neal showed in his introduction in the MCMC Handbook. This approach is necessitated by Stan’s lack of sparse mass matrix support.

4. Use TMB’s R interface and Andrew Johnson’s StanEstimators package to make the model available to Stan’s samplers and other tooling.

There’s nothing Stan-specific about this technique. It could be rolled into PyMC and NumPyro with a lot of work if JAX’s experimental sparse library is up to the task of duplicating what TMB is doing.

Empirical evaluation

The paper is evaluated on 15+ realistic models, some of which can be scaled. The results show that with this preconditioning, Stan can scale to 10K+ parameter hierarchical models when there is sparseness and high correlations.

The paper demonstrate how much better SparseNUTS is than trying to use Stan’s built-in diagonal or dense mass matrix estimators. Obviously, the dense mass matrix won’t scale to 10K parameters as the mass matrix will have 100M entries and that becomes nearly impossible to estimate or manage computationally in Stan. Stan is very slow to estimate mass matrices in these cases (WALNUTS should be better, which we hope to roll out soon), but the real problem is that diagonal preconditioning is insufficient for hierarchical models.

Should you use it?

Yes! The form of hierarchical models addressed by SparseNUTS are widely used and Stan by itself simply cannot fit them. SparseNUTS also provides much more flexibility than INLA in writing custom likelihoods and priors.

Why not code this method in Stan?

We do not have the sparse Hessian tooling within Stan’s automatic differentiation library to implement this directly. Nor are we likely to roll out something for specific model classes, as we try to stick to black-box techniques in the core of Stan. This isn’t because we don’t like model-class specific methods. We are happy to implement them in special packages like brms. It’s just because one has to limit one’s scope somehow to keep a software project manageable with limited developer hours.

Some alternatives on the horizon

We are evaluating how far we’ll be able to scale Charles Margossian’s black-box approach to marginalization, which Steve Bronder and Charles have gotten into shape so it will be in the next Stan release. As INLA demonstrates, marginalization can be an efficient alternative to sampling the whole model if you can get the marginalization right.

We’re also evaluating how far we can get with Adrian Seyboldt’s Nutpie sampler’s approach to estimating low-rank plus diagonal mass matrices based on Fisher divergence, which could also potentially solve this problem as most of the structure in hierarchical models is low rank. Adrian and Eliot Carlsen (both of PyMC Labs) and I are writing that up now and we will release the paper soon. So far, Nutpie’s low-rank plus diagonal adaptation works incredibly well for some problems, but fails spectacularly on others for reasons we do not yet understand.

What do Socrates, Herb Simon, and Andrew Gelman have in common? A love of the oral tradition

This post is from Bob.

With all the hand-wringing over generative AI ruining people’s ability to think, I thought I’d turn the clock back a few millennia, a few decades, then a few minutes.

Socrates

Let’s start by turning the clock back a few millennia when the deep thinkers were wringing their hands over writing. Writing wasn’t exactly new in Socrates’s time, but it wasn’t widespread in the pre-classical and very early classical period. Here’s Plato channeling Socrates in Phaedrus.

And so it is that you by reason of your tender regard for the writing that is your offspring have declared the very opposite of its true effect. If men learn this, it will implant forgetfulness in their souls. They will cease to exercise memory because they rely on that which is written, calling things to remembrance no longer from within themselves, but by means of external marks.

What you have discovered is a recipe not for memory, but for reminder. And it is no true wisdom that you offer your disciples, but only the semblance of wisdom, for by telling them of many things without teaching them you will make them seem to know much while for the most part they know nothing. And as men filled not with wisdom but with the conceit of wisdom they will be a burden to their fellow

You know, Phaedrus, that is the strange thing about writing, which makes it truly correspond to painting. The painter’s products stand before us as though they were alive. But if you question them, they maintain a most majestic silence. It is the same with written words. They seem to talk to you as though they were intelligent, but if you ask them anything about what they say from a desire to be instructed they go on telling just the same thing forever.

As Shane Parrish says on the philosophy blog from which I lifted the quote, “Oh the irony of having an argument against writing in a written text.” Of course, it took Plato to write it down. The reason Socrates’s name is not on the Columbia library is because he didn’t write! We only know of him because his student, Plato, did write, as did Plato’s student, Aristotle.

I’m old enough to remember the same arguments about diminished thinking ability rolled out against calculators, personal computers, the internet, and Wikipedia. I’m a too young to remember the same arguments leveled against slide rules, sound recording, motion pictures, or television.

Herb Simon

I sat in on Herb’s graduate-level cognitive science classes in the late 1980s at Carnegie Mellon. They were fascinating and largely focused on associative vs. logical reasoning in a production system framework. Kahnemann later popularized this distinction using the terms “fast” and “slow.”

That’s all background. The relevant fact for today’s discussion is that Herb started every semester by telling students to put away their paper and pens, saying they were distractions from attention. Herb would let you submit an essay arguing for note-taking being cognitively useful and he’d let you take notes if you turned it in. This is a level of dedication to the oral tradition that is unique in my experience. People today think phones are distracting, but if you’re serious, you’ll realize pencil and paper are also distracting. Herb did project slides from a computer (against the express wishes of our humanities and social sciences dean, who thought projectors and computers in classes would ruin thinking). He also used the board. So not purely oral. That’s the perfect segue to our contemporary subject…

Andrew Gelman

If I understood Andrew correctly when he was talking about his classes the other day, he won’t let students use phones or computers in his class. I believe this is for the same reason Herb didn’t want them to use pencil and paper. Namely, that he thinks it takes their attention away from focusing on learning the material. I don’t think this anti-phone, anti-computer stance is unusual—I’ve heard a lot of professors complain about students use of devices and try to enforce some kind of ban.

What’s unusual about Andrew these days is that he doesn’t use slides for his public talks. Even for big deal talks like plenary sessions at ISBA. It’s straight up oral tradition plus pantomimed graphs. Being Andrew, he doesn’t even need notes. I believe this dedication to the oral tradition in presenting statistics is unique to Andrew.

I don’t know if Andrew uses slides in his classes. When I was at Columbia, I sat in on classes where he would put the title slide up, then riff on interesting side material for an hour without ever advancing the slide deck (for me, that master class style was awesome). Andrew often mentions that he does live coding in his classes, but I don’t know if he lets students follow along on their computers.

Back to the oral tradition

If we combine Andrew’s no slides approach and Herb’s no notes approach, we’re back to Socrates’s happy place. A pure oral tradition.

P.S. I’m not advocating giving up writing. Or giving up taking notes. Or not being able to use the internet during class. Or giving up LLMs, for that matter. Then again, my attitude’s purely theoretical—I haven’t been responsible for a college class since 1996, which was still before mobile phones and computers were available to students.

Everything I need to know I learned in Little League

This post is by Bob

“Little League” is what we call baseball for kids in the United States. I often tell people that I learned a ton about how to behave and how to approach problems, teamwork, and life in little league. Turns out I’ve been saying that for a while. My sister just sent me this little poster I made for my dad at some point.

Dad repeated this advice regularly, even decades after my baseball-playing days. I still believe it’s good advice, so I’m sharing.

I put the most important advice first—keep your eye on the ball. That’s really key to just about anything.

I have found that hustle is also really critical in life. Dad and I loved hustling baseball players like Pete Rose. Dad used to drive me from Detroit to Cincinnatti in the early 70s to see the Big Red Machine in person, then drive back for work the next day. I find it demoralizing today how players just watch their hits rather than hustling as soon as there’s contact. I really miss “little ball”, which is why Cleveland’s my favorite team (that and it’s Mitzi’s home town).

The staying loose part is also really important and really hard. No editor, so I included keeping your eye on the ball twice. Without the duplication, I could have saved enough space to not cramp the bottom—otherwise, my graphical layout’s pretty good.

For me, sportsmanship is really critical. I also makes me sad that players only shake hands with their own team after the end of the game. We always had to go and shake every other player’s hand and tell them “good game” (even if it wasn’t). And the pros did the same.

I can’t emphasize the teamwork advice enough for the real world—part of that should have said “there’s enough credit to go around.” I should have put that higher up. Listening to how star players respond to interviews is key—it’s usually along the lines of, “I’m just trying to play my role and help the other 8 guys out on the field.”

Getting in front of the ball is also super important not only literally in baseball, but also metaphorically in life. You can do so much by just getting in front of the ball. It might hurt a bit when it hits you if you can’t catch it cleanly, but at least it didn’t get by you! I might rephrase “throw overhand” as “take the straight ahead approach” rather than “trying to get fancy.”

As a bonus, my sister also sent along this photo of our Little League days in Detroit.

That’s dad in the back and me on the far left of the back row. This is 1972 or 1973, so I was 8 or 9 years old (top row, far left) and dad was only 29 or 30. At the time this was taken, he was paying his way through law school photographing sports teams and accident scenes (I tagged along to both and “helped” in the darkroom). I love the attention to detail in the arrangement of gloves on the first row and the classically crossed bats—I also learned photography and design from dad, not to mention printing. Also notice how poor the neighborhood was. One of my teammates, Adam (can’t recall his last name), is wearing dress shoes; you can’t see Tibor’s shoes in the back, but they were mostly duct tape. We couldn’t even afford new baseballs, so I’m not sure how dad managed the spiffy uniforms—probably shilling a pizza joint or auto body repair on the back.

 
 

EVERYTHING I NEED TO KNOW I LEARNED IN LITTLE LEAGUE*

DAD ON THE GAME
keep your eye on the ball

DAD ON HUSTLE
run, don’t walk

DAD ON BATTING
stay loose, keep your shoulder down & keep your eye on the ball

DAD ON SPORTSMANSHIP
don’t be a bad loser & don’t be a bad winner; shake hands

DAD ON TEAMWORK
there are 8 other players to help you

DAD ON FIELDING
get in front of the ball & keep your eye on it

DAD ON THROWING
overhand, it goes straight

* FROM MY DAD [Mack L. Carpenter]

Seven-parameter drift-diffusion pdfs and cdfs now in Stan

This post is from Bob.

Drift-diffusion models

Whew. The cdf function for the seven-parameter drift-diffusion model was just merged. The pdf was merged a few months ago. This is a big deal. These pdfs and cdfs are used for in decision-time models in cognitive psychology. There’s a really nice illustration through NLM on nih.gov. The basic idea is that you have a binary task like deciding if an image is red or blue. The data being recorded is time to decision and the decision being made. The underlying generative model is a continuous Wiener diffusion process that has a lag time to get started before drifting with some bias toward opposing decision boundaries. The decision is determined by when it crosses a boundary and which one it crosses (see the illustration). The cdf is important when the task ends before a decision is made, giving you censored observations, which require cdfs or truncated pdfs to implement.

The first time I saw this model being applied was by Bruno Nicenboim and Shravan Vasishth (psycholinguists at Potsdam at the time, though Bruno has since moved to Tilburg) about six or eight years ago. At that point, it took Stan a month or so to fit the model (yes, that’s a month, not a typo)—you may know them as two of the three authors of the really wonderful book, Introduction to Bayesian Data Analysis for Cognitive Science (2025, CRC), which, in its final chapter, covers accumulator models of which the drift-diffusion model is one form. Now these models are very fast in Stan with the new built-in functions.

The pull requests and engineering challenge

Hats off to Franziska Henrich, a cognitive psychologist and Stan developer at the University of Freiburg, aka GitHub user Franzi2114, for writing the code and bearing with Steve Bronder’s hundreds of comments and fixes and my final round of a hundred or so change requests. You can see all the gory details in the discussions around the pull requests and in the code itself.

These pair of functions were perhaps the two hardest functions to get into Stan for a myriad of reasons. The most challenging obstacle beyond the inherent complication of the functions themselves is that our testing framework for densities can’t handle seven-parameter densities. So all the tests had to be projected into subsets of parameters (and seven choose four or five or whatever it was led to a lot of tests). A further difficulty is that to make the arithmetic stable, the code branches all over the place (see the Hartmann and Klaeur article linked below), which also complicates testing.

Some academic background

In addition to Vasishth et al.’s book chapter, there is a vast literature on drift-diffusion models in cognitive psychology and elsewhere. Most relevantly, Franziska wrote an open-access article about the model and the Stan implementation.

Luckily, Hartmann and Klauer provided the derivatives in a previous (closed-access) article.

As is often the case, you can find a pdf through Google Scholar. It’s a daunting pile of mathematics that puts the “M” in “mathematical psychology.” Luckily for us, the authors published an R implementation in package WienR on CRAN (the name is because it’s the Wiener diffusion model underlying the process), which Franziska could use for testing.

Coming to the Stan language next release

We just put out a new Stan release, so we have plenty of time to get the language wrappers around the math library functions before the next release of Stan. Ideally, we’ll also have a User’s Guide chapter with examples of how to use them. We’re always open to new *User’s Guide* chapters about models or methodologies in wide use, and as you can see from this example, we take pull requests, which go down much more easily with the User’s Guide.

A slew of improvements to NUTS

This post is from Bob.

Hold onto your hats, because 2026 promises to bring a whole slew of improvements to MCMC for continuously differentiable densities. The really awesome part is that all of these improvements are orthogonal, so they stack. I’m going to list the ones we know work first, followed by a couple in which we have high hopes.

I’m currently working on implementing all this in a C++ reference implementation, which I plan to roll out with an interface like that of Nutpie. You can follow here:

You’ll find the work in progress on branches. We are, of course, always happy to get feedback on the code, and I’m happy to get feedback on these or other ideas for sampling.

Fisher divergence for mass matrix adaptation

Adrian Seyboldt developed much faster and more robust and better targeted adaptation using Fisher divergence for his sampler Nutpie. I’m helping Adrian and Eliot Carlsen finish up a paper on this, which we hope to release in a week or so. In addition to better diagonal and dense estimators, it contains a really nice low-rank plus diagonal preconditioner that seems very effective (the risk is getting stuck too much in the subspace defined by the low rank structure). It also contains all the mathematical proofs, thanks to Adrian. The basic idea is that Fisher divergence adaptation targets getting the preconditioned target as close to a standard normal in terms of gradients as possible (not as close as possible in terms of density—that’s KL divergence). With some surprising mathematical magic, you can solve the exact optimization of Fisher divergence by taking the geometric mean in the affine-invariant manifold of positive-definite matrices of two quantities: the inverse covariance of the draws and the covariance of the scores (where the score is the gradient of the log density). In a multivariate normal, the covariance of the scores is the inverse covariance of the target density, has zero expectation, and thus acts like a control variate. Who knew? (That was a rhetorical question. Ben Goodrich, of course, knew—he knows everything.)

Adaptive step size on the fly

This was developed by Nawaf Bou-Rabee, Tore Kleppe, Sifan Liu, and me over the course of a few papers expanding on our Gibbs self tuning (GIST) ideas, culminating in our WALNUTS paper (on arXiv). The basic idea here is at each leapfrog step to try to take a step, and if the Hamiltonian diverges past a tolerance, try a smaller step size (half the original). There’s a bit of adjustment to do for reversibility (very much like in the MALT sampler of Lionel Riou-Durand et al. [the paper has an all star cast]), but the expectation is that the adjustment will be close to unity and that seems to hold in practice. There’s some subtlety here in that the WALNUTS sampler we’ve produced can be slower than NUTS (on a per gradient basis) for densities that NUTS can fit well, but if it’s well tuned in terms of minimal numbers of micro steps per macro step in the WALNUTS algorithm, it will be more efficient on a per-iteration basis. This shows we really are fixing the integrator, even if that doesn’t always give us a better sampler on a per-gradient basis. This same kind of issue comes up with higher-order leapfrog algorithms—they’re more precise, but often not worth the effort. On the plus side, when you have highly varying scales, like in the funnel density, WALNUTS can sample it effectively where NUTS will just fail silently reporting overly optimistic R-hat and ESS values.

Isokinetic sampler

As the name implies, the idea of isokinetic sampler is that you keep the kinetic energy constant. This is carried out by designing a kinetic energy function that’s different than the usual quadratic model derived from Newtonian physics (decomposed a la Hamilton, of course). This was developed decades ago by Mark Tuckerman at NYU’s Courant Institute for molecular dynamics, written about in Leimkuhler and Matthews Molecular Dynamics textbook, and reinvented recently by Jakob Robink and Uroš Seljak under the name “microcanonical sampling.” Isokinetic sampling has the remarkable property that you can fix a single energy level set and the isokinetic sampler can be restricted to that, yet remain ergodic for the distribution. Although they’re singing the praises of unadjusted methods, Reuben Cohn-Gordon joined Jakob and Uroš and they wrote a joint paper on Metropolis-adjusted methods (on arXiv). Of course, you could extend this to something like jittered HMC, multinomial HMC, or NUTS (we discuss these alternatives for standard HMC in the various GIST papers in detail). Recently my colleagues Tore Kleppe and Nawaf Bou-Rabee (my partners in crime on GIST) have verified the results presented by Robnik et al. and it seems to give a pretty clean factor of 1.5 to 3 over the standard NUTS kinetic energy model.

Unadjusted sampling, at least for warmup

In statistics, we’re used to thinking of unbiased MCMC with the correct ergodicity properties—you run them longer and you get a better answer. But in estimation in stats, we’re often OK with introducing a bit of bias if it gives us a large enough reduction in variance that we get better answers. If you view sampling the same way, then removing the Metropolis step from HMC gives you a biased sampler that can be a lot faster at moving toward the typical set where we want to sample than an adjusted sampler. And while Jakob, Uroš and Reuben are mainly working on unadjusted samplers and trying to prove error bounds for them under pretty strong assumptions, they have also suggested the reasonable tactic of using unadjusted sampling during warmup then switching to adjusted later so that everything works as expected for longer runs.

Smoother adaptation of Nutpie

Nutpie, like NUTS before it, works in blocks. It evaluates a given number of MCMC iterations, then updates its mass matrix estimate. I’ve developed a smooth alternative that simply exponentially discounts the past at a decreasing rate to mimic the block structure of Stan’s warmup. This seems to work very well and doesn’t suffer the problem of Nutpie of never converging due to a finite cap on block size (Adrian’s working on changing the version in Nutpie in some way to get around the initial design of 100-length blocks; Stan works in a sequence of blocks that doubles in size).

Adam replacement for dual averaging

Stan used the dual averaging stochastic gradient descent algorithm to set step size. It’s not spelled out in the short NUTS paper that dual averaging is an SGD algorithm because the NUTS paper never gives you the objective and its gradient. If you work it out backwards from the dual averaging algorithm, you can deduce that Hoffman and Gelman used a normal target (i.e., squared error), the gradient of which is the negative difference between the observation and the target. This gives you simple stochastic gradients you can plug into any SGD optimizer. I’ve just done this this week, but I’ve already found that Adam is much faster to converge, less highly variable after convergence, and more monotonic getting to convergence in the short tests I’ve done so far. I had to add an update discounting factor to Adam or you get the terrible oscillations rather than convergence for which both dual averaging and Adam are known.

Concurrent adaptation and convergence monitoring

This is a big one that Andrew’s been asking for for years. I’ve finished the online R-hat monitor (it’s in a branch of that name in the Walnuts repo on flatironinstitute). It uses the C++11 threads library to build an asynchronous, non-blocking monitor. The chains roll along as usual, each within its own thread, but they accumulate their within chain means and variances via Welford’s algorithm. They publish these in a per-chain Atomic using relaxed memory guarantees (hence the lack of blocking). Then there is also a monitor thread running that continually reads the per-chain means and variances and computes R-hat (the original one, not the split or ranked versions that are currently used in posterior (R) and ArviZ (Python)). I am able to run at least 100 R-hat checks per second without slowing down the chains noticeably, so it can detect convergence within a few iterations of when it happens even for very fast models. This all works blazingly fast on my new Mac Studio Server running 16 chains in parallel(*). Next up, I have to monitor adaptation. For that, there’s not a pre-built solution, but I’ll be using something like R-hat to monitor whether the mass matrix and step sizes have converged (on a log scale, which makes the monitoring respect the positive-definite manifold structure and operate scale free). And I’ll have to use a double-buffered store rather than an atomic because the vectors required for monitoring convergence of the mass matrix are D-dimensional, not default copyable.

(*) Mac ARM hardware

Not an algorithmic change, but … if you haven’t gotten the memo yet, the new Mac ARM chips are very well suited to parallel sampling like we do in Stan. Their memory architecture is much more tightly integrated into the CPU and more multiplexed than Intel architecture. This is great for sampling or other tasks where data and parameters are being hit asynchronously in memory in parallel.

I just got a Mac Studio with 20 performance cores on the M3 Ultra chip and 256GB of memory. I would highly recommend this specific machine if you have US$6K to spend (your cost may vary—electronics seem to be way more expensive outside of the U.S.). They also have a US$2K starter model, which should also be great compared to just about anything else. Even the MacBooks are good—the inexpensive Air I got a few years ago crushed my mega-expensive (albeit 8 year old) iMac running Intel Xeon chips. People have told me Stan’s broken on Windows after comparing Windows and Mac, but really, it’s just the memory architecture. I didn’t do any controlled tests, but I swear that tests that were taking 3s now take 1-2s after upgrading to the Tahoe OS (bunch of small UI changes that make pretty much no difference to my experience); Apple says they’re continuing to integrate more ARM goodness into their OS, so maybe that was it? The release notes don’t mention anything about heavy thread performance.

I don’t know how long this is going to be useful. I plan to develop a lot of these algorithms in JAX to run on GPU, which is going to be way faster than anything you can do on the desktop.


TENTATIVE IMPROVEMENTS

These haven’t been well tested by us yet.

Generalized HMC

NUTS is not good for GPUs. Matt Hoffman et al. have a new paper out that’s going into the next edition of the MCMC handbook about MCMC on modern hardware that explains why (on arXiv for now). Hugh Dance et al. just wrote another paper showing how to code something like NUTS on GPU, but it’s still expensive. Something like generalized HMC promises to give much of the advantage of NUTS implicitly without the elaborate recursive doubling structure. This is the kind of sampling to which Matt Hoffman and his former colleague Pavel Sountsov developed (Matt left Google and doesn’t work on sampling any more as far as I know). This is also what Gilad Turok, Chirag Modi, and I found with delayed rejection. It also sidesteps the problem of having to chain all the adjustments for local step-size adaptativity which can add up in a bad way for longer chains. It does add some additional tuning, which is how much to partially refresh momentum, which is a kind of proxy for path length for U-turns. It also has the advantage of being implicitly Rao-Blackwellized compared to HMC (you can average all the steps on an HMC path when computing expectations, but it’s just usually not worth the storage—this may turn out to be the case for G-HMC too).

Local mass matrix adaptation

I think of this as the final frontier. If we could locally condition, we wouldn’t need variable step sizes. We don’t have anything that works for this in complex cases. In Stan, you get a choice of a diagonal or dense preconditioned, and in Nutpie you also get low rank plus diagonal, but they’re all global preconditions, not local.

In very simple cases, we can use GIST to generate a mass matrix by taking an inverse Wishart sample around the negative Hessian. If you set the degrees of freedom correctly to get low variance, this pretty much perfectly preconditions a multivariate normal and should be a sound strategy in any log concave density. It’s just expensive, even with everything Cholesky factored. But that’s not enough for a system like Stan. The problem in going to more general models is that the Hessian’s no longer guaranteed to be positive definite (as it would be in a log concave density). This means we have to use something like Michael Betancourt’s softabs technique to condition the Hessian back to positive definite (e.g., eigendecompose, move negative eigenvalues up to positive, put back together, which is prohibitive because of the cubic cost). As an alternative, Nawaf and Tore have been looking at some explicit Riemannian integrators that only require a few Hessian-vector products, which are cheap with autodiff (linear in dimension rather than quadratic). This avoids the problem with the implicit integrator in Riemannian HMC, which is an additional obstacle beyond the need for positive-definite metrics. We might go back to implicit midpoint, as Arya Pourzanjani explored in his thesis, because we can use GIST to set a step size where we can guarantee stability; even so, the implicit nature of the algorithm is a real challenge for reversibility.

From Bayesian inference to LLMs (Steve Bronder’s 2025 CppCon talk)

This post is from Bob.

Steve is now a C++ celebrity!

The production values for these are top notch. They’re also highly Google ranked, so you can’t miss them when you search for advanced C++ topics.

Steve’s been writing C++ code for Stan since before the pandemic. He now works with me at Flatiron Institute as a software engineer. He joined the Stan team after taking a GPU class and asking us if we had any use for GPUs. We said, “Yes, please.”

I say Steve’s a celebrity because of the prestige of CppCon—it’s where the language developers, top educators, and infrastructure tool builders hang out. For example, Steve had lunch with Matt Goldbolt. I use Goldbolt’s app all the time and didn’t realize it was a person’s name until Steve mentioned the lunch. To get some sense of CppCon, you can see Godbolt himself going full fanboy over a CppCon selfie with Laurie Kirk.

The model underlying R-hat and a Bayesian estimator

This post is from Bob

Andrew and I were talking the other day about generalizing R-hat convergence monitoring to the situation where we have multiple asynchronous threads running chains and we needed ragged input. This is because I’m coding (with Steve Bronder and Brian Ward’s help) a parallel auto-stopping version of Stan combining the step-size adaptivity of WALNUTS and the warmup of Nutpie—stay tuned (or follow it or join in and help on the WALNUTS GitHub).

The usual R-hat assumes each chain is the same length. Andrew suggested it would be good to go back to the model to think about how to generalize. I had never thought of R-hat that way, but it turns out it’s really simple, so I can take you all along for the ride.

A Stan model for Bayesian R-hat

Here’s the Stan model. The input is an M by N matrix of draws theta—the output includes the posterior for R and the indicator if it is below 1.01. The mean of the former is the Bayesian estimate of R, i.e., R-hat. The mean of the latter is the estimated probability that R is below 1.01, just to give it a perhaps not completely arbitrary cutoff.

rhat.stan

data {
  int<lower=1> M;
  int<lower=2> N;
  matrix[M, N] theta;
}
parameters {
  real mu;
  real<lower=0> sigma;
  real<lower=0> tau;
  vector<multiplier=tau>[M] alpha;
}
model {
  mu ~ normal(0, 5);
  sigma ~ normal(0, 5);
  tau ~ normal(0, 5);
  alpha ~ normal(0, tau);
  for (m in 1:M) {
    theta[m] ~ normal(mu + alpha[m], sigma);
  }
}
generated quantities {
  real<lower=1> R = sqrt(1 + square(tau) / square(sigma));
  int<lower=0, upper=1> R_ok = R < 1.01;
}

The draws in chain m, (i.e., theta[m] == theta[m, 1:N]), are given a normal(mu + alpha[m], sigma) distribution, so that mu is the global mean across chains and alpha[m] is the difference in mean in chain m from the global mean. The posterior standard deviation of all of the draws is represented by sigma. The variable tau is the standard deviation of the alpha[1:M] variables, so that represents the scale of variation among the chain means. The rest of the priors are weakly informative assuming the values are on roughly a unit scale or slightly larger.

The multiplier=tau converts alpha to a non-centered parameterization given its prior distribution normal(0, tau). The non-centered parameterization is more efficient here, though it may not be with large M and large N.

The generated quatnties define R as the square root of 1 + (tau / sigma)^2. This makes it easy to see that it's the variation between chains as represented by tau, that must go to zero for R to converge to 1. The sigma is just providing the proper scaling to make the R statistic unit free.

Stan will propagate uncertainty through tau and sigma to R. Then the indicator R_ok will be 1 if R is less than 1.01, so its posterior mean is the probability that R is less than 1.01. This is how you code event probability estimators in Stan.

The Bayesian estimate of R, which we can call the "Bayesian R-hat", is the posterior mean of R. The "hat" etymology derives from statisticians liking to decorate random variables with hats to indicate estimates of the random variable. So if X is a random variable, X-hat is an estimate of that variable. We could've taken a posterior median---that's a different estimator with slightly different properties.

A test model

We need a simple model to generate draws that we can test. I'll choose one that's not trivial, so HMC will have to work a bit and we can see some R-hat values that aren't all 1 after rounding.

eval.stan

parameters {
  vector<lower=0>[5] theta;	   
}
model {
  theta ~ lognormal(0, 1);
}

A Student-t with 3 degrees of freedom also works.

A simulator using CmdStanPy

Let's kick the tires, as they say in my hometown of Detroit. We're going to generate some data using the test model

fit.py

import numpy as np
import cmdstanpy as csp
csp.disable_logging()

model = csp.CmdStanModel(stan_file='eval.stan')
fit = model.sample(iter_warmup=10, iter_sampling=200, chains=8, max_treedepth=1, show_progress=False)
print(fit.summary(sig_figs=3))

draws = fit.stan_variable("theta")
rhat_model = csp.CmdStanModel(stan_file='rhat.stan')
for d in range(5):
    theta = draws[:, d]
    num_chains = fit.runset.chains
    draws_per_chain = fit.num_draws_sampling
    theta_matrix = theta.reshape(num_chains, draws_per_chain)
    rhat_data = {
        "M": num_chains,
        "N": draws_per_chain,
        "theta": theta_matrix
    }
            
    rhat_fit = rhat_model.sample(data=rhat_data, show_progress=False)
    R_hat = np.mean(rhat_fit.stan_variable("R"))
    R_ok = np.mean(rhat_fit.stan_variable("R_ok"))
    print(f"{d=};  R-hat: {R_hat:0.3f}; Pr[R < 1.01]: {R_ok:0.3f}")

I find Python very readable, but let me walk you through it. We import and turn off some of the annoying warning messages. Then we compile the evaluation model, sample it, and print the summary. We set tree depth to 1 to force NUTS to act like Langevin and thus sample inefficiently. And we only go 10 warmup iterations and 200 sampling iterations over eight chains.

Then we extract the draws from the posterior as a big 3D array draws of shape (D x M x N), where D is number of parameters, M number of chains, and N iterations per chain.

Then we loop over all the variables in the model, with d going from 0 to 4 (Python indexes from 0, unlike R). For each variable we extract the draws as theta. We grab out the number of chains and draws per chain and reshape the matrix so that it's (M x N) and put it into a data dictionary (like an R list) called rhat_data.

Then we take the data and fit using the Rhat model and extract our Bayesian estimate of R-hat as the posterior mean of the variable R and same for R_ok. Then we report.

Here's the output with columns trimmed for readability.

(stanenv) rhat$ python3 fit.py
          Mean    MCSE  StdDev   R_hat
theta[1]  1.63  0.1070    1.75    1.04
theta[2]  1.46  0.0915    1.61    1.03
theta[3]  1.63  0.1890    2.06    1.07
theta[4]  1.50  0.0995    1.78    1.03
theta[5]  1.68  0.1110    1.98    1.03

d=0;  R-hat: 1.046;  Pr[R < 1.01]: 0.016
d=1;  R-hat: 1.019;  Pr[R < 1.01]: 0.306
d=2;  R-hat: 1.058;  Pr[R < 1.01]: 0.002
d=3;  R-hat: 1.011;  Pr[R < 1.01]: 0.616
d=4;  R-hat: 1.014;  Pr[R < 1.01]: 0.525

Looks reasonable on first glance comparing the estimate R_hat value from CmdStanPy and the Bayesian approach.

Because it's a hierarchical model, we can't directly calculate a maximum likelihood estimate---we'd have to marginalize and use max marginal likelihood. Marginalizing out the intermediate normal (alpha) would also be a better way to code rhat.stan. I imagine fitting the model in brms would also be more robust than my simple example code here---they have better priors. I'll leave all that as an exercise for the reader.

Citation

I went back to the original for the first time ever to verify that indeed this is right model for what R-hat is modeling.

Assistant professor position at USI in Lugano

This post is from Bob

Antonietta Mira writes,

Assistant Professor position (tenure-track) in Theoretical Data Science and Machine Learning, Università della Svizzera italiana, Lugano, Switzerland

USI is looking for a talented junior statistician for Tenure Track Assistant Professor position in “Theoretical Data Science and Machine Learning”. The deadline for applications is 30 November 2025 and we are looking for a starting date around the Summer 2026. Starting salary is 110,000 CHF per year. More details about the required profile can be found on:

Official announcement.

The faculty is located in Lugano in the South of Switzerland on the University’s main campus, 1.5 hrs from Zürich and 1 hr from Milan. It has been expanding its activities in Data Science, Statistics and Machine Learning. Since 2024, we are now having both a bachelor’s and master’s degree in Data Science and a master’s degree in Artificial Intelligence. In August 2026, we will be hosting the European Meeting of Statisticians in Lugano.
Timeline of the hiring process

  • 30 November 2025: application deadline
  • 2-4 February 2026: In person interviews in Lugano, Switzerland.
  • Summer 2026: envisioned starting date

For any questions, do not hesitate to contact the Dean, Prof. Ernst C. Wit ([email protected]).

Mitzi and I have visited Antonietta for a week on two separate occasions over the last few years and we absolutely loved Lugano the city, the USI campus, and getting to hang out with Antonietta. Teaching is in English and faculty salaries in Switzerland are much better than anywhere else I know in Europe (1 Swiss franc is about US$1.25).

Sabbatical and pre-faculty positions at Flatiron Institute in NYC

This post is from Bob.

Sabbatical visitors

If you work in computational stats or ML (or even in other branches of applied math) and have a sabbatical coming up and would like to spend it at the Center for Computational Mathematics, which is part of the Flatiron Institute in NYC, please drop me a line:

Pre-faculty visitors

Some of our postdoc applicants have wound up getting faculty offers, at which point we can make them visiting researcher offers at better than postdoc salaries for a year. We have great computing resources, great physical space, and a wonderful set of colleagues across a range of scientific computing areas of interest.

What we did last year

Last year, Nawaf Bou-Rabee was on sabbatical here and Sifan Liu was here as a pre-faculty visitor. Four different papers about which I’m excited came out of this collaboration and we really feel like we’re just getting warmed up, if I may be permitted a pun.

Nawaf stayed on as a part-time visiting researcher, and with Tore and Sifan, we’ve turned our attention to mass matrix adaptation.

Sifan has left us for Duke, where I have zero doubt she’ll be hugely successful. She has the same kind of research X-ray vision that I last experienced working with Matt Hoffman. I frankly couldn’t keep up. These projects with us were just a fraction of what she worked on while here. She also collaborated with a bunch of new people locally and came up with a couple novel normalizing flow implementations that connect to quasi Monte Carlo. I can’t wait to see what she does next.

Going forward: flows and diffusions

This year, Luhuan Wu is here as a pre-faculty visitor before she heads off to a faculty position at Johns Hopkins in Applied Math and Statistics. Most of our new ML postdocs this year have worked on diffusion models and normalizing flows (this includes Luhuan, Mark Goldstein, and Louis Grenioux), as have many of our research scientists and previous postdocs (though a group of three postdocs who collaborated on a diffusion plus HMC model of galactic dust denoising project to measure cosmic microwave background all got jobs and left). I hope to spend at least half of my time going forward, starting in January, working on normalizing flows and diffusions with the goal of developing a practical tool that applied statisticians can use. What got me super excited about this was when Justin Domke took a five month leave of absence here—his work with Abhinav Agrawal on normalizing flows actually really works robustly, in many hard cases better than Stan, though it takes a bajillion flops, for which you need a good GPU. If you’re interested in this project as a visitor here, please let me know!

It’s a JAX, JAX, JAX, JAX World

This post is by Bob.

The title is based on the similarly named classic film.

 

“Big” models moving from Stan to JAX

Ever since the big ML frameworks PyTorch and TensorFlow were released, the Stan developers have been worried they’re going to put Stan out of business (we built Stan’s autodiff before those packages existed, but after Theano). While that hasn’t quite happened yet, I now believe our days are numbered. For high end applications, Stan is slowly, but surely, being replaced by JAX. Many places I go (don’t want Andrew to jump on a hyperbolic use of “everywhere”), I hear about people switching from Stan to JAX.

Here are four examples:

1. At StanCon in Oxford in 2024, Elizaveta Semenova started her talk by saying something to the effect of, “I’m sorry to say this here, but I don’t use Stan any more—I switched to JAX through NumPyro for scalability.”

2. Mitzi Morris just started working as a contractor for the U.S. Center for Disease Control (CDC) (!? as they say in chess). Their public GitHub repositories have old Stan code they used to use that has been replaced by JAX, for which they are building up a library of code. It’s very hard to build reusable code in Stan given its blocked structure and the limited form of includes; Sean Pinkney has gone further than I thought possible with his helpful Stan functions project. The CDC models are for wastewater-informed forecasting—here’s the project overview.

3. Andrew posted a job announcement from the L.A. Dodgers baseball team a week ago that said, “We have a soft spot for jax and numpyro but Stan and PyMC folks are obviously always of interest.” Like Andrew, they apparently don’t like using their shift key.

4. Matt Hoffman’s been saying this for years and backing it up with adaptive ensemble samplers, convergence monitoring, etc. He, Pavel Sountsov, and Colin Carroll wrote a draft chapter for the second edition of the MCMC Handbook, Running Markov Chain Monte Carlo on Modern Hardware and Software. It contains complete instructions for massively parallelizing HMC on a GPU using JAX.

But what about the hardware?

The biggest obstacle for people moving is finding the hardware on which to run JAX most efficiently—it’s really tailored for multiprocessing and GPU processing and I don’t believe most of the Stan users have access to this kind of hardware to fit their models. But I believe this is going to change over the next ten years. That, and I believe we’re going to get better and better Macs—the ARM chips are way faster than the Intel chips for the kind of random-access memory needed in Stan programs.

New samplers moving to JAX

New samplers like the micro-canonical HMC of Jakob Robnik and Uroš Seljak (and more recently Reuben Cohn-Gordon) are being coded only in JAX. Like many others, they added their package (see the previous link) to the Blackjax package. They even have a competitor for posteriordb in the form of Inference Gym.

A very nice feature of putting things up on Blackjax is that you can use them with any Python-defined log density function—it doesn’t even need to come from JAX. Brian Ward managed to plug Stan models into JAX (by which I mean having JAX call Stan’s C++, not generating JAX code from Stan).

Static vs. dynamic automatic differentiation

We built Stan with automatic differentiation before PyTorch, TensorFlow or JAX existed. We went with the same dynamic design as PyTorch eventually chose, despite Matt Hoffman and I knowing that the static TensorFlow/JAX approach could be more performant. The problem was that we didn’t have the people to implement enough derivatives to do it that way. Instead, we just started autodiffing through functions in the Eigen matrix library (like matrix multiplication and division) and in Boost (like the Runge-Kutta 4/5 ODE solver and many of the special functions). The static approach of XLA (which is the infrastructure under JAX and TensorFlow) does limit expressiveness of things like loops and conditionals to not condition on parameters, making it challenging, if not impossible to write iterative algorithms in JAX.

Graphical modeling

Tools like BUGS, PyMC, and NumPyro are all fundamentally based on the notion of a directed acyclic graphical model. That is, you have nodes representing random variables with each variable being conditionally independent given the nodes that point to it. You specify the distribution of each node given the nodes on which it depends. Transforms are represented by deterministic nodes. The upside to constraining oneself to graphical models is that everything has to remain clearly generative (assuming you avoid improper flat priors, that is). This lets tools like PyMC automate a lot of workflow in the same way that we can with brms in Stan. When you go outside that paradigm, as you can in PyMC by adding density statements, the built-in automation of workflow breaks. So while it’s possible, they generally don’t recommend it. This came up in an earlier blog post I wrote, What’s a generative model? PyMC and Stan edition.

Differentiable programming

Stan does not work on a graphical modeling base. You can write graphical models in Stan, but we just treat them as defining a log density (that was the leap that led to Stan—I thought about how to code JAGS to generate log densities rather than conditional samplers as they do in BUGS/JAGS). In Stan, we just declare constrained parameters and define a log density over them. That’s it (the Jacobian adjustment for the change of variables is kept under the hood). There are generated quantities, but that’s conceptually after sampling.

Like Stan, JAX is also a differentiable programming language. Unlike Stan, it’s wonderfully compositional and general.

Writing JAX models like Stan models

As much as people like to use NumPyro and sometimes even PyMC to generate JAX code, I think it may be easier in the end to just write JAX directly. That way, nothing gets between you and JAX and you don’t have to figure out how to filter JAX through middleware. When you do that, the models can be organized very much like in Stan.

Brian Ward and I took some time to work through what a simple linear regression would look like coded this way in JAX. I went over it a couple weeks ago with Andrew and he didn’t think it was too bad. Here’s the example.

GitHub Gist: linear regression in JAX.

In this example, we first do the constraining parameter transforms and extract the Jacobian, then define the model directly. Although we didn’t need it for this simple example, the Oryx library in JAX provides an extensive library of constraining transforms with Jacobians. It’s using the really cool PyTree features of JAX to move between structured log densities and array-based serialized log densities. This is sooo cool and the fact that it can all be compiled away is even cooler.

In JAX, there’s no distribution statement syntactic sugar, but then even Andrew thinks those were a mistake in Stan. I still like them, though I admit they’ve caused a lot of confusion in terms of people thinking about how Stan works. It’s odd to find myself on the more permissive side of language design discussion for once.

Generated quantities of the form used in Stan are trivial to code directly in JAX with vmap. Removing all these special constructs is super helpful for learnability, as is having the language embedded in Python (as much as Python is terrible for this kind of thing, much like R, because of its lack of static typing, its global interpreter lock, and it’s R-like scope, I believe it’s well on its way to becoming the lingua franca of numerical analysis.

Generating JAX from Stan?

People have asked if we were going to work on generating JAX code from Stan programs. I doubt it, given how easy it is to just define models directly in JAX and given how few dedicated developers we now have. The whole point of Stan was to provide a structured way to do derivatives for statistics models. We can just do that directly in JAX as the above gist shows.

Giving up working on Stan?

No, we’re not giving up on Stan. People still use BUGS! Stan’s going to keep being used for a long time if history is any indication. We have lots of strategies for making it faster, adding samplers that will work well on CPU but not GPU, etc. I don’t plan to be involved in coding for Stan any more. It’s just too complicated for me. My plan is to write standalone samplers like WALNUTS, following Adrian Seyboldt’s lead for Nutpie. If you’re OK with Python but haven’t tried Nutpie, I’d highly recommend it—it’s twice as fast as Stan and more robust due to its adaptation—I’m rolling that into the new WALNUTS code and maybe we’ll find the cycles to roll it into Stan itself after more testing.

Condition numbers for HMC and the funnel

This post is by Bob.

Back to some technical statistical computing.

Condition numbers for random walks

The usual notion of condition number is the ratio of the largest to the smallest eigenvalue of the negative Hessian. Large eigenvalues correspond to high curvature and small eigenvalues to low curvature. Condition numbers matter because the step size needs to be small enough to deal with the regions of high curvature and thus will require many steps to traverse flatter regions of low curvature. Eigenvalues of the negative Hessian act like inverse variances (they are inverse variances in a multivariate normal with a diagonal covariance matrix), and are thus squared scales. If you set the step size to be consistent with the direction of highest curvature, you have to take a number of steps equal to the condition number to move in the direction of lowest curvature—this is the condition number. It bounds how many steps are going to be required to get roughly independent draws.

Neal’s funnel

Radford Neal introduced a funnel density in his slice sampling paper. I assume he was well aware of just how nasty this example is. The funnel is a centered parameterization of a hierarchical model with no data in N dimensions:

y ~ normal(0, 3)
x[1:N - 1] ~ normal(0, exp(y / 2))

Here’s a density plot of y versus x[1] from the Stan User’s Guide chapter on reparameterization.

As you move along the y axis between +6 and -6, the condition number goes from 1000 to roughly 1 at the origin back up to 1000. From conditioning, both the mouth and the neck of the funnel are tricky. And this is only +/- two standard deviations, which is only approximately 95% of the probability mass. One of the things that makes the funnel nasty is that during the move from -6 to 6, the eigenstructure changes with the principal eigenvector (the one with the largest eigenvalue), changes alignment from along the x axes to along the y axis.

It is very hard to estimate the uncertainty in the funnel using sampling, even independent sampling. The problem is that x[n]^2 has a mean of roughly 100, but x[n]^4 has a mean of 2 x 10^8 (!) and thus x[n]^2 itself has a standard deviation of 1.4 x 10^4 (I’m using the fact that var[X^2] = E[X^4] - E[X^2]^2). This has to be enormously skewed to the right because the values are bounded below by 0. Even with 10 billion independent draws from the funnel, the estimates of the expectation and variance of the x coordinates are all over the place.

Condition numbers for HMC

HMC is so effective precisely because it overcomes the random walk behavior of Metropolis. Where Metropolis takes O(N^2) amount of work to move a distance of N, HMC only requires O(N^5/4). But there’s still this nasty constant from conditioning lurking in that asymptotic complexity result.

I don’t know how I missed it before, but I only learned about this paper at the MCM conference in Chicago last month:

Langmore et al. introduce an appropriate notion of condition for HMC,

kappa = [ SUM_{n=1}^N (lambdaMax / lambda[n])^4 ]^(1/4)

where lambda[1:N] are the eigenvalues of the negative Hessian, and lambdaMax = max(lambda[1:N]). This tells us that it’s worse to have one big eigenvalue (one highly curved dimension) and many small eigenvalues (flat dimensions) than the other way around. Therefore, the funnel is actually more poorly conditioned for HMC in the mouth than in the neck. In the mouth, the largest eigenvalue corresponds to the relatively slow moving y axis and the x axes are all much lower curvature relatively speaking. The reason the neck is usually considered the source of the problem is that the leapfrog algorithm in HMC is only a first-order (i.e., gradient-based) approximation of the Hamiltonian trajectory, and it can diverge pretty quickly in regions of high curvature. It turns out that if you take HMC or NUTS and use a fixed step size, you cannot explore the tails of either the neck or the mouth of the funnel very well.