Assistant professor position at USI in Lugano

Posted on October 24, 2025 3:30 PM by Bob Carpenter

This post is from Bob

Assistant Professor position (tenure-track) in Theoretical Data Science and Machine Learning, Università della Svizzera italiana, Lugano, Switzerland

USI is looking for a talented junior statistician for Tenure Track Assistant Professor position in “Theoretical Data Science and Machine Learning”. The deadline for applications is 30 November 2025 and we are looking for a starting date around the Summer 2026. Starting salary is 110,000 CHF per year. More details about the required profile can be found on:

Official announcement.

The faculty is located in Lugano in the South of Switzerland on the University’s main campus, 1.5 hrs from Zürich and 1 hr from Milan. It has been expanding its activities in Data Science, Statistics and Machine Learning. Since 2024, we are now having both a bachelor’s and master’s degree in Data Science and a master’s degree in Artificial Intelligence. In August 2026, we will be hosting the European Meeting of Statisticians in Lugano.
Timeline of the hiring process

30 November 2025: application deadline
2-4 February 2026: In person interviews in Lugano, Switzerland.
Summer 2026: envisioned starting date

For any questions, do not hesitate to contact the Dean, Prof. Ernst C. Wit ([email protected]).

Mitzi and I have visited Antonietta for a week on two separate occasions over the last few years and we absolutely loved Lugano the city, the USI campus, and getting to hang out with Antonietta. Teaching is in English and faculty salaries in Switzerland are much better than anywhere else I know in Europe (1 Swiss franc is about US$1.25).

Sabbatical and pre-faculty positions at Flatiron Institute in NYC

Posted on October 22, 2025 3:00 PM by Bob Carpenter

This post is from Bob.

Sabbatical visitors

If you work in computational stats or ML (or even in other branches of applied math) and have a sabbatical coming up and would like to spend it at the Center for Computational Mathematics, which is part of the Flatiron Institute in NYC, please drop me a line:

[email protected]

Pre-faculty visitors

Some of our postdoc applicants have wound up getting faculty offers, at which point we can make them visiting researcher offers at better than postdoc salaries for a year. We have great computing resources, great physical space, and a wonderful set of colleagues across a range of scientific computing areas of interest.

What we did last year

Last year, Nawaf Bou-Rabee was on sabbatical here and Sifan Liu was here as a pre-faculty visitor. Four different papers about which I’m excited came out of this collaboration and we really feel like we’re just getting warmed up, if I may be permitted a pun.

Gibbs self tuning (GIST) for dynamic HMC, co-authored with Nawaf and Milo Marsden.
GIST for step size adaptation, co-authored with Nawaf, Milo, and Tore Kleppe.
WALNUTS for local step size tuning, with Nawaf, Sifan, and Tore.
No underrun sampler (NURS), which is a gradient-free implementable version of the hit-and-run sampler, with Nawaf, Sifan, and Stefan Oberdörster.

Nawaf stayed on as a part-time visiting researcher, and with Tore and Sifan, we’ve turned our attention to mass matrix adaptation.

Sifan has left us for Duke, where I have zero doubt she’ll be hugely successful. She has the same kind of research X-ray vision that I last experienced working with Matt Hoffman. I frankly couldn’t keep up. These projects with us were just a fraction of what she worked on while here. She also collaborated with a bunch of new people locally and came up with a couple novel normalizing flow implementations that connect to quasi Monte Carlo. I can’t wait to see what she does next.

Going forward: flows and diffusions

This year, Luhuan Wu is here as a pre-faculty visitor before she heads off to a faculty position at Johns Hopkins in Applied Math and Statistics. Most of our new ML postdocs this year have worked on diffusion models and normalizing flows (this includes Luhuan, Mark Goldstein, and Louis Grenioux), as have many of our research scientists and previous postdocs (though a group of three postdocs who collaborated on a diffusion plus HMC model of galactic dust denoising project to measure cosmic microwave background all got jobs and left). I hope to spend at least half of my time going forward, starting in January, working on normalizing flows and diffusions with the goal of developing a practical tool that applied statisticians can use. What got me super excited about this was when Justin Domke took a five month leave of absence here—his work with Abhinav Agrawal on normalizing flows actually really works robustly, in many hard cases better than Stan, though it takes a bajillion flops, for which you need a good GPU. If you’re interested in this project as a visitor here, please let me know!

It’s a JAX, JAX, JAX, JAX World

Posted on October 3, 2025 3:00 PM by Bob Carpenter

This post is by Bob.

The title is based on the similarly named classic film.

“Big” models moving from Stan to JAX

Ever since the big ML frameworks PyTorch and TensorFlow were released, the Stan developers have been worried they’re going to put Stan out of business (we built Stan’s autodiff before those packages existed, but after Theano). While that hasn’t quite happened yet, I now believe our days are numbered. For high end applications, Stan is slowly, but surely, being replaced by JAX. Many places I go (don’t want Andrew to jump on a hyperbolic use of “everywhere”), I hear about people switching from Stan to JAX.

Here are four examples:

1. At StanCon in Oxford in 2024, Elizaveta Semenova started her talk by saying something to the effect of, “I’m sorry to say this here, but I don’t use Stan any more—I switched to JAX through NumPyro for scalability.”

2. Mitzi Morris just started working as a contractor for the U.S. Center for Disease Control (CDC) (!? as they say in chess). Their public GitHub repositories have old Stan code they used to use that has been replaced by JAX, for which they are building up a library of code. It’s very hard to build reusable code in Stan given its blocked structure and the limited form of includes; Sean Pinkney has gone further than I thought possible with his helpful Stan functions project. The CDC models are for wastewater-informed forecasting—here’s the project overview.

3. Andrew posted a job announcement from the L.A. Dodgers baseball team a week ago that said, “We have a soft spot for jax and numpyro but Stan and PyMC folks are obviously always of interest.” Like Andrew, they apparently don’t like using their shift key.

4. Matt Hoffman’s been saying this for years and backing it up with adaptive ensemble samplers, convergence monitoring, etc. He, Pavel Sountsov, and Colin Carroll wrote a draft chapter for the second edition of the MCMC Handbook, Running Markov Chain Monte Carlo on Modern Hardware and Software. It contains complete instructions for massively parallelizing HMC on a GPU using JAX.

But what about the hardware?

The biggest obstacle for people moving is finding the hardware on which to run JAX most efficiently—it’s really tailored for multiprocessing and GPU processing and I don’t believe most of the Stan users have access to this kind of hardware to fit their models. But I believe this is going to change over the next ten years. That, and I believe we’re going to get better and better Macs—the ARM chips are way faster than the Intel chips for the kind of random-access memory needed in Stan programs.

New samplers moving to JAX

New samplers like the micro-canonical HMC of Jakob Robnik and Uroš Seljak (and more recently Reuben Cohn-Gordon) are being coded only in JAX. Like many others, they added their package (see the previous link) to the Blackjax package. They even have a competitor for posteriordb in the form of Inference Gym.

A very nice feature of putting things up on Blackjax is that you can use them with any Python-defined log density function—it doesn’t even need to come from JAX. Brian Ward managed to plug Stan models into JAX (by which I mean having JAX call Stan’s C++, not generating JAX code from Stan).

Static vs. dynamic automatic differentiation

We built Stan with automatic differentiation before PyTorch, TensorFlow or JAX existed. We went with the same dynamic design as PyTorch eventually chose, despite Matt Hoffman and I knowing that the static TensorFlow/JAX approach could be more performant. The problem was that we didn’t have the people to implement enough derivatives to do it that way. Instead, we just started autodiffing through functions in the Eigen matrix library (like matrix multiplication and division) and in Boost (like the Runge-Kutta 4/5 ODE solver and many of the special functions). The static approach of XLA (which is the infrastructure under JAX and TensorFlow) does limit expressiveness of things like loops and conditionals to not condition on parameters, making it challenging, if not impossible to write iterative algorithms in JAX.

Graphical modeling

Tools like BUGS, PyMC, and NumPyro are all fundamentally based on the notion of a directed acyclic graphical model. That is, you have nodes representing random variables with each variable being conditionally independent given the nodes that point to it. You specify the distribution of each node given the nodes on which it depends. Transforms are represented by deterministic nodes. The upside to constraining oneself to graphical models is that everything has to remain clearly generative (assuming you avoid improper flat priors, that is). This lets tools like PyMC automate a lot of workflow in the same way that we can with brms in Stan. When you go outside that paradigm, as you can in PyMC by adding density statements, the built-in automation of workflow breaks. So while it’s possible, they generally don’t recommend it. This came up in an earlier blog post I wrote, What’s a generative model? PyMC and Stan edition.

Differentiable programming

Stan does not work on a graphical modeling base. You can write graphical models in Stan, but we just treat them as defining a log density (that was the leap that led to Stan—I thought about how to code JAGS to generate log densities rather than conditional samplers as they do in BUGS/JAGS). In Stan, we just declare constrained parameters and define a log density over them. That’s it (the Jacobian adjustment for the change of variables is kept under the hood). There are generated quantities, but that’s conceptually after sampling.

Like Stan, JAX is also a differentiable programming language. Unlike Stan, it’s wonderfully compositional and general.

Writing JAX models like Stan models

As much as people like to use NumPyro and sometimes even PyMC to generate JAX code, I think it may be easier in the end to just write JAX directly. That way, nothing gets between you and JAX and you don’t have to figure out how to filter JAX through middleware. When you do that, the models can be organized very much like in Stan.

Brian Ward and I took some time to work through what a simple linear regression would look like coded this way in JAX. I went over it a couple weeks ago with Andrew and he didn’t think it was too bad. Here’s the example.

GitHub Gist: linear regression in JAX.

In this example, we first do the constraining parameter transforms and extract the Jacobian, then define the model directly. Although we didn’t need it for this simple example, the Oryx library in JAX provides an extensive library of constraining transforms with Jacobians. It’s using the really cool PyTree features of JAX to move between structured log densities and array-based serialized log densities. This is sooo cool and the fact that it can all be compiled away is even cooler.

In JAX, there’s no distribution statement syntactic sugar, but then even Andrew thinks those were a mistake in Stan. I still like them, though I admit they’ve caused a lot of confusion in terms of people thinking about how Stan works. It’s odd to find myself on the more permissive side of language design discussion for once.

Generated quantities of the form used in Stan are trivial to code directly in JAX with vmap. Removing all these special constructs is super helpful for learnability, as is having the language embedded in Python (as much as Python is terrible for this kind of thing, much like R, because of its lack of static typing, its global interpreter lock, and it’s R-like scope, I believe it’s well on its way to becoming the lingua franca of numerical analysis.

Generating JAX from Stan?

People have asked if we were going to work on generating JAX code from Stan programs. I doubt it, given how easy it is to just define models directly in JAX and given how few dedicated developers we now have. The whole point of Stan was to provide a structured way to do derivatives for statistics models. We can just do that directly in JAX as the above gist shows.

Giving up working on Stan?

No, we’re not giving up on Stan. People still use BUGS! Stan’s going to keep being used for a long time if history is any indication. We have lots of strategies for making it faster, adding samplers that will work well on CPU but not GPU, etc. I don’t plan to be involved in coding for Stan any more. It’s just too complicated for me. My plan is to write standalone samplers like WALNUTS, following Adrian Seyboldt’s lead for Nutpie. If you’re OK with Python but haven’t tried Nutpie, I’d highly recommend it—it’s twice as fast as Stan and more robust due to its adaptation—I’m rolling that into the new WALNUTS code and maybe we’ll find the cycles to roll it into Stan itself after more testing.

Condition numbers for HMC and the funnel

Posted on September 18, 2025 3:00 PM by Bob Carpenter

This post is by Bob.

Back to some technical statistical computing.

Condition numbers for random walks

The usual notion of condition number is the ratio of the largest to the smallest eigenvalue of the negative Hessian. Large eigenvalues correspond to high curvature and small eigenvalues to low curvature. Condition numbers matter because the step size needs to be small enough to deal with the regions of high curvature and thus will require many steps to traverse flatter regions of low curvature. Eigenvalues of the negative Hessian act like inverse variances (they are inverse variances in a multivariate normal with a diagonal covariance matrix), and are thus squared scales. If you set the step size to be consistent with the direction of highest curvature, you have to take a number of steps equal to the condition number to move in the direction of lowest curvature—this is the condition number. It bounds how many steps are going to be required to get roughly independent draws.

Neal’s funnel

Radford Neal introduced a funnel density in his slice sampling paper. I assume he was well aware of just how nasty this example is. The funnel is a centered parameterization of a hierarchical model with no data in N dimensions:

y ~ normal(0, 3) x[1:N - 1] ~ normal(0, exp(y / 2))

Here’s a density plot of y versus x[1] from the Stan User’s Guide chapter on reparameterization.

As you move along the y axis between +6 and -6, the condition number goes from 1000 to roughly 1 at the origin back up to 1000. From conditioning, both the mouth and the neck of the funnel are tricky. And this is only +/- two standard deviations, which is only approximately 95% of the probability mass. One of the things that makes the funnel nasty is that during the move from -6 to 6, the eigenstructure changes with the principal eigenvector (the one with the largest eigenvalue), changes alignment from along the x axes to along the y axis.

It is very hard to estimate the uncertainty in the funnel using sampling, even independent sampling. The problem is that x[n]^2 has a mean of roughly 100, but x[n]^4 has a mean of 2 x 10^8 (!) and thus x[n]^2 itself has a standard deviation of 1.4 x 10^4 (I’m using the fact that var[X^2] = E[X^4] - E[X^2]^2). This has to be enormously skewed to the right because the values are bounded below by 0. Even with 10 billion independent draws from the funnel, the estimates of the expectation and variance of the x coordinates are all over the place.

Condition numbers for HMC

HMC is so effective precisely because it overcomes the random walk behavior of Metropolis. Where Metropolis takes O(N^2) amount of work to move a distance of N, HMC only requires O(N^5/4). But there’s still this nasty constant from conditioning lurking in that asymptotic complexity result.

I don’t know how I missed it before, but I only learned about this paper at the MCM conference in Chicago last month:

Langmore, I., Dikovsky, M., Geraedts, S., Norgaard, P. and Von Behren, R., 2019. A condition number for Hamiltonian Monte Carlo. arXiv preprint arXiv:1905.09813.

Langmore et al. introduce an appropriate notion of condition for HMC,

kappa = [ SUM_{n=1}^N (lambdaMax / lambda[n])^4 ]^(1/4)

where lambda[1:N] are the eigenvalues of the negative Hessian, and lambdaMax = max(lambda[1:N]). This tells us that it’s worse to have one big eigenvalue (one highly curved dimension) and many small eigenvalues (flat dimensions) than the other way around. Therefore, the funnel is actually more poorly conditioned for HMC in the mouth than in the neck. In the mouth, the largest eigenvalue corresponds to the relatively slow moving y axis and the x axes are all much lower curvature relatively speaking. The reason the neck is usually considered the source of the problem is that the leapfrog algorithm in HMC is only a first-order (i.e., gradient-based) approximation of the Hamiltonian trajectory, and it can diverge pretty quickly in regions of high curvature. It turns out that if you take HMC or NUTS and use a fixed step size, you cannot explore the tails of either the neck or the mouth of the funnel very well.

Stats and ML postdoc and permanent hiring season officially open at Flatiron

Posted on September 16, 2025 3:00 PM by Bob Carpenter

This post is from Bob.

Flatiron Institute’s Center for Computational Mathematics, where I am based, is looking to hire both postdocs and permanent research scientists. Here are the job ads.

Flatiron Research Fellow ($91K)
Research Scientist ($130K–$215K)

The retirement and health benefits are far better than any job I’ve ever had. All of our research staff also gets their own budget for travel and other academic expenses. We’re also centrally located in the Flatiron district of New York on lower Fifth Avenue, very close to NYU, Google, and Meta.

Flatiron Institute is the most pleasant place I’ve ever worked, including the latter days of Bell Labs. I’m pretty sure it’s largely because of not having to write grants or teach combined with a highly collaborative research environment. Our postdocs are not assigned to grants—they’re largely free to work on whatever they want as long as it’s on mission.

Flatiron’s mission

I like our non-BS mission statement enough to quote it.

The mission of the Flatiron Institute is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation.

Jim and Marilyn Simons founded the institute as a way is to fill the gap in academia around funding computational tools. We back this up with five permanent software engineers in our center (including Brian Ward, Steve Bronder, and Jeff Soules, all of whom have worked on pieces of Stan—Brian’s our main language developer and Steve’s our main math library developer).

Flatiron’s structure

I am the group leader for computational statistics in the Center for Computational Mathematics, which includes traditional applied math and numerical analysis like PDE solvers, performant linear algebra, etc., as well as machine learning and statistics. Flatiron Insititute has four other centers for computational science, focused on biology, neuroscience, astrophysics, and quantum physics. Each center has roughly 30 permanent research scientists, 20 postdocs, and 5 software engineers, plus an incredibly effective admin staff. We also have the Scientific Compute Core, with hundreds of state of the art GPUs and hundreds of thousands of CPUs—that’s where all of Stan’s continuous integration runs these days. The SCC is staffed with high performance compute specialists who can help you scale your work and code it efficiently and also help package your work for continuous integration and for distribution.

Flatiron is about 25% of the budget and about half the staff of the Simons Foundation, whose mission is also science oriented, but through grant funding (we give out over $300M in funding per year for autism, climate science, physics, and math) and scientific outreach (Math for America, Quanta Magazine, Sandbox Films, etc.).

This year’s hires

To give you a sense of who we hire, here are the postdocs (er, Flatiron Research Fellows) we hired last round.

Mark Goldstein (NYU CS)
Louis Grenioux (Polytechnique applied math)
Tetiana Parshakova (Stanford math)
Zahra Kadkhodaie (NYU neuroscience), partially hosted by the Center for Computational Neuroscience

We also hired a one-year visiting researcher who has a permanent faculty job in math/stats lined up at Hopkins next year:

Luhuan Wu (Columbia statistics)

This is bringing us up to critical mass in diffusion models and variational inference, as well as strengthening the optimization team. I plan to spend a lot of time over the next year focusing on normalizing flows and diffusions for variational inference and focusing on how to code statistical models in JAX.

Our two outgoing stats postdocs, Yuling Yao and Charles Margossian, landed stats faculty jobs at University of Texas and University of British Columbia. Two of our ML postdocs, Wenda Zhou and Ruben Ohana, landed jobs at OpenAI and NVIDIA, after using their postdocs to transition from theoretical Ph.D.s to ML engineering and research.

We currently have three research scientists on the stats/ML side in addition to the senior research scientists, me and Lawrence Saul:

There’s also another half dozen or so more senior postdocs on the ML/stats side (it’s hard to count because so many people do interdisciplinary work here).

We also have five visiting researchers who are here 20% time. They mostly show up once/week other than Stéphane, who bunches up longer visits because his commute’s transatlantic.

Dave Blei
Nawaf Bou-Rabee
Andrew Gelman
Daniel Lee
Stéphane Mallat

On site, not remote

Our jobs are all in person and on site in New York City. Let me add that this decision came down from the top before I arrived and I have zero latitude in breaking it with hires.

Show, don’t tell: ChatGPT 5 marginalizing Gelman’s measurement error model in Stan

Posted on September 9, 2025 3:00 PM by Bob Carpenter

This post is from Bob

Even though the post is from me, the story is partly about Andrew, but mostly about how LLMs are getting better at math. Rather than telling everyone about LLMs and having a theoretical discussion, I thought it’d be useful to show you another example of the kind of interaction with LLMs that I find super useful.

The backstory

Andrew frequently mentions that he’s impatient with Stan’s default sampling times and would like something to give him a rough answer faster. So I and several colleagues have been working on finding faster adaptation (for now, I would recommend Adrian Seyboldt’s fast adapting Nutpie sampler, which can already be run with models coded in Stan or PyMC). That means evaluation. So I finally asked Andrew for an example.

Gelman’s measurement error model

Here’s the measurement error model that Andrew gave me:

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
  vector[N] x;
}
model {
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b*x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

As an aside, I love that there’s an appropriate lower bound on N.

This is a classic measurement error model—if you’re not familiar, there’s a short introduction to exactly this flavor of model in the Stan User’s Guide introduction to measurement error models.

Bob’s revised model

This looks like a BDA type model—there are improper flat priors. And the latent true values of the covariates x have a centered parameterization. Both of these can cause problem fitting, so I swithced to a non-centered parameterization using offset and multiplers, then added some weakly informative priors for all the parameters.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
  vector[N] x;
}
model {
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  mu_x ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
  sigma_x ~ lognormal(0, 0.5);
  
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b * x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

It fits better than Andrew’s but still not well. With the default 1000 warmup and 1000 sampling iterations on four chains, there were still R-hat values above 1.2. If we crank that up to an order of magnitude more iterations, everything looks OK, but it still has an integrated autocorrelation time of around 100 (meaning ESS is 1/100 times the sample size).

GPT 5’s revised model

Looking at the model, I thought like it looked like a prime candidate for marginalizing out the x parameter—it has a normal prior and it’s used in a normal likelihood. But I’m terrible at this kind of math and there weren’t any math stats folks around like Philip Greengard, so I turned to GPT 5, which had already done some math for me I couldn’t do myself (calculating the second and fourth moments of Neal’s funnel). Here’s my interaction with GPT 5.0 set to “auto”. GPT 3.5 could barely write a syntactically correct Stan program for linear regression, so let me just say it’s come a long way since then.

[Bob] I have the following Stan program for measurement error in x and I’m wondering if there is some way to marginalize out some of the variables to make this sample more efficiently. I figured since everything is normal there would be some chance of this. In particular, I would like to marginalize the x out if possible.
```stan
[code from above]
```
I would be OK slightly modifying the priors on the sigma to be inverse gammas on sigma^2 if that would help grease the wheels of the marginalization.

And lo and behold, the genie returned a correctly marginalized Stan program:

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
}
transformed parameters {
  real inv_var_x  = inv_square(sigma_x);
  real inv_var_xs = inv_square(sigma_x_star);
  real tilde_v    = 1.0 / (inv_var_x + inv_var_xs);              // Var(x | x*)
  real<lower=0> sd_xstar = sqrt(square(sigma_x) + square(sigma_x_star));
  real<lower=0> sd_y_cond = sqrt(square(sigma) + square(b) * tilde_v);
  vector[N] tilde_mu = tilde_v * ( inv_var_x * rep_vector(mu_x, N)
                                 + inv_var_xs * x_star );
}
model {
  // hyperpriors (use whatever you prefer; half-normal/lognormal work fine)
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  mu_x ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
  sigma_x ~ lognormal(0, 0.5);

  // marginalized likelihood
  x_star ~ normal(mu_x, sd_xstar);                 // p(x*)
  y ~ normal(a + b * tilde_mu, sd_y_cond);         // p(y | x*)
}

Followed by this advice:

[GPT5] Integrating out μx as well is possible with a normal prior, but it induces dense cross-observation covariance (rank-1 update) and is typically worse for HMC/NUTS. Keeping μx as a parameter is usually better.

Did it work?

Yes! Not only did it work, it worked on the first try. And the results matched what I got from a much longer run of the full model, but with a great ESS.

GPT, can you explain to our users at home how you did that?

I was thinking this kind of operation would be useful to include in the efficiency chapter of the Stan User’s Guide. So I asked the genie to explain itself.

[Bob] This is great and it worked exactly as advertised and had an effective sample size for estimating parameter means of about half the total number of sampling iterations and agreed with a long run of the original model. Now I would like to write this up as an example for our users in the efficiency chapter of the [*Stan User’s Guide*](https://mc-stan.org/docs/stan-users-guide/efficiency-tuning.html). So I want an explanation pitched for a similar audience. Given that we’re going with the factored conditional form, could you go into a bit more detail about how you derived the forms of the four variables you have, the constants $\tilde{v}$ and $\mu_i$, and the distributions for $\x_i^\star$ and $y_i \mid x_i^\star$? You can break it down into known properties of normals, though please remind me of what these are because I’m not very good with classical math stats manipulations of exponential families, nor will many of our users be.

And, of course, with an LLM, Bob’s your uncle (from the French idiom et voilà):

Of course, I will have to double check all the math until I understand it thoroughly, then convert to the Stan User’s Guide style.

An evaluation scheme for sampling

Posted on July 24, 2025 3:00 PM by Bob Carpenter

This post is by Bob

I’ve been working with Reuben Cohn-Gordon on evaluations of MCMC samplers, and we’ve come up with a scheme that I quite like for evaluation. Before we finish a more formal write up, I’d like to get some feedback from the blog readers.

Evaluating with an effective sample size target

First, let’s suppose we have an effective sample size (ESS) target, say something like ESS = 100 (or choose a dozen to make David MacKay’s ghost and Andrew happy or choose 10,000 because reviewer 2 insisted—the point is, pick a number that matters to you—I like 100). The standard error (standard deviation of the estimator) will be

se = sd / sqrt(ESS),

and thus errors will be distributed approximately as

err ~ normal(0, se),

where sd is the posterior standard deviation of the variable whose expectation is being estimated. With ESS = 100, the standard error will be 1/10th of a standard deviation and we thus expect 68% or so of our estimates to have an error of less than 0.1 sd. In this situation, our uncertainty in the value of a variable is dominated by standard deviation, not standard error. That’s why we stop at ESS = 100. You can choose a different ESS target if you like, but we have to choose one for a paper.

From a cold start

We’d like to evaluate how long a sampler takes to get to ESS = 100 from a cold start such as a random initialization. We care about cold starts because it’s usually the best we can do in applied problems. That can mean sampling from the prior or from something like a uniform(-2, 2) as we do in Stan. In some cases we can take a better guess or use something like Pathfinder variational inference to generate an init, but we’re not going to consider special cases like this.

Evaluate square parameter estimates

A problem with HMC is that it’s easy to boost ESS on parameter means (by taking roughly half an orbit where every other draw acts like a control variate) while tanking performance on parameters squared (see, e.g., this post on the Stan Forums about harmonics in HMC/jittered HMC/NUTS trajectories). So we want to evaluate estimates of both parameter posterior means and posterior variance.

Gradients are (usually) the dominant cost

We’ll evaluate using gradients as a proxy for compute, because most of the tight algorithm implementations are dominated by log density and gradient calls, which happen together with autodiff. The obstacle to just doing this is that the ESS estimators are themselves very noisy, especially for ESS around 25, which is what we need in each of four chains to hit ESS = 100 overall, for example. Of course, this doesn’t work for gradient-free samplers, such as Gibbs or random-walk Metropolis. We can measure using wall time, but that’s going to be very implementation dependent rather than algorithm dependent.

Reference moments versus effective sample size

So instead of estimating ESS, which is noisy, we propose to compare in terms of error versus reference moments. We have to be careful to evaluate those either analytically or with relatively low standard error compared to the samplers we’re evaluating so that noise in the reference moments does not dominate the evaluation. So we will take for granted that we can run millions of MCMC iterations and get reference moments that are much more precise than the estimates we will get from the shorter runs from the samplers being evaluated. This becomes very challenging if your ESS target gets much bigger than 100, say to 10,000 to cut errors a factor of 10 (we only get square root growth, so we need a sample 100 times larger to reduce errors by a factor of 10).

Our proposed metric

What we propose is running until the standardized error in each expectation of interest (D parameters, D parameters squared, 1 log density) is below 0.1. Suppose we have a random variable Z whose posterior mean is E[Z | y] and its posterior standard deviation is sd[Z | y]. Given an estimate hatZ, the standardized (not standard) error is

standardized-error = (hatZ – E[Z | y]) / sd[Z | y].

The standardized error is how many standard deviations the estimate is away from the mean. Taking Z = Theta, Z = Theta^2, and Z = log p(Theta) as our random variables of interest, we want to calculate how many gradients it takes until the standardized error in each one in each dimension is below 0.1.

Some issues

One of the practical issues that arises is that our algorithms like Stan’s implementation of the no-U-turn sampler (NUTS) is that it involves a fixed warmup period of N_warmup iterations and a fixed sampling period of N_sampling iterations. We only want to run warmup as long as necessary so that the total number of gradient evals to get all estimates below our prescribed error threshold is minimized. But we don’t know how many iterations or gradient evaluations that is a priori. So what we will do in practice is run increasingly large windows of adaptation as one might do in practice, starting with 50 iterations, then going to 100, then 200, and so on. Maybe with a base of sqrt(2) like f-stops in a camera, i.e., 50, 71, 100, 141, 200, 283, … Not perfect, but at least I won’t be making our cluster go brrr for a week.

The second practical issue is the usual one facing the evaluation of stochastic processes—there can be high variability due to random seeds. We can mitigate this somewhat by pairing the initializations in the comparisons, but it’s still very noisy. Stan can vary by an order of magnitude in the time it takes to solve these problems, so we need quite a few more evaluations than are traditional to have any confidence in a result such as the expected number of gradients required for a sampler to achieve standardized errors lower than 0.1. The noise is exacerbated by requiring the maximum error to be less than the tolerance.

I suspect we’ll hear more in the comments…

Stan for multimodal mixtures—from exponential CPS to linear DP

Posted on July 22, 2025 3:00 PM by Bob Carpenter

This post is from Bob

I’ve been thinking about evaluation recently because I’ve been working with colleagues on new samplers, which means evaluating how well they work (more on that soon). This in turn means coming up with target densities on which to evaluate them.

A combinatorial multimodal test case

I wanted something clearly multimodal and hence not log concave. I remember somebody’s paper (help with citation?) used a mixture of four two-dimensional isotropic normals, separated enough to make transition possible, but still a bit difficult. Not to give the game away, but here’s a posterior plot of a sample drawn from Stan—the imbalance in component weights is intentional, as I’ll describe below.

I’m still working on posteriordb with the Stan gang (see the authors of the linked paper) and Inference Gym with Reuben Cohn-Gordon (another linguist by training and programming language geek turned to MCMC), and thought it’d be nice to have something a little more general than just the 2D example. So I got out my notebook, and realized the generalization to D dimensions involves 2^D mixture components that are normal with unit covariance located at the points in {-r, r}^D.

p(y | r) = SUM_{mu in {-r, r}^D} 1/2^D normal(y | mu, I).

I then generalized to allow setting the probability that Y[d] > 0 to be p in (0, 1) to get a non-uniform mixture. This leads to a slightly more complex density because of the non-uniformity.

p(y | r) = SUM_{mu in {-r, r}^D} binomial(sum(mu == r) | D, p) * normal(y | mu, I).

Coding in Stan with continuation-passing style

So how do we code this in Stan? Obviously it needs to be recursive or at least iterative to deal with the D being unknown at compilation time. Whenever I see recursion, I immediately think of continuation passing style (CPS). So I came up with this Stan program to code a generalization in D dimensions.

functions {
  real mm(vector y, real r, real p, int d, real lp) {
    if (d == 0) {
      return lp;
    }
    real lp1 = mm(y, r, p, d - 1, lp + normal_lpdf(y[d] | r, 1));
    real lp2 = mm(y, r, p, d - 1, lp + normal_lpdf(y[d] | -r, 1));
    return log_mix(p, lp1, lp2);
  }

  real mm_lpdf(vector y, real r, real p, int D) {
    return mm(y, r, p, D, 0);
  }
}
data {
  int D;   // number of dimensions
  real r;  // modes in {-r, r}^D
  real p;  // p = Pr[Y[d] > 0]
}
parameters {
  vector[D] y;
}
model {
  y ~ mm(r, p, D);
}

The log_mix function is defined as follows, but implemented in a more stable way.

log_mix(p, lp1, lp2)
    = log_sum_exp(log(p) + lp1, log(1 - p) + lp2)
    = log(exp(log(p) + lp1) + exp(log(1 - p) + lp2))
    = log(p * exp(lp1) + (1 - p) * exp(lp2)).

If you unfold the recursion manually, the leaves wind up being the log densities and the weights wind up percolating as described in the definition. If you’re having trouble seeing this, manually expanding the D = 1 and then D = 2 cases will help. It’s compact, but it’s still exponential in cost to evaluate a log density and gradient (i.e., O(2^D)). Although it’s slow in higher dimensions, it works.

Python scripts

The plot above is from the following Python code that sets

D = 2, r = 2.5, and p = 2.0/3.0.

For those of you considering a move to Python, having a clone of data frames (pandas) and ggplot2 (plotnine) is a godsend. And yes, of course the LLMs know how to code pandas and plotnine.

import cmdstanpy as csp
import pandas as pd
import plotnine as pn

model = csp.CmdStanModel(stan_file='mm.stan')
D = 2
r = 2.5
p = 2.0 / 3.0
data = {'D': D, 'r': r, 'p': p}
fit = model.sample(data = data, iter_sampling=5_000)
print(fit.summary(sig_figs=2))

y = fit.stan_variable('y')
df = pd.DataFrame({'y1': y[:, 0], 'y2': y[:, 1]})
plot = (
    pn.ggplot(df, pn.aes(x='y1', y='y2'))
    + pn.geom_vline(xintercept=[-r, r], color='red', linetype='dashed')
    + pn.geom_hline(yintercept=[-r, r], color='red', linetype='dashed')
    + pn.geom_point(alpha=0.1)
    + pn.scale_x_continuous(breaks=[-r, 0, r])
    + pn.scale_y_continuous(breaks=[-r, 0, r])
    + pn.coord_fixed()
    + pn.theme_minimal()
)
plot.save('mm.jpg', dpi=300)

The knockoff of data frames in pandas and ggplot2 in plotnine are a godsend if you’re transitioning to Python from R (which I would highly recommend).

Dynamic programming to the rescue

Because it involved CPS, I mailed it off to Brian Ward around midnight last night. I’m a decent programmer, but Brian’s next level. By the time I arrived today at 10 am, he had rewritten the target density as follows.

  real mm_lpdf(vector y, real r, real p, int d) {
    if (d == 0) {
      return 0;
    }
    real lower_mixture = mm_lpdf(y | r, p, d - 1);
    real lp1 = lower_mixture + normal_lpdf(y[d] | r, 1);
    real lp2 = lower_mixture + normal_lpdf(y[d] | -r, 1);
    return log_mix(p, lp1, lp2);
  }

[Edit: Switched everything to lpdf from a mix of lpdf and lupdf.]

He saw that the recursions were doing the same thing in each branch and could be shared. Because there’s only one recursive call, Brian’s code is linear (i.e., O(D)). It achieves this speedup using dynamic programming (DP). DP calculates partial solutions that can be combined into larger solutions rather than recomputing them. DP’s the technique that you need to solve the harder L33T-code quizzes you’ll get during technical interviews these days. Other examples where DP can be helpful for statistical models include the fast Fourier transform (FFT), the forward algorithm for hidden Markov models (HMMs), and the Poisson-binomial distribution. The first two are coded efficiently in Stan and the latter I showed how to code in a Stan forum post on Poisson-binomial.

Try it yourself in the Stan Playground

If you want to play with this yourself, Brian built a version using the Stan Playground that you can run in the browser.

Stan Playground: Combinatorial normal mixtures

Here’s what it looks like after setting D = 3, running sampling, and then viewing a histogram with all three dimensions selected.

It’s a live demo, so you can edit the data to set r, D, and p. And it’s really fast due to the DP. Just like in ShinyStan and especially like its generic in-the-browser version MCMCMonitor (from many of the same developers as Stan Playground), you can view 3D projections of the higher-dimensional draws and rotate them to see it making 8 balls in 3D, 7 of which are visible in the screen grab. Or you can go to higher dimensions and view projections down to two or three dimensions. You might want to increase the number of draws per chain to get cleaner delineation of the posterior densities in the visualizations.

Chaining Bayesian inference with priors constructed from posterior draws

Posted on May 13, 2025 3:00 PM by Bob Carpenter

This post is from Bob.

Chenyang Zhong, a stats professor at Columbia, presented the following paper at our Bayesian computation reading group on Friday.

Chenyang Zhong, Shouxuan Ji, Tian Zheng. 2024. Graph-Enabled Fast MCMC Sampling with an Unknown High-Dimensional Prior Distribution. arXiv 2408.02122.

The goal is to be able to chain Bayesian inferences on a data stream in situations where there’s no analytic form of the posterior. The problem is that the textbook solution of using analytic posteriors (e.g., chaining binomial likelihoods with beta priors), only works for simple conjugate models.

The model

The unknown prior in this case is constructed from draws from the posterior of a previous model. To ground us with some notation, suppose our joint model is the product of a likelihood and prior,

p(y, theta | x) = p(y | theta, x) * p(theta).

Sequential data

For example, consider the case where we receive a sequence of data sets

(x1, y1), (x2, y2), …, (xn, yn), ….

In Chenyang’s case, the data is private, so you can view the problem as a kind of federated learning or as a kind of meta-analysis. We know we can fit a model p(theta, y | x) and get posterior draws

theta_post(1), …, theta_post(M) ~ p(theta | x1, y1).

Chenyang assumes the posterior draws may be shared. What we’d like to do is use the posterior p(theta | x1, y1) as the prior for theta when analyzing data x2, y2, i.e.,

p(theta | x1, x2, y1, y2) propto p(y2 | theta, x2) * p(theta | y1, x1).

But we don’t have a closed form expression for p(theta | y1, x1), so what do we do?

Kernel density estimate as a posterior approximation

Chenyang’s idea is to use a kernel density estimate with a normal basis as a proxy for p(theta | y1, x1). Specifically, what he’s going to do is write an empirical prior that penalizes squared distance from the posterior draws.

p(theta | y1, x1) approx 1/M SUM_m normal(theta | theta_post(m), h * I),

where I is the identity matrix and h > 0 is a variance parameter. Most of Chenyang’s paper is about how to compute this efficiently. By “high dimensions” he’s talking about a modest 6 to 20 dimensions. He takes M to be about 10,000. He then goes about constructing a really neat way to Metropolis sample exactly using only nearest neighbors of theta.

It’s easy to establish that the maximum likelihood estimate and posterior mean of the prior will be the sample mean of the posterior draws theta(1), …, theta(M). How strongly it concentrates around that mean will depend on how spread out the posterior draws are. What’s interesting is that how hard it pulls does not matter how large M is—the more the merrier in terms of accuracy. But it will depend on the variance term h.

What if we just use Stan?

Stan’s pretty darn fast at normal approximations, so what if we just coded the approximate posterior directly rather than trying to use a graph of nearest neighbors and adjust? Turns out it works very cleanly. Given the data set in his paper, which is a logistic regression with N = 1500, it takes Stan 2s to fit the posterior for (x1, y1), and 35s to fit the posterior for (x2, y2) using 10,000 posterior draws from the approximate posterior (this is my 2017 iMac Pro, which is very slow compared to current ARM-based Macs).

I coded this all up as a Stan case study that you can find here, along with the results:

Stan case study for Zhong’s kernel posterior approximation.

Should I add this as a new technique to the third section of the Stan User’s Guide? This problem comes up all the time on our forums.

What’s left to do?

I have ~~two~~ four questions left after fitting the model.

How to set the variance term h? It doesn’t affect the mean, but it does affect how strongly the prior concentrates. Chenyang mentioned something about potentially wanting to discount the past, which you can do. Is there a way to set that by tweaking h? Alternatively, is there a way to set it optimally so that the posterior for the final fit is closest to p(theta | x1, y1, x2, y2) in cases where we want to equally weight the past?
How many posterior draws do we need? In simple cases like this, probably not 10,000!
Is this just an easier way to do the computation rather than estimating a covariance matrix and using a multivariate normal? The reason I ask is that I know you can generate from the empirical covariance by differencing in this way—it’s equation (11) in Goodman and Weare’s affine invariant sampling paper.
How can this approach handle constrained parameters? We can just keep exactly the same code and keep the constraints on the parameters and everything should work, but it seems more natural to lay down a multivariate normal approximation on the unconstrained scale (e.g., after log transforming positive constrained parameters).

[edit: added third and fourth question]

AISTATS ’25 Best Paper award—Margossian and Saul on exact recovery of means and correlation in VI

Posted on May 5, 2025 8:07 PM by Bob Carpenter

The best paper award for AISTATS ’25 was just awarded to the following paper.

Charles C. Margossian and Lawrence K. Saul. 2025. Variational Inference in Location-Scale Families: Exact Recovery of the Mean and Correlation Matrix. AISTATS.

Here’s Charles, nattily dressed in a blue blazer for the award ceremony:

Photo credit: Daniel Lee (the Cornell CS professor, not the Stan developer)

For those who might not know Charles, he was a Ph.D. student of Andrew’s, is currently a postdoc at Flatiron Institute, and as he posted here recently, is on his way to start a faculty position in the statistics department at the University of British Columbia (Vancouver). I can say from a front row seat that Lawrence is really really good at coaching writing and presenting and Charles is a great presenter (though the latter wouldn’t help with the best paper award—they knew about it before the presentation so we got to see the practice talk). If you get a chance to take one of Lawrence’s scientific writing tutorials, take it—it will for sure give you a leg-up in a best paper competition.

It’s a small world in Bayesian-flavored AI and stats. I believe that’s Stephan Mandt on the right of the photo—he’s the general program chair for AISTATS this year. Stephan was a postdoc with Dave Blei at Columbia while I was there and worked with Matt Hoffman at Google. Charles has also published papers on VI with Dave Blei. Daniel Lee (who took the photo above) was at Bell Labs with me in the late 1990s. I sometimes ran into Lawrence on the train back then going to AT&T Labs.

Analytics positions for the Miami Marlins (baseball)

Posted on April 3, 2025 2:41 PM by Bob Carpenter

Cross-posted from the Stan forums jobs board.

Work with Alex Andorra

Alexandre Andorra, of Learning Bayesian Statisticspodcast and PyMC/ArviZ fame, is now working for the Miami Marlins. If you don’t know Alex’s podcast, I’d encourage you to check it out—he has interviewed a number of the Stan devs, including two live interviews during the last StanCon.

All new Marlins

As those of you who follow baseball know, the Marlins were an expansion team in the early 90s who had some early success, but have been underperforming recently. Look for that to turn around because Peter Bendix, the former general manager of the Tampa Bay Rays, the most analytics-oriented team in the sport, has moved to become president of baseball operations and general manager in Miami.

The analytics job openings

The Marlins are recruiting for two positions:

Both jobs have the following line item in the Skill Requirements section:

Experience with probabilistic programming languages preferred (Stan, PyMC)

How is an American research university funded?

Posted on March 20, 2025 3:00 PM by Bob Carpenter

[Update: I hadn’t realized that Johns Hopkins was such an outlier when writing this. I followed up with some stats from other universities in this comment responding to Andrew.]

If you want some insight into why American academics are losing their mind, it helps to understand how American research is funded. The U.S. classifies universities into tiers by how much research the do (a “college” is like a university but doesn’t have graduate programs). The R1 universities are the research universities you’ve heard of like Stanford and Johns Hopkins and Columbia and University of Michigan. What you may not realize is that a huge chunk of their operating budget is derived from grants.

Here’s a link to the Johns Hopkins Annual Financial Report, 2023. I chose Hopkins because it’s been coming up with a lot of our job candidates. The breakdown of the budget is on pages 4 and 5. The bottom line is that the university, a non-profit organization, made a $414M “profit.” But let’s break down where the money is coming in and out.

Page 4: Operating Revenue and Operating Expenses

First note that the numbers are in thousands of U.S. dollars. The total income is $7.8B. Let’s break that down.

“Tuition” brings in only $830M. Universities don’t collect full tuition—they give out a lot of financial aid. And they just don’t have that many students.

“Grants, contracts, etc.” bring in $2.3B (excluding the APL).

“Applied Physics Laboratory” brings in another $2.3B (yes, same number). They mince words in the intro, but it’s basically a defense contractor if you read the bullet items, whose mission is, quite frankly, frightening (click through for details of their “warfighting” support). The U.S. sadly entangles its defense budget and university budgets.

“Contributions” from individuals, foundations and corporations make up another $210M. I wonder how the great whale Bloomberg’s $1B donation is accounted—I’m guessing this is lottery-style reporting of $1B, not net-present value of $1B.

“Net assets released from restrictions” of $110M, which means previously restricted donations become available to be spent.

“Clinical services,” i.e., the Hopkins hospital system, brought in $890M net!!! More than tuition, but less than grants.

“Reimbursements from affiliates” of $760M is basically things affiliated with Hopkins like a broader network of hospitals and research institutions for services rendered. Yikes. Add that to clinical services and you have a hospital network making way more money than tuition.

“Other revenues” of $190M. No clue as to what this is.

“Endowment payout” of $425M. U.S. non-profit law requires institutions to pay out a fraction of their endowment every year. My guess is that they’d pay out zero if they could, since university presidents are largely incentivized around two things: raising the endowment and raising U.S. News and World Report rankings.

“Auxiliary expense” income of $105M from things like bookstores, housing, and dining. They’re actually turning a huge “profit” on this stuff! Who knew?

“Maryland State aid” of $65M. Basically a drop in the bucket.

“Investment return” of $73M. I don’t know if this is the endowment or other investments.

Page 5: Other changes in net assets with and without donor restrictions

This is mostly investment stuff related to pensions, and investment return. There is one item of note, the contributions.

“Contributions” totalled $2.1B! This is what it sounds like, but the contributions are typically restricted (i.e., they come with strings attached, typically building a building with someone’s name on it or funding some tenure line). And yet I’m guessing they still pester their alumni with annual “donate to Hopkins” letters. I know both Michigan State and Edinburgh never seem to miss a move with their begging.

Overhead

The government is threatening or maybe already has (hard to keep up) reduced overhead rates from 60% to 15% (that means if you apply for $1 of direct costs, you also apply for $0.60 of overhead. The plan will remove 3/4 of the overhead revenue, which for Hopkins was $460M, which means a revenue reduction of $355M.

The much bigger effect will be what’s happening to Columbia with just cuts across the board in NIH and NSF funding. If that hits Hopkins, it’s going to hurt, because that’s a large part of their $2.3B research grant budget.

Was 2023 an outlier?

No. It’s just more of the same in 2024. If you follow the link it has 2023/2024 both. In 2024, the bottom-line “profit” was up to $2.6B!!!

Panic setting in?

Despite this recent success (I didn’t find 2024) they’re panicking and reportedly laying of 2000 employees. If you click through to the article, you’ll realize that’s 250 people “cut” in the U.S. and 1900 internationally, mostly working on international health aid, with another 200 locals “furloughed” (I don’t know exactly what that means in this context).

BOTTOM LINE: JOHNS HOPKINS 2023 BUDGET

Research grants and overhead: $2.3B
Defense contracting: $2.3B
Health care and other services: $1.6B
Tuition: $841M
Endowment payout: $425M
Bookstores, housing, dining: $105M
State aid: $64M

Operating “profit”: $410M

Bottom-line “profit”: $2.3B

Which leaves us with the question of what one calls a “profit” at a non-profit? “Changes in Net Assets” is the term of art used in the reports.

Generalized linear neural network models

Posted on February 4, 2025 4:11 PM by Bob Carpenter

This is Bob.

Are neural nets the future of regression?

Andrew was visiting Flatiron last Friday (really last Friday, not six months ago), and I was asking the question that’s been on my mind lately: will neural networks put regression modelers out of work?

Andrew hired me and Matt Hoffman in 2010 to work out how to specify and fit hierarchical regression models with interactions. He wanted to create a system that automatically added interactions, non-linearities, etc., guided by a vaguely conceived “topology of models.” This is a combinatorial nightmare, even with a handful of covariates and non-linearities, and not even considering continuous variation in things like priors.

Black box non-linear function approximations

Fast forward 15 years and regressions from neural networks are ubiquitous. Rather than specifying interactions, non-linearities, etc., we just let a highly overparameterized deep neural net sort it out. This idea of black-box, non-linear function approximation is not new. I first saw it with random forests (the Bayesian analogue of which is Bayesian additive regression trees) and more recently, gradient-boosted decision trees (the go-to method in Kaggle competitions).

Do we have enough data?

The only thing holding us back from using neural networks everywhere is limited data. It’s clear as our data sets get bigger that neural network regression works very well (see, e.g., LLMs, image recognition, and image generation systems, all of which fit largely black-box deep neural network models).

Uncertainty quantification

As we were talking about this, Andrew kept returning to uncertainty quantification. I somehow couldn’t convince him that we can do exactly the same thing as we are currently doing. There’s no fundamental difference between using a neural network and using bespoke hand-tooled covariate combinations—just different functions mapping the covariates to expected values.

Here’s a document explaining the connection

I didn’t have time to explain this to Andrew at the board, so I wrote it up as a document. This goes over how you can take a GLM and swap out the linear component for a neural network and then proceed as usual. It contains an example of a two-hidden layer perceptron model coded in Stan.

Bob Carpenter. 2025. It’s all neural nets to me. Technical note.

I’m always happy to get feedback if people have comments or suggestions. Keep in mind the purpose here is not a publication, but just explaining how a neural network can be swapped in for the linear function in a generalized linear model.

Treasure trove of forensic details in arXiv’s LaTeX source code

Posted on January 7, 2025 3:00 PM by Bob Carpenter

There’s gold in them thar hills^*

When you submit a paper to arXiv, you send them a bundle including the LaTeX source, figures, etc. These are all available for download through the arXiv site. This morning, I was downloading the source^** for the original Hoffman and Gelman no-U-turn sampler paper. If you want to follow along, ere’s the arXiv link, but you have to click through to the “TeX Source” link under the “Access Paper:” header on the top right side under the banner. What I found was a treasure trove of comments that never made it to the paper, some of which I will share below.

Examples

Returning to Hoffman and Gelman’s arXiv source LaTeX, what struck me was the following comment right after the algorithm itself.

%% Algorithm ?? is more efficient than algorithm
%% ??, but the policy of sampling uniformly from
%% $\cC$ leaves something to be desired. We would prefer to select an
%% element of $\cC$ that is farther away from the initial position
%% $\theta^t$, rather than face the possibility of performing many costly
%% gradient evaluations just to wind up choosing an element of $\cC$ that
%% is close to where we started. Algorithm ?? addresses
%% this issue by giving preference to points subtrees that do not include
%% the starting point $\{\theta^t, r^t\}$. [To do: explain why this is
%%   valid. Probably proof by induction is the easiest way to go.]

Neither the arXiv preprint nor the final JMLR paper have a clearly delineated inductive proof. In both versions, we get “this is equivalent to a Metropolis_Hastings kernel with proposal …, and it is straightforward to show that it obeys detailed balance” (second-to-last sentence on page 1604 of JMLR paper).

Presumably on the principle of minimizing surface area for reviewers to gripe, the following useful comment from the abstract didn’t make the final cut.

%% This issue is compounded when the
%% target distribution depends on a set of parameters that cannot be
%% updated by HMC (such as discrete parameters) and are updated
%% independently of the parameters updated by HMC. 
%% In this case, optimal settings of $L$ may change from iteration to
%% iteration.

Here’s another useful comment that wound up on the cutting-room floor. I’m not saying these should all be in the paper—usually there are so many things you can add and qualify that it requires some judgement. But for the dedicated and interested reader, the paper would have been more useful with the elided comments.

%% Even if we assume that there exists some transformation of the
%% parameter space under which all parameters are i.i.d. and that this
%% transformation can be applied cheaply (i.e. in $O(D)$ time, for
%% example using a low-rank transformation matrix to avoid the $O(D^2)$
%% cost of dense matrix multiplication and the $O(D^3)$ cost of dense
%% matrix inversion), the cost of obtaining an effectively independent
%% sample using RWM is still $O(D^2)$ \citep{Creutz:1988}. Gibbs also
%% requires $O(D^2)$ operations per effectively independent sample in
%% this setting, since it must update $D$ parameters and it must perform
%% a transformation costing $O(D)$ operations after each update.

There are also useful explanations of figures that never made the final cut, like this one, which expands the diagram from the one of “naive NUTS” in the paper figures to what the paper calls “efficient NUTS.”

%% %% %% Figure ?? illustrates how an iteration of NUTS might
%% %% %% proceed once the slice and initial momentum variables have been
%% %% %% resampled. Initially (a), we have only one node. We double the size of
%% %% %% the tree to two nodes by taking a single step forward (b), and since
%% %% %% the new point is valid tentatively set $w^{t+1}$ to that new point
%% %% %% (with probability $1/1=1$). We then redouble the size of the tree to
%% %% %% four nodes, taking two steps forward (c). Only one of the two new
%% %% %% nodes is valid, so the probability of choosing a node from the new
%% %% %% half-tree is $1/2$ (the ratio of the number of valid new nodes to
%% %% %% valid old nodes). In this example, we randomly choose to stick with
%% %% %% the old value of $w^{t+1}$. Next, we again double the size of the tree
%% %% %% by taking four steps backward from $w^-$ (d). We discover that the new
%% %% %% half-tree satisfies the stopping criterion, and so we cannot select
%% %% %% any points from it. Finally, we double the tree one more time, this
%% %% %% time going forward (e). This half-tree contains some valid points and
%% %% %% does not satisfy the stopping criterion, but a subtree of it does
%% %% %% satisfy the stopping criterion, so we invalidate the points in that
%% %% %% subtree. The number of valid points in this half-tree (3) is the same
%% %% %% as the number of valid points in the old half-tree (3), so we choose a
%% %% %% point uniformly at random from the new half-tree for $w^{t+1}$. At
%% %% %% this point, the end points $w^-$ and $w^+$ satisfy the stopping
%% %% %% criterion, and we return $w^{t+1}$ as the new position-momentum pair.

The following would have been nice.

%% Also, we should probably have a scatterplot showing target versus
%% realized criteria (mean acceptance probability, mean energy change)
%% that shows that the stochastic approximation scheme pretty much works,
%% and maybe a plot showing convergence speed.

There’s more where these came from—I was just cherry picking from the algorithm, abstract, intro, and conclusion.

Who knew?

I’ve never heard anyone mention diving into the source of papers, so I wonder just what’s out there to be mined. I also wonder how many authors realize that comments in their arXiv LaTeX are forever.

^* An American idiom meaning there’s value to be found from exploring in a particular place; see the wikitionary for a definition and etymology.

^** I downloaded the LaTeX source of Hoffman and Gelman’s paper in order to produce a ChatGPT(o1[plus]) translation of the efficient NUTS algorithm to Python. I need to code a similar algorithm for a new sampler we’re exploring and wanted to make sure I had understood the structure of the NUTS algorithm, because it’s a very subtle recursion. GPT continues to impress!

Softmax is on the log, not the logit scale

Posted on December 26, 2024 3:00 PM by Bob Carpenter

Bad Stan naming

I realized recently that we followed the confusing terminological convention of ML in our description of Stan’s categorical_logit function. In Stan, if there’s a suffix to a distribution, it describes the scale of one or more of the parameters. For example,

poisson_log(y | u) == poisson(y | exp(u)).

So when we write categorical(y | p) we take p to be a simplex (sequence of finite, non-negative values that sum to 1). So it would make sense that categorical_logit(y | logit(p)) would be equivalent, where logit(p) = log(p / (1 - p)). But that’s not how it works in Stan. Instead,

caetgorical_logit(y | u) = categorical(y | softmax(u)).

We made the same mistake everyone on ML makes in their variable naming! We call the u here “logits”, when in fact they’re (unnormalized [see below]) log probabilities. This is probably due to the fact that if u is a regression, then the resulting system is called “multinomial logistic regression.”

Example

The softmax function is defined by softmax(u) = exp(u) / sum(exp(u)). When used like this, the arguments to softmax are log probabilities, not logit probabilities. Here’s a little snippet of Python to illustrate (the style sheet is adding the extra space, not me, and I don’t want to fix it manually in this post with a hack because it’ll mess up the page if the style sheet is ever fixed).

>>> p = np.asarray([0.2, 0.5, 0.3])


>>> def logit(p): return np.log(p / (1 - p))

... 
>>> logit_p = logit(p)
>>> log_p = np.log(p)
>>> sp.special.softmax(logit_p)

array([0.14893617, 0.59574468, 0.25531915])

>>> sp.special.softmax(log_p) array([0.2, 0.5, 0.3])

This shows that for the round trip probabilities through softmax, the appropriate operation is the natural logarithm, not the logit function.

Origin of the confusion

So where did this confusion come from? Let’s look at a standard binary logistic regression. There we take

p(y | alpha, beta, x) = bernoulli(y | inv_logit(alpha + beta * x))

where

inv_logit(v) = exp(v) / (1 + exp(v)).

Writing inverse logit this way suggests how to write a logistic regression with a categorical distribution and softmax.

p(y | alpha, beta, x) = categorical(y | softmax([0, alpha + beta * x]))

that’s because

softmax([0, alpha + beta * x]) = [exp(0), exp(alpha + beta * x)] / (exp(0) + exp(alpha + beta * x)) = [1, exp(alpha + beta(x)] / (1 + exp(alpha + beta * x)) = [1 / (1 + exp(alpha + beta * x), exp(alpha + beta * x) / (1 + exp(alpha + beta * x)] = [1 - inv_logit(alpha + beta * x), inv_logit(alpha + beta * x)],

This derivation shows that the probability of the categorical in this formulation returning 1 is inv_logit(alpha + beta * x). But this connection falls apart in the multinomial case when there are more than two outcomes.

In traditional frequentist K outcome multinomial logistic regressions, the first input to softmax is pinned to 0 for identifiability just as in the binary case.

softmax([0, u[2], ..., u[K1]) = [exp(0), exp(u[2]), ..., exp(u[K])] / (exp(0) + exp(u[2]) + ... + exp(u[K]))

This leads to asymmetry in the regression as we don’t have a regression for the first element. What it does do is make softmax and log proper inverses. If you reduce the choice to just the first category and some other category, then you get a standard binomial logistic regression again. But you still can’t round trip the multinomial case with logit, because

exp(u[2]) / (exp(0) + exp(u[2]) + ... + exp(u[K])) != inv_logit(u[2])

To see that this is still not going to produce logits in the multinomial case, here’s some more Python.

>>> log_p array([-1.60943791, -0.69314718, -1.2039728 ])


>>> log_p_zero = log_p - log_p[0]
>>> log_p_zero

array([0.        , 0.91629073, 0.40546511])

>>> sp.special.softmax(log_p_zero) array([0.2, 0.5, 0.3])

So as you can see, softmax isn’t identified without pinning one of the values—we can add or subtract a constant from each element of the input and get the same value. But this still doesn’t turn the inputs to softmax into logits.

>>> def inv_logit(v): 1 / (1 + exp(-v))
…

>>> inv_logit(log_p_zero)
array([0.5 , 0.71428571, 0.6 ])

So you can see that the input 0.91629073 is not the logit of the probability even when pinning a value to zero to identify.

P.S. I really miss being able to write math on the blog and really hate that all my old posts with math no longer render. Maybe if Andrew reminds us why it went away, someone will have a suggestion on how to fix.

Applications of (Bayesian) variational inference?

Posted on December 17, 2024 3:00 PM by Bob Carpenter

I’m curious about whether anyone’s using variational inference, and more specifically, using variational approximations to estimate posterior expectations for applied work. And if so, what kinds of reactions have you gotten from readers or reviewers?

I see a lot of talk in papers about how variational inference (VI) scales better than MCMC at the cost of only approximating the posterior. MCMC, which is often characterized as “approximate”, is technically asymptotically exact. MCMC’s approximation is not very many decimal places of accuracy rather than bias, at least in cases where MCMC can sample the posterior.

But I don’t recall ever seeing anyone use VI for inference in applied statistics. In particular, I’m curious if there are any Bayesian applications of VI, by which I mean applications where the variational approximation is used to estimate Bayesian posterior expectations in the usual way for an applied statistics problem of interest. That is, I’m wondering if anyone uses a variational approximation q(theta | phi), where phi is fixed as usual, to approximate a Bayesian posterior p(theta | y) and use it to estimate expectations as follows.
E[f(theta) | y] = INTEGRAL f(theta) q(theta | phi) d.theta.

This could be computed with Monte Carlo when it is possible to sample from q(theta | phi).

I’m using our Pathfinder variational inference system (now in Stan) to initialize MCMC, but I wouldn’t trust inference based on Pathfinder because of the very restrictive variational family (i.e., multivariate normal with low rank plus diagonal covariance). Similarly, most of the theoretical results I’ve been seeing around VI are for normal approximating families, particularly of the mean field (diagonal covariance) variety. Mean field approximations are easy to manipulate theoretically and computationally, but seem to make poor candidates for predictive inference, where there is often substantial posterior correlation and non-Gaussianity.

I know that there are ML applications to autoencoding that use variational inference, but I’m specifically asking about applied statistics applications that would be published in an applied journal, not a stats methodology or ML journal. I’ve seen some applications of point estimates from VI to “fit” latent Dirichlet allocation (LDA) models, but the ones I’ve seen don’t compute any expectations other than point estimates of parameters from a local mode among combinatorially many modes.

I’m curious about applications using ML techniques like normalizing flows as the variational family. I would expect those to be of more practical interest to applied statisticians than all the VI that has come before. I’ve seen cases where VI outperforms NUTS from Abhinav Agrawal and Justin Domke using a 10-layer deep, 20-ish neuron wide, real non-volume preserving (realNVP) flow touched up with importance sampling—their summary paper’s still under review and Abhinav’s thesis is being revised. But it requires a lot of compute, which isn’t cheap these days. The cases where realNVP outperforms include funnels, multimodal targets, bananas and other varying curvature models (like from an IRT 2PL posterior). I suspect the costs and accessibility of the equivalent of an NVIDIA H100 GPU will drop to a point where everyone will be able to use these methods in 10 years. It’s what I’m spending at least half my time on these days—JAX is fun and ChatGPT can (help) translate Stan programs pretty much line for line into JAX.

Flatiron Institute hiring: postdocs, joint faculty, and permanent research positions

Posted on October 24, 2024 5:00 PM by Bob Carpenter

This is Bob.

We’re hiring

It’s that time of year again and we’re hiring at all levels at the Center for Computational Mathematics (CCM) at Flatiron Institute (the in-house research arm of Simons Foundation). As they are listed, job ads will appear here:

CCM job listings

Our mission

I finally work at a place whose mission aligns with what I actually do.

The mission of the Flatiron Institute is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation.

This place is entirely researcher driven. This is the only top-down directive I have, and I can choose what I work on as long as I stay productive and on mission.

About Flatiron Institute

In addition to CCM, Flatiron has centers devoted to computational biology, computational neuroscience, computational astrophysics, and computational quantum physics (not to be confused with quantum computing). It’s a very problem-rich environment, because all these domain scientists have lots of interesting data (for example, I spent an hour yesterday talking to a CCB researcher about his data involving flies jumping when exposed to light, and we went over how to build hierarchical Bayesian time series model and I wrote some Stan code to match). Each center is comprised of roughly 60 people, around 30 of whom are full-time research scientists, 20 of whom are postdocs, 5 of whom are software engineers, and 3 of whom are dedicated admin staff.

We also have a dedicated high-performance computing group, the Scientific Compute Core (SCC), and a massive (for academia) compute cluster with hundreds of H100 GPUs and hundreds of thousands of CPU cores.

It’s all on site

We do not hire remotely for full-time jobs. Luckily, our offices are really nice, bright, and well maintained from IT to physical facilities.

P.S. “Flatiron” is the name of the neighborhood—our offices are two blocks south of the Flatiron Building, in one of the nicest areas of Manhattan (a short walk to NYU, Google, and Meta).

About CCM

Our work can broadly be categorized into traditional numerical analysis and scientific computing, computational statistics, and machine learning. In all of these areas, we work on theory, algorithms, and release open-source software. All of our research and external funding is funded through the $5 billion endowment for Simons Foundation—we are not allowed to apply for grants. As an aside, Simons Foundation hands out over $300 million in science funding per year.

We have really great software engineers; in particular, Brian Ward and Steve Bronder work with me on Stan-related projects and with other people on other projects across Flatiron. There’s much more of a career track here for software than in academia, and it’s allowed us to recruit really top-notch people.

We also have a fantastic admin staff, which makes an incredible difference.

Flatiron research fellows (aka postdocs)

We hire postdocs for three years. We’ll be trying to hire around 6 postdocs this year across the areas of CCM interest. I’ll be out of postdocs after this year—Yuling Yao started a faculty job at UT-Austin, Charles Margossian is on the job market, and Sifan Liu (technically visiting faculty) will be starting her faculty job at Duke next year.

In general, we realize that our postdocs can’t strictly follow our mission if they want to get an academic job, because academia still doesn’t value software contributions very much. Nevertheless, I would very much like to recruit someone who’s just as interested in contributing to open-source software as publishing NeurIPS papers. One of our recent postdocs, Wenda Zhou, who did his Ph.D. in stats at Columbia, used his time here to retool as a neural network researcher and ML engineer—he now works at OpenAI.

Joint faculty position with Cooper Union

This year, we’re looking to hire someone for a joint position with Cooper Union, an engineering-focused college in NYC. Here’s the job posting:

Cooper Union Computer Science /Flatiron Institute CCM: joint tenure-track faculty position

It’s 3 years of a 50-50 position, then transitions to “full” time at Cooper Union (scare quotes because academic jobs in the U.S. are typically 9 month jobs). That’ll be good for tenure decisions because academic departments have a hard time with split commitments.

Research scientists

We have a few openings across ML and statistics for permanent research scientist positions. We’re looking for people with at least a few years experience post-Ph.D., but will consider people just finishing a first postdoc. We can also hire much more senior research scientists like me or Lawrence Saul, our head of ML. I would ideally like to hire someone who has worked on both computational Bayesian statistics and on neural networks.

We’re still tweaking the job ad, but it’ll go out any day now and show up on the list posted above.

ChatGPT o1-preview can code Stan

Posted on October 22, 2024 3:00 PM by Bob Carpenter

This is Bob.

Yes, but can it Stan?

The first few instantiations of ChatGPT haven’t been so good at Stan. This is perhaps not surprising, because there’s relatively little written about Stan on the web compared to, say, Python, C++, or R.

Impressive, whatever you call it

It’s a whole new ball game with the new o1-preview model rolled out by OpenAI. I was tipped off to this fact by a blog post from Keith Goldfeld (of NYU Population Health) on the ouR data generation blog, Can ChatGPT help construct non-trivial statistical models? An example with Bayesian ‘random’ splines. It’s well worth reading.

A lot of our readers will object to the tagline from OpenAI for o1-preview, “uses advanced reasoning.” Comment season is open. Whether you want to call this “reasoning” or “understanding” or pick your own term, the ability of OpenAI’s new o1-preview model is quite remarkable. It’s getting better at generalizing and working from the data it has. The o1-preview model gives you a protocol as it uses what AI researchers call “chain of thought” (yes, I’m teeing up another one for the doubters, to continue with the sports metaphors). You have to sign up and pay and then it’s rate limited, because it’s spending on the order of 15 to 150 seconds to answer a query as it goes through iterative refinement on its output. I’m here to tell you that the wait’s worth it, and that’s also what I’m hearing from people around here who lean on it for complicated code refactors.

Bayesian workflow is universal

I am about to do a tutorial tomorrow at Flatiron Institute on Bayesian workflow. I’d like to help all the scientists here understand how to apply Bayesian workflow using SBC, PPC, LOO, etc., no matter how they fit their models (Monte Carlo methods, variational inference, amortized inference, simulation-based inference, etc.).

Galileo’s inclined plane experiment

I thought it would help the presentation to have a concrete science example. I figured Galileo’s inclined plane experiment as a way to estimate the gravitational constant would be fun. Galileo set up a ball on an inclined plane, then measured how long it took to move a given distance. With that experimental data, we can fast forward to Newton and set up a statistical model where we use Galileo’s data to estimate the gravitational constant (on earth).

ChatGPT o1-preview cracks physics-based Stan

So I asked ChatGPT o1-preview to derive the physics for me, then to write the Stan model. I’d already written the Stan model at that point—it’s one thing I can still do better than GPT. But the one it wrote looks good. I went and checked some online physics tutorials as I can’t remember my high school physics well enough to derive it myself. It decided to use the inertial formula for a solid sphere with no friction. It got all the algebra right as far as I can tell. This is much better than earlier versions of ChatGPT did at algebra.

Show, don’t tell

Here’s the transcript from my session in case you’d like to dive deeper on how I’m prompting it:
ChatGPT o1-preview simulates Galileo’s inclined plane.

Keith Goldfeld’s blog post also goes over in detail how he used it to generate code for splines.

Just the code

Here’s the Python simulation, inference, and plotting code that I put together from the GPT output.

import numpy as np

def simulate_times(N, sigma, length, height):
    g = 9.81  # gravitational acceleration in m/s^2
    s = np.sqrt(length**2 + height**2)
    h = height
    t = np.sqrt((14 * s**2) / (5 * g * h))
    t_obs = np.random.lognormal(mean=np.log(t), sigma=sigma, size=N)
    return {
        'length': length, 'height': height,
        'N': N, 't_obs': t_obs
    }

data = simulate_times(N=100, sigma=0.1, length=5, height=2.5)

import cmdstanpy as csp
m = csp.CmdStanModel(stan_file='galileo.stan')
draws = m.sample(data = data)
print(draws.summary())

import pandas as pd
import plotnine as pn

df = pd.DataFrame('g': draws.stan_variables('m'),
                      'sigma': draws.stan_variables('sigma'))
df = pd.DataFrame(draws.stan_variables())
plot = (pn.ggplot(df, pn.aes(x='g', y='sigma'))
            + pn.geom_point()
            + pn.theme(aspect_ratio=1)
            + pn.geom_hline(yintercept=0.10, color='blue', size=1)
            + pn.geom_vline(xintercept=9.81, color='red', size=1)
            )
plot.show()

Here’s the Stan program that I wrote—GPT’s is pretty darn close, but my priors are tighter. I chose lognormal error just to keep everything positive, not because multiplicative (proportional) error is right for measuring time. I suspect the errors are more naturally additive based on the precision of the measuring device.

data {
  real length, height; 
  int N;  vector[N] t_obs;
}
transformed data {
  real s = sqrt(length^2 + height^2);
  real h = height;
}
parameters {
  real g;      // gravity accel in m/s^2
  real sigma;  // measurement error
}
model {
  real log_t_true = 0.5 * (log(14) + 2 * log(s) - log(5) - log(g) - log(h));
  t_obs ~ lognormal(log_t_true, sigma);
  sigma ~ exponential(1);
  g ~ lognormal(log(10), 0.25);
}

I threw in tighter priors than GPT—these are weakly informative. A more general measurement error model would allow for lack of calibration (i.e., measurement bias), and use a constant offset as well as an error term.

Where have all the count words gone? In defense of “fewer” and “among”

Posted on October 13, 2024 3:00 PM by Bob Carpenter

This is cranky linguist Bob.

The lack of count markers is starting to bug me. To wit…

Usage of “fewer” vs. “less”

The prescriptive rule in English is that “fewer” applies to groups of countable objects whereas “less” applies to uncountable masses. I’d say “many dogs” rather than “much dogs” because dogs are countable. Similarly, I’d say “fewer dogs” not “less dogs.” You can say “less water” but not “fewer water”, though you can say “fewer bottles of water” if you’re willing to introduce a partitive.

Usage of “among” vs. “between”

A similar prescriptive rule is that one uses “between” for two things and “among” for more than two. So it’s “between you and me” but “among the three of us”.

Kids these days

I’ve noticed that “fewer” and “among” are being used infrequently these days, at least in spoken language and language written by scientists. This is understandable as languages tend to evolve toward regularity. It’s just that I used to work on count nouns when I was studying natural language semantics and internalized the rules. So now I find it jarring to hear collocations that jumble the mass/count distinction. This must be how Francophones feel when a non-native speaker confuses the gender of a noun. I can’t stop my internal checksum from flashing an error code!

References

Meriam-Webster. 2024. ‘between’ vs. ‘among’
Merriam-Webster. 2024. ‘fewer’ vs. ‘less’
Bob Carpenter. 1994. Distribution, collection, and quantification: A type-logical account of plurality. Laboratory for Computational Linguistics Technical Report. Carnegie Mellon University.

Yes, I’m human

The blog’s asked me so many times when writing this post that I feel I need to share.

Defining statistical models in JAX?

Posted on October 8, 2024 3:00 PM by Bob Carpenter

This is Bob. And I’d like to know the best way for us to code a bunch of models in JAX to use to evaluate parallel algorithms including normalizing flows. I’m going to dump out my current thinking, but I’m really hoping to get feedback from experts on the best way to do this without starting a flame war in the comments.

Why not Stan? Ask Elizaveta!

The bottom line is that in order to evaluate the parallel algorithms we’re considering, we need fast parallel execution in-kernel on the GPU. Stan has some ability to offload compute to GPU, but not to the extent that we can parallelize entire model evaluations.

Elizaveta Semenova’s words at StanCon are still ringing in my ears—she started her live interview with Alex Andorra by saying, “I don’t use Stan any more.”

Why JAX?

Elizaveta needed to integrate neural networks for the Bayesian optimization she’s doing and for that turned to JAX. (The interview with Elizaveta and Chris Wymant will soon be up on Alex’s podcast, Learn Bayes Stats, along with the interview of Brian Ward and Mitzi Morris in another segment that also took place live at StanCon—the podcast is a ton of fun and both Andrew and I have done interviews).

The real reason for why JAX isn’t that all the cool kids are using it (though everyone I know on the CS side has pretty much switched to JAX, including my own personal bellwether, Matt Hoffman). JAX is beautifully compositional in the same way as Unix. I suppose we could’ve used PyTorch, but JAX just feels much more natural to a computer scientist like me. I just love the way it can compose JIT and autodiff to enable massively parallel differentiable programs. There are really two applications I have in mind, normalizing flows (the main topic of this post) and parallelized MCMC of the form Matt Hoffman’s been propounding lately (Charles Margossian, a former Ph.D. student of Andrew’s and one of our postdocs here, did an internship with Matt at Google working out how to do R-hat in a massively parallel setting with 1000+ chains that communicate with each other to accelerate convergence, after which a single draw is taken from each in the limiting case).

Normalizing flows

I think there is a good chance that normalizing flow-based variational inference will displace MCMC as the go-to method for Bayesian posterior inference as soon as everyone gets access to good GPUs. I’ve been looking into normalizing flows with Gilad Turok, Sifan Liu, Justin Domke, and Abhinav Agrawal. Justin visited Flatiron for five months and during that time, we didn’t manage to program a distribution in JAX that his and Abhinav’s take on realNVP, as coded in the repo vistan, couldn’t fit well. They’re busy writing up a more extensive evaluation in a follow-up paper and the results only look better. Gilad was able to port their vistan code to Blackjax and replicate all their results on our clusters here—he’ll be submitting a PR to Blackjax soon.

My thinking on normalizing flows was inspired by the last model we fit with Justin—a centered parameterization of a hierarchical IRT 2PL model with around 1000 total parameters (this is a nice example due to additive non-identifiability, multiplicative non-identifiability, and funnels from the hierarchical priors). With this parameterization, Stan struggles to the point where I’d say it can’t really fit the model. Justin and Abhinav’s RealNVP fit it quite well—much better than Stan managed. It just took a massive number of flops on a state-of-the-art GPU. One of the things Justin and Abhinav’s approach to flows relies on for convergence is a massive number of evaluations of the log density and gradients for computing the approximate KL-divergence stochastic gradient (i.e., the ELBO). So we needed to code the models in JAX to run entirely on the GPU. So I’m looking for an easier way to do this.

Workflow in JAX

Colin Carroll (Google employee, PyMC dev) just presented a talk about Bayes and JAX at PyData Vermont. He covers the whole workflow in JAX and talks about his bayeaux repository. Colin talks about Adrian Seyboldt’s new nutpie sampler in Rust, which Adrian also just presented at StanCon. There’s no write-up, but we’re looking into reverse engineering the Rust into C++ for Stan—it works quite well. Adrian’s agreed to come out and give a talk here at Flatiron on his sampler in the new year. But that’s a different topic.

For now, I want to do a lot more evaluations of Justin and Abhinav’s take on realNVP, and we’re trying to figure out how to code up a couple dozen models in JAX. There are many possibilities.

PyMC

PyMC can produce JAX output. The PyMC devs just did a little hackathon and created about ten pull requests in the posteriordb repository for PyMC implementations.

with pm.Model() as hierarchical:
    eta = pm.Normal("eta", 0, 1, shape=J)
    mu = pm.Normal("mu", 0, sigma=10)
    tau = pm.HalfNormal("tau", 10)
    theta = pm.Deterministic("theta", mu + tau * eta)
    obs = pm.Normal("obs", theta, sigma=sigma, observed=y)

All of the approaches in Python wind up having to name variables and then provide string-based names. I don’t know if the sigma=sigma thing is necessary for the scale parameter. I like that the distributions are vectorized here. It’s too bad that there’s an observed= in the data models—I think that means the models as defined aren’t as flexible as the BUGS models in terms of specifying what’s data at run time. At the same time, Thomas Wiecki was telling me you could use NaN to code the equivalent of R’s NA and do inference, so I think that observed value can have missingness.

Not all of the PyMC models look so much like a graphical model.

NumPyro

NumPyro is the version of Pyro that generates JAX on the back end. NumPyro looks like BUGS (or Turing.jl), which is not necessarily a bad thing. Here’s the NumPyro version of Andrew’s favorite example model, eight schools (the arguments to the top-level function are the data):

def eight_schools(J, sigma, y=None):
    mu = numpyro.sample('mu', dist.Normal(0, 5))
    tau = numpyro.sample('tau', dist.HalfCauchy(5))
    with numpyro.plate('J', J):
        theta = numpyro.sample('theta', dist.Normal(mu, tau))
        numpyro.sample('obs', dist.Normal(theta, sigma), obs=y)

pangolin

pangolin can produce JAX output. This is an “early-stage probabilistic inference project” rather than a longstanding embedded PPL like PyMC or NumPyro. Specifically, it’s a graphical modeling language that looks a lot like the others, and it has back ends for Stan, JAGS, and JAX. It’s very experimental and a work in progress, but the models look nice. Python doesn’t let you overload the ~ operator, which is unary arithmetic complement. Here it’s not so clear that y and stddevs are the data.

mu = pg.normal(0,10)                                             # μ ~ normal(0,10)
tau = pg.exp(pg.normal(5,1))                                     # τ ~ lognormal(5,1)
theta = [pg.normal(mu,tau) for i in range(num_schools)]          # θ[i] ~ normal(μ,τ)
y = [pg.normal(theta[i],stddevs[i]) for i in range(num_schools)] # y[i] ~ normal(θ[i],stddevs[I])

No names here, but they have to get introduced later if you want to do I/O. The doc also makes it clear how things line up. unlike the other approaches, this uses standard Python comprehensions, which I don’t think are super efficient in JAX judging from the JAX doc I’ve read. But I think there are lots of ways to code in pangolin. The problem is when you release “Hello, World!” code, people read it as what your project does rather than as a simple example.

postjax

We can just code models in JAX. Bernardo Williams (Ph.D. student at U. Helsinki) just coded a bunch of models directly in JAX in his GitHub postjax. I couldn’t find eight schools, but here’s a simple logistic regression model as a class with a method defined as follows.

def logp(self, theta):
    sqrt_alpha = jnp.sqrt(self.alpha_var)
    data = self.data
    X = data["X"]
    y = data["y"]
    assert len(theta) == self.D
    return jnp.sum(jss.norm.logpdf(theta, 0.0, sqrt_alpha)) + jnp.sum(
        jss.bernoulli.logpmf(y, sigmoid(jnp.dot(X, theta)))
    )

The variable self.alpha_var is set as data in the constructor as is the data dictionary data. I’d have been tempted to put alpha_var into the data input.

Other options?

I’d really like to hear about other options for coding statistical models in JAX.

Straight to XLA?

Both JAX and TensorFlow run by compilation down to XLA (stands for “accelerated linear algebra”). Mattijs Vákár, who coded a lot of the Stan parser and code generator, is working on autodiff down at that level. That may be a good eventual target for a compiler, but it’s a lot easier to start in JAX. Similarly, we could have targeted LLVM with Stan rather than C++, but we rely on so much pre-existing C++ infrastructure that would have been challenging. Similarly, I think coding directly at the XLA level would be painful at this stage, not that I’ve ever tried it or even know what it looks like. I just know we’re going to need a lot more than linear algebra.

Stan

For comparison, I really wish we could just use Stan. Here’s what eight schools looks like in Stan. This includes all the data declarations that were implicit in the other programs (which used either a closure or function argument to capture data directly).

data {
  int J;
  vector[J] y;
  vector[J] sigma;
}
parameters {
  real mu;
  real tau;
  vector[J] theta;
}
model {
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
  mu ~ normal(0, 5);
}

I’m thinking the way I would code something that follows Stan’s execution logic in JAX directly would be something like this:

class LinearRegression:
    def __init__(self, data):
        self._data = data

    def num_params_unc(self):
        return 3

    def log_density(self, params_unc):
        reader r = Reader(params_Unc)
        alpha = r.real()
        beta = r.real()
        sigma = r.realLB(lower=0)
        log_jacobian = r.lp_

        log_prior = 0
        log_prior += norm.logpdf(alpha, 0, 1)
        log_prior += norm.logpdf(beta, 0, 1)
        log_prior += exponential.logpdf(sigma, 1)

        log_likelihood = 0
        log_likelihood_fun = lambda x, y: norm.logpdf(y, alpha + beta * x, sigma)
        log_likelihood += sum(vmap(log_likelihood)(zip(self._data['x'], self._data['y'])))

        return log_jacobian + log_prior + log_likelihood

where I’m relying on a Reader class that follows the reader I first coded for Stan in order to define the log density over a vector. It’s really a deserializer. I’m wondering if I can just lean more on the pytree construct in JAX to simplify my interfaces, but I’m just getting started with JAX myself.

class Reader:
    def __init__(self, params):
        self._params = params
        self._lj = 0
        self._next = 1

    def real(self):
        x = self._params[self._next_]
        self._next += 1
        return x

    def real_lb(lb):
        x_unc = self.read_real()
        self._lj += x_unc
        return lb + jax.numpy.exp(x_unc)

    ... other constraining transforms ...

    def log_jacobian():
        return self._lp

Statistical Modeling, Causal Inference, and Social Science

Author Archives: Bob Carpenter

Assistant professor position at USI in Lugano

Sabbatical and pre-faculty positions at Flatiron Institute in NYC

It’s a JAX, JAX, JAX, JAX World

Condition numbers for HMC and the funnel

Stats and ML postdoc and permanent hiring season officially open at Flatiron

Show, don’t tell: ChatGPT 5 marginalizing Gelman’s measurement error model in Stan

An evaluation scheme for sampling

Stan for multimodal mixtures—from exponential CPS to linear DP

Chaining Bayesian inference with priors constructed from posterior draws

AISTATS ’25 Best Paper award—Margossian and Saul on exact recovery of means and correlation in VI

Analytics positions for the Miami Marlins (baseball)

How is an American research university funded?

Generalized linear neural network models

Treasure trove of forensic details in arXiv’s LaTeX source code

Softmax is on the log, not the logit scale

Applications of (Bayesian) variational inference?

Flatiron Institute hiring: postdocs, joint faculty, and permanent research positions

ChatGPT o1-preview can code Stan

Where have all the count words gone? In defense of “fewer” and “among”

Defining statistical models in JAX?