Aki looking for a doctoral student to develop Bayesian workflow

I (Aki) am looking for a doctoral student with Bayesian background to work on Bayesian workflow and cross-validation (see my publication list for my recent work) at Aalto University, Finland (the world’s happiest country). You would also collaborate with Andrew and Stan and ArviZ developers. You can apply through the ELLIS PhD program (dl October 31)

The Dodgers are hiring

Brendan Cooley writes:

I’m an analyst with the Los Angeles Dodgers. We are looking for budding Bayesian statisticians and deep learners that are interested in joining us for an internship next summer. We have a soft spot for jax and numpyro but Stan and PyMC folks are obviously always of interest. Job description is here.

Students can get in touch at [email protected] with any questions.

It looks like the Dodgers might be facing the Mets in the first round of the playoffs this year. So, if you do take this job, please do me a favor and hold off for a couple weeks on giving the Dodgers any useful ideas!

The Miami Marlins are hiring

This post is from Daniel.

Bryant Davis at the Miami Marlins forwarded me a job posting. They’re looking for someone Bayesian to join them!


The Miami Marlins Baseball Research and Baseball Solutions departments are seeking entry- and senior-level Data Scientists and Baseball Analysts. In particular, they are seeking individuals who’ve worked in probabilistic programming languages – anyone with experience in Stan or PyMC who has a passion for working in sports is encouraged to apply! For more details, check out the job posting here.

Show, don’t tell: ChatGPT 5 marginalizing Gelman’s measurement error model in Stan

This post is from Bob

Even though the post is from me, the story is partly about Andrew, but mostly about how LLMs are getting better at math. Rather than telling everyone about LLMs and having a theoretical discussion, I thought it’d be useful to show you another example of the kind of interaction with LLMs that I find super useful.

The backstory

Andrew frequently mentions that he’s impatient with Stan’s default sampling times and would like something to give him a rough answer faster. So I and several colleagues have been working on finding faster adaptation (for now, I would recommend Adrian Seyboldt’s fast adapting Nutpie sampler, which can already be run with models coded in Stan or PyMC). That means evaluation. So I finally asked Andrew for an example.

Gelman’s measurement error model

Here’s the measurement error model that Andrew gave me:

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
  vector[N] x;
}
model {
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b*x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

As an aside, I love that there’s an appropriate lower bound on N.

This is a classic measurement error model—if you’re not familiar, there’s a short introduction to exactly this flavor of model in the Stan User’s Guide introduction to measurement error models.

Bob’s revised model

This looks like a BDA type model—there are improper flat priors. And the latent true values of the covariates x have a centered parameterization. Both of these can cause problem fitting, so I swithced to a non-centered parameterization using offset and multiplers, then added some weakly informative priors for all the parameters.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
  vector[N] x;
}
model {
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  mu_x ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
  sigma_x ~ lognormal(0, 0.5);
  
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b * x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

It fits better than Andrew’s but still not well. With the default 1000 warmup and 1000 sampling iterations on four chains, there were still R-hat values above 1.2. If we crank that up to an order of magnitude more iterations, everything looks OK, but it still has an integrated autocorrelation time of around 100 (meaning ESS is 1/100 times the sample size).

GPT 5’s revised model

Looking at the model, I thought like it looked like a prime candidate for marginalizing out the x parameter—it has a normal prior and it’s used in a normal likelihood. But I’m terrible at this kind of math and there weren’t any math stats folks around like Philip Greengard, so I turned to GPT 5, which had already done some math for me I couldn’t do myself (calculating the second and fourth moments of Neal’s funnel). Here’s my interaction with GPT 5.0 set to “auto”. GPT 3.5 could barely write a syntactically correct Stan program for linear regression, so let me just say it’s come a long way since then.

[Bob] I have the following Stan program for measurement error in x and I’m wondering if there is some way to marginalize out some of the variables to make this sample more efficiently. I figured since everything is normal there would be some chance of this. In particular, I would like to marginalize the x out if possible.

```stan
[code from above]
```

I would be OK slightly modifying the priors on the sigma to be inverse gammas on sigma^2 if that would help grease the wheels of the marginalization.

And lo and behold, the genie returned a correctly marginalized Stan program:

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
}
transformed parameters {
  real inv_var_x  = inv_square(sigma_x);
  real inv_var_xs = inv_square(sigma_x_star);
  real tilde_v    = 1.0 / (inv_var_x + inv_var_xs);              // Var(x | x*)
  real<lower=0> sd_xstar = sqrt(square(sigma_x) + square(sigma_x_star));
  real<lower=0> sd_y_cond = sqrt(square(sigma) + square(b) * tilde_v);
  vector[N] tilde_mu = tilde_v * ( inv_var_x * rep_vector(mu_x, N)
                                 + inv_var_xs * x_star );
}
model {
  // hyperpriors (use whatever you prefer; half-normal/lognormal work fine)
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  mu_x ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
  sigma_x ~ lognormal(0, 0.5);

  // marginalized likelihood
  x_star ~ normal(mu_x, sd_xstar);                 // p(x*)
  y ~ normal(a + b * tilde_mu, sd_y_cond);         // p(y | x*)
}

Followed by this advice:

  • [GPT5] Integrating out μx as well is possible with a normal prior, but it induces dense cross-observation covariance (rank-1 update) and is typically worse for HMC/NUTS. Keeping μx as a parameter is usually better.

Did it work?

Yes! Not only did it work, it worked on the first try. And the results matched what I got from a much longer run of the full model, but with a great ESS.

GPT, can you explain to our users at home how you did that?

I was thinking this kind of operation would be useful to include in the efficiency chapter of the Stan User’s Guide. So I asked the genie to explain itself.

[Bob] This is great and it worked exactly as advertised and had an effective sample size for estimating parameter means of about half the total number of sampling iterations and agreed with a long run of the original model. Now I would like to write this up as an example for our users in the efficiency chapter of the [*Stan User’s Guide*](https://mc-stan.org/docs/stan-users-guide/efficiency-tuning.html). So I want an explanation pitched for a similar audience. Given that we’re going with the factored conditional form, could you go into a bit more detail about how you derived the forms of the four variables you have, the constants $\tilde{v}$ and $\mu_i$, and the distributions for $\x_i^\star$ and $y_i \mid x_i^\star$? You can break it down into known properties of normals, though please remind me of what these are because I’m not very good with classical math stats manipulations of exponential families, nor will many of our users be.

And, of course, with an LLM, Bob’s your uncle (from the French idiom et voilà):

Of course, I will have to double check all the math until I understand it thoroughly, then convert to the Stan User’s Guide style.

They’re looking for businesses that want to use their Bayesian inference software, I think?

Kirill Parinov writes:

I recently co-founded Boston Bayes, a Dutch venture studio, together with the founders who created RxInfer — a powerful open-source probabilistic programming framework enabling organizations to make data-driven decisions under uncertainty. https://lazydynamics.com/

Boston Bayes is now seeking independent founders (for CEO positions) to launch new ventures powered by this technology — startups that turn uncertainty into a competitive edge through real-time probabilistic inference.

Boston Bayes will provide the tools, deep tech, and ongoing support needed to build the future of probabilistic AI, while the founders maintain independence and ownership of their ventures. You can read more here: https://bostonbayes.com

Do you happen to know anyone in your network who might be interested in founding a spin-off with Boston Bayes’ support?

All of this is baffling to me, starting with that I don’t know what a “venture studio” is, also it’s named Boston but it’s in the Netherlands, also I don’t understand what it means to be seeking “independent founders for CEO positions,” also I’m not quite sure what is the real-time probabilistic inference they are talking about. Also I don’t get what’s up with RxInfer, but Bayesian inference is cool, and anything we put in Stan and our workflow book and our research articles is open-source, so anyone is free to use these ideas in whatever computer program they’re writing.

In any case, maybe someone reading this post is doing Bayesian workflow in the business world and would find this to be useful?

Uber could use your statistical analysis.

Johannes Hallermeier writes:

After CU, I went on to work at Uber, as an applied scientist on the policy research team.
I’m reaching out because Uber is hiring for a number of interesting applied science & engineering roles across policy, marketing and marketplace.
—————————————-
Applied science / engineering roles at Uber
– two roles on policy research (#1, #3)
– two roles on marketing applied science (#2, #4, plus a third role coming online soon)
– several roles on marketplace (pricing, matching, incentives, etc., see rest of list below)
before applying, please ping [email protected], including your CV and preferred role (for questions and/or referrals)

Applied Scientist II, Policy & Consumer Research

Applied Scientist II, Brand Marketing

Applied Scientist II, Earnings Policy

Applied Scientist II, Marketing Applied Scientist II

Machine Learning Engineer

Optimization Engineer (Operations Research)

Backend Engineer, Rider Pricing & Incentives

Senior Optimization Engineer

Senior Software Engineer, Rider Pricing and Incentives

Senior Software Engineer, Dynamic Pricing

Staff Machine Learning Engineer, Causal Inference

Staff Software Engineer

Staff Software Engineer, Rider Pricing Platform

Staff Machine Learning Engineer, Pricing and Incentives

Staff Machine Learning Engineer, Dynamic Pricing

Staff Optimization Engineer, Dynamic Pricing

Senior Staff Machine Learning Engineer, Marketplace Pricing & Incentives

Science Manager – Dynamic Pricing

Manager, Science

Senior Staff Engineer – Marketplace Competitive Intelligence

Scientist II – Competitive Intelligence

Sr. Scientist – Competitive Intelligence

Sr. Scientist – Competitive Intelligence

Scientist II, Pricing and Incentives

Senior Scientist, Pricing and Incentives

I guess maybe if you interview for a job there, they’ll pick you up directly from the airport?

In any case, I can only assume that expertise in Bayes and Stan will come in handy.

(1) Fitting hierarchical models in genetics, (2) A Stan model that runs faster with 400,000 latent parameters, (3) Super-scalable penalized maximum likelihood inference for biome problems, (4) “In the end, I basically gave up working on biology because of the politics.”

Our recent post elicited this comment from Bob that was so detailed I thought it deserved its own post.

This one comment from Bob has enough material for 4 standalone posts:

1. Fitting hierarchical models in genetics: why the full Bayesian approach gives the right answer, whereas an existing shortcut method using approximations and likelihood ratio tests does not.

2. Stan’s HMC scales well in high dimension, but not so well in bad geometry, so for this problem it was better to use a Poisson/gamma model with 400,000 latent parameters than a much lower dimensional but computationally awkward negative binomial model. (Recall the dictum that mixture models all have computational problems and that all computational problems are essentially mixture models.)

3. “Super-scalable penalized maximum likelihood inference for biome problems with some collaborators in optimization. With 150K biomes averaging 3 megabases each and arbitrary amounts of data, it fits in about 20m on a couple decent servers (doesn’t need GPU).”

4. The struggles of interdisciplinary, or interlab, collaboration within certain subfields of biology.

Bob’s summary of that last point: “It’s trivial to rewrite a statistically sound, fully Bayesian version of DESeq2, but biologists are too conservative to use new tools, so I haven’t even tried to fix this. In the end, I basically gave up working on biology because of the politics.”

And here’s Bob’s comment in full:

There are already hierarchical models in wide use in genomics. For example, rMATS is a system for inferring splice variants. The statistical computation in that package is a mess, which led them to the ridiculous conclusion that fewer variants from a population with deeper sequencing is better for estimating the population values than more variants with less deep sequencing (basic stats will show you the latter is better if your goal is estimating the population). I rewrote it all as a Bayesian model. It gets the right answer statistically, namely that it’s better to have more samples at lower sequence depth, unlike rMATS. The problem with rMATS is their Laplace approximation for max marginal likelihood and then their likelihood ratio tests, which together give them the wrong results statistically, even for their model.

Negative binomials induce really bad geometry, as does just about anything that gives you multiple ways to explain the same data (i.e., either higher mean or higher dispersion can explain a large value). So I reparameterized in two ways. First, I coded the differential expression with a gamma-Poisson rather than negative binomial, which with 20 sequencing runs over 20K splice variants led to 400K latent parameters. Second, I use an overall mean plus difference parameterization on Gelman’s advice. Stan’s HMC scales well in high dimension, but not so well in bad geometry, so fitting is OK.

For political reasons, I haven’t been able to publish. As an example, my original collaborator’s Ph.D. supervisor literally won’t let her work on anything other than his project and the person whose lab she rotated through is only interested if it produces “breakthrough biology” we can publish in Nature, which is not something I can deliver. I haven’t been able to find anyone who can help me evaluate this and write it up in a way that the biologists would be OK with. I’d be happy to share the code. I even hired an intern specifically to work on this, but the intern showed up and refused to work on the problem!

More recently, I worked on super-scalable penalized maximum likelihood inference for biome problems with some collaborators in optimization. With 150K biomes averaging 3 megabases each and arbitrary amounts of data, it fits in about 20m on a couple decent servers (doesn’t need GPU). It’s like Rob Patro et al.’s salmon, but I figured out how to do all the bias adjustments for things like hexamer and positional bias (just push the uncertainty through Bayes style—not hard), which is why Rob told me that he abandoned the approach. I’ve had the same problem getting a biologist to help evaluate and write up. Even my collaborators on the autism project where I have a ton of gut biome data from Simons Foundation’s Autism Research Initiative (SFARI) and we did publish a paper, but they say they have zero time to do this. Everyone says they know someone, but when I contact them, they’re too busy. If I could come up with half a year funding for a postdoc, they’d loan me one (they really do assign their postdocs work—it’s not at all like CS and stats). It’s amazing to me how constrained biologists are in working with other people. The code we produced was really great, because Robert Gower (an optimization specialist) helped add a mirror descent optimization algorithm, which is about an order of magnitude faster than the L-BFGS default in Stan. And Robert Blackwell (it’s the 3 Roberts project), a high-performace compute specialist, added some high performance sparse matrix computation in C++ and figured out how to scale it out on a (CPU) cluster. What’s cool is that time doesn’t depend on the data size other than the really fast conversion to k-mer counts.

The autism gut biome paper for which I helped with the Stan modeling is Multi-level analysis of the gut–brain axis shows autism spectrum disorder-associated molecular and microbial profiles. This project was led by Jamie Morton, who was an amazing postdoc at Flatiron. Jamie wrote a really nice overview of how to think about compositional analyses like differential expression in Establishing microbial composition measurement standards with reference frames. But he has zero time to help build a state-of-the-art differential expression tool.

I would suggest looking at DESeq2 to see what a mess current differential expression inference is. It’s pretty easy to see why none of their p-values are well calibrated. It’s trivial to rewrite a statistically sound, fully Bayesian version of DESeq2, but biologists are too conservative to use new tools, so I haven’t even tried to fix this.

In the end, I basically gave up working on biology because of the politics.

Using hierarchical modeling to get more stable rankings of gene expression

Will Macnair writes:

I am a computational biologist (with a maths/stats background) working with genomic data. One task that comes up often is estimating changes in expression level between two conditions, for many thousands of genes, then ranking the genes in order of some measure of “confidence.”

My question is: In a Bayesian regression setting, what is a good way to rank the genes? If I calculate Bayes factors for “full model” vs “covariates-only model” and use these to rank the genes, is this ok? Are there better options?

To give a bit more detail, this task is typically referred to as “differential expression”, and we usually do this for all genes that pass some filter on very low / very boring expression. In the frequentist setting, the ranking is done with an adjusted p-value, and there are various popular methods for doing this (e.g. edgeR and DESeq2). For some recent analysis I have been using rstanarm, and I’ve been wondering what would be the best way to achieve a confidence-type ranking in this setting. The reason we do ranking is that we might do follow-up experiments for some genes, however the experiments are a lot of work. So we would like to prioritise genes where we have high confidence that something interesting is going on.

Some other thoughts:

– I searched previous blog entries and found this one. There is some nice discussion in the comments, in particular the thread started by Bob Carpenter where he talks about framing it as ranking. If I’ve missed something more recent, please point me to that!

– There are also some comments from Daniel Lakeland on how there has been a lot of work on the FDR approach, despite it being a bad framing of the problem (i.e. “effect yes/no” rather than “ranking under cost constraints”). The blog was posted in 2016, and the approach taken in the field has not really changed much since then, so your input would be valuable.

– In the blog post, and also in BDA3, you say that you don’t really like Bayes factors, as having a lump of probability at 0 doesn’t really make any sense. I agree with that, however in this case I don’t really want to make statements about the value of the parameter, just how confident we are that something is going on. Could Bayes factors here be ok?

– I wonder if an alternative approach would be to decide on a minimum interesting effect size, then ask which genes have the highest posterior probability of an effect size larger than this.

– I think elsewhere you have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model. I have over 10k genes, and a few hundred samples, so this doesn’t seem practical (at least not for me!).

– In my analysis, I have calculated Bayes factors (for “full model” vs “only covariates”), but also tried ranking by mean(case_vs_control) / sd(case_vs_control). The Bayes factors seem to have some advantages, e.g. they are especially low for genes that are known to have strong sex effects, and sex is one of the covariates.

My reply: I know very little about genomics–ok, actually I know nothing at all about genomics, although I’ve had the occasional conversation of this sort where it seems that I’ve been able to offer helpful advice. So, rather than attempt any sort of solution, I will just share some scattered thoughts:

1. Why do you want to rank the genes? What are you going to do with the ranking?

My general thought is that, even if you’re doing ranking, there are lots of ways to do this, and it’s hard for me to think of a realistic example in which tail-area probabilities are the right way to do the ranking–see discussion here–except in some very simple symmetric problems where all analyses lead to the same result.

2. My problem with the Bayes factor is not so much with the lump of probability at zero, but rather with the dependence on the prior distribution for the unconstrained parameters in the model. Take K well-identified parameters on unit scale and change their weak prior from independent normal(0,100) to independent normal(0,1000), and you’ve decreased the model’s Bayes factor by a factor of 10^K without changing the inferences conditional on the model. We discuss this sort of thing further in Chapter 7 of BDA3 (chapter 6 of the earlier editions). When comparing models, you can establish conventions with weak priors so as to obtain stable Bayes factors, but that’s what these are–conventions–and there’s reason to expect the resulting Bayes factors to make much sense (that’s the point that I made in my 1995 article with Rubin on the topic), so you’ll just have to evaluate these as one more possible decision rule with no clearly-defined theoretical basis.

3. You’re right that I have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model! Here’s my paper with Jennifer and Masanao explaining this, published in the Journal of Research on Educational Effectiveness in 2012.

4. You write that you over 10k genes, and a few hundred samples, so this doesn’t seem practical. But it is practical! We routinely fit multilevel regressions in Stan with hundreds of thousands of data points and thousands of predictors. Try it and see how it goes! If it’s slow, then my recommendation is to set the group-level variance parameters to reasonable values, then it’s just a simple regression model with a prior, and you can fit it fast using an optimizer, just a simple mode-finder or you can use ADVI or Pathfinder if you’d like.

5. I can see that ranking by mean(case_vs_control) / sd(case_vs_control) might not work so well because the sd can be noisy. You can probably do better by fitting a hierarchical model to the sd’s.

6. Again speaking generally, you should be able to use simulated-data experimentation to get a sense of how any of these methods is working. Simulated-data experimentation has two advantages here: first, by construction you know the true values of all the parameters so you can directly compare the performance of different methods; second, the very act of setting up the simulation forces you to figure out exactly what you are trying to learn, which brings us back to my item #1 above.

Stan for multimodal mixtures—from exponential CPS to linear DP

This post is from Bob

I’ve been thinking about evaluation recently because I’ve been working with colleagues on new samplers, which means evaluating how well they work (more on that soon). This in turn means coming up with target densities on which to evaluate them.

A combinatorial multimodal test case

I wanted something clearly multimodal and hence not log concave. I remember somebody’s paper (help with citation?) used a mixture of four two-dimensional isotropic normals, separated enough to make transition possible, but still a bit difficult. Not to give the game away, but here’s a posterior plot of a sample drawn from Stan—the imbalance in component weights is intentional, as I’ll describe below.



I’m still working on posteriordb with the Stan gang (see the authors of the linked paper) and Inference Gym with Reuben Cohn-Gordon (another linguist by training and programming language geek turned to MCMC), and thought it’d be nice to have something a little more general than just the 2D example. So I got out my notebook, and realized the generalization to D dimensions involves 2^D mixture components that are normal with unit covariance located at the points in {-r, r}^D.

p(y | r) = SUM_{mu in {-r, r}^D} 1/2^D normal(y | mu, I).

I then generalized to allow setting the probability that Y[d] > 0 to be p in (0, 1) to get a non-uniform mixture. This leads to a slightly more complex density because of the non-uniformity.

p(y | r) = SUM_{mu in {-r, r}^D} binomial(sum(mu == r) | D, p) * normal(y | mu, I).

Coding in Stan with continuation-passing style

So how do we code this in Stan? Obviously it needs to be recursive or at least iterative to deal with the D being unknown at compilation time. Whenever I see recursion, I immediately think of continuation passing style (CPS). So I came up with this Stan program to code a generalization in D dimensions.

functions {
  real mm(vector y, real r, real p, int d, real lp) {
    if (d == 0) {
      return lp;
    }
    real lp1 = mm(y, r, p, d - 1, lp + normal_lpdf(y[d] | r, 1));
    real lp2 = mm(y, r, p, d - 1, lp + normal_lpdf(y[d] | -r, 1));
    return log_mix(p, lp1, lp2);
  }

  real mm_lpdf(vector y, real r, real p, int D) {
    return mm(y, r, p, D, 0);
  }
}
data {
  int D;   // number of dimensions
  real r;  // modes in {-r, r}^D
  real p;  // p = Pr[Y[d] > 0]
}
parameters {
  vector[D] y;
}
model {
  y ~ mm(r, p, D);
}

The log_mix function is defined as follows, but implemented in a more stable way.

log_mix(p, lp1, lp2)
    = log_sum_exp(log(p) + lp1, log(1 - p) + lp2)
    = log(exp(log(p) + lp1) + exp(log(1 - p) + lp2))
    = log(p * exp(lp1) + (1 - p) * exp(lp2)).

If you unfold the recursion manually, the leaves wind up being the log densities and the weights wind up percolating as described in the definition. If you’re having trouble seeing this, manually expanding the D = 1 and then D = 2 cases will help. It’s compact, but it’s still exponential in cost to evaluate a log density and gradient (i.e., O(2^D)). Although it’s slow in higher dimensions, it works.

Python scripts

The plot above is from the following Python code that sets

D = 2, r = 2.5, and p = 2.0/3.0.

For those of you considering a move to Python, having a clone of data frames (pandas) and ggplot2 (plotnine) is a godsend. And yes, of course the LLMs know how to code pandas and plotnine.

import cmdstanpy as csp
import pandas as pd
import plotnine as pn

model = csp.CmdStanModel(stan_file='mm.stan')
D = 2
r = 2.5
p = 2.0 / 3.0
data = {'D': D, 'r': r, 'p': p}
fit = model.sample(data = data, iter_sampling=5_000)
print(fit.summary(sig_figs=2))

y = fit.stan_variable('y')
df = pd.DataFrame({'y1': y[:, 0], 'y2': y[:, 1]})
plot = (
    pn.ggplot(df, pn.aes(x='y1', y='y2'))
    + pn.geom_vline(xintercept=[-r, r], color='red', linetype='dashed')
    + pn.geom_hline(yintercept=[-r, r], color='red', linetype='dashed')
    + pn.geom_point(alpha=0.1)
    + pn.scale_x_continuous(breaks=[-r, 0, r])
    + pn.scale_y_continuous(breaks=[-r, 0, r])
    + pn.coord_fixed()
    + pn.theme_minimal()
)
plot.save('mm.jpg', dpi=300)

The knockoff of data frames in pandas and ggplot2 in plotnine are a godsend if you’re transitioning to Python from R (which I would highly recommend).

Dynamic programming to the rescue

Because it involved CPS, I mailed it off to Brian Ward around midnight last night. I’m a decent programmer, but Brian’s next level. By the time I arrived today at 10 am, he had rewritten the target density as follows.

  real mm_lpdf(vector y, real r, real p, int d) {
    if (d == 0) {
      return 0;
    }
    real lower_mixture = mm_lpdf(y | r, p, d - 1);
    real lp1 = lower_mixture + normal_lpdf(y[d] | r, 1);
    real lp2 = lower_mixture + normal_lpdf(y[d] | -r, 1);
    return log_mix(p, lp1, lp2);
  }

[Edit: Switched everything to lpdf from a mix of lpdf and lupdf.]

He saw that the recursions were doing the same thing in each branch and could be shared. Because there’s only one recursive call, Brian’s code is linear (i.e., O(D)). It achieves this speedup using dynamic programming (DP). DP calculates partial solutions that can be combined into larger solutions rather than recomputing them. DP’s the technique that you need to solve the harder L33T-code quizzes you’ll get during technical interviews these days. Other examples where DP can be helpful for statistical models include the fast Fourier transform (FFT), the forward algorithm for hidden Markov models (HMMs), and the Poisson-binomial distribution. The first two are coded efficiently in Stan and the latter I showed how to code in a Stan forum post on Poisson-binomial.

Try it yourself in the Stan Playground

If you want to play with this yourself, Brian built a version using the Stan Playground that you can run in the browser.

Here’s what it looks like after setting D = 3, running sampling, and then viewing a histogram with all three dimensions selected.



It’s a live demo, so you can edit the data to set r, D, and p. And it’s really fast due to the DP. Just like in ShinyStan and especially like its generic in-the-browser version MCMCMonitor (from many of the same developers as Stan Playground), you can view 3D projections of the higher-dimensional draws and rotate them to see it making 8 balls in 3D, 7 of which are visible in the screen grab. Or you can go to higher dimensions and view projections down to two or three dimensions. You might want to increase the number of draws per chain to get cleaner delineation of the posterior densities in the visualizations.

The Dodgers are hiring

Richard Anderson writes:

I manage the data science team with the Los Angeles Dodgers and have a job post that may be of interest to students or readers of your blog. If you know anyone who may be interested, they are welcome to reach out to me directly.

From the job description:

We are especially interested in candidates with either (1) demonstrated strength in deep learning with applications to spatiotemporal data or (2) demonstrated strength in Bayesian hierarchical modeling & probabilistic forecasting.

And, in the list of areas of preferred expertise:

You have written probabilistic models in NumPyro, PyMC, or Stan with custom likelihoods and priors.

I can only assume that Leo Durocher would be spinning in his grave. “Quantitative Analysis,” indeed!

StanBio

This is Eric.

On 30 May, we held the first virtual StanBio Connect conference. One large video recording is now available on the conference website and the Stan YouTube channel under the Live tab. We have individual videos, but we have not processed them yet. I was impressed with the depth and the variety of the talks, and I hope we will be able to do it again next year.

In the meantime, I am considering turning stanbio.org into a destination for all things Stan in biomedical research. This would involve restructuring the website (it’s built with Quarto, so that’s not too difficult) and hosting both the original and linked content. If you think it’s a good idea and want to volunteer and contribute, please send an email to [email protected] with the subject line ‘StanBio’.

loo R package 10 years!

This post is by Aki.

The loo R package has its 10 year anniversary today! Jonah Gabry made the first loo package release (v0.1.0) 10 years ago on June 26.

  • loo has been downloaded more than 4 million times from the RStudio CRAN mirror (there are more than 80 mirrors, but the RStudio mirror is likely to be one of the most popular ones)
  • R-universe counts 304 other R packages using loo
  • based on R-universe scores, loo is in top 100 among 26,819 packages

Here’s a blog post about the background, advances during the years, and a bit about the future.

Stan

When I (Aki) got involved with Stan project, there was a need for a model selection criterion that would be fast, robust, and easy to compute. I had used cross-validation (CV) a lot, but it did require some expertise to know which computational approach to use in which case, and how to diagnose the reliability. Brute-force leave-one-out cross-validation with repeated model inference is slow. Gelfand et al. (1992) had proposed importance sampling leave-one-out (LOO) CV, but 1) that estimate may have infinite variance (e.g. Peruggia, 1997), 2) due to skewed distribution the estimate would be over-optimistic with high probability, and 3) there was no good diagnostic for reliability.

DIC

Andrew had been using DIC (Spiegelhalter et al., 2002), which was simple and fast, but 1) it assumes the predictions are made using posterior mean of the parameters (and not by integrating over the posterior), 2) it’s not invariant to parameter transformations, and 3) it was known to fail for multimodal posteriors and flexible models.

WAIC

The Widely-applicable information criterion (WAIC; Watanabe, 2010) seemed promising at first, 1) being simple and fast, 2) assuming predictions using posterior predictive distribution, 3) being invariant to parameter transformations, and 4) it works with multimodal and singular posteriors. Alas, more testing of WAIC revealed it also fails with more flexible models without a warning.

Diagnostics

It was now clear that there was a pressing need for a diagnostic for the model comparison criterion computation.

I started to investigate a diagnostic for WAIC. WAIC can be presented as a truncated Taylor series approximation. In difficult cases, the higher order functional cumulant terms are not small. Can we diagnose whether MCMC estimates of higher order terms have finite variance?

Koopman et al. (2009) had proposed a diagnostic for importance sampling: 1) fit generalized Pareto distribution (GPD) to the tail of the importance ratio distribution, 2) use shape parameter k to diagnose whether the variance is finite (k <= 1/2), and 3) reject the estimate if the variance is not finite. The problem was that there was no advice what to do if the estimate is rejected and this seemed to happen a lot with WAIC and importance sampling LOO. Furthermore, finite but high variance is also problematic.

I then realized that if we assume the tail of the ratio distribution is close to a generalized Pareto distribution, and we fit the generalized Pareto distribution, we can use that as a model for the tail. To make the computation practical, this model is used to replace the raw ratios with modelled (smoothed) ratios, which are then used in further computations. In theory, modeling should reduce variability (and if the model is good, the bias can be negligible) and this was observed in practice; smoothed importance sampling LOO performed better than WAIC (or had similar performance for simple boring models).

Jonah implemented the method in loo package while we were making the experiments and writing the paper, and loo v0.1.0 was released on Github on June 26 before the papers were public.

Main papers

As theory and experiments did take a lot of pages, we decided to split the paper into “Very Good Importance Sampling” (arXived July 9, the name inspired by WAIC) and “Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models” (arXived July 16). Eventually the paper names were changed to “Pareto smoothed importance sampling” (PSIS) and “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC”. The latter did get published in Statistics and Computing in less than 14 months, while the PSIS paper (final version 58 pages) took more than 9 years to get published eventually in JMLR.

While the PSIS algorithm stayed practically the same the whole time, the theoretical justifications and diagnostics did improve over the years. Dan Simpson and Yuling Yao provided help with the theory and joined as co-authors.

The PSIS estimator always has finite variance with a cost of some bias. If Pareto-k<0.7, the bias and variance are small (see details in the PSIS paper; Vehtari et al., 2024). Although 9 years felt too long time and we had a nasty case of reviewer 2, I’m happy about the improved theoretical understanding of the pre-asymptotic behaviour.

Outliers and influential observations

The loo package did gain additional practical advice on how examining the number of parameters, the effective number of parameters (p_loo), and the number of observations can provide information on whether the high Pareto-k values are likely due to a) badly misspecified model with outliers or b) well specified but flexible model.

As we get Pareto-k diagnostic for each LOO-fold, that can be used as identifying influential or problematic observations, but also to focus the additional computation only for the specific LOO-folds.

Additional computation for high Pareto-k cases

The simplest approach is to just re-run MCMC for the folds with high Pareto-k. rstanarm and brms packages know enough about the data and model, so that they can provide automated approach for this.

To speed-up computation, we developed moment matching approach (Paananen et al., 2021) that can adjust the posterior draws faster than what re-running MCMC would take, to better match the proposal and target. The needed functionality and a vignette was added to loo. brms makes it easy to use moment matching LOO with one option.

The loo package also has a vignette and support for K-fold-CV which is robust and relatively fast if, e.g. K=10.

Large data

Even though PSIS-LOO is generally fast, with big enough data it can be slow. We developed a sub-sampling LOO approach (Magnusson et al., 2019, 2022) with a vignette in the loo package. In the sub-sampling approach we use a faster but biased estimate for all LOO-folds and a slower but (almost) unbiased estimate for a subset of LOO-folds. The biggest benefits of this approach can be seen in the projpred package (github version), where PSIS-LOO given the full data search path is fast, but doing the search for each LOO-fold is slow. Using subsampling and the difference-estimator we can get more than a 10-fold speedup as demonstrated in one of the case studies.

Predictive checking

Cross-validation can be used to improve predictive checking. Posterior predictive checking can fail with flexible models as the same data are used for fitting and checking. We added useful functions to loo package to support LOO predictive checks and LOO probability integral transformation (LOO-PIT) calibration checks to bayesplot (Gabry et al., 2019; Säilynoja et al., 2022, 2025).

Beyond LOO-CV

The package is named loo as it started as an implementation of the PSIS-LOO algorithm (and we had only US and Finnish people thinking about the name). But it was natural to extend it beyond LOO.

Leave-one-group-out (LOGO) is useful if we want to predict for new groups. LOGO is challenging for importance sampling as the posterior for group specific parameters changes a lot if we leave out all the group specific observations. We can use K-fold-CV (loo vignette) or integrate out the group specific parameters (Roaches case study).

While LOO is valid for analysing the observation model in time series models, we may sometimes prefer leave-future-out cross-validation (LFO-CV), as it has smaller bias if the prediction task is in the future. The loo package has a vignette demonstrating how PSIS and occasional re-fits can be used for fast LFO-CV (Bürkner, Gabry and Vehtari, 2020).

The downside of LFO-CV is that it uses only a small part of the data for fitting the model and making predictions for the future, and thus has high variance. If the focus is model comparison, it is better to use K-fold-CV or hv-block-CV with joint log score as these have smaller variance, which leads to higher model selection efficiency for time series models (Cooper et al., 2025a).

Leaving out more than one observation and using joint log score improves the model selection efficiency also in the case of spatial models (Cooper et al., 2025b).

Often temporal and spatial models are presented as non-factorized normal (or t) models, in which case we need to use properties of multivariate normal (or t) to compute LOO (Bürkner, Gabry and Vehtari, 2020). The loo package has a vignette and brms uses this approach for non-factorized models.

Comparing models

Instead of just using point estimates of the predictive performance, we can also quantify the related uncertainty, which is especially useful when doing model comparison. From the beginning, the loo package was reporting the log score (elpd) difference and the related standard error based on the recommendation by Vehtari and Lampinen (2002). We (Sivula et al., 2025) have investigated in more detail the conditions when the standard error and related normal approximation are accurate. The normal approximation can be used to estimate probability that model B has better predictive performance than model A.

Model averaging

The LOO computed log score (elpd_loo) has a connection to information criteria (see, e.g. Vehtari and Ojanen, 2012). Inspired by information criteria based model weights and stacking, we developed Bayesian stacking (Yao et al., 2018), which is also implemented in the loo package. Furthermore, we extended Bayesian stacking to Bayesian hierarchical stacking (Yao et al., 2022a) and stacking for non-mixing computations (Yao et al., 2022b). There is a loo package vignette for stacking.

Other scores and metrics

The loo package supports also (S)CRPS, MAE, RMSE, MSE, ACC, and BACC, although not as nicely as log score (see below for future plans).

ArviZ

loo is an R package, but PyStan and PyMC users needed fast cross-validation, too. The Python and Julia ArviZ libraries were subsequently developed to include most of the methods that are in the loo R package,

Future

Because PSIS is useful in many other cases beyond just LOO, PSIS functionality has now been implemented in the posterior package, so that packages that would like to use just the Pareto smoothing and diagnostics do not need to depend on the loo package. We’re also in the progress of changing loo to use more of these functions from posterior.

We’re also refactoring loo to improve modularity and usability (as part of Google Summer of Code) focusing on:

  • easier use of different scores and metrics (e.g. RMSE, R^2, CRPS)
  • easier use of different cross-validation variants
  • easier use of joint log score
  • more diagnostic information provided to the user, e.g. for the uncertainty normal approximation

The ArviZ team is also working to update ArviZ.

CV-FAQ

The loo package users have been asking many questions, and eventually I wrote CV-FAQ answering the most frequently asked questions (and setting straight some common misconceptions).

Thanks

Very big thanks to Jonah Gabry for writing the loo package (see also other contributors) and to the ArviZ team for implementing the methods in Python and Julia! All the methods developed in the papers would not be widely used without these packages (The Practical LOO-CV paper has been cited more than 5900 times). While the methods in the papers made certain things possible, it is the software that makes using the methods easy!

Stanford Human Trafficking Data Lab Hiring for a Full-time Postdoctoral Scholar

Ben Seiler sends along this opportunity:

The Stanford Human Trafficking Data Lab is accepting applications for a postdoctoral fellowship position to join a project investigating trafficking risks in charcoal supply chains in Brazil. The position is open to recent graduates of PhD programs in statistics, economics, computer science, operations research, or related data science fields. The position provides opportunities to participate in rigorous, quantitative research on human trafficking, including supply chain network analysis and geospatial modeling. The successful candidate will have strong data science skills, including experience working with large, complex data from varied sources, and machine learning methodologies. The underlying data are complex and will require sophisticated data management and integration skills. A candidate should have proficiency with GIS software and Python, strong written and interpersonal communication skills, and a demonstrated interest in addressing social justice issues through data-driven research. The postdoc will work in partnership with PI Grant Miller (Stanford University School of Medicine) and other research team members, and will contribute to study design, participate in field research, conduct data analysis, and disseminate findings through academic publications and presentations. The postdoctoral fellow will be expected to focus mainly on this project, but may spend up to 20% of their time on independent research. For more on the Lab’s other ongoing projects see https://htdatalab.stanford.edu/projects/. The postdoc will be based at Stanford University in the Department of Health Policy, located near to the Departments of Economics and Political Science and the Graduate School of Business.

TO APPLY: Please email a single PDF document named
“Lastname_Firstname_HTDL_Postdoc_2025” containing the following materials to Lydia Aletrais at [email protected]:

1. A cover letter describing your interest in this position, your relevant training and experience, and our earliest and preferred start dates
2. A current CV
3. A transcript (unofficial is fine)
4. Names, e-mail addresses, and phone numbers of 2-3 references.

Applications will be considered on a rolling basis. Short-listed applicants will be asked to complete a technical exercise and may be called for an interview.

Given that they’re doing network analysis and geospatial modeling, I expect that knowledge of Stan would be useful too.

Election analytics positions available at the New York Times

Will Davis writes:

I oversee the Election Analytics department (aka The Needle and The Times/Siena Poll) at The Times.

We’re lucky enough to have two great roles open on the team right now for people excited to make a career in the field of election analytics.

* Election analyst: This is a mostly technical role, but it comes with the opportunity to write. This person would be responsible for some most essential elements of our election polling and modeling — modeling unit-level turnout and vote share, creating baselines for The Needle and helping us continue to innovate the design of The Times/Siena Poll.

* Election researcher: This is a great job for somebody with less of a technical skillset who’s eager to be around a team they can learn a ton from. It’s primarily focused on manual research and data entry.

These sound like excellent an opportunity to combine statistical modeling, computation, and graphics.

Taking our Models Seriously (my talk at StanBio Connect, this Friday 9am)

StanBio is a free, one-day online conference that will take place on Friday, 30 May 2025 from 9 am to 5 pm ET. Here’s my talk:

Taking our Models Seriously

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

In biomedical research we often use models that are “mechanistic” rather than “phenomenological”: that is, we try to model an underlying process rather than simply fitting a curve. Mechanistic models are necessary for inference with latent variables (as in pharmacology when there is interest in concentration within an internal organ) and useful for learning and extrapolating from sparse data. We discuss statistical and computational challenges we have resolved, along with some open problems.

Now that I’m posting this, I can’t remember what I was going to say! I guess I’ll have to figure it out between now and Friday morning.

Using Stan to do sequential Bayesian updating

Gaurav Sood writes:

I was thinking about Bob’s post, and I think we can do something with Wasserstein Barycenter Priors

I have the Gaussian thing going here.

But the right approach is the barycenter priors ….

I haven’t been following the details here, but I think this discussion is about a problem that I’ve seen many times in applications, which is that you want to update your posterior as new data come in, and it’s expensive to re-fit the model with all your data each time a new little batch of data come in. We’ve solved this problem in different ways at different times: sometimes we approximate the posterior distribution by a normal or mixture of normals; other times we approximate the posterior for the hyperparameters and use it as a prior for the model we’re fitting to the new data; sometimes the problem can be simplified by factorization in some way or another. There’s also particle filtering, which is a particular class of iterative simulation algorithm using multiple chains.

Naming the problem

As with any problem that comes up often, there are many solutions. I’m excited that Bob recently implemented a version in Stan. Bob called it “chaining Bayesian inference” . . . I’m not thrilled with that term. “Sequential Bayesian updating” sounds better to me, so that’s what I used in the title of this post. I’m open to other suggestions.

Gaurav’s approach

Here’s what Gaurav writes in his new post:
Continue reading

Chaining Bayesian inference with priors constructed from posterior draws

 

This post is from Bob.

 

Chenyang Zhong, a stats professor at Columbia, presented the following paper at our Bayesian computation reading group on Friday.

The goal is to be able to chain Bayesian inferences on a data stream in situations where there’s no analytic form of the posterior. The problem is that the textbook solution of using analytic posteriors (e.g., chaining binomial likelihoods with beta priors), only works for simple conjugate models.

The model

The unknown prior in this case is constructed from draws from the posterior of a previous model. To ground us with some notation, suppose our joint model is the product of a likelihood and prior,

        p(y, theta | x) = p(y | theta, x) * p(theta).

Sequential data

For example, consider the case where we receive a sequence of data sets

        (x1, y1), (x2, y2), …, (xn, yn), ….

In Chenyang’s case, the data is private, so you can view the problem as a kind of federated learning or as a kind of meta-analysis. We know we can fit a model p(theta, y | x) and get posterior draws

        theta_post(1), …, theta_post(M) ~ p(theta | x1, y1).

Chenyang assumes the posterior draws may be shared. What we’d like to do is use the posterior p(theta | x1, y1) as the prior for theta when analyzing data x2, y2, i.e.,

        p(theta | x1, x2, y1, y2) propto p(y2 | theta, x2) * p(theta | y1, x1).

But we don’t have a closed form expression for p(theta | y1, x1), so what do we do?

Kernel density estimate as a posterior approximation

Chenyang’s idea is to use a kernel density estimate with a normal basis as a proxy for p(theta | y1, x1). Specifically, what he’s going to do is write an empirical prior that penalizes squared distance from the posterior draws.

        p(theta | y1, x1) approx 1/M SUM_m normal(theta | theta_post(m), h * I),

where I is the identity matrix and h > 0 is a variance parameter. Most of Chenyang’s paper is about how to compute this efficiently. By “high dimensions” he’s talking about a modest 6 to 20 dimensions. He takes M to be about 10,000. He then goes about constructing a really neat way to Metropolis sample exactly using only nearest neighbors of theta.

It’s easy to establish that the maximum likelihood estimate and posterior mean of the prior will be the sample mean of the posterior draws theta(1), …, theta(M). How strongly it concentrates around that mean will depend on how spread out the posterior draws are. What’s interesting is that how hard it pulls does not matter how large M is—the more the merrier in terms of accuracy. But it will depend on the variance term h.

What if we just use Stan?

Stan’s pretty darn fast at normal approximations, so what if we just coded the approximate posterior directly rather than trying to use a graph of nearest neighbors and adjust? Turns out it works very cleanly. Given the data set in his paper, which is a logistic regression with N = 1500, it takes Stan 2s to fit the posterior for (x1, y1), and 35s to fit the posterior for (x2, y2) using 10,000 posterior draws from the approximate posterior (this is my 2017 iMac Pro, which is very slow compared to current ARM-based Macs).

I coded this all up as a Stan case study that you can find here, along with the results:

Should I add this as a new technique to the third section of the Stan User’s Guide? This problem comes up all the time on our forums.

What’s left to do?

I have two four questions left after fitting the model.

  1. How to set the variance term h? It doesn’t affect the mean, but it does affect how strongly the prior concentrates. Chenyang mentioned something about potentially wanting to discount the past, which you can do. Is there a way to set that by tweaking h? Alternatively, is there a way to set it optimally so that the posterior for the final fit is closest to p(theta | x1, y1, x2, y2) in cases where we want to equally weight the past?
     
  2. How many posterior draws do we need? In simple cases like this, probably not 10,000!
     
  3. Is this just an easier way to do the computation rather than estimating a covariance matrix and using a multivariate normal? The reason I ask is that I know you can generate from the empirical covariance by differencing in this way—it’s equation (11) in Goodman and Weare’s affine invariant sampling paper.
     
  4. How can this approach handle constrained parameters? We can just keep exactly the same code and keep the constraints on the parameters and everything should work, but it seems more natural to lay down a multivariate normal approximation on the unconstrained scale (e.g., after log transforming positive constrained parameters).

[edit: added third and fourth question]

“No Vehicles in the Park”: A multilevel-model computing saga (including some Bayesian workflow)

This is a story about how hard it can be to fit even a simple multilevel model. It’s a long story, but that’s ok: we have a lot to think about:

1. The data and the problem
2. Fitting an item-response model using lme4
3. Understanding and expanding the fitted model
4. Fitting using rstanarm
5. The funnel rears its ugly head
6. Fitting the model using Stan
7. A model with 3 constant terms
8. Two simulation experiments
9. Summary of results
10. Other software options
11. Conclusion
12. How to write this up in a way to be most useful to readers?

It’s a story in twelve parts, but no Comedian, no Rorschach, not even a cartoon smiley face. It’s something I needed to work through, nevertheless.

1. The data and the problem

It began with this blog post by Dan Luu a couple of years ago, which pointed to an online quiz from Dave Turner that begins:

Every question is about a hypothetical park. The park has a rule: “No vehicles in the park.” Your job is to determine if this rule has been violated.

You might know of some rule in your jurisdiction which overrides local rules, and allows certain classes of vehicles. Please disregard these rules; the park isn’t necessarily in your jurisdiction. Or perhaps your religion allows certain rules to be overridden. Again, please answer the question of whether the rule is violated (not whether the violation should be allowed).

Then 27 questions follows with different scenarios; for example:

And you can see where you, and others, would draw the line. Turner generously shares this file with 51953 responses from 2409 people giving yes/no answers to the 27 questions. The columns are:
• Submission id (a unique id for the respondent)
• Question id (a unique id for each question)
• Indicator if the person described in the survey question has a male-sounding name
• Indicator if the person described in the survey question has a white-sounding name
• Response (1 for the response, “Yes, this violates the rule” and 0 for the response, “No, this does not violate the rule”)

The wordings for the 27 items are here.

From some googling, I gather that the No Vehicles in the Park problem dates from a law review article by H. L. A. Hart from 1958 (see footnote 1 here), but I first heard about it from the above-linked post from Dan Luu.

Our goal here is to fit an item-response / ideal-point model including variation over respondents and items. In short: some respondents are more likely to answer Yes to these questions and some items are more likely to elicit Yes responses.

A lot can be done with these data, but our entire post here will be devoted to the statistically–but, as we shall see, not computationally–simple problem of estimating the very basic model of independent responses:

Pr(y_i=1) = invlogit(X_i*beta + a_{j[i]} – b_{k[i]}),

where i=1,…,51953 indexes the responses, j[i] indexes the person giving that response, k[i] indexes the item, and X is a 51953 x 4 matrix with 4 columns:
• Constant term (that is, a column of 1’s)
• Indicator for male-sounding name for item i
• Indicator for white-sounding name for item i
• The number of responses given by person j[i]. In this dataset about two-thirds of the respondents answered all 27 questions (they are rule-followers!), with most of the others answering between 7 and 26 items, and one person answering only 2 questions.

2. Fitting an item-response model using lme4

We start by setting up the data in R:

park <- read.csv("park.csv")
N <- nrow(park)
respondents <- sort(unique(park$submission_id))
J <- length(respondents)
respondent <- rep(NA, N)
for (j in 1:J){
  respondent[park$submission_id == respondents[j]] <- j
}
items <- sort(unique(park$question_id))
K <- length(items)
item <- park$question_id
y <- park$answer
n_responses <- rep(NA, J)
for (j in 1:J){
  n_responses[j] <- sum(respondent==j)
}
male_name <- park$male
white_name <- park$white
n_responses_full <- n_responses[respondent]

And then we fit the model using lme4:

library("arm")
data_matrix <- data.frame(y, respondent, item, male_name, white_name, n_responses_full)
fit_lme4 <- glmer(y ~ (1 | item) + (1 | respondent) + male_name + white_name + n_responses_full, family=binomial(link="logit"), data=data_matrix)
display(fit_lme4)

Which yields:

                 coef.est coef.se
(Intercept)      -1.50     0.46  
male_name         0.07     0.03  
white_name        0.08     0.03  
n_responses_full -0.03     0.01  

Error terms:
 Groups     Name        Std.Dev.
 respondent (Intercept) 1.92    
 item       (Intercept) 2.31    
 Residual               1.00    
---
number of obs: 51953, groups: respondent, 2409; item, 27

The next step is to understand the estimated coefficients and error terms. Along the way to doing that, we'll run into some challenges.

3. Understanding and expanding the fitted model

3.1. Understanding the intercept

Going through the above output in order:
- The intercept is -1.50. This corresponds to the prediction (on the logit scale) for a data point where all the predictors are zero . . . This is hard to interpret, given that n_responses_full is never zero: it usually takes on the value of 27.
- Before going through the rest of the output, let's work to understand the intercept term. We could compute the predicted probability by setting n_responses_full=27, thus, invlogit(-1.50 - 0.03*27), but instead let's just code that last predictor in terms of the number of items skipped so that zero is a reasonable baseline:

n_skipped <- K - n_responses

We then re-fit the model:

n_skipped_full <- n_skipped[respondent]
data_matrix <- data.frame(y, respondent, item, male_name, white_name, n_skipped_full)
fit_lme4 <- glmer(y ~ (1 | item) + (1 | respondent) + male_name + white_name + n_skipped_full, family=binomial(link="logit"), data=data_matrix)
display(fit_lme4)

Which yields:

               coef.est coef.se
(Intercept)    -2.42     0.45  
male_name       0.07     0.03  
white_name      0.08     0.03  
n_skipped_full  0.03     0.01  

Error terms:
 Groups     Name        Std.Dev.
 respondent (Intercept) 1.92    
 item       (Intercept) 2.31    
 Residual               1.00    
---
number of obs: 51953, groups: respondent, 2409; item, 27

OK, so let's try again:

- For an average respondent given an average item, if that question is given with a female, nonwhite name and the respondent is someone who answered all 27 questions, the estimated probability of a "Yes, this violates the rule" response is invlogit(-2.42) = 0.08.

Just to check, I'll go the raw data:

> mean(y)
[1] 0.238735

Huh? I'm surprised. This should be close to 0.08, no? OK, let me restrict just to the cases with zero values for the three predictors:

> mean(y[male_name==0 & white_name==0 & n_skipped_full==0])
[1] 0.22

And it's not like this is an empty category:

> sum(male_name==0 & white_name==0 & n_skipped_full==0)
[1] 10748

We can also compute the predicted probability for an average case:

> invlogit(-2.42 + 0.07*mean(male_name) + 0.08*mean(white_name) + 0.03*mean(n_skipped_full))
[1] 0.09

This too is nothing close to the mean of the raw data.

At this point, I'm not quite sure what's going on here. It must have to do with the nonlinearity of the logit function: for an average case, the predicted probability is 0.09, but sometimes that probability is much higher or lower. At the low end it's bounded by zero, so the mean probability (rather than the probability of the mean) will be much more than 0.09.

Hey, we can check this by running the model in "predict" mode:

> mean(predict(fit_lme4, type="response"))
[1] 0.23

So, yeah, it's the extreme nonlinearity of the logit when the predicted probability for an average case is near 0 (in this case, it's 0.09) and the variance on the logit scale is high (which it is; check out the high estimates for the standard deviations of the respondent and item effects above).

3.2. A simulated-data experiment

When in doubt, I like to check with fake data. Here it's easy enough to simulate from the fitted model. First we simulate the varying intercepts given the estimated variance parameters:

set.seed(123)
a_respondent_sim <- rnorm(J, 0, sqrt(VarCorr(fit_lme4)$respondent))
a_item_sim <- rnorm(K, 0, sqrt(VarCorr(fit_lme4)$item))

Then we throw in the estimated regression coefficients and pipe it through the logit:

b_sim <- fixef(fit_lme4)
X <- cbind(1, male_name, white_name, n_skipped_full)
p_sim <- invlogit(a_respondent_sim[respondent] + a_item_sim[item] + X %*% b_sim)
y_sim <- rbinom(N, 1, p_sim)

Next we add the simulated responses to the data frame and re-fit the model:

data_sim <- data.frame(data, y_sim)
fit_lme4_sim <- glmer(y_sim ~ (1 | item) + (1 | respondent) + male_name + white_name + n_responses_full, family=binomial(link="logit"), data=data_sim)
display(fit_lme4_sim)

And here's what pops out:

                 coef.est coef.se
(Intercept)      -1.88     0.53  
male_name         0.06     0.03  
white_name        0.12     0.03  
n_responses_full -0.04     0.01  

Error terms:
 Groups     Name        Std.Dev.
 respondent (Intercept) 1.87    
 item       (Intercept) 2.65    
 Residual               1.00    
---
number of obs: 51953, groups: respondent, 2409; item, 27

The parameter estimates are within a standard error of what came before . . . I guess the real question is why is the standard error for the intercept so high?

The standard error for the intercept is 0.45 (or 0.41 when fit to the simulated data); this seems huge for a logistic regression fit to 52,000 data points.

The fitting challenge here has got to be coming from those 27 item parameters: a high variance and only K=27 will lead to some uncertainty in the rest of the fitted model.

Let's look at those 27 estimates:

> a_item_hat <- ranef(fit_lme4)$item
> print(a_item_hat)
   (Intercept)
1        7.607
2        3.009
3       -0.738
4       -1.016
5        1.155
6       -0.970
7       -0.508
8        1.986
9        3.232
10       1.344
11      -1.254
12      -3.171
13       0.489
14      -1.239
15      -0.586
16       0.610
17      -2.783
18      -1.593
19       0.022
20      -2.793
21       1.038
22       0.295
23      -0.045
24      -1.491
25      -0.039
26      -0.503
27       2.008
> options(digits=1)
> a_item_hat <- ranef(fit_lme4)$item
> print(a_item_hat)
   (Intercept)
1         7.61
2         3.01
3        -0.74
4        -1.02
5         1.15
6        -0.97
7        -0.51
8         1.99
9         3.23
10        1.34
11       -1.25
12       -3.17
13        0.49
14       -1.24
15       -0.59
16        0.61
17       -2.78
18       -1.59
19        0.02
20       -2.79
21        1.04
22        0.30
23       -0.05
24       -1.49
25       -0.04
26       -0.50
27        2.01

Ummm, hard to read these. Let's extract the item names and attach them to those numbers:

wordings <- read.csv("park.txt", header=FALSE)$V2
wordings <- s ubstr(wordings, 2, nchar(wordings)-1)
a_item_hat <- unlist(a_item_hat)
names(a_item_hat) <- wordings

The space in "s ubstr" above is there to defeat Columbia's stupid firewall which is otherwise blocking this post. There should be no such space in the code.

Anyway, we can now print the estimates in order:

> sort(a_item_hat)
              kite     paper_airplane                iss             toycar         ice_skates 
             -3.17              -2.79              -2.78              -1.59              -1.49 
           toyboat          parachute              plane          surfboard             skates 
             -1.25              -1.24              -1.02              -0.97              -0.74 
skateboard_carried         wheelchair           stroller            travois               sled 
             -0.59              -0.51              -0.50              -0.05              -0.04 
             rccar              horse         quadcopter         skateboard         wagon_kids 
              0.02               0.30               0.49               0.61               1.04 
             wagon            rowboat               bike           memorial          ambulance 
              1.15               1.34               1.99               2.01               3.01 
            police                car 
              3.23               7.61 

Interesting. The distribution for these items is far from normal. Almost everybody thinks the private car violates the rule, and most people think the same about the police car and the ambulance (???), while, at the other extreme, just about everybody's cool with the kite, the paper airplane, and the space station flying above. I'm actually kinda surprised that these coefficients aren't even further away from zero . . . let's compare these estimated item effects to the raw averages:

item_avg <- rep(NA, K)
for (k in 1:K) {
  item_avg[k] <- mean(y[item==k])
}
plot(item_avg, a_item_hat, pch=20)

Here's the resulting graph:

The raw rates range from 2% saying that the kite violates the rule to 99% saying that car violates the rule. Again, it's surprising to me these aren't 0% and 100% but maybe some respondents were taking the piss.

I'll re-plot labeling the points and on the logit scale:

plot(logit(item_avg), a_item_hat, type="n")
text(logit(item_avg), a_item_hat, names(a_item_hat), cex=.5)

You can't read all the items but you get the basic idea.

For the purpose of estimating these parameters we didn't really need the item-response model, but that's because the data are balanced: people were given the items randomly. In general with this sort of rating problem, it's important to fit this sort of model to account for systematic differences in who rates which items.

Just for laffs we can look at the parameter estimates and data averaged by respondents:

a_respondent_hat <- unlist(ranef(fit_lme4)$respondent)
respondent_avg <- rep(NA, J)
for (j in 1:J) {
  respondent_avg[j] <- mean(y[respondent==j])
}
plot(respondent_avg, a_respondent_hat, pch=20, cex=.4)

We can see a few things from the graph:
- Some number of respondents answered Yes or No to all the items. From this graph we can't see how many because of overplotting, but we can just compute sum(respondent_avg==1) and sum(respondent_avg==0) directly and find that 38 people answered Yes to all their items (that is, saying the No Vehicles in the Park rule always applies) and 10 answered only No (that is, saying that the rule never applies).
- There's quite a bit of variation in how people respond--this is something we can also see from the large value for the estimated variation of these intercepts in the fitted model.
- There's some variation in the scatterplot, which is mostly coming from variation in the number of items that people answered. The fewer items you answer, the more your intercept is partially pooled toward the mean of the distribution (zero, in this case).

3.3. Understanding the fitted model

OK, that was a real rabbit hole! Now let's return to the fitted multilevel logistic regression, fit_lme4

               coef.est coef.se
(Intercept)    -2.42     0.45  
male_name       0.07     0.03  
white_name      0.08     0.03  
n_skipped_full  0.03     0.01  

Error terms:
 Groups     Name        Std.Dev.
 respondent (Intercept) 1.92    
 item       (Intercept) 2.31    
 Residual               1.00    
---
number of obs: 51953, groups: respondent, 2409; item, 27

Going through the table from the top:
- We've already talked about the intercept.
- The coefficient for male_name is 0.07: people in this experiment are slightly more likely to say that someone with a male name is violating the rules. This is a small effect (and, yes, we can legitimately call it an "effect," as I'm pretty sure that the "treatment" of using a male or female name was assigned at random), shifting the estimated probability of "Yes, this violates the rule" from 0.24 (recall, that's the mean of y in the data) to invlogit(logit(0.24) + 0.07) = 0.253.
- The coefficient for white_name is 0.08: people in this experiment are slightly more likely to say that someone with a white name is violating the rules: again, the estimated effect is small, an estimated increased of 1.5 percentage points in the probability of a Yes response.
- People who skipped more questions are more likely to say, "Yes, this violates the rule." This effect is not so small: comparing a person who skipped 10 questions to someone who skipped none, the predicted probability of saying Yes increases from 0.24 to invlogit(logit(0.24) + 0.03*10) = 0.30.

This is kind of interesting, and it makes me think that we should also include in the model a predictor for the order in which the questions were asked. Perhaps there's a trend where, as the questions go on, people are increasingly likely to answer No.

So I'll create that new predictor:

order_of_response <- rep(NA, N)
for (j in 1:J) {
  order_of_response[respondent==j] <- 1:n_responses[j]
}
data_matrix <- data.frame(data_matrix, order_of_response)

And add it to the model:

fit_lme4 <- glmer(y ~ (1 | item) + (1 | respondent) + male_name + white_name + n_skipped_full + order_of_response, data=data, family=binomial(link="logit"))
display(fit_lme4)

Here's the result:

                  coef.est coef.se
(Intercept)       -2.59     0.45  
male_name          0.07     0.03  
white_name         0.08     0.03  
n_skipped_full     0.04     0.01  
order_of_response  0.01     0.00  

Error terms:
 Groups     Name        Std.Dev.
 respondent (Intercept) 1.92    
 item       (Intercept) 2.33    
 Residual               1.00    
---
number of obs: 51953, groups: respondent, 2409; item, 27

So, no, order of response (a predictor that varies from 1 to 27 in the data) doesn't seem to do much here. Good to know.

We conclude our understanding of the fitted model by interpreting the error variances. The distributions of respondent and item intercepts have estimated standard deviations of around 2, which is a big number on the logit scale: roughly speaking, the model estimates that about two-thirds of respondent intercepts are in the range (-1.92, 1.92), and about two-thirds of item intercepts are in the range (-2.33, 2.33). A shift of +/- 2 on the logit scale is big when you convert it to probabilities. If we start with that probability of 0.24, subtracting or adding 2 on the logit scale takes you to invlogit(logit(0.24) - 2) = 0.05 or invlogit(logit(0.24) + 2) = 0.70 on the probability scale. So, as we've already seen in the above graph, there's lots of variation across respondents and items.

4. Fitting using stan_glmer

All the above is fine. Sometimes, though, lme4 doesn't work so well. It can yield degenerate estimates for varying-intercept, varying slope models (see for example here) and, more generally, has the usual problems of point estimation that it understates uncertainty and can produce noisy estimates. So by default we prefer to use Stan, even if it's a little bit slower.

That's easy to do here using rstanarm, so first we load in that package and set it up to run parallel chains:

library("rstanarm")
options(mc.cores = parallel::detectCores())

And then we fit the model:

fit_rstanarm <- stan_glmer(y ~ (1 | item) + (1 | respondent) + male_name + white_name + n_skipped_full + order_of_response, family=binomial(link="logit"), data=data_matrix)
print(fit_rstanarm)

But in this case it takes forever. It takes so long that I hit the escape key, restarted R, and ran a minimal version:

fit_rstanarm <- stan_glmer(y ~ (1 | item) + (1 | respondent) + male_name + white_name + n_skipped_full + order_of_response, family=binomial(link="logit"), data=data_matrix, iter=200, control=list(max_treedepth=5))
print(fit_rstanarm)

There's an annoyance in the stan_glmer syntax where the number of iterations is set in one part of the function call and the treedepth is set elsewhere. This is a legacy of the rstanarm package using rstan rather than the newer interface, cmdstanr. The control options are cleaner when running Stan programs directly using cmdstanr, as we'll see below.

Anyway, to return to our main narrative, when we ran stan_glmer for 100 iterations and max treedepth of 5 and (by default) 4 chains, it's fast, but the chains do not mix well, and we get a long warning message, including this:

The largest R-hat is 4.34, indicating chains have not mixed.
Markov chains did not converge! Do not analyze results! 

Rather than chugging through thousands of iterations, I paused to think for a moment, and, from my experience fitting multilevel models, I realized the problem. It's subtle, involving both modeling and computing.

5. The funnel rears its ugly head

The problem is "the funnel," also called "the whirlpool," "the vortex," and a few other things, and it comes up with group-level variance parameters in multilevel models. David van Dyk, Zaiying Huang, John Boscardin, and I discuss the problem in this paper from 2008: the usual parameterization of the Gibbs sampler or Metropolis algorithm for such problems can get stuck when the estimated group-level variance is near zero, and the alternative, "non-centered" parameterization, can get stuck when the estimated group-level variance is near infinity. This discussion from an old version of the Stan User's Guide demonstrates the problem in a simple special case. In our 2008 paper, we recommended a redundant parameterization that works in an expanded parameter space, but this doesn't work so well with Stan, because it involves a renormalization of the parameters at each iteration.

We do not (yet) have a universal solution to the funnel problem. Here are a few options, along with their strengths and weaknesses. I'll discuss them in the context of a simple varying-intercept model with measurements i and groups j[i]:

y_i ~ normal(a_{j[i]} + X_i*beta, sigma_y), i=1,...,n
a_j ~ normal(mu, sigma_a), j=1,...,J

1. Standard parameterization as above. Works fine if sigma_a is well estimated from data and distinct from zero but runs into problems when there is posterior mass near zero, as in the 8 schools example from chapter 5 of BDA. The problem here is that in multilevel modeling you typically want sigma_a to become indistinguishable from zero: that's desirable behavior if you include high-quality group-level predictors in X.

2. Alternative parameterization:

y_i ~ normal(mu + sigma_a*xi_{j[i]} + X_i*beta, sigma_y), i=1,...,n
xi_j ~ normal(0, 1), j=1,...,J

This is mathematically the same model as above, but now it's parameterized in terms of sigma_a and xi rather than sigma_a and a = mu + sigma_a*xi. This fixes the geometry of the posterior distribution near sigma_a=0, allowing HMC (or Gibbs or Metropolis) to slide up and down the "cone" of the joint distribution rather than getting struck in the "vortex." The alternative parameterization works just fine with the 8 schools, as demonstrated in Appendix C of BDA, and other problems where there is only weak information on the group-level scale parameter sigma_a. The bad news is that it fails in the opposite scenario in which sigma_a is well estimated with a posterior distribution that is distinct from zero.

3. Redundant parameterization:

y_i ~ normal(mu + kappa*xi_{j[i]} + X_i*beta, sigma_y), i=1,...,n
xi_j ~ normal(0, tau), j=1,...,J

This model on (kappa, tau, xi) does not separately identify kappa or tau, but it does identify their product. As we discuss in our 2008 paper, you can do Gibbs or Metropolis on this larger space and just re-compute sigma_a = kappa*tau and a = mu + kappa*xi at each iteration, and it works for both the problematic examples discussed above, along with everything in between! The bad news is that if you try this redundant parameterization in Stan or other Hamiltonian Monte Carlo (HMC)-based probabilistic programming languages, it will fail because there's no joint distribution in the probability space: HMC will just drift and never converge. And you can't even run non-convergent HMC on (kappa, tau, xi) and hope to get good inferences for the generated quantities (sigma_a, a),

4. Alternate between standard and alternative parameterizations:

The redundant parameterization includes the standard and alternative parameterizations as special cases (kappa=1 and tau=1, respectively). So, instead of jointly estimating (kappa, tau, xi), you can slide back and forth between the two parameterizations, which should give you something like 50% efficiency: not as good as the optimal parameterization for your problem but not getting stuck in the bad parameterization. This approach has two challenges: first, it makes the algorithm more complicated, especially given the problems of tuning the step size and mass matrix for HMC; second, in more complicated multilevel models with more than one grouping factor (as in the example that motivated this post!), you need to do this for each batch of parameters.

5. Decide which of the standard or alternative parameterizations to use based on a preliminary analysis of the data. As we shall see, this can work in practice--you can pick one parameterization, try it, and if you have convergence problems you can switch to the other--but it's awkward. And, again, this can run into combinatorial difficulties if you have multiple grouping factors.

6. Use a zero-avoiding and infinity-avoiding prior to rule out the bad areas of the funnel. This can work, but it's not always easy: There's no generic prior that will always solve the problem, and you have to be careful not to rule out zones in parameter space that you actually want to keep in the posterior.

7. Fit using the marginal posterior mode of the group-level variance parameter. The joint mode won't work (it has a pole at (sigma_a=0, a=mu); this is a well-known problem discussed in chapter 5 of BDA and elsewhere), but you can work with the marginal mode of sigma_a. That's what lme4 does. This works fine in the example that motivated this post, but, for harder problems, marginal-mode inference can be unstable. It often yields degenerate estimates, especially when the model includes varying slopes as well as intercepts, and even when the estimated variance parameters are not on the boundary, they can be noisy. These problems can be addressed with zero-avoiding priors (see our papers from 2013 and 2014), but then that's another required step. Indeed, one reason we wrote Stan in the first place was that lme4 was giving us problems when our models started to get complicated.

8. Other approximate algorithms. ADVI and Pathfinder are already programmed in Stan, and soon we'll have nested Laplace, which is pretty similar to lme4. These options could work well; I'll have to try them out. For now I'll just have to say they're untested, and, given that they are not trying to fully explore the posterior distribution, I expect they'll have problems.

9. New HMC implementations. Bob Carpenter has been talking about a new sampler that auto-adapts in a way so that it fits the funnel just fine. Maybe this will be in Stan soon and solve all our funnel problems?

10. Some combination of the above solutions. For example: start with a stable fast algorithm such as Pathfinder, include zero-avoiding and infinity-avoiding priors as necessary, run the new adaptive HMC algorithm, also be willing to change the parameterization if it's still running slowly. This would not represent a perfectly automatic solution but could still be a reliable workflow.

11. Resolve computing problems by simplifying the model. A big challenge with the funnel is when the unexplained group-level variance is statistically indistinguishable from zero. That's what happens in the 8 schools example. One solution here is to just set the group-level scale to zero--that is, to remove that batch of parameters from the model entirely. That's equivalent to partially pooling the group estimates all the way to the fitted model: in this setting, the model is difficult to fit in part because there's no much there to be estimated, so why not simplify? In other cases, you might be uncomfortable partially pooling all the way down, so you could pin the group-level scale to some small value--for example, sigma_a=5 in the 8 schools model--which will get rid of the funnel problem. The drawback to this approach is that it requires further human intervention and also is changing the model you're fitting.

6. Fitting the model using Stan

To return to our main thread . . . we're trying to fit the model with respondent and item effects. We had problems with rstanarm so let's try using straight-up Stan which will give us more flexibility in coding.

We put our first try into a file, park_1.stan:

data {
  int<lower=0> N, J, K, L;
  array[N] int y;
  array[N] int respondent;
  array[N] int item;
  matrix[N,L] X;
}
parameters {
  vector[L] b;
  real<lower=0> sigma_respondent, sigma_item;
  vector<offset=0, multiplier=sigma_respondent>[J] a_respondent;
  vector<offset=0, multiplier=sigma_item>[K] a_item;
}
model {
  a_respondent ~ normal(0, sigma_respondent);
  a_item ~ normal(0, sigma_item);
  y ~ bernoulli_logit(X*b + a_respondent[respondent] + a_item[item]);
}

This uses the alternative parameterization, scaling the respondent and item effects in proportion to their scale parameters, so I'm pretty sure that it will fail, just as the version from rstanarm failed. But we'll check.

In R, we set up the design matrix and the data list to be sent to Stan:

X <- cbind(rep(1,N), male_name, white_name, n_skipped_full, order_of_response)
stan_data <- list(N=N, J=J, K=K, L=ncol(X), y=y, respondent=respondent, item=item, X=X)

Then we compile and run; to be on the safe side we just do 100 warmup and 100 saved iterations per chain and set max_treedepth=5:

library("cmdstanr")
park_1 <- cmdstan_model("park_1.stan")
fit_1 <- park_1$sample(data=stan_data, parallel_chains=4, iter_warmup=100, iter_sampling=100, max_treedepth=5)
print(fit_1)

Here's the (bad) result:

         variable      mean    median      sd    mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -23496.85 -22783.60 1468.31 332.40 -26998.40 -22548.48 2.95        4       12
 b[1]                 -1.14     -1.14    0.18   0.23     -1.37     -0.79 2.41        5       21
 b[2]                  0.03      0.03    0.02   0.02      0.00      0.07 1.02       69      175
 b[3]                  0.03      0.03    0.03   0.03     -0.01      0.08 1.09       61      117
 b[4]                  0.03      0.02    0.00   0.00      0.02      0.03 1.19       15       48
 b[5]                 -0.01     -0.01    0.02   0.02     -0.04      0.01 2.88        4       12
 sigma_respondent      0.06      0.06    0.01   0.01      0.05      0.07 2.18        5       12
 sigma_item            1.08      1.16    0.28   0.28      0.55      1.39 3.07        4       14
 a_respondent[1]      -0.01     -0.01    0.05   0.04     -0.10      0.06 3.06        4       12
 a_respondent[2]      -0.03     -0.03    0.06   0.09     -0.12      0.05 3.94        4       12
. . .

It still has problems if you run longer. For example this is what you get if you do the default 1000 warmup plus 1000 saved iterations per chain, again setting max_treedepth=5:

         variable      mean    median     sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -15146.77 -15098.70 133.90 77.98 -15415.81 -15004.80 1.60        6       11
 b[1]                 -2.66     -2.71   0.34  0.24     -3.12     -1.96 1.55        8       43
 b[2]                  0.08      0.08   0.03  0.03      0.02      0.12 1.09       31      463
 b[3]                  0.07      0.07   0.03  0.04      0.02      0.13 1.24       12       76
 b[4]                  0.04      0.04   0.01  0.00      0.03      0.04 1.08      186      616
 b[5]                  0.01      0.01   0.00  0.00      0.01      0.02 1.04      444     1399
 sigma_respondent      1.92      1.94   0.07  0.06      1.80      2.01 1.54        7       12
 sigma_item            2.40      2.35   0.31  0.28      1.99      3.00 1.22       14       64
 a_respondent[1]       0.21      0.32   1.80  1.79     -2.93      3.08 1.23       12       25
 a_respondent[2]      -0.41     -0.43   0.67  0.69     -1.48      0.63 1.03      144     1040
. . .

The estimated group-level variances are big here, so it makes sense to use the standard parameterization, which is actually simpler to code in Stan. Here it is, in the file park_2.stan:

data {
  int<lower=0> N, J, K, L;
  array[N] int y;
  array[N] int respondent;
  array[N] int item;
  matrix[N,L] X;
}
parameters {
  vector[L] b;
  real<lower=0> sigma_respondent, sigma_item;
  vector[J] a_respondent;
  vector[K] a_item;
}
model {
  a_respondent ~ normal(0, sigma_respondent);
  a_item ~ normal(0, sigma_item);
  y ~ bernoulli_logit(X*b + a_respondent[respondent] + a_item[item]);
}

We'll fit this to our data, running 1000 warmup plus 1000 saved iterations per chain, max_treedepth=5:

park_2 <- cmdstan_model("park_2.stan")
fit_2 <- park_2$sample(data=stan_data, parallel_chains=4, max_treedepth=5)
print(fit_2)

It takes 100 seconds to run on my laptop, and here's what comes out:

         variable      mean    median    sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -16718.24 -16717.70 39.66 39.88 -16783.80 -16652.80 1.02      324      921
 b[1]                 -2.64     -2.58  0.50  0.52     -3.49     -1.82 1.43        8       13
 b[2]                  0.07      0.07  0.03  0.03      0.02      0.12 1.01      425     1137
 b[3]                  0.08      0.08  0.03  0.03      0.03      0.13 1.01      557     1135
 b[4]                  0.04      0.04  0.01  0.01      0.03      0.04 1.01      175      413
 b[5]                  0.01      0.01  0.00  0.00      0.01      0.02 1.00      863     1314
 sigma_respondent      1.96      1.96  0.04  0.04      1.89      2.03 1.04      145      280
 sigma_item            2.53      2.49  0.38  0.37      2.00      3.21 1.01      648      896
 a_respondent[1]      -0.39     -0.35  1.76  1.74     -3.41      2.49 1.01      417      956
 a_respondent[2]      -0.36     -0.32  0.70  0.66     -1.58      0.71 1.01      509      548
. . .

Not so nice! We can run it in full default setting (4 chains, 1000 warmup plus 1000 saved iterations per chain, max_treedepth=10):

fit_2 <- park_2$sample(data=stan_data, parallel_chains=4)
print(fit_2)

It now isn't too bad:

         variable      mean    median    sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -16714.93 -16715.20 39.87 39.88 -16780.50 -16650.59 1.00     1213     1805
 b[1]                 -2.53     -2.53  0.51  0.48     -3.40     -1.72 1.02      195      233
 b[2]                  0.07      0.07  0.03  0.03      0.02      0.12 1.00     6622     3208
 b[3]                  0.08      0.08  0.03  0.03      0.03      0.14 1.00     7494     2747
. . .

But it took 8 minutes to run. 8 minutes!

If you know what model you want to fit, and you're not worried about any problems in the data or whatever, then, sure, 8 minutes is fine, it's basically nothing. It took me a lot more than 8 minutes just to write material in this section of this post.

But if you're not sure exactly what you're planning to do, if you want to explore different models, if you want to do the usual Bayesian workflow, then you don't want to wait minutes. Not if you can avoid it.

7. A model with 3 constant terms

So what's going wrong here? We used the standard parameterization for the multilevel model, which should work better when the estimated group-level variance is large--and indeed we're now seeing better mixing for sigma_respondent and sigma_item, the group-level variance parameters--but now there's a problem with b[1], the coefficient for the constant term in the regression.

The intercept b[1] is identified in the model. The respondent effects and item effects both have distributions with mean 0. To put it another way, suppose the model were set up with:

  a_respondent ~ normal(mu_respondent, sigma_respondent);
  a_item ~ normal(mu_item, sigma_item);

Then the posterior distribution would be improper: the parameters b[1], mu_respondent, and mu_item would not be identified. To put it another way, the model would have a translation invariance, as you could add any arbitrary constants to mu_respondent and mu_item and subtract these constants from b[1] without changing the target function. By setting mu_respondent and mu_item to zero:

  a_respondent ~ normal(0, sigma_respondent);
  a_item ~ normal(0, sigma_item);

we've removed this invariance: the respondent and item effects are softly constrained toward zero. The constraint is "soft" rather than "hard" on the a_respondent and a_item parameters because the hard constraint is on the mean of their distribution; conditional on sigma_respondent and sigma_item, the respondent and item effects have informative priors centered at zero.

But . . . the soft constraint is not enough in this problem. Or, to put it more precisely, it's enough to identify the parameters but not enough to ensure smooth computation. In this model there are high posterior correlations between b[1], sum(a_respondent), and sum(a_item), and it's hard for our Hamiltonian Monte Carlo in its default settings to move efficiently in this space. Stan does some adaptation, but I think the adaptation is on the diagonal of the mass matrix so it doesn't correct for the correlations.

What to do?

I mentioned this problem to some colleagues, and they suggested using Stan's new sum_to_zero_vector type, as explained in this case study from Mitzi Morris. Here's how it works in the No Vehicles in the Park example--I put it in the file park_3.stan:

data {
  int<lower=0> N, J, K, L;
  array[N] int y;
  array[N] int respondent;
  array[N] int item;
  matrix[N,L] X;
}
parameters {
  vector[L] b;
  real<lower=0> sigma_respondent, sigma_item;
  sum_to_zero_vector[J] a_respondent;
  sum_to_zero_vector[K] a_item;
}
model {
  a_respondent ~ normal(0, sigma_respondent);
  a_item ~ normal(0, sigma_item);
  y ~ bernoulli_logit(X*b + a_respondent[respondent] + a_item[item]);
}

Here's the R code:

park_3 <- cmdstan_model("park_3.stan")
fit_3 <- park_3$sample(data=stan_data, parallel_chains=4, max_treedepth=5)
print(fit_3)

It runs in 100 seconds with this perfect mixing:

        variable      mean    median    sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -16714.82 -16714.70 38.92 38.99 -16779.30 -16651.30 1.00     1137     1894
 b[1]                 -2.60     -2.60  0.06  0.06     -2.70     -2.50 1.00     2237     2939
 b[2]                  0.07      0.07  0.03  0.03      0.02      0.12 1.00     7398     3227
 b[3]                  0.08      0.08  0.03  0.03      0.03      0.13 1.00     9024     2887
 b[4]                  0.04      0.04  0.01  0.01      0.03      0.05 1.00     1426     2576
 b[5]                  0.01      0.01  0.00  0.00      0.01      0.02 1.00     4123     3391
 sigma_respondent      1.95      1.95  0.04  0.04      1.89      2.02 1.00     2023     2919
 sigma_item            2.48      2.44  0.38  0.36      1.96      3.14 1.00     8438     2806
 a_respondent[1]      -0.33     -0.28  1.72  1.75     -3.20      2.47 1.00     2998     2964
 a_respondent[2]      -0.36     -0.34  0.67  0.67     -1.51      0.71 1.00     5091     2466
. . .

That's a win!

There is some awkwardness to this model: with the sum of the respondent and item effects constrained to zero, some extra mathematical effort is required when making inferences for new items and new respondents.

8. Two simulation experiments

I'm curious about a couple of things here. First, is this weak-identification thing a problem with this particular dataset or would it arise with data simulated from the model? Second, yeah yeah I get it that the correlations cause a problem with the non-sum-to-zero model, but does this also have something to do with the nonlinearity of the logit?

We can answer--or, at least, study--both these questions using simulation, first by simulating data from the fitted multilevel logistic model, second by fitting the normal rather than logit model.

First the simulation. We'll use the simulated data y_sim from Section 3.2 above and then try fitting all these models:

stan_data_sim <- stan_data
stan_data_sim$y <- y_sim
fit_1_sim <- park_1$sample(data=stan_data_sim, parallel_chains=4, max_treedepth=5)
print(fit_1_sim)
fit_2_sim <- park_2$sample(data=stan_data_sim, parallel_chains=4, max_treedepth=5)
print(fit_2_sim)
fit_3_sim <- park_3$sample(data=stan_data_sim, parallel_chains=4, max_treedepth=5)
print(fit_3_sim)

Yeah, I know my code is ugly. Feel free to clean it for your own use.

Anyway, here's what happens:
- As before, all three models took 100 seconds on my laptop.
- park_1, the model with the alternative parameterization for the group-level variance, again shows poor mixing. Not as bad as with the real data, but still unacceptable for most practical purposes.
- park_2, the model with the natural parameterization, does better but still shows poor mixing for the intercept, b[1], with an R-hat of 1.3.
- park_3, the model with the sum_to_zero constraint, has essentially perfect convergence.

So the problem was with the model, not just with the data.

Next the normal model. We could simulate data from the normal distribution, but let's just try fitting the same item-response model as before, just with a normal error term. This seems weird given that the data are binary, but we can do it. Here's the new model, park_1_normal.stan:

data {
  int<lower=0> N, J, K, L;
  vector[N] y;
  array[N] int respondent;
  array[N] int item;
  matrix[N,L] X;
}
parameters {
  vector[L] b;
  real<lower=0> sigma_respondent, sigma_item, sigma_y;
  vector[J] a_respondent;
  vector[K] a_item;
}
model {
  a_respondent ~ normal(0, sigma_respondent);
  a_item ~ normal(0, sigma_item);
  y ~ normal(X*b + a_respondent[respondent] + a_item[item], sigma_y);
}

This is the same as before, except: (a) we changed the specification of y from integer array to vector, (b) we added sigma_y as a parameter, and (c) we changed the data model from bernoulli_logit to normal.

We create park_2_normal.stan and park_3_normal.stan in the same way, then we compile and run all three:

park_normal <- list(cmdstan_model("park_1_normal.stan"),
                    cmdstan_model("park_2_normal.stan"),
                    cmdstan_model("park_3_normal.stan"))
fit_normal <- as.list(rep(NA, 3))
for (i in 1:3){
  fit_normal[[i]] <- (park_normal[[i]])$sample(data=stan_data, parallel_chains=4, max_treedepth=5)
  print(fit_normal[[i]])
}

Each one of these takes 70 seconds on my laptop--computing the gradient for the normal is a bit faster than for the logistic function, but for whatever reason it doesn't save much in compute time.

How did the computation go? Almost the same as before. Model 1 had terrible mixing, model 2 had poor mixing, but model 3 had some issues too:

        variable     mean   median    sd   mad       q5      q95 rhat ess_bulk ess_tail
 lp__             38745.21 38745.60 35.71 36.03 38685.09 38801.40 1.01      408      760
 b[1]                 0.20     0.20  0.00  0.00     0.19     0.21 1.03      114      392
 b[2]                 0.00     0.00  0.00  0.00     0.00     0.01 1.00      386     1073
 b[3]                 0.01     0.01  0.00  0.00     0.00     0.01 1.01      570      973
 b[4]                 0.00     0.00  0.00  0.00     0.00     0.00 1.09       37       68
 b[5]                 0.00     0.00  0.00  0.00     0.00     0.00 1.02      216      643
 sigma_respondent     0.19     0.19  0.00  0.00     0.18     0.19 1.01      867     1550
 sigma_item           0.23     0.23  0.03  0.03     0.18     0.29 1.04      135      207
 sigma_y              0.30     0.30  0.00  0.00     0.30     0.31 1.00     3670     2827
 a_respondent[1]     -0.04    -0.04  0.13  0.13    -0.25     0.17 1.02      124      294
. . .

This time I think the problem was with the data.

We can check with a simple simulation experiment. First we simulate data from the above-fitted model, using the posterior medians of the parameters b, sigma_respondent, sigma_item, and sigma_y, then simulating the vectors a_respondent and a_item, then simulating new data from the normal model:

library("posterior")
sims <- as_draws_rvars(fit_normal[[3]]$draws())
a_respondent_sim <- rnorm(J, 0, median(sims$sigma_respondent))
a_item_sim <- rnorm(K, 0, median(sims$sigma_item))
b_sim <- median(sims$b)
sigma_y_sim <- median(sims$sigma_y)
y_sim <- rnorm(N, a_respondent_sim[respondent] + a_item_sim[item] + X %*% b_sim, sigma_y_sim)

We then fit the three Stan models to this simulated dataset:

stan_data_sim$y <- y_sim
fit_normal_sim <- as.list(rep(NA, 3))
for (i in 1:3){
  fit_normal_sim[[i]] <- (park_normal[[i]])$sample(data=stan_data_sim, parallel_chains=4, max_treedepth=5)
  print(fit_normalsim[[i]])
}

Again, each model runs in about 70 seconds. Model 1 mixes terribly, model 2 again has problems with b[1], model 3 mixes ok, not great. Actually there's poor mixing for b[4], which is the coefficient for n_skipped. I'm really not sure why, also not sure why there's a problem for the normal model and not the binomial. That's why we do experiments--because things don't always go as expected! If you run it at default settings, mixing is fine but then it does take several minutes.

9. Summary of results

OK, we've successfully fit the model. I could talk more about summarizing the inferences for the 27 items and the 2409 respondents, and about checking the fit using posterior predictive checks, but the focus of today's post is computing.

With computing, then, where are we? What's the takeaway?

• For this example, the model fits just fine using glmer() in the lme4 package.

• lme4 computes the maximum marginal likelihood estimate of the variance parameters, averaging over the regression coefficients. Because this is a multilevel model, it doesn't work to try to maximize the joint likelihood (or find the joint posterior mode); the math on this is in chapter 5 of BDA.

• For this example, the model fails with stan_glmer() in the rstanarm package.

• For this example, the model fits fine in Stan--if you use sum_to_zero_vector.

• When the model isn't fitting well, it makes sense to run with fewer iterations and max_treedepth=5 so that you're not spinning all your cycles for fits that aren't working anyway.

• Despite the above experience for this particular dataset, we can't always recommend lme4, as it has problems when group-level variance parameters are near zero, and it has problems for varying-intercept, varying slope models.

• Surprisingly, Stan converged less well with the normal model than the binomial model.

• We hope that future developments in Stan will make this process easier:

- Nested Laplace should give us the functionality of lmer4 in a more flexible setting, not restricted to the classes of models included in that package and also allowing priors for the variance parameters

- Better adaptation should allow us to reduce the number of default iterations and also avoid the current problem where adaptation goes really slowly for the most poorly-specified models, in violation of the fail fast principle.

- Bob Carpenter told me they have a new variant of Nuts that does some sort of local adaptation which automatically works for the funnel. So then we won't need to choose between the two parameterizations.

10. Other software options

So far, we've tried lme4, rstanarm, and Stan. There are a few more options we can try.

There's INLA, which I think has pretty much the same strengths and weaknesses as lme4 for this problem, so I won't try to figure out how to use it here.

There's fast Nono, which would swallow this problem whole--but I think it only works for normal models, not logistic.

There's brms, which I think has pretty much the same strengths and weaknesses as rstanarm for this problem, so I won't try to figure out how to use it here.

And then there's variational inference. Stan has two variational methods--ADVI and Pathfinder. I'll run them in their default settings from cmdstanr. To stay focused, let's just look at model 3, the parameterization that fit well in Stan:

advi_3 <- park_3$variational(data=stan_data)
pathfinder_3 <- park_3$pathfinder(data=stan_data)

ADVI in its default setting takes 4 seconds, Pathfinder takes 70 seconds. They both yield estimates that are in the ballpark, and they both understate uncertainty compared to the model fit in full Stan, which I include below for easy comparison:

> print(advi_3)
         variable      mean    median    sd   mad        q5       q95
 lp__             -19315.46 -19312.10 59.06 60.49 -19416.52 -19222.58
 lp_approx__       -1220.23  -1219.59 35.18 35.89  -1278.46  -1161.80
 b[1]                 -2.70     -2.70  0.01  0.01     -2.72     -2.68
 b[2]                  0.06      0.06  0.03  0.02      0.02      0.10
 b[3]                  0.09      0.09  0.06  0.06     -0.01      0.19
 b[4]                  0.01      0.01  0.00  0.00      0.01      0.02
 b[5]                  0.00      0.00  0.00  0.00      0.00      0.00
 sigma_respondent      2.08      2.08  0.03  0.03      2.03      2.13
 sigma_item            2.62      2.57  0.37  0.36      2.07      3.29
 a_respondent[1]      -0.80     -0.81  0.95  0.99     -2.38      0.82
. . .
> print(pathfinder_3)
         variable      mean    median   sd  mad        q5       q95
 lp_approx__       -1250.91  -1250.91 0.00 0.00  -1250.91  -1250.91
 lp__             -18120.00 -18120.00 0.00 0.00 -18120.00 -18120.00
 b[1]                 -2.23     -2.23 0.00 0.00     -2.23     -2.23
 b[2]                  0.07      0.07 0.00 0.00      0.07      0.07
 b[3]                  0.01      0.01 0.00 0.00      0.01      0.01
 b[4]                  0.04      0.04 0.00 0.00      0.04      0.04
 b[5]                  0.01      0.01 0.00 0.00      0.01      0.01
 sigma_respondent      1.36      1.36 0.00 0.00      1.36      1.36
 sigma_item            2.13      2.13 0.00 0.00      2.13      2.13
 a_respondent[1]      -0.03     -0.03 0.00 0.00     -0.03     -0.03
. . .
> print(fit_3)
        variable      mean    median    sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -16714.82 -16714.70 38.92 38.99 -16779.30 -16651.30 1.00     1137     1894
 b[1]                 -2.60     -2.60  0.06  0.06     -2.70     -2.50 1.00     2237     2939
 b[2]                  0.07      0.07  0.03  0.03      0.02      0.12 1.00     7398     3227
 b[3]                  0.08      0.08  0.03  0.03      0.03      0.13 1.00     9024     2887
 b[4]                  0.04      0.04  0.01  0.01      0.03      0.05 1.00     1426     2576
 b[5]                  0.01      0.01  0.00  0.00      0.01      0.02 1.00     4123     3391
 sigma_respondent      1.95      1.95  0.04  0.04      1.89      2.02 1.00     2023     2919
 sigma_item            2.48      2.44  0.38  0.36      1.96      3.14 1.00     8438     2806
 a_respondent[1]      -0.33     -0.28  1.72  1.75     -3.20      2.47 1.00     2998     2964
. . .

Actually, Pathfinder shows no uncertainty at all, and it also returns the warning, "Pareto k value (9.2) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor."

ADVI is so fast that maybe it could be good as a starting point for Stan's default NUTS algorithm.

Let's try it! First, Model 1:

advi_1 <- park_1$variational(data=stan_data)
fit_after_advi_1 <- park_1$sample(data=stan_data, init=advi_1, parallel_chains=4, max_treedepth=5)
print(fit_after_advi_1)

After the usual hundred seconds, here's what we see:

         variable      mean    median    sd   mad        q5       q95 rhat ess_bulk ess_tail
 lp__             -15074.93 -15074.55 50.84 52.11 -15158.91 -14991.80 1.02      193      436
 b[1]                 -2.54     -2.52  0.46  0.48     -3.35     -1.86 1.40        8       19
 b[2]                  0.07      0.07  0.03  0.03      0.02      0.13 1.03      335      782
 b[3]                  0.08      0.08  0.03  0.03      0.03      0.14 1.01      345      569
 b[4]                  0.04      0.04  0.01  0.01      0.03      0.04 1.04       77      138
 b[5]                  0.01      0.01  0.00  0.00      0.01      0.02 1.01      594     1121
 sigma_respondent      1.96      1.95  0.04  0.04      1.89      2.02 1.02      187      357
 sigma_item            2.49      2.41  0.38  0.34      2.01      3.24 1.07       51      107
 a_respondent[1]      -0.36     -0.26  1.71  1.77     -3.33      2.35 1.01      293      648
 a_respondent[2]      -0.31     -0.27  0.70  0.69     -1.57      0.75 1.01      388      583
. . .

It's not magic--there are still problems with mixing--but it's not bad, either. Similarly when we do with model 2. So for this problem I think that the best option is to run full Stan with max_treedepth=5 and using ADVI as a starting point. Ideally with model 3, but even with one of the poorly-mixing models we still end up with a reasonable answer. It's definitely a better option than running for 8 minutes and still not mixing.

11. Conclusion

It's not always so easy to fit multilevel models. And this model was a pretty easy one! Granted, it had non-nested groupings, but other than that it was simplicity itself: logistic regression with a few predictors and two batches of varying intercepts. For our data and model, lme4 did the job just fine--but we know that in harder problems lme4 can fail. Unfortunately, rstanarm did not work well at all--it was a problem with the parameterization used for the group-level scale parameter. Stan worked, but to avoid it taking too long, we needed a few tricks: (a) use of the right parameterization for those scale parameters, (b) the sum-to-zero constraint on the batches of varying intercepts, (c) setting max_treedepth to a level below its default of 10. Using ADVI as a starting point could make sense too.

This can be taken as a success story (we could fit the model we wanted! using full Bayes! in a reasonable amount of time!) or as a bummer (we needed to work so hard! even for this simple model!).

The next time I work on this sort of problem I'll start with lme4. Often it fails, but when it does work, it's fast, and it's a good place to start. lme4 will rarely be my final step, though, first because I'd like to more fully account for uncertainty, and second because lme4 only works on a limited class of models.

Looking at this problem from an applied perspective, we're coming into the data analysis wanting to estimate ideal points for respondents and items. We'll also want to check the fit of the model and then use it to assess some of the claims made by Dan Luu that got us thinking about this problem in the first place (see the link at the very top of this post). And we'd probably be inclined to elaborate on the model, going far beyond whatever happens to be programmed in lme4, rstanarm, and brms.

It almost makes me want to just give up! But then I remember that all this effort is still easier than the alternative of trying to cobble together a non-model-based data summary. Also the models we're fitting here have the advantage of also working for unbalanced data, which is often an issue in item-response or ideal-point estimation. Finally, doing all the above research should be helpful for me and others working on future problems.

12. How to write this up in a way to be most useful to readers?

When writing up this sort of experience, I'm always torn between making it look clean or showing all the struggles. On one hand, the struggles are important and they give a feel of our workflow. On the other hand, as a reader you could end up getting lost in all the details. I guess the right thing to do is to first present the clean version with the best solution (in this case, that would be lme4), then discuss what to do in harder problems (in this case, just jump to the Stan program that I'm calling Model 3), and only then go back and discuss the full workflow with all the guesses and experimentation.

Perhaps when writing this up more formally I'll do those steps. For now, I'll keep it as is. It already took me many hours to get this far. And I don't think this is the final version. Indeed, I'm hoping that this post will motivate Bob, Charles, and others to put in some improvements to Stan that will allow everything to run much more smoothly! But even after this particular problem gets solved using default software, there will be new problems. The workflow of computation, simulation, and experimentation will continue.

Analytics positions for the Miami Marlins (baseball)

Cross-posted from the Stan forums jobs board.

Work with Alex Andorra

Alexandre Andorra, of Learning Bayesian Statisticspodcast and PyMC/ArviZ fame, is now working for the Miami Marlins. If you don’t know Alex’s podcast, I’d encourage you to check it out—he has interviewed a number of the Stan devs, including two live interviews during the last StanCon.

All new Marlins

As those of you who follow baseball know, the Marlins were an expansion team in the early 90s who had some early success, but have been underperforming recently. Look for that to turn around because Peter Bendix, the former general manager of the Tampa Bay Rays, the most analytics-oriented team in the sport, has moved to become president of baseball operations and general manager in Miami.

The analytics job openings

The Marlins are recruiting for two positions:

Both jobs have the following line item in the Skill Requirements section:

Experience with probabilistic programming languages preferred (Stan, PyMC)


Elections for the Stan Governing Body 2025

(this post is by Charles)

Calling all members of the Stan community to action!

We’re renewing the Stan Governing Body. The SGB co-organizes StanCon and related events (such as StanConnect!), works on initiatives to fund developers, and more generally helps set the directions of the project.

From the Stan forum:

We’re renewing the Stan Governing Body (SGB) with all 5 seats up for grabs. Current SGB members may still run, however they are not guaranteed to preserve their seats. As in previous years, we will use the Stan forum for nominations.

Please respond to this post to self-nominate for either a 1-year or 2-year term on the SGB before April 14th. We encourage all nominees to briefly summarize their experiences with Stan and their goals for the SGB. Feel free to add links to any content you think is relevant. And if you know someone who you think would be a good fit for the SGB please let them know and encourage them to nominate themselves.

You can find more details on the original post.

 

On a more personal note…

On a more personal note, I have decided to not run for a third term. Serving for two years has been an immense privilege.

It doesn’t take much reflection to realize what an integral part of my career Stan has been (and continues to be). I discovered Stan during my first job out of college at Metrum Research Group: I was contributing to the C++ library and building features which would enable applications in Pharmacometrics. My first pull request was the matrix exponential function in 2016. My first research poster was titled “Stan functions for pharmacometrics.” My first conference talk was at StanCon 2017, aka the inaugural StanCon. The project and the people working on it encouraged me—even empowered me—to pursue a PhD in Statistics, which I did at Columbia with, as my advisor, the illustrious stranger who created this blog. Throughout the years, Stan has connected me to people, both in statistics and in many applied domains.

I must recognize that, as I dove deeper into academia, I started dedicating less and less time to Stan. I remember during my first year in grad school proposing, as a final class project, adding a new feature to Stan. The prof told that me that such work was important for the scientific community but that I needed to find something that was conceptually more substantive. I was also given a clear signal that the PhD program was to train researchers, not engineers.

It’s unfortunate that incentives in academia often misalign with the goals of developing (and maintaining!) high-quality open-source software—at least in the short term. What the creators of Stan pulled off in an academic context was remarkable and unorthodox. In the end, it took an international collaboration across academia and industry to carry the project. I can’t do everyone justice, so I’ll refer you to the list of developers.

I have often felt torn between my work as a researcher and as an engineer. Sure, sometimes you can align the goals: I did leverage that final class project to build a prototype feature in Stan and wrote a research paper on it. But this anecdote strikes me more as the exception than the rule. Even after I started working at the Flatiron Institute—an institution that champions the development of scientific software and hires full-time software engineers—I only devoted so much time to Stan. My colleague Brian Ward jokes that during my almost 3 years here, he has never seen me write any C++ (and I almost haven’t). I’m embarrassed by how much time it’s taken me to write up documentation on our suite of HMM functions; I’m not very active on the Stan forum; and it is now a StanCon tradition to have someone publicly bash me for not finishing Stan’s integrated Laplace approximation (this month I hope—Steve Bronder and I are one unit test away from completing the C++ pull request. Steve has done a lot of work to create a clean user interface.)

The reality is that I’ll never get to work full time on Stan like I did before the PhD. And I suppose that’s ok. I enjoy the “conceptual” work—that is the more methodological/theoretical research that I’ve been doing, and I trust that some of it is useful to Stan users and to the broader Bayesian modeling community. I’m actually very fond of my non-Stan collaborators—if you can believe it! But the feeling of not doing enough for Stan… yeah, that’s a real thing.

Two years ago, when the announcement for the SGB election was posted, I saw an opportunity to devote more time to the Stan project in a structured manner. The SGB does a lot and I focused on bringing back StanCon. Concretely, I co-organized StanCon’23 and StanCon’24, and laid the groundwork for StanCon’26 (yes, we’re taking a one-year break). I liked working on these conferences. Sure, it’s work, but a lot of people generously contribute their time, and if the tasks are properly delegated, it all becomes very manageable. Ultimately, it’s very rewarding to see a vision brainstormed over a zoom call come to life when we all gather, say, at a pub in Oxford, and streams of colleagues we haven’t see in months, sometimes years, keep pouring in and gathering around a large table.

I also believe that StanCon is the best applied Bayesian conference out there. Period. And I think its participants benefit immensely from attending—whether by learning a lot from the tutorials or exchanging ideas with top experts. We’re a community of doers.

I hope the next batch of SGB members will tackle this opportunity with the same ambition and pride that the past bodies have displayed. And even as some of us move on, we’ll be here to insure a smooth transition and provide support where we can.