StanBio Connect 2025

This is Eric.

It’s hard to follow Jesus, but here we are.

Vianey Leos Barajas and I are organizing StanBio Connect 2025 conference: https://stanbio.org

StanBio is a free, one-day online conference that will take place on Friday, 30 May 2025 from 9 am to 4 pm ET. We are interested in broad applictions of Stan in biomedicine including drug discovery and development, bioinformatics, medical devices, health economics, real-world evidence, and decision-making under uncertainty. If you are developing Stan models and want to show off your work, we hope you would consider submitting an abstract by 15 April. We are planning for keynotes to be about 45 minutes and regular talks to be about 20 minutes long.

Stay tuned for the program to be released in a month or so.

We are also looking for sponsors, with the money going directly to Stan/NumFocus to help pay for the conference and to support the development of Stan. If your organization is using Stan, please consider supporting the project by sending a note to eric dot novik at nyu dot edu.

Better priors for AR, ARX, LTX, DR, MA, ARMA, and VAR models

We arXived last year a paper The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressions” by David Kohns, Noa Kallionen, Yann McLatchie, and me (Aki). Last week, we uploaded a revised version.

The idea of the paper is to have a prior that is predictively consistent, that is, if we add more terms (e.g., lags and covariates) to the model, the prior predictive distribution should not change substantially. This can be done by setting the prior on R-squared and using variance decomposition inspired by R2D2 prior. Using such priors allows using big models without fear of overfitting, there is no need to try to select the number of lags, and the information from the data is more efficiently used. If needed, a smaller model can be selected afterward considering the model selection as decision problem given the big model (see, e.g. Using reference models in variable selection and Advances in projection predictive inference).

The first version included the derivations for AR, ARX, LTX, and DR models. The revision adds results about implied priors on characteristic roots and partial autocorrelations, case study with quasi-cyclical behavior, and extensions to MA, ARMA, ARDL, and VAR models. The paper and associated git repo include Stan code.

Generalized linear neural network models

This is Bob.

Are neural nets the future of regression?

Andrew was visiting Flatiron last Friday (really last Friday, not six months ago), and I was asking the question that’s been on my mind lately: will neural networks put regression modelers out of work?

Andrew hired me and Matt Hoffman in 2010 to work out how to specify and fit hierarchical regression models with interactions. He wanted to create a system that automatically added interactions, non-linearities, etc., guided by a vaguely conceived “topology of models.” This is a combinatorial nightmare, even with a handful of covariates and non-linearities, and not even considering continuous variation in things like priors.

Black box non-linear function approximations

Fast forward 15 years and regressions from neural networks are ubiquitous. Rather than specifying interactions, non-linearities, etc., we just let a highly overparameterized deep neural net sort it out. This idea of black-box, non-linear function approximation is not new. I first saw it with random forests (the Bayesian analogue of which is Bayesian additive regression trees) and more recently, gradient-boosted decision trees (the go-to method in Kaggle competitions).

Do we have enough data?

The only thing holding us back from using neural networks everywhere is limited data. It’s clear as our data sets get bigger that neural network regression works very well (see, e.g., LLMs, image recognition, and image generation systems, all of which fit largely black-box deep neural network models).

Uncertainty quantification

As we were talking about this, Andrew kept returning to uncertainty quantification. I somehow couldn’t convince him that we can do exactly the same thing as we are currently doing. There’s no fundamental difference between using a neural network and using bespoke hand-tooled covariate combinations—just different functions mapping the covariates to expected values.

Here’s a document explaining the connection

I didn’t have time to explain this to Andrew at the board, so I wrote it up as a document. This goes over how you can take a GLM and swap out the linear component for a neural network and then proceed as usual. It contains an example of a two-hidden layer perceptron model coded in Stan.

I’m always happy to get feedback if people have comments or suggestions. Keep in mind the purpose here is not a publication, but just explaining how a neural network can be swapped in for the linear function in a generalized linear model.

Treasure trove of forensic details in arXiv’s LaTeX source code

There’s gold in them thar hills*

When you submit a paper to arXiv, you send them a bundle including the LaTeX source, figures, etc. These are all available for download through the arXiv site. This morning, I was downloading the source** for the original Hoffman and Gelman no-U-turn sampler paper. If you want to follow along, ere’s the arXiv link, but you have to click through to the “TeX Source” link under the “Access Paper:” header on the top right side under the banner. What I found was a treasure trove of comments that never made it to the paper, some of which I will share below.

Examples

Returning to Hoffman and Gelman’s arXiv source LaTeX, what struck me was the following comment right after the algorithm itself.

%% Algorithm ?? is more efficient than algorithm
%% ??, but the policy of sampling uniformly from
%% $\cC$ leaves something to be desired. We would prefer to select an
%% element of $\cC$ that is farther away from the initial position
%% $\theta^t$, rather than face the possibility of performing many costly
%% gradient evaluations just to wind up choosing an element of $\cC$ that
%% is close to where we started. Algorithm ?? addresses
%% this issue by giving preference to points subtrees that do not include
%% the starting point $\{\theta^t, r^t\}$. [To do: explain why this is
%%   valid. Probably proof by induction is the easiest way to go.]

Neither the arXiv preprint nor the final JMLR paper have a clearly delineated inductive proof. In both versions, we get “this is equivalent to a Metropolis_Hastings kernel with proposal …, and it is straightforward to show that it obeys detailed balance” (second-to-last sentence on page 1604 of JMLR paper).

Presumably on the principle of minimizing surface area for reviewers to gripe, the following useful comment from the abstract didn’t make the final cut.

%% This issue is compounded when the
%% target distribution depends on a set of parameters that cannot be
%% updated by HMC (such as discrete parameters) and are updated
%% independently of the parameters updated by HMC. 
%% In this case, optimal settings of $L$ may change from iteration to
%% iteration.

Here’s another useful comment that wound up on the cutting-room floor. I’m not saying these should all be in the paper—usually there are so many things you can add and qualify that it requires some judgement. But for the dedicated and interested reader, the paper would have been more useful with the elided comments.

%% Even if we assume that there exists some transformation of the
%% parameter space under which all parameters are i.i.d. and that this
%% transformation can be applied cheaply (i.e. in $O(D)$ time, for
%% example using a low-rank transformation matrix to avoid the $O(D^2)$
%% cost of dense matrix multiplication and the $O(D^3)$ cost of dense
%% matrix inversion), the cost of obtaining an effectively independent
%% sample using RWM is still $O(D^2)$ \citep{Creutz:1988}. Gibbs also
%% requires $O(D^2)$ operations per effectively independent sample in
%% this setting, since it must update $D$ parameters and it must perform
%% a transformation costing $O(D)$ operations after each update.

There are also useful explanations of figures that never made the final cut, like this one, which expands the diagram from the one of “naive NUTS” in the paper figures to what the paper calls “efficient NUTS.”

%% %% %% Figure ?? illustrates how an iteration of NUTS might
%% %% %% proceed once the slice and initial momentum variables have been
%% %% %% resampled. Initially (a), we have only one node. We double the size of
%% %% %% the tree to two nodes by taking a single step forward (b), and since
%% %% %% the new point is valid tentatively set $w^{t+1}$ to that new point
%% %% %% (with probability $1/1=1$). We then redouble the size of the tree to
%% %% %% four nodes, taking two steps forward (c). Only one of the two new
%% %% %% nodes is valid, so the probability of choosing a node from the new
%% %% %% half-tree is $1/2$ (the ratio of the number of valid new nodes to
%% %% %% valid old nodes). In this example, we randomly choose to stick with
%% %% %% the old value of $w^{t+1}$. Next, we again double the size of the tree
%% %% %% by taking four steps backward from $w^-$ (d). We discover that the new
%% %% %% half-tree satisfies the stopping criterion, and so we cannot select
%% %% %% any points from it. Finally, we double the tree one more time, this
%% %% %% time going forward (e). This half-tree contains some valid points and
%% %% %% does not satisfy the stopping criterion, but a subtree of it does
%% %% %% satisfy the stopping criterion, so we invalidate the points in that
%% %% %% subtree. The number of valid points in this half-tree (3) is the same
%% %% %% as the number of valid points in the old half-tree (3), so we choose a
%% %% %% point uniformly at random from the new half-tree for $w^{t+1}$. At
%% %% %% this point, the end points $w^-$ and $w^+$ satisfy the stopping
%% %% %% criterion, and we return $w^{t+1}$ as the new position-momentum pair.

The following would have been nice.

%% Also, we should probably have a scatterplot showing target versus
%% realized criteria (mean acceptance probability, mean energy change)
%% that shows that the stochastic approximation scheme pretty much works,
%% and maybe a plot showing convergence speed.

There’s more where these came from—I was just cherry picking from the algorithm, abstract, intro, and conclusion.

Who knew?

I’ve never heard anyone mention diving into the source of papers, so I wonder just what’s out there to be mined. I also wonder how many authors realize that comments in their arXiv LaTeX are forever.


* An American idiom meaning there’s value to be found from exploring in a particular place; see the wikitionary for a definition and etymology.

** I downloaded the LaTeX source of Hoffman and Gelman’s paper in order to produce a ChatGPT(o1[plus]) translation of the efficient NUTS algorithm to Python. I need to code a similar algorithm for a new sampler we’re exploring and wanted to make sure I had understood the structure of the NUTS algorithm, because it’s a very subtle recursion. GPT continues to impress!

Softmax is on the log, not the logit scale

Bad Stan naming

I realized recently that we followed the confusing terminological convention of ML in our description of Stan’s categorical_logit function. In Stan, if there’s a suffix to a distribution, it describes the scale of one or more of the parameters. For example,

poisson_log(y | u) == poisson(y | exp(u)).

So when we write categorical(y | p) we take p to be a simplex (sequence of finite, non-negative values that sum to 1). So it would make sense that categorical_logit(y | logit(p)) would be equivalent, where logit(p) = log(p / (1 - p)). But that’s not how it works in Stan. Instead,

caetgorical_logit(y | u) = categorical(y | softmax(u)).

We made the same mistake everyone on ML makes in their variable naming! We call the u here “logits”, when in fact they’re (unnormalized [see below]) log probabilities. This is probably due to the fact that if u is a regression, then the resulting system is called “multinomial logistic regression.”

Example

The softmax function is defined by softmax(u) = exp(u) / sum(exp(u)). When used like this, the arguments to softmax are log probabilities, not logit probabilities. Here’s a little snippet of Python to illustrate (the style sheet is adding the extra space, not me, and I don’t want to fix it manually in this post with a hack because it’ll mess up the page if the style sheet is ever fixed).


>>> p = np.asarray([0.2, 0.5, 0.3])

>>> def logit(p): return np.log(p / (1 - p))
...

>>> logit_p = logit(p)

>>> log_p = np.log(p)

>>> sp.special.softmax(logit_p)
array([0.14893617, 0.59574468, 0.25531915])

>>> sp.special.softmax(log_p)
array([0.2, 0.5, 0.3])

This shows that for the round trip probabilities through softmax, the appropriate operation is the natural logarithm, not the logit function.

Origin of the confusion

So where did this confusion come from? Let’s look at a standard binary logistic regression. There we take


p(y | alpha, beta, x) = bernoulli(y | inv_logit(alpha + beta * x))

where

inv_logit(v) = exp(v) / (1 + exp(v)).

Writing inverse logit this way suggests how to write a logistic regression with a categorical distribution and softmax.


p(y | alpha, beta, x) = categorical(y | softmax([0, alpha + beta * x]))

that’s because

softmax([0, alpha + beta * x])
= [exp(0), exp(alpha + beta * x)] / (exp(0) + exp(alpha + beta * x))
= [1, exp(alpha + beta(x)] / (1 + exp(alpha + beta * x))
= [1 / (1 + exp(alpha + beta * x), exp(alpha + beta * x) / (1 + exp(alpha + beta * x)]
= [1 - inv_logit(alpha + beta * x), inv_logit(alpha + beta * x)],

This derivation shows that the probability of the categorical in this formulation returning 1 is inv_logit(alpha + beta * x). But this connection falls apart in the multinomial case when there are more than two outcomes.

In traditional frequentist K outcome multinomial logistic regressions, the first input to softmax is pinned to 0 for identifiability just as in the binary case.

softmax([0, u[2], ..., u[K1])
= [exp(0), exp(u[2]), ..., exp(u[K])] / (exp(0) + exp(u[2]) + ... + exp(u[K]))

This leads to asymmetry in the regression as we don’t have a regression for the first element. What it does do is make softmax and log proper inverses. If you reduce the choice to just the first category and some other category, then you get a standard binomial logistic regression again. But you still can’t round trip the multinomial case with logit, because


exp(u[2]) / (exp(0) + exp(u[2]) + ... + exp(u[K])) != inv_logit(u[2])

To see that this is still not going to produce logits in the multinomial case, here’s some more Python.


>>> log_p
array([-1.60943791, -0.69314718, -1.2039728 ])

>>> log_p_zero = log_p - log_p[0]

>>> log_p_zero
array([0. , 0.91629073, 0.40546511])

>>> sp.special.softmax(log_p_zero)
array([0.2, 0.5, 0.3])

So as you can see, softmax isn’t identified without pinning one of the values—we can add or subtract a constant from each element of the input and get the same value. But this still doesn’t turn the inputs to softmax into logits.

>>> def inv_logit(v): 1 / (1 + exp(-v))

>>> inv_logit(log_p_zero)
array([0.5 , 0.71428571, 0.6 ])

So you can see that the input 0.91629073 is not the logit of the probability even when pinning a value to zero to identify.

P.S. I really miss being able to write math on the blog and really hate that all my old posts with math no longer render. Maybe if Andrew reminds us why it went away, someone will have a suggestion on how to fix.

Call for StanCon 2025+

We are looking for volunteers to organize the next StanCons for 2025 and beyond!

If you are interested in making a proposal (or even just discussing the possibility of making one), please consider reaching out to [email protected].

To make a proposal, please submit this form. The Stan Governing Body will begin reviewing proposals on November 30th. For more information, see our complete post on the Stan forum.

The Red Sox are hiring

Here’s another job opportunity for baseball enthusiasts and Stan users! The Boston Red Sox are building out their R&D group and are currently hiring for the position of Senior Analyst, Baseball Analytics. Although the listed qualifications don’t specifically mention Stan, I’ve been told they have a particular interest in applicants who have experience fitting hierarchical models with Stan (or other probabilistic programming languages). Check out the job posting for all the details.

Probabilistic numerics and the folk theorem of statistical computing

U.S. election day is tomorrow. So let’s talk about something else:

1. Encoding prior information using non-generative modeling

I was talking with Hong Ge about the uses of non-generative models in probabilistic programming. An example I gave is the use of prior information on derived quantities. For example, suppose you have a logistic regression with a weak prior on the coefficients:

data {
  int N;
  array[N] int y;
  vector[N] x;
}
parameters {
  real a, b;
}
model {
  a ~ normal(0, 5);
  b ~ normal(0, 5);
  y ~ bernoulli_logit(a + b*x);
}

and you feed it some data x_i, y_i, i=1,…,N. And then you want to add additional information on predicted values. Perhaps, for example, you think the predicted probability for observations 1, 2, and 3 is likely to be close to 50%. Then you could add something like this to the model block:

  vector[N] prob;
  prob = inv_logit(a + b*x);
  prob[1:3] ~ normal(0.5, 0.2);

I’m purposely making this code kinda clunky and purposely using a model that’s not completely well specified (p is constrained to between 0 and 1, but the specified normal prior has no such constraint) just to show how direct this process can be. Also I haven’t actually programmed the model and checked it, so my code could have a bug! Anyway, the point is that we can just throw in this prior information wherever we want, not just on the official “parameters” of the model.

As noted above, the resulting Stan program does not correspond to a generative model! The code produces a target function that is a constant plus a log posterior density, log p(a,b|y), implicitly conditional on the unmodeled data x and N, without separately defining a prior distribution p(a,b) or a data distribution p(y|a,b). There’s no direct way to sample from the prior. The way I like to think of it is that the prior information on those predicted probabilities represent additional data.

2. Replacing hard constraints by soft constraints

This sort of prior on a derived quantity can be useful in many statistical settings. For example, when poststratifying a survey or causal estimate, it’s often the case that we have some partial information on the full poststratification table and also exact information (for example, from a census) on certain marginals. So we might have the marginal totals for age x ethnicity, county, and ethnicity x sex, but not the full joint table. In that case, fitting can be difficult: it can be awkward to parameterize the models so that the margins are known, also computation can be slow: all these hard constraints can lead to difficult geometry.

We can often fix this problem by replacing the hard constraints by soft constraints. As the saying goes, uncertainty greases the wheels of commerce. This is not (yet) automatic—there’s a tradeoff in that if you make the constraints too soft, you’re throwing away some information, but if you make them too narrow, your computational problems can return—and I think further research is needed on developing automatic or semi-automatic methods for implementing soft constraints in such problems.

3. The folk theorem

Regular readers will know the Folk Theorem of Statistical Computing (for more on the topic, see this post by Bob). The above is an example of that folk theorem, in that, in the real world, purportedly hard constraints actually are soft! For example, census numbers are typically only estimates, not exact. Indeed, the first place this idea came up in my own work was with Frédéric Bois and Jiming Jiang when fitting our differential equation model in toxicology: our algorithm had poor mixing, and we realized that the problem was that certain biological measurements we’d taken as known, were only measured; we added the measurement error terms and our model fit better.

4. Probabilistic numerics

Hong pointed out a connection of the above ideas to probabilistic numerics, a field of numerical analysis that I’d never heard of. Here’s a reference that Hong recommends.

Stan Playground: Run Stan on the web, play with your program and data at will, and no need to download anything on your computer

Just in time for Halloween, we have a scarily effective implementation of Stan on the web, full of a veritable haunted house of delicious treats.

Brian Ward, Jeff Soules, and Jeremy Magland write:

Stan Playground is a new open-source, browser-based editor and runtime environment for Stan models. Users can edit, compile, and run models, as well as analyze the results using built-in plots and statistics or custom analysis code in Python or R, all with no local installation required. . . .

Whether you’re a new user, an educator trying to teach Stan, or an experienced user who just doesn’t have their new laptop configured yet, we hope to make your life just a bit easier.

You can visit the live website here: https://stan-playground.flatironinstitute.org

Feature Overview

For users familiar with tools like the Compiler Explorer 1, repl.it or JSFiddle, Stan Playground hopes to provide a similar experience for Stan models.

Stan editor

The site features an editor for Stan code with syntax highlighting and provides warnings and errors from the Stan compiler for instant feedback.

Compiling models

Compilation of the models is the only part of Stan Playground which is not run locally. We provide a public server for convenience, but you can also host your own.

Preparing data

Data can be provided in JSON format in its own editor, or can be generated from code written in R (using webR) or Python (using pyodide), including code that imports published datasets.

Sampling

After a model has been compiled, sampling can be run entirely in your local browser.

Viewing and analyzing results

Stan Playground has several built-in ways of viewing the samples, but also supports performing your own analysis, again in R or Python.

Sharing

Stan Playground has built-in sharing features to allow you to download a copy of your project, upload an existing project, or share via a Github Gist . . . You can also prepare custom links . . . For example, this link will load the “golf” case study from the example models repository: https://stan-playground.flatironinstitute.org/?title=Knitr%20-%20Golf%20-%20Golf%20Angle&stan=https://raw.githubusercontent.com/stan-dev/example-models/master/knitr/golf/golf_angle.stan&data=https://raw.githubusercontent.com/stan-dev/example-models/master/knitr/golf/golf1.data.json

I tried it out and it really worked!

I wrote a simple example that simulates fake data from a linear regression with known parameters (in the transformed data block) and then expresses the log posterior density (in the model block):

data {
  int N;             // sample size
  real a_, b_;   // true values of parameters
  real<lower=0> s_;  //   for the simulation
}
transformed data {
  vector[N] x = linspaced_vector(N, 0, 1);
  vector[N] y;
  for (n in 1:N) y[n] = normal_rng(a_ + b_*x[n], s_);
}
parameters {
  real a, b;
  real<lower=0> s;
}
model {
  y ~ normal(a + b*x, s);
}

With data:

{
  "N": 1000,
  "a_": 0.2,
  "b_": 0.3,
  "s_": 0.5
}

It’s so easy

Setting this up a and running it in Stan Playground is the simplest thing in the world:

1. Put the Stan program in the Stan program window.
2. Enter the data (in this case, the sample size of the regression and true parameter values) as a .json in the data window and click to save.
3. In the command window, click once to compile and click again to run.
4. The output (in tabular and graphical form) appears in the output window.

That’s it! The image at the top of this post shows the results.

Also, it catches many bugs in the compilation and sampling stages.

StanCon 2024 Oxford: recorded talks are now released!

(This post is by Charles)

The title says it all: recordings of StanCon 2024 are now available on Stan’s youtube channel.

We’re happy to make the content of StanCon 2024 accessible, even to those who couldn’t make it in person. And for those who were there, you might be tempted to revisit the excellent presentations. This is a wonderful snapshot of the applied and methodological work happening in Stan, and more broadly in the Bayesian modeling community. (StanCon is Stan centric, but not Stan exclusive!)

I’m very excited about all the research that is happening!!

What makes an MCMC sampler GPU-friendly?

(This post is by Charles)

Art Owen (Stanford) read our paper on nesting Rhat to assess convergence in the many-short-chains regime of MCMC. He made a lot of great comments and asked some clarification questions. Notably:

It wasn’t clear to me what makes an MCMC GPU-friendly.  Is there a canonical reference?  E.g., maybe the Lao paper.

I’m not sure there is a canonical reference with an entirely satisfactory exposition. Lao et al (2020) present the main ideas from an engineering perspective and discuss why it is difficult to implement NUTS on a GPU. Meanwhile, Hoffman et al (2021) present ChEES-HMC, a GPU-friendly alternative to NUTS. (This paper inspired the design of other GPU-friendly samplers: see Sountsov and Hoffman (2022), Hoffman and Sountsov (2022), and Riou-Durand et al (2022).)

A key ingredient to make a sampler GPU-friendly is that each Markov chain requires the same operations, albeit with different inputs. Then each operation can be executed in parallel on single-instruction-multiple-data (SIMD) hardware. For example, in HMC, the gradients of the log density for all chains are evaluated in parallel and this calculation of many gradients is efficient. This is to be contrasted with more traditional implementations of parallel MCMC where each chain runs asynchronously on its own processing unit, and so there is no parallel evaluation of the gradients. The latter approach is less efficient than SIMD, which to me is one of the more subtle points made by Junpeng Lao and colleagues.

With this in mind, NUTS is not GPU-friendly because at each iteration, each chain simulates a Hamiltonian trajectory with a different length, meaning each chain requires a different number of operations. So we end up waiting for the slowest chain at each iteration or trying to anticipate the longest trajectory. This is likely a challenge, not only for NUTS, but for many locally adaptive methods (although not all).
 

Additional input from Matt

I’ve discussed the topic many times with Matt and Bob, and sought them out for additional insight.

Matt sends me the draft of an article which dives into the topic. The article is unpublished and currently being reviewed, so I cannot share it. But I believe it will become the canonical reference Art was asking for.

The paper carefully explains that using SIMD only works well when separate chains perform the same computation at the same time. This can be violated if a sampler is adaptive or has a control flow that makes the chains differ in the amount of computation they perform. Matt points to some results, which show that NUTS has a heavy control flow, meaning it spends a lot of computation beyond evaluating and differentiating the target log density (remember, calculating gradients on SIMD is efficient). For those more familiar with the NUTS paper, these operations include managing the internal tree proposal stack, which will incur a number of operations unique to each chain.

Matt adds:

Although raggedness (different chains running for different numbers of leapfrog steps or whatever) is a serious issue, I think it may actually be secondary to NUTS’s other overheads (switching back and forth between states at either end of the trajectory, other control flow stuff). Alexey Radul tried to implement a version of NUTS that deals with raggedness by syncing on every gradient call, which allows for chains to have different numbers of samples at different times, but IIRC the overhead of the virtual machine he created (in TensorFlow!!!) to solve the problem incurred more overhead than it was worth.

Additional input from Bob

To shed more light on why NUTS’s overhead, and more generally why control flow is a problem, Bob points to this blog post he wrote. I highly recommend reading it. The post discusses SIMD, memory management, vectorization, and branch-point prediction. A heavy control flow leads to a more difficult memory management and branch-point predictions, both of which are harder on a GPU than a CPU. A GPU-friendly sampler would avoid both, as much as possible.

(On the same email thread, Bob Carpenter and I also started a discussion on how certain likelihoods can be difficult to evaluate, for example when the likelihood is based on a system of differential equations. I’m not including this here, because this post seems long enough and I will visit this topic in another post.)

Postdoc opportunity! to work with me here at Columbia! on Bayesian workflow! for contamination models! With some wonderful collaborators!!

Laboratory assays are central to much of biomedical research. My colleagues and I recently received a research grant to do better assays using Bayesian inference. Beyond the usual challenges of fitting nonlinear hierarchical models to real data that can sometimes be above or below detection limits, the project involves several special challenges:

– We estimate concentrations of a compound using calibration assays. But when samples are contaminated, which can happen all the time in the real world (for example, you gather dust samples from the kitchen, and they are contaminated with sugar), the calibration curve can change.

– More generally, we want to simultaneously classify a sample as contaminated or not, and estimate its concentration under some model of possible contamination, while recognizing that our contamination model could be way off.

– Experimental design. This includes designing protocols for spiking some dilutions with contaminants to estimate the contaminated calibration curves, and also, once everything is working, designing assays to get more information out of each plate. We believe that current designs are inefficient, devoting too much of the assay space to estimating the calibration curve and not enough for each unknown sample.

– We want to build a robust implementation so that practitioners can use our new approach (whatever exactly it is) with confidence, in place of what is currently given out by the standard software that comes with the assay machines.

In short, workflow. We’re not just analyzing a dataset, we’re proposing to create the new default, and then assess where it makes a difference in real-world studies.

And to do this we’re hiring a postdoc. Someone who wants to build models in Stan, and who is also interested in the super-important but often underemphasized area of experimental design, and who is also interested in building a reliable tool that includes diagnostics and warnings when it’s not working well. We’re aiming for an end-to-end statistics project here, with immediate applications to public health, and my colleagues and I are really stoked to do it.

You, the postdoc, will be at the center of this. In addition to working with us on this project, you’ll be part of the active research communities in Columbia’s statistics and biostatistics departments, with lots of opportunity for collaborations on theory and applications alike. Really a wonderful opportunity.

This exciting project is a collaboration between me, Prof. Qixuan Chen in the biostatistics department, and Prof. Matt Perzanowski in the department of environmental health.

All of us have collaborated before, and we each bring strengths to the project:

– I have experience in Bayesian modeling and computing and am particularly interested in methods for diagnosing and expanding models that do not fit the data.

– Qixuan has worked on a wide variety of problems involving Bayesian latent-parameter modeling in the biomedical sciences.

– Matt operates a lab performing bioassays for public health studies, which allows us to develop and evaluate our research on an ongoing data stream and have real-world impact.

The official job announcement is here, and some background is in these papers from 2004 and from 2007.

Interested candidates should submit a detailed CV, a cover letter outlining research interests and career goals, and contact information for three references to Dr. Qixuan Chen at [email protected].

ChatGPT o1-preview can code Stan

This is Bob.

Yes, but can it Stan?

The first few instantiations of ChatGPT haven’t been so good at Stan. This is perhaps not surprising, because there’s relatively little written about Stan on the web compared to, say, Python, C++, or R.

Impressive, whatever you call it

It’s a whole new ball game with the new o1-preview model rolled out by OpenAI. I was tipped off to this fact by a blog post from Keith Goldfeld (of NYU Population Health) on the ouR data generation blog, Can ChatGPT help construct non-trivial statistical models? An example with Bayesian ‘random’ splines. It’s well worth reading.

A lot of our readers will object to the tagline from OpenAI for o1-preview, “uses advanced reasoning.” Comment season is open. Whether you want to call this “reasoning” or “understanding” or pick your own term, the ability of OpenAI’s new o1-preview model is quite remarkable. It’s getting better at generalizing and working from the data it has. The o1-preview model gives you a protocol as it uses what AI researchers call “chain of thought” (yes, I’m teeing up another one for the doubters, to continue with the sports metaphors). You have to sign up and pay and then it’s rate limited, because it’s spending on the order of 15 to 150 seconds to answer a query as it goes through iterative refinement on its output. I’m here to tell you that the wait’s worth it, and that’s also what I’m hearing from people around here who lean on it for complicated code refactors.

Bayesian workflow is universal

I am about to do a tutorial tomorrow at Flatiron Institute on Bayesian workflow. I’d like to help all the scientists here understand how to apply Bayesian workflow using SBC, PPC, LOO, etc., no matter how they fit their models (Monte Carlo methods, variational inference, amortized inference, simulation-based inference, etc.).

Galileo’s inclined plane experiment

I thought it would help the presentation to have a concrete science example. I figured Galileo’s inclined plane experiment as a way to estimate the gravitational constant would be fun. Galileo set up a ball on an inclined plane, then measured how long it took to move a given distance. With that experimental data, we can fast forward to Newton and set up a statistical model where we use Galileo’s data to estimate the gravitational constant (on earth).

ChatGPT o1-preview cracks physics-based Stan

So I asked ChatGPT o1-preview to derive the physics for me, then to write the Stan model. I’d already written the Stan model at that point—it’s one thing I can still do better than GPT. But the one it wrote looks good. I went and checked some online physics tutorials as I can’t remember my high school physics well enough to derive it myself. It decided to use the inertial formula for a solid sphere with no friction. It got all the algebra right as far as I can tell. This is much better than earlier versions of ChatGPT did at algebra.

Show, don’t tell

Here’s the transcript from my session in case you’d like to dive deeper on how I’m prompting it:
ChatGPT o1-preview simulates Galileo’s inclined plane.

Keith Goldfeld’s blog post also goes over in detail how he used it to generate code for splines.

Just the code

Here’s the Python simulation, inference, and plotting code that I put together from the GPT output.

import numpy as np

def simulate_times(N, sigma, length, height):
    g = 9.81  # gravitational acceleration in m/s^2
    s = np.sqrt(length**2 + height**2)
    h = height
    t = np.sqrt((14 * s**2) / (5 * g * h))
    t_obs = np.random.lognormal(mean=np.log(t), sigma=sigma, size=N)
    return {
        'length': length, 'height': height,
        'N': N, 't_obs': t_obs
    }

data = simulate_times(N=100, sigma=0.1, length=5, height=2.5)

import cmdstanpy as csp
m = csp.CmdStanModel(stan_file='galileo.stan')
draws = m.sample(data = data)
print(draws.summary())

import pandas as pd
import plotnine as pn

df = pd.DataFrame('g': draws.stan_variables('m'),
                      'sigma': draws.stan_variables('sigma'))
df = pd.DataFrame(draws.stan_variables())
plot = (pn.ggplot(df, pn.aes(x='g', y='sigma'))
            + pn.geom_point()
            + pn.theme(aspect_ratio=1)
            + pn.geom_hline(yintercept=0.10, color='blue', size=1)
            + pn.geom_vline(xintercept=9.81, color='red', size=1)
            )
plot.show()

Here’s the Stan program that I wrote—GPT’s is pretty darn close, but my priors are tighter. I chose lognormal error just to keep everything positive, not because multiplicative (proportional) error is right for measuring time. I suspect the errors are more naturally additive based on the precision of the measuring device.

data {
  real length, height; 
  int N;  vector[N] t_obs;
}
transformed data {
  real s = sqrt(length^2 + height^2);
  real h = height;
}
parameters {
  real g;      // gravity accel in m/s^2
  real sigma;  // measurement error
}
model {
  real log_t_true = 0.5 * (log(14) + 2 * log(s) - log(5) - log(g) - log(h));
  t_obs ~ lognormal(log_t_true, sigma);
  sigma ~ exponential(1);
  g ~ lognormal(log(10), 0.25);
}

I threw in tighter priors than GPT—these are weakly informative. A more general measurement error model would allow for lack of calibration (i.e., measurement bias), and use a constant offset as well as an error term.

Supporting Bayesian modeling workflows with iterative filtering for multiverse analysis

Anna Riha, Nikolas Siccha, Antti Oulasvirta, and Aki Vehtari write:

When building statistical models for Bayesian data analysis tasks, required and optional iterative adjustments and different modelling choices can give rise to numerous candidate models. In particular, checks and evaluations throughout the modelling process can motivate changes to an existing model or the consideration of alternative models to ultimately obtain models of sufficient quality for the problem at hand. Additionally, failing to consider alternative models can lead to overconfidence in the predictive or inferential ability of a chosen model. The search for suitable models requires modellers to work with multiple models without jeopardising the validity of their results. Multiverse analysis offers a framework for transparent creation of multiple models at once based on different sensible modelling choices, but the number of candidate models arising in the combination of iterations and possible modelling choices can become overwhelming in practice. Motivated by these challenges, this work proposes iterative filtering for multiverse analysis to support efficient and consistent assessment of multiple models and meaningful filtering towards fewer models of higher quality across different modelling contexts. Given that causal constraints have been taken into account, we show how multiverse analysis can be combined with recommendations from established Bayesian modelling workflows to identify promising candidate models by assessing predictive abilities and, if needed, tending to computational issues. We illustrate our suggested approach in different realistic modelling scenarios using real data examples.

They’re just getting started! Lots more needs to be done. I’ve been interested in the general idea for awhile; the challenge is to get it working for some good examples and then to develop more general tools and abstract more general principles. As Riha et al. demonstrate, it can help to work in the directions of modeling and computation at the same time.

Defining statistical models in JAX?

This is Bob. And I’d like to know the best way for us to code a bunch of models in JAX to use to evaluate parallel algorithms including normalizing flows. I’m going to dump out my current thinking, but I’m really hoping to get feedback from experts on the best way to do this without starting a flame war in the comments.

Why not Stan? Ask Elizaveta!

The bottom line is that in order to evaluate the parallel algorithms we’re considering, we need fast parallel execution in-kernel on the GPU. Stan has some ability to offload compute to GPU, but not to the extent that we can parallelize entire model evaluations.

Elizaveta Semenova’s words at StanCon are still ringing in my ears—she started her live interview with Alex Andorra by saying, “I don’t use Stan any more.”

Why JAX?

Elizaveta needed to integrate neural networks for the Bayesian optimization she’s doing and for that turned to JAX. (The interview with Elizaveta and Chris Wymant will soon be up on Alex’s podcast, Learn Bayes Stats, along with the interview of Brian Ward and Mitzi Morris in another segment that also took place live at StanCon—the podcast is a ton of fun and both Andrew and I have done interviews).

The real reason for why JAX isn’t that all the cool kids are using it (though everyone I know on the CS side has pretty much switched to JAX, including my own personal bellwether, Matt Hoffman). JAX is beautifully compositional in the same way as Unix. I suppose we could’ve used PyTorch, but JAX just feels much more natural to a computer scientist like me. I just love the way it can compose JIT and autodiff to enable massively parallel differentiable programs. There are really two applications I have in mind, normalizing flows (the main topic of this post) and parallelized MCMC of the form Matt Hoffman’s been propounding lately (Charles Margossian, a former Ph.D. student of Andrew’s and one of our postdocs here, did an internship with Matt at Google working out how to do R-hat in a massively parallel setting with 1000+ chains that communicate with each other to accelerate convergence, after which a single draw is taken from each in the limiting case).

Normalizing flows

I think there is a good chance that normalizing flow-based variational inference will displace MCMC as the go-to method for Bayesian posterior inference as soon as everyone gets access to good GPUs. I’ve been looking into normalizing flows with Gilad Turok, Sifan Liu, Justin Domke, and Abhinav Agrawal. Justin visited Flatiron for five months and during that time, we didn’t manage to program a distribution in JAX that his and Abhinav’s take on realNVP, as coded in the repo vistan, couldn’t fit well. They’re busy writing up a more extensive evaluation in a follow-up paper and the results only look better. Gilad was able to port their vistan code to Blackjax and replicate all their results on our clusters here—he’ll be submitting a PR to Blackjax soon.

My thinking on normalizing flows was inspired by the last model we fit with Justin—a centered parameterization of a hierarchical IRT 2PL model with around 1000 total parameters (this is a nice example due to additive non-identifiability, multiplicative non-identifiability, and funnels from the hierarchical priors). With this parameterization, Stan struggles to the point where I’d say it can’t really fit the model. Justin and Abhinav’s RealNVP fit it quite well—much better than Stan managed. It just took a massive number of flops on a state-of-the-art GPU. One of the things Justin and Abhinav’s approach to flows relies on for convergence is a massive number of evaluations of the log density and gradients for computing the approximate KL-divergence stochastic gradient (i.e., the ELBO). So we needed to code the models in JAX to run entirely on the GPU. So I’m looking for an easier way to do this.

Workflow in JAX

Colin Carroll (Google employee, PyMC dev) just presented a talk about Bayes and JAX at PyData Vermont. He covers the whole workflow in JAX and talks about his bayeaux repository. Colin talks about Adrian Seyboldt’s new nutpie sampler in Rust, which Adrian also just presented at StanCon. There’s no write-up, but we’re looking into reverse engineering the Rust into C++ for Stan—it works quite well. Adrian’s agreed to come out and give a talk here at Flatiron on his sampler in the new year. But that’s a different topic.

For now, I want to do a lot more evaluations of Justin and Abhinav’s take on realNVP, and we’re trying to figure out how to code up a couple dozen models in JAX. There are many possibilities.

PyMC

PyMC can produce JAX output. The PyMC devs just did a little hackathon and created about ten pull requests in the posteriordb repository for PyMC implementations.

with pm.Model() as hierarchical:
    eta = pm.Normal("eta", 0, 1, shape=J)
    mu = pm.Normal("mu", 0, sigma=10)
    tau = pm.HalfNormal("tau", 10)
    theta = pm.Deterministic("theta", mu + tau * eta)
    obs = pm.Normal("obs", theta, sigma=sigma, observed=y)

All of the approaches in Python wind up having to name variables and then provide string-based names. I don’t know if the sigma=sigma thing is necessary for the scale parameter. I like that the distributions are vectorized here. It’s too bad that there’s an observed= in the data models—I think that means the models as defined aren’t as flexible as the BUGS models in terms of specifying what’s data at run time. At the same time, Thomas Wiecki was telling me you could use NaN to code the equivalent of R’s NA and do inference, so I think that observed value can have missingness.

Not all of the PyMC models look so much like a graphical model.

NumPyro

NumPyro is the version of Pyro that generates JAX on the back end. NumPyro looks like BUGS (or Turing.jl), which is not necessarily a bad thing. Here’s the NumPyro version of Andrew’s favorite example model, eight schools (the arguments to the top-level function are the data):

def eight_schools(J, sigma, y=None):
    mu = numpyro.sample('mu', dist.Normal(0, 5))
    tau = numpyro.sample('tau', dist.HalfCauchy(5))
    with numpyro.plate('J', J):
        theta = numpyro.sample('theta', dist.Normal(mu, tau))
        numpyro.sample('obs', dist.Normal(theta, sigma), obs=y)

pangolin

pangolin can produce JAX output. This is an “early-stage probabilistic inference project” rather than a longstanding embedded PPL like PyMC or NumPyro. Specifically, it’s a graphical modeling language that looks a lot like the others, and it has back ends for Stan, JAGS, and JAX. It’s very experimental and a work in progress, but the models look nice. Python doesn’t let you overload the ~ operator, which is unary arithmetic complement. Here it’s not so clear that y and stddevs are the data.

mu = pg.normal(0,10)                                             # μ ~ normal(0,10)
tau = pg.exp(pg.normal(5,1))                                     # τ ~ lognormal(5,1)
theta = [pg.normal(mu,tau) for i in range(num_schools)]          # θ[i] ~ normal(μ,τ)
y = [pg.normal(theta[i],stddevs[i]) for i in range(num_schools)] # y[i] ~ normal(θ[i],stddevs[I])

No names here, but they have to get introduced later if you want to do I/O. The doc also makes it clear how things line up. unlike the other approaches, this uses standard Python comprehensions, which I don’t think are super efficient in JAX judging from the JAX doc I’ve read. But I think there are lots of ways to code in pangolin. The problem is when you release “Hello, World!” code, people read it as what your project does rather than as a simple example.

postjax

We can just code models in JAX. Bernardo Williams (Ph.D. student at U. Helsinki) just coded a bunch of models directly in JAX in his GitHub postjax. I couldn’t find eight schools, but here’s a simple logistic regression model as a class with a method defined as follows.

def logp(self, theta):
    sqrt_alpha = jnp.sqrt(self.alpha_var)
    data = self.data
    X = data["X"]
    y = data["y"]
    assert len(theta) == self.D
    return jnp.sum(jss.norm.logpdf(theta, 0.0, sqrt_alpha)) + jnp.sum(
        jss.bernoulli.logpmf(y, sigmoid(jnp.dot(X, theta)))
    )

The variable self.alpha_var is set as data in the constructor as is the data dictionary data. I’d have been tempted to put alpha_var into the data input.

Other options?

I’d really like to hear about other options for coding statistical models in JAX.

Straight to XLA?

Both JAX and TensorFlow run by compilation down to XLA (stands for “accelerated linear algebra”). Mattijs Vákár, who coded a lot of the Stan parser and code generator, is working on autodiff down at that level. That may be a good eventual target for a compiler, but it’s a lot easier to start in JAX. Similarly, we could have targeted LLVM with Stan rather than C++, but we rely on so much pre-existing C++ infrastructure that would have been challenging. Similarly, I think coding directly at the XLA level would be painful at this stage, not that I’ve ever tried it or even know what it looks like. I just know we’re going to need a lot more than linear algebra.

Stan

For comparison, I really wish we could just use Stan. Here’s what eight schools looks like in Stan. This includes all the data declarations that were implicit in the other programs (which used either a closure or function argument to capture data directly).

data {
  int J;
  vector[J] y;
  vector[J] sigma;
}
parameters {
  real mu;
  real tau;
  vector[J] theta;
}
model {
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
  mu ~ normal(0, 5);
}

I’m thinking the way I would code something that follows Stan’s execution logic in JAX directly would be something like this:

class LinearRegression:
    def __init__(self, data):
        self._data = data

    def num_params_unc(self):
        return 3

    def log_density(self, params_unc):
        reader r = Reader(params_Unc)
        alpha = r.real()
        beta = r.real()
        sigma = r.realLB(lower=0)
        log_jacobian = r.lp_

        log_prior = 0
        log_prior += norm.logpdf(alpha, 0, 1)
        log_prior += norm.logpdf(beta, 0, 1)
        log_prior += exponential.logpdf(sigma, 1)

        log_likelihood = 0
        log_likelihood_fun = lambda x, y: norm.logpdf(y, alpha + beta * x, sigma)
        log_likelihood += sum(vmap(log_likelihood)(zip(self._data['x'], self._data['y'])))

        return log_jacobian + log_prior + log_likelihood

where I’m relying on a Reader class that follows the reader I first coded for Stan in order to define the log density over a vector. It’s really a deserializer. I’m wondering if I can just lean more on the pytree construct in JAX to simplify my interfaces, but I’m just getting started with JAX myself.

class Reader:
    def __init__(self, params):
        self._params = params
        self._lj = 0
        self._next = 1

    def real(self):
        x = self._params[self._next_]
        self._next += 1
        return x

    def real_lb(lb):
        x_unc = self.read_real()
        self._lj += x_unc
        return lb + jax.numpy.exp(x_unc)

    ... other constraining transforms ...

    def log_jacobian():
        return self._lp

Stan’s autodiff is 4x faster than JAX on CPU but 5x slower on GPU (in one eval)

This is Bob.

JAX on my mind

I’ve been thinking a lot about JAX lately. JAX is appealing to a computer scientist like me, due to its beautifully compositional architecture for coding autodiff on GPUs. Had JAX existed when we started coding Stan in 2011, we would’ve used that rather than rolling our own autodiff system. Apparently Theano did exist at that time, but we didn’t hear about it until long after releasing the first version of Stan.

Why JAX?

Originally, my interest was sparked by Matt Hoffman’s work on Cheese and Meads samplers (too lazy to look up their capitalization pattern, so going Andrew style) that use massively parallel HMC steps on GPU (Matt was driven to abandon NUTS because its recursive structure is anathema to GPU acceleration). It continues to be kindled by statements from people like Elizaveta Semenova, who announced during the recent StanCon at Oxford that she had to give up Stan and moved to NumPyro because she couldn’t code neural nets easily in Stan and Stan doesn’t scale on the GPU (we have some operations that can be sent to GPU, but we can’t keep the whole eval in kernel). The full blown fire is due to Justin Domke and Abhinav Agrawal’s work on normalizing flows—they have a repo, vistan, that implements their methods, including a greatly improved version of autodiff variational inference (ADVI) and a version of real non-volume preserving (realNVP) normalizing flows. Gilad Turok, who’s a research analyst here and is applying to grad school for next year, is almost done with a better engineered version we can submit to Blackjax.

We haven’t gotten far, but so far we haven’t found a density yet that the realNVP flows, followed by importance resampling, didn’t fit nearly perfectly. For example, it could fit a 1000-parameter hierarchical IRT 2PL without any identifying strategy other than priors, which is a model that Stan cannot fit well at all due to the product of funnels structure of the posterior geometry. Justin and Abhinav’s ADVI and realNVP variants rely on JAX’s ability to massively parallelize the log density and gradient calculations in order to stabilize the stochastic gradients used to calculate KL divergence up to a constant (i.e., the ELBO).

Comparing Stan and JAX performance

At the recent StanCon in Oxford, Simon Maskell presented a result I’ve been meaning to generate for a year or two, which is a comparison of JAX and Stan on CPU. Here’s Simon (or at least his arm!) presenting the result:

Working out the algebra problem Simon’s slide title presents (JG = 20 * JC, JG= 5 * SC, where JG is JAX on GPU, JC is JAX on CPU, and SC is Stan on CPU), we see that JAX on GPU is 5 times faster than Stan on CPU, which is in turn 4 times faster than JAX on CPU. For the evaluation, Simon was using a high-end consumer-grade (RTX) NVIDIA GPU—we will re-run Simon’s results on our state-of-the-art GPUs over the next couple of months and report back. Tensorflow did a similar evaluation with similar results a few years ago, but I wasn’t able to find it.

CPU to GPU over next decade?

This is actually kind of a bummer in that it means we couldn’t really move to JAX as a back end for most of our users because it’s not very performant on CPUs. My guess is that the median Stan user in terms of computation has a 5-year old Window’s notebook with no GPU. Part of my interest in things like JAX is my feeling that in 10 years, this is no longer going to be the case, and almost everyone trying to fit a statistical model will be working on a cluster.

StanCon 2024… is a wrap!

We made it…! Another StanCon! We’re on the other side.

StanCon ’24 took place on September 9th – 13th at Oxford University and, of all the StanCons I attended, this may well have been my favorite so far.

Some highlights:

  • The high quality of the talks. I remember thinking, during the first day “Wow, this is the fifth talk I’ve enjoyed today. There’s no low point.” (And I’m usually the first one to stare at the window after slide 1 and space out for the next twenty minutes.) I can think of two reasons why the talks were so engaging: (i) a lot of talks were applied, and speakers were mindful to introduce the field of application to a non-expert audience. The pedagogy was great. (ii) At the same time, all the talks were about Bayesian modeling, and so, while diverse in application, the overall topic of conference remained focused. We also had a packed and engaged lecture room, and there was no shortage of questions and comments (I don’t believe a single speaker got away with just a round of applause at the end of their talk).
  • The high quality of the tutorials. I stepped in and out for most tutorials, so I can only share some impressions. On Monday, Aki Vehtari, Noa Kallioinen and Teemu Säilynoja spoke about model comparison, and from what I saw, they did a wonderful job balancing technical justification and pragmatic recommendation. In parallel, Richard McElreath taught an introduction to Stan, and I discovered his excellent cat adoption example (a survival model), where all aspects of the model are built from the ground up. On Friday, I attended the course by Elizaveta Semenova and Anna Riha, who completely demystified Bayesian optimization; it was a great to learn about a new topic Stan could be applied to. I didn’t have a chance to stop by the other tutorials, though I heard positive feedback from attendees.
  • Gathering between events. The coffee room was animated, groups formed in the corridor, and folks were quick to find a pub in the evening. One night, I stood up to leave, and by the time I had gone around to say goodbye and make just one more comment about this presentation, or this idea, or the future of applied statistics, another hour had passed. We had an excellent reception dinner at Exeter college: we toasted many times to Brian (if you know, you know) and moved a “forbidden table” to take a group picture. When the main conference ended on Thursday, many came to play football (soccer :p) in the university park. Naturally, we were caught by the rain which made the grass wet and the game very slapstick. This is the second football game in the history of StanCon and once again I was on the losing team (next time!!). Overall, several attendees commented that the Stan community was friendly and willing to engage with each other’s problems, which is exactly what we hope for from such a conference.
  • Oxford: I like Oxford. Its buildings, its pubs, its beautiful parks and its beautiful grass. And I liked getting a sausage roll in the morning on my way to the conference registration desk, and fish’n’chips after a wet football game. I would’ve liked some heating in the dorm I stayed at, but hey, there was a kettle.

Goals for the Stan Governing Body (SGB)

I’ll now put on my SGB hat, and say that the conference helped us meet several goals of the SGB, including fostering the community, giving Stan developers and Bayesian experts a place to meet, and reaching out to Stan curious—sometimes even Bayesian curious—researchers and students.

The organizers were mindful of creating a diverse, inclusive, and accessible environment. We don’t collect official data on our participants, so what follows is an estimate. Focusing on one dimension, we can look at the participation of women in what is a historically male-dominated environment. Women presented 2/4 keynotes lectures, a bit short of 1/3 of the talks, and led 2/6 tutorials. To be transparent, this composition of the program arose rather organically—we did make an effort to achieve gender parity for the keynotes, but even then, the candidate list was long for both women and men. I believe we took several steps in the right direction to make StanCon more inclusive, and the SGB will discuss more thoroughly how well this and other facts align with our goals, and what further initiatives we need to take.

One of our key strategies to make StanCon more inclusive was the scholarship program, which was need-based and awarded based on (i) whether the participant would benefit from attending the conference, (ii) whether the participant is contributing to the conference (e.g. giving a talk) and (iii) whether the participant is from an underrepresented group at StanCon and more generally in STEM. Very few candidates satisfied all three criterion, and we would consider anyone who matched at least criterion (i) as a qualified applicant. The main restriction was monetary, and we awarded the scholarship to about ~1/3 of applicants. I’m left with the bitter thought that we could’ve awarded more scholarships (more and earlier for those who needed to file a visa application), if we had anticipated funding from late ticket purchases or gotten another sponsor, etc. These are regrets and lessons we will leverage for future StanCons, and hopefully we can admit more qualified applicants next time.

Recordings and public material

Another way in which we make the event accessible is by making talks available online for free. We recorded most of the talks (as long as we had consent from the speaker), and our videographer is currently editing the content. These things take a bit of time, but we’ll share the videos soon enough.

Also, the two panel discussions, led by Alex Andorra, will feature as the first ever live episodes of the podcast Learning Bayesian Statistics. Check out the podcast, its high quality content is available for free.

StanCon 2025 and other events

Well, now is the time to think about the next StanCon (and for some of us, to continue thinking about it)! The first step will be to find a host and local organizers, and soon the SGB will release a call for proposals.

If you want to support the Stan Community, hosting the conference is an excellent way to do so. It’s work but it’s an important event—-I really believe the Bayesian community would be worst off without it, and fortunately, we’ve had many volunteers who for past StanCons have taken on this value- and passion-driven project.

We also plan to have other events to cultivate the Stan Community, including StanConnects, which are shorter events focused on one particular application, and we want to support courses for summer schools. And of course, we’re always trying to support volunteers who wish to run such events.

The Mets are looking to hire a data scientist

Michale Jerman writes:

We’re looking to expand our group, and we’d obviously be interested in hiring the type of people who follow you.

The job is Senior Data Scientist, Baseball Analytics:

The New York Mets are seeking a Senior Data Scientist in Baseball Analytics. The Senior Data Scientist will build, test, and present statistical models that inform decision-making in all facets of Baseball Operations. This position requires strong background in complex statistics and data analytics, as well as the ability to communicate statistical model details and findings to both a technical and non-technical audience. Prior experience in or knowledge of baseball is a plus, but is not required.

Essential Duties & Responsibilities:

  • Build statistical models to answer a wide variety of baseball-related questions affecting the operations of the organization using advanced knowledge of statistics and data analytics and exercising appropriate discretion and judgment regarding development of statistical models
  • Interpret data and report conclusions drawn from their analyses
  • Present model outputs in an effective way, both for technical and non-technical audiences
  • Communicate well with both the Baseball Analytics team as well as other Baseball Operations personnel to understand the parameters of any particular research project
  • Provide advice on the desired outputs from the data engineering team, and guidance to the Baseball Systems team on how best to present model results
  • Assist with recruiting, hiring, and mentoring new analysts in the Baseball Analytics department
  • Evaluate potential new data sources and technologies to determine their validity and usefulness
  • Consistently analyze recent research in analytics that can help improve the modeling work done by the Baseball Analytics department

Qualifications:

  • Ph.D. in statistics or a related field, or equivalent professional experience
  • Strong background in a wide variety of statistical techniques
  • Strong proficiency in R, Python, or similar, as well as strong proficiency in SQL
  • Basic knowledge of data engineering and front-end development is a plus, for the purpose of communicating with those departments
  • Strong communication skills
  • Ability to work cooperatively with others, and to take control of large-scale projects with little or no daily oversight

The job description doesn’t literally mention Stan, but I can assure you that proficiency in Bayesian modeling and computing in Stan would be a plus.

Join the team before the Mets win that world championship and you can share some of the credit!

Modeling Weights to Generalize (my talk this Wed noon at the Columbia University statistics department)

In the student-organized seminar series, Wed 11 Sep 2024 noon in room 903 room 1025 Social Work Bldg:

Modeling Weights to Generalize

A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights came from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest. For background, see here: https://www.stat.columbia.edu/~gelman/research/unpublished/weight_regression.pdf

Research!

Doctoral student positions in Bayesian workflow at Aalto, Finland

Apply for fully funded doctoral student position to work with me (Aki) on Bayesian workflow, model selection, inference, and diagnostics!

Apply through a joint call fcai.fi/doctoral-program and choose me as the supervisor, and tell in the motivation letter why you want to work with me. DL Sep 9.

You can also ask me more about possible research topics and information how it is like to do doctoral studies at Aalto and in the Bayesian workflow group.

Research would include collaboration with Andrew, Richard McElreath, Paul Bürkner and Stan development team, and potential research visits with them.

(There are also many other topics and professors to choose from in AI, machine learning and probabilistic modeling…)