ChatGPT o1-preview can code Stan

This is Bob.

Yes, but can it Stan?

The first few instantiations of ChatGPT haven’t been so good at Stan. This is perhaps not surprising, because there’s relatively little written about Stan on the web compared to, say, Python, C++, or R.

Impressive, whatever you call it

It’s a whole new ball game with the new o1-preview model rolled out by OpenAI. I was tipped off to this fact by a blog post from Keith Goldfeld (of NYU Population Health) on the ouR data generation blog, Can ChatGPT help construct non-trivial statistical models? An example with Bayesian ‘random’ splines. It’s well worth reading.

A lot of our readers will object to the tagline from OpenAI for o1-preview, “uses advanced reasoning.” Comment season is open. Whether you want to call this “reasoning” or “understanding” or pick your own term, the ability of OpenAI’s new o1-preview model is quite remarkable. It’s getting better at generalizing and working from the data it has. The o1-preview model gives you a protocol as it uses what AI researchers call “chain of thought” (yes, I’m teeing up another one for the doubters, to continue with the sports metaphors). You have to sign up and pay and then it’s rate limited, because it’s spending on the order of 15 to 150 seconds to answer a query as it goes through iterative refinement on its output. I’m here to tell you that the wait’s worth it, and that’s also what I’m hearing from people around here who lean on it for complicated code refactors.

Bayesian workflow is universal

I am about to do a tutorial tomorrow at Flatiron Institute on Bayesian workflow. I’d like to help all the scientists here understand how to apply Bayesian workflow using SBC, PPC, LOO, etc., no matter how they fit their models (Monte Carlo methods, variational inference, amortized inference, simulation-based inference, etc.).

Galileo’s inclined plane experiment

I thought it would help the presentation to have a concrete science example. I figured Galileo’s inclined plane experiment as a way to estimate the gravitational constant would be fun. Galileo set up a ball on an inclined plane, then measured how long it took to move a given distance. With that experimental data, we can fast forward to Newton and set up a statistical model where we use Galileo’s data to estimate the gravitational constant (on earth).

ChatGPT o1-preview cracks physics-based Stan

So I asked ChatGPT o1-preview to derive the physics for me, then to write the Stan model. I’d already written the Stan model at that point—it’s one thing I can still do better than GPT. But the one it wrote looks good. I went and checked some online physics tutorials as I can’t remember my high school physics well enough to derive it myself. It decided to use the inertial formula for a solid sphere with no friction. It got all the algebra right as far as I can tell. This is much better than earlier versions of ChatGPT did at algebra.

Show, don’t tell

Here’s the transcript from my session in case you’d like to dive deeper on how I’m prompting it:
ChatGPT o1-preview simulates Galileo’s inclined plane.

Keith Goldfeld’s blog post also goes over in detail how he used it to generate code for splines.

Just the code

Here’s the Python simulation, inference, and plotting code that I put together from the GPT output.

import numpy as np

def simulate_times(N, sigma, length, height):
    g = 9.81  # gravitational acceleration in m/s^2
    s = np.sqrt(length**2 + height**2)
    h = height
    t = np.sqrt((14 * s**2) / (5 * g * h))
    t_obs = np.random.lognormal(mean=np.log(t), sigma=sigma, size=N)
    return {
        'length': length, 'height': height,
        'N': N, 't_obs': t_obs
    }

data = simulate_times(N=100, sigma=0.1, length=5, height=2.5)

import cmdstanpy as csp
m = csp.CmdStanModel(stan_file='galileo.stan')
draws = m.sample(data = data)
print(draws.summary())

import pandas as pd
import plotnine as pn

df = pd.DataFrame('g': draws.stan_variables('m'),
                      'sigma': draws.stan_variables('sigma'))
df = pd.DataFrame(draws.stan_variables())
plot = (pn.ggplot(df, pn.aes(x='g', y='sigma'))
            + pn.geom_point()
            + pn.theme(aspect_ratio=1)
            + pn.geom_hline(yintercept=0.10, color='blue', size=1)
            + pn.geom_vline(xintercept=9.81, color='red', size=1)
            )
plot.show()

Here’s the Stan program that I wrote—GPT’s is pretty darn close, but my priors are tighter. I chose lognormal error just to keep everything positive, not because multiplicative (proportional) error is right for measuring time. I suspect the errors are more naturally additive based on the precision of the measuring device.

data {
  real length, height; 
  int N;  vector[N] t_obs;
}
transformed data {
  real s = sqrt(length^2 + height^2);
  real h = height;
}
parameters {
  real g;      // gravity accel in m/s^2
  real sigma;  // measurement error
}
model {
  real log_t_true = 0.5 * (log(14) + 2 * log(s) - log(5) - log(g) - log(h));
  t_obs ~ lognormal(log_t_true, sigma);
  sigma ~ exponential(1);
  g ~ lognormal(log(10), 0.25);
}

I threw in tighter priors than GPT—these are weakly informative. A more general measurement error model would allow for lack of calibration (i.e., measurement bias), and use a constant offset as well as an error term.

13 thoughts on “ChatGPT o1-preview can code Stan

  1. And o1-preview is only a preview of the o1, a model of similar size. In benchmarks (“evals”) released by OpenAI, o1 is a big step upwards still from o1-preview.

    The speculation is that o1 (like the preview model) builds a tree of thought traces, which are evaluated and iteratively selected by a reward model. The summary we see is actually a reconstruction, not the real traces. Anyway, this requires a lot of computation but allows the model to try different things and evaluate them internally, so unlike bare LLMs it finally has a bit of working memory and self reflection, which help a lot with consistency and errors.

    This is not Stan-related, but I have currently access to both Claude and the OpenAI models, and have in my R and Python coding for some reason converged to using Claude Sonnet 3.5 almost entirely.

    In those “evals” from OpenAI, o1-mini is actually a bit better in coding than o1-preview, probably because unlike o1-preview, it’s a model that has finished its training. o1-preview has wider world knowledge though, it is not so STEM-focused, and obviously the final o1 will outperform the mini version.

  2. Haven’t tried Stan. I use both Claude and OpenAI models (o1) for python (non-stats related things) and R coding. Both are good with generic stuff. But, I would say responses from Claude models are more elegant!

    • I agree Claude’s code is more elegant, especially if you ask for things like using tidyverse, and writing functional code. It’s English is more elegant as well, and compared to GPT-4o, it says more often it’s uncertain, or doesn’t know.

      Not that Claude wouldn’t hallucinate sometimes at the limits of its knowledge. One particularly amusing example were different correlation structures available for random effects in brms. I was like “oh, I have to think whether that compound symmetry would be good in my model, and nice that there’s now a low-rank choice too”.

  3. Bob – to what extent would you say you need to know Stan yourself in order to use this effectively? Is it just a potential time saver in that it could do things quickly and you can skim over them, using your expertise, and potentially make minor amendments and have saved a load of time not having to type everything. Or would you to some degree trust what it outputs in quite simple cases if you didn’t know Stan too well yourself?

    • There is a basic computer science result which tells you that ChatGPT cannot possibly generate reliably correct code: it’s called the halting problem.

      Whether or not ChatGPT et. al. can generate _useful_ results or can be fun to use, or whether or not other techniques for producing relevant code samples might or might not be better are all _empirical_ questions. (That are widely recognized to currently be in ChatGPT’s favor. By a mile. Or more. (See the recent posts by Bob C.))

      But the underlying mathematical truth here is that it’s not possible (in general) to prove things about code, so machine-generated code has to be demonstrated correct by other means.

      (Ditto for LLM system text outputs, but for different reasons.) The point that natural langauge prompts are not formal specifications means that the translation to code isn’t formally defined is important here: if the linguistic prompts were adequate to stipulate the code, then it’d just be a compilation problem. But presumably the advantage of ChatGPTing is that you don’t have to do the detail work to get most of the details, and giving a formally adequate stipulation of the problem would defeat the point of using ChatGPT.

      That doesn’t change the empirical fact that Bob C. likes ChatGPT. Or the empirical fact that Tokyo Dave thinks that it’s snake oil.

      • I’m not sure the halting problem is particularly applicable here. All that says is that for *some* arbitrary programs, the only way to know for sure that it halts is to run the program and wait for it to halt. That doesn’t mean such a property applies to *all* functions–clearly, a computer program like “print Hello World” halts. It certainly doesn’t mean it’s not possible in general to prove things about code.

        I agree that machine-generated code shouldn’t always be blindly trusted, but that’s also true of human-written code. And it doesn’t necessarily mean that LLMs cannot possibly generate reliably correct code. In fact, that seems plainly false: modern LLMs score quite well on some coding tasks even compared to expert humans, and continue to improve as they become larger and better-trained.

    • Can’t speak for Stan code for I haven’t lately written raw Stan and especially not with LLMs. But in general, I wouldn’t trust the output of any of these tools. (That’s not to say that a current LLM embedded into a large context of iterative prompting or some other large framework, or a future LLM, wouldn’t be trustable.)

      But the thing of “not knowing your language” is nuanced. Languages and libraries develop and acquire new features all the time, and even if they didn’t, it’s hard to remember everything. Often it’s slow to write the code without googling a lot (or reading a manual), but when you have it ready, it’s relatively easy to understand or check. (And you can use an LLM to explain parts.)

      At least for me, the use of LLM in coding is pretty iterative. I extend and fix what they give me, or point out errors, and they extend or generalize the code I give them.

      (On halting problem: I don’t think it’s relevant here either. But I agree that LLMs are not formal compilers, oracles or solvers in any sense, and can’t be, any more than a human.)

      • And yeah, all my LLM experience is about code snippets: scripting or small-scale production functionality. Working with larger code bases, especially ones that are relatively foreign to you, is a whole another topic.

        • Yesterday I had ChatGPT translate a large blob of pyspark code to spark SQL. I went back through to check the results, found a difference, and it turned out there was a bug in the original code that it silently fixed in the translation haha. Pretty fun!

          I’ve found myself working mostly in snippets too though — but I suspect I’ll get more comfortable with the bigger transformations in time.

    • “Bob – to what extent would you say you need to know Stan yourself in order to use this effectively?”

      I’m not Bob, but my answer is you need to know Stan (or any other language) quite well first. I use LLM’s to write some R code fairly often, like a custom function I want done quickly. I don’t think they’ve ever given the correct answer out of the box. But they usually get close and I know enough to tweak the code to my needs. It saves a lot of time overall.

      On the other hand, I teach sophomores how to code in R and allow them to use LLM’s to help with their coding. It’s rarely helpful. 1) they don’t know enough about R to ask a specific question. 2) they don’t know enough about R to troubleshoot the produced code. 3) Even if they get workable code, they don’t know enough about R to be able to interpret it. In this setting, it wastes a lot of time and its hard to see what anyone has gained.

      • Thanks Jeff – yeh this is pretty much what I was asking. I have the same sense with R and it concerns me that people also seem to think that, without knowing what they want or why they want it, they could just outsource analyses to an LLM (e.g., I saw an example of some kind of LLM analytic helper the other day where the person asked it to impute missing ages using the median, then later on asked for a visualisation of the ages and was led to comment that it was interesting that there is a lot of people around age 30 – the median they imputed!!! You can’t outsource thinking, at least not yet…).

        I wonder with Stan if, knowing quite a lot about Bayesian analysis and brms, it might still be possible to get quite some value out of LLMs helping with the coding as a scaffold for learning it, because you at least still know at a higher level what should be happening/what you are trying to achieve even if not exactly what the code should look like. But I very much agree they currently seem more useful when you can quickly assess their outputs to quite a high level yourself.

  4. Someone asked me a bunch of questions offline about my comments here that I thought normalizing flows might overtake MCMC. I got permission to repost the questions so that I could answer here (and maybe other people could throw in their opinions):

    1. What has prevented wider adoption of this approach so far? Is the Jacobian calculation the main bottleneck?

    I think there are a few blockers. One, it’s relatively new and it’s hard to know what to read and how much to trust the results. All you have are these short empirical papers at venues like NeurIPS, then surveys that give you an overwhelming palette of choices. Also, you need a GPU in order to do enough nested ELBO evaluations to accurately compute gradients for minimizing KL divergence, which is prohibitive. My guess is that the median Stan user has a 5-year old Windows notebook and no budget to buy GPU time on a cluster. I assume this is going to change over the next decade to follow the compute.

    You need the Jacobians to be efficient in order to evaluate log densities, which you need for gradient-based variational inference. But that’s not a problem for most of the normalizing flow architectures. For RealNVP, they’re trivially 1.

    2. What specific advantages do you still see in MCMC-based inference (if there are any left)?

    If (and this is a big if), you have geometric ergodicity for your Markov chains, then you know you are going to get an unbiased result that converges exponentially fast (in total variation distance). That is, we know that the algorithm will be asymptotically exact, but also converge well in practice. I think people believe that MCMC is going to be unbiased because it is asymptotically, but that’s not true if you use something like Metropolis in high dimensions or you use centered parameterizations of hierarchical models in HMC.

    3. Regarding your comment about GPU availability: What scale of computational resources would be needed? Are we talking about an OpenAI-sized datacenter, or would a smaller GPU cluster suffice?

    For the stuff I’ve seen, a decent consumer GPU should be enough to see substantial gains, but we’ve been working on beefy H100s for experiments. So far, it hasn’t required strapping multiple GPUs together (even an H100 only has 80GB of GRAM), because we’ve only gone up to 1000-dimensional problems.

    4. Could you share more about the types of models you’ve successfully implemented using this approach and what computational resources they required?

    We’ve fit models with extreme funnel geometry like Neal’s in 100s of dimensions. We can fit lightly (not combinatorially) multimodal posteriors. It fits things like Rosenbrock, etc., that are challenging for MCMC. The largest scale thing we’ve fit is a 1000-dimensional hierarchical IRT 2PL model with a centered parameterization and the additive and multiplicative non-identifiability in the IRT 2PL model only identified through symmetric priors. It worked better than MCMC in those cases as measured by Wasserstein distance.

    I’ve been working with Justin Domke (UMass CS prof) and his Ph.D. student Abhinav Agrawal (who should graduate next month), as well as Gilad Turok here. The thing to read is Justin and Abhinav’s paper on realNVP (the arXiv version that has the appendix): BBVI site

    Some of Abhinav’s more recent work is on GitHub: vistan repo. The paper going along with this should be out soon—the draft goes through much much more extensive evaluation than the original paper sited above.

    Gilad has this all working in a Blackjax port and he’ll be submitting a PR to Blackjax soon. He’s also been looking into some alternative black-box VI algorithms based on normalizing flows, and has found one he thinks works better—more on that later (not trying to be cagey—I just can’t remember the name of the other system he was exploring).

    • > that I thought normalizing flows might overtake MCMC

      I’m thinking when do they overtake the distributions I put in my models. The little I’ve futzed with them in 1D they’re amazing.

Leave a Reply

Your email address will not be published. Required fields are marked *