Nicholas Carlini on LLMs and AI for research programmers

Nicholas Carlini starts his blog post, How I use AI, with this provocative line.

I don’t think that “AI” models [a] (by which I mean: large language models) are over-hyped.

To put this in perspective, his conclusion, which he nicely puts up front, is rather modest compared to the hype cycle.

I would say I’m at least 50% faster at writing code for both my research projects and my side projects as a result of these models.

Carlini backs up his conclusion by taking us on a deep dive on how he’s used LLMs over the past year as part of his own workflow for doing research at DeepMind (Google), with examples organized into the following sections.

How I use AI

  • To make applications
  • As a tutor
  • To get started
  • To simplify code
  • For boring tasks
  • To automate tasks
  • As an API reference
  • As a search engine
  • To solve one-offs
  • To teach me
  • Solving solved problems
  • To fix errors

This is very much in line with how I’ve been using LLM chatbots.

P.S. I’d also highly recommend LLMs for brainstorming for tabletop role-playing games.

What’s a generative model? PyMC and Stan edition

I’ve been hanging out on both the Stan Discourse and PyMC Discourse forums. They’re both very friendly and relatively high volume for open-source stats packages. I’m on the PyMC forum because I want to learn PyMC so that I can generate JAX code for normalizing flow-based variational inference; I’m working with Justin Domke and Abhinav Agrawal at UMass and Gilad Turok here at Flatiron, starting from their paper on black-box variational inference with Dan Sheldon, and with Abhinav’s vistan GitHub repo.

At one point, I was suggesting on the PyMC forums that you could apply a change-of-variables in a constrained sampling problem and then adjust with the Jacobian determinant. This came up in a topic, Separation between model and data/parameters, which was about how to generate a positive increasing sequence. The other participants gently pointed out that my proposed approach is not standard practice in PyMC because it isn’t “generative” in the sense they care about (more on that below). Their preferred approach is to generate an unordered sequence and sort it, which can work, but introduces problems with identifiability in the unconstrained space for HMC.

Discriminative regression?

I remember way back in the 2000s, before Andrew and I were working together, we met at a cafe at Columbia. I wanted to get his take on discriminative vs. generative modeling. He couldn’t understand what I meant. Turns out the Bayesian statisticians like Andrew think of everything as generative. The canonical “discriminative” model in machine learning is logistic regression. In Bayesian statistics, like in classical statistics, it generates data y given a parameter vector beta and covariates x. But we don’t typically model x. In the ML world, this was contrasted with the naive Bayes classifier, which did model x and was thus called “generative.” What Andrew always brings up is that we can model x if we want to, it just doesn’t affect our posterior for beta or y in a logistic regression (there’s a section of BDA that works through this in detail). Modeling the covariates only becomes useful when we have missing data or need to simulate both the independent and dependent variables. One upshot of this view is that you won’t see any semicolons in parametric densities in Andrew’s writing, like normal(y; mu, sigma).

Is PyMC a graphical modeling language?

The PyMC language is a graphical modeling language organized around forward sampling. At least I think it is—they never come out and say that anywhere I can find on their site. In general, I find these embedded languages like PyMC and JAX and NumPyro very confusing to understand conceptually, because the documentation tends to be example driven rather than specification driven. If someone knows a link to something that looks more like a specification for how PyMC works internally, I’d love to see it.

Constrained variables in PyMC

So let’s say you’re working in PyMC and want to use a positively constrained variable. If you use a lognormal, beta, Dirichlet, or Wishart distribution, how do you deal with the positive, interval, simplex, and positive definite constraints? In BUGS, and from what I’ve seen in PyMC, you deal with these by choosing among the hand-picked set of distributions with the appropriate constrained support.* The Wishart is built in, so you can write down in a generative fashion that Sigma has a Wishart distribution without ever having to declare that Sigma is positive definite—it’s handled behind the scenes somehow, so the model looks generative. And, of course, people have figured out Wishart random generation.

Constrained variables in Stan

In Stan, we just declare constrained variables and then apply a change-of-variables correction (specifically, we add the log absolute determinant of the Jacobian of the constraining transform to the target density). So for the positive ordered case, we start with unconstrained parameters and then take the cumulative sum of their exponentiation. The exponentiation is non-linear, so we need a change-of-variables correction, which is carried out under the hood. So in Stan, this looks like this:

parameters {
  positive_ordered[5] y;
}
model {
  y ~ normal(0, 1);
}

PyMC-style constraints in Stan

This is “generative” in the Bayesian sense in that we have a well-defined sampling distribution for y. What the PyMC folks were objecting to is that it’s not generative in the simple “forward modeling” sense. That is, you can’t working from their built-ins, generate an instance of y this way. Instead, PyMC works as follows when translated into Stan:

parameters {
  vector[5] y_unsorted;
}
transformed parameters {
  positive_ordered[5] y = sort_ascending(y_unsorted);
}

The problem here is that this is going to disconnect the raw parameters from the sorting and thus introduce some non-identifiability on the unconstrained scale.

But can we generate from these models in Stan? Not directly—we have to use MCMC. Mitzi Morris has been running into these problems with sum-to-zero constraints. They’re hard (but not impossible) to generate using anything other than MCMC, because you have to work out a forward way to generate a constrained variable.

It’s all about the built-ins

I stressed the built-ins, because PyMC is perfectly OK with the equivalent of this Stan model, which works exactly the same way as the ordered case in Stan,

parameters {
  real<lower=0>: y;
}
model {
  y ~ lognormal(0, 1);
}

As with the ordered case, we define a constrained variable y, which means under the hood we use an exponentiation transform and add the unconstrained variable to the log density (it’s the log of the absolute derivative of the exponent transform). Then we apply the lognormal density and sample with Hamiltonian Monte Carlo. The reason this second case is copacetic in the PyMC world is that they have a lognormal distribution built-in. If they had a normal_positive_ordered distribution built in, then the first model would be “generative” in the PyMC world, too. We happen to know how to easily generate lognormal variates—just generate normal variates and exponentiate them.


*Of course, you an always resort to the Poisson trick from BUGS, but I have a feeling the PyMC folks will think that’s cheating.

Bayesian statistics: the three cultures

Subjective Bayes

A traditional subjective Bayesian will assume a data generating distribution (aka a likelihood when viewed as a function of the parameters). Then, conditional on this assumption, they encode their prior beliefs as a prior distribution over parameters. After this, they turn the crank on posterior inference and don’t look back.

I’m not sure if anyone ever followed this philosophy strictly, nor do I know if anyone would register their affiliation as subjective Bayesian these days.

Objective Bayes

I believe the so-called “objective Bayes” philosophy stems from a combination of wanting to do hypothesis testing (perhaps using Bayes factors) and deeply feeling what Andrew calls the “Bayesian cringe.” To quote from a paper by one of its leading proponents (my highlights),

Reference analysis produces objective Bayesian inference, in the sense that inferential statements depend only on the assumed model and the available data, and the prior distribution used to make an inference is least informative in a certain information-theoretic sense.

In: Berger, Bernardo and Sun. 2009. The formal definiton of reference priors.

There are still lots of folks characterizing (at least some of) their work this way judging from the fact that they run conferences, publish books with “objective Bayes” in the title, etc. It’s in some sense behind the wide gamma(epsilon, epsilon) and normal(0, 10_000) priors employed in the BUGS examples.

Pragmatic Bayes

Andrew follows a philosophy that I call “pragmatic Bayes.” It’s laid out crisply in the first edition of Gelman, Carlin, Stern, and Rubin’s book Bayesian Data Analysis, the second paragraph of which begins as follows (my highlights).

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

  1. Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.
  2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ul- timate interest, given the observed data.
  3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

First, the probability model is joint, which means it includes the prior and the likelihood. Second, it characterizes our inputs as “knowledge” rather than “belief.” Third, it encourages practitioners to evaluate how well a model fits and its predictive consequences, and if there are issues, go back and try again. This process is what Andrew finally branded as “workflow.”

Standard operating procedure in engineering

This is standard operating procedure in engineering, where it is simply called “iterative design” . It’s the way almost all machine learning is done. Coming from a CS and ML background, I was shocked to learn that statisticians didn’t think this way.

Postscript on the strategy behind BDA

Andrew told me he made the explicit decision early on when writing the first edition of BDA to skip the philosophy and just “do” science. He and Rubin never characterized their iterative design process with a name.

I think that’s a wise choice, because it’s hard to characterize other peoples’ philosophical beliefs accurately and even harder to change them through debate. Or as editors like to tell writers, “show, don’t tell.” Or as I used to say back when I worked on linguistics and philosophy, if your scientific method needs a philosophy of science disclaimer, it’s probably not very convincing science.

Post2script on likelihood vs. prior

Although some of this discussion is about approaches to priors, I agree with Andrew that the choice of prior is no more or less “subjective” than the choice of likelihood. Andrew summed this up neatly in his post, “Straining on the gnat of the prior distribution while swallowing the camel that is the likelihood” .

Post3sript on belief vs. knowledge

From a philosophical perspective, I prefer a more epistemological characterization of priors and likelihoods in terms of “knowledge” rather than a doxastic characterization in terms of “belief”. This is the framing originally provided by Laplace and later explored more thoroughly by John Stuart Mill and followed by Gelman et al. in BDA.

Post4script on etymology behind this post

In 1959, C.P. Snow wrote what became a famous essay on the arts vs. sciences, titled “The two cultures.” In 2001, L. Breiman wrote an influential essay, titled “Statistical modeling: the two cultures.” [edit: Thanks for correct citation, Ben.] Breiman’s contrast is about the distinction between modeling a generative process explicitly versus using a very flexible model (“high capacity” in ML speak). Breiman championed decision forests in his work, and his approach is still winning Kaggle contests where there’s not enough data to fit a state-of-the-art neural network.

I wonder if Andrew considers decision forests and neural networks to be examples of what he called “the unfolding flower,” for example in this post.

Faculty and postdoc jobs in computational stats at Newcastle University (UK)

If you’re looking for a job in computational statistics, Newcastle is hiring one or two faculty positions and a postdoc position. The application deadline is in 10 days for the postdoc position and in a month for the faculty positions.

Close to the action

The UK (and France) are where much of the action is in MCMC, especially theory, in my world. There are great Bayesian computational statisticians in Newcastle, including Professor Chris Oates, and the department head, Professor Murray Pollock. In case you don’t know UK geography, Durham is a mere 30km down the road, with even more MCMC and comp stats researchers.

Faculty position(s)

These are lecturer and senior lecturer position, which is like assistant and associate professors in the U.S.

Lecturer Job ad

Application deadline: 18 July 2024

3 year postdoc position

Application deadline: 23rd June 2024

Postdoc job ad

About the postdoc

I’m at a conference with both Chris Oates and Murray Pollock in London right now. Murray just gave a really exciting talk on the topic of the postdoc, which is federated learning as part of the FUSION ERC project, which involves a bigger network of MCMC researchers including Christian Robert and Eric Moulines in Paris and Gareth Roberts at University of Warwick. They’re applying cutting edge diffusion processes (e.g., Brownian bridges) to recover exact solutions to the federated learning problem (where subsampled data sets, for instance from different hospitals, are fit independently and their posteriors are later combined without sharing all the data).

For more information

Contact: Murray Pollock (Murray.Pollock (at) newcastle.ac.uk)

HMC fails when you initialize at the mode

Nawaf Bou-Rabee was telling me and Chirag Modi that a high-dimensional normal initialized at the mode would be a disaster. I was skeptical as I’d never seen this behavior. So I dashed back to the computer and ran some simulations. Summary: Nawaf’s right! And the reason I hadn’t seen it is because NUTS handles the problem with variable adaptation during warmup.

Edit: I forgot the firsrt time to say why after I had written it into the PyMC Discourse response, but Daniel Lakeland reminded me. It’s because the mode is highest density, so all your proposals will be lower density. And the density goes down quickly in high dimensions as you move away from the mode. I have some illustrations of this in my case study on the curse of dimensionality.

Simulation results

I defined a 10,000-dimensional standard normal in Stan and used CmdStanPy to run Stan’s default NUTS for 4 chains of 1000 iterations each to give me 4000 draws.

If I initialize at the origin and fix the step size to 1 (which is stable for the leapfrog algorithm), all of the transitions diverge. I get an effective sample size (ESS) of 0.

If I initialize at the origin and fix the step size to 0.5, I get an ESS of about 20.

If I initialize randomly uniform(-2, 2) using the Stan default and fix the step size to 0.5, I get an ESS of about 2000 (in 4000 draws).

If I initialze at the origin, but let NUTS do its thing adapting step size, I get an ESS of about 9000. NUTS can use multiple step sizes during adaptation, so it can get away from the origin with a small step size and then crank it back up during the last phase of adaptation before sampling.

If I take Stan’s defaults, which is to let NUTS do its own thing adapting step size and initializing randomly uniform(-2, 2), I also get an ESS of about 9000.

Cool! The defaults for NUTS are very robust even in this nasty example.

PyMC provides this option and used to recommend it, but they must have done this ages ago and they now discourage initialization at the mode in their documentation.

Stan program

parameters {
  vector[10000] y;
}
model {
  y ~ normal(0, 1);
}

Python script

import cmdstanpy as csp
import numpy as np

N = 10000
y = np.zeros(N)
model = csp.CmdStanModel(stan_file='normal.stan')

# INIT ORIGIN, STEP SIZE >> 0.5
fit0 = model.sample(inits={'y': y}, step_size=1, iter_warmup=0, adapt_engaged=False)
fit0.summary()

# INIT ORIGIN, STEP SIZE = 0.5
fit1 = model.sample(inits={'y': y}, step_size=0.5, iter_warmup=0, adapt_engaged=False)
fit1.summary()

# INIT RANDOM, STEP SIZE = 0.5
fit2 = model.sample(step_size=0.5, iter_warmup=0, adapt_engaged=False)
fit2.summary()

# INIT ORIGIN, ADAPT STEP SIZE
fit3 = model.sample(inits={'y': y})
fit3.summary()

# NUTS DEFAULT INIT (uniform(-2, 2)), ADAPT STEP SIZE
fit4 = model.sample()
fit4.summary()

GPT today: Buffon’s Needle in Python with plotting (and some jokes)

Werner Krauth is visiting NYU’s Simons Center for Physics from ENS in Paris. He’s a physicist and the author of the 2006 book Statistical Mechanics: Algorithms and Computations. And he’s stopping by our reading group Fridays, where I learned about his most peculiar approach to plotting. He coded it in Postscript (!!! see below) and ran simulations on the printer. The result is quite beautiful. Here’s Figure 1.10 from the text.


Buffon's needles, by Werner Krauth, copyright 2006

I wondered how well GPT would do at this task. It’s pretty complicated, but then Buffon’s well represented on the internet. Let’s give it a shot. Here’s my prompt:

BC: From a book on molecular dynamics, “We want to follow in Count Buffon’s footsteps and and perform our own needle-throwing experiments. One of these, with 2000 samples, is shown in Fig. 1.10.” Write a Python program to generate those 2000 samples. Then take the 2000 samples and write code to plot them. Each needle should be displayed as a line connected to a needle eye with a small ball in the middle. There should be 17 “cracks” in the experiment and the ball should be solid if a needle overlaps a crack. The field overall is sized 20 high x 16 wide and has 17 vertical cracks. Each needle is unit length.

At which point, GPT gave me three blocks of code (see below), and when I ran it, this was the result.


Buffon's needles by ChatGPT4 May 23, 2024 (version 1)

This is actually better than I expected. You can see it got the distinction right of white circles versus red circles, but not quite the right visualization. So I asked it to try again with this prompt:

BC: That’s great, but there are several improvements I’d like to see.

1. Black and white rendering only.
2. Thinner lines and smaller circles and disks.
3. Plot the vertical cracks with thin lines.
4. Put a needle like shape (like an American football) on one end of each needle to give it an orientation.

I just need the final Plotting the Needles code for this.

Et voila.


Buffon's needles by ChatGPT May 23, 2024 (version 2)

Well, almost. It forgot to distinguish the ones that overlapped with circles rather than black disks.

This took a total of 2 minutes (plus 10 minutes to get into the blog, because it’s totally manual).

Here’s the code snippets it produced.

import numpy as np

def generate_needles(num_samples, field_width, field_height):
    # Generate random midpoints
    x = np.random.uniform(0, field_width, num_samples)
    y = np.random.uniform(0, field_height, num_samples)
    # Generate random angles from 0 to 2*pi
    theta = np.random.uniform(0, 2*np.pi, num_samples)
    
    return x, y, theta

# Constants
num_samples = 2000
field_width = 16
field_height = 20

x, y, theta = generate_needles(num_samples, field_width, field_height)

def check_overlaps(x, y, theta, field_width, num_cracks):
    crack_positions = np.linspace(0, field_width, num_cracks + 1)
    overlaps = np.zeros(len(x), dtype=bool)
    
    for i in range(len(x)):
        # Calculate endpoints
        x1 = x[i] + 0.5 * np.cos(theta[i])
        y1 = y[i] + 0.5 * np.sin(theta[i])
        x2 = x[i] - 0.5 * np.cos(theta[i])
        y2 = y[i] - 0.5 * np.sin(theta[i])
        
        # Check for overlap
        min_x = min(x1, x2)
        max_x = max(x1, x2)
        overlaps[i] = np.any((crack_positions > min_x) & (crack_positions < max_x))
        
    return overlaps

overlaps = check_overlaps(x, y, theta, field_width, 17)

import matplotlib.pyplot as plt

def plot_needles(x, y, theta, overlaps):
    fig, ax = plt.subplots(figsize=(8, 6))
    for i in range(len(x)):
        # Endpoints
        x1 = x[i] + 0.5 * np.cos(theta[i])
        y1 = y[i] + 0.5 * np.sin(theta[i])
        x2 = x[i] - 0.5 * np.cos(theta[i])
        y2 = y[i] - 0.5 * np.sin(theta[i])
        
        # Plot needle
        ax.plot([x1, x2], [y1, y2], 'k-')
        # Plot needle eye
        ax.plot(x[i], y[i], 'ro' if overlaps[i] else 'wo')
    
    ax.set_xlim(0, field_width)
    ax.set_ylim(0, field_height)
    ax.set_aspect('equal')
    plt.show()

plot_needles(x, y, theta, overlaps)

This is very neat and well organized code! Here's the followup code.

import matplotlib.pyplot as plt
import matplotlib.patches as patches

def plot_needles(x, y, theta, overlaps):
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Plot vertical cracks
    field_width = 16
    num_cracks = 17
    crack_positions = np.linspace(0, field_width, num_cracks + 1)
    for cp in crack_positions:
        ax.axvline(x=cp, color='gray', linestyle='-', linewidth=0.5)
    
    # Settings for needles and circles
    line_width = 0.5
    circle_radius = 0.05  # radius for midpoint marker
    needle_length = 1
    football_length = 0.2  # length of the "football" shape
    
    for i in range(len(x)):
        # Calculate endpoints
        x1 = x[i] + 0.5 * np.cos(theta[i])
        y1 = y[i] + 0.5 * np.sin(theta[i])
        x2 = x[i] - 0.5 * np.cos(theta[i])
        y2 = y[i] - 0.5 * np.sin(theta[i])
        
        # Draw needle
        ax.plot([x1, x2], [y1, y2], 'k-', lw=line_width)
        
        # Draw "American football" shape
        football_x = x[i] + 0.5 * np.cos(theta[i]) - football_length * np.cos(theta[i])
        football_y = y[i] + 0.5 * np.sin(theta[i]) - football_length * np.sin(theta[i])
        ellipse = patches.Ellipse((football_x, football_y), football_length, circle_radius, 
                                  angle=np.degrees(theta[i]), color='black')
        ax.add_patch(ellipse)
        
        # Draw needle eye
        color = 'black' if overlaps[i] else 'white'
        ax.plot(x[i], y[i], marker='o', markersize=circle_radius*30, markeredgecolor='black', markerfacecolor=color)
    
    ax.set_xlim(0, field_width)
    ax.set_ylim(0, field_height)
    ax.set_aspect('equal')
    ax.axis('off')  # Turn off the axis
    plt.show()

plot_needles(x, y, theta, overlaps)

What's really amazing here is that it recognizes "American football shape" and can convert that into Matplotlib code.

Getting better at humor

GPT's getting funnier. This is from Mitzi.

MM: please give me a joke with the punchline "all your base are belong to us"

GPT4: Why did the computer take over the entire internet?

Because it finally decoded the message: "All your base are belong to us."

Thank you sir, may I have another?

MM: that's pretty good. please give me another one.

ChatGPT: Why did the grammar teacher fail the evil overlord's essay?

Because it ended with "All your base are belong to us."

A good comic has a lot of material.

MM: another one?

GPT: Why did the spaceship's navigation system malfunction?

Because the aliens reprogrammed it to say, "All your base are belong to us."

And because it's that time of year.

MM: please make a baseball joke about this

GPT: Why did the opposing baseball team start panicking in the middle of the game?

Because the scoreboard flashed a message: "All your base are belong to us."

Mitzi says "home team" would have been funnier.

The Postscript postscript

Werner shared his postscript code for generating his plot. It runs as part of the postscript rendering. The relevant part is at the bottom starting at "4000".

%!PS-Adobe-2.0 EPSF-2.0
%%Title: IN_buffon_exp_new.eps
%%BoundingBox: 45 131 430 625
%%  Postscript figure from 
%   Statistical Mechanics: Algorithms and Computations
%                   Werner Krauth
%            Oxford University Press (2006)
%
/cm { 0.0351 div} def
/thicklinewidth 1.5 def
/normallinewidth 1.0 def    % 0.35377 mm or 1/72 inch 
/thinlinewidth 0.5 def
/white 1.0 def     %    0 % gray = white
/black .0 def      %   100% gray= black
/tinyradius 0.03 cm  def  %  
                           %
/dot {  % usage: x y radius dot (draws a dot (filled circle) of given radius in current grayvalue)
gsave
   0 360 arc
   gsave
      fill
   grestore
   gsave
      black setgray
      stroke
   grestore
grestore
} def
/circle {  % usage: x y radius circle (draws an unfilled circle of given radius)
           % the linewidth is not specified
gsave
   0 360 arc
   black setgray
   stroke
grestore
} def
/nran {    % usage: N nran (picks random number between 1 and N)
rand exch mod 1 add } def 
/ran01 {    % usage ran01  (picks random number between 0 and 1)
/xnorm 10000 def
rand xnorm mod cvr xnorm div } def 
%%%%%
%%%%% end common area of all Smac postscript figures.
%%%%%
/sc {0.75 mul} def
100 srand
/needle { % usage: angle x y needle, draws a needle, which for angle = 0 extends from
   % -0.5 sc cm to + 0.5 sc cm
   % allows different gray values. Version 04-JUN-03
   gsave
      translate
      rotate
      /l 0.7 sc cm def % factor setting lateral size of hole
      /x .2 sc cm def % 2 x: length of hole
      /y .6 sc cm def % length of shaft
      /radius l dup mul x dup mul add sqrt def
      /alpha x l atan def
%
% make the shaft
%
      thinlinewidth setlinewidth
      1 setlinejoin
      1 setlinecap
      gsave
         y 2 div x -1 mul add 0 translate
         y -1 mul 0 cm moveto
         0 sc cm 0 cm lineto
         gsave
            black setgray
            stroke
         grestore
%
% make the hole
%
         newpath
         x l -1 mul radius 90 alpha add 90 alpha sub arcn
         x l radius 270 alpha add 270 alpha sub arcn
% fill
         gsave
            black setgray
            stroke
         grestore
      grestore
      0 0 tinyradius 1.3 mul dot
   grestore
} def
2 cm 5 cm translate
/ymax 22 sc cm def
/xmax 18. cm def
/delx 1. sc cm def
/ncracks 18 def

thinlinewidth setlinewidth
gsave
200  thinlinewidth setlinewidth
   ncracks {0 sc cm -.5 sc cm moveto 0 sc cm ymax .5 sc cm add lineto stroke delx 0 translate} repeat
   stroke
grestore
white setgray
thinlinewidth setlinewidth
4000 {
   /angle {360 nran} def
   /xcenter {delx .5 mul ran01 mul} def
   /ycenter {ymax ran01 mul} def
   /xc xcenter def
   /an angle def
   xc an cos .5 sc cm mul abs sub 0 lt {black setgray}{white setgray} ifelse
   2 nran 1 gt {/xc delx xc sub def } if
   /xc xc ncracks 1 sub nran 1 sub delx mul add def
   an xc ycenter needle
} repeat
showpage

Mitzi’s and my talks in Trieste 3 and 4 June 2024 (yes, they’ll be broadcast)

I couldn’t resist quoting Andrew’s statistical lexicon in my abstract! It’d be great to see you in person if you’re nearby, but otherwise, the talks will go out via MS Teams (links below).


Prior and Posterior Probability Checks in Stan or How to Write a Stan Model to Simulate Data and Why You Should

Mitzi Morris, Columbia University, Statistics Department

A Stan model infers the model parameters given the data. The data-generating program reverses the computation: given parameters, it outputs the data according to the specified distributions. Often simulated datasets are created by one-off scripts in R or Python. Writing a data-generating program using Stan allows for more systematic exploration of the consequences of choice of prior and difficult data regimes.

11am, Tuesday, 4 June in Room 3A at DEAMS, University of Trieste, and available also from MS TEAMS at the following link: MS Teams link for talk


GIST: Gibbs self-tuning for locally adaptive Hamiltonian Monte Carlo

Bob Carpenter, Flatiron Institute, Center for Computational Mathematics

I will present a novel and flexible framework for localized tuning of Metropolis samplers, including Hamiltonian Monte Carlo (HMC). In the Gibbs self tuning (GIST) framework, an algorithm’s tuning parameters are Gibbs sampled each iteration conditioned on the current position and momentum. For adaptively sampling path lengths, I will show that randomized integration time Hamiltonian Monte Carlo, the no-U-turn sampler, and the apogee-to-apogee path sampler all fit within this unified framework as special cases. I’ll provide two examples. One is a multinomial form of randomized bidirectional HMC with a 100% acceptance rate. The second is a much simpler alternative to the no-U-turn sampler for locally adapting path lengths. In all of these samplers, correctness depends on simulating the Hamiltonian dynamics forward and backward randomly in time. The key to making local tuning practical is randomization (aka uncertainty), which as Andrew Gelman likes to say, “greases the wheels of commerce.” I will conclude with a discussion of the opportunity this framework presents for adapting HMC’s step size and mass matrix.

Joint work with Nawaf Bou-Rabee (Rutgers, Camden), Milo Marsden (Stanford), Edward Roualdes (Cal State, Chico), Chirag Modi (Flatiron Institute), and Gilad Turok (Flatiron Institute)

4pm, Monday, 3 June in room 1A (‘Aula Conferenze Bruno de Finetti ‘), at Department of Business, Economics, Mathematics, and Statistics (DEAMS), University of Trieste, Via Valerio 4/1, 34127, Trieste. Available via MS Teams at the following link: MS Teams link for talk


MCMC draws cannot fill the posterior in high dimensions

Word on the street

I often hear people say things like “We do MCMC to get a full representation of the posterior.” Their intuition seems to be that MCMC is going to take a set of draws that covers the posterior in some way.

Misleading textbook examples

I’m just as guilty of this as everyone else. The textbook examples are in 1D or 2D. We throw darts at a unit square and keep the ones in an inscribed circle to calculate pi, for example, or we sample independently form a 1D distribution and fill in a histogram (in the limit of infinitely many draws and infinitesimal bins, the histogram approaches the density). In low dimensions, we can effectively “cover” the posterior.

A simple counterexample in high dimensions

This is not how things work in high dimensions. There’s no way to get enough draws to even remotely fill the volume of the posterior. Let’s suppose we have an N-dimensional standard normal y ~ normal(0, I). There are 2**N different quadrants in which a draw is equally likely to land, representing the signs of the numbers y[1], y[2], …, y[N]. If N = 20, there are more than one million quadrants. So we won’t even get a draw in each quadrant in 20 dimensions, much less a set of draws that in any sense “covers” or “fills” the posterior.

Luckily, we don’t need coverage

Our goal isn’t to fill the posterior’s volume with closely packed draws. That’s a relief, because that goal’s clearly impossible in high dimensions. Instead, our goal is to compute posterior expectations of the form E[f(Theta) | y] for parameters Theta and data y. That we can do effectively with a sample that does not cover the posterior. In fact, we can usually get away with something like an effective sample size of 100, at which point the standard error in the expectation is 10% of the posterior standard deviation.

Job Ad: Spatial Statistics Group Lead at Oak Ridge National Laboratory

Robert Stewart, of Oak Ridge National Lab (ORNL), who we met at StanCon, is looking to fill the following role:

It’s a research group leader position with an emphasis on published research that’s relevant for the Department of Energy (the group that runs the national labs). Robert came and gave a talk here at Flatiron Institute and the scale and range of problems they’re tasked with solving is really interesting (e.g., inferring the height, construction, and use of every human-constructed building in the world). They’ve got access to the computing to tackle this kind of thing, too.

After meeting Robert, I invited him out to Flatiron Institute to give a talk to our group (Center for Computational Mathematics). They’re working on a fascinating range of problems around climate science, energy, etc. In all cases, there is a challenging mix of measurement uncertainty, missing data, and prior knowledge to combine.

ORNL is looking for someone with a Ph.D. and 6 years of experience—that is, about the point where you’d get tenure in academia. Unlike academia, they also list an M.S. with 15 years of experience as an alternative.

The visa sponsorship” section says “not available”, so I assume that means you have to be a U.S. citizen.

Evaluating MCMC samplers

I’ve been thinking a lot about how to evaluate MCMC samplers. A common way to do this is to run one or more iterations of your contender against a baseline of something simple, something well understood, or more rarely, the current champion (which seems to remain NUTS, though we’re open to suggestions for alternatives).

Reporting comparisons of averages (and uncertainty)

Then, what do you report? What I usually see is a report of averages over runs, such as average effective sample size per gradient eval. Sometimes I’ll see medians, but I like averages better here as it’s a fairer indication of expected time over multiple runs. What I don’t often see is estimates of uncertainty in the estimate of the average (std error) or across runs (std deviation). I can’t remember ever seeing someone evaluate something like the probability that one sampler is better than another for a problem, though you do occasionally see those “significance” asterisks (or at least you used to) in ML or search evaluations.

NUTS ESS has high variance

The range of errors we get from running NUTS on even mildly difficult problems implies variations of effective sample size of an order of magnitude. This variance is a huge problem for evaluation. If you follow Andrew Gelman’s advice, then you will run multiple chains and wait for them all to finish. That makes the run time an order statistic which is going to depend strongly on the cross-chain variance.

Time-bound runs?

In the past, Andrew asked about running for a fixed time rather than a fixed number of iterations, but we were worried about inducing bias. Chains tend to slow down in speed when they adapt to a too-small step size or when they wander around big flat regions for too long. It’d be nice if we could fulfill Andrew’s request to figure out how to combine timed runs in a way that doesn’t introduce bias. Suggestions more than welcome! If you run 1024 chains in the style of Matt Hoffman and Charles Margossian, then the order statistics for timing become very sensitive to tails.

Within-class versus across-class variation

This is a classic statistical comparison problem. In the sampler I’m evaluating now that is inspiring me to rethink evals, the within-group variance of NUTS and the within-group variance of the contender is greater than the difference between the mean time of NUTS and the mean time of the contender.

Distributions of ratios?

So I’m thinking that rather comparing ratios of average run times, we should look at histograms of ratios of draws from NUTS’s statistics and draws for the contender’s statistics. This will give us a much wider range of values, which for some of which will have NUTS taking twice as long and for some of them will have the contender taking twice as long. Luckily, it shouldn’t be as bad as a ratio of two normals, as we don’t expect values near zero in the denominator giving us massive values like in generating Cauchy variates (as the ratio of two normals). Averaging ratios doesn’t make a lot of sense because it’s sensitive to direction, but averaging log odds might make sense, because it’s symmetric around even odds.

Just show everything

We don’t need to try to reduce things to a single statistic. I often argued when I was in NLP that I’d rather see a confusion matrix than an accuracy (sensitivity, F-measure, etc.) and I’d rather see the ROC curve than just get the single value for AUC. These reductions to single statistics are very lossy and a single number is often a poor summary of a complex multivariate situation. And in NLP, a client often has a specific operating point in mind (high sensitivity, high specificity, etc.).

RMSE of standardized parameters

What I’m currently doing is showing the boxplots of root mean square error for standardized parameters. Standardizing is a trick I learned from Andrew that brings everyone’s scales into alignment. That is, standard error is just standard deviation divided by the square root of effective sample size. Scaling the parameters normalizes all parameters to have a standard deviation of 1, so they’ll be weighted equally when comparing results.

Why not minimum ESS?

The problem with minimum ESS is that it’s an extreme order statistic in high dimensional models, and is thus going to be difficult to estimate. It also ignores the errors on all the other parameters.

Direct standard error calculation

If you know the true value and can perform multiple estimation runs, you can calculate standard error empirically as the standard deviation of the errors. Given the standard error, we can back out expected ESS from the relation that standard error is standard deviation divided by the square root of ESS. I learned that technique from Aki Vehtari, who used it to calibrate our new ESS estimators that discount for approximate convergence (actually a few years old as of now).

Something multivariate?

Maybe I should be thinking of something that’s more multivariate, like the multivariate ESS from Vats, Flegal, and Jones? I’m afraid that would introduce a scaling problem in dimensionality.

GIST: Gibbs self-tuning for HMC

I’m pleased as Punch to announce our new paper,

We followed the mathematician alphabetical author-ordering convention.

The basic idea

The basic idea is so simple, I’m surprised it’s not more popular. The general GIST sampler couples the HMC algorithm tuning parameters (step size, number of steps, mass matrix) with the position (model parameters) and momentum. Each iteration of the Markov chain, we resample momentum using a Gibbs step (as usual in HMC), then we sample the tuning parameters in a second Gibbs step, conditioning on the current position and resampled momentum. The proposal is generated by the leapfrog algorithm using the sampled tuning parameters. The accept/reject step then uses the ratio of the joint densities, which now includes the tuning parameters. Of course, we need to specify a conditional distribution of tuning parameters to make this concrete.

The same idea could be applied to other Metropolis samplers.

Some prior art

We were inspired by a combination of NUTS and some negative results from Nawaf regarding acceptance bounds in delayed rejection generalized HMC (more on that later).

Radford Neal used the coupling idea for the Metropolis acceptance probability in his recent paper Non-reversibly updating a uniform [0,1] value for Metropolis accept/reject decisions. This is for generalized HMC, which is one-step HMC with partial momentum refresh. The partial refresh means the momentum flips in HMC matter and will typically thwart directed movement. Neal’s appraoch groups the acceptances together so that generalized HMC can make directed movement like HMC itself. Although this easily fits within the GIST framework, the proposal for acceptance probability doesn’t depend on the current position or momentum.

We (Chirag Modi, Gilad Turok and I) are working on a follow-up to my recent paper with Chirag and Alex Barnett on delayed rejection HMC. The goal there was to properly sample multiscale and varying scale densities (i.e., “stiff” Hamiltonians). In the follow up, we’re using generalized HMC with delayed rejection rather than Radford’s coupling idea, and show it can also generate directed exploration. The paper’s drafted and should be out soon. Spoiler alert! DR-G-HMC works better than DR-HMC in its ability to adjust to local changes in scale, and it’s competitive with NUTS in some cases where there are varying scales, but not in most problems.

Algorithms that randomize the number of steps (or stepsize), like those in Changye Wu and Christian Robert’s Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale, can also be seen as an instance of GIST. From this view, they are coupling the number of steps and sampling that number from a fixed uniform distribution each iteration. This doesn’t condition on the current position. The general framework immediately suggests biasing those draws toward longer jumps, for example by sampling uniformly from the second half of the trajectory.

Sam Livingstone was visiting Flatiron Institute for the last couple of weeks, and when I ran the basic idea by him, he suggested I have a look at the paper by Chris Sherlock, Szymon Urbas, and Matthew Ludkin, The apogee to apogee path sampler. This is a really cool idea that uses U-turns in potential energy (negative log density), which is univariate, rather than in position space. They evolve the Hamiltonian forward and backward in time for a given number of U-turns in log density, and then sample from among the points on the path (like NUTS does). One issue is that a regular normal distribution on momentum can have trouble sampling through the range of log densities (see, e.g., Sam et al.’s paper on kinetic energy choice in HMC). Because we were going to be discussing his paper in our reading group, I wrote to Chris Sherlock and he told me they have been working on the apogee to apogee idea since before NUTS was released! It’s a very similar idea, with similar forward/backward in time balancing. The other idea it shares with NUTS is a biasing toward longer jumps—this is done in a really clever way that we can borrow for GIST. Milo and Nawaf figured out how NUTS and the apogee-to-apogee sampler can be fit into the GIST framework, which simplifies their correctness proofs. Once you’ve defined a proper conditional distribution for GIST, you’re done. The apogee-to-apogee paper is also nice in evaluating a randomized stepsize version of HMC they call “blurred” HMC (and which of course, fits into the general GIST framework the same way the Wu and Robert sampler does).

If you know of other prior art or examples, we’d love to hear about them.

Our concrete alternative to NUTS

The alternative to NUTS we discuss in the GIST paper proposes a number of steps by iterating the leapfrog algorithm until a U-turn, then randomly sampling along the trajectory (like the improvement to the original NUTS that Michael Betancourt introduced or the apogee-to-apogee sampler). We then use a crude biasing mechanism (compared to the apogee-to-apogee sampler or NUTS) toward longer paths. That’s it, really. If you look at what that means for evaluating, we have to run a trajectory from the point sampled backwards in time until a U-turn—you don’t really get away from that forward-and-backward thing in any of these samplers.

We evaluate mean-square jump distance, rejections, and errors on parameter estimates and squared parameter estimates. It’s a little behind NUTS’s performance, but mostly in the same ballpark. In most cases, the variance among NUTS runs was greater than the difference between the mean NUTS and new algorithm run times. The evals demonstrate why it’s necessary to look at both parameter and squared parameter estimates. As draws become more anti-correlated, which happens by maximizing expected square jump distance, estimates for parameters improve, but error goes way up on estimates of squared parameters. I provide an example in this Stan forum post, which was inspired by a visit with Wu and Robert to discuss their randomized steps algorithm. Also, before we submit to a journal, I need to scale all the root mean-square-error calculations to Z scores.

What we’re excited about here is that it’s going to be easy to couple step size adaptation. We might even be able to adapt the mass matrix this way and get a cheap approxmation to Riemannian HMC.

Try it out

The evaluation code is up under an MIT License on my GitHub repo adaptive-hmc. I’m about to go add some more user-facing doc on how to run the evaluations. I’m really a product coder at heart, so I always find it challenging to hit the right level of doc/clarity/robustness/readability with research code. To get log densities and gradients from Stan models to develop our algorithm, we used BridgeStan.

I’m always happy to get suggestions on improving my code, so feel free to open issues or send me email.

Thank you!

This project would’ve been much harder if I hadn’t gotten feedback on the basic idea and code from Stan developer and stats prof Edward Roualdes. We also wouldn’t have been able to do this easily if Edward hadn’t developed BridgeStan. This kind of code is so off-by-one, numerator-vs-denominator, negative vs. positive, log vs. exponent mistake prone, that it makes my head spin. It doesn’t help that I’ve moved from R to Python, where the upper bounds are exclusive! Edward found a couple of dingers in my original code. Thanks also to Chirag for helping me understand the general framework and on the code. Both Edward and Chirag are working on better alternatives to our simple alternative to NUTS, which will be showing up in the same adaptive HMC repo—just keep in mind this is all research code!

What’s next?

Hopefully we’ll be rolling out a bunch of effective GIST samplers soon. Or even better, maybe you will…

P.S. In-person math and computing

The turnaround time on this paper from conception to arXiv is about the fastest I’ve ever been involved with (outside of simple theoretical notes I used to dash out). I think the speed is from two things: (1) the idea is easy, and (2) Milo, Nawaf and I spent four days together at Rutgers about a month ago working pretty much full time on this paper (with some time outs for teaching and talks!). We started with a basic idea, then worked out all the theory and developed the alternative to NUTS and tried some alternatives over those four days. It’s very intense working like that, but it can be super productive. We even triple coded on the big screen as we developed the algorithm and evaluated alternatives. Then we divided the work of writing the paper cleanly among us—as always, modularity is the key to scaling.

“Close but no cigar” unit tests and bias in MCMC

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))
)

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))
)

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).

Intelligence is whatever machines cannot (yet) do

I had dinner a few nights ago with Andrew’s former postdoc Aleks Jakulin, who left the green fields of academia for entrepreneurship ages ago. Aleks was telling me he was impressed by the new LLMs, but then asserted that they’re clearly not intelligent. This reminded me of the old saw in AI that “AI is whatever a machine can’t do.”

In the end, the definition of “intelligent” is a matter of semantics. Semantics is defined by conventional usage, not by fiat (the exception seems to be an astronomical organization trying to change the definition of “planet” to make it more astronomically precise). We do this all the time. If you think about what “water” means, it’s incredibly vague. In the simplest case, how many minerals can it contain before we call it “mud” rather than “water”? Does it even have to be made of H20 if we can find a clear liquid on an alternative earth that will nourish us in the same way (this is a common example in philosophy from Hilary Putnam, I believe)? When the word “water” was first introduced into English, let’s just say that our understanding of chemistry was less developed than it is now. The word “intelligent” is no different. We’ve been using the term since before computers, and now we have to rethink what it means. By convention, we could decide as a group of language users to define “intelligent” however we want. Usually such decisions are guided by pragmatic considerations (or at least I’d like to think so—this is the standard position of pragmatist philosophers of language, like Richard Rorty). For instance, we could decide to exclude GPT because (a) it’s not embodied in the same way as a person, (b) it doesn’t have long-term memory, (c) it runs on silicon rather than cells, etc.

It would be convenient for benchmarking if we could fix a definition of “intelligence” to work with. What we do instead is just keep moving the bar on what counts as “intelligent.” I doubt people 50 years ago (1974) would have said you can play chess without being intelligent. But as soon as Deep Blue beat the human chess champion, everyone changed their tune and the chorus became “chess is just a game” and “it’s finite” and “it has well defined rules, unlike real life.” Then when IBM’s Watson trounced the world champion at Jeopardy!, a language based game, it was dismissed as a parlor trick. Obviously because a machine can play Jeopardy!, the reasoning went, it doesn’t require intelligence.

Here’s the first hit on Google I found searching for something like [what machines can’t do]. This one’s in a popular magazine, not the scientific literature. It’s the usual piece in the genre of “ML is amazing, but it’s not intelligent because it can’t do X”.

Let’s go over Toews’s list of AI’s failures circa 2021 (these are direct quotes).

  1. Use “common sense.” A man went to a restaurant. He ordered a steak. He left a big tip. If asked what the man ate in this scenario, a human would have no problem giving the correct answer—a steak. Yet today’s most advanced artificial intelligence struggles with prompts like this.
     
  2. Learn continuously and adapt on the fly. Today, the typical AI development process is divided into two distinct phases: training and deployment.
     
  3. Understand cause and effect. Today’s machine learning is at its core a correlative tool. It excels at identifying subtle patterns and associations in data. But when it comes to understanding the causal mechanisms—the real-world dynamics—that underlie those patterns, today’s AI is at a loss.
     
  4. “Reason ethically…In 2016, Microsoft debuted an AI personality on Twitter named Tay. The idea was for Tay to engage in online conversations with Twitter users as a fun, interactive demonstration of Microsoft’s NLP technology. It did not go well. Within hours, Internet trolls had gotten Tay to tweet a wide range of offensive messages: for instance, “Hitler was right” and “I hate feminists and they should all die and burn in hell.”

(1) ChatGPT-4 gets these common-sense problems mostly right. But it’s not logic. The man may have ordered a steak, gotten it, sent it back, ordered the fish instead, and still left a big tip. This is a problem with a lot of the questions posed to GPT about whether X follows from Y. It’s not a sound inference, just the most likely thing to happen, or as we used to say, the “default.” Older AIs were typically designed around sound inference and weren’t so much trying to emulate human imprecision (having said that, my grad school admissions essay was about and my postdoc was funded by a grant on default logics back in the 1980s!).

(2) You can do in-context learning with ChatGPT, but it doesn’t retain anything long term without retraining/fine tuning. It will certainly adapt to its task/listener on the fly throughout a conversation (arguably the current systems like ChatGPT adapt to their interlocuter too much—it’s what they were trained to do via reinforcement learning). Long-term memory is perhaps the biggest technical challenge to overcome, and it’s been interesting to see people going back to LSTM/recursive NN ideas (transformers, the neural net architecture underlying ChatGPT, were introduced in a paper titled “Attention is all you need”, which used long, but finite memory).

(3) ChatGPT 4 is pretty bad at causal inference. But it’s probably above the bar for what Toews’s complaints. It’ll get simple “causal inference” right the same way people do. In general, humans are pretty bad at causal inference. We are way too prone to jump to causal conclusions based on insufficient evidence. Do we classify baseball announcers as not intelligent when they talk about how a player struggles with high pressure situations after N = 10 plate appearances in the playoffs? We’re also pretty bad at reasoning about things that go against our preconceptions. Do we think Fisher was not intelligent because he argued that smoking didn’t cause cancer? Do we think all the anthropogenic global warming deniers are not intelligent? Maybe they’re right and it’s just a coincidence that temps have gone up coinciding with industrialization and carbon emissions. Seems like a highly suspicious coincidence, but causation is really hard when you can’t do randomized controlled trials (and even then it’s not so easy because of all the possible mediation).

(4) How you call this one depends on whether you think the front-line fine-tuning of ChatGPT made a reasonably helpful/harmless/truthful bot or not and whether the “ethics” it was trained with are yours. You can certainly jailbreak even ChatGPT-4 to send it spiraling into hate land or fantasy land. You can jailbreak some of my family in the same way, but I wouldn’t go so far as to say they weren’t intelligent. You can find lots of folks who think ChatGPT is too “woke”. This is a running theme on the GPT subreddit. It’s also a running theme among anti-woke billionaires, as reflected in the UK’s Daily Telegraph article title, “ChatGPT may be the next big thing, but it’s a biased woke robot.”

I’ve heard a lot of people say their dog is more intelligent than ChatGPT. I suppose they would argue for a version of intelligence that doesn’t require (1) or (4) and is very tolerant of poor performance in (2) and (3).

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

When all else fails, add a code comment

Another way of saying this is that you should treat inline code comments as a last resort when there is no other way to make your intentions clear.

I used to teach a session of Andrew’s statistical communication class once a year and I’d focus on communicating a computational API. Most of the students hated it because they signed up for the class to hear Andrew talk about stats, not me talk about API design. At least one student just up and walked out every year! So if you’re that student, now’s your chance to bail.

Comments considered harmful

Most academics, before they will share code with me, tell me they have to “clean it up.” I invariably tell them not to bother, and at best, they will dilly dally and shilly shally and apologize for lack of comments. What they don’t realize is that they were on the right track in the first place. The best number of inline code comments is zero. Nada. Zilch. Nil. Naught.

Why are comments so harmful? They lie! Even with the best of intent, they might not match the actual implementation. They often go stale over time. You can write whatever you want in a comment and there’s no consistency checking with the code.

You know what doesn’t lie? Code. Code doesn’t lie. So what do professional programmers do? They don’t trust comments and read the code instead. At this point, comments just get in the way.

What’s a bajillion times better than comments?

Readable code. Why? It’s self documenting. To be self documenting, code needs to be relatively simple and modular. The biggest mistake beginners make in writing code is lack of modularity. Without modularity, it’s impossible to build code bottom up, testing as you go.

It’s really hard to debug a huge program. It’s really easy to debug modules built up piece by piece on top of already-tested modules. So design top down, but build code bottom up. This is why we again and again stress in our writing on Bayesian workflow and in our replies to user questions on forums, that it helps immensely to scaffold up a complicated model one piece at a time. This lets you know when you add something and it causes a failure.

Knowing where to draw lines between modules is, unfortunately, a matter of experience. The best way to get that experience? Read code. In the product coding world, code is read much more often than it’s written. That means much more effort typically goes into production code to make it readable. This is very unlike research code which might be written once and never read again.

There is a tradeoff here. Code is more readable with short variable names and short function names. It’s easier to apprehend the structure of the expression a * b + c**2 than meeting_time * number_of_meetings + participants**2. We need to strike a balance with not too long, but still informative variable names.

And why are beginners so afraid of wasting horizontal space while being spendthrifts on the much more valuable vertical space? I have no explanation. But I see a lot of code from math-oriented researchers that looks like this, ((a*b)/c)+3*9**2+cos(x-y). Please use spaces around operators and no more parens than are necessary to disambiguate given attachment binding.

When should I comment?

Sometimes you’re left with no choice and have to drop in a comment as a last resort. This should be done if you’re doing something non-idiomatic with the language or coding an unusual algorithm or something very involved. In this case, a little note inline about intent and/or algebra can be helpful. That’s why commenting is sometimes called a method of last resort.

But whatever you do, comment for people who know the language better than you. Don’t write a comment that explains what a NumPy function does—that’s what the NumPy doc is for. Nobody wants to see this:

int num_observations = 513;  // declare num_observations as an integer and set equal to 513

But people who feel compelled to comment will write just this kind of thing, thinking it makes their code more professional. If you think this is a caricature, you don’t read enough code.

The other thing you don’t want to do is this:

#####################################################
################## INFERENCE CODE ###################
#####################################################
...
...
...

This is what functions are for. Write a function called inference() and call it. It will also help prevent accidental reuse of global variables, which is always a problem in scripting languages like R and Python. Don’t try to fix hundreds or thousands of lines of unstructured code with structured comments.

Another thing to keep in mind is that vertical space is very precious in coding, because we want to be able to see as much of the code as we can at a time without scrolling. Don’t waste vertical space with useless or even harmful comments.

Do not, and I repeat, do not use /* ... */ style comments inline with code. It’s too easy to get confused when it’s a lot of lines and it’s doubly confusing when nested. Instead, use line comments (// in C++ and Stan, # in Python and R). Use the comment-region command in emacs or whatever does the same in your IDE. With line comments, the commented out code will be very visible, as in the following example.

for (int n = 0; n < N; ++n) {
  // int x = 5
  // int y = x * x * 3;
  // int z = normal_rng(y, 1);
  z = n * 3
}

Compare that to what I often see, which is some version of the following.

for (int n = 0; n < N; ++n) {
  /* int x = 5
  int y = x * x * 3;
  int z = normal_rng(y, 1); */
  z = n * 3
}

In the first case, it's easy to just scan down the margin and see what's commented out.

After commenting out and fixing everything, please be a good and respectful citizen and just delete all the edited out code before merging or releasing. Dead code makes the live code hard to find and one always wonders why it's still there---was it a mistake or some future plan or what? When I first showed up at Bell Labs in the mid 1990s, I was handed a 100+ page Tcl/Tk script for running a speech recognizer and told only a few lines were active, but I'd have to figure out which ones. Don't do that!

The golden triangle

What I stressed in Andrew's class is the tight interconnection between three aspects of production code:


$latex \textrm{API Documentation} \leftrightarrow \textrm{Unit tests} \leftrightarrow \textrm{Code}$

 

The API documentation should be functionally oriented and say what the code does. It might include a note as to how it does it if that is relevant to its use. An example might be different algorithms to compute the same thing that are widely known by name and useful in different situations. The API doc should ideally be specific enough to be the basis of both unit testing and coding. So I'm not saying don't document. I'm saying don't document how inline code works, document your API's intent.

The reason I call this the "golden" triangle is the virtuous cycle it imposes. If the API doc is hard to write, you know there's a problem with the way the function has been specified or modularized. With R and Python programmers, that's often because the code is trying to do too many things for a single function and the input types and output types become a mess of dependencies. This leads to what programmers identify as a "bad smell" in the code. If the code or the unit tests are hard to write, you know there's a problem with the API specification.

Clients (human and computational) are going to see and "feel" the API. That's where the "touch" is that designers like to talk about in physical object design. Things need to feel natural for the application, or in the words of UI/UX designers, it needs to offer affordances (in the past, we might have said it should be intuitive). It needs to feel natural for the application. Design the API first from the client perspective. Sometimes you have to suffer on the testing side to maintain a clean and documentable API, but that clean API is your objective.

What about research code?

Research code is different. It doesn't have to be robust. It doesn't have to be written to be read by multiple people in the future. You're usually writing end-to-end tests rather than unit tests, though that can be dangerous. It still helps to develop bottom-up with testing.

What research code should be is reproducible. There should be a single script to run that generates all the output for a paper. That way, even if the code's ugly, at least the output's reproducible and someone with enough interest can work through it.

And of course, research code needs to be tested that it's doing what it's supposed to be doing. And it needs to be audited to make sure it's not "cheating" (like cross-validating a time-series, etc.).

Notebooks, Quarto, and other things that get in the way of coding and documenting

With all due respect to Donald Knuth (never a good start), literate programming is a terrible way to develop code. (On a related topic, I would totally recommend at least the beginning part of Knuth's notes on how to write math.)

I don't love them, but I use Quarto and Jupyter (nee iPython) notebooks for writing reproducible tutorial material. But only after I've sorted out the code. These tools mix text and code and make too many compromises along the way to make them good at either task. Arguably the worst sin is that it winds up obfuscating the code with a bunch of text. Jupyter also makes it possible to get into inconsistent states because it doesn't automatically re-run everything. Quarto is just a terrible typesetting platform, inheriting all the flaws of pandoc, citeproc, with the added joy of HTML and LaTeX interoperability and R and Python interoperability. We use it for Stan docs so that we can easily generate HTML and LaTeX, but it always feels like there should be a better way to do this as it's a lot of trial and error due to the lack of specs for markdown.

Evaluating samplers with reference draws

I started thinking a bit more about what we’re doing when we use something like posteriordb (from Stan) or Inference Gym (from TensorFlow Probability) to evaluate a sampler.

posteriordb gives you Stan programs and 10,000 reference draws from their posterior. The reference draws must be generated by another sampler. In some cases, like Neal’s funnel or high-dimensional normals, etc., we can just run forward simulation and get independent draws. In cases of models with data, we don’t actually know the posterior analytically—that’s why we’re running MCMC in the first place! I think what posteriordb did was run NUTS for enough iterations it could be thinned down to 10,000 roughly independent draws.

Standard error in reference

The standard error for $latex N$ independent draws estimating a particular quantity like the posterior mean of a parameter is

$latex \textrm{se} = \frac{\displaystyle \textrm{sd}}{\displaystyle \sqrt{N}}$

where sd is the standard deviation of the quantity in the posterior. Standard Monte Carlo methods scale horribly due to that square root—Markov chain Monte Carlo can be even worse because rather than the number of draws, we have the effective sample size, which can be much lower than the number of iterations for more challenging posteriors due to the correlation in the draws.

Because $latex \sqrt{10,000} = 100$, with 10,000 draws, or standard error is only 1/100th of our standard deviation. In general, we need four times as many draws to cut the error in half, or 100 times as many draws to add a decimal place of accuracy by cutting error by a factor of ten. That means we can’t reasonably expect something like posteriordb to store enough draws to make fine-grained convergence evaluation possible. A slightly more economical alternative would be to store long term averages. There we cold run a million iterations and save parameter and squared parameter estimates. We could even run a hundred million by farming out 100 jobs on a cluster. But that’s still only three or four decimal places of accuracy. After that, maybe we could wring out another decimal place or two with control variates.

That’s a lot of residual error to have to deal with when evaluating an algorithm that asymptotically approaches the correct answer. The only thing that saves us is that the algorithms we’re evaluating are also samplers and are subject to the same inverse square root rate of reduction in standard error.

Report error in evaluations

What we really want to report is the standard error of the residual between our new model’s estimate and the truth. But we don’t usually have the truth. In practice, we usually do one of two things, depending on how compute-intensive it is to fit my model and our compute resources and patience.

Easy model:. Run the new algorithm 100 times or even 1000 times and report something like a histogram of errors. With this many runs, we can directly estimate standard error (assuming our draws are unbiased) and we can estimate it even more tightly and without assumptions if we know the true values.

Hard model:. Run the new algorithm once or twice and use an estimated standard error for the new algorithm and report that instead of an empirical histogram of errors. This suffers from estimation error in the effective sample size, which can be large.

The problem with both of these aproaches is that it’s only reporting variability of the new algorithm, treating the reference draws as true. We should be going one step further and folding the error in posteriordb or another reference database into the uncertainty reporting of the error of the new algorithm. The additional error could be done with simulation (and maybe analytically—normals are nice that way).

That’s easy enough to do reporting single univariate means. I do not know how to do this for evaluating error in something like KL-divergence, which itself has a Monte Carlo integration, or something like Wasserstein distance. In the Pathfinder paper, we used a semi-discrete Wassertein distance algorithm. I have to admit I don’t even know how that algorithm works.

Evaluating delayed rejection

I ran into this problem because I want to evaluate our delayed rejection HMC and show that the error goes to zero as I take more draws. Specifically, I want to extend the illustration in the paper that shows that standard HMC and NUTS fails, but DR-HMC succeeds at sampling from multiscale distributions like Neal’s funnel in its centered parameterization.

There are two fundamental problems with this.

Problem One: By the time we get past an effective samle size of 10,000 from the new algorithm, the error we’re estimating is dominated by the standard error of posteriordb. We don’t even know the real error of posteriordb, only an estimated standard error assuming draws are independent.

Problem Two: The second problem is that we have no guarantees that the reference draws are unbiased draws from the posterior. For example, in the centered parameterization of Neal’s funnel, there is no step size for HMC that will properly explore the posterior—each step size picks out a range of log scale parameters it can explore (as we demonstrate in the delayed rejection paper). The standard error estimate from the reference draws assumes independent draws from the posterior, not biased draws. If the posteriordb draws are biased, the error reported for a perfect sampler will not go to zero as the number of posteriordb and new sampler draws goes to infinity.

Help?

Does anyone have any suggestions of what we can do in the face of potentially biased and stochastic samples for evaluation? I could wait until the Stan meeting on Thursday and ask Aki, who’s a font of inspiration on topics like this, but I thought I’d ask a wider group and see how that goes.

Fun with Dååta: Reference librarian edition

Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).

Rasmuth writes,

While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).

The search is on.

… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.

The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,

Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.

Rasmuth concludes,

She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?

Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.

I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…

P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.