Mixtures: A debugging story

I’m not a great programmer. The story that follows is not intended to represent best programming practice or even good programming practice. It’s just something that happened to me, and it’s the kind of thing that’s happened to me many times before, so I’m sharing it with you here.

The problem

For a research project I needed to fit a regression model with an error term that is a mixture of three normals. I googled *mixture model Stan* and came to this page with some code:

data {
  int K;          // number of mixture components
  int N;          // number of data points
  array[N] real y;         // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}

I should’ve just started by using this code as is, but instead I altered it in a couple of ways:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
}
model {
  lambda ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  sigma ~ lognormal(0, 2);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}

The main thing was adding the linear predictor, X*beta, but I also renamed a couple of variables to make the code line up with the notation in the paper I was writing, also I added a prior on the mixture component sizes and I removed some of the code from the Stan User’s Guide that increased computational efficiency but which seemed to me to make the code harder to read to newcomers.

I fit this model setting M=3, and . . . it was really slow! I mean, stupendously slow. My example had about 1000 data points and it was taking, oh, I dunno, close to an hour to run?

This just made no sense. It’s not just that it was slow; also, the slowness was just not right, which made me concerned that something else was going wrong.

Also there was poor convergence. Some of this was understandable, as I was fitting the model to data that had been simulated from a linear regression with normal errors, so there weren’t actually three components. But, still, the model has priors, and I don’t think the no-U-turn sampler (NUTS) algorithm used by Stan should have so much trouble traversing this space.

Playing with priors

It was time to debug. My first thought was that the slowness was caused by poor geometry: if NUTS is moving poorly, it can take up to 1024 steps per iteration, and then each iteration will take a long time. Poor mixing and slow convergence go together.

A natural way to fix this problem is to make the priors stronger. With strong priors, the parameters are restricted to fall in a small, controlled zone, and the geometry should be better.

I tried a few things with priors, the most interesting of which was to set up a hierarchical model for the scales of the mixture components. I added this line to the parameters block:

  real log_sigma_0;

And then, in the model block, I replaced “sigma ~ lognormal(0, 2);” with:

  sigma ~ lognormal(log_sigma_0, 1);

One thing that made the priors relatively easy to set up here was that the data are on unit scale: they’re the logarithms of sampling weights, and the sampling weights were normalized to have a sample mean of 1. Also, we’re not gonna have weights of 1000, so they have a limited range on the log scale.

In any case, these steps of tinkering with the prior weren’t helping. The model was still running ridiculously slowly.

Starting from scratch

OK, what’s going on? I decided to start from the other direction: Instead of starting with my desired model and trying to clean it up, I started with something simple and built up.

The starting point is the code in the Stan User’s Guide, given above. I simulated some fake data and fitted it:

library("rstanarm")
set.seed(123)

# Simulate data
N <- 100
K <- 3
lambda <- c(0.5, 0.3, 0.2)
mu <- c(-2, 0, 2)
sigma <- c(1, 1, 1)
z <- sample(1:K, n, replace=TRUE, prob=lambda)
v <- rnorm(N, mu[z], sigma[z])

# Fit model
mixture <- cmdstan_model("mixture_2.stan")
mixture_data <- list(y=v, N=N, K=3)
v_mixture_fit <- mixture$sample(data=mixture_data, seed=123, chains=4, parallel_chains=4)
print(v_mixture_fit)

And it worked just fine, ran fast, no convergence problems. It also worked fine with N=1000; it just made sense to try N=100 first in case any issues came up.

Then I changed the Stan code to my notation, using M rather than K and a couple other things, and still no problems.

Then I decided to make it a bit harder by setting the true means of the mixture components to be identical, changing "mu <- c(-2, 0, 2)" in the above code to

mu <- c(0, 0, 0)

Convergence was a bit worse, but it was still basically ok. So the problem didn't seem to be the geometry.

Next step was to add predictors to the model. I added them to the R code and the Stan code . . . and then the problem returned.

So I'd isolated the problem. It was with the regression predictors. But what was going on? One problem could be nonidentification of the constant term in the regression with the location parameters mu in the mixture model---but I'd been careful not to include a constant term in my regression, so it wasn't that.

Finding the bug

I stared at the code harder and found the problem! It was in this line of the Stan program:

      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);

The problem is that the code is doing one data point at a time, but "X*beta" has all the data together! So I fixed it. I changed the above line to:

      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);

and added the following line to the beginning of the model block:

  vector[N] Xbeta = X*beta;

Now it all works. I found the bug.

What next?

The code runs and does what it is supposed to do. Great. Now I have to go back to the larger analysis and see whether everything makes sense.

Here was the output of the simple normal linear regression fit to the data I'd simulated:

            Median MAD_SD
(Intercept)  0.57   0.10 
x           -0.16   0.01 

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.02   0.02

And here was the result of fitting the regression model in Stan with error term being a mixture of 3 normals:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__        -1340.21 -1339.91 2.85 2.77 -1345.21 -1335.96 1.00      681      458
 beta[1]        -0.16    -0.16 0.01 0.01    -0.18    -0.14 1.00     2139     2357
 lambda[1]       0.32     0.14 0.34 0.19     0.01     0.94 1.02      419      977
 lambda[2]       0.39     0.27 0.35 0.36     0.01     0.95 1.01      506      880
 lambda[3]       0.28     0.11 0.32 0.15     0.01     0.92 1.01      543      879
 mu[1]          -0.14    -0.02 0.61 0.57    -1.43     0.57 1.02      385      281
 mu[2]           0.58     0.56 0.51 0.33    -0.25     1.56 1.01      474      536
 mu[3]           1.53     1.37 1.56 0.67     0.60     2.42 1.01      562     1025
 sigma[1]        0.77     0.85 0.30 0.22     0.21     1.09 1.01      522      481
 sigma[2]        0.85     0.93 0.32 0.16     0.27     1.16 1.00      644     1028
 sigma[3]        0.77     0.79 0.50 0.29     0.25     1.11 1.00      910     1056
 log_sigma_0    -0.35    -0.34 0.64 0.65    -1.42     0.68 1.00     1474     1605

The estimate for the slope parameter, beta, seems fine, but it's hard to judge mu and sigma. Ummm, we can take a weighted average for mu, 0.32*(-0.14) + 0.39*0.58 + 0.28*1.53 = 0.61, which seems kind of ok although a bit off from the 0.57 we got from the linear regression. What about sigma? It's harder to tell. We can compute the weighted variance of the mu's plus the weighted average of the sigma^2's, but it's getting kinda messy, also really we want to account for the posterior uncertainty---as you can see from above, these uncertainty intervals are really wide, which makes sense given that we're fitting a mixture of 3 normals to data that were simulated from a single normal distribution . . . Ok, this is getting messy. Let's just do it right.

I added a generated quantities block to the Stan program to compute the total mean and standard deviation of the error term:

generated quantities {
  real mu_total = sum(lambda.*mu)/sum(lambda);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2))/sum(lambda));
}

And here's what came out:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 mu_total        0.56     0.56 0.11 0.11     0.39     0.74 1.00     2172     2399
 sigma_total     1.02     1.02 0.03 0.02     0.98     1.06 1.00     3116     2052

Check. Very satisfying.

The next step is to continue on with the research on the problem that motivated all this. Fitting a regression with mixture model for errors, that was just a little technical thing I needed to do. It's annoying that it took many hours (not even counting the hour it took to write this post!) and even more annoying that I can't do much with this---it's just a stupid little bug, nothing that we can even put in our workflow book, I'm sorry to say---but now it's time to move on.

Sometimes, cleaning the code, or getting the model to converge, or finding a statistically-significant result, or whatever, takes so much effort that when we reach that ledge, we just want to stop and declare victory right there. But we can't! Or, at least, we shouldn't! Getting the code to do what we want is just a means to an end, not an end in itself.

P.S. I cleaned the code some more, soft-constraining mu_total to 0 so I could include the intercept back into the model. Here's the updated Stan program:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
  real log_sigma_0;
}
model {
  vector[N] Xbeta = X*beta;
  lambda ~ lognormal(log(1./M), 1);
  mu ~ normal(0, 10);
  sigma ~ lognormal(log_sigma_0, 1);
  sum(lambda.*mu) ~ normal(0, 0.1);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}
generated quantities {
  real mu_total = sum(lambda.*mu);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2)));
}

6 thoughts on “Mixtures: A debugging story

  1. I’ll speak up for the subset of applied Bayesians that work in the field of uncertainty quantification for computer models. I’m generally a fan of what you’ve done to elucidate Bayesian workflow (as partly demonstrated in this post), but there are parts of it that make less sense for our field (or physicists, chemists, etc., that do what we do). Our likelihoods often include physical models and our priors often include expert knowledge. We want to quantify uncertainty even if our resulting statistical models are not strongly identifiable. MCMC mixing may be bad, but we are less inclined to relax the physical models or expert priors. This is not to say that we statisticians don’t add our tweaks, like measurement error models, biases to the physical models, and even modularizing the inference, but sometimes the real thing we’re after to truly reflect our uncertain state of knowledge is some complicated posterior falling on a manifold with multiple modes. This is all to say that the part of the Bayesian workflow that involves changing the model to explore/fix MCMC problems is sometimes not the right thing to do—sometimes you really need different MCMC algorithms.

    • Devin:

      Rather than saying that we are changing the model to improve the computation, let me put it this way: Computational problems motivate us to look more carefully at our model, and then we realize that we have a lot of prior information we had not yet included. Putting in that prior information can help the computation. (That’s the contrapositive of the folk theorem of statistical computing.) And, yes, sometimes there’s a model we really want to fit and it’s a computational challenge and we have to deal with it. In the case of the above post, though, the added prior information was stuff that we really should’ve, or could’ve, had in the model in the first place.

  2. Thanks for sharing! I’d also just like to say that your demonstrated commitment to transparency and intellectual honesty about making errors is very inspiring. This is the sort of culture we need, and I really appreciate you promoting it in the way you do.

  3. I like this post. Recently I had to debug a Stan model that I was writing for a laboratory experiment that included a bunch of serial dilutions. There were a lot of latent variables and hierarchical components to the model. It just wouldn’t converge. I kept thinking the problem was something about the geometry and maybe needed reparameterizing…just couldn’t figure it out…for a long time… turns out, I managed to write the power function backwards, “pow(y, x)” instead of “pow(x, y).” aaaarrrgh. Couldn’t believe it was something so simple and dumb.

Leave a Reply

Your email address will not be published. Required fields are marked *