Using Stan to do sequential Bayesian updating

Gaurav Sood writes:

I was thinking about Bob’s post, and I think we can do something with Wasserstein Barycenter Priors

I have the Gaussian thing going here.

But the right approach is the barycenter priors ….

I haven’t been following the details here, but I think this discussion is about a problem that I’ve seen many times in applications, which is that you want to update your posterior as new data come in, and it’s expensive to re-fit the model with all your data each time a new little batch of data come in. We’ve solved this problem in different ways at different times: sometimes we approximate the posterior distribution by a normal or mixture of normals; other times we approximate the posterior for the hyperparameters and use it as a prior for the model we’re fitting to the new data; sometimes the problem can be simplified by factorization in some way or another. There’s also particle filtering, which is a particular class of iterative simulation algorithm using multiple chains.

Naming the problem

As with any problem that comes up often, there are many solutions. I’m excited that Bob recently implemented a version in Stan. Bob called it “chaining Bayesian inference” . . . I’m not thrilled with that term. “Sequential Bayesian updating” sounds better to me, so that’s what I used in the title of this post. I’m open to other suggestions.

Gaurav’s approach

Here’s what Gaurav writes in his new post:

The Problem

Bayesian inference becomes challenging when you need to chain analyses across sequential datasets but lack conjugate priors with closed-form posteriors. While textbook solutions work beautifully for simple cases like binomial-beta models, real-world scenarios often involve complex posteriors that resist analytical treatment.

Chenyang Zhong has an elegant solution using kernel density estimation to approximate posteriors as priors for subsequent analyses. Zhong approximates the posterior from the first dataset using a mixture of normals: p(θ | y₁, x₁) ≈ (1/M) Σᵢ Normal(θ | θᵢ⁽¹⁾, h·I)

where θᵢ⁽¹⁾ are posterior draws from the first analysis, and h controls the kernel bandwidth. This creates an empirical prior for analyzing the second dataset.

The approach is clever—it preserves the shape of the posterior while allowing efficient MCMC sampling. Zhong even develops a sophisticated nearest-neighbor method for fast Metropolis sampling.

A Wasserstein Alternative

The Gaussian distribution that minimizes Wasserstein-2 distance to any empirical distribution is simply the one matching the first two moments. This means we can replace Zhong’s kernel density approximation with:

fit_wasserstein_prior(posterior_draws):
    """Wasserstein-optimal Gaussian approximation"""
    mu = np.mean(posterior_draws, axis=0)
    Sigma = np.cov(posterior_draws.T)
    return mu, Sigma

That’s it. No optimization, no bandwidth selection, no nearest-neighbor graphs.

Implementation Comparison

Zhong’s approach requires:

Choosing bandwidth h Implementing mixture model in Stan Complex log-sum-exp calculations

Wasserstein approach:

Automatic parameter selection Simple multivariate normal prior Standard Stan implementation Comparable or faster performance

Here’s the key Stan difference:

stan// Zhong's mixture approach
model {
  y ~ bernoulli_logit(x * beta);
  vector[B] lp;
  for (b in 1:B) {
    lp[b] = normal_lpdf(beta | beta0[b], h);
  }
  target += log_sum_exp(lp) - log(B);
}

// Wasserstein approach  
model {
  y ~ bernoulli_logit(x * beta);
  beta ~ multi_normal(mu_prior, Sigma_prior);
}

When Does This Work?

The Wasserstein approximation is optimal when:

The true posterior is approximately Gaussian You want to preserve the covariance structure Computational simplicity matters

It may underperform when posteriors have:

Strong multimodality Heavy tails Complex dependence structures

Gaurav includes a bunch more Stan and Python code in his post.

6 thoughts on “Using Stan to do sequential Bayesian updating

  1. Hence my comment on Bob’s post to consider copulas. Use a Gaussian copula with whatever marginals you want. With normal margins, it’s the same as Wasserstein.

  2. It’s always fun when you a bunch of fun math like optimal transport and the result is use a normal approximation with the sample mean and covariance. An alternative use of barycenters is the approach of David Dunson et al. to federated learning that I cited in my post.

    There are two immediate problems with multivariate normals, namely that (a) they’re inflexible, and (b) they’re slow. I think (a) should be pretty obvious compared to more flexible density families that can handle skew, longer tails, multimodality, etc. For (b), just evaluating the density is cubic in dimensionality with a standard multivariate normal approximation. One can Cholesky factor the covariance matrix for a one-time cubic cost and then the density evaluations become quadratic. But that’s still expensive in high dimensions. Zhong’s method is linear in dimensionality with a “constant” factor of B. If that B needs to grow even linearly with dimensionality, then it’s still going to be quadratic in dimension.

    • Not that I understand anything on this topic, but Bob’s first sentence

      “It’s always fun when you a bunch of fun math like optimal transport and the result is use a normal approximation with the sample mean and covariance.”

      is missing a verb. Or, so I believe.

      • Yes, I guess it is the verb “do”:

        It’s always fun when you **do** a bunch of fun math, like optimal transport, and the result is: use a normal approximation with the sample mean and covariance.

        i.e., complicated stuff still lead us to the Gaussian, of course that’s ironical

    • Wait linear in dimesionality? So hes proposing an independent product of 1d KDE estimates? (I didn’t follow the details) That would seem to throw away a lot of dependency information. Unless you’re adding a big new dataset you could easily wind up with less info about the parameters afterwards than you had at the start.

  3. Thanks, Bob! Makes sense. I hadn’t completely grokked: “Zhong’s method is linear in dimensionality with a “constant” factor of B.”

Leave a Reply

Your email address will not be published. Required fields are marked *