Gaurav Sood writes:
I was thinking about Bob’s post, and I think we can do something with Wasserstein Barycenter Priors
I have the Gaussian thing going here.
But the right approach is the barycenter priors ….
I haven’t been following the details here, but I think this discussion is about a problem that I’ve seen many times in applications, which is that you want to update your posterior as new data come in, and it’s expensive to re-fit the model with all your data each time a new little batch of data come in. We’ve solved this problem in different ways at different times: sometimes we approximate the posterior distribution by a normal or mixture of normals; other times we approximate the posterior for the hyperparameters and use it as a prior for the model we’re fitting to the new data; sometimes the problem can be simplified by factorization in some way or another. There’s also particle filtering, which is a particular class of iterative simulation algorithm using multiple chains.
Naming the problem
As with any problem that comes up often, there are many solutions. I’m excited that Bob recently implemented a version in Stan. Bob called it “chaining Bayesian inference” . . . I’m not thrilled with that term. “Sequential Bayesian updating” sounds better to me, so that’s what I used in the title of this post. I’m open to other suggestions.
Gaurav’s approach
Here’s what Gaurav writes in his new post:
Bayesian inference becomes challenging when you need to chain analyses across sequential datasets but lack conjugate priors with closed-form posteriors. While textbook solutions work beautifully for simple cases like binomial-beta models, real-world scenarios often involve complex posteriors that resist analytical treatment.
Chenyang Zhong has an elegant solution using kernel density estimation to approximate posteriors as priors for subsequent analyses. Zhong approximates the posterior from the first dataset using a mixture of normals: p(θ | y₁, x₁) ≈ (1/M) Σᵢ Normal(θ | θᵢ⁽¹⁾, h·I)
where θᵢ⁽¹⁾ are posterior draws from the first analysis, and h controls the kernel bandwidth. This creates an empirical prior for analyzing the second dataset.
The approach is clever—it preserves the shape of the posterior while allowing efficient MCMC sampling. Zhong even develops a sophisticated nearest-neighbor method for fast Metropolis sampling.
The Gaussian distribution that minimizes Wasserstein-2 distance to any empirical distribution is simply the one matching the first two moments. This means we can replace Zhong’s kernel density approximation with:
fit_wasserstein_prior(posterior_draws): """Wasserstein-optimal Gaussian approximation""" mu = np.mean(posterior_draws, axis=0) Sigma = np.cov(posterior_draws.T) return mu, SigmaThat’s it. No optimization, no bandwidth selection, no nearest-neighbor graphs.
Zhong’s approach requires:
Choosing bandwidth h Implementing mixture model in Stan Complex log-sum-exp calculations
Wasserstein approach:
Automatic parameter selection Simple multivariate normal prior Standard Stan implementation Comparable or faster performance
Here’s the key Stan difference:
stan// Zhong's mixture approach model { y ~ bernoulli_logit(x * beta); vector[B] lp; for (b in 1:B) { lp[b] = normal_lpdf(beta | beta0[b], h); } target += log_sum_exp(lp) - log(B); } // Wasserstein approach model { y ~ bernoulli_logit(x * beta); beta ~ multi_normal(mu_prior, Sigma_prior); }
Gaurav includes a bunch more Stan and Python code in his post.
Hence my comment on Bob’s post to consider copulas. Use a Gaussian copula with whatever marginals you want. With normal margins, it’s the same as Wasserstein.
It’s always fun when you a bunch of fun math like optimal transport and the result is use a normal approximation with the sample mean and covariance. An alternative use of barycenters is the approach of David Dunson et al. to federated learning that I cited in my post.
There are two immediate problems with multivariate normals, namely that (a) they’re inflexible, and (b) they’re slow. I think (a) should be pretty obvious compared to more flexible density families that can handle skew, longer tails, multimodality, etc. For (b), just evaluating the density is cubic in dimensionality with a standard multivariate normal approximation. One can Cholesky factor the covariance matrix for a one-time cubic cost and then the density evaluations become quadratic. But that’s still expensive in high dimensions. Zhong’s method is linear in dimensionality with a “constant” factor of B. If that B needs to grow even linearly with dimensionality, then it’s still going to be quadratic in dimension.
Not that I understand anything on this topic, but Bob’s first sentence
“It’s always fun when you a bunch of fun math like optimal transport and the result is use a normal approximation with the sample mean and covariance.”
is missing a verb. Or, so I believe.
Yes, I guess it is the verb “do”:
It’s always fun when you **do** a bunch of fun math, like optimal transport, and the result is: use a normal approximation with the sample mean and covariance.
i.e., complicated stuff still lead us to the Gaussian, of course that’s ironical
Wait linear in dimesionality? So hes proposing an independent product of 1d KDE estimates? (I didn’t follow the details) That would seem to throw away a lot of dependency information. Unless you’re adding a big new dataset you could easily wind up with less info about the parameters afterwards than you had at the start.
Thanks, Bob! Makes sense. I hadn’t completely grokked: “Zhong’s method is linear in dimensionality with a “constant” factor of B.”