
Lu Zhang, Bob Carpenter, Aki Vehtari and I write:
We introduce Pathfinder, a variational method for approximately sampling from differentiable log densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL-divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder’s speed advantage with multiple cores.
The current title of the paper is actually “Pathfinder: Parallel quasi-Newton variational inference.” We didn’t say it in the above abstract, but our original motivation for Pathfinder was to get better starting points for Hamiltonian Monte Carlo, but then at some point it became clear that the way to do this is to get a variational approximation to the target distribution. I gave the above title to the blog post to draw attention to the idea that Pathfinder finds where to go.
In the world of Stan, we see three roles for Pathfinder: (1) getting a fast approximation (should be faster and more reliable than ADVI), (2) part of a revamped adaptation (warmup) phase, which again we hope will be faster and more reliable than what we have now, and (3) more effectively getting rid of minor modes. This is a cool paper (and I’m the least important of the four authors) and I’m really excited about the idea of having this in Stan and other probabilistic programming languages.
P.S. Bob provides further background:
The original idea was based on the idea that the intermediate value theorem of calculus would guarantee that if we started with a random init in the tail and followed an optimization path, that path would have to go through the bulk of the probability mass on the way to the mode or pole (for example, hierarchical models don’t have modes because the density is unbounded). Here’s an ASCII art version of the diagram of what should happen,
*———[——–]———–> opt path
TAIL BODY HEADThe brackets pick out the region in the body of the distribution corresponding to reasonable posterior draws (see below for how we refined this idea). For instance, in a multivariate standard normal, this is a thin shell at a fixed radius from the mode at the origin.
We were at the same time playing around with the notion of typical set from information theory. The theorem around that is an asymptotic result for growing samples having a mean log density that approaches the entropy. We have been using it informally to denote the subset of the sample space where a sampler will draw samples. But that’s not quite what it is. Formally, the typical set with one draw is defined by sets of draws whose log density is within epsilon of the expected log density, i.e.,
Typical_epsilon = {theta : log p(theta | y) in E[log p(Theta | y)] +/- epsilon}.
This +/- epsilon business doesn’t work well when log p(Theta | y) is skewed, as it often is in applied problems. So instead we settled on the set of theta whose log densities fall the 99% central interval of log densities. That’s where 99% of our samples are drawn and should make a good target for initializing an MCMC sampler, our original goal.
Given an optimization path, we can evaluate the points in parallel to detect if they’re in the body of the distribution. If we start MCMC chains at those points, the ones in the tail should drift higher in density, the ones in the head should drift lower, whereas those in the body should bounce around. The only problem is that it’s hard to run MCMC on arbitrary densities robustly without a lot of adaptation, which is precisely what we were designing Pathfinder to avoid. Also, running MCMC is intrinsically serial. So next we considered evaluating the volume around a point and using density times volume as a proxy for being in the body of a distribution. We could use the low rank plus diagonal covariance approximation from the L-BFGS (quasi-Newton) optimizer to evaluate the volume. It worked better than MCMC and was very fast, but still failed in many cases. The final idea was to just evaluate the evidence lower bound (ELBO), equivalently KL-divergence from the approximation to the target density, at the normal approximation derived from L-BFGS at each point along the optimization path. That can also be done easily in parallel. It’s way more robust to run L-BFGS on the objective log p(theta | y) than it is to try to optimize the ELBO[normal(mu, Sigma) || log p(theta | y)] over mu and Sigma directly as ADVI does. We also don’t have the serialization bottleneck of ADVI where the evaluations of the ELBO need to be staged one after the other—all of our ELBO evals happen in parallel.
Along the way, we lost the original Pathfinder idea of finding a point in the optimization path in the body of the distribution. The point at which the best ELBO is located is often beyond the body of the distribution. For example, the best normal approximation for a multivariate normal is located at the mode.
After the basic idea worked, we found it getting trapped in local optima. The last idea was to run multiple optimization paths of Pathfinder in parallel, then importance resample the results. That essentially gives us a low-rank multivariate normal mixture approximation of a posterior, and that worked much better. Each Pathfinder instance runs in parallel, too, and the only serialization bottleneck is importance resampling, which is fast.
Next up, we’re going to try to tackle Phase II of Stan’s warmup. The question is whether its covariance estimates will be good enough to circumvent or at least shorten Phase II warmup. Ben Bales has already done a lot of work around this in his thesis, like parallelizing adaptation in Phase II. The final goal, of course, is robust, massively parallelizable MCMC. But even without parallelization, Pathfinder is way faster than ADVI or Stan’s Phase I warmup.





