Alp Kucukelbir, Rajesh Ranganath, Dave Blei, and I write:

We describe an automatic variational inference method for approximating the posterior of differentiable probability models. Automatic means that the statistician only needs to define a model; the method forms a variational approximation, computes gradients using automatic differentiation and approximates expectations via Monte Carlo integration. Stochastic gradient ascent optimizes the variational objective function to a local maximum. We present an empirical study that applies hierarchical linear and logistic regression models to simulated and real data.

Alp clarifies:

This is not standard variational Bayes. The core differences are:

+ gaussian variational approximation in unconstrained parameter space

+ gradient-based algorithm. *not* coordinate ascent.

+ stochastic optimize. *not* the same as stochastic variational inference (SVI) where data is subsampled.

Our next goals are to test the current code on more complicated models and get started on implementing SVI. This will allow stan to address massive datasets, which piqued many people’s interest at NIPS 2014.

My contributions to this project so far have mostly been limited to expressing a lot of enthusiasm about the idea, making some connections between people, and supplying some data for one of the examples.

There’s a lot of good stuff here and I have high hopes for it. The code is open and available for all; I hope it will be in regular Stan soon so anybody can easily try it out.

How is this different from just approximating the posterior (of transformed parameters) by a Gaussian? Isn’t that a standard technique? (I know I’m missing something.)

After choosing the Gaussian family for the approximation, the important thing is to choose where to put it (i.e., setting the mean of the Gaussian) and how wide to make it (setting its covariance). After all, you could just take a zero-mean Gaussian with unit covariance and call it a Gaussian approximation, although it will be quite a poor one for most posterior distributions.

There are different algorithms (Laplace’s method/Laplace approximation, VB, EP…) for finding the mean and covariance and they will often find different values and they have different computational complexities or requirements. Check, for example, Figure 1 in http://research.microsoft.com/pubs/67425/tr-2005-173.pdf for some comparisons (e.g., alpha = 0 is VB and alpha = 1 is EP in this example).

How black-box is this? Is this the kind of thing where one could have part of the model coded outside of Stan? I don’t think so, since the autodiff is invoked in the description.

The application I have in mind is applying Stan with mechanistic disease models. These models can be pretty complicated (i.e. there is a whole lot that goes on between the model parameters and the computed quantities that are fed into the likelihood). I have tried to rebuild an existing model in Stan but it was pretty creaky. The killer app for me would be something where Stan can hand off parameter samples, the model (outside Stan) can hand back computed quantities, then Stan evaluates likelihood and draws new samples (and repeat). I realize this may be missing the point as the autodiff seems a critical part of Stan (I would guess what I describe would need some numerical approximation of the gradients), but though I would mention in case (a) I am wrong and Stan can actually operate like this now, or (b) it might be possible in the future. As a plug, this would greatly widen the applicability of Stan, at least in my field. As for the current version of Stan — looks awesome, very grateful for the developers, and am just waiting for an analysis that is better suited for it.

We’re running into similar problems with PK/PD models we’re working on.

But I’m afraid we don’t have any plans to allow arbitrary plug-in non-differentiable functions.

Without gradients, it’s hard to explore the posterior efficiently. You might want to try something like emcee, which is set up to do what you want. It should at least be better than the standard Metropolis approach to fully black-box inference. Especially if the problem isn’t very high dimensional.

If you could write the analysis function in templated C++, then you can link it to Stan and Stan can just autodiff through it. What we really need is an easy way for users to write extensions in C++. It’s on our near-term to-do list, along with adding a way to import function libraries written in Stan.

Hi Bob – thanks for the advice! Will have a go with emcee (it seems I can get the same sampler (AIES) in R with the LaPlacesDemon package).

‘…an easy way to write extensions in C++…’ — would be good. Being one that only ventures outside of R with trepidation, Rcpp has made my life a lot easier.

VBSTAN? Please please please don’t make “Stan” all caps. People will think it’s an acronym. And then Andrew will just stir things up by saying “Stan” stands for “sampling through adaptive neighborhoods”. It doesn’t. It’s a joke. Really. Andrew likes to play with names and second-order associations, if people haven’t noticed. Feel free to humour him by chuckling politely when he makes a joke about “Stan” being an acronym. But don’t quote him on it, because he’s just teasing everyone (well me, mostly; I take the bait every time).

A logical point of comparison would be infer.net which uses variational message passing and EP. It got some bad speed comparisons for some regression problems when the project was in an early stage, but I imagine that those were implementation details. I gave up on figuring it out for comparisons for a pretty lazy reason: they use Csoft = C# + custom stuff for models. I’ve never used C# or F#, and using it on linux seems incompletely supported and requires something like Mono, Wine, and IronPython and jiggering of libraries on my cluster.