https://twitter.com/roydanroy/status/1106366689065734145

I’ve dabbled with both Church and Stan but was always curious about their conceptual differences. Now that this is being discussed here, I was wondering if others could enlighten me on the differences between Stan’s way of doing this and the LISP+Probability way of doing things? Just pointers to other resources would be great, if not a full fledged argument.

]]>Most symbolic differentiation systems, on the other hand, require a formula, not a program, to differentiate, so they don’t support loops (other than implicitly in matrices/tensors, etc.).

David: Daniel’s on the right track in that speedups are possible when there’s algebraic reductions that can be done in the formula that results for the derivatives. Anything like that which can be computed statically is a big win, which is why we’re looking hard at Stan program transformations. The disadvantage of literally getting a formula for the derivative is that it’s rarely set up to efficiently share subexpression computations with the function evaluation code itself, whereas this is built into reverse-mode autodiff.

Some nice implementation details come for free in reverse-mode autodiff from the adjoint propagation—any re-entrancy in the expression graph (re-use of variables) is automatically handled by a kind of dynamic programming (it memoizes all results in the expression graph above a certain point in a topological sort rather than combinatorially multiplying them out).

]]>Second, missing count data’s is the one case I always bring up where it’d be nice to have a discrete sampler. There’s no way to properly implement that model in Stan. It doesn’t present the same multimodality problems (at least with most models of missingness!) as the NP-hard combinatoric discrete samplers, so I’d like to be able to code it. I’ve seen Stan users go to the extreme length of approximating the Poisson by marginalizing out a large interval around the mean. But that’ll be problematic for things like negative binomial with high dispersion, which won’t have the small posterior variance that the Poisson has.

]]>Stan 2.18 supports multi-core mapping for the likelihood (and/or the prior or posterior predictive simulations if those are compute intensive). This lets you do the same kind of parallelization that’d be used for neural network layers (it’s a big matrix-matrix multiply for each layer and then a non-linear inverse logit [sigmoid] plus parallel backprop which is also parallelizable).

This has shown speedups of a factor of 50 using 80 cores and have a similar effect on workflow. With 16 cores, the speedup will depend on how much computation can be fed to each core as a unit each log density evaluation.

]]>Just to give a flavor, the last time I ran across this I was dealing with imperfectly measured counts, and the research goal was to model the underlying counts from the imperfect measurements (among other things). My first reaction was to model the counts as Poisson with the observed measurements having expectation equal to the true count and some error (normal was the first thing I looked at). Since, at least according to my understanding, the counts would be discrete parameters, this model could not be expressed in the natural naive way in Stan. I then did a bunch of research looking into continuous approximations (e.g. doing an anscome transformation or a custom probability function that was a smooth continuous approximation of Poisson) and potentially marginalizing.

In the end I found that it didn’t matter because poisson was actually a pretty bad fit for the data (much higher variability than poisson). Anyhow, I just feel that I would have been better off if I had been able to fit the model and diagnose it rather than going off on this side research project to work around the issue.

All this is to say that Stan is a great tool and I love it. I just look at all of the amazing flexible models that can be expressed parsimoniously in the language and see a discrete parameters (even with their limitations) as being a natural feature to target.

]]>The issue is not so much the paradigm of HMC—we’d be happy to throw in Gibbs or whatever if it makes things faster. The issue is that Gibbs sampler with discrete parameters can take forever to converge, and it’s much faster with the mixture implementation (when that can be done). Potential poor convergence is a particular concern for Stan which is intended to run as automatically as possible.

]]>I understand that, while it is a natural thing to do in Gibbs sampling, discrete parameters don’t really fit into the paradigm of HMCMC.

]]>Suppose you have a 4 core machine. You set up a special chain, and 3 other chains at higher temperatures. In the special chains you attempt to randomly perturb the discrete parameters with some probability, and do HMC with some other probability. At the higher temperatures you’ll be likely to accept the discrete perturbations more often. Each chain attempts to accept swaps with its higher temperature neighbor at random intervals. The ultimate result, if it works, would be to occasionally have the special chain randomly jump to a different discrete set of parameters.

It’s a thought at least.

]]>(I’ve been doing some neural network stuff and getting about 15x speedup over 16-core CPU-based machines — I attribute the relatively small speedup to using recurrent neural networks which are inherently less parallelizable — and even this speedup has made model development thinkable.)

]]>L1(x|p)*.2 + L2(x|p)*.8

This lets you do inference on p averaged over the prior for discrete values of q.

This becomes impossible as the number of discrete parameters increases because of exponential explosion in the number of terms you need (for q1 having 3 states and q2 having 5 and q3 having 10 you need 3 * 5 * 10 different terms)

If you’re trying to do inference on the posterior probability of the q, you can make the mixture probabilities parameters:

L1(x|p)*q1 + L2(x|p)*q2

and make q1,q2,… be a simplex with some probability.

]]>The most common uses we’ve seen of discrete parameters are for mixture models. The Stan User’s Guide describes how to write mixture models directly, summing over the discrete parameters which generally gives faster computation; see chapter 6 of the current version of the manual.

Examples do arise of discrete models that are not easily-computed mixtures. Some of these are just really hard computation problems that we’d not want to attack with direct MCMC; for example, if you’re trying to simulate from the Ising model you’ll want to introduce a structure of auxiliary variables; it’s not like you’d want to run Gibbs or whatever directly. For other problems, the discrete parameters are few enough or isolated enough that MCMC could work, and for those problems it could be helpful to augment Stan to allow such parameters. There’s some work being done in Stan on discrete parameters so maybe this will appear at some point.

]]>Could you talk a little more about discrete parameter marginalization? In my work, the biggest limitation I’ve encountered with Stan is the inability to specify discrete parameters.

]]>is a pretty horrific thing to contemplate ;-)

]]>I can imagine a backend that produces high level Julia code which links low-level C++ Stan mathlib, so that at least you can have a direct in Julia *interface* to the fitting machinery even if the actual computation is done in C++ code hidden from view. That’d be super nice.

]]>A lot of our effort these days is going into cluster support for multi-core parallelization and GPU support. That’s also going to present challenges for installation, but I think we’ll be building wrappers at that level for popular platforms like AWS. We’ll see what people come up with.

]]>Julia’s autodiff system is interesting in the way it lets third-party libraries contribute differentiable functions. So it’s not out of the question. Having a clean intermediate representation in OCaml is going to make it much easier to experiment with alternative backends.

]]>Looks like one day Stan only works on Docker with a dozen containers.

]]>So I see this as meaning OCaml is the language in which they are rewriting the front-end entirely, and then adding on stuff.

]]>They mention the example option to use TensorFlow Probability and PyTorch as back-ends… languages where the actual computations get expressed. That’s all fine and dandy, but Julia is where I really want to get my numerical computing done. I’m guessing there is enough interest that it has a reasonable chance to get done if the frontend changes to facilitate it.

With a compiler, first you want to read a language, and then decide what computations that language has specified, optimize those computations, and then spit out something a computer can understand to carry out the calculations. OCaml is enabling the compilation phase. Whatever the backend will be will enable the calculation phase itself. I imagine they’ll start with a C++ backend like it currently has.

Currently if I understand it correctly, the Compiler itself is written in C++, and it generates C++, which then is compiled to machine language and runs the calculation. What this will do is change the compiler from a cumbersome C++ program that reads Stan code to a light and fluffy souffle of OCaml that reads Stan code. Most likely the back end will start out as C++, and then once the OCaml based compiler is well in place, someone will come along and give us a backend like Julia ;-)

]]>