stanc3: rewriting the Stan compiler

I’d like to introduce the stanc3 project, a complete rewrite of the Stan 2 compiler in OCaml.

Join us!

With this rewrite and migration to OCaml, there’s a great opportunity to join us on the ground floor of a new era. Your enthusiasm for or expertise in programming language theory and compiler development can help bring Stan into the modern world of language design, implementation, and optimization. If this sounds interesting, we could really use your help! We’re meeting twice a week for a quick standup on Mondays and Wednesdays at 10am EST, and I’m always happy to help people get started via email, hangout, or coffee. If you’re an existing Stan user, get ready for friendly new language features and performance upgrades to existing code! It might be a little bumpy along the way, but we have a really great bunch of people working on it who all care most about making Stan a great platform for practicing scientists with bespoke modeling needs.

The opportunity

Stan is a successful and mature modeling language with core abstractions that have struck a chord, but our current C++ compiler inhibits some next-gen features that we think our community is ready for. Our users and contributors have poured a huge amount of statistical expertise into the Stan project, and we now have the opportunity to put similar amounts of programming language theory and compiler craftsmanship into practice. The rewrite will also aim at a more modular architecture, which will enable tooling to be built on top of the Stan compiler enabling features like IDE auto-completion and error highlighting, as well as programming and statistical code linters that can help users with common sources of modeling issues. OCaml’s powerful and elegant pattern matching and seasoned parsing library make it a natural fit for the kinds of symbolic computation required of compilers. This makes it much more pleasant and productive for the task at hand, and is reflected by its frequent use by programming language researchers and compiler implementers. OCaml’s flagship parsing library Menhir enabled Matthijs Vákár to rewrite the Stan parsing phase in about a week, adding hundreds of new custom error messages in another week. Matthijs is obviously a beast, but I think he would agree that OCaml & Menhir definitely helped. Come join us and see for yourself :)

New language features

After we replicate Stan’s current compilers functionality, we will be targeting new language features. The to-do list includes, but is not necessarily limited to:

  • tuples
  • tools for representing and working with ragged arrays
  • higher order functions (functions that take in other functions)
  • variadic functions
  • annotations
    • to bring methods like Posterior Predictive Checking and Simulation-Based Calibration into Stan itself
    • to label variables as “silent” (not output), or as living on a GPU or other separate hardware
    • to assist those who would like to use Stan as an algorithms workbench
  • user-defined gradients
  • representations for missing data and sparse matrices
  • discrete parameter marginalization

Next-gen optimization

But back to the next-gen features. Here is just some of the low-hanging fruit:

  • peephole optimizations: we might notice when a user types log(1- x) and replace it with log1m(x) automatically
  • finding redundant computations and sharing the results
  • moving computation up outside of loops (including the sampling loop!)
  • using the data sizes to ahead-of-time compile a specialized version of the Stan program in which we can easily unroll loops, inline functions, and pre-allocate memory
  • pulling parts of the Math library into the Stan compiler to e.g. avoid checking input matrices for positive-definiteness on every iteration of HMC

There is a wealth of information at the Stan language level we can take advantage to produce more efficient code than the more mature C++ compilers we rely on, and we can use the new compiler to pass some of that information along to the C++ code we generate. Maria Gorinova showed us with SlicStan how to move code to its most efficient (Stan) block automatically as well as a nice composition-friendly syntax. We can use similar static analysis tools in a probabilistic setting to e.g. allow for discrete parameters via automated Rao-Blackwellization (i.e. integrating them out) or discover conjugacy relationships and use analytic solutions where applicable. We can go a step further and integrate with a symbolic differentiation library to get symbolic derivatives for Stan code as a fast substitution for automatic differentiation.

Advanced techniques

Once we’ve created a platform for expressing Stan language concepts and optimizing them, we’ll naturally want to bring as much of our computation onto that platform as possible so we can optimize holistically. This will mean either using techniques like Lightweight Modular Staging to parse our existing C++ library into our Stan compiler representation, or beginning a project to rewrite and simplify the Stan Math library into Stan language itself. We hope that with some of the extensions above, we’ll be able to express the vast majority of the Math library in the Stan language, and lean heavily on a symbolic differentiation library and the stanc3 code generator to generate optimized C++ code. This should shrink the size of our Math library by something like 12x, and takes the code generation techniques used in PyTorch to the next level.

Alternative backend targets (TensorFlow, Pytorch, etc.)

At that point, targeting multiple backends will become fairly trivial. We can put compilation times squarely in the cross-hairs and provide an interpreted Stan that immediately gives feedback and has minimal time-to-first-draw. We can also target other backends like TensorFlow Probability and PyTorch that do not possess the wealth of specialty statistical functions and distributions that we do, but may make better use of the 100,000 spare GPUs you have sitting in your garage.

42 thoughts on “stanc3: rewriting the Stan compiler

        • It sounds like OCaml is the *front end* meaning it reads Stan programs and converts them into some back-end language for describing the numerical calculations involved.

          They mention the example option to use TensorFlow Probability and PyTorch as back-ends… languages where the actual computations get expressed. That’s all fine and dandy, but Julia is where I really want to get my numerical computing done. I’m guessing there is enough interest that it has a reasonable chance to get done if the frontend changes to facilitate it.

          With a compiler, first you want to read a language, and then decide what computations that language has specified, optimize those computations, and then spit out something a computer can understand to carry out the calculations. OCaml is enabling the compilation phase. Whatever the backend will be will enable the calculation phase itself. I imagine they’ll start with a C++ backend like it currently has.

          Currently if I understand it correctly, the Compiler itself is written in C++, and it generates C++, which then is compiled to machine language and runs the calculation. What this will do is change the compiler from a cumbersome C++ program that reads Stan code to a light and fluffy souffle of OCaml that reads Stan code. Most likely the back end will start out as C++, and then once the OCaml based compiler is well in place, someone will come along and give us a backend like Julia ;-)

        • It sounds like they will be using OCaml as an intermediary that will do some compiler passes before C++ sees it, related to the “next-gen” features mentioned, emitting some kind of pre-optimized C++.

        • > After we replicate Stan’s current compilers functionality, we will be targeting new language features

          So I see this as meaning OCaml is the language in which they are rewriting the front-end entirely, and then adding on stuff.

        • We don’t have any concrete plans yet for alternative back ends. Stan’s math library is going to be hard to duplicate elsewhere, so we may wind up implementing the language and limited function support when we do look at alternative backends.

          Julia’s autodiff system is interesting in the way it lets third-party libraries contribute differentiable functions. So it’s not out of the question. Having a clean intermediate representation in OCaml is going to make it much easier to experiment with alternative backends.

        • I didn’t mean to imply you guys were definitely doing Julia, but I did imagine that it’d be easier to experiment with backends once you have a really nice modular front-end.

          I can imagine a backend that produces high level Julia code which links low-level C++ Stan mathlib, so that at least you can have a direct in Julia *interface* to the fitting machinery even if the actual computation is done in C++ code hidden from view. That’d be super nice.

  1. It may be good news for those of us who install once and work, but having to install OCAML (and Julia?) infrastructure on other people’s machines, and keeping compile flags up to date, is going to more nightmarish than it is now.

    Looks like one day Stan only works on Docker with a dozen containers.

    • We’re hoping to actually make this easier. We’re going to be distributing platform-specific binaries for the OCaml transpiler, so at least it won’t require fiddling with compiler flags.

      A lot of our effort these days is going into cluster support for multi-core parallelization and GPU support. That’s also going to present challenges for installation, but I think we’ll be building wrappers at that level for popular platforms like AWS. We’ll see what people come up with.

      • Any word on how far along GPU support is? I’ll be showing `rstanarm` and `brms` to co-workers tomorrow and will mention GPU support (where the stated goal is to try to get a 100x speedup) but I haven’t seen anything lately about whether GPU support might enter beta this year, next year, or we don’t know when.

        (I’ve been doing some neural network stuff and getting about 15x speedup over 16-core CPU-based machines — I attribute the relatively small speedup to using recurrent neural networks which are inherently less parallelizable — and even this speedup has made model development thinkable.)

        • [Sorry, wasn’t clear: GPU was 15x faster than a 16-core EC2 instance for the same training data and network. This turned 24+ hour runs into an hour and a half, which allows you to keep enough focus to make progress.]

        • Wayne: It depends what you’re doing as far as GPU speedup goes. The big wins are for compute-intensive functions like Cholesky factorization where a lot of work can be done on the CPU.

          Stan 2.18 supports multi-core mapping for the likelihood (and/or the prior or posterior predictive simulations if those are compute intensive). This lets you do the same kind of parallelization that’d be used for neural network layers (it’s a big matrix-matrix multiply for each layer and then a non-linear inverse logit [sigmoid] plus parallel backprop which is also parallelizable).

          This has shown speedups of a factor of 50 using 80 cores and have a similar effect on workflow. With 16 cores, the speedup will depend on how much computation can be fed to each core as a unit each log density evaluation.

  2. This sounds amazing, and if I’m honest a little scary.

    Could you talk a little more about discrete parameter marginalization? In my work, the biggest limitation I’ve encountered with Stan is the inability to specify discrete parameters.

    • Ian:

      The most common uses we’ve seen of discrete parameters are for mixture models. The Stan User’s Guide describes how to write mixture models directly, summing over the discrete parameters which generally gives faster computation; see chapter 6 of the current version of the manual.

      Examples do arise of discrete models that are not easily-computed mixtures. Some of these are just really hard computation problems that we’d not want to attack with direct MCMC; for example, if you’re trying to simulate from the Ising model you’ll want to introduce a structure of auxiliary variables; it’s not like you’d want to run Gibbs or whatever directly. For other problems, the discrete parameters are few enough or isolated enough that MCMC could work, and for those problems it could be helpful to augment Stan to allow such parameters. There’s some work being done in Stan on discrete parameters so maybe this will appear at some point.

      • I think I oversold it as “biggest limitation.” I should have said “only limitation.” My use cases have tended to be rather standardly within Stan’s main strengths. There have definitely been times where I did want a discrete parameter, and marginalizing it was beyond what my brain was willing to do during the model exploration phase.

        I understand that, while it is a natural thing to do in Gibbs sampling, discrete parameters don’t really fit into the paradigm of HMCMC.

        • Ian:

          The issue is not so much the paradigm of HMC—we’d be happy to throw in Gibbs or whatever if it makes things faster. The issue is that Gibbs sampler with discrete parameters can take forever to converge, and it’s much faster with the mixture implementation (when that can be done). Potential poor convergence is a particular concern for Stan which is intended to run as automatically as possible.

        • I’ve dealt quite extensively from the research side with poor convergence and phase transitions in MCMC, so I know there are no free lunches. In fact, if you could do it efficiently in all cases then P=NP, which is _probably_ not the case.

          Just to give a flavor, the last time I ran across this I was dealing with imperfectly measured counts, and the research goal was to model the underlying counts from the imperfect measurements (among other things). My first reaction was to model the counts as Poisson with the observed measurements having expectation equal to the true count and some error (normal was the first thing I looked at). Since, at least according to my understanding, the counts would be discrete parameters, this model could not be expressed in the natural naive way in Stan. I then did a bunch of research looking into continuous approximations (e.g. doing an anscome transformation or a custom probability function that was a smooth continuous approximation of Poisson) and potentially marginalizing.

          In the end I found that it didn’t matter because poisson was actually a pretty bad fit for the data (much higher variability than poisson). Anyhow, I just feel that I would have been better off if I had been able to fit the model and diagnose it rather than going off on this side research project to work around the issue.

          All this is to say that Stan is a great tool and I love it. I just look at all of the amazing flexible models that can be expressed parsimoniously in the language and see a discrete parameters (even with their limitations) as being a natural feature to target.

        • Ian: Yes, the plan is to do the marginalization automatically in cases where it’s feasible. Matthijs is working on it. We’ve also been toying with adding an inefficient Gibbs implementation, but for reasons Andrew mentions, we don’t trust it. We also don’t know how the warmup is affected by changing discrete parameters. (The evaluation should be possible with PyMC3.)

          Second, missing count data’s is the one case I always bring up where it’d be nice to have a discrete sampler. There’s no way to properly implement that model in Stan. It doesn’t present the same multimodality problems (at least with most models of missingness!) as the NP-hard combinatoric discrete samplers, so I’d like to be able to code it. I’ve seen Stan users go to the extreme length of approximating the Poisson by marginalizing out a large interval around the mean. But that’ll be problematic for things like negative binomial with high dispersion, which won’t have the small posterior variance that the Poisson has.

    • I think what’s meant by discrete parameter marginalization is writing likelihoods as a discrete mixture of the different possible values of the discrete parameter… like if p is a vector of continuous parameters and L1(x|p) is the likelihood when q = 1 and L2(x|p) is the likelihood when q = 2 and the prior is that there’s a 20% chance of q=1 and 80% of q=2 then you write the likelihood for x as:

      L1(x|p)*.2 + L2(x|p)*.8

      This lets you do inference on p averaged over the prior for discrete values of q.

      This becomes impossible as the number of discrete parameters increases because of exponential explosion in the number of terms you need (for q1 having 3 states and q2 having 5 and q3 having 10 you need 3 * 5 * 10 different terms)

      If you’re trying to do inference on the posterior probability of the q, you can make the mixture probabilities parameters:

      L1(x|p)*q1 + L2(x|p)*q2

      and make q1,q2,… be a simplex with some probability.

    • P.S. Yet another idea is for the language itself to allow discrete parameters and then have the marginalization be done automatically in the compilation. But that seems to me like it would be really hard, as it would only work for some sorts of models but not others.

      • Seems like it should be possible to do something like umbrella sampling to enable discrete transitions.

        Suppose you have a 4 core machine. You set up a special chain, and 3 other chains at higher temperatures. In the special chains you attempt to randomly perturb the discrete parameters with some probability, and do HMC with some other probability. At the higher temperatures you’ll be likely to accept the discrete perturbations more often. Each chain attempts to accept swaps with its higher temperature neighbor at random intervals. The ultimate result, if it works, would be to occasionally have the special chain randomly jump to a different discrete set of parameters.

        It’s a thought at least.

        • There are lots of things we could do if we wanted to do discrete sampling. We could easily do Metropolis discrete steps or Gibbs. The problem is that we can’t efficiently, in general, compute the Markov blanket for a discrete parameter. So we can’t do what BUGS or JAGS do and efficiently set up Gibbs sampling (or generalizations thereof).

        • I just imagine that it’d be extremely unlikely to get acceptance… if you have discrete parameters and you try to change them, it’s going to be the case that the continuous parameters should also change a bunch… which is why I imagine an umbrella type sampling procedure would make sense, basically relax the posterior quite a bit so that what would be “big” changes due to discrete variables are still acceptable at the “high” temperatures, and then you can rely on HMC to move the continuous variables into the region where they ought to be, and try to anneal these things down into the temperature = 1 version… otherwise I see discrete transitions as extremely large barriers to acceptance in many cases.

    • I’m not sure how it’s implemented, but I can easily imagine that you’d have some loop for example moving through some data and multiplying and adding and square rooting etc parameters together with data, and each time through the loop you have to construct and spit out a specialized autodiff “object” that represents the contribution of this particular expression at this particular point in the loop to the overall derivative… So the memory overhead would be proportional to the size of the data times the number of parameters involved… Whereas a compiler could analyze it and realize all it has to do is accumulate a couple extra numbers through the loop…

      • Daniel: What you’re talking about sounds more like code transforming autodiff. General loop handling is very challenging with code transformation because the reverse pass needs to go in the reverse order so the autodiff code can’t be generated with the same control structure.

        Most symbolic differentiation systems, on the other hand, require a formula, not a program, to differentiate, so they don’t support loops (other than implicitly in matrices/tensors, etc.).

        David: Daniel’s on the right track in that speedups are possible when there’s algebraic reductions that can be done in the formula that results for the derivatives. Anything like that which can be computed statically is a big win, which is why we’re looking hard at Stan program transformations. The disadvantage of literally getting a formula for the derivative is that it’s rarely set up to efficiently share subexpression computations with the function evaluation code itself, whereas this is built into reverse-mode autodiff.

        Some nice implementation details come for free in reverse-mode autodiff from the adjoint propagation—any re-entrancy in the expression graph (re-use of variables) is automatically handled by a kind of dynamic programming (it memoizes all results in the expression graph above a certain point in a topological sort rather than combinatorially multiplying them out).

  3. Really looking forward to this; the time to first sample (I was actually just using the optimizer for MAPs) for moderate sized (~30k rows) and complex (thousands of predictors in tens of groups) HLMs in rstan was greater than I could wait (no evals after hours) – going through the terrible agony of coding the same model in tensorflow it started up in less than a minute.

  4. I came across this comment from Dan Roy at U Toronto

    https://twitter.com/roydanroy/status/1106366689065734145

    I’ve dabbled with both Church and Stan but was always curious about their conceptual differences. Now that this is being discussed here, I was wondering if others could enlighten me on the differences between Stan’s way of doing this and the LISP+Probability way of doing things? Just pointers to other resources would be great, if not a full fledged argument.

    • LISP has its own philosophy about language design (homoiconic syntax (only (nested) parentheses and atoms as syntactic objects)), and we clearly aren’t in that tradition. Computationally, Church uses a fairly simplistic generative modeling approach that doesn’t seem to work in finite time on most modeling problems, whereas Stan’s main paradigm is a mostly imperatively-specified joint log density function coupled with black-box Hamiltonian Monte Carlo for sampling from the posterior.

Leave a Reply to NU Cancel reply

Your email address will not be published. Required fields are marked *