Stan’s Within-Chain Parallelization now available with brms

The just released R package brms version 2.14.0 supports within-chain parallelization of Stan. This new functionality is based on the recently introduced reduce_sum function in Stan, which allows to evaluate sums over (conditionally) independent log-likelihood terms in parallel, using multiple CPU cores at the same time via threading. The idea of reduce_sum is to exploit the associativity and commutativity of the sum operation, which allows to split any large sum into many smaller partial sums.

Paul Bürkner did an amazing job to enable within-chain parallelization via threading for a broad range of models as supported by brms. Note that currently threading is only available with the CmdStanR backend of brms, since the minimal Stan version supporting reduce_sum is 2.23 and rstan is still at 2.21. It may still take some time until rstan can directly support threading, but users will usually not notice any difference between either backend once configured.

We encourage users to read the new threading vignette in order to get an intuition of the new feature as to what speedups one can expect for their model. The speed gain by adding more CPU cores per chain will depend on many model details. In brief:

  • Stan models taking days/hours can run in a few hours/minutes, but models running just a few minutes will be hard to accelerate
  • Models with computationally expensive likelihoods will parallelize better than those with cheap to calculate ones like a normal or a Bernoulli likelihood
  • Non-Hierarchical and hierarchical models with few groupings will greatly benefit from parallelization while hierarchical models with many random effects will gain somewhat less in speed

The new threading feature is marked as „experimental“ in brms, since it is entirely new and there may be a need to change some details depending on further experience with it. We are looking forward to hear from users about their stories when using the new feature at the Stan Discourse forums.

New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!

What’s new? The new and shiny reduce_sum facility released with Stan 2.23 is far more user-friendly and makes it easier to scale Stan programs with more CPU cores than it was before. While Stan is awesome for writing models, as the size of the data or complexity of the model increases it can become impractical to work iteratively with the model due to too long execution times. Our new reduce_sum facility allows users to utilise more than one CPU per chain such that the performance can be scaled to the needs of the user, provided that the user has access to respective resources such as a multi-core computer or (even better) a large cluster. reduce_sum is designed to calculate in parallel a (large) sum of independent function evaluations, which basically is the evaluation of the likelihood for the observed data with independent contributions as applicable to most Stan programs (GP problems would not qualify though).

Where do we come from? Before 2.23, the map_rect facility in Stan was the only tool enabling CPU based parallelisation. Unfortunately, map_rect has an awkward interface since it forces the user to pack their model into a set of weird data structures. Using map_rect often requires a complete rewrite of the model which is error prone, time intensive, and certainly not user-friendly. In addition, chunks of works had to be formed manually leading to great confusion around how to „shard“ things. As a result, map_rect was only used by a small number of super-users. I feel like I should apologise for map_rect given that I proposed the design. Still, map_rect did drive some crazy analyses with up to 600 cores!

What is it about? reduce_sum leverages the fact that the sum operation is associative. As a consequence, we can break a large sum of independent terms into an arbitrary number of partial sums. Hence, the user needs to provide a “partial sum” function. This function must follow conventions that allow it to evaluate arbitrary partial sums. The key to user-friendliness is that the partial sum function allows an arbitrary number of additional arguments of arbitrary structure. Therefore, the user can naturally formulate their model as no awkward packing/unpacking is needed. Finally, the actual slicing into smaller partial sums is performed in full automation which automatically tunes the computational task to the given resources.

What can users expect? As usual, the answer is „it depends“. Great… but on what? Well, first of all we have to account for the fact that we do not parallelise the entire Stan program, but only a fraction of the total program is run in parallel. The theoretical speedups in this case are described by Amdahl‘s law (plot is taken from the respective Wikipedia page)

AmdahlsLaw

You can see that only when the fraction of the parallel task is really large (beyond 95%), then you can expect very good scaling of the performance up to many cores. Still, doubling the speed is easily done for most cases with just 2-3 cores. Thus, users should pack as much of their Stan program into the partial sum function to increase the fraction of parallel work load – not only the data likelihood, but ideally also the calculation to get the by data record model mean value, for example. For Stan programs this will usually mean that code in the transformed parameters and model block will be moved into the partial sum function. As a bonus for doing so, we have actually observed that this will speedup your program – even when using only a single core! The reason is that reduce_sum will slice the given task into many small ones which improves the use of CPU caches.

How can users apply it? Easy! Grab CmdStan 2.23 and dive into our documentation (R / Python users may use CmdStanR / CmdStanPy – RStan 2.23 is underway). I would recommend to go over our documentation in this order:

1. A case study which adapts Richard McElreath’s intro to map_rect for reduce_sum
2. User manual introduction to reduce_sum parallelism with a simple example as well: 23.1 Reduce-Sum
3. Function reference: 9.4 Reduce-Sum Function

I am very happy with the new facility. It was a tremendous piece of work to get this into Stan and I want to thank my Stan team colleagues Ben Bales, Steve Bronder, Rok Cesnovar, and Mitzi Morris for making all of this possible in a really short time frame. We are looking forward to what our users will do with it. We definitely encourage everyone to try it out!