New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!

Posted on May 5, 2020 1:00 PM by Sebastian Weber

What’s new? The new and shiny reduce_sum facility released with Stan 2.23 is far more user-friendly and makes it easier to scale Stan programs with more CPU cores than it was before. While Stan is awesome for writing models, as the size of the data or complexity of the model increases it can become impractical to work iteratively with the model due to too long execution times. Our new reduce_sum facility allows users to utilise more than one CPU per chain such that the performance can be scaled to the needs of the user, provided that the user has access to respective resources such as a multi-core computer or (even better) a large cluster. reduce_sum is designed to calculate in parallel a (large) sum of independent function evaluations, which basically is the evaluation of the likelihood for the observed data with independent contributions as applicable to most Stan programs (GP problems would not qualify though).

Where do we come from? Before 2.23, the map_rect facility in Stan was the only tool enabling CPU based parallelisation. Unfortunately, map_rect has an awkward interface since it forces the user to pack their model into a set of weird data structures. Using map_rect often requires a complete rewrite of the model which is error prone, time intensive, and certainly not user-friendly. In addition, chunks of works had to be formed manually leading to great confusion around how to „shard“ things. As a result, map_rect was only used by a small number of super-users. I feel like I should apologise for map_rect given that I proposed the design. Still, map_rect did drive some crazy analyses with up to 600 cores!

What is it about? reduce_sum leverages the fact that the sum operation is associative. As a consequence, we can break a large sum of independent terms into an arbitrary number of partial sums. Hence, the user needs to provide a “partial sum” function. This function must follow conventions that allow it to evaluate arbitrary partial sums. The key to user-friendliness is that the partial sum function allows an arbitrary number of additional arguments of arbitrary structure. Therefore, the user can naturally formulate their model as no awkward packing/unpacking is needed. Finally, the actual slicing into smaller partial sums is performed in full automation which automatically tunes the computational task to the given resources.

What can users expect? As usual, the answer is „it depends“. Great… but on what? Well, first of all we have to account for the fact that we do not parallelise the entire Stan program, but only a fraction of the total program is run in parallel. The theoretical speedups in this case are described by Amdahl‘s law (plot is taken from the respective Wikipedia page)

You can see that only when the fraction of the parallel task is really large (beyond 95%), then you can expect very good scaling of the performance up to many cores. Still, doubling the speed is easily done for most cases with just 2-3 cores. Thus, users should pack as much of their Stan program into the partial sum function to increase the fraction of parallel work load – not only the data likelihood, but ideally also the calculation to get the by data record model mean value, for example. For Stan programs this will usually mean that code in the transformed parameters and model block will be moved into the partial sum function. As a bonus for doing so, we have actually observed that this will speedup your program – even when using only a single core! The reason is that reduce_sum will slice the given task into many small ones which improves the use of CPU caches.

How can users apply it? Easy! Grab CmdStan 2.23 and dive into our documentation (R / Python users may use CmdStanR / CmdStanPy – RStan 2.23 is underway). I would recommend to go over our documentation in this order:

1. A case study which adapts Richard McElreath’s intro to map_rect for reduce_sum
2. User manual introduction to reduce_sum parallelism with a simple example as well: 23.1 Reduce-Sum
3. Function reference: 9.4 Reduce-Sum Function

I am very happy with the new facility. It was a tremendous piece of work to get this into Stan and I want to thank my Stan team colleagues Ben Bales, Steve Bronder, Rok Cesnovar, and Mitzi Morris for making all of this possible in a really short time frame. We are looking forward to what our users will do with it. We definitely encourage everyone to try it out!

10 thoughts on “New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!”

Andrew on May 5, 2020 3:48 PM at 3:48 pm said:

Sebastian:

I see in the link that there’s an example for logistic regression. Can I expect that there will soon be implementations of logistic regression in rstanarm, brms, etc., that will do this automatically? Or even a logistic regression function within Stan that does all these reduce_sum operations by itself? As a lazy user, I want the benefits of this speedup without having to go in and program all those steps myself.

Here’s what I’d like to see: I enter my credit card into some online parallel cluster and reserve, say, 400 processors. I then run Stan, telling it I can access 100 processors per chain, and in compilation it does what’s necessary to distribute the computations. Then it would just run.

Is this feasible?

Reply ↓
- Sebastian Weber on May 5, 2020 4:14 PM at 4:14 pm said:
  
  I really hope that rstanarm and brms will provide reduce_sum as an option to be used. I did have an open ear to both developers when designing reduce_sum. The question is just about how efficient that will be vs a hand-crafted model. It it is going to be interesting. Wrt. to making this all automatic in Stan itself – well it is possible to plug this stuff under the hood of our glm functions in order to give you speedups whenever you don’t have a GPU (or maybe the CPU version also helps with smaller data sizes). I am not sure. Porting all those glm functions to use this would be quite some work. I am sceptical if this is really helpful anyways. As you see from Amdahl’s law, you really have to cram about everything into the parallel portion of the model to get really good speedup. However, I would really encourage everyone who has runtime challenges to dive into reduce_sum – it is really easy to code it given that you can master writing Stan programs. I ported a number of very complex models to this thing in a matter of ~30min or so.
  
  If you buy 400 cores and run 4 chains, then we can hopefully get to the point where slower running chains can get the resources from faster finishing chains… but yeah, you should go and buy those crazy 128 core machines for your office…
  
  Reply ↓
  - jd on May 6, 2020 2:35 PM at 2:35 pm said:
    
    I really hope that this is available in brms, too! I just finished running several large hierarchical logistic regression models in brms that took days to run on my 32 core machine.
    
    Reply ↓
    - Florian on June 12, 2020 7:22 AM at 7:22 am said:
      
      How did you get brms to even use those 32 cores? I can’t seem to find a way to use more cores than chains. There’s a post on the Stan discourse server that suggests this isn’t currently possible…? –> https://discourse.mc-stan.org/t/does-having-more-cores-than-chains-improve-speed/4235/2
Ian Fellows on May 5, 2020 7:19 PM at 7:19 pm said:

This is great! Are there any facilities for profiling Stan code so that the user can figure out where bottlenecks are so that they know where to apply this? I usually have a good idea, but sometimes am surprised and end up optimizing something that makes no difference.

Reply ↓
- Sebastian Weber on May 6, 2020 4:43 AM at 4:43 am said:
  
  We don’t have a profiler for Stan, no. After all it’s a C++ program which you can profile using standard profiler tools, but that’s not too easy nor intuitive to use. However, you should not be looking for individual operations to put into reduce_sum. Instead try to put as much of your Stan program into reduce_sum. Basically this means to move most of the transformed parameters block and the model block into the reduce_sum thing. With this strategy you maximise the portion of the code which runs parallel, which is so important as you can see from Amdahl’s law. The only downside from doing so is that you won’t get the intermediate quantities (like the per data row mean) as output from your Stan runs, since you just output the log-lik value. So you will need to re-calculate those intermediates which are often needed for predictive checks in the generated quantities block – but don’t worry about that; this is cheap computationally.
  
  I think Stan programs where we worry about speed will be written differently in the future – at least this is what I will do.
  
  Reply ↓
Ignacio on May 6, 2020 3:54 PM at 3:54 pm said:

Any chance that the next Stacon will have a tutorial session about this?

Reply ↓
- Sebastian Weber on May 7, 2020 4:25 AM at 4:25 am said:
  
  That’s probably a good idea. Let’s see. In the meantime please refer to the documentation – we have really tried to make it as easy as possible for users to get started. If there is still some “friction”, then just head to the Stan discourse forums.
  
  @Ian Fellows: Looks like brms is gaining support for reduce_sum (see https://github.com/paul-buerkner/brms/issues/892)… but if porting the Stan code generated by brms over to use reduce_sum is an option for you, then I can only encourage you in doing that already now. It is really not that hard, but requires you to touch Stan code and handle CmdStan. Still, in your case you can expect to run things considerably faster and with days runtime the time investment will be well worth it. Again, the forum is there for help.
  
  Reply ↓
mathDR on July 18, 2020 2:43 PM at 2:43 pm said:

Can you comment on recommenced usage of map_rect and reduce_sum simultaneously? Is there a class of models that would benefit from having both paradigms?

Reply ↓
- Sebastian Weber on July 20, 2020 9:21 AM at 9:21 am said:
  
  You should refrain from using a nested parallelisation scheme. It can be done and Stan supports it, but it’s best to choose a sensible unit of computation and then do a single layer of parallelisation. This way you reduce the overhead which is always present with any parallelisation. In case your computational task is so huge that this concern is irrelevant, then it does not really matter if or how you nest things. Given that we usually calculate the log-likelihood of independent terms which are being summed, I would opt for always using reduce_sum as it is much easier to use given it’s improved interface over map_rect.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!

10 thoughts on “New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!”

Leave a Reply to Andrew Cancel reply