After getting 80 zillion comments on that last post with all that political content, I wanted to share something that’s purely technical.

It’s something Bob Carpenter wrote in a conversation regarding implementing algorithms in Stan:

One thing we are doing is having the matrix library return more expression templates rather than copying on return as it does now. This is huge in that it avoids a lot of intermediate copies. Some of these don’t look so big when running a single chain, but stand out more when running multiple chains in parallel when there’s overall more memory pressure.

Another current focus for fusing operations is the GPU so that we don’t need to move data on and off GPU between operations.

Stan only does a single arena-based heap allocation other than for local vector and Eigen::Matrix objects (which are standard RAII). Actually, it’s more of an expanding thing, since it’s not pre-sized. But each bit only gets allocated once in exponentially increasing chunk sizes, so there’ll be at most log(N) chunks. It then reuses that heap memory across iterations and only frees at the end.

We’ve found that using a standard functional map operation internally that partially evaluates reverse-mode autodiff for each block over which the function is mapped. This reduces overall memory size and keeps the partial evaluations more memory local, all of which speeds things up at the expense of clunkier Stan code.

The other big thing we’re doing now is looking at static matrix types, of the sort used by Autograd and JAX. Stan lets you assign into a matrix, which destroys any hope of memory locality. If matrices never have their entries modified after creation (for example through a comprehension or other operation), then the values and adjoints can be kept as separate double matrices. Our early experiments are showing a 5–50-fold speedup depending on the order of data copying reduced and operations provided. Addition’s great at O(N^2) copies currently for O(N^2) operations (on N x N matrices). Multiplication with an O(N^2) copy cost and O(N^3) operation cost is less optimizable when matrices get big.

I love this because I don’t understand half of what Bob’s talking about, but I know it’s important. To make decisions under uncertainty, we want to fit hierarchical models. This can increase computational cost, etc. In short: Technical progress on computing allows us to fit better models, so we can learn more.

Recall the problems with this study, which could’ve been avoided by a Bayesian analysis of uncertainty in specificity and sensitivity, and multilevel regression and poststratification for adjusting for differences between sampling and population. In that particular example, no new computational developments are required—Stan will work just fine as it is, and to the extent that Stan improvements would help, it would be in documentation and interfaces. But, moving forward, computational improvements will help us fit bigger and better models. This stuff can make a real difference in our lives, so I wanted to highlight how technical it can all get.

This is very important, I can only work with models that do not overtax my patience D:

Two qualifications on this:

1. None of the new Stan developments I’m reporting on are being done by me. I’m involved in the design of the immutable matrices, but mainly just as a commenter to point out constraints coming from the rest of Stan’s architecture. I’m not even involved in the design of anything on the GPU. Steve Bronder’s doing the work on immutable matrices, Tadej Cigalirič on expresison templates to reduce copies of matrices, and the whole GPU team including Rok Česnovar, Steve, and Tadej are working on unfolding things for the GPU. And this will eventually have to hit the language, where I believe Sean Talts and Matthijs Vákár laid the groundwork with earlier work on optimizations in the generated code.

2. I wrote this in reply to an email from a Columbia student interested in autodiff. I told Andrew is that everything I said should be well known in theory by anyone working seriously on autodiff or GPUs. What’s less well known is the kind of approach the student was taking that was similar to Adept’s use of expression templates to do intermediate partial autodiff evaluations (called “checkpointing” in the autodiff literature, which is also what map does when run in a single chain). That didn’t show up until a later reply.

I also forgot to mention the cool parallel stuff that Sebastian Weber and Ben Bales so heroically pushed through for the latest release; see Sebastian’s blog post on within-chain parallelism from a few days ago. This involved a lot of very careful work on our autodiff structure to make this efficient.

Although nothing really novel for autodiff, with a lot of help from my friends, I finished integrating complex numbers into the Stan math lib. We should have complex data types, functions, and complex linear algebra operations like FFTs and asymmetric eigendecomposition soon.

It really is kind of mind-blowing thinking about how much work is going on simultaneously with Stan. I told the first few Columbia students who asked about volunteering on the project before version 1 was released that they’d have to sit in the office and write C++ with me, Matt, and Daniel if they wanted to contribute, because nothing was stable enough to scale out work to new devs.

Tadej is also working on using expression templating to simplify adding efficient GPU support for new functions. Here is the paper on that for anyone interested: https://dl.acm.org/doi/abs/10.1145/3388333.3388654

And a youtube presentation: https://www.youtube.com/watch?v=0wUxv6Mv61g&feature=emb_title

I am training an AI to write Carpenterese. So far it’s generating things like this:

“A heap operation can be complicated so there must be a well-designed data structure to manage the heap, preferably inside glibc although alternative approaches are possible. The data structure will have both micro and macro structure and one should take care when defining the structure to allow for parallelization, not necessarily now but in the future. Memory allocation and de-allocation in particular can be a problem with malleable data types.”

Someday I hope to have the AI write extensive comments on all of Bob’s posts. There is no way I will ever be able to do it myself.

I think Phil’s comment will be Best of Thread no matter how many more comments are added!

Try this: https://pdos.csail.mit.edu/archive/scigen/

You can tweak their open-source word and phrase dictionary to align with Carpenterese.

Bob, how much is the Stan team looking at e.g. tensorflow for some of these issues? I don’t know much about how reverse mode autodiff works under the hood, but some of the things you mentioned—like GPU-support and not being able to assign into an existing matrix—sound familiar to me as a tensorflow user.

Hi. I’m on the TensorFlow Probability team. We’ve had some initial discussions about providing a TensorFlow backend for Stan, and there was some very initial prototyping done by Sean Talts. This is an area of active interest for us, but it’s not being actively worked on right now due to resource constraints.

Everyone’s looking at everything. It’s a small friendly niche community. Hi, Rif! Speaking of the community, I’m really hoping Prob Prog happens later this year.

Stan’s autodiff is much more like PyTorch and JAX or TensorFlow in eager mode than it is like the highly optimized TensorFlow graph execution. The beauty of TensorFlow’s graph mode is that they have the whole computational call structure available at compile time, so there’s a lot of optimization potential. We’re looking to push the kinds of operations TensorFlow can do on its graphs into the Stan language code generator.

A lot of the early design decisions were made based on wanting to be able to autodiff through everything and have a flexible BUGS-style way of building up models. I knew at the time we were paying a price by having matrices assignable, but the alternative would’ve been a lot of matrix autodiff and ODE solver autodiff and whatnot that we didn’t have capacity for at the time. As is, we could just plug into Eigen and Boost and away we went. In retrospect, had I known where the project was going in scale, we might’ve made different decisions. Matt Hoffman was lobbying for code generation a la Theano, which gives you a great deal of flexiblity for optimization, but also restricts the kinds of operations you can do much like TensorFlow graph execution. I was lobbying for general imperative code, and since I was writing the language parser and code generator, I got to make that decision.

Had a system like Theano or TensorFlow or PyTorch existed when we started writing Stan, we almost certainly would’ve just chosen one of those libraries. We didn’t want to write an autodiff system—it was just that the existing ones weren’t engineered well enough to be easily extensible in the ways we needed.