How does Stan work? A reading list.

Bob writes, to someone who is doing work on the Stan language:

The basic execution structure of Stan is in the JSS paper (by Bob Carpenter, Andrew Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell) and in the reference manual. The details of autodiff are in the arXiv paper (by Bob Carpenter, Matt Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt). These are sort of background for what we’re trying to do.

If you haven’t read Maria Gorinova’s MS thesis and POPL paper (with Andrew Gordon and Charles Sutton), you should probably start there.

Radford Neal’s intro to HMC is nice, as is the one in David McKay’s book. Michael Betancourt’s papers are the thing to read to understand HMC deeply—he just wrote another brain bender on geometric autodiff (all on arXiv). Starting with the one on hierarchical models would be good as it explains the necessity of reparameterizations.

Also I recommend our JEBS paper (with Daniel Lee, and Jiqiang Guo) as it presents Stan from a user’s rather than a developer’s perspective.

And, for more general background on Bayesian data analysis, we recommend Statistical Rethinking by Richard McElreath and BDA3.

3 thoughts on “How does Stan work? A reading list.

  1. I was aiming for people who wanted to jump in as developers on the C++ side. And it’s just for the HMC/NUTS sampling. The differences between the original Hoffman and Gelman NUTS involves improvements to adaptation (documented in the reference manual) and in the adapted NUTS-like sampling (documented in Betancourt’s exhaustive HMC paper and in the conceptual introduction). It’s still worth reading the original Hoffman and Gelman JMLR paper.

    For variational inference, the thing to read is the Kucukelbir et al. JMLR paper on automatic differentiation variational inference.

    For optimization, we’re using L-BFGS as described on the web, in Numerical Recipes, and in the classic Nocedal and Wright textbook.

    For higher-order autodiff, we’re using pretty standard forward mode with reverse nested inside. I keep meaning to write the follow-on arXiv paper for that. Maybe after the recent round of testing improvements go in.

    There’s all kinds of implementation detail for more complex derivatives for various solvers from linear solvers in matrix libraries, through algebraic equation solvers, to the ordinary differential equation solvers. Yi Zhang’s pushing that work toward PDEs and differential algebraic equations. Charles Margossian wrote an intro paper on autodiff that should also be useful for understanding how all this stuff works. We could use a nice reference on the implementation details of autodiff through these solvers. There are also details not in the paper like the new lazy adjoint-Jacobian helper which lets us write simple and memory/time-efficient multivariate derivatives.

    A level up from the math and sampling library, there’s the language. That’s all being rewritten in OCaml and Matthijs Vákár is working on theory for it at the same time. We don’t have good intro docs there, but there’s a lot of the usual functional stuff going on there.

    At a higher level, it’s worth diving into how rstanarm implements GLMs. There’s numerical conditioning of the data matrices coupled with some nice default priors. I’m not sure what the best reference is there. Then there’s also all the work around evaluation with R-hat (for sampling), K-hat (for variational inference and other approximate methods), simulation-based calibration (SBC), and Pareto-smoothed importance sampling for efficient approximate leave-one-out cross-validation (PSIS-LOO).

    A lot of our current work is focused around scaling and generalizing. At the lowest level, that’s primarily multi-core and multi-threading for map operations and GPU support for big matrix operations. I don’t think we have great doc for any of that anywhere.

    In the medium level, it’s a lot of efficient, general functions for modeling. For instance, we want to write higher-order Gaussian process covariance functions and use partial evaluation of derivatives (what the autodiff literature calls checkpointing) to reduce memory and hence improve speed (just about any reduction in memory pressure yields an improvement in speed in these cache-heavy numerical algorithms).

    At the highest level, it’s new algorithms like Laplace approximations, marginal optimization and/or sampling, expectation propagation, etc. We have lots of drafts of things floating around for the algorithms.

  2. I know the above post was aimed at developers and understanding how Stan works, but since it did mention Statistical Rethinking – I have to say that the free Statistical Rethinking lectures on YouTube, combined with this re-coded Statistical Rethinking in ‘brms’, the amazing vignettes for ‘brms’ and ‘rstanarm’, the prior choice recommendations here , and this blog, really make learning accessible for the end user like myself.
    This stuff gets more user friendly all the time:)
    I really appreciate all the people involved.

Leave a Reply

Your email address will not be published. Required fields are marked *