Thanks, NVIDIA

Andrew and I both received a note like this from NVIDIA:

We have reviewed your NVIDIA GPU Grant Request and are happy support your work with the donation of (1) Titan Xp to support your research.

Thanks!

In case other people are interested, NVIDA’s GPU grant program provides ways for faculty or research scientists to request GPUs; they also have graduate fellowships and larger programs.

Stan on the GPU

The pull requests are stacked up and being reviewed and integrated into the testing framework as I write this. Stan 2.19 (or 2.20 if we get out a quick 2.19 in the next month) will have OpenCL-based GPU support for double-precision matrix operations like multiplication and Cholesky decomposition. And the GPU speedups are stackable with the multi-core MPI speedups that just came out in CmdStan 2.18 (RStan and PyStan 2.18 are in process and will be out soon).

Plot of GPU timing

Figure 1. The plot shows the latest performance figures for Cholesky factorization; the X-axis is the matrix dimensionality and the Y-axis the speedup vs. the regular Cholesky factorization. I’m afraid I don’t know which CPU/GPU combo this was tested on.

Academic hardware grants

I’ve spent my academic career coasting on donated hardware back when hardware was a bigger deal. It started at Edinburgh in the mid-80s with a Sun Workstation I donated to our department. LaTeX on the big screen was just game changing over working on a terminal then printing the postscript. Then we got Dandelions from Xerox (crazy Lisp machines with a do-what-I-mean command line), continued with really great HP Unix workstations at Carnegie Mellon that had crazy high-res CRT monitors for the late ’80s. Then I went into industry, where we had to pay for hardware. Now that I’m back in academia, I’m glad to see there are still hardware grant programs.

Stan development is global

We’re also psyched that so much core Stan development is coming from outside of Columbia. For the core GPU developers, Steve Bronder is at Capital One and Erik Štrumbelj and Rok Češnovar are at the University of Ljubljana. Erik’s the one who tipped me off about the NVIDIA GPU Grant program.

Daniel Lee is also helping out with the builds and testing and coding standards, and he’s at Generable. Sean Talts is also working on the integration here at Columbia; he played a key design role in the recent MPI launch, which was largely coded by Sebastian Weber at Novartis in Basel.

Where do I learn about log_sum_exp, log1p, lccdf, and other numerical analysis tricks?

Richard McElreath inquires:

I was helping a colleague recently fix his MATLAB code by using log_sum_exp and log1m tricks. The natural question he had was, “where do you learn this stuff?”

I checked Numerical Recipes, but the statistical parts are actually pretty thin (at least in my 1994 edition).

Do you know of any books/papers that describe these techniques?

I’d love to hear this blog’s answers to these questions.

I replied that I learned numerical analysis “on the street” through HMM implementations. HMMs are also a good introduction to the kind of dynamic programming technique I used for that Poisson-binomial implementation we discussed (which we’ll build into Stan one of these days—it’ll be a fun project for someone). Then I picked up the rest through a hodge-podge of case-based learning.

“Numerical analysis” is name of the field and the textbooks where you’ll learn log_sum_exp and log1p and complementary cdfs and learn how 0 is so very different than 1 (smallest double-precision floating point value greater than zero is around 10^-300, whereas the largest double-precision value less than 1 is about 1 – 10^-16), which is rather relevant for statistical computation. You’ll also learn about catastrophic cancellation (which makes naive variance calculations so unstable) and things like the stable Welford algorithm for calculating variance, which also has the nice property of behaving as a streaming accumulator (i.e., it’s memoryless). I don’t know which books are good, but there are lots of web sites and course materials you can try.

The more advanced versions of this will be about matrices and how to maintain stability of iterative algorithms. Things like pivoting LL^t decompositions and how to do stable matrix division. A lot of that’s also about how to deal with caching in memory with blocking algorithms to do this efficiently. A decent matrix multiplier will be more than an order of magnitude faster than a naive approach on large matrices.

“Algorithms and data structures” is the CS class where you learn about things like dynamic programming (e.g., how to calculate HMM likelihoods, fast Fourier transforms, and matrix multiplication ordering).

Algorithms class won’t typically get into the low-level caching and branch-point prediction stuff you need to know to build something like Stan efficiently. There, you need to start diving into the compiler literature and the generated assembly and machine code. I can highly recommend Agner Fogg’s overviews on C++ optimization—they’re free and cover most of what you need to know to start thinking about writing efficient C++ (or Fortran—the game’s a bit different with statically typed functional languages like ML).

The 1D integrator in Stan (probably land in 2.19—there’s a few kinks to work out in Ben Bales’s math lib code) uses an input that provides both the integrated value and its complement (x and closest boundary of the integration minus x). Ben Goodrich helped a lot, as usual, with these complicated numerical things. The result is an integrator with enough precision to integrate the beta distribution between 0 and 1 (the trick is the asymptote at 1).

Integration in general is another advanced numerical analysis field with tons of great references on error accumulation. Leimkuhler and Reich is the usual intro reference that’s specific to Hamiltonian systems; we use the leapfrog (Störmer-Verlet) integrator for NUTS and this book has a nice analysis. We’re looking now into some implicit algorithms to deal with “stiff” systems that cause relatively simple explicit algorithms like Runge-Kutta to require step sizes so small as to be impractical; we already offer them within Stan for dynamics modeling (the _bdf integrators). Hairer et al. the more mathematically advanced reference for integrators. There are tons of great course notes and applied mathematics books out there for implementing Euler, implicit Euler, Runge-Kutta, Adams-Moulton, implicit midpoint, etc., all of which have different error and symplecticness properties which heavily tie into implementing efficient Hamiltonian dynamics. Yi Zhang at Metrum is now working on improving our underlying algorithms and adding partial differential equation solvers. Now I have a whole new class of algorithms to learn.

So much for my getting away from Columbia after I “learned statistics”. I should at least record the half-lecture I do on this topic for Andrew’s stats communication class (the other half of the class I do focuses on wring API specs). I figure it’s communicating with the computer and communicating with users, but at least one student per year walks out in disgust at my stretching the topic so broadly to include this computer sciency stuff.

The current state of the Stan ecosystem in R

(This post is by Jonah)

Last week I posted here about the release of version 2.0.0 of the loo R package, but there have been a few other recent releases and updates worth mentioning. At the end of the post I also include some general thoughts on R package development with Stan and the growing number of Stan users who are releasing their own packages interfacing with rstan or one of our other packages.

Interfaces

rstanarm and brms: Version 2.17.4 of rstanarm and version 2.2.0 of brms were both released to provide compatibility with the new features in loo v2.0.0. Two of the new vignettes for the loo package show how to use it with rstanarm models, and we have also just released a draft of a vignette on how to use loo with brms and rstan for many “non-factorizable” models (i.e., observations not conditionally independent). brms is also now officially supported by the Stan Development Team (welcome Paul!) and there is a new category for it on the Stan Forums.

rstan: The next release of the rstan package (v2.18), is not out yet (we need to get Stan 2.18 out first), but it will include a loo() method for stanfit objects in order to save users a bit of work. Unfortunately, we can’t save you the trouble of having to compute the point-wise log-likelihood in your Stan program though! There will also be some new functions that make it a bit easier to extract HMC/NUTS diagnostics (thanks to a contribution from Martin Modrák).

Visualization

bayesplot: A few weeks ago we released version 1.5.0 of the bayesplot package (mc-stan.org/bayesplot), which also integrates nicely with loo 2.0.0. In particular, the diagnostic plots using the leave-one-out cross-validated probability integral transform (LOO-PIT) from our paper Visualization in Bayesian Workflow (preprint on arXiv, code on GitHub) are easier to make with the latest bayesplot release. Also, TJ Mahr continues to improve the bayesplot experience for ggplot2 users by adding (among other things) more functions that return the data used for plotting in a tidy data frame.

shinystan: Unfortunately, there hasn’t been a shinystan (mc-stan.org/shinystan) release in a while because I’ve been busy with all of these other packages, papers, and various other Stan-related things. We’ll try to get out a release with a few bug fixes soon. (If you’re annoyed by the lack of new features in shinystan recently let me know and I will try to convince you to help me solve that problem!)

(Update: I forgot to mention that despite the lack of shinystan releases, we’ve been working on better introductory materials. To that end, Chelsea Muth, Zita Oravecz, and I recently published an article User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan (view).)

Other tools

loo: We released version 2.0.0, a major update to the loo package (mc-stan.org/loo). See my previous blog post.

projpred: Version 0.8.0 of the projpred package (mc-stan.org/projpred) for projection predictive variable selection for GLMs was also released shortly after the loo update in order to take advantage of the improvements to the Pareto smoothed importance sampling algorithm. projpred can already be used quite easily with rstanarm models and we are working on improving its compatibility with other packages for fitting Stan models.

rstantools: Unrelated to the loo update, we also released version 1.5.0 of the rstantools package (mc-stan.org/rstantools), which provides functions for setting up R packages interfacing with Stan. The major changes in this release are that usethis::create_package() is now called to set up the package (instead of utils::package.skeleton), fewer manual changes to files are required by users after calling rstan_package_skeleton(), and we have a new vignette walking through the process of setting up a package (thanks Stefan Siegert!). Work is being done to keep improving this process, so be on the lookout for more updates soonish.

Stan related R packages from other developers

There are now well over fifty packages on CRAN that depend in some way on one of our R packages mentioned above!  You can find most of them by looking at the “Reverse dependencies” section on the CRAN page for rstan, but that doesn’t count the ones that depend on bayesplot, shinystanloo, etc., but not rstan.

Unfortunately, given the growing number of these packages, we haven’t been able to look at each one of them in detail. For obvious reasons we prioritize giving feedback to developers who reach out to us directly to ask for comments and to those developers who make an effort to our recommendations for developers of R packages interfacing with Stan (included with the rstantools package since its initial release in 2016). If you are developing one of these packages and would like feedback please let us know on the Stan Forums. Our time is limited but we really do make a serious effort to answer every single question asked on the forums (thank you to the many Stan users who also volunteer their time helping on the forums!).

My primary feelings about this trend of developing Stan-based R packages are ones of excitement and gratification. It’s really such an honor to have so many people developing these packages based on all the work we’ve done! There are also a few things I’ve noticed that I hope will change going forward. I’ll wrap up this post by highlighting two of these issues that I hope developers will take seriously:

(1) Unit testing

(2) Naming user-facing functions

The number of these packages that have no unit tests (or very scant testing) is a bit scary. Unit tests won’t catch every possible bug (we have lots of tests for our packages and people still find bugs all the time), but there is really no excuse for not unit testing a package that you want other people to use. If you care enough to do everything required to create your package and get it on CRAN, and if you care about your users, then I think it’s fair to say that you should care enough to write tests for your package. And there’s really no excuse these days with the availability of packages like testthat to make this process easier than it used to be! Can anyone think of a reasonable excuse for not unit testing a package before releasing it to CRAN and expecting people to use it? (Not a rhetorical question. I really am curious given that it seems to be relatively common or at least not uncommon.) I don’t mean to be too negative here. There are also many packages that seem to have strong testing in place! My motivation for bringing up this issue is that it is in the best interest of our users.

Regarding function naming: this isn’t nearly as big of a deal as unit testing, it’s just something I think developers (including myself) of packages in the Stan R ecosystem can do to make the experience better for our users. rstanarm and brms both import the generic functions included with rstantools in order to be able to define methods with consistent names. For example, whether you fit a model with rstanarm or with brms, you can call log_lik() on the fitted model object to get the pointwise log-likelihood (it’s true that we still have a bit left to do to get the names across rstanarm and brms more standardized, but we’re actively working on it). If you are developing a package that fits models using Stan, we hope you will join us in trying to make it as easy as possible for users to navigate the Stan ecosystem in R.

loo 2.0 is loose

This post is by Jonah and Aki.

We’re happy to announce the release of v2.0.0 of the loo R package for efficient approximate leave-one-out cross-validation (and more). For anyone unfamiliar with the package, the original motivation for its development is in our paper:

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. doi:10.1007/s11222-016-9696-4. (published versionarXiv preprint)

Version 2.0.0 is a major update (release notes) to the package that we’ve been working on for quite some time and in this post we’ll highlight some of the most important improvements. Soon I (Jonah) will follow up with a post about important new developments in our various other R packages.

New interface, vignettes, and more helper functions to make the package easier to use

Because of certain improvements to the algorithms and diagnostics (summarized below), the interfaces, i.e., the loo() and psis() functions and the objects they return, also needed some improvement. (Click on the function names in the previous sentence to see their new documentation pages.) Other related packages in the Stan R ecosystem (e.g., rstanarm, brms, bayesplot, projpred) have also been updated to integrate seamlessly with loo v2.0.0. (Apologies to anyone who happened to install the update during the short window between the loo release and when the compatible rstanarm/brms binaries became available on CRAN.)

Three vignettes now come with the loo package package and are also available (and more nicely formatted) online at mc-stan.org/loo/articles:

  • Using the loo package (version >= 2.0.0) (view)
  • Bayesian Stacking and Pseudo-BMA weights using the loo package (view)
  • Writing Stan programs for use with the loo package (view)

A vignette about K-fold cross-validation using new K-fold helper functions will be included in a subsequent update. Since the last release of loo we have also written a paper, Visualization in Bayesian workflow, that includes several visualizations based on computations from loo.

Improvements to the PSIS algorithm, effective sample sizes and MC errors

The approximate leave-one-out cross-validation performed by the loo package depends on Pareto smoothed importance sampling (PSIS). In loo v2.0.0, the PSIS algorithm (psis() function) corresponds to the algorithm in the most recent update to our PSIS paper, including adapting the Pareto fit with respect to the effective sample size and using a weakly informative prior to reduce the variance for small effective sample sizes. (I believe we’ll be updating the paper again with some proofs from new coauthors.)

For users of the loo package for PSIS-LOO cross-validation and not just the PSIS algorithm for importance sampling, an even more important update is that the latest version of the same PSIS paper referenced above describes how to compute the effective sample size estimate and Monte Carlo error for the PSIS estimate of elpd_loo (expected log predictive density for new data). Thus, in addition to the Pareto k diagnostic (an indicator of convergence rate – see paper) already available in previous loo versions, we now also report an effective sample size that takes into account both the MCMC efficiency and the importance sampling efficiency. Here’s an example of what the diagnostic output table from loo v2.0.0 looks like (the particular intervals chosen for binning are explained in the papers and also the package documentation) for the diagnostics:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     240   91.6%   205
 (0.5, 0.7]   (ok)         7    2.7%   48
   (0.7, 1]   (bad)        8    3.1%   7
   (1, Inf)   (very bad)   7    2.7%   1

We also compute and report the Monte Carlo SE of elpd_loo to give an estimate of the accuracy. If some k>1 (which means the PSIS-LOO approximation is not reliable, as in the example above) NA will be reported for the Monte Carlo SE. We hope that showing the relationship between the k diagnostic, effective sample size, and and MCSE of elpd_loo will make it easier to interpret the diagnostics than in previous versions of loo that only reported the k diagnostic. This particular example is taken from one of the new vignettes, which uses it as part of a comparison of unstable and stable PSIS-LOO behavior.

Weights for model averaging: Bayesian stacking, pseudo-BMA and pseudo-BMA+

Another major addition is the loo_model_weights() function, which, thanks to the contributions of Yuling Yao, can be used to compute weights for model averaging or selection. loo_model_weights() provides a user friendly interface to the new stacking_weights() and pseudobma_weights(), which are implementations of the methods from Using stacking to average Bayesian predictive distributions (Yao et al., 2018). As shown in the paper, Bayesian stacking (the default for loo_model_weights()) provides better model averaging performance than “Akaike style“ weights, however, the loo package does also include Pseudo-BMA weights (PSIS-LOO based “Akaike style“ weights) and Pseudo-BMA+ weights, which are similar to Pseudo-BMA weights but use a so-called Bayesian bootstrap procedure to  better account for the uncertainties. We recommend the Pseudo-BMA+ method instead of, for example, WAIC weights, although we prefer the stacking method to both. In addition to the Yao et al. paper, the new vignette about computing model weights demonstrates some of the motivation for our preference for stacking when appropriate.

Give it a try

You can install loo v2.0.0 from CRAN with install.packages("loo"). Additionally, reinstalling an interface that provides loo functionality (e.g., rstanarm, brms) will automatically update your loo installation. The loo website with online documentation is mc-stan.org/loo and you can report a bug or request a feature on GitHub.

Three new domain-specific (embedded) languages with a Stan backend

One is an accident. Two is a coincidence. Three is a pattern.

Perhaps it’s no coincidence that there are three new interfaces that use Stan’s C++ implementation of adaptive Hamiltonian Monte Carlo (currently an updated version of the no-U-turn sampler).

  • ScalaStan embeds a Stan-like language in Scala. It’s a Scala package largely (if not entirely written by Joe Wingbermuehle.
    [GitHub link]

  • tmbstan lets you fit TMB models with Stan. It’s an R package listing Kasper Kristensen as author.
    [CRAN link]

  • SlicStan is a “blockless” and self-optimizing version of Stan. It’s a standalone language coded in F# written by Maria Gorinova.
    [pdf language spec]

These are in contrast with systems that entirely reimplement a version of the no-U-turn sampler, such as PyMC3, ADMB, and NONMEM.

StanCon is next week, Jan 10-12, 2018

It looks pretty cool!

Wednesday, Jan 10

Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia Rabe-Hesketh and Daniel Furr (U California, Berkely) 10:40-11:30am

Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:30-11:50am

Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am-12:10pm

Joint longitudinal and time-to-event models via Stan. Sam Brilleman, Michael Crowther, Margarita Moreno-Betancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:10-12:30pm
Lunch 12:30-2:00pm

ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:00-2:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:20-2:40pm

Student Ornstein-Uhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:40-3:00pm

SlicStan: a blockless Stan-like language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:00-3:20pm
Break 3:20-4:00pm

Invited Talk: Talia Weiss (MIT) 4:00-4:50pm

Thursday, Jan 11

Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:40-11:30am

NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:30-11:50am

Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am-12:10pm

A brief history of Stan. Daniel Lee (Generable) 12:10-12:30pm
Lunch 12:30-1:30pm

Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:30-1:50pm

Flexible modeling of Alzheimer’s disease progression with I-Splines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:50-2:10pm

Intrinsic Auto-Regressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:10-2:30pm

Modeling/Data Session + Classes 2:30-4:10pm

Open session for consultations on modeling and data problems with Stan developers and modelers. 2:30-4:10pm

Session 3 of Intro to Stan 2:30-4:10pm

2:30-3:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter

What is new to Stan 3:30-4:10pm

Invited Talk: Manuel Rivas (Stanford U) 4:00-4:50pm

Friday, Jan 12

Invited Talk: Susan Holmes (Stanford U) 10:40-11:30am

Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:30-11:50am

The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, Emma Pierson (Stanford U) 11:50am-12:10pm

Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:10-12:30pm
Lunch 12:30-1:30pm

Causal inference with the g-formula in Stan. Leah Comment (Harvard U) 1:30-1:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:50-2:10pm

Invited Talk: Andrew Gelman 2:10-3:00 (Columbia U) (virtual)

Classes/Tutorials

We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:

Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.

Executive decision making the Bayesian way: This is for non-technical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.

Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.

Model assessment, model selection and inference after model selection. Aki Vehtari.

How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.

[edited by Aki: added Model selection tutorial]

Stan Roundup, 10 November 2017

We’re in the heart of the academic season and there’s a lot going on.

  • James Ramsey reported a critical performance regression bug in Stan 2.17 (this affects the latest CmdStan and PyStan, not the latest RStan). Sean Talts and Daniel Lee diagnosed the underlying problem as being with the change from char* to std::string arguments—you can’t pass char* and rely on the implicit std::string constructor without the penalty of memory allocation and copying. The reversion goes back to how things were before with const char* arguments. Ben Goodrich is working with Sean Talts to cherry-pick the performance regression fix to Stan that led to a very slow 2.17 release for the other interfaces. RStan 2.17 should be out soon, and it will be the last pre-C++11 release. We’ve already opened the C++11 floodgates on our development branches (yoo-hoo!).

  • Quentin F. Gronau, Henrik Singmann, E. J. Wagenmakers released the bridgesampling package in R. Check out the arXiv paper. It runs with output from Stan and JAGS.

  • Andrew Gelman and Bob Carpenter‘s proposal was approved by Coursera for a four-course introductory concentration on Bayesian statistics with Stan: 1. Bayesian Data Analysis (Andrew), 2. Markov Chain Monte Carlo (Bob), 3. Stan (Bob), 4. Multilevel Regression (Andrew). The plan is to finish the first two by late spring and the second two by the end of the summer in time for Fall 2018. Advait Rajagopal, an economics Ph.D. student at the New School is going to be leading the exercise writing, managing the Coursera platform, and will also TA the first few iterations. We’ve left open the option for us or others to add a prequel and sequel, 0. Probability Theory, and 5. Advanced Modeling in Stan.

  • Dan Simpson is in town and dropped a casual hint that order statistics would clean up the discretization and binning issues that Sean Talts and crew were having with the simulation-based algorithm testing framework (aka the Cook-Gelman-Rubin diagnostics). Lo-and-behold, it works. Michael Betancourt worked through all the math on our (chalk!) board and I think they are now ready to proceed with the paper and recommendations for coding in Stan. As I’ve commented before, one of my favorite parts of working on Stan is watching the progress on this kind of thing from the next desk.

  • Michael Betancourt tweeted about using Andrei Kascha‘s javascript-based vector field visualization tool for visualizing Hamiltonian trajectories and with multiple trajectories, the Hamiltonian flow. Richard McElreath provides a link to visualizations of the fields for light, normal, and heavy-tailed distributions. The Cauchy’s particularly hypnotic, especially with many fewer particles and velocity highlighting.

  • Krzysztof Sakrejda finished the fixes for standalone function generation in C++. This lets you generate a double- and int-only version of a Stan function for inclusion in R (or elsewhere). This will go into RStan 2.18.

  • Sebastian Weber reports that the Annals of Applied Statistics paper, Bayesian aggregation of average data: An application in drug development, was finally formally accepted after two years in process. I think Michael Betancourt, Aki Vehtari, Daniel Lee, and Andrew Gelman are co-authors.

  • Aki Vehtari posted a case study for review on extreme-value analysis and user-defined functions in Stan [forum link — please comment there].

  • Aki Vehtari, Andrew Gelman and Jonah Gabry have made a major revision of Pareto smoothed importance sampling paper with improved algorithm, new Monte Carlo error and convergence rate results, new experiments with varying sample size and different functions. The next loo package release will use the new version.

  • Bob Carpenter (it’s weird writing about myself in the third person) posted a case study for review on Lotka-Volterra predator-prey population dynamics [forum link — please comment there].

  • Sebastian and Sean Talts led us through the MPI design decisions about whether to go with our own MPI map-reduce abstraction or just build the parallel map function we’re going to implement in the Stan language. Pending further review from someone with more MPI experience, the plan’s to implememt the function directly, then worry about generalizing when we have more than one function to implement.

  • Matt Hoffman (inventor of the original NUTS algorithm and co-founder of Stan) dropped in on the Stan meeting this week and let us know he’s got an upcoming paper generalizing Hamiltonian Monte Carlo sampling and that his team at Google’s working on probabilistic modeling for Tensorflow.

  • Mitzi Morris, Ben Goodrich, Sean Talts and I sat down and hammered out the services spec for running the generated quantities block of a Stan program over the draws from a previous sample. This will decouple the model fitting process and the posterior predictive inference process (because the generated quantities block generates a ỹ according to p(ỹ | θ) where ỹ is a vector of predictive quantities and θ is the vector of model parameters. Mitzi then finished the coding and testing and it should be merged soon. She and Ben Bales are working on getting it into CmdStan and Ben Goodrich doesn’t think it’ll be hard to add to RStan.

  • Mitzi Morris extended the spatial case study with leave-one-out cross-validation and WAIC comparisons of the simple Poisson model, a heterogeneous random effects model, a spatial random effects model, and a combined heterogeneous and spatial model with two different prior configurations. I’m not sure if she posted the updated version yet (no, because Aki is also in town and suggested checking Pareto khats, which said no).

  • Sean Talts split out some of the longer tests for less frequent application to get distribution testing time down to 1.5 hours to improve flow of pull requests.

  • Sean Talts is taking another one for the team by leading the charge to auto-format the C++ code base and then proceed with pre-commit autoformat hooks. I think we’re almost there after a spirited discussion of readability and our ability to assess it.

  • Sean Talts also added precompiled headers to our unit and integration tests. This is a worthwhile speedup when running lots of tests and part of the order of magnitude speedup Sean’s eked out.

ps. some edits made by Aki

Stan Roundup, 27 October 2017

I missed two weeks and haven’t had time to create a dedicated blog for Stan yet, so we’re still here. This is only the update for this week. From now on, I’m going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.

Not one, but two new devs!

This is my favorite news to post, hence the exclamation.

  • Matthijs Vákár from University of Oxford joined the dev team. Matthijs’s first major commit is a set of GLM functions for negative binomial with log link (2–6 times speedup), normal linear regression with identity link (4–5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn’t just implement the straight-line case—this is a fully vectorized implementation as a density, so we’ll be able to use them this way:
    int y[N];  // observations
    matrix[N, K] x;                  // "data" matrix
    vector[K] beta;                  // slope coefficients
    real alpha;                      // intercept coefficient
    
    y ~ bernoulli_logit_glm(x, beta, alpha);
    

    These stand in for what is now written as

    y ~ bernoulli_logit(x * beta + alpha);
    

    and before that was written

    y ~ bernoulli(inv_logit(x * beta + alpha)); 
    

    Matthijs also successfully defended his Ph.D. thesis—welcome to the union, Dr. Vákár.

  • Andrew Johnson from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He’s also been patching doc and other issues.

Visitors

  • Kentaro Matsura, author of Bayesian Statistical Modeling Using Stan and R (in Japanese) visited and we talked about what he’s been working on and how we’ll handle the syntax for tuples in Stan.

  • Shuonan Chen visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics—specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro’s book!

Other news

  • Bill Gillespie presented a Stan and Torsten tutorial at ACoP.

  • Charles Margossian had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.

  • Krzysztof Sakrejda nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).

  • Ben Goodrich has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review—suggestions welcome).

  • Bob Carpenter and Mitzi Morris taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! Mike Lawrence, of Stan tutorial YouTube fame, helped people with debugging and installs—nice to finally meet him in person.

  • Ben Bales got the metric initialization into CmdStan, so we’ll finally be able to restart (the metric used to be called the mass matrix—it’s just the inverse of a regularized estimate of global posterior covariance during warmup)

  • Michael Betancourt just returned from SACNAS (diversity in STEM conference attended by thousands).

  • Michael also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.

  • Aki Vehtari was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with Andrew Gelman; it’ll also be joint between Helsinki and New York. Sounds like fun!

  • Breck Baldwin and Sean Talts headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.

  • Imad Ali is leaving Columbia to work for the NBA league office (he’ll still be in NYC) as a statistical analyst! That’s one way to get access to the data!

Halifax, NS, Stan talk and course Thu 19 Oct

Halfiax, here we come.

I (Bob, not Andrew) am going to be giving a talk on Stan and then Mitzi and I will be teaching a course on Stan after that. The public is invited, though space is limited for the course. Here are details if you happen to be in the Maritime provinces.

TALK: Stan: A Probabilistic Programming Language for Bayesian Inference

Date: Thursday October 19, 2017

Time: 10am

Location: Slonim Conference room (#430), Goldberg Computer Science Building, Dalhousie University, 6050 University Avenue, Halifax

Abstract

I’ll describe Stan’s probabilistic programming language, and how it’s used, including

  • blocks for data, parameter, and predictive quantities
  • transforms of constrained parameters to unconstrained spaces, with automatic Jacobian corrections
  • automatic computation of first- and higher-order derivatives
  • operator, function, and linear algebra library
  • vectorized density functions, cumulative distributions, and random number generators
  • user-defined functions
  • (stiff) ordinary differential equation solvers

I’ll also provide an overview of the underlying algorithms for full Bayesian inference and for maximum likelihood estimation:

  • adaptive Hamiltonian Monte Carlo for MCMC
  • L-BFGS optimization and transforms for MLE

I’ll also briefly describe the user-facing interfaces: RStan (R), PyStan (Python), CmdStan (command line), Stan.jl (Julia), MatlabStan (MATLAB)

I’ll finish with an overview of the what’s on the immediate horizon:

  • GPU matrix operations
  • MPI multi-core, multi-machine parallelism
  • data parallel expectation propagation for approximate Bayes
  • marginal Laplace approximations

TUTORIAL: Introductio to Bayesian Modeling and Inference with RStan

Instructors:

  • Bob Carpenter, Columbia University
  • Mitzi Morris, Columbia University

Date: Thursday October 19, 2017

Time: 11:30am-5:30pm (following the seminar on Stan at 10am)

Location: Slonim Conference room (#430)
Goldberg Computer Science Building
Dalhousie University
6050 University Avenue, Halifax

Registration: EventBrite Registration Page

Description:

This short course will provide

  • an introduction to Bayesian modeling
  • an introduction to Monte Carlo methods for Bayesian inference
  • an overview of the probabilistic programming language Stan

Stan provides a language for coding Bayesian models along
with state-of-the-art inference algorithms based on gradients. There will be an overview of how Stan works, but the main focus will be on the RStan interface and building applied models.

The afternoon will be devoted to a case study of hierarchical modeling, the workhorse of applied Bayesian statistics. We will show how hierarchical models pool estimates toward the population means based on population variance and how this automatically estimates regularization and adjusts for multiple comparisons. The focus will be on probabilistic inference, and in particular on testing posterior predictive calibration and the sharpness of predictions.

Installing RStan: Please show up with RStan installed. Instructions are linked from here:

Warning: follow the instructions step-by-step; even though installation involves a CRAN package, it’s more complex than just installing from RStudio because a C++ toolchain is required at runtime.

If you run into trouble, please ask for help on our forums—they’re very friendly:

Full Day Schedule

10:00-11:00am Open Seminar – Introduction to the “Stan” System

11:00-11:30am Break

11:30am-1:00pm Tutorial part 1

1:00pm -2:00pm Lunch Break

2:00pm -3:30pm Tutorial part 2

3:30pm -3:45pm Break

3:45pm -5:30pm Tutorial part 3

Stan Roundup, 6 October 2017

I missed last week and almost forgot to add this week’s.

  • Jonah Gabry returned from teaching a one-week course for a special EU research institute in Spain.

  • Mitzi Morris has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higher-order functions.

  • Michael Betancourt with help from Sean Talts spent last week teaching an intro course to physicists about Stan. Charles Margossian attended and said it went really well.

  • Ben Goodrich, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.

  • Aki Vehtari has put us in touch with the MxNet developers at Amazon UK and Berlin and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).

  • Aki is also working on revising the EP as a way of life paper and finalizing other Stan-related papers.

  • Bob Carpenter and Andrew Gelman have recruited Advait Rajagopal to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).

  • Ben Bales finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.

  • Daniel Lee is getting ready for ACoP, which Bill Gillespie and Charles Margossian will also be presenting at.

  • Steven Bronder and Rok Češnovar, with some help from Daniel Lee, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.

  • Sean Talts when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.

  • Breck Baldwin is taking over dealing with StanCon. Our goal is to get up to 150 registrations.

  • Breck Baldwin has also been working with Andrew Gelman and Jonathan Auerbach on non-conventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.

  • Bob Carpenter published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.

  • Imad Ali went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.

  • Ben Goodrich fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).

  • Advait Rajagopal has been working with Imad Ali on adding ARMA and ARIMA time-series functions to rstanarm.

  • Aki Vehtari is working to enhance the loo package with automated code for K-fold cross validation for (g)lmer models.

  • Lizzie Wolkovich visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).

  • Krzysztof Sakrejda has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.

  • Michael Andreae has kicked off is meta-analysis and graphics project at Penn State with Jonah Gabry and Ben Goodrich chipping in.

  • Ben Goodrich also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.

  • Yuling Yao gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and Andrew Gelman are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.

Will Stanton hit 61 home runs this season?

[edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]

So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

Sampling notation

A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

$latex \theta \sim \mbox{Beta}(1, 1)$

The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

$latex 56 \sim \mbox{Binomial}(555, \theta)$

Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

$latex \mathit{ab} \sim \mbox{Poisson}(10 \times 555 / 149)$

We then take the number of home runs to be binomial given the number of at bats and the home run rate.

$latex h \sim \mbox{Binomial}(\mathit{ab}, \theta)$

Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

$latex \mbox{gte61} = \mbox{I}[h \geq (61 – 56)]$

Event probability

The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

$latex \displaystyle \mbox{Pr}[h \geq (61 – 56)] \\[6pt] \hspace*{3em} \displaystyle { } = \ \int_{\theta} \ \sum_{ab} \, \ \mathrm{I}[h \geq 61 – 56] \\ \hspace*{8em} \ \times \ \mbox{Binomial}(h \mid ab, \theta) \\[6pt] \hspace*{8em} \ \times \ \mbox{Poisson}(ab \mid 10 \ \times \ 555 / 149) \\[6pt] \hspace*{8em} \ \times \ \mbox{Beta}(\theta \mid 1 + 56, 1 + 555 – 56) \ \mathrm{d}\theta.$

Computation in R

The posterior for $latex \theta$ is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

> sum(rbinom(1e5,
             rpois(1e5, 10 * 555 / 149),
             rbeta(1e5, 1 + 56, 1 + 555 - 56))
       >= (61 - 56)) / 1e5
[1] 0.34

That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.

Disclaimer

The above is intended for recreational use only and is not intended for serious bookmaking.

Exercise

You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for $latex \theta$ down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

Stan Weekly Roundup, 25 August 2017

This week, the entire Columbia portion of the Stan team is out of the office and we didn’t have an in-person/online meeting this Thursday. Mitzi and I are on vacation, and everyone else is either teaching, TA-ing, or attending the Stan course. Luckily for this report, there’s been some great activity out of the meeting even if I don’t have a report of what everyone around Columbia has been up to. If a picture’s really worth a thousand words, this is the longest report yet.

  • Ari Hartikainen has produced some absolutely beautiful parallel coordinate plots of HMC divergences* for multiple parameters. The divergent transitions are shown in green and the lines connect a single draw. The top plot is unnormalized, whereas the bottom scales all parameters to a [0, 1] range.


    Ari's divergence plot

    You can follow the ongoing discussion on the forum thread. There are some further plots for larger models and some comparisons with the pairs plots that Michael Betancourt has been recommending for the same purpose (the problem with pairs is that it’s very very slow, at least in RStan, because it has to draw quadratically many plots).

  • Sebastian Weber has a complete working prototype of the MPI (multi-core parallelization) in place and has some beautiful results to report. The first graph is the speedup he achieved on a 20-core server (all in one box with shared memory):


    Sebasitan's MPI speedup plot

    The second graph shows what happens when the problem size grows (those bottom numbers on the x-axis are the number of ODE systems being solved, whereas the top number remains the number of cores used).


    Sebastian's weak scaling plot

    As with Ari’s plots, you can follow the ongoing disussion on the forum thread. And if you know something about MPI, you can even help out. Sebastian’s been asking if anyone who knows MPI would like to check his work—he’s learning it as he goes (and doing a bang-up job of it, I might add!).

 

These lists are incomplete

After doing a handful of these reports, I’m sorry to say you’re only seeing a very biased selection of activity around Stan. For the full story, I’d encourage you to jump onto our forums or GitHub (warning: very high traffic, even if you focus).


 * Divergences in Stan arise when the Hamiltonian, which should be conserved across a trajectory, diverges—it’s basically a numerical simulation problem—if we could perfectly follow the Hamiltonian through complex geometries, there wouldn’t be any divergences. This is a great diagnostic mechanism to signal something’s going wrong and resulting estimates might be biased. It may seem to make HMC more fragile, but the problem is that Gibbs and Metropolis will fail silently in a lot of these situations (though BUGS will often help you out of numerical issues by crashing).

Stan Weekly Roundup, 11 August 2017

This week, more Stan!

  • Charles Margossian is rock star of the week, finishing off the algebraic solver math library fixture and getting all plumbed through Stan and documented. Now you can solve nonlinear sets of equations and get derivatives with the implicit function theorem all as part of defining your log density. There is a chapter in the revised manual

  • The entire Stan Development Team, spearheaded by Ben Goodrich needing fixes for RStan, is about to roll out Stan 2.17 along with the interfaces. Look for that to begin trickling out next week. This will fix some further install and error message reporting issues as well as include the algebraic solver. We are also working on moving things toward Stan 3 behind the scenes. We won’t just keep incrementing 2.x forever!

  • Ben Goodrich fixed the inlining declarations in C++ to allow multiple Stan models to be linked or built in a single translation unit. This will be part of the 2.17 release.

  • Sean Talts is still working on build issues after Travis changed some config and compilers changed everywhere disrupting our continuous integration.

  • Sean is working with Michael Betancourt on the Cook-Gelman-Rubin diagnostic and have gotten to the bottom of the quantization errors (usingly on 1000 posterior draws and splitting into too many bins).

  • Imad Ali is looking ahead to spatio-temporal models as he wraps up the spatial models in RStanArm.

  • Yuling Yao and Aki Vehtari finished the stacking paper (like model averaging), adding references, etc.

  • Yuling has also made some updates to the loo package that are coming soon.

  • Andrew Gelman wrote papers (with me and Michael) about R-hat and effective sample size and a paper on how priors can only be understood in the context of the likelihood.
  • Jonah Gabry, Ben and Imad have been working on the paper for priors based on $R^2$

  • Andrew, Breck Baldwin and I are trying to put together some educational proposals for NSF to teach intro stats, and we’re writing more grants now that it’s grant season. All kinds of interesting things at NSF and NIH with spatial modeling.

  • Jonah Gabry continues the ggplot-ification of the new Gelman and Hill book.

  • Ben Goodrich has been working on multivariate effective sample size calculations.

  • Breck Baldwin has been working with Michael on courses before Stan (and elsewhere).

  • Jonah Gabry has been patching all the packagedown references for our doc and generally cleaning things up with cross-references and organization.

  • Mitzi Morris finished adding a data qualifier to function arguments to allow propagation of data-only-ness as required by our ODE functions and now the algebraic solver.

  • Dootika Vats was visiting Michael; she’s been working on multivariate effective sample sizes, which considers how bad things can be with linear combinations of parameters; scaling is quadratic, so there are practical issues. She has previously worked on efficient and robust effective sample size calculations which look promising for inclusion in Stan.

  • Bob Carpenter has been working on exposing properties of variables in a Stan program in the C++ model at runtime and statically.

Stan Weekly Roundup, 3 August 2017

You’d almost think we were Europeans based on how much we’ve slowed down over the summer.

  • Imad Ali, Jonah Gabry, and Ben Goodrich finished the online pkgdown-style documentation for all the Stan Development Team supported R packages. They can be accessed via http://mc-stan.org/(package_name), e.g.,

    The Stan manual will also get converted as soon as we can get to it.

  • Ben Goodrich added the nonlinear inverse link functions with Gaussian outcomes following lme4’s “self-starting” functions nlmer.

  • Rob Trangucci has added further GP doc to the manual and is working on multinomial logit to RStanArm.

  • Jonah Gabry released ShinyStan 2.4 and a new Bayesplot is on the way—they’re more flexible about ggplot2 theming, and also RStanArm releases to coordinate with the next Gelman and Hill edition bringing it up to date with modern R.

  • Breck Baldwin has been trying to wrangle governance issues.

  • Imad Ali is working on some basketball models and waiting for NBA data. Also supervising our high school student and working on the nonlinear models for RStanArm.

  • Aki Vehtari is continuing work on Pareto smoothed importance sampling with Jonah. StanCon Helsinki planning is underway; still waiting on a date.

  • Ben Bales rewrote append arrays and the initial RNG vectorization.

  • Bill Gillesie has been learning C++ and software development with Charle’s help. His first pending pull request is adding a linear interpolation function like the one in BUGS.

  • Charles Margossian finished the Torsten 0.8.3 release (that’s Metrum’s pharmacometrics package wrapping RStan).

  • Charles also finished the pull request for the algebraic solver and it’s passed code review, so it should land in the math lib soon.

  • Charles is also writing some docs on how to start programming with Stan, based on whathe’s been learning teaching Bill to write C++.

  • Charles and Bill are also adapting the mixed solver for a PK/PD journal.

  • Michael Betancourt wrote a case study about QR decomposition that’s up on the web site. He’s since been at JSM talking about Stan, where there were lots of posters citing Stan. He gave away a lot of stickers through the Metrum booth.

  • Michael, Aki, and Rob Trangucci have been working on GPs and Michael has a case study in the works.

  • Michael also made a GP movie tweet that’s gotten a ton of positive feedback on Twitter (along with the spatial, methodology, and QR decomposition case studies).

  • Andrew Gelman wrote up a draft of an R-hat an ESS calculation paper with me and Michael.

  • Mitzi Morris launched the spatial model case studies with the fit for intrinsic conditional autoregression (ICAR) model, with some neat parameterizations by Dan Simpson. She’s also got the Cook, Gelman, and Rubin in the wings.

  • Mitzi is also adding a data specifiation for variables that will let us write functions that only apply to data.

  • Sean Talts and Daniel Lee have been hammering away at the C++ builds through all our repos and allow better conditional compilation of optional external libs like CVODES (for ODE solving), MPI (process parallelism), and OpenCL (GPUs).

  • Sean and Michael have also been fiding anomalies in their Cook-Gelman-Rubin stats (as has Mitzi) when the number of replications is cranked up to the thousands.

Stan Weekly Roundup, 28 July 2017

Here’s the roundup for this past week.

  • Michael Betancourt added case studies for methodology in both Python and R, based on the work he did getting the ML meetup together:

  • Michael Betancourt, along with Mitzi Morris, Sean Talts, and Jonah Gabry taught the women in ML workshop at Viacom in NYC and there were 60 attendees working their way up from simple linear regression, through Poisson regression to GPs.

  • Ben Goodrich has been working on new R^2 analyses and priors, as well as the usual maintenance on RStan and RStanArm.

  • Aki Vehtari was at the summer school in Valencia teaching Stan.

  • Aki has also been kicking off planning for StanCon in Helsinki 2019. Can’t believe we’re planning that far ahead!

  • Sebastian Weber was in Helsinki giving a talk on Stan, but there weren’t many Bayesians there to get excited about Stan; he’s otherwise been working with Aki on variable selection.

  • Imad Ali is finishing up the spatial models in RStanArm and moving on to new classes of models (we all know his goal is to model basketball, which is a very spatially continuous game!).

  • Ben Bales has been working on generic append array funcitons and vectorizing random number geneators. We learned his day job was teaching robotics with lego to mechanical engineering students!

  • Charles Margossian is finishing up the algebraic solvers (very involved autodiff issues there, as with the ODE solvers) and wrapping up a final release of Torsten before he moves to Columbia to start the Ph.D. program in stats. He’s also writing the mixed solver paper with feedback from Michael Betancourt and Bill Gillespie.

  • Mitzi Morris added runtime warning messages for problems arising in declarations, which inadvertently fixed another bug arising for declarations with sizes for which constraints couldn’t be satisfied (as in size zero simplexes).

  • Miguel Benito, along with Mitzi Morris and Dan Simpson, with input from Michael Betancourt and Andrew Gelman, now have spatial models with matching results across GeoBUGS, INLA, and Stan. They further worked on better priors for Stan so that it’s now competitive in fitting; turns out the negative effect of the sum-to-zero constraint on the spatial random effects had a greater negative effect on the geometry than a positive effect on identifiability.

  • Michael Andreae resubmitted papers with Ben Goodrich and Jonah Gabry and is working on some funding prospects.

  • Sean Talts (with help from Daniel Lee) has most of the C++11/C++14 dev ops in place so we’ll be able to start using all those cool toys.

  • Sean Talts and Michael Betancourt with some help from Mitzi Morris, have been doing large-scale Cook-Gelman-Rubin evaluations for simple and hierarchical models and finding some surprising results (being discussed on Discourse). My money’s on them getting to the bottom of what’s going on soon; Dan Simpson’s jumping in to help out on diagnostics, in the same thread on Discourse.
  • Aki Vehtari reports that Amazon UK (with Neil Lawrence and crew) are using Stan, so we expect to see some more GP activity at some point.

  • We spent a long time discussing how to solve the multiple translation unit problems. It looks at first glance like Eigen just inlines every function and that may also work for us (if a function is declared inline, it may be defined in multiple translation units).

  • Solène Desmée, along with France Mentré and others have been fitting time-to-event models in Stan and have a new open-access publication, Nonlinear joint models for individual dynamic prediction of risk of death using Hamiltonian Monte Carlo: application to metastatic prostate cancer. You may remember France as the host of last year’s PK/PD Stan conference in Paris.

Stan Weekly Roundup, 21 July 2017

It was another productive week in Stan land. The big news is that

  • Jonathan Auerbach, Tim Jones, Susanna Makela, Swupnil Sahai, and Robin Winstanley won first place in a New York City competition for predicting elementary school enrollment. Jonathan told me, “I heard 192 entered, and there were 5 finalists….Of course, we used Stan (RStan specifically). … Thought it might be Stan news worthy.”

    I’d say that’s newsworthy. Jon also provided a link to the “challenge” page, a New York City government sponsored “call for innovations”: Enhancing School Zoning Efforts by Predicting Population Change.

    They took home a US$20K paycheck for their efforts!

    Stan’s seeing quite a lot of use these days among demographers and others looking to predict forward from time series data. Jonathan’s been very active using government data sets (see his StanCon 2017 presentation with Rob Trangucci, Twelve Cities: Does lowering speed limits save pedestrian lives?). Hopefully they’ll share their code—maybe they already have. I really like to see this combination of government data and careful statistical analysis.

In other big news coming up soon,

In other news,

  • Andrew Gelman‘s been driving a lot of rethinking of our interfaces because he’s revising his and Jennifer Hill’s regression book (the revision will be two books). Specifically, he’s thinking a lot about workflow and how we can manage model expansion by going from fixed to modeled parameters. Right now, the process is onerous in that you have to move data variables into the parameters block and keep updating their priors. Andrew wants to be able to do this from the outside, but Michael Betancourt and I both expressed a lot of skepticism in terms of it breaking a lot of our fundamental abstractions (like a Stan program defining the density that’s fit!). More to come on this hot topic. Any ideas on how to manage developing a model would be appreciated. This goes back to the very first thing Matt Hoffman, Michael Malecki and I worked on with Andrew when he hired us before we’d conceived Stan. You’d think we’d have better advice on this after all this time. I’ve seen people do everything from use the C++ preprocessor to write elaborate program generation code in R.

  • Breck Baldwin has been working on governance and we’re converging on a workable model that we’ll share with everyone soon. The goal’s to make the governance clear and less of a smoke-filled room job by those of us who happen to go to lunch after the weekly meetings.

  • Jonah Gabry is taking on the ggplot2-ification of the new regression book and trying to roll everything into a combination of RStanArm and BayesPlot. No word yet if the rest of the tidyverse is to follow. Andrew said, “I’ll see what Jonah comes up with” or something to that effect.

  • Jonah has alos been working on priors for multilevel regression and poststratification with Yajuan Si (former postdoc of Andrew’s, now at U. Wisconsin); the trick is doing somethign reasonable when you have lots of interactions.

  • Ben Goodrich has been working on the next release of RStanArm. It’s beginning to garner a lot of contributions. Remember that the point has been to convert a lot of common point estimation packages to Bayesian versions and supply them with familiar R interfaces. Ben’s particularly been working on survival analysis lately.

  • Our high school student intern (don’t know if I can mention names online—the rest of our developers are adults!) is working on applying the Cook-Gelman-Rubin metric to evaluating various models. We’re doing much more of this method and it needs a less verbose and more descriptive name!

  • Mitzi Morris submitted a pull request for the Stan repo to add line numbers to error messages involving compound declare-and-define statements.
  • A joint effort by Mitzi Morris, Dan Simpson, Imad Ali, and Miguel A Martinez Beneito has led to convergence of models and fits among BUGS, Stan, and INLA for intrinsic conditional autorgression (ICAR) models. Imad’s building the result in RStanArm and has figured out how to extend the loo (leave one out cross-validation) package to deal with spatial models. Look for it in a Stan package near you soon. Mitzi’s working on the case study, which has been updated in the example-models repo.

  • Charles Margossian knocked off a paper on the mixed ODE solver he’s been working on, with a longer paper promised that goes through all the code details. Not sure if that’s on arXiv or not. He’s also been training Bill Gillespie to code in C++, which is great news for the project since Charles has to contend with his first year of grad school next year (whereas Bill’s facing a pleasant retirement of tracking down memory leaks). Charles is also working on getting the algebraic and fixed state solvers integrated into Torsten before the fall.

  • Krzysztof Sakrejda has a new paper out motivating a custom density he wrote in Stan for tracking dealys in diseseas incidence in a countrywide analysis for Thailand. I’m not sure where he put it.

  • Yuling Yao is revising the stacking paper (for a kind of improved model averaging). I believe this is going into the loo package (or is maybe already there). So much going on with Stan I can’t keep up!

  • Yuling also revised the simulated tempering paper, which is some custom code on top of Stan to fit models with limited multimodality. There was some discussion about realistic examples with limited multimodality and we hope to have a suite of them to go along with the paper.

  • Sebastian Weber, Bob Carpenter, and Micahel Betancourt, and Charles Margossian had a long and productive meeting to design the MPI interface. It’s very tricky trying to find an interface that’ll fit into the Stan language and let you ship off data once and reuse it. I think we’re almost there. The design discussion’s on Discourse.

  • Rob Trangucci is finishing the GP paper (after revising the manual chapter) with Dan Simpson, Aki Vehtari, and Michael Betancourt.

I’m sure I’m missing a lot of goings on, especially among people not at our weekly meetings. If you know of things that should be on this list, please let me know.

Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

Animating a spinner using ggplot2 and ImageMagick

It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.

I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.



Here’s the R magic.

library(ggplot2)
draw_curve <- function(angle) {
  df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) 
  plot <-
    ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) +
    geom_bar(stat="identity", position="fill") +
    coord_polar(theta="y", start = 0, direction = 1) +
    scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) +
    geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) +
    theme(axis.title = element_blank(),
          axis.ticks = element_blank(),
          axis.text.y = element_blank()) +
    theme(panel.grid = element_blank(),
          panel.border = element_blank()) +
    theme(legend.position = "none") +
    geom_point(aes(x=-1, y = 0), color="#666666", size=5)
  return(plot)
}
ds <- c()
pos <- 0
for (i in 1:66) {
  pos <- (pos + (67 - i)) %% 360
  ds[i] <- pos
}
ds <- c(rep(0, 10), ds)
ds <- c(ds, rep(ds[length(ds)], 10))

for (i in 1:length(ds)) {
  ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""),
         plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5)
}

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

Stan Weekly Roundup, 14 July 2017

Another week, another bunch of Stan updates.

  • Kevin Van Horn and Elea McDonnell Feit put together a tutorial on Stan [GitHub link] that covers linear regression, multinomial logistic regression, and hierarchical multinomial logistic regression.
  • Andrew has been working on writing up our “workflow”. That includes Chapter 1, Verse 1 of Bayesian Data Analysis of (1) specifying joint density for all observed and unobserved quantities, (2) performing inference, and (3) model criticism and posterior predictive checks. It also includes fake data generation and program checking (the Cook-Gelman-Rubin procedure; eliminating divergencent transitions in HMC) and comparison to other inference systems as a sanity check. It further includes the process of building up the model incrementally starting from a simple model. We’re trying to write this up in our case studies and make it the focus of the upcoming Stan book.
  • Ben Goodrich working on RStanArm with lots of new estimators, specifically following from nlmer, for GLMs with unusual inverse functions. This led to some careful evaluation, uncovering some multimodal behavior.
  • Breck Baldwin has been pushing through governance discussions so we can start thinking about how to make decisions about the Stan project when not everyone agrees. I think we’re going to go more with a champion model than a veto model; stay tuned.
  • Mitzi Morris has been getting a high-school intern up to speed for doing some model comparisons and testing.
  • Mitzi Morris has implemented the Besag-York-Mollie likelihood with improved priors provided by Dan Simpson. You can check out the ongoing branch in the stan-dev/example-models repo.
  • Aki Vehtari has been working on improving Pareto smoothed importance sampling and refining effective sample size estimators.
  • Imad Ali has prototypes of the intrinsic conditional autoregressive models for RStanArm.
  • Charles Margossian is working on gradients of steady-state ODE solvers for Torsten and a mixed solver for forcing functions in ODEs; papers are in the works, including a paper selected to be highlighted at ACoP.
  • Jonah Gabry is working on a visualization paper with Andrew for submission and is gearing up for the Stan course later this summer. Debugging R packages.
  • Sebastian Weber has been working on the low-level architecture for MPI including a prototype linked from the Wiki. The holdup is in shipping out the data to the workers. Anyone know MPI and want to get involved?
  • Jon Zelner and Andrew Gelman have been looking at adding hierarchical structure to discrete-parameter models for phylogeny. These models are horribly intractable, so they’re trying to figure out what to do when you can’t marginalize and can’t sample (you can write these models in PyMC3 or BUGS, but you can’t explore the posterior). And when you can do some kind of pre-pruning (as is popular in natural language processing and speech recognition pipelines).
  • Matthew Kay has a GitHub package TidyBayes that aims to integrate data and sampler data munging in a TidyVerse style (wrapping the output of samplers like JAGS and Stan).
  • Quentin F. Gronau has a Bridgesampling package on CRAN, the short description of which is “Provides functions for estimating marginal likelihoods, Bayes factors, posterior model probabilities, and normalizing constants in general, via different versions of bridge sampling (Meng & Wong, 1996)”. I heard about it when Ben Goodrich recommended it on the Stan forum.
  • Juho Piironen and Aki Vehtari arXived their paper, Sparsity information and regularization in the horseshoe and other shrinkage priors. Stan code included, naturally.

Stan Weekly Roundup, 7 July 2017

Holiday weekend, schmoliday weekend.

  • Ben Goodrich and Jonah Gabry shipped RStan 2.16.2 (their numbering is a little beyond base Stan, which is at 2.16.0). This reintroduces error reporting that got lost in the 2.15 refactor, so please upgrade if you want to debug your Stan programs!
  • Joe Haupt translated the JAGS examples in the second edition of John Kruschke’s book Doing Bayesian Data Analysis into Stan. Kruschke blogged it and Haupt has a GitHub page with the Stan programs. I still owe him some comments on the code.
  • Andrew Gelman has been working on the second edition of his and Jennifer Hill’s regression book, which is being rewritten as two linked books and translated to Stan. He’s coordinating with Jonah Gabry and Ben Goodrich on the RStanArm replacements for lme4 and lm/glm in R.
  • Sean Talts got in the pull request for enabling C++11/C++14 in Stan. This is huge for us developers as we have a lot of pent-up demand for C++11 features on the back burner.
  • Michael Betancourt, with feedback from the NumFOCUS advisory board for Stan, put together a web page of guidelines for using the Stan trademarks.
  • Gianluca Baio released version 1.0.5 of survHE, a survival analysis package based on RStan (and INLA and ShinyStan). There’s also the GitHub repo that Jacki Buros Novik made available with a library of survival analysis models in Stan. Techniques from these packages will probably make their way into RStanArm eventually (Andrew’s putting in a survival analysis example in the new regression book).
  • Mitzi Morris finished testing the Besag-York-Mollie model in Stan and it passes the Cook-Gelman-Rubin diagnostics. Given that GeoBUGS gets a different answer, we now think it’s wrong, but those tests haven’t completed running yet (it’s much slower than Stan in terms of effective sample size per unit time if you want to get to convergence).
  • Imad Ali has been working with Mitzi on getting the BYM model into RStanArm.
  • Jonah Gabry taught a one-day Stan class in Padua (Italy) while on vacation. That’s how much we like talking about Stan.
  • Ben Goodrich just gave a Stan talk at the Brussels useR conferece group following close on the heels of his Berlin meetup. You can find a lot of information about upcoming events at our events page.

  • Mitzi Morris and Michael Betancourt will be teaching a one-day Stan course for the Women in Machine Learning meetup event in New York on 22 July 2017 hosted by Viacom. Dan Simpson’s comment on the blog post was priceless.
  • Martin Černý improved feature he wrote to implement a standalone function parser for Stan (to make it easier to expose functions in R and Python).
  • Aki Vehtari arXived a new version of the horseshoe prior paper with a parameter to control regularization more tightly, especially for logistic regression. It has the added benefit of being more robust and removing divergent transitions in the Hamiltonian simulation. Look for that to land in RStanArm soon.
  • Charles Margossian continues to make speed improvements on the Stan models for Torsten and is also working on getting the algebraic equation solver into Stan so we can do fixed points of diff eqs and other fun applications. If you follow the link to the pull request, you can also see my extensive review of the code. It’s not easy to put a big feature like this into Stan, but we provide lots of help.
  • Marco Inacio got in a pull request for definite numerical integration. There are needless to say all sorts of subtle numerical issues swirling around integrating. Marco is working from John Cook’s basic implemnetation of numerical integration and John’s been nice enough to offer it under a BSD license so it would be compatible with Stan.
  • Rayleigh Lei is working on vectorizing all the binary functions and has a branch with the testing framework. This is really hairy template programming, but probably a nice break after his first year of grad school at U. Michigan!
  • Allen Riddell and Ari Hartikainen have been working hard on Windows compatibility for PyStan, which is no walk in the park. Windows has been the bane of our existence since starting this project and if all the world’s applied statisticians switched to Unix (Linux or Mac OS X), we wouldn’t shed a tear.
  • Yajuan Si, Andrew Gelman, Rob Trangucci, and Jonah Gabry have been working on a survey weighting module for RStanArm. Sounds like RStanArm’s quickly becoming the swiss army knife (leatherman?) of Bayesian modeling.
  • Andrew Gelman finished a paper on (issues with) NHST and is wondering about clinical effects that are small by design because they’re being compared to the state of the art treatment as a baseline.
  • My own work on mixed mode tests continues apace. The most recent pull request adds logical operators (and, or, not) to our autodiff library (it’s been in Stan—this is just rounding out the math lib operators directly) and removed 4000 lines of old code (replacing it with 1000 new lines, but that includes doc and three operators in both forward and reverse mode). I’m optimistic that this will eventually be done and we’ll have RHMC and autodiff Laplace approximations.
  • Ben Bales submitted a pull request for appending arrays, which is under review and will be generalized to arbitary Stan array types.
  • Ben Bales also submitted a pull request for the initial vectorization of RNGs. This should make programs doing posterior predictive inference so much cleaner.
  • I wrote a functional spec for standalone generated quantities. This would let us do posterior predictive inference after fitting the model. As you can see, even simple things like this take a lot of work. That spec is conservative on a task-by-task basis, but given the correlations among tasks, probably not so conservative in total.
  • I also patched potentially undefined bools in Stan; who knew that C++ would initialize a bool in a class to values like 127. This followed on from Ben Goodrich filing the issue after some picky R compiler flagged some undefined behavior. Not a bug, but the code’s cleaner now.