The current state of the Stan ecosystem in R

(This post is by Jonah)

Last week I posted here about the release of version 2.0.0 of the loo R package, but there have been a few other recent releases and updates worth mentioning. At the end of the post I also include some general thoughts on R package development with Stan and the growing number of Stan users who are releasing their own packages interfacing with rstan or one of our other packages.

Interfaces

rstanarm and brms: Version 2.17.4 of rstanarm and version 2.2.0 of brms were both released to provide compatibility with the new features in loo v2.0.0. Two of the new vignettes for the loo package show how to use it with rstanarm models, and we have also just released a draft of a vignette on how to use loo with brms and rstan for many “non-factorizable” models (i.e., observations not conditionally independent). brms is also now officially supported by the Stan Development Team (welcome Paul!) and there is a new category for it on the Stan Forums.

rstan: The next release of the rstan package (v2.18), is not out yet (we need to get Stan 2.18 out first), but it will include a loo() method for stanfit objects in order to save users a bit of work. Unfortunately, we can’t save you the trouble of having to compute the point-wise log-likelihood in your Stan program though! There will also be some new functions that make it a bit easier to extract HMC/NUTS diagnostics (thanks to a contribution from Martin Modrák).

Visualization

bayesplot: A few weeks ago we released version 1.5.0 of the bayesplot package (mc-stan.org/bayesplot), which also integrates nicely with loo 2.0.0. In particular, the diagnostic plots using the leave-one-out cross-validated probability integral transform (LOO-PIT) from our paper Visualization in Bayesian Workflow (preprint on arXiv, code on GitHub) are easier to make with the latest bayesplot release. Also, TJ Mahr continues to improve the bayesplot experience for ggplot2 users by adding (among other things) more functions that return the data used for plotting in a tidy data frame.

shinystan: Unfortunately, there hasn’t been a shinystan (mc-stan.org/shinystan) release in a while because I’ve been busy with all of these other packages, papers, and various other Stan-related things. We’ll try to get out a release with a few bug fixes soon. (If you’re annoyed by the lack of new features in shinystan recently let me know and I will try to convince you to help me solve that problem!)

(Update: I forgot to mention that despite the lack of shinystan releases, we’ve been working on better introductory materials. To that end, Chelsea Muth, Zita Oravecz, and I recently published an article User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan (view).)

Other tools

loo: We released version 2.0.0, a major update to the loo package (mc-stan.org/loo). See my previous blog post.

projpred: Version 0.8.0 of the projpred package (mc-stan.org/projpred) for projection predictive variable selection for GLMs was also released shortly after the loo update in order to take advantage of the improvements to the Pareto smoothed importance sampling algorithm. projpred can already be used quite easily with rstanarm models and we are working on improving its compatibility with other packages for fitting Stan models.

rstantools: Unrelated to the loo update, we also released version 1.5.0 of the rstantools package (mc-stan.org/rstantools), which provides functions for setting up R packages interfacing with Stan. The major changes in this release are that usethis::create_package() is now called to set up the package (instead of utils::package.skeleton), fewer manual changes to files are required by users after calling rstan_package_skeleton(), and we have a new vignette walking through the process of setting up a package (thanks Stefan Siegert!). Work is being done to keep improving this process, so be on the lookout for more updates soonish.

Stan related R packages from other developers

There are now well over fifty packages on CRAN that depend in some way on one of our R packages mentioned above!  You can find most of them by looking at the “Reverse dependencies” section on the CRAN page for rstan, but that doesn’t count the ones that depend on bayesplot, shinystanloo, etc., but not rstan.

Unfortunately, given the growing number of these packages, we haven’t been able to look at each one of them in detail. For obvious reasons we prioritize giving feedback to developers who reach out to us directly to ask for comments and to those developers who make an effort to our recommendations for developers of R packages interfacing with Stan (included with the rstantools package since its initial release in 2016). If you are developing one of these packages and would like feedback please let us know on the Stan Forums. Our time is limited but we really do make a serious effort to answer every single question asked on the forums (thank you to the many Stan users who also volunteer their time helping on the forums!).

My primary feelings about this trend of developing Stan-based R packages are ones of excitement and gratification. It’s really such an honor to have so many people developing these packages based on all the work we’ve done! There are also a few things I’ve noticed that I hope will change going forward. I’ll wrap up this post by highlighting two of these issues that I hope developers will take seriously:

(1) Unit testing

(2) Naming user-facing functions

The number of these packages that have no unit tests (or very scant testing) is a bit scary. Unit tests won’t catch every possible bug (we have lots of tests for our packages and people still find bugs all the time), but there is really no excuse for not unit testing a package that you want other people to use. If you care enough to do everything required to create your package and get it on CRAN, and if you care about your users, then I think it’s fair to say that you should care enough to write tests for your package. And there’s really no excuse these days with the availability of packages like testthat to make this process easier than it used to be! Can anyone think of a reasonable excuse for not unit testing a package before releasing it to CRAN and expecting people to use it? (Not a rhetorical question. I really am curious given that it seems to be relatively common or at least not uncommon.) I don’t mean to be too negative here. There are also many packages that seem to have strong testing in place! My motivation for bringing up this issue is that it is in the best interest of our users.

Regarding function naming: this isn’t nearly as big of a deal as unit testing, it’s just something I think developers (including myself) of packages in the Stan R ecosystem can do to make the experience better for our users. rstanarm and brms both import the generic functions included with rstantools in order to be able to define methods with consistent names. For example, whether you fit a model with rstanarm or with brms, you can call log_lik() on the fitted model object to get the pointwise log-likelihood (it’s true that we still have a bit left to do to get the names across rstanarm and brms more standardized, but we’re actively working on it). If you are developing a package that fits models using Stan, we hope you will join us in trying to make it as easy as possible for users to navigate the Stan ecosystem in R.

ShinyStan v2.0.0

For those of you not familiar with ShinyStan, it is a graphical user interface for exploring Stan models (and more generally MCMC output from any software). For context, here’s the post on this blog first introducing ShinyStan (formerly shinyStan) from earlier this year.

shinystan_images

ShinyStan v2.0.0 released

ShinyStan v2.0.0 is now available on CRAN. This is a major update with a new look and a lot of new features. It also has a new(ish) name: ShinyStan is the app/GUI and shinystan the R package (both had formerly been shinyStan for some reason apparently not important enough for me to remember). Like earlier versions, this version has enhanced functionality for Stan models but is compatible with MCMC output from other software packages too.

You can install the new version from CRAN like any other package:

install.packages("shinystan")

If you prefer a version with a few minor typos fixed you can install from Github using the devtools package:

devtools::install_github("stan-dev/shinystan", build_vignettes = TRUE)

(Note: after installing the new version and checking that it works we recommend removing the old one by running remove.packages(“shinyStan”).)

If you install the package and want to try it out without having to first fit a model you can launch the app using the preloaded demo model:

library(shinystan)
launch_shinystan_demo()

Notes

This update contains a lot of changes, both in terms of new features added, greater UI stability, and an entirely new look. Some release notes can be found on GitHub and there are also some instructions for getting started on the ShinyStan wiki page. Here are two highlights:

  • The new interactive diagnostic plots for Hamiltonian Monte Carlo. In particular, these are designed for models fit with Stan using NUTS (the No-U-Turn Sampler).

    Diagnostics screenshot Diagnostics screenshotshinystan_diagnostics3

  • The deploy_shinystan function, which lets you easily deploy ShinyStan apps for your models to RStudio’s ShinyApps hosting service. Each of your apps (i.e. each of your models) will have a unique URL. To use this feature please also install the shinyapps package: devtools::install_github("rstudio/shinyapps").

The plan is to release a minor update with bug fixes and other minor tweaks in a month or so. So if you find anything we should fix or change (or if you have any other suggestions) we’d appreciate the feedback.

Google Refine

Tools worth knowing about:

Google RefineGoogle Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.]

Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post)!

Refine 2.0 adds some data-collection tools for scraping and parsing web data. I have not had a chance to play with any of this kind of advanced scripting with it yet. I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway).

RStudio – new cross-platform IDE for R

The new R environment RStudio looks really great, especially for users new to R. In teaching, these are often people new to programming anything, much less statistical models. The R GUIs were different on each platform, with (sometimes modal) windows appearing and disappearing and no unified design. RStudio fixes that and has already found a happy home on my desktop.

Initial impressions

I’ve been using it for the past couple of days. For me, it replaces the niche that R.app held: looking at help, quickly doing something I don’t want to pollute a project workspace with; sometimes data munging, merging, and transforming; and prototyping plots. RStudio is better than R.app at all of these things. For actual development and papers, though, I remain wedded to emacs+ess (good old C-x M-c M-Butterfly).

Favorite features in no particular order

  • plots seamlessly made in new graphics devices. This is huge— instead of one active plot window named something like quartz(1) the RStudio plot window holds a whole stack of them, and you can click through to previous ones that would be overwritten and ‘lost’ in R.app.
  • help viewer. Honestly I use this more than anything else in R.app and the RStudio one is prettier (mostly by being not set in Times), and you can easily get contextual help from the source doc or console pane (hit tab for completions, then F1 on what you want).
  • workspace viewer with types and dimensions of objects. Another reason I sometimes used R.app instead of emacs. This one doesn’t seem much different from the R.app one, but its integration into the environment is better than floaty thing that R.app does.
  • ‘Import Dataset’ menu item and button in the workspace pane. For new R users, the answer to “How do I get data into this thing?” has always been “Use one of the textbook package’s included datasets until you learn to read.csv()”. This is a much better answer.
  • obviously, the cross-platform nature of RStudio took the greatest engineering effort. The coolest platform is actually that it will run on a server and you access it using a modern browser (i.e., no IE). (“While RStudio is compatible with Internet Explorer, other browsers provide an improved user experience through faster JavaScript performance and more consistent handling of browser events.” more).

It would be nice if…

  • indents worked like emacs. I think my code looks nice largely because of emacs+ess. The default indent of two spaces is nice (see the Google style guide) but where newlines line up by default is pretty helpful in avoiding silly typing errors (omitted commas, unclosed parentheses
  • you could edit data.frames, which I’ll guess they are working on. It must be hard, since the R.app one and the X one that comes up in emacs are so abysmal (the R.app one is the least bad). RStudio currently says “ Editing of matrix and data.frame objects is not currently supported in RStudio.” :-(

Overall, really great stuff!