Another week, another bunch of Stan updates.
- Kevin Van Horn and Elea McDonnell Feit put together a tutorial on Stan [GitHub link] that covers linear regression, multinomial logistic regression, and hierarchical multinomial logistic regression.
- Andrew has been working on writing up our “workflow”. That includes Chapter 1, Verse 1 of Bayesian Data Analysis of (1) specifying joint density for all observed and unobserved quantities, (2) performing inference, and (3) model criticism and posterior predictive checks. It also includes fake data generation and program checking (the Cook-Gelman-Rubin procedure; eliminating divergencent transitions in HMC) and comparison to other inference systems as a sanity check. It further includes the process of building up the model incrementally starting from a simple model. We’re trying to write this up in our case studies and make it the focus of the upcoming Stan book.
- Ben Goodrich working on RStanArm with lots of new estimators, specifically following from nlmer, for GLMs with unusual inverse functions. This led to some careful evaluation, uncovering some multimodal behavior.
- Breck Baldwin has been pushing through governance discussions so we can start thinking about how to make decisions about the Stan project when not everyone agrees. I think we’re going to go more with a champion model than a veto model; stay tuned.
- Mitzi Morris has been getting a high-school intern up to speed for doing some model comparisons and testing.
- Mitzi Morris has implemented the Besag-York-Mollie likelihood with improved priors provided by Dan Simpson. You can check out the ongoing branch in the stan-dev/example-models repo.
- Aki Vehtari has been working on improving Pareto smoothed importance sampling and refining effective sample size estimators.
- Imad Ali has prototypes of the intrinsic conditional autoregressive models for RStanArm.
- Charles Margossian is working on gradients of steady-state ODE solvers for Torsten and a mixed solver for forcing functions in ODEs; papers are in the works, including a paper selected to be highlighted at ACoP.
- Jonah Gabry is working on a visualization paper with Andrew for submission and is gearing up for the Stan course later this summer. Debugging R packages.
- Sebastian Weber has been working on the low-level architecture for MPI including a prototype linked from the Wiki. The holdup is in shipping out the data to the workers. Anyone know MPI and want to get involved?
- Jon Zelner and Andrew Gelman have been looking at adding hierarchical structure to discrete-parameter models for phylogeny. These models are horribly intractable, so they’re trying to figure out what to do when you can’t marginalize and can’t sample (you can write these models in PyMC3 or BUGS, but you can’t explore the posterior). And when you can do some kind of pre-pruning (as is popular in natural language processing and speech recognition pipelines).
- Matthew Kay has a GitHub package TidyBayes that aims to integrate data and sampler data munging in a TidyVerse style (wrapping the output of samplers like JAGS and Stan).
- Quentin F. Gronau has a Bridgesampling package on CRAN, the short description of which is “Provides functions for estimating marginal likelihoods, Bayes factors, posterior model probabilities, and normalizing constants in general, via different versions of bridge sampling (Meng & Wong, 1996)”. I heard about it when Ben Goodrich recommended it on the Stan forum.
- Juho Piironen and Aki Vehtari arXived their paper, Sparsity information and regularization in the horseshoe and other shrinkage priors. Stan code included, naturally.
I guess Andrew didn’t get too far on that workflow thing ;-)
Thanks, I was entering things during the Stan meeting and must have gotten absorbed in the conversation. The workflow is really occupying a lot of our thinking in terms of how we want to move Stan forward along with our writing about it and teaching it.
I’d be interested in hearing how many readers of this blog are “tidyverse” users. Personally, I am already used to “normal” R and have found tidyverse stuff to slow everything down so usually I remove it if working with others code. It almost seems like a python 2 vs python 3 thing is forming here.
Well, I use ggplot2 for basically everything, and I use readr a fair amount to read in all those weird files people produce (read_fwf is essentially for prying information out of many government sources), I’ll do a little dplyr hear and there, though I tend to prefer sqldf for general munging.
I haven’t gotten very far with the whole piping thing %>% etc, though I admit to being a fan of the Unix pipe, usually again, sqldf to the rescue.
I do think there’s a forking culture here though.
Yea I was thinking mostly of the pipes. Of course if there is a function with a certain feature I want (or is more efficient) then that will be used. Regarding reading in data, sometimes these functions can get “too smart”. I was using fread (data.table package) for awhile until one time I saw it silently mess up a column (during a kaggle competition, so maybe this bug was triggered on purpose), now I am scared of it.
Also, for some reason I have never liked the appearance of ggplot’s charts, there is just something cartoonish about them. Since ggplot is probably the “gateway drug” into the tidyverse for many people, perhaps that is where I diverged.