It’s so hard to compare the efficiency of MCMC samplers

I thought I’d share a comment I made on Perry de Valpine‘s post on the NIMBLE blog,

Perry was posting about a paper that tried to compare efficiency of Stan, NIMBLE, and JAGS for some linear modeling problems. You can find the link in Perry’s post if you really care and then the paper links to their GitHub. Perry was worried they were misrepresenting NIMBLE and users would get the wrong idea. We decided to take the approach of established companies and simply ignore this kind of criticism. But this time I couldn’t resist as it’s not really about us.

Here’s my comment (lightly edited):

Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that we want to represent actual practice but also minimize warmup to put systems in as favorable a light as possible. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration. But in practice, you don’t know if you can get away with less warmup (though you can use something like iterative deepening algorithms to probe deeper, they’re not built in yet).

One way to get around this sensitivity to warmup is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

In other words, this is a really hard problem.

I then read Perry’s follow up and couldn’t help myself. I actually looked at their Stan code. Then I had a follow up comment.

I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

My slides and paper submissions for Prob Prog 2021

Prob Prog 2021 just ended. Prob Prog is the big (250 registered attendees, and as many as 180 actually online at one point) probabilistic programming conference. It’s a very broadly scoped conference.

The online version this year went very smoothly. It ran a different schedule every day to accommodate different time zones. So I wound up missing the Thursday talks other than the posters because of the early start. There was a nice amount of space between sessions to hang out in the break rooms and chat.

Given that there’s no publication for this conference, I thought I’d share my slides here. The talks should go up on YouTube at some point.

Slides: What do we need from a PPL to support Bayesian workflow?

There was a lot of nice discussion around bits of workflow we don’t really discuss in the paper or book: how to manage file names for multiple models, how to share work among distributed teammates, how to put models into production and keep them updated for new data. In my talk, I brought up issues others have to deal with like privacy or intellectual property concerns.

My main focus was on modularity. After talking to a bunch of people after my talk, I still don’t think we have any reasonable methodology as a field to test out components of a probabilistic program that are between the level of a density we can unit test and a full model we can subject to our whole battery of workflow tests. How would we go about just testing a custom GP prior or spatio-temporal model component? There’s not even a way to represent such a module in Stan, which was the motivation for Maria Gorinova‘s work on SlicStan. Ryan Bernstein (a Stan developer and Gelman student) is also working on IDE-like tools that provide a new language for expressing a range of models.

Then Eli Bingham (of Pyro fame) dropped the big question: is there any hope we could use something like these PPLs to develop a scalable, global climate model? Turns out that we don’t even know how they vet the incredibly complicated components of these models. Just the soil carbon models are more complicated than most of the PK/PD models we fit and they’re one of the simplest parts of these models.

I submitted two abstracts this year and then they invited me to do a plenary session and I decided to focus on the first.

Paper submission 1: What do we need from a probabilistic programming language to support Bayesian workflow?

Paper submission 2: Lambdas, tuples, ragged arrays, and complex numbers in Stan

P.S. Andrew: have you considered just choosing another theme at random? It’s hard to imagine it’d be harder to read than this one.

How many infectious people are likely to show up at an event?

Stephen Kissler and Yonatan Grad launched a Shiny app,

Effective SARS-CoV-2 test sensitivity,

to help you answer the question,

How many infectious people are likely to show up to an event, given a screening test administered n days prior to the event?

Here’s a screenshot.



The app is based on some modeling they did with Stan followed by simulation-based predictions. Here’s the medRxiv paper.

Stephen M. Kissler et al. 2020. SARS-CoV-2 viral dynamics in acute infections. medRxiv.

Users input characteristics of the test taken, the event size, the time the tests are taken before the event, the duration of the event itself, etc. The tool then produces plots of expected number of infectious people at your event and even the projected viral load at your event with uncertainties.

This obviously isn’t a tool for amateurs. I don’t even understand the units Ct for the very first input slider; the authors describe it as “the RT-qPCR cycle threshold”. They said they’d welcome feedback in making this more usable.

Probabilities for action and resistance in Blades in the Dark

Later this week, I’m going to be GM-ing my first session of Blades in the Dark, a role-playing game designed by John Harper. We’ve already assembled a crew of scoundrels in Session 0 and set the first score. Unlike most of the other games I’ve run, I’ve never played Blades in the Dark, I’ve only seen it on YouTube (my fave so far is Jared Logan’s Steam of Blood x Glass Cannon play Blades in the Dark!).

Action roll

In Blades, when a player attempts an action, they roll a number of six-sided dice and take the highest result. The number of dice rolled is equal to their action rating (a number between 0 and 4 inclusive) plus modifiers (0 to 2 dice). The details aren’t important for the probability calculations. If the total of the action rating and modifiers is 0 dice, the player rolls two dice and takes the worst. This is sort of like disadvantage and (super-)advantage in Dungeons & Dragons 5e.

A result of 1-3 is a failure with a consequence, a result of 4-5 is a success with a consequence, and a result of 6 is an unmitigated success without a consequence. If there are more than two 6s in the result, it’s a success with a benefit (aka a “critical” success).

The GM doesn’t roll. In a combat situation, you can think of the player roll encapsulating a turn of the player attacking and the opponent(s) counter-attacking. On a result of 4-6, the player hits, on a roll of 1-5, the opponent hits back or the situation becomes more desperate in some other way like the character being disarmed or losing their footing. On a critical result (two or more 6s in the roll), the player succeeds with a benefit, perhaps cornering the opponent away from their flunkies.

Resistance roll

When a player suffers a consequence, they can resist it. To do so, they gather a pool of dice for the resistance roll and spend an amount of stress equal to six minus the highest result. Again, unless they have zero dice in the pool, in which case they can roll two dice and take the worst. If the player rolls a 6, the character takes no stress. If they roll a 1, the character takes 5 stress (which would very likely take them out of the action). If the player has multiple dice and rolls two or more 6s, they actually reduce 1 stress.

For resistance rolls, the value between 1 and 6 matters, not just whether it’s in 1-3, in 4-5, equal to 6, or if there are two 6s.

Probabilities
Resistance rolls are rank statistics for pools of six-sided dice. Action rolls just group those. Plus a little sugar on top for criticals. We could do this the hard way (combinatorics) or we could do this the easy way. That decision was easy.

Here’s a plot of the results for action rolls, with dice pool size on the x-axis and line plots of results 1-3 (fail plus a complication), 4-5 (succeed with complication), 6 (succeed) and 66 (critical success with benefit). This is based on 10m simulations.



You can find a similar plot from Jasper Flick on AnyDice, in the short note Blades in the Dark.

I find the graph pretty hard to scan, so here’s a table in ASCII format, which also includes the resistance roll probabilities. The 66 result (at least two 6 rolls in the dice pool) is a possibility for both a resistance roll and an action roll. Both decimal places should be correct given the 10M simulations.

DICE   RESISTANCE                      ACTION           BOTH

DICE    1    2    3    4    5    6     1-3  4-5    6      66
----  ----------------------------     -------------    ----
 0d   .36  .25  .19  .14  .08  .03     .75  .22  .03     .00

 1d   .17  .17  .17  .17  .17  .17     .50  .33  .17     .00
 2d   .03  .08  .14  .19  .25  .28     .25  .44  .28     .03

 3d   .01  .03  .09  .17  .29  .35     .13  .45  .35     .07
 4d   .00  .01  .05  .14  .29  .39     .06  .42  .39     .13

 5d   .00  .00  .03  .10  .27  .40     .03  .37  .40     .20
 6d   .00  .00  .01  .07  .25  .40     .02  .32  .40     .26

 7d   .00  .00  .01  .05  .22  .39     .01  .27  .39     .33
 8d   .00  .00  .00  .03  .19  .38     .00  .23  .38     .39

One could go for more precision with more simulations, or resort to working them all out combinatorially.

The hard way

The hard way is a bunch of combinatorics. These aren’t too bad because of the way the dice are organized. For the highest value of throwing N dice, the probability that a value is less than or equal to k is one minus the probability that a single die is greater than k raised to the N-th power. It’s just that there are a lot of cells in the table. And then the differences would be required. Too error prone for me. Criticals can be handled Sherlock Holmes style by subtracting the probability of a non-critical from one. A non-critical either has no sixes (5^N possibilities with N dice) or exactly one six ((6 choose 1) * 5^(N – 1)). That’s not so bad. But there are a lot of entries in the table. So let’s just simulate.

Edit: Cumulative Probability Tables

I really wanted the cumulative probability tables of a result or better (I suppose I could’ve also done it as result or worse). I posted these first on the Blades in the Dark forum. It uses Discourse, just like Stan’s forum.

Action Rolls

Here’s the cumulative probabilities for action rolls.



And here’s the table of cumulative probabilities for action rolls, with 66 representing a critical, 6 a full success, and 4-5 a partial success:

ACTION ROLLS, CUMULATIVE
        probability of result or better
 dice   4-5+     6+     66
    0  0.250  0.028  0.000
    1  0.500  0.167  0.000
    2  0.750  0.306  0.028
    3  0.875  0.421  0.074
    4  0.938  0.518  0.132
    5  0.969  0.598  0.196
    6  0.984  0.665  0.263
    7  0.992  0.721  0.330
    8  0.996  0.767  0.395

Resistance Rolls

And here are the basic probabilities for resistance rolls.



Here’s the table for stress probabilities based on dice pool size

RESISTANCE ROLLS

             Probability of Stress
Dice    5    4    3    2    1    0   -1
   0  .31  .25  .19  .14  .08  .03  .00 
   1  .17  .17  .17  .17  .17  .17  .00
   2  .03  .08  .14  .19  .25  .28  .03
   3  .00  .03  .09  .17  .28  .35  .07
   4  .00  .01  .05  .13  .28  .39  .13
   5  .00  .00  .03  .10  .27  .40  .20
   6  .00  .00  .01  .07  .25  .40  .26
   7  .00  .00  .01  .05  .22  .39  .33
   8  .00  .00  .00  .03  .19  .37  .40

Here’s the plot for the cumulative probabilities for resistance rolls.



Here’s the table of cumulative resistance rolls.

RESISTANCE ROLLS, CUMULATIVE

             Probability of Stress or Less
Dice       5      4     3     2    1    0     -1
   0    1.00    .69   .44   .25   .11  .03   .00 
   1    1.00    .83   .67   .50   .33  .17   .00
   2    1.00    .97   .89   .75   .56  .31   .03
   3    1.00   1.00   .96   .87   .70  .42   .07
   4    1.00   1.00   .99   .94   .80  .52   .13
   5    1.00   1.00  1.00   .97   .87  .60   .20
   6    1.00   1.00  1.00   .98   .91  .67   .26
   7    1.00   1.00  1.00   .99   .94  .72   .33
   8    1.00   1.00  1.00  1.00   .96  .77   .40

For example, with 4 dice (the typical upper bound for resistance rolls), there’s an 80% chance that the character takes 1, 0, or -1 stress, and 52% chance they take 0 or -1 stress. With 0 dice, there’s a better than 50-50 chance of taking 4 or more stress because the probability of 3 or less stress is only 44%.

Finally, here’s the R code for the resistance and cumulative resistance.

# row = dice, col = c(1:6, 66)
resist <- matrix(0, nrow = 8, ncol = 7)
resist[1, 1:6] <- 1/6
for (d in 2:8) {
  for (result in 1:5) {
    resist[d, result] <-
      sum(resist[d - 1, 1:result]) * 1/6 +
      resist[d - 1, result] *  (result -1) / 6
  }
  resist[d, 6] <- sum(resist[d - 1, 1:5]) * 1/6 +
                  sum(resist[d - 1, 6]) * 5/6
  resist[d, 7] <- resist[d - 1, 7] + resist[d - 1, 6] * 1/6
}

cumulative_resist <- resist  # just for sizing
for (d in 1:8) {
  for (result in 1:7) {
    cumulative_resist[d, result] <- sum(resist[d, result:7])
  }
}

library('reshape')
library('ggplot2')


zero_dice_probs <-  c(11, 9, 7, 5, 3, 1, 0) / 36
zero_dice_cumulative_probs <- zero_dice_probs
for (n in 1:7)
  zero_dice_cumulative_probs[n] <- sum(zero_dice_probs[n:7])

z <- melt(cumulative_resist)  # X1 = dice, X2 = result, value = prob
stress <- 6 - z$X2
df <- data.frame(dice = z$X1, stress = as.factor(stress), prob = z$value)
df <- rbind(df, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7), prob = zero_dice_cumulative_probs))

cumulative_plot <- ggplot(df, aes(x = dice, y = prob,
                   colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress or less") +
  scale_x_continuous(breaks = 0:8)
cumulative_plot
ggsave('cumulative-resistance.jpg', plot = cumulative_plot, width = 5, height = 4)


z2 <- melt(resist)  # X1 = dice, X2 = result, value = prob
stress2 <- 6 - z2$X2
df2 <- data.frame(dice = z2$X1, stress = as.factor(stress2), prob = z2$value)
df2 <- rbind(df2, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7),
                             prob = zero_dice_probs))

plot <- ggplot(df2, aes(x = dice, y = prob,
               colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress") +
  scale_x_continuous(breaks = 0:8)
plot
ggsave('resistance.jpg', plot = plot, width = 5, height = 4)

Drunk-under-the-lamppost testing

Edit: Glancing over this again, it struck me that the title may be interpreted as being mean. Sorry about that. It wasn’t my intent. I was trying to be constructive and I really like that analogy. The original post is mostly reasonable other than on this one point that I thought was important to call out.

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.

Make Andrew happy with one simple ggplot trick

By default, ggplot expands the space above and below the x-axis (and to the left and right of the y-axis). Andrew has made it pretty clear that he thinks the x axis should be drawn at y = 0. To remove the extra space around the axes when you have continuous (not discrete or log scale) axes, add the following to a ggplot plot,

plot <-
  plot + 
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0))

Maybe it could even go in a theme.

Hats off to A5C1D2H2I1M1N2O1R2T1 (I can't make these handles up) for posting the solution on Stack Overflow.

Naming conventions for variables, functions, etc.

The golden rule of code layout is that code should be written to be readable. And that means readable by others, including you in the future.

Three principles of naming follow:

1. Names should mean something.

2. Names should be as short as possible.

3. Use your judgement to balance (1) and (2).

The third one’s where all the fun arises. Do we use “i” or “n” for integer loop variables by convention? Yes, we do. Do we choose “inv_logit” or “inverse_logit”? Stan chose “inv_logit”. Do we choose “complex” or “complex_number”? C++ chose “complex”, as well as choosing “imag” over “imaginary” for the method to pull the imaginary component out.

Do we use names like “run_helper_function”, which is both long and provides zero clue as to what it does? We don’t if we want to do unto others as we’d have them do unto us.

P.S. If the producers of Silicon Valley had asked me, Winnie would’ve dumped Richard after a fight about Hungarian notation, not tabs vs. spaces.

Beautiful paper on HMMs and derivatives

I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$latex L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $latex \pi$ is the initial state distributions, $latex B_t$ is the vector of emission densities for the states, $latex A$ is the stochastic transition matrix, and $latex 1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $latex \pi$ as the stationary distribution for the transition matrix $latex A,$, i.e., $latex \pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $latex A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.

Econometrics postdoc and computational statistics postdoc openings here in the Stan group at Columbia

Andrew and I are looking to hire two postdocs to join the Stan group at Columbia starting January 2020. I want to emphasize that these are postdoc positions, not programmer positions. So while each position has a practical focus, our broader goal is to carry out high-impact, practical research that pushes the frontier of what’s posisble in Bayesian modeling. This particular project is focused on extremely challenging econometric modeling problems and statistical computation and will be carried out in conjunction with some really great economists (details in the job descriptions below).

These positions are funded through a generous gift from the Alfred P. Sloan Foundation.

Computational statistics postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on computational statistics. The goal of the project is to:

* develop algorithms for solving differential and algebraic equations, potentially stochastic and partial

* fit large scale-hierarchical models either through core sampling improvements or approximations such as nested Laplace or variational inference.

In both projects, there is wide latitude for extending the state of the art in computational statistics. The Stan project encompasses a team of dozens of developers distributed around the world and this work will be done in collaboration with that wider team. The wider team provides expertise in everything from numerical analysis and applied mathematics to programming language theory and parallel computation. The position is well funded to travel to conferences and visit collaborators.

The project is funded through a grant focused on Bayesian econometric modeling, which provides concrete applications that will provide a focus for the work as well as a second postdoc funded to develop those applications concurrently with developing the tools needed to extend the existing state of the art. The Stan group at Columbia is also working on applications of differential and algebraic equations in soil carbon modeling and pharmacology and applications of large scale hierarchical models in education and in survey sampling for political science.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, bob.carpenter@columbia.edu. The position is available starting in January 2020, and we will review applications as they arrive.

Econometrics Postdoc

The Stan group at Columbia is looking to hire a Postdoctoral Research Scholar to work on Bayesian econometric modeling and methodology. The goal is to create a bridge from modern econometric modeling to current Bayesian computational practice by generating a range of illustrative case studies illustrating best practices. Many of these best practices will need to be developed from the ground up and there is wide latitude for novel work.

This work will be carried out in collaboration with several economists and methodologists outside of Columbia University:

* Empirical auction analysis, where the theory around optimal design can be used to improve econometric methods used to draw inferences from the performance of real auctions in practice, including jointly modeling all components of a bidding system in order to test the structural assumptions driving mechanism design decisions. With Prof. Shoshanna Vasserman (Stanford)

* Bounded rationality and decision making in dynamic and stochastic environments, where macroeconomic models may be expressed in the form of dynamic, stochastic, general equilibrium models which can be extended to higher orders to model bounded rationality in agents making decisions in dynamic and stochastic environments. With Prof. Thomas Sargent, New York University.

* External validity of policy targeting for subgroups, with the goal of applying interventions where they will benefit the most while avoiding harming other subgroups, and a focus on combining data across multiple settings using meta-analysis. With Prof. Rachel Meager, London School of Economics.

* Causal models of interventions in education policy, where the focus is on time-series data organized by classroom, school, and larger groupings in the context of heterogeneous demographic. With Prof. Sophia Rabe-Hesketh, University of California, Berkeley.

Basic capabilities to fit these models exist in Stan currently and this grant will support a second postdoc to help extend those capabilities to more complicated systems.

The position will be housed in the Applied Statistics Center at Columbia University and supervised by Andrew Gelman. The initial appointment will be for 18 months (January 2020 through June 2022) with a possibility of extension.

Columbia is an EEO/AA employer

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Andrew Gelman, at gelman@stat.columbia.edu. The position is available starting January 1, 2020, and we will review applications as soon as they arrive.

Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $latex N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$latex y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we’ll assume a standard normal prior on $latex \mu$,

$latex \mu \sim \textrm{normal}(0, 1).$

Now we’re not going to observe all $latex y_n$, but only a sample of the $latex N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $latex y_n$ from the population.

Missing data

Now let’s assume the sample of $latex y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $latex N$ items from the population, and for each item $latex n$, there is a probability $latex \pi_n$ that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

$latex \pi_n = \textrm{logit}^{-1}(y_n)$.

Now when we collect our sample, we’ll do something like poll $latex N = 2000$ people from the population, but each person $latex n$ only has a $latex \pi_n$ chance of responding. So we only wind up with $latex N^{\textrm{obs}}$ observations, with $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$latex \sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)
  target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);

If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $latex \hat{\pi}$ is the true value $latex \pi$). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $latex N$ items when we’ve only observed $latex N^{\textrm{obs}}$ items. That is, their weights are what we’d expect to get if we’d observed $latex N$ items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to $latex N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $latex N$ people out of a population of $latex N^{\textrm{pop}}$, each of which has a $latex \pi_n$ chance of responding, leading to a set of responses of size $latex N^{\textrm{obs}}.$

Given that we know how $latex \pi$ relates to $latex y$, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the $latex N^{\textrm{miss}} = N – N^{\textrm{obs}}$ missing items each get parameters $latex y^{\textrm{miss}}_n$ representing how they would’ve responded had they responded. We also model response, so we have an extra term $latex \textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $latex \textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here’s the Stan program.

data {
  int N_miss;
  int N_obs;
  vector[N_obs] y_obs;
}
parameters {
  real mu;
  vector[N_miss] y_miss;
}
model {
  // prior
  mu ~ normal(0, 1);
  // observed data likelihood
  y_obs ~ normal(mu, 2);
  1 ~ bernoulli_logit(y_obs);
  // missing data likelihood and missingness
  y_miss ~ normal(mu, 2);
  0 ~ bernoulli_logit(y_miss);
}

The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)
  0 ~ bernoulli(inv_logit(y_miss[n]))

or even more explicitly,

for (n in 1:N_miss)
  target += bernoulli_lpmf(0 | inv_logit(y_miss[n]));

And here’s the simulation code, including a cheap run at SBC:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(), logical = FALSE)

printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }
inv_logit <- function(u) 1 / (1 + exp(-u))

printf("Compiling model.")
model <- stan_model('missing.stan')

for (m in 1:20) {

# SIMULATE DATA
mu <- rnorm(1, 0, 1);
N_tot <- 1000
y <- rnorm(N_tot, mu, 2)
z <- rbinom(N_tot, 1, inv_logit(y))
y_obs <- y[z == 1]
N_obs <- length(y_obs)
N_miss <- N_tot - N_obs

# COMPILE AND FIT STAN MODEL
fit <- sampling(model,
                data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),
                chains = 1, iter = 5000, refresh = 0)
mu_ss <- extract(fit)$mu
mu_hat <- mean(mu_ss)
q25 <- quantile(mu_ss, 0.25)
q75 <- quantile(mu_ss, 0.75)
printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",
       mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))

}

Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55
mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62
mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05
mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71
mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03
mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81

The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

Seeking postdoc (or contractor) for next generation Stan language research and development

The Stan group at Columbia is looking to hire a postdoc* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation.

The language features on the roadmap include lambdas with closures, sparse matrices and vectors, ragged arrays, tuples and structs, user-defined Jacobians, and variadic functions. The parser, intermediate representation, and code generation are written in OCaml using the Menhir compiler framework. The code is hosted on GitHub in the stanc3 repo; the current design documents are in the design docs repo. The generated code is templated C++ that depends on the automatic differentiation framework in the Stan math library and is used by Stan’s statistical inference algorithms.

The research and development for Stan will be carried out in collaboration with the larger Stan development team, which includes a large group of friendly and constructive collaborators within and outside of Columbia University. In addition to software development, the team has a diverse set of research interests in programming language semantics, applied statistics, statistical methodology, and statistical computation. Of particular relevance to this project is foundational theory work on programming language semantics for differentiable and probabilistic programming languages.

The position would be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for one year with a possible reappointment for a second year.

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, carp@alias-i.com. The position is available immediately and we will review applications as they arrive.

Thanks to Schmidt Futures for the funding to make this possible!


* We could also hire a contractor on an hourly basis. For that, I’d be looking for someone with experience who could hit the ground running with the OCaml code.

(Markov chain) Monte Carlo doesn’t “explore the posterior”

[Edit: (1) There’s nothing dependent on Markov chain—the argument applies to any Monte Carlo method in high dimensions. (2) No, (MC)MC is not not broken.]

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it to calculate expectations.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of $latex 2^{20}$, or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

$latex \begin{array}{rcl} \hat{\theta} & = & \mathbb{E}[\theta \mid y] \\[8pt] & = & \int_{\Theta} \theta \times p(\theta \mid y) \, \mbox{d}\theta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \theta^{(m)},\end{array}$

where $latex \theta^{(1)}, \ldots, \theta^{(M)}$ are draws from the posterior $latex p(\theta \mid y).$

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

$latex p(\tilde{y} \mid y) \ = \ \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] \ \approx \ \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}),$

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

$latex \mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].$

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

NYC Meetup Thursday: Under the hood: Stan’s library, language, and algorithms

I (Bob, not Andrew!) will be doing a meetup talk this coming Thursday in New York City. Here’s the link with registration and location and time details (summary: pizza unboxing at 6:30 pm in SoHo):

After summarizing what Stan does, this talk will focus on how Stan is engineered. The talk follows the organization of the Stan software.

Stan math library: differentiable math and stats functions, template metaprorgrams to manage constants and vectorization, matrix derivatives, and differential equation derivatives.

Stan language: block structure and execution, unconstraining variable transforms and automatic Jacobians, transformed data, parameters, and generated quantities execution.

Stan algorithms: Hamiltonian Monte Carlo and the no-U-turn sampler (NUTS), automatic differentiation variational inference (ADVI).

Stan infrastructure and process: Time permitting, I can also discuss Stan’s developer process, how the code repositories are organized, and the code review and continuous integration process for getting new code into the repository

Slides

I realized I’m missing a good illustration of NUTS and how it achieves detailed balance and preferentially selects positions on the Hamiltonian trajectory toward the end of the simulated dynamics (to minimize autocorrelation in the draws). It was only an hour, so I skipped the autodiff section and scalable algorithms section and jumped to the end. I’ll volunteer do another meetup with the second half of the talk.

StanCon Helsinki streaming live now (and tomorrow)

We’re streaming live right now!

Timezone is Eastern European Summer Time (EEST) +0300 UTC

Here’s a link to the full program [link fixed].

There have already been some great talks and they’ll all be posted with slides and runnable source code after the conference on the Stan web site.

Thanks, NVIDIA

Andrew and I both received a note like this from NVIDIA:

We have reviewed your NVIDIA GPU Grant Request and are happy support your work with the donation of (1) Titan Xp to support your research.

Thanks!

In case other people are interested, NVIDA’s GPU grant program provides ways for faculty or research scientists to request GPUs; they also have graduate fellowships and larger programs.

Stan on the GPU

The pull requests are stacked up and being reviewed and integrated into the testing framework as I write this. Stan 2.19 (or 2.20 if we get out a quick 2.19 in the next month) will have OpenCL-based GPU support for double-precision matrix operations like multiplication and Cholesky decomposition. And the GPU speedups are stackable with the multi-core MPI speedups that just came out in CmdStan 2.18 (RStan and PyStan 2.18 are in process and will be out soon).

Plot of GPU timing

Figure 1. The plot shows the latest performance figures for Cholesky factorization; the X-axis is the matrix dimensionality and the Y-axis the speedup vs. the regular Cholesky factorization. I’m afraid I don’t know which CPU/GPU combo this was tested on.

Academic hardware grants

I’ve spent my academic career coasting on donated hardware back when hardware was a bigger deal. It started at Edinburgh in the mid-80s with a Sun Workstation I donated to our department. LaTeX on the big screen was just game changing over working on a terminal then printing the postscript. Then we got Dandelions from Xerox (crazy Lisp machines with a do-what-I-mean command line), continued with really great HP Unix workstations at Carnegie Mellon that had crazy high-res CRT monitors for the late ’80s. Then I went into industry, where we had to pay for hardware. Now that I’m back in academia, I’m glad to see there are still hardware grant programs.

Stan development is global

We’re also psyched that so much core Stan development is coming from outside of Columbia. For the core GPU developers, Steve Bronder is at Capital One and Erik Štrumbelj and Rok Češnovar are at the University of Ljubljana. Erik’s the one who tipped me off about the NVIDIA GPU Grant program.

Daniel Lee is also helping out with the builds and testing and coding standards, and he’s at Generable. Sean Talts is also working on the integration here at Columbia; he played a key design role in the recent MPI launch, which was largely coded by Sebastian Weber at Novartis in Basel.

Where do I learn about log_sum_exp, log1p, lccdf, and other numerical analysis tricks?

Richard McElreath inquires:

I was helping a colleague recently fix his MATLAB code by using log_sum_exp and log1m tricks. The natural question he had was, “where do you learn this stuff?”

I checked Numerical Recipes, but the statistical parts are actually pretty thin (at least in my 1994 edition).

Do you know of any books/papers that describe these techniques?

I’d love to hear this blog’s answers to these questions.

I replied that I learned numerical analysis “on the street” through HMM implementations. HMMs are also a good introduction to the kind of dynamic programming technique I used for that Poisson-binomial implementation we discussed (which we’ll build into Stan one of these days—it’ll be a fun project for someone). Then I picked up the rest through a hodge-podge of case-based learning.

“Numerical analysis” is name of the field and the textbooks where you’ll learn log_sum_exp and log1p and complementary cdfs and learn how 0 is so very different than 1 (smallest double-precision floating point value greater than zero is around 10^-300, whereas the largest double-precision value less than 1 is about 1 – 10^-16), which is rather relevant for statistical computation. You’ll also learn about catastrophic cancellation (which makes naive variance calculations so unstable) and things like the stable Welford algorithm for calculating variance, which also has the nice property of behaving as a streaming accumulator (i.e., it’s memoryless). I don’t know which books are good, but there are lots of web sites and course materials you can try.

The more advanced versions of this will be about matrices and how to maintain stability of iterative algorithms. Things like pivoting LL^t decompositions and how to do stable matrix division. A lot of that’s also about how to deal with caching in memory with blocking algorithms to do this efficiently. A decent matrix multiplier will be more than an order of magnitude faster than a naive approach on large matrices.

“Algorithms and data structures” is the CS class where you learn about things like dynamic programming (e.g., how to calculate HMM likelihoods, fast Fourier transforms, and matrix multiplication ordering).

Algorithms class won’t typically get into the low-level caching and branch-point prediction stuff you need to know to build something like Stan efficiently. There, you need to start diving into the compiler literature and the generated assembly and machine code. I can highly recommend Agner Fogg’s overviews on C++ optimization—they’re free and cover most of what you need to know to start thinking about writing efficient C++ (or Fortran—the game’s a bit different with statically typed functional languages like ML).

The 1D integrator in Stan (probably land in 2.19—there’s a few kinks to work out in Ben Bales’s math lib code) uses an input that provides both the integrated value and its complement (x and closest boundary of the integration minus x). Ben Goodrich helped a lot, as usual, with these complicated numerical things. The result is an integrator with enough precision to integrate the beta distribution between 0 and 1 (the trick is the asymptote at 1).

Integration in general is another advanced numerical analysis field with tons of great references on error accumulation. Leimkuhler and Reich is the usual intro reference that’s specific to Hamiltonian systems; we use the leapfrog (Störmer-Verlet) integrator for NUTS and this book has a nice analysis. We’re looking now into some implicit algorithms to deal with “stiff” systems that cause relatively simple explicit algorithms like Runge-Kutta to require step sizes so small as to be impractical; we already offer them within Stan for dynamics modeling (the _bdf integrators). Hairer et al. the more mathematically advanced reference for integrators. There are tons of great course notes and applied mathematics books out there for implementing Euler, implicit Euler, Runge-Kutta, Adams-Moulton, implicit midpoint, etc., all of which have different error and symplecticness properties which heavily tie into implementing efficient Hamiltonian dynamics. Yi Zhang at Metrum is now working on improving our underlying algorithms and adding partial differential equation solvers. Now I have a whole new class of algorithms to learn.

So much for my getting away from Columbia after I “learned statistics”. I should at least record the half-lecture I do on this topic for Andrew’s stats communication class (the other half of the class I do focuses on wring API specs). I figure it’s communicating with the computer and communicating with users, but at least one student per year walks out in disgust at my stretching the topic so broadly to include this computer sciency stuff.

The current state of the Stan ecosystem in R

(This post is by Jonah)

Last week I posted here about the release of version 2.0.0 of the loo R package, but there have been a few other recent releases and updates worth mentioning. At the end of the post I also include some general thoughts on R package development with Stan and the growing number of Stan users who are releasing their own packages interfacing with rstan or one of our other packages.

Interfaces

rstanarm and brms: Version 2.17.4 of rstanarm and version 2.2.0 of brms were both released to provide compatibility with the new features in loo v2.0.0. Two of the new vignettes for the loo package show how to use it with rstanarm models, and we have also just released a draft of a vignette on how to use loo with brms and rstan for many “non-factorizable” models (i.e., observations not conditionally independent). brms is also now officially supported by the Stan Development Team (welcome Paul!) and there is a new category for it on the Stan Forums.

rstan: The next release of the rstan package (v2.18), is not out yet (we need to get Stan 2.18 out first), but it will include a loo() method for stanfit objects in order to save users a bit of work. Unfortunately, we can’t save you the trouble of having to compute the point-wise log-likelihood in your Stan program though! There will also be some new functions that make it a bit easier to extract HMC/NUTS diagnostics (thanks to a contribution from Martin Modrák).

Visualization

bayesplot: A few weeks ago we released version 1.5.0 of the bayesplot package (mc-stan.org/bayesplot), which also integrates nicely with loo 2.0.0. In particular, the diagnostic plots using the leave-one-out cross-validated probability integral transform (LOO-PIT) from our paper Visualization in Bayesian Workflow (preprint on arXiv, code on GitHub) are easier to make with the latest bayesplot release. Also, TJ Mahr continues to improve the bayesplot experience for ggplot2 users by adding (among other things) more functions that return the data used for plotting in a tidy data frame.

shinystan: Unfortunately, there hasn’t been a shinystan (mc-stan.org/shinystan) release in a while because I’ve been busy with all of these other packages, papers, and various other Stan-related things. We’ll try to get out a release with a few bug fixes soon. (If you’re annoyed by the lack of new features in shinystan recently let me know and I will try to convince you to help me solve that problem!)

(Update: I forgot to mention that despite the lack of shinystan releases, we’ve been working on better introductory materials. To that end, Chelsea Muth, Zita Oravecz, and I recently published an article User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan (view).)

Other tools

loo: We released version 2.0.0, a major update to the loo package (mc-stan.org/loo). See my previous blog post.

projpred: Version 0.8.0 of the projpred package (mc-stan.org/projpred) for projection predictive variable selection for GLMs was also released shortly after the loo update in order to take advantage of the improvements to the Pareto smoothed importance sampling algorithm. projpred can already be used quite easily with rstanarm models and we are working on improving its compatibility with other packages for fitting Stan models.

rstantools: Unrelated to the loo update, we also released version 1.5.0 of the rstantools package (mc-stan.org/rstantools), which provides functions for setting up R packages interfacing with Stan. The major changes in this release are that usethis::create_package() is now called to set up the package (instead of utils::package.skeleton), fewer manual changes to files are required by users after calling rstan_package_skeleton(), and we have a new vignette walking through the process of setting up a package (thanks Stefan Siegert!). Work is being done to keep improving this process, so be on the lookout for more updates soonish.

Stan related R packages from other developers

There are now well over fifty packages on CRAN that depend in some way on one of our R packages mentioned above!  You can find most of them by looking at the “Reverse dependencies” section on the CRAN page for rstan, but that doesn’t count the ones that depend on bayesplot, shinystanloo, etc., but not rstan.

Unfortunately, given the growing number of these packages, we haven’t been able to look at each one of them in detail. For obvious reasons we prioritize giving feedback to developers who reach out to us directly to ask for comments and to those developers who make an effort to our recommendations for developers of R packages interfacing with Stan (included with the rstantools package since its initial release in 2016). If you are developing one of these packages and would like feedback please let us know on the Stan Forums. Our time is limited but we really do make a serious effort to answer every single question asked on the forums (thank you to the many Stan users who also volunteer their time helping on the forums!).

My primary feelings about this trend of developing Stan-based R packages are ones of excitement and gratification. It’s really such an honor to have so many people developing these packages based on all the work we’ve done! There are also a few things I’ve noticed that I hope will change going forward. I’ll wrap up this post by highlighting two of these issues that I hope developers will take seriously:

(1) Unit testing

(2) Naming user-facing functions

The number of these packages that have no unit tests (or very scant testing) is a bit scary. Unit tests won’t catch every possible bug (we have lots of tests for our packages and people still find bugs all the time), but there is really no excuse for not unit testing a package that you want other people to use. If you care enough to do everything required to create your package and get it on CRAN, and if you care about your users, then I think it’s fair to say that you should care enough to write tests for your package. And there’s really no excuse these days with the availability of packages like testthat to make this process easier than it used to be! Can anyone think of a reasonable excuse for not unit testing a package before releasing it to CRAN and expecting people to use it? (Not a rhetorical question. I really am curious given that it seems to be relatively common or at least not uncommon.) I don’t mean to be too negative here. There are also many packages that seem to have strong testing in place! My motivation for bringing up this issue is that it is in the best interest of our users.

Regarding function naming: this isn’t nearly as big of a deal as unit testing, it’s just something I think developers (including myself) of packages in the Stan R ecosystem can do to make the experience better for our users. rstanarm and brms both import the generic functions included with rstantools in order to be able to define methods with consistent names. For example, whether you fit a model with rstanarm or with brms, you can call log_lik() on the fitted model object to get the pointwise log-likelihood (it’s true that we still have a bit left to do to get the names across rstanarm and brms more standardized, but we’re actively working on it). If you are developing a package that fits models using Stan, we hope you will join us in trying to make it as easy as possible for users to navigate the Stan ecosystem in R.

loo 2.0 is loose

This post is by Jonah and Aki.

We’re happy to announce the release of v2.0.0 of the loo R package for efficient approximate leave-one-out cross-validation (and more). For anyone unfamiliar with the package, the original motivation for its development is in our paper:

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. doi:10.1007/s11222-016-9696-4. (published versionarXiv preprint)

Version 2.0.0 is a major update (release notes) to the package that we’ve been working on for quite some time and in this post we’ll highlight some of the most important improvements. Soon I (Jonah) will follow up with a post about important new developments in our various other R packages.

New interface, vignettes, and more helper functions to make the package easier to use

Because of certain improvements to the algorithms and diagnostics (summarized below), the interfaces, i.e., the loo() and psis() functions and the objects they return, also needed some improvement. (Click on the function names in the previous sentence to see their new documentation pages.) Other related packages in the Stan R ecosystem (e.g., rstanarm, brms, bayesplot, projpred) have also been updated to integrate seamlessly with loo v2.0.0. (Apologies to anyone who happened to install the update during the short window between the loo release and when the compatible rstanarm/brms binaries became available on CRAN.)

Three vignettes now come with the loo package package and are also available (and more nicely formatted) online at mc-stan.org/loo/articles:

  • Using the loo package (version >= 2.0.0) (view)
  • Bayesian Stacking and Pseudo-BMA weights using the loo package (view)
  • Writing Stan programs for use with the loo package (view)

A vignette about K-fold cross-validation using new K-fold helper functions will be included in a subsequent update. Since the last release of loo we have also written a paper, Visualization in Bayesian workflow, that includes several visualizations based on computations from loo.

Improvements to the PSIS algorithm, effective sample sizes and MC errors

The approximate leave-one-out cross-validation performed by the loo package depends on Pareto smoothed importance sampling (PSIS). In loo v2.0.0, the PSIS algorithm (psis() function) corresponds to the algorithm in the most recent update to our PSIS paper, including adapting the Pareto fit with respect to the effective sample size and using a weakly informative prior to reduce the variance for small effective sample sizes. (I believe we’ll be updating the paper again with some proofs from new coauthors.)

For users of the loo package for PSIS-LOO cross-validation and not just the PSIS algorithm for importance sampling, an even more important update is that the latest version of the same PSIS paper referenced above describes how to compute the effective sample size estimate and Monte Carlo error for the PSIS estimate of elpd_loo (expected log predictive density for new data). Thus, in addition to the Pareto k diagnostic (an indicator of convergence rate – see paper) already available in previous loo versions, we now also report an effective sample size that takes into account both the MCMC efficiency and the importance sampling efficiency. Here’s an example of what the diagnostic output table from loo v2.0.0 looks like (the particular intervals chosen for binning are explained in the papers and also the package documentation) for the diagnostics:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     240   91.6%   205
 (0.5, 0.7]   (ok)         7    2.7%   48
   (0.7, 1]   (bad)        8    3.1%   7
   (1, Inf)   (very bad)   7    2.7%   1

We also compute and report the Monte Carlo SE of elpd_loo to give an estimate of the accuracy. If some k>1 (which means the PSIS-LOO approximation is not reliable, as in the example above) NA will be reported for the Monte Carlo SE. We hope that showing the relationship between the k diagnostic, effective sample size, and and MCSE of elpd_loo will make it easier to interpret the diagnostics than in previous versions of loo that only reported the k diagnostic. This particular example is taken from one of the new vignettes, which uses it as part of a comparison of unstable and stable PSIS-LOO behavior.

Weights for model averaging: Bayesian stacking, pseudo-BMA and pseudo-BMA+

Another major addition is the loo_model_weights() function, which, thanks to the contributions of Yuling Yao, can be used to compute weights for model averaging or selection. loo_model_weights() provides a user friendly interface to the new stacking_weights() and pseudobma_weights(), which are implementations of the methods from Using stacking to average Bayesian predictive distributions (Yao et al., 2018). As shown in the paper, Bayesian stacking (the default for loo_model_weights()) provides better model averaging performance than “Akaike style“ weights, however, the loo package does also include Pseudo-BMA weights (PSIS-LOO based “Akaike style“ weights) and Pseudo-BMA+ weights, which are similar to Pseudo-BMA weights but use a so-called Bayesian bootstrap procedure to  better account for the uncertainties. We recommend the Pseudo-BMA+ method instead of, for example, WAIC weights, although we prefer the stacking method to both. In addition to the Yao et al. paper, the new vignette about computing model weights demonstrates some of the motivation for our preference for stacking when appropriate.

Give it a try

You can install loo v2.0.0 from CRAN with install.packages("loo"). Additionally, reinstalling an interface that provides loo functionality (e.g., rstanarm, brms) will automatically update your loo installation. The loo website with online documentation is mc-stan.org/loo and you can report a bug or request a feature on GitHub.

Three new domain-specific (embedded) languages with a Stan backend

One is an accident. Two is a coincidence. Three is a pattern.

Perhaps it’s no coincidence that there are three new interfaces that use Stan’s C++ implementation of adaptive Hamiltonian Monte Carlo (currently an updated version of the no-U-turn sampler).

  • ScalaStan embeds a Stan-like language in Scala. It’s a Scala package largely (if not entirely written by Joe Wingbermuehle.
    [GitHub link]

  • tmbstan lets you fit TMB models with Stan. It’s an R package listing Kasper Kristensen as author.
    [CRAN link]

  • SlicStan is a “blockless” and self-optimizing version of Stan. It’s a standalone language coded in F# written by Maria Gorinova.
    [pdf language spec]

These are in contrast with systems that entirely reimplement a version of the no-U-turn sampler, such as PyMC3, ADMB, and NONMEM.

StanCon is next week, Jan 10-12, 2018

It looks pretty cool!

Wednesday, Jan 10

Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia Rabe-Hesketh and Daniel Furr (U California, Berkely) 10:40-11:30am

Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:30-11:50am

Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am-12:10pm

Joint longitudinal and time-to-event models via Stan. Sam Brilleman, Michael Crowther, Margarita Moreno-Betancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:10-12:30pm
Lunch 12:30-2:00pm

ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:00-2:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:20-2:40pm

Student Ornstein-Uhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:40-3:00pm

SlicStan: a blockless Stan-like language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:00-3:20pm
Break 3:20-4:00pm

Invited Talk: Talia Weiss (MIT) 4:00-4:50pm

Thursday, Jan 11

Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:40-11:30am

NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:30-11:50am

Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am-12:10pm

A brief history of Stan. Daniel Lee (Generable) 12:10-12:30pm
Lunch 12:30-1:30pm

Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:30-1:50pm

Flexible modeling of Alzheimer’s disease progression with I-Splines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:50-2:10pm

Intrinsic Auto-Regressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:10-2:30pm

Modeling/Data Session + Classes 2:30-4:10pm

Open session for consultations on modeling and data problems with Stan developers and modelers. 2:30-4:10pm

Session 3 of Intro to Stan 2:30-4:10pm

2:30-3:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter

What is new to Stan 3:30-4:10pm

Invited Talk: Manuel Rivas (Stanford U) 4:00-4:50pm

Friday, Jan 12

Invited Talk: Susan Holmes (Stanford U) 10:40-11:30am

Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:30-11:50am

The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, Emma Pierson (Stanford U) 11:50am-12:10pm

Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:10-12:30pm
Lunch 12:30-1:30pm

Causal inference with the g-formula in Stan. Leah Comment (Harvard U) 1:30-1:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:50-2:10pm

Invited Talk: Andrew Gelman 2:10-3:00 (Columbia U) (virtual)

Classes/Tutorials

We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:

Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.

Executive decision making the Bayesian way: This is for non-technical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.

Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.

Model assessment, model selection and inference after model selection. Aki Vehtari.

How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.

[edited by Aki: added Model selection tutorial]