**Someone’s wrong on the internet**

And I have to do something about it.

Following on from Dan’s post on ~~Barry Gibb~~ statistical model evaluation, here’s an example inspired by a paper I found on Google Scholar searching for Stan citations. The paper (which there is no point in citing) concluded that JAGS was faster than Stan by several orders of magnitude for a simple hierarchical model.

**First problem: thinning**

The first problem in their evaluation was thinning. JAGS has a long autocorrelation time for these models because of the correlated parameters in the posterior (Gibbs isn’t rotation invariant because of its componentwise sampling). So what the author did was run JAGS and Stan for a gazillion iteration and thinned.

Thinning always loses information. Don’t do it unless you can’t afford the memory not to. The problem here is that it loses more information when autocorrelation time is low. Because Stan’s effective sample size per iteration is much higher, thinning disproportionately penalizes its true speed.

The trick is to figure out how much time it takes to get the effective sample size you need and compare that. Each system should be allowed to get there on its own terms. For Stan, that might mean 2K iterations and for JAGS it might mean 100K (remember JAGS iterations can be much much faster than Stan’s). Whatever you do, don’t run Stan for 100K iterations and thin!

**Second problem: convergence**

The authors used centered parameterizations.

Of course, we don’t want to compare two programs that get the wrong answer. The second problem was that the models weren’t going to be fitting well with a centered parameterization in Stan and in JAGS. Stan will try hard to fit it anyway, and it will crank down the step size so it can try to get into the neck of the funnel (the portion of the posterior with low hierarchical variance and all lower-level coefficients thus near the hierarchical mean). That’s why Stan takes longer. But really, neither system is going to robustly get the right answer with a centered parameterization.

As Michael Betancourt often says, speed comparisons only make sense when conditioned on the models actually fitting. For Stan and for JAGS in these models, you want to use a non-centered parameterization. The non-centered parameterization was originally developd for Gibbs and Metropolis, not HMC. Matt Hoffman rediscovered it in the context of HMC, so we used to call it the “Matt trick” before we learned its name and history for Gibbs and Metropolis.

With parmeterizations that work and no thinning, the results should be rather different, Stan will do relatively better. It really helps with Stan to have models that are well-specified for the data—when the models can’t be made to fit the data well, Stan really struggles to explore the posterior. This is a pure example of Andrew’s folk theorem in action.

**Third problem: inefficient coding**

Both programs need to be coded efficiently in order to compare the speed of the underying system (unless you’re making a meta-point of what “ordinary” users who don’t read the manual or ask on the forums do with a system, which can be a valid point to make).

It looks like you can vectorize JAGS models now. I don’t know how much that helps JAGS with speed. When Matt Hoffman and I were looking at JAGS before developing Stan, he managed to implement a vectorized Bernoulli-logit that was something like an order of magnitude faster than looping. Stan will also take advantage of shared computations, dropping constants, etc. in its vectorized distributions.

**Fourth problem: poor choice of priors**

The examples are using the diffuse inverse gamma priors popularized by WinBUGS (and perpetuated by JAGS). Andrew’s written about their dangers at length. In summary, they pull estimates into the tails and wind up being much stronger than their users expect.

Inverse gammas are popular for variance parameters because they’re conjugate. That’s the ideal situation for JAGS, because JAGS can very efficiently exploit the conjugacy when doing Gibbs sampling. Try comparing the speed of JAGS with non-conjugate priors and you’ll see very different results.

Placing priors on variance or precision is tricky because their units are the square and inverse square of the units of the variate. It’s much easier to think in terms of scale (standard deviation) because it’s in the same units as the variate.

**A final issue: comparing easy models**

The models being used for comparison are very easy, especially for JAGS. They’re fairly low dimensional and use simple conjugate priors.

We recommend comparing simulated data that can grow in both parameter size and data size. You’ll find Stan tends to scale fairly well.

Model complexity is where Stan really shines. These comparisons always tend to be for simple models.

I love that we’re in “content mode” this week!

It might be worth writing a “how to break Stan” case study. It’s kinda hiding in a lot of the other case studies (like Mike’s ones on divergence and priors on scale parameters) but it might be worth bringing it out more. An “Eight simple ways to make your Stan code slower” type of thing.

I think “content mode” is just a coincidental alignment of our need to procrastinate.

Yes, a “Stan bloopers” kind of thing could work. I really like the book

GUI Bloopers—it could be a model.> coincidental

Yep, when I scheduled mine the cue looked pretty open.

Ouch, it wasn’t the ecologists was it?

On Gibbs samplers:

Lots people like Gibbs samplers. One motivations is that you can directly sample the conditional posterior without any *immediate* need to calculate posterior densities. However, in many cases this is just a delayed cost that needs to be paid later! Unless the entire model posterior is conjugate, you will need to recalculate the posterior densities for the nodes that you used the conjugate sampler if you are going to do something like an MH update that influences those nodes. If you had used an MH sampler instead of Gibbs, you can just cache the posterior density and don’t need to recalculate it’s density. However, if you skipped that calculation by using a Gibbs sampler, you can’t take advantage of the caching when you the MH update.

You still have the advantage of an 100% acceptance rate…but if you have a lot of correlated nodes that you are directly sampling from individually, suddenly that 100% acceptance rate is not so great either, as the entire MCMC sampler will very slowly crawl across the posterior, despite conditionally efficient updates.

Of course, there are plenty of models for which Gibbs sampling is great…but just because you can use a Gibbs sampler doesn’t mean you should.

I completely agree on your point. How about comparing them in terms of other criteria such as coverage, bias etc. with a particular focus on variance parameters. Is there something you advice us like you did above before making any comparison. I am talking about models like Generalized liner mixed model with non Gaussian outcome

Belay:

For evaluating two Monte Carlo algorithms that are computing the same thing, I don’t think it usually makes sense to evaluate using coverage, bias, etc. I’d start with wall time per effective sample size.

I was presupposing that any comparison is conditioned on both algorithms computing the right answer.