Following on from his talk at StanCon, Michael Betancourt just wrote three Stan case studies, all of which are must reads:

- Diagnosing Biased Inference with Divergences: This case study discusses the subtleties of accurate Markov chain Monte Carlo estimation and how divergences can be used to identify biased estimation in practice.

- Identifying Bayesian Mixture Models: This case study discusses the common pathologies of Bayesian mixture models as well as some strategies for identifying and overcoming them.

- How the Shape of a Weakly Informative Prior Affects Inferences: This case study reviews the basics of weakly-informative priors and how the choice of a specific shape of such a prior affects the resulting posterior distribution.

**Reproducible R scripts**

They all come with fully reproducible knitr scripts to run in RStan. The same lessons hold for the other interfaces, so don’t let the R put you off.

**A spectator sport**

It was particularly fun sitting in the office the day Michael went and figured out all the mixture modeling properties. It followed on from one of our Stan meetings and some of my own failed experiments.

**Publish or perish**

It’s really a shame that this kind of methodological study is so hard to publish, because all three of these deserve to be widely cited. Maybe Andrew has some ideas of how to turn these into “regular” papers. The main thing journal articles give us is a way to argue that we got research results from our research grants. Not a small thing!

**Other case studies**

We have a bunch of case studies up and are always looking for more. The full list and instructions for submitting are on the Stan case studies page.

**Getting information out of the posterior fit object in R**

And in case you didn’t see it, Jonah wrote up a guide for how to extract the kind of information you need for extracting information from a Stan fit object in R.

+1

“A Few Case Studies More”.

“The Good, The Bad, and the Divergent”.

These are great. Thanks for sharing!

Great!

From the third case study:

“It is important to note however, that these unit-scale priors alone do specify a weakly informative prior! They are weakly informative only when our parameters have appropriate units.”

Seems there’s a crucial “not” missing in the first sentence?

“Although there is recent work being done to develop formal criteria for selecting the exact shape of the prior distribution[…]”

This sounds interesting. Hope to hear about this more.

Thanks for the great illustration showing how easy mixture modeling can be in Stan!

However, I wondered about the following lines:

“While the ordering is limited to scalar parameters, it can still prove useful when the component distributions are multivariate. Although we cannot order the multivariate parameters themselves, ordering any one of the parameters is sufficient to break the labeling degeneracy for the entire mixture.”

But then, Frühwirth-Schnatter (2001, JASA, https://doi.org/10.1198/016214501750333063) writes:

“It is, however, important to realize that within a Bayesian analysis the constraint has to be selected carefully. As a general rule, a constraint that ignores the geometry of the posterior does not induce a unique labeling and introduces a bias toward the constraint.”

I think the problem is that order constraints on the means work quite well (because they separate the modes geometrically), whereas order constraints on the SDs don’t separate the modes. But I am not an expert, am I overlooking something?

Daniel:

If there is prior information that is different for the different mixture components (that is, if the prior distribution is not exchangeable over the components (1,…,K)), then an ordering constraint will not change the posterior distribution for predictive quantities. But if the prior distribution is asymmetric than including a constraint will change the model.

Just to add a reference to the Weakly Informative Priors paper, this recent one by Nadja Klein and Thomas Kneib offers a fairly detailed simulation study for priors with different tail properties: ]

http://projecteuclid.org/euclid.ba/1448323525

So some of these papers do occasionally slip through the publishing filter, although it’s definitely hard.

“Maybe Andrew has some ideas of how to turn these into “regular” papers. The main thing journal articles give us is a way to argue that we got research results from our research grants.”

Why don’t you just start your own Stan Case Studies journal? I’d love to see a journal that has some of the aspects of classic scientific journals (peer-review, citability) but also shares some aspects of a wiki/forum that the Stan users-list and mcmc-stan.org provides. Case studies like these that really lay out coding methods would be “sticky posts” in the journal. Then you could also invite papers that are more application focused. You could have a nicely moderated comment section as well where authors and readers can go for more discussion and feedback.

C’mon, Bob, “homebrew” is the tastiest!

Reading on the divergent transitions issue, it seems as if the main idea is that HMC can’t explore a space when the high probability region of some dimension sometimes decreases to nearly zero width (“funnel”). The example has a parameterization where s can range over a region that includes zero, so a parameter given

m ~ normal(u,s)

when s is small the HMC traverses across the whole region of interest (in m space) in one step

the alternative

m0 ~ normal(0,1)

m <- u + m0*s

now m0, the parameter, has constant geometry, and m is a deterministic function of m0, so it's easy to find a single step size that makes moving around the m0 dimension work, and yet, different s values still correspond to different m values, deterministically.

So, I like that, but I wonder if you (Michael?) could comment on how priors that keep scale parameters away from 0 affect this? For example, suppose instead of a half-cauchy prior on the scale parameter, (which puts a bunch of probability mass very near 0) you used a gamma(2,2) type prior which goes to zero density at tau = 0 thereby declaring that you expect some variability between schools.

Obviously divergences can still occur in other situations, but in this particular "Funnel" geometry, it seems to be related to scale parameters that have priors that put probability mass close to zero. In many instances, I really expect some variability and I have some order of magnitude for my expectation, and I use gamma(n, n/s) as a prior to interpolate between an exponential with mean s (maximum entropy for a positive parameter with known mean and equal to gamma(1,1.0/s)) and a gaussian tightly concentrated around s which is gamma(n,n/s) for large n.

The nice property is that for very moderate n (like 1.5 or 2) you have a fair bias away from 0, which is almost always something you want in a scale parameter for variability.

I may be missing something, but… in the second post, on mixtures, it seems as if at least one of the outer sums for the likelihoods should be products (I.e., across the index for the data), no?

These case studies have been very very helpful. I will need to go back and run all the code myself and try things out on my own to better understand everything.

In mixture models, the ordering constraints on mu make a lot of sense to me. The super tight priors feel difficult to justify to reviewers however. Saying that this is for computational convenience may not cut it. I guess it depends on who one is writing for. Psychologists and psycholinguists will have a heart attack.

I suspect that in my own work I can get away with the ordering constraint alone though; I will report my experience with this approach, Michael. Thanks again for this helpful set of notes. If you want to write a paper using real data to illustrate the issues, I have literally a dozen studies to work with.

I will present some of this stuff at Bayes@Lund on April 20th, where I look forward to meeting Richard McElreath. See: http://www.maths.lu.se/bayeslund2017/

It’s funny I encountered a variant of this problem trying to use a mechanistic model representing the analytical solution of a coupled ODE system. The problem involves inverse estimation of two rate parameters, and I found I had to induce identifiability by setting super strong priors that essentially allowed no overlap. Without that, I would get these bimodal posterior pathologies. Now I’m wondering if I could recast that model using the ordering constraint. Thanks to Michael Betancourt for really hammering out the math underlying the intuition here!

In mixture models, I think it has to be true that if the posterior is multimodal the R-hats will be (quite a bit) larger than 1. But R-hats can be very large for a lot of other reasons in the specific case of mixture models; is this right?

Ah! Nice.