I was having a conversation with Andrew that went like this yesterday:

*Andrew:*- Hey, someone’s giving a talk today on HMMs (that someone was Yang Chen, who was giving a talk based on her
*JASA*paper Analyzing single-molecule protein transportation experiments via hierarchical hidden Markov models).

Maybe we should add some specialized discrete modules to Stan so we can fit HMMs. Word on the street is that Stan can’t fit models with discrete parameters. *Me:*- Uh, we can already fit HMMs in Stan. There’s a section in the manual that explains how (section 9.6, Hidden Markov Models).

We’ve had some great traffic on the mailing list lately related to movement HMMs, which were all the rage at ISEC 2016 for modeling animal movement (e.g., whale diving behavior in the presence or absence of sonar). *Andrew:*- But how do you marginalize out the discrete states?
*Me:*- The forward algorithm (don’t even need the backward part for calculating HMM likelihoods). It’s a dynamic programming algorithm that calculates HMM likelihoods in , where is the number of time points and the number of HMM states.
*Andrew:*- So you built that in C++ somehow?
*Me:*- Nope, just straight Stan code. It’s hard to generalize to C++ given that the state emissions models vary by application and the transitions can involve predictors. Like many of these classes of models, Stan’s flexible enough to, say, implement all of the models described in Roland Langrock et al.’s moveHMM package for R. The doc for that package is awesome—best intro the models I’ve seen. I like the concreteness of resolving details in code. Langrock’s papers are nice, too, so I suspect his co-authored book on HMMs for time series is probably worth reading, too. That’s another one that’d be nice to translate to Stan.

**More HMMs in Stan coming soon (I hope)**

Case study on movement HMMs coming soon, I hope. And I’ll pull out the HMMs tuff into a state-space chapter. Right now, it’s spread between the CJS ecology models (states are which animals are alive) in the latent discrete parameters chapters and HMMs in the time-series chapter. I’m also going to try to get some of the more advanced sufficient statistics, N-best, and chunking algorithms I wrote for LingPipe going in Stan along with some semi-supervised learning methods I used to work on in NLP. I’m very curious to compare the results in Stan to those derived by plugging in (penalized) maximum likelihood parameter estimates

**Getting working movement HMM code**

Ben Lambert and I managed to debug the model through a mailing list back and forth, which also contains code for the movement HMMs (which also use a cool wrapped Cauchy distribution that Ben implemented as a function). The last word on that thread has the final working models with covariates.

**P.S.**

We should be able to fit Yang Chen’s hierarchical HMMs in Stan. She assumed a univariate normal emission density that was modeled hierarchically across experiments, with a shared transition matrix involving only a few states. We wouldn’t normally perfrom her model selection step; see Andrew’s previous post on overfitting cluster analysis which arose out of a discussion I had with Ben Bolker at ISEC.

A common application in HMMs as in her paper is to do point estimation over the transition and emission probabilities rather than full Bayes, which can be quite difficult. Basically, EM while marginalizing out the hidden states. Is that doable in Stan? (Maybe that motivates GMO!)

Yes, of course. The forward algorithm marginalizes out the hidden states. You can fit a penalized MLE at that point rather than doing full Bayes if you want. And I don’t think GMO is going to marginalize out discrete states, is it?

Or are you talking about the hierarchical parameters? Yang was doing blocked Gibbs, alternating the high-level and low-level parameters.

Maybe I misunderstood the talk, but it seemed like Yang was trying to evaluate the Markovian assumption (exponential state occupancy, basically) by taking an MLE (or posterior mean? couldn’t quite tell), then running Viterbi to find the best path sequence z given a point estimate of the parameters, then looking at the transitions in z. That seems to me like it’s going to be biased because you’re taking the MLE from a point estimate based on the model. As Richard Davis pointed out, you don’t observe the hidden states. But at that point she was wrapping up and there wasn’t a lot of time for questions.

Oops, forgot the cat :-) The one I used is Tor-Odin, Mitzi’s and my Maine coon cat, who, in 1994, was one of the first cats on the web. He’s sadly no longer with us.

Great stuff – I really ought to try this in Stan myself. A co-author of mine, Vianey Leos-Barajas, has been telling me for some time now that I should look into this. How’s the computational speed compared to say moveHMM? I don’t have any experience myself really with Bayesian approaches to fitting HMMs, all I know is that people from the movement modelling community usually lament the slow mixing. But I guess most of those people use WinBUGS (?).

By the way, I’ve also been discussing the issue of selecting the number of states with Ben Bolker. Together with a PhD student, I wrote a discussion-like paper on this issue, which is on the arXiv: https://arxiv.org/abs/1701.08673. Any comments appreciated.

P.S.: Thanks for the kind words! :-)

I don’t know how fast moveHMM is. It’ll be quite a bit slower than performing a maximum likelihood estimate but will be more competitive for hierarchical modeling. So far, we haven’t seen mixing problems. There is the usual label-switching problems if the prior is symmetric, but we can identify by ordering parameters.

The way we could make this faster is to build the forward algorithm in C++ with custom derivative propagation.

I’ll take a look at the arXiv paper. This is really more up Andrew’s alley than mine.

The HMMs are known to have the label bias problem. In probabilistic programming formulation of HMMS, is the issue still around? Or it is intrinsic to HMM? What can be done to alleviate this problem?

Presumably you’re talking about the same thing as Léon Bottou (1991) talked about in his French Ph.D. thesis, and which motivated John Lafferty and friends to introduce conditional random fields (CRFs) in this paper.

The paper provides examples of what they call bias (not the same thing as an estimator’s bias) for max entropy Markov models (MEMMs). Those models aren’t generative models like HMMs and they seem to indicate it’s specifically a problem for MEMMs.

Now what would be interesting is to see how much better, if any, full Bayesian inference would be compared to plugging in maximum likelihood estimates and then extracting first-best sequences. We already know taking first-best sequences for the latent states doesn’t give you good marginal estimates (you need forward-backward to compute marginals for that and for subsequences—I built all that into LingPipe years ago).

By the way, none of this has anything to do with probabilistic programming. Yang Chen just wrote a straight up Gibbs sampler. It’s just that Stan lets you fit a wide range of models relatively easily (compared to implementing Gibbs for each one).

You could also use Stan to perform full Bayesian inference for CRFs. In fact, I may get around to doing that this year at least as far as writing some case studies of how to do it. I used to spend a lot of time working on HMMs and CRFs when I worked on speech and natural language processing.

Yes, I was referring to the Leon’s paper on CRF. I use CRF (CRFSuite implementation) a lot to perform NER on short text utterances. Built lots of models and they work great. The point estimates by L-BFGS are very fast as we need to re-train models from time to time. What is lacking in CRFSuite is n-best outputs and also the capability to say “I am not certain about this top sequence”. Will it be possible in full Bayesian treatment? What about scaling? Our models are trained from datasets that have few hundreds of thousands unique discrete features.

You don’t need full Bayes for that. I built a stochastic gradient descent estimator for CRFs as part of LingPipe (LingPipe also includes a simple HMM estimator). Then I built a general interfaces (this was Java) that let you extract the n-best sequence of outputs (it’s a well known algorithm that generalized Viterbi with a priority queue). The n-best entity extractor takes n-best chunks, as in for noun phrase or named entity chunking (less well known algorithm that generalized forward-backward for subsequences with a priority queue). I really like the way the interfaces turned out because they’re iterators that iterate in descending order of probability estimate.

Here’s a tutorial. All the code’s available open source.

None of it’s full Bayesian inference—it’s effectively penalized maximum likelihood estimation with those point estimates plugged in. Very common, but not very principled.

Full Bayes does the right thing in propagating the uncertainty in the estimates through the inferences. It basically averages predictions over the posterior. It gets a lot more expensive if we want to use something like 100 draws of the parameters and then fit with each of those. 100 times as expensive.

Scaling should be OK in the range you’re talking about, but it will be slow compared to L-BFGS (and SGD). It’ll probably take at least an order of magnitude longer to fit, if not two. Especially without some better customized gradients in the forward algorithm. I want to run all these analyses, so I don’t know the answer—this is just a guess. Mixing of the sampler should be good with HMC.

There were two QMSS students at Columbia whose M.A. theses were doing HMMs in Stan, plus one QMSS student who was helping Greg Wawro do a HMM in Stan.

Ben:

I’m proud to have founded a program whose students now know more about statistics than I do!

I think I asked about this in another post but never got a reply. Is it possible to implement Dirichlet processes in Stan? I would like to put Dirichlet process distributions on the transition probabilities of an HMM.

You can approximate a DP with a finite-dimensional Dirichlet prior. That’s usually pretty close if you have even a rough idea of how many clusters you’re going to get.

Unfortunately, those models can’t be fit. By Stan or by anybody else. For the number-of-cluster-selection applications to which DPs are put, they don’t really work in the sense that (a) the posterior is intractable, so we can only approximate with a single (or handful of) mode(s) and thus can’t accurately calculate any of the Bayesian posterior predictive statistics, and (b) number of cluster selection is itself problematic (see Andrew’s post). The usual response from the people who like these models is “so what.” Now if they led with that in their papers, I’d be OK with all this, but it’s usually presented as some kind of cluster number selection panacea that actually works rather than providing some kind of heuristic that might produce pretty clusters. See Blei et al.’s reading the tea leaves papers about how to even evaluate the output of one of these models and the early papers by Griffiths and Steyvers on LDA showing how different runs that found different modes compared.