Thanks for the link. The LeGland and Mevel paper is much more mathematical and much harder for me to read. It cites an earlier paper with an ergodicity result, but I’m not sure for which sampler. In our practical experience, Hamiltonian Monte Carlo (HMC) has mixed well for HMMs, but autodiffing the forward algorithm in Stan is a lot slower than a custom expectation-maximization (EM) implementation. We can also use Stan’s quasi-Newton optimizer (L-BFGS), which takes advantage of derivatives (and approximate Hessians), to find maximum likelihood estimates (MLE) efficiently. That’ll also get faster and more scalable when we move from autodiffing the forward algorithm to analytical gradients.

If I’m understanding the gist of their recursive algorithm, it’s very similar to how I implemented stochastic gradient descent (SGD) for HMMs and conditional random fields (CRFs) in LingPipe (the NLP toolkit I worked on before Stan). I built recursive accumulators following the forward backward algorithm and fed them into a standard batched SGD with an interface to control the learning rate (the ones that satisfy the Robbins-Monro conditions work in theory, but not so well in practice). The main reason I chose to go back into academia to work with Andrew is so that I could understand discussions like these :-).

This reply has so many acronyms I feel like I owe readers a glossary.

]]>@Adam: Thanks for the reference.

@Mark: Jason cites some earlier work on the relation in his paper. The new part for me is the connection of derivatives to expectations in log-linear models.

What I like about the paper I linked above is the clear presentation of the forward-backward algorithm and derivatives in matrix form. That particular form makes the connection between the algorithm for expectations and the general derivative propagation we need for Stan (reverse-mode autodiff) very clear. For example, we’ll have varying transition matrices defined as multi-logit regressions based on time-specific predictors (aka features in NLP). I think there’s a name for that model in NLP where the transitions are modeled with logistic regressions.

Forward-mode autodiff is less burdensome than reverse mode, but it could probably also benefit from the analytical treatment. The trick would be to find an efficient way to have reverse-mode nested in forward mode so we could get second derivatives (Hessians) more efficiently and accurately than approximate gradient methods.

]]>Thanks Adam,

I haven’t seen this paper, but yes, it seems to have all the things that Jason was saying (although I recall him saying these things a decade before the paper).

Anyway, thanks for the reference, and Bob can add this to his bibliography!

Mark

]]>Mark,

Was this what you had in mind? https://www.cs.jhu.edu/~jason/papers/eisner.spnlp16.pdf

]]>Francois LeGland and Laurent Mevel, Recursive estimation in hidden Markov models,Proceedings of the 36th IEEE Conference on Decision and Control, vol. 4, pp. 3468–3473, 1997.

These authors use such recursions to propose an online parameter estimation scheme; see also the subsequence paper:

Francois LeGland and Laurent Mevel, Exponential forgetting and geometric ergodicity in hidden Markov models, Mathematics of Control, Signals and Systems, vol. 13, no. 1, pp. 63–93, 2000. ]]>