## A little story of the Folk Theorem of Statistical Computing

I know I promised I wouldn’t blog, but this one is so clean and simple. And I already wrote it for the stan-users list anyway so it’s almost no effort to post it here too:

A colleague and I were working on a data analysis problem, had a very simple overdispersed Poisson regression with a hierarchical, varying-intercept component. Ran it and it was super slow and not close to converging after 2000 iterations. Took a look and we found the problem: The predictor matrix of our regression lacked a constant term. The constant term comes in by default if you do glmer or rstanarm, but if you write it as X*beta in a Stan model, you have to remember to put in a column of 1’s in the X matrix (or to add a “mu” to the regression model), and we’d forgotten to do that.

Once we added in that const term, the model (a) ran much faster (cos adaptation was smoother) and (b) converged just fine after just 100 iterations.

Yet another instance of the folk theorem.

1. Hernan Bruno says:

I am estimating variations of a model I have been working on to make model comparisons and robustness checks. I can sometimes predict how poorly the model will fit by the time it takes. Would that “Sampling Time” work as an alternative to WAIC? ;)
And sometimes codes that feel unusually slow, end up having a bug, like a variable that was defined by by mistake never made it into the log_prob statement .

2. I took that model for poll aggregation from a few days ago, and made it a simultaneous model for Clinton,Trump,Other using a smooth Gaussian Process, and a Dirichlet distribution for the polls. After I fixed the problem that one of the polls had exactly 0% for “other”, I tried to run it, and killed it after a half hour wait for just a SINGLE sample.

turns out with 400+ days, the straightforward gaussian process with 160,000 element covariance matrix is just ***too slow***

I wish this were a folk-theorem issue, but I don’t think so.

• Corey says:

Maybe pick some inducing points and use the Fully Independent Training Conditional approximation (as described here)? You could even infer and/or marginalize over the inducing points — I bet this recently described prior on unobserved(!) input valueswould work well for marginalizing over inducing points.

• One thing I might be able to do in a straightforward way is just estimate at the days where polls are available not the intermediate time points. I was looking to get a full trajectory through every day between the start and end of the dataset. There are only 110 unique polling points, whereas 444 different days between first and last poll.

I’ll look at the paper, and see what I can do.