>If it takes months to do a calculation, then that doesn’t leave much time to check that the calculation is for sure doing what you expect.

Oh ya, I hear ya haha. Nothing worse than waiting ages for a model to fit only to see it fail. Usually I’ve been able to get around that by iteratively complexifying it bit by bit, so that I can troubleshoot nested models on data subsets that fit much faster.

>you wanna be doing as much as that as possible regardless of what Stan’s Rhats/ESS estimates say

Agreed! I was just doing the automated diagnostics for when the chains failed to converge / mix (these are all across 4 independent chains, by default), that I might rerun them longer or under different conditions.

RE: half-cauchies, ya, I’ve had trouble with them before, and will usually swap ’em out on scale parameters for a halfnormal or exponential if I suspect they’re giving me trouble. The fat tails are a nice hedge but often not worth the trouble.

> That seems nice cause it would be easy to code.

OK, so it sounds like there’s nothing terribly unprincipled I’m not seeing wrt that sort of short-cutting! Thanks!

> Like 8-schools model if the 8 measurements were really precise is better done with a centered parameterization.

TBH I don’t have the best intuitions for when one is preferred over the other — usually I’ll just flip between them when having trouble, or else use the non-centered form when working with really crummy data. Should probably actually give that 2013 Betancourt and Girolami paper a read!

Agreed on your follow-up comment! And really, in this current case the waiting time between units of your first type of information would measured in centuries or whatever haha, so that’s why I’m looking for cheap workarounds!

]]>I was thinking more about my answer with the months and I think it’s kinda wrong.

Months is still probably bad, but I think here’s a better way to get at that answer.

Running a model (whatever that means, and whatever algorithms or whatever we’re using) produces two kinds of information (using information informally):

1. Information useful for directly improving the model

2. Information useful for the application

If you’re doing biology, the biology info is the second. In the process of doing this, you might find divergences, or other computational or statistical problems with your model. This is the first type of information.

Before you really trust the second type of information your model generates (application info), you need to spend some time generating the first type information (modeling info). So even if we say “we can perfect our model with 10 units of information” and we generate 1 unit of information each iteration, then 10 informations -> 2.5 years before you even start generating that second kind of information that is the stuff that matters for your application.

The second piece here is that the product of running your model is information — that information derives value from its use elsewhere. Even if you tell me, “my higher level decision is simple, we can simply wait 2.5 years for the answer and be done with it” I am skeptical.

I think the reality is that this higher level decision is likely to lead to iterations in the types of information you want to extract about the problem at hand (sending you all the way back to the beginning). So a decision process that can support this sort of delay in information would need to have a timeline in the decades probably.

—

And I’m exaggerating numbers a bit there. You already said:

> (this is all on lots of replicated simulated data to explore the statistical properties / viability of this model, before applying it to the big empirical data at hand — the hope being that the strong priors on the bottom-level parameters get learned in the course of fitting)

So you’re using smaller experiments to generate #1 type information which is good but I think the relationship between #2 and the next level of the problem will be similar to the relationship between #1 and #2, so a unit of computation that takes 3 months -> bad.

]]>Yeah months is too slow.

> I’m not sure what’s meant by programming difficulty, but I can specify & implement the model

Just checking that you were able to write the model without too much difficulty (if it took months to write it out, for instance, that is too long).

> are there other concerns with that strategy when using HMC / NUTS

If it takes months to do a calculation, then that doesn’t leave much time to check that the calculation is for sure doing what you expect.

So what you said:

> this is all on lots of replicated simulated data to explore the statistical properties / viability of this model

Like you wanna be doing as much as that as possible regardless of what Stan’s Rhats/ESS estimates say (and you’d do it for individual fits and you’d do it for EP too) and if that takes months oof.

As far as crazy long chains, I don’t think there’s anything special about NUTS there. Just the same, generic, if-you’re-having-to-run-this-long-why-not-longer-maybe-you-haven’t-mixed concerns. Multiple-chains + Rhat is the workaround.

> sigma_a ~ half-cauchy(0,1)

> sigma_b ~ half-cauchy(0,1)

The half-cauchies can be hard to work with. Bob has a note here: https://statmodeling.stat.columbia.edu/2021/03/25/the-folk-theorem-revisited/#comment-1768023

> sigma_a and sigma_b can be reliably learned from just some small subset of _i, I could do away with the hierarchical structure and fit the wholly flat model in parallel on each group simultaneously

That seems nice cause it would be easy to code. Also you could pull estimates of sigma_a and sigma_b before the full fits mix and then use that for a full fit with fixed hierarchical parameters.

If the model is that simple, then if you have group members with lots of data and some without, then you might need to mix non-centered and centered parameterizations. Like 8-schools model if the 8 measurements were really precise is better done with a centered parameterization.

]]>Interesting, thank you! The Expectation Propagation approach described sounds promising, but I’ll have to give it a closer read to really understand all the fiddly bits (some parts are more familiar to me than others, e.g. in grad school I spent a lot of time exploring proposal distributions that respect the PSD constraint of high-dimensional correlation matrices, so the struggle in 5.3 echoes old frustrations haha).

Out of curiosity, is the desire to avoid just throwing longer chains at an inference problem motivated by concerns for efficiency, or something else? Using Met-Hastings in phylogenetics it was customary to just run longer and longer analyses, hundreds of millions of iterations long, when some complex posterior geometry proved especially unyielding. Algorithmic details notwithstanding, are there other concerns with that strategy when using HMC / NUTS? Specifically, the model that I’m hoping to fit is this one (https://pastebin.com/Wr32H1Z9), the sum of a logistic growth and decay function, except with varying intercepts (of dimension bajillion) on various parameters. In simulation, I find the (non-hierarchical) model fits super easily when presented with lots of data (like 128 whole observations… though with lots of non-identifiability across certain focal parameters, e.g. https://i.imgur.com/HVWW8Or.png), but with few data points (e.g. https://i.imgur.com/6e8NqOt.png — 90% HPDIs and posterior predictive intervals of a posterior fit plotted), I find that I’m often not passing even the most basic of diagnostics until hitting 100s of thousands of iterations (having set up a little while() loop to do some automated checking and increase adapt_delta, chain length, max_treedepth, etc. as warranted, depending on what runtime warnings get reported).

(this is all on lots of replicated simulated data to explore the statistical properties / viability of this model, before applying it to the big empirical data at hand — the hope being that the strong priors on the bottom-level parameters get learned in the course of fitting)

But the full hierarchical model won’t ever actually sample, because it seems like it’s too big (given prior experience working with smaller data in Stan). Even something like:

y ~ normal(mu, sigma) //assume all y are mean-centered and rescaled to unit variance

mu = a_i + b_i*x //ditto all x

a_i ~ normal(0,sigma_a)

b_i ~ normal(0,sigma_b)

sigma_a ~ half-cauchy(0,1) //or I guess half-cauchy(1) depending on taste

sigma_b ~ half-cauchy(0,1)

Seems like it would never get off the ground if _i indexes, like, a million groups. Right? Thus my initial thinking that if in the simple linear model above, sigma_a and sigma_b can be reliably learned from just some small subset of _i, I could do away with the hierarchical structure and fit the wholly flat model in parallel on each group simultaneously. If all the a_i and b_i only affect each other through the sigmas, there’s no benefit to fitting them jointly?

]]>Nikolai:

You do not want to run for 10,000 iterations. Can you try constraining the model with strong priors? Another approach is to split and combine the data, as here: http://www.stat.columbia.edu/~gelman/research/published/ep_life_jmlr.pdf

]]>Generally too slow to mix over the super high-dimensional parameter space! Sometimes I get lots of divergent iterations but then usually increasing the target acceptance rate and using a non-centered parameterization is enough to fix that. Otherwise it just takes too long to fit (-> pass MCMC diagnostics across multiple independent chains), having to sample a million parameters at a time or whatever. I’m not sure what’s meant by programming difficulty, but I can specify & implement the model in Stan with fewer instances of the different categories, it’s just scaling it up to the full parameter space that makes it slow (like, getting 10000 iterations takes months to finish, slow).

]]>> this ends up being utterly intractable to fit in Stan

Intractable because it is too slow? Or lots of divergences and hard to figure out what’s going on? Or too difficult to program?

]]>(alternatively, I’ve also considered fitting a buncha sub-models to the full data, each with some subset of the focal parameters, and then trying to interpret focal quantities across posterior predictive simulation averaged across them, which I’d usually have implemented and fit anyway during iterative model construction, but that seems less satisfying. And then the last approach I’ve used is just to fit multiple chains for weeks at a time — but that sucks!)

]]>One idea I’ve had has been empirical Bayes flavored — fit the full model to lots of random subsets of the data, say 100-genes at a time, and then characterize the distribution of estimated model hyperparameters. Then, fit each independent model (gene-by-gene, say) given whatever informative hyperprior is induced, adding another level to the hierarchy. Would this be a reliable approximation to the full hierarchical model? Is there a better alternative approach?

]]>