Thank you for the link! I will give it a look.

(also see some typos in my earlier comment — was distracted on mobile. Sorry! Had meant to write:

>since occasionally the non-*centered* form will sample more efficiently.

>specify the model in its fully *centered* form (which, read aloud, is often much easier for me to parse, at least)

)

]]>You might be interested in Maria Gorinova’s paper w/ myself and Matt Hoffman from a couple years ago on ‘Automatic Reparameterisation of Probabilistic Programs’ (https://arxiv.org/abs/1906.03028). A few of the takeaways:

– Converting a model from centered to non-centered parameterizations is quite mechanical, while the reverse in general requires a symbolic algebra system, so you want to start with the centered parameterization.

– A model with N sampling sites has 2**N ways to choose centering vs non-centering, and that’s intractable to search over in general, but you can do pretty well with a gradient-based / hill-climbing optimization strategy.

– With an automatic search you can actually find intermediate parameterizations, partway between centered and noncentered, that work better than either extreme. Intuitively: non-centering works well when you have weak or no evidence (so the posterior is not very different from the prior), but poorly when you have very strong evidence (bc it induces coupling in your latent variables via ‘explaining away’ effects), and in real models the strength of the evidence is usually neither zero nor infinity, but somewhere in between.

]]>One thing I’ve wondered about before involves the automated exploration of centered and non-centered parameterizations of hierarchical (or flat) models. I feel that when implementing these in Stan these I must often rely on vague heuristics for one one form or the other is appropriate, and sampling of some parameters in the same model may benefit from the non-centered form where sampling for other parameters may not. Often the final arbiter of what’s appropriate where is observed performance, since occasionally the non-hierarchical form will sample more efficiently.

Since reparameterization is such a straightforward & rote procedure, why not incorporate it into something like the warm-up process? A user could specify the model in its fully non-centered form (which, read aloud, is often much easier for me to parse, at least), and an algorithm could toggle on-and-off different parameterizations and try to adaptively identify which transformed target distributions are easiest to explore (according to convergence & mixing diagnostics like E-FMI or w/e). Is the reason this is not done just that warm-up would have to start from scratch with each parameterization (but would it?).

Even so, it seems like something that would be fairly straightforward to implement, even at the cost of lots of early redundancy (certainly faster than doing everything by hand). Just something that enumerates (some subset of) possible non-centered parameterizations, launches however many independent samplers, and then prunes the ones that are sampling poorly might offer considerably improved convenience and efficiency.

]]>Would it take more computation to fully characterize the joint distribution of a collection of p independent random variables than doing each marginally? Specifically thinking in curse of dimensionality terms — would assurance that one has mixed adequately over the whole target distribution not involve some check that you haven’t completely ignored some lonely orthant in the high-dimensional space? (ie if you don’t know a priori that the parameters are indeed independent)? Or would it just scale linearly with p?

Most of the time I see high dimensional parameter space is in the context of a hierarchical model. It seems like if the hyperparameters are well identified, adding more members to some population of effects won’t induce interdependence between parameters in the same way that adding more top-level parameters would.

]]>This is a good first order explanation, but for models with many parameters the dependencies between parameters will often make the sampler slower for those models, as it will have to take smaller timesteps and do more of them per iteration. Bigger models are harder to sample in general.

]]>I think a lot could be done to give hints, Bob would be the person to ask. But the real reason it can’t be done automatically is because you want the **posterior** distribution to be on unit scale, and you can’t really automatically know the posterior scale without running the sampler.

]]>Naive question: why can’t this be done automatically by the code behind the scenes?

Can this be done as say a pre compiler optimization?

]]>The “longer time” in this post isn’t wall time, but rather number of iterations. HMC should mix at the same rate per-iteration on arbitrary dimensionality, but yeah each iteration is going to take longer on larger data and dimensionality due to all the extra computation.

]]>The reason to do it so everything is on unit scale is so that MCMC proposals that are relatively “isotropic” will work well. it’s a sampling issue, not a modeling issue. If you have one parameter that’s 1e6 +- 1e5 and another parameter that’s .002 +- .0003 then it’s really hard for a sampler to figure out it should propose a movement that’s around 100,000 in one dimension and around .0001 in another.

]]>I guess if you JUST read the warnings you could be assuming the defaults are what you find people are doing. But if you read any manuals, guides, tutorials, recommendations, papers etc. then the defaults are as you presented.

So good work Stan peeps.

]]>I honestly blamed (and continue to blame) myself for being a poor trainer, but now I can also blame the Stan warnings (but not the developers, you guys are great!).

]]>This shift is attributed to many factors: the idea of workflow integrates computing into the bigger modeling problem; the flexibility of bayesian formulation enables free creations of more model; the referees’ reports arrive in longer time; we have more papers to write such that there is no free computer to wait for the referees’ reports.

]]>Andrew (other):

It’s a waste of time because “just letting the computer get on with it” won’t get me anywhere! It just puts off the day of reckoning when you have to figure out what’s wrong with the model. If you have poor mixing and you run it overnight for a zillion iterations, then you’ve wasted a day.

]]>+1

Some combination of these things are what I tend to do myself and what I recommend when working with others. Things like bugs or typos should ideally have been ironed out by building tiny versions that exemplify each moving part in the model and then running those parts with some simulated data so you can verify each component is working as intended. But neither I nor anyone I know is always that disciplined, so it is crucial to keep in mind that odd behavior can sometimes be traced back to using an “i” instead of a “j” somewhere.

I tend to think of 1 and 5 as different aspects of the same idea: Whether you are changing priors, changing scales, or reparameterizing, you are ultimately changing the *meaning* of the quantities in the model. While these are mechanically different, they all represent what you might call “semantic” fixes. By changing the meaning of what the model is meant to represent, you are also changing various technical features of the model to make it run better.

A side benefit to doing these “semantic” fixes is that they are deepening your understanding of the model and its relationship to the data. This guides you in steps 6/7 when you explore more substantive changes to model structure.

]]>