> shouldn’t be looking for the one perfect prior

That comment and the overall post has made me less uncertain that it is really all about getting a good enough (for exactly what?) probabilistic representation (model) for how the data came about and ended up being accessible to you (the analyst).

(e.g. “how you should best represent what you plan to act upon prior to acting in the world. The “how one should best represent” to profitably advance inquiry being logic” http://statmodeling.stat.columbia.edu/2017/09/27/value-set-act-represent-possibly-act-upon-aesthetics-ethics-logic/ )

Unfortunately, the bad meta-physics that supports the idea that there must/should be one perfect/true/best model leads many into an endless dizzying spiral towards the black hole of certainty – that can never be reached (as it presupposes direct access or correspondence to reality).

An example that comes to mind is Dennis Lindley’s quest for an axiom system for statistical inference as well as being overly reluctant to break from it https://www.youtube.com/watch?v=cgclGi8yEu4 (e.g. at around 18:45).

]]>What I meant to say was that a light rail meeans you sing escape the heavy region around zero without a lot of data, so that region (the bulk of the prior) needs to be wide enough to contain all the reasonable parameter values.

]]>You’re saying the light tail strongly affects inferences around the heavy region towards zero? This is not so intuitive…

]]>I’m not sure if anyone has derived a Jeffreys’ prior for this model. I feel like it would be hard, but I’m genuinely not sure.

As for the case where you’re pretty sure there is over-dispersion, the base model at zero may still do well (in particular it’s useful for the case where your sample doesn’t show much overdispersion). Alternatively, you might want to put the base model somewhere else in the space and build a PC prior off that. An example where this has been done for the correlation parameter in a bivariate normal distribution is here (https://arxiv.org/pdf/1512.06217.pdf).

I think you’ll be fine in this case as long as your tail isn’t too light. Maybe a Student-t-7 would be a good idea. But the end point is you should try a couple of priors and see how they go on some existing data. You should also simulate some data that you think is realistic, but isn’t near the base model and see how the prior performs. If there’s anything that I wish I’d bought out more in the post, it’s the idea that we shouldn’t be looking for the one perfect prior, but rather a set of “good enough” priors that we can compare and check.

]]>This is a problem I’ve often encountered with a half-normal prior. The tails are too light so if the scale is even slightly wrong you are massively penalized for it.

The exponential seems to do much better, as does the t with 3-7 degrees of freedom (although lower dof has a higher chance of giving divergences)

]]>Yeah – this is a problem that you get when you transform bounded parameters. Honestly I don’t know how to deal with this intersection of prior specification and computation, but it’s definitely something needs thought.

]]>A boomer?

]]>Personally, I care about the topic but got distracted by wondering, what’s the multiplicative inverse of a millennial?

]]>In any case… We are sort of in a funny situation, where all prior information strongly suggests that the dispersion parameter is >0. Usually the over-dispersion parameter is estimated to be >0 even in relatively large studies, which seems logical to me when we look at medical events happening to patients and we do not put much information on the patients in the model. I suspect our prior clearly should not prevent the model from finding the case where there are no random effects, but one of the worries is really that the prior should definitely not favor it (or values of the over-dispersion parameter near 0) “too much”. Whatever that means, but in a sense there would be an inappropriately precise / insufficiently uncertain estimate of any treatment effect (if we are talking about a randomized controlled clincial trial), if we concentrate too much posterior mass near the value zero for the over-dispersion parameter.

I wonder what kind of prior would work sensibly as a weakly informative prior in this sort of setting… Half-normal (or half-T) on the untransformed dispersion parameter (=quite flat towards zero)?!

]]>Yes, I did some limited experimentation with a half normal on the sd for ‘0 true variance random effects’ at some point and it wasn’t working as well as frequentist random effects. Then if I go to something ‘more frequentist’, like a normal(-1,10) on the log of the sd, sampling gets hairy and transitions get divergent.

]]>> The first thing is that it should peak at zero and go down as the standard deviation

> increases. Why? Because we need to ensure that our prior doesn’t prevent the model from

> easily finding the case where the random effect^0 should not be in the model. The easiest way

> to ensure this is to have the prior decay away from zero.

That makes some sense in theory. But if there is any posterior mass in a small neighborhood of zero in the constrained space then there is mass out to negative infinity in the unconstrained space, and there will probably be divergent transition warnings from Stan in practice. So, it seems that you have to thread a needle where you are choosing a prior with a peak at zero in order to get a posterior whose mass is bounded away from zero but concentrated enough near zero that you discover that you are better off without that part of the model.

]]>Thanks Keith

]]>I think the idea is that there is a lot of similarity between new and old style random effects. So it’s an extension of an existing concept rather than a new one. Also these guys are biostatisticians, who are DRILLED on mixed effects models, so they’re building from a common ground.

]]>I also came away asking “Why refer to new-style random effects as random effects?” instead of coming up with a new name for them, at least in a non-Bayesian setting. For posterior distributions, I think it is fine to refer to margins that are common to all groups and margins that are group-specific.

]]>Hodges and Clayton may redefine effects, but there is still the problem of “random”. What if the effects are deterministic but unknown? In general I prefer “unknown parameters” instead of “random parameters”. ]]>

I was kidding. Which songs?

]]>Thanks. What can I say, I love a theme and am not organized enough for my footnotes to be integers.

]]>Not a girl and largely imaginary. There was an earlier draft before I added more section titles where it looked a lot like I’d just had a bad break up and I’d taken a time machine to 1998. I added some happy songs to balance :p

]]>Thanks for the clarification! I’m not surprised that it works (it’s pretty sensible!). My preference for distributing standard deviations directly is based on generalizability to situations where you don’t want the weights to be a priori exchangeable. This is much more interpretable on a standard deviation scale. But that’s really just choosing what to communicate. The prior comes from the same foundation, it’s just a different expression of the idea.

Now, if you were distributing the total precision across the simplex, I would probably feel differently. An example of a model that does this (or something similar) is the Leroux model in spatial statistics, which I am not fond of. (See equation 3 in this paper https://arxiv.org/pdf/1601.01180.pdf)

]]>Dan seems to prefer a scaled simplex for the vector of K standard deviations, which is almost the same thing. In any event, the decov prior seems to work well and we have had approximately zero questions on Discourse or the old Google Groups site where people were having trouble fitting a model with stan_[g]lmer and the answer was to fiddle with the hyperparameters of the decov prior.

]]>