In response to the above question, Aki writes:

I recommend as an easy default option

realnu;

nu ~ gamma(2,0.1);This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified “penalised complexity (PC) prior”, which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student’s t in Stan.

Thanks Aki / Andrew!

We’d love to implement some of these things in STAN, but it’s not currently clear how to go about that. (I guess this is a more general question for the STAN guys: how do we add priors? In INLA there’s a scripting option, as well as a “tabulate” option that does the log-linear interpolation internally. Is there any plan to add something along those lines?)

As for the gamma(2,0.1) prior, the primary feature that separates it from the PC prior is the tail. We showed that to keep mass around the Gaussian, you can’t have a prior on nu with a finite expectation. So perhaps something with a heavier tail would perform better.

If someone actually wants to do the tests, all of the code is available here: http://www.math.ntnu.no/inla/r-inla.org/case-studies/Martins-etal-2014/t/

To prodcue the paper figure with the gamma(2,0.1) prior, you should only need to change one line in the new.prior() function.

Pedantic nitpick time!

‘Definition 1. A prior π(θ) forces overfitting (or overfits) if the density of the prior is zero at the base model. This implies vanishing prior mass in the neighbourhood of the base model which means that the posterior cannot concentrate on the base model even when it is the true model.’

The final clause of the definition doesn’t follow from the first one: You can poke holes in a density — change its values in a set of measure zero — without changing anything substantive. Definition 1 ought to leave out mention of the density and just target consistency under the base model directly: ‘Definition 1. A prior π(θ) forces overfitting (or overfits) if the posterior cannot concentrate on the base model even when it is the true model.’

Hence an informal definition.

We know that the correct condition is that the prior puts sufficient mass in a KL ball around the solution (well, actually, that’s not correct either, but I don’t care about logarithmic factors), but we don’t think that’s a) as clean, or b) as useful as the version we have.

The aim is to avoid detailed answers to unimportant questions and head instead for useful answers to useful questions.

Just to be a touch more expansive here, the stated aim of the paper is to build up, component-by -component, a practical, principled (nb: not axiomatic – none of us are Kolmogorov) way of constructing weakly informative (subjective) priors for parameters in practically useful models. These take into account a) the structure of the model, b) the idea of a simpler “base” model, c) a notion of deviation from that model, and d) some idea of strength of penalisation.

Because of this, we have to be somewhat “loose” in our definitions and we can’t shoot for the minimum. Because we don’t know what the KL ball looks like for that parameter (it will depend on the global structure in a way that I’m pretty sure hasn’t been studied), we can’t just say “sufficient mass” (which would correspond to the density not decaying too quickly at the base model [polynomial is probably fine here]), because we don’t have a senile feel for what that means. The best that we can do is say that if the density is non-zero at the base model and it decays “not too quickly’ in the local-KL scale, then everything will probably be fine.

To be more specific, if we consider a scale mixture of a Gaussian

x | sigma ~ N(0,sigma^2)

sigma ~ pi(sigma),

there is big difference between

1) a polynomial decay of pi(sigma) to zero around sigma=0 ,

2) pi(sigma) \in (0, \infty) at zero, and

3) pi(sigma) = O(sigma^{-alpha}), alpha \in (0,1) near sigma=0

These the cases correspond to a marginal prior on x, denoted pi(x), that has

1) pi(x) non-zero and finite at zero

2) pi(x) has a logarithmic spike at zero

3) pi(x) = O(|x|^{-alpha}) as |x| -> 0.

So the two cases that are probably ok from an “asymptotics” point of view (1 and 2) give very different behaviours, with the second being more conservative in that it shrinks small signals towards zero strongly, while keeping Gaussian tails for the others. (This is ok in low dimensions, heavy tails are needed when setting independent priors in very high dimension as you need to basically mimic the inherent dependency required to get sparse models)

The specific statement of this is Theorem 4 of the first version of the PC prior paper [http://arxiv.org/pdf/1403.4630v1.pdf], which isn’t in the modern version (except hidden in the proof of Theorem 4 of the new version) as the referees didn’t want it in the paper.

Ugh, it’s annoying when the referees want material removed and then later you need to refer to the pre-journal version because people keep asking about that specific thing…

Arguably this is my fault. I thought it was obvious why a result like this was meaningful (I was wrong). And then I thought that if the reviewers thought it was too specific for the paper and not interesting, then it didn’t need to be there (I was wrong again).

But sometimes that’s how it goes :)

The definition I suggested is even simpler than the original and has the virtue of not stating a falsehood…

Well no. It’s not simpler to use and actually leads to you needing to consider what “sufficient mass” is. It’s also fundamentally a different thing.

The informal definition is usefully wrong. And that’s not an accident. It also makes sense within the goals of the pape.

But thanks for the discussion – it’s clear that this needs to be signposted better. As I was saying to Aki above, it’s hard to anticipate what people’s specific problems will be.

Whaaaa…!? If you remove from the original definition the phrase ‘…the density of the prior is zero at the base model. This implies vanishing prior mass in the neighbourhood of the base model which means that…’ you’ll arrive at my suggestion. It literally contains

only words found in the original definition, in the same order.The problem I see with the original phrasing is that people who don’t realize there’s a falsehood will learn the falsehood, and then they’ll be confused later; and people who do realize what the falsehood is will find it to be a puzzling lapse in an otherwise valuable contribution.

I don’t know what the PC prior is. There are two ways to implement a function in Stan — either as a function in the Stan language itself or under the hood in C++. In both cases, the major concern is differentiability. For speed, the best thing to do is include analytic derivatives on the C++ side. For functions in C++, you need to rebuild everything on top — we don’t yet have a simple extension mechanism for user-contributed functions that aren’t part of Stan itself.

There’s a GitHub Wiki page on functions in C++ (along with a lot more about contributing in C++ to Stan): https://github.com/stan-dev/stan/wiki/Contributing-New-Functions-to-Stan

And the Stan manual for functions in Stan itself: http://mc-stan.org/manual.html

Bob:

The PC prior and its computational challenges are discussed in the above post!

Thanks Bob!

For some of the other cases, the scripting language is probably enough, but for the d.o.f. parameter it’s numeric for nu9, which probably makes it necessary to do in C++.

We (Mark Steel and me) recently explored another “weakly informative” prior for the degrees of freedom of the Student-t distribution. The idea is to set a uniform prior on a bounded (and interpretable) measure of kurtosis for the Student-t density, which is a one-to-one function of the degrees of freedom. Given that the degrees of freedom control the tail/kurtosis of the density, this strategy assigns a flat prior on an interpretable function of the degrees of freedom. This induces a proper prior on the degrees of freedom that can be approximated using a kernel estimator.

This strategy also scales the influence of the degrees of freedom on the shape of the density. For instance, values of the df in [0,1] have a greater influence of the shape of the density than values in [100,101]. This is naturally taken into account by the measure of kurtosis.

In case this is of interest:

http://arxiv.org/abs/1406.7674

Thanks Javier – I look forward to reading this!

I like already that you defined the prior on a meaningful parameter and transformed back. I think that’s vital for a problem like this!

Thanks.

We did something similar with the skewness parameter.

I will try to take a closer look at your paper. I noticed that you also mention a possible use of your approach for constructing priors for skewness and kurtosis parameters. This is an interesting (to me, at least) problem since, in many “flexible” distributions, the role of the shape parameters is not separate (e.g. in the skew-t by Azzalini and Capitanio, or the SAS by Jones and Pewsey), where the “skewness” parameter controls the asymmetry, the mode, spread, tail behaviour in each direction, and shape around the mode (basically, everything). In those distributions we analysed in the manuscript, each parameter play a clear and separate role.

Anyway, I should just tell you these things in the Common Room not in a blog!

In “Bayesian robust inference of sample selection using selection-t models”, I used

nu ~ gamma(1, 0.1).

http://www.sciencedirect.com/science/article/pii/S0047259X1300256X#

The mean of the prior is 10. What is more, it has the following quantiles:

> qgamma(0.025, 1, 0.1)

[1] 0.2531781

> qgamma(0.975, 1, 0.1)

[1] 36.88879

Therefore, my prior covers both extreme cases with nu qgamma(0.025, 2, 0.1)

[1] 2.422093

> qgamma(0.975, 2, 0.1)

[1] 55.71643

It is unnecessary to allow nu to be larger than 30 or 50, but it is very important to allow nu to be smaller than 1.

The above comment has some format issues. It should be:

In “Bayesian robust inference of sample selection using selection-t models”, I used

nu ~ gamma(1, 0.1).

The mean of the prior is 10. What is more, it has the following quantiles:

> qgamma(0.025, 1, 0.1)

[1] 0.2531781

> qgamma(0.975, 1, 0.1)

[1] 36.88879

Therefore, my prior covers extreme cases with nu qgamma(0.025, 2, 0.1)

[1] 2.422093

> qgamma(0.975, 2, 0.1)

[1] 55.71643

It is unnecessary to allow nu to be larger than 30 or 50, but it is very important to allow nu to be smaller than 1.

Could you expand on the statement “…, but it is very important to allow nu to be smaller than 1.”

Peng:

You write, “it is very important to allow nu to be smaller than 1.”

Really? I’d never heard of that before, of using a t with df less than 1.

Can you explain further?

For good practical results I’ve used an exponential prior on nu-1 (where nu is the df parameter for the t distribution). Of course, an exponential prior is just a special case of a gamma distribution with mean=sd, so it can be generalized to a gamma. In particular, I’ve used mean=sd=29, so that the prior mean of nu is 30, which implies that highly kurtotic and nearly normal t distributions are about equally weighted by the prior.

nu-1 ~ dgamma( mean=29 , sd=29 )

is approx. same as

nu-1 ~ dgamma( shape=1 , rate=0.0345 )

which is same as

nu-1 ~ dexp( 1/29 )

More at http://doingbayesiandataanalysis.blogspot.com/2015/04/informed-priors-for-bayesian-comparison.html

We tried this, and found that these priors had bad coverage for models away from the mean. (See Figure 4 of http://arxiv.org/pdf/1403.4630.pdf, code to reproduce the figure mentioned in an above comment)

Thanks very much for the specific pointer. I see that the PC prior does better than the exponential, but I also see that an exponential with a mean around 30 (which would be between the blue and purple bars in your Figure 4) does a pretty good job. I hope good enough that I can still say “for good practical results,” especially when the focus is not on recovering the value of the df parameter but on getting an estimate of the mean parameter that’s at least better than assuming normality (df=infinity). But this issue is good to have on my radar for future improvement. Thanks again.

I just like that we’re all talking about informative priors. It’s a step forward from 25 years ago when we were all trying to come up with the right noninformative priors.

Which makes me curious to ask: What is the final application where this particular question arose. i.e. What is the larger application here that this prior is being used to model?

Rahul:

The t distribution comes up in many many applications where we are concerned about possible outliers or a mixture distribution so we want to use a distribution with wider tails than the normal. In any given application, of course it should be possible to use a more informative prior, but just in general a weakly informative prior should be possible given the typical goals of such analyses.

One of the things that we tried to do with the PC priors was to encode this choice into the prior specification. For the prior on nu, we define the base model to be nu = infinity (i.e. the component is Gaussian), and then penalise deviations from this. So basically, the data has the opportunity to depart from Gaussianity, but is not forced to.

This corresponds to the situation where we think things are probably normal, but we are concerned there may be some contamination. In this case, we want to use as much of the information as possible, and therefore we need to avoid falsely high values of nu. Hence the prior we derive is weakly informative in this scenario.

You could imagine the opposite, where the base model is Cauchy. In this case, we are assuming our data is horrible, and we want to shrink towards the model that really only uses the central data and ignores the tails. You could then derive a prior for this case.

Even with a distribution tailored to serve as the prior for the degrees of freedom, wouldn’t it make more sense to directly implement a mixture of t distributions over this prior, marginalized over the mixture weights? If you’re using the former exclusively for priors in t distributions, it’s not clear what is the benefit of keeping it separate.

Is there a proper theory, system or general principle regarding which information and how much there should be in a weakly informative “default” prior? How does “information” square with “default”?

In some sense, this is what we wrote our paper (http://arxiv.org/pdf/1403.4630.pdf) about.

But the short answer is “it doesn’t”.

For non-trivial models, there is an amount of choice needed to make “weakly informative” priors, which means that you need to check the assumptions before you use them. (Although, this is also true of “objective” priors). Hence, one of the important things for “weakly informative” priors is to state exactly what information is being injected into the system.

As for theory, there isn’t much. It’s either too general to be of use, or too specific to cover small perturbations away from the considered models. The work of van der Vaart and the other Dutch master demonstrate both of these extremes. The lesson seems to be “put prior mass in teh right place”. But that’s not actually enough. It’s been known since Stein that *how* you do this is very very important, but to my knowledge, for even something as simple as a (deep) multilevel model, there isn’t actually theory about how to do this. (I’d love to be wrong here – so please let me know!)

(I apologise for being in “pimp my paper” mode. But it just happens to be really relevant to this discussion)

Interesting. Do you believe that there is a “true model” (which would probably make you a frequentist)? If there is none, why is overfitting (as you define it) a problem?

I don’t care if there’s a true model. But using the mathematics of a true model is often very useful.

Overfitting is a problem because one of the fundamental tasks of statistics is prediction, in which case we want to be able to generalise beyond the current sample and hence we care about overfitting.

More generally, the types of models that we talk about in the paper can be very flexible (mathematically, they’re often singular, which means loosely that different parameters live the same distribution). In this case, you really need to control things carefully.

Incidentally, I’ve never thought of “overfitting” as a frequentist (or a Bayesian) problem. I think of it as a stability problem. Inference that overfits is unstable to new data (in the guise of a repeated experiment, or sub-sampling).

> I don’t care if there’s a true model. But using the mathematics of a true model is often very useful.

Nicely put.

Here is my recent attempt to animate that idea https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/

(Based on papers by Stephan Stigler – http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.2010.00643.x/pdf or http://chance.amstat.org/2011/02/galton/)

This is such an unusual question, I might just have another shot.

Why should you “avoid overfitting” in the context where there is not true model?

My feeling is that “avoiding overfitting” is a similar idea to putting a barrier to increased complexity. So a model that avoids overfitting does its best to ensure that any complexity in the inferred model is forced there by the data rather than by the modelling assumptions.

How to put this into practice is a totally different thing. Bayesians (can/should/do) use priors. But we also use things like cross-validation and predictive checks.

Dan, I think I agree with much of what you just wrote. But then I wonder whether your definition is appropriate. If the problem with overfitting isn’t really that a true simple model if it exists will have probability zero, is this a good way to define what overfitting is?

In a prediction context, by the way, it often doesn’t matter too much to have parameters that are truly zero (even if they exist) estimated as 10^{-8} with their posterior being concentrated close to zero.

On the other hand, with a low sample size we may want to enforce parameter estimators to be zero if the true parameter (if it exists) is nonzero but small in absolute value.

I doubt that the “true model” is a good reference point to define what overfitting is.

The problem is not that we “overfit” the truth but rather that we fit more than what the data allow us to fit reliably.

Christian: We never really talk about a true model in the paper (except in the places we have to – some asymptotics in section 2 that aren’t massively important). What we’re trying to do is *exactly* what you’re saying! We want the flexibility parameters to be small (zero) unless the data tells us they aren’t. And so we have to explicitly build that in.

On a different topic, in conclusion to an Oprah-style wish fulfilment exercise, I visualised some learning theory for Bayesian hierarchical models and it appeared. Today on the arXiv. I haven’t read it properly, so I don’t know if it’s good, but at least I know The Secret works! http://arxiv.org/abs/1505.04984

One point that, for nu>2, the variance of a standard t distribution is nu/(nu-2). So if we’re going to set priors on things, it makes sense to rescale the distribution so that it always has unit variance. Our 3 parameter t distribution is then

T(mu, sigma, nu) = mu + sigma*\sqrt{(nu-2)/nu} * t_{nu}, nu>2

where t_{nu} is a standard t variable with nu d.o.f.