Sander Greenland sent me this paper that he wrote with Mohammad Ali Mansournia, which discusses possible penalty functions for penalized maximum likelihood or, equivalently, possible prior distributions for Bayesian posterior mode estimation, in the context of logistic regression. Greenland and Mansournia write:

We consider some questions that arise when considering alternative penalties . . . Taking simplicity of implementation and interpretation as our chief criteria, we propose that the log-F(1,1) prior provides a better default penalty than other proposals. Penalization based on more general log-F priors is trivial to implement and facilitates sensitivity analyses of penalty strength the number of added observations (prior degrees of freedom) are varied.

Greenland and Mansournia talk a lot about the penalty function of Firth (1993) and of Jeffreys priors but these don’t interest me much except for historical reasons. With a regression model, the form of the likelihood function, and thus the Jeffreys prior, changes every time you add a predictor so it’s hard to envision these as general defaults.

They argue that their log-F prior works better than other alternatives in the literature, including the weakly-informative Cauchy family recommended in my 2008 paper with Jakulin, Pittau, and Su, and implemented in the bayesglm() function (which in turn is soon to be replaced and improved by an implementation in Stan).

I haven’t thought much about different functional forms. The Cauchy prior does this flat thing near the extremes which can work well if occasionally there really are infinite or near-infinite coefficients (which can happen if there is some redundancy in a model) but maybe other families such as log-F could do better for most problems. Also, as discussed in that paper with Jakulin et al., our recommended Cauchy(0,2.5) prior really is too weak in general so I’m open to the idea that there are better default options out there. In my own applied work I’ve been gradually shifting toward the use of stronger priors (indeed, as I’ve noted before, part of this attitude is coming because Sander Greenland convinced me that in many problems the prior information for some parameters can be much stronger than the information available from the data).

Greenland and Mansournia also write that “a properly shifted prior or (more conveniently) no prior be used for intercepts or coefficients that could reasonably have extreme values.” I think this is right, but I’m loath to suggest no prior at all, as this can give degenerate estimates in extreme cases such as data with all successes or all failures. (No one would waste time fitting a logistic regression on such trivial data, but this sort of degenerate pattern can occur in subsets of data, and sometimes a dataset is divided into pieces, with each piece analyzed separately, so we do need to think of how our procedures will work in such cases.) In any case, I agree about translating the model so that coefficients for intercepts and main effects are interpretable. We discussed this point in section 2.1 of our paper.

A lot of the details of this remain up in the air so it’s good to see this new paper by Greenland and Mansournia.

This is interesting. We’ve done some work revising the PC prior paper looking at the effect of the tails of the prior on the standard deviation on the risk estimates for the normal-normal model (in the Polson and Scott style), and we basically recovered almost identical graphs of the risk for increasing signal size for the half cauchy as we did for the exponential (that we recommended). In fact, the only time the tails of the prior on the standard deviation bothered us at all was in the very high dimensional case, in which case you need the heavier tail for shrinkage (our results are too soft to say *how heavy* the tail needs to be).

I think the biggest difference between these and the log-F (if I read the paper correctly – I’ve only had a chance to scan it) is that the log-F density is finite at zero, while a half-Cauchy on the standard deviation (or any prior with a non-zero, finite density at zero on the standard deviation) has a logarithmic spike at zero. So the behaviour of the two priors will be very very different, especially in the presence of small signals (where the shrinkage of the half-Cauchy will be severe).

I look forward to reading the paper more carefully (although Word makes maths ugly…).