Sander Greenland sent me this paper that he wrote with Mohammad Ali Mansournia, which discusses possible penalty functions for penalized maximum likelihood or, equivalently, possible prior distributions for Bayesian posterior mode estimation, in the context of logistic regression. Greenland and Mansournia write:
We consider some questions that arise when considering alternative penalties . . . Taking simplicity of implementation and interpretation as our chief criteria, we propose that the log-F(1,1) prior provides a better default penalty than other proposals. Penalization based on more general log-F priors is trivial to implement and facilitates sensitivity analyses of penalty strength the number of added observations (prior degrees of freedom) are varied.
Greenland and Mansournia talk a lot about the penalty function of Firth (1993) and of Jeffreys priors but these don’t interest me much except for historical reasons. With a regression model, the form of the likelihood function, and thus the Jeffreys prior, changes every time you add a predictor so it’s hard to envision these as general defaults.
They argue that their log-F prior works better than other alternatives in the literature, including the weakly-informative Cauchy family recommended in my 2008 paper with Jakulin, Pittau, and Su, and implemented in the bayesglm() function (which in turn is soon to be replaced and improved by an implementation in Stan).
I haven’t thought much about different functional forms. The Cauchy prior does this flat thing near the extremes which can work well if occasionally there really are infinite or near-infinite coefficients (which can happen if there is some redundancy in a model) but maybe other families such as log-F could do better for most problems. Also, as discussed in that paper with Jakulin et al., our recommended Cauchy(0,2.5) prior really is too weak in general so I’m open to the idea that there are better default options out there. In my own applied work I’ve been gradually shifting toward the use of stronger priors (indeed, as I’ve noted before, part of this attitude is coming because Sander Greenland convinced me that in many problems the prior information for some parameters can be much stronger than the information available from the data).
Greenland and Mansournia also write that “a properly shifted prior or (more conveniently) no prior be used for intercepts or coefficients that could reasonably have extreme values.” I think this is right, but I’m loath to suggest no prior at all, as this can give degenerate estimates in extreme cases such as data with all successes or all failures. (No one would waste time fitting a logistic regression on such trivial data, but this sort of degenerate pattern can occur in subsets of data, and sometimes a dataset is divided into pieces, with each piece analyzed separately, so we do need to think of how our procedures will work in such cases.) In any case, I agree about translating the model so that coefficients for intercepts and main effects are interpretable. We discussed this point in section 2.1 of our paper.
A lot of the details of this remain up in the air so it’s good to see this new paper by Greenland and Mansournia.