Skip to content
 

The default prior for logistic regression coefficients in Scikit-learn

Someone pointed me to this post by W. D., reporting that, in Python’s popular Scikit-learn package, the default prior for logistic regression coefficients is normal(0,1)—or, as W. D. puts it, L2 penalization with a lambda of 1.

In the post, W. D. makes three arguments. I agree with two of them.

1. I agree with W. D. that it makes sense to scale predictors before regularization. (There are various ways to do this scaling, but I think that scaling by 2*observed sd is a reasonable default for non-binary outcomes.)

2. I agree with W. D. that default settings should be made as clear as possible at all times.

3. I disagree with the author that a default regularization prior is a bad idea. As a general point, I think it makes sense to regularize, and when it comes to this specific problem, I think that a normal(0,1) prior is a reasonable default option (assuming the predictors have been scaled). I think that rstanarm is currently using normal(0,2.5) as a default, but if I had to choose right now, I think I’d go with normal(0,1), actually.

Apparently some of the discussion of this default choice revolved around whether the routine should be considered “statistics” (where primary goal is typically parameter estimation) or “machine learning” (where the primary goal is typically prediction). As far as I’m concerned, it doesn’t matter: I’d prefer a reasonably strong default prior such as normal(0,1) both for parameter estimation and for prediction.

Again, I’ll repeat points 1 and 2 above: You do want to standardize the predictors before using this default prior, and in any case the user should be made aware of the defaults, and how to override them.

P.S. Sander Greenland and I had a discussion of this. Sander disagreed with me so I think it will be valuable to share both perspectives.

Sander wrote:

The following concerns arise in risk-factor epidemiology, my area, and related comparative causal research, not in formulation of classifiers or other pure predictive tasks as machine learners focus on…

As you may already know, in my settings I don’t think scaling by 2*SD makes any sense as a default, instead it makes the resulting estimates dependent on arbitrary aspects of the sample that have nothing to do with the causal effects under study or the effects one is attempting control with the model. Decontextualized defaults are bound to create distortions sooner or later, alpha = 0.05 being of course the poster child for that. So it seems here: Regularizing by a prior with variance 1 after rescaling by 2*SD means extending the arbitrariness to made-up prior information and can be pretty strong for a default, adding a substantial amount of pseudo-information centered on the null without any connection to an appropriate loss function. It is then capable of introducing considerable confounding (e.g., shrinking age and sex effects toward zero and thus reducing control of distortions produced by their imbalances). Weirdest of all is that rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not. Thus I advise any default prior introduce only a small absolute amount of information (e.g., two observations worth) and the program allow the user to increase that if there is real background information to support more shrinkage.

Of course high-dimensional exploratory settings may call for quite a bit of shrinkage, but then there is a huge volume of literature on that and none I’ve seen supports anything resembling assigning a prior based on 2*SD rescaling, so if you have citations showing it is superior to other approaches in comparative studies, please send them along!

I replied that I think that scaling by population sd is better than scaling by sample sd, and the way I think about scaling by sample sd is as an approximation to scaling by population sd. I also think the default I recommend, or other similar defaults, are safer than a default of no regularization, as this leads to problems with separation. It sounds like you would prefer weaker default priors. I think that weaker default priors will lead to poorer parameter estimates and poorer predictions–but estimation and prediction are not everything, and I could imagine that for some users, including epidemiology, weaker priors could be considered more acceptable.

Sander responded:

A severe question would be what is “the” population SD? The county? The state? The nation? The world? All humans who ever lived?

Maybe you are thinking of descriptive surveys with precisely pre-specified sampling frames. But no comparative cohort study or randomized clinical trial I have seen had an identified or sharply defined population to refer to beyond the particular groups they happened to get due to clinic enrollment, physician recruitment, and patient cooperation.

That aside, do we use “the” population restricted by the age restriction used in the study? Which would mean the prior SD for the per-year age effect would vary by peculiarities like age restriction even if the per-year increment in outcome was identical across years of age and populations. All that seems very weird, more along the lines of statistical numerology rather than empirical science (as if there were some magic in SD – why not the intraquartile or intraquintile or intratertile range? all of which could be equally bad, but aren’t necessarily worse).

I don’t recommend no regularization over weak regularization, but problems like separation are fixed by even the weakest priors in use. The weak priors I favor have a direct interpretation in terms of information being supplied about the parameter in whatever SI units make sense in context (e.g., mg of a medication given in mg doses).

As for “poorer parameter estimates” that is extremely dependent on the performance criteria one uses to gauge “poorer” (bias is often minimized by the Jeffreys prior which is too weak even for me – even though it is not as weak as a Cauchy prior). And “poor” is highly dependent on context. In comparative studies (which I have seen you involved in too), I’m fine with a prior that pulls estimates toward the range that debate takes place among stakeholders, so they can all be comfortable with the results. But no stronger than that, because a too-strong default prior will exert too strong a pull within that range and thus meaningfully favor some stakeholders over others, as well as start to damage confounding control as I described before. Worse, most users won’t even know when that happens; they will instead just defend their results circularly with the argument that they followed acceptable defaults. Again, 0.05 is the poster child for that kind of abuse, and at this point I can imagine parallel strong (if even more opaque) distortions from scaling of priors being driven by a 2*SD covariate scaling.

My reply regarding Sander’s first paragraph is that, yes, different goals will correspond to different models, and that can make sense. In practice with rstanarm we set priors that correspond to the scale of 2*sd of the data, and I interpret these as representing a hypothetical population for which the observed data are a sample, which is a standard way to interpret regression inferences.

Regarding Sander’s concern that users “they will instead just defend their results circularly with the argument that they followed acceptable defaults”: Sure, that’s a problem. But in any case I’d like to have better defaults, and I think extremely weak priors is not such a good default as it leads to noisy estimates (or, conversely, users not including potentially important predictors in the model, out of concern over the resulting noisy estimates). Informative priors—regularization—makes regression a more powerful tool.

21 Comments

  1. “Informative priors—regularization—makes regression a more powerful tool” powerful for what?

    The what needs to be carefully considered whereas defaults are supposed to be only place holders until that careful consideration is brought to bear. Given my sense of the literature, that will often be just overlooked so “warnings” that it shouldn’t be, should be given.

  2. I honestly think the only sensible default is to throw an error and complain until a user gives an explicit prior. Cranking out numbers without thinking is dangerous. Imagine if a computational fluid mechanics program supplied defaults for density and viscosity and temperature of a fluid. No way is that better than throwing an error saying “please supply the properties of the fluid you are modeling”

    • I agree! I wish R hadn’t taken the approach of always guessing what users intend.

      I’m curious what Andrew thinks, because he writes that statistics is the science of defaults.

      We supply default warmup and adaptation parameters in Stan’s fitting routines. But those are a bit different in that we can usually throw diagnostic errors if sampling fails. And most of our users don’t understand the details (even I don’t understand the dual averaging tuning parameters for setting step size—they seem very robust, so I’ve never bothered).

      • Andrew says:

        Daniel, Bob:

        I think defaults are good; I think a user should be able to run logistic regression on default settings. The defaults should be clear and easy to follow. But there’s a tradeoff: once we try to make a good default, it can get complicated (for example, defaults for regression coefficients with non-binary predictors need to deal with scaling in some way). The default warmup in Stan is a mess, but we’re working on improvements, so I hope the new version will be more effective and also better documented. In particular, in Stan we have different goals: one goal is to get reliable inferences for the models we like, another goal is to more quickly reveal problems with models we don’t like.

      • Bob, the Stan sampling parameters do not make assumptions about the world or change the posterior distribution from which it samples, they are purely about computational efficiency. So they are about “how well did we calculate a thing” not “what thing did we calculate”.

        I don’t think there should be a default when it comes to modeling decisions. I think it makes good sense to have defaults when it comes to computational decisions, because the computational people tend to know more about how to compute numbers than the applied people do. But the applied people know more about the scientific question than the computing people do, and so the computing people shouldn’t implicitly make choices about how to answer applied questions.

  3. Tom Holden says:

    Hi Andrew,
    Do you not think the variance of these default priors should scale inversely with the number of parameters being estimated?
    How regularization optimally scales with sample size and the number of parameters being estimated is the topic of this CrossValidated question: https://stats.stackexchange.com/questions/438173/how-should-regularization-parameters-scale-with-data-size
    It would be great to hear your thoughts. It could make for an interesting blog post!
    Thanks in advance,
    Tom

    • >How regularization optimally scales …

      this can only be defined by specifying an objective function. since the objective function changes from problem to problem, there can be no one answer to this question

      • Tom Holden says:

        I mean in the sense of large sample asymptotics. You can take in-sample CV MSE or expected out of sample MSE as the objective.

        • In my opinion this is problematic, because real world conditions often have situations where mean squared error is not even a good approximation of the real world practical utility.

          The questions can be good to have an answer to because it lets you do some math, but the problem is people often reify it as if it were a very very important real world condition.

          Let me give you an example, since I’m near the beach this week… suppose you have low mean squared error in predicting the daily mean tide height… this might seem very good, and it is very good if you are a cartographer and need to figure out where to put the coastline on your map… but if you are a beach house owner, what matters is whether the tide is 36 inches above your living room floor.

          It would absolutely be a mistake to spend a bunch of time thinking up a book full of theory about how to “adjust penalties” to “optimally in predictive MSE” adjust your prediction algorithms. Such a book, while of interest to pure mathematicians would undoubtedly be taken as a bible for practical applied problems, in a mistaken way.

          • The alternative book, which is needed, and has been discussed recently by Rahul, is a book on how to model real world utilities and how different choices of utilities lead to different decisions, and how these utilities interact. For example, your inference model needs to make choices about what factors to include in the model or not, which requires decisions, but then your decisions for which you plan to use the predictions also need to be made, like whether to invest in something, or build something, or change a regulation etc.

        • Also, Wald’s theorem shows that you might as well look for optimal decision rules inside the class of Bayesian rules, but obviously, the truly optimal decision rule would be the one that puts a delta-function prior on the “real” parameter values. If you are using a normal distribution in your likelihood, this would reduce mean squared error to its minimal value… But if you have an algorithm for discovering the exact true parameter values in your problem without even seeing data (ie. as a prior) what do you need statistics for ;-)

          so the problem is hopeless… the “optimal” prior is the one that best describes the actual information you have about the problem. And that obviously can’t be a one-size-fits-all thing.

    • Andrew says:

      Tom,

      When the number of predictors increases in this way, you’ll want to fit a hierarchical model in which the amount of partial pooling is a hyperparameter that is estimated from the data.

      • Don’t we just want to answer this whole kerfuffle with “use a hierarchical model”? W.D., in the original blog post, says

        Furthermore, the lambda is never selected using a grid search. Someone learning from this tutorial who also learned about logistic regression in a stats or intro ML class would have no idea that the default options for sklearn’s LogisticRegression class are wonky, not scale invariant, and utilizing untuned hyperparameters.

        By grid search for lambda, I believe W.D. is suggesting the common practice of choosing the penalty scale to optimize some end-to-end result (typically, but not always predictive cross-validation). This isn’t usually equivalent to empirical Bayes, because it’s not usually maximizing the marginal.

        There’s simply no accepted default approach to logistic regression in the machine learning world or in the stats world. For a start, there are three common penalties in use, L1, L2 and mixed (elastic net). Only elastic net gives you both identifiability and true zero penalized MLE estimates. Then there’s the matter of how to set the scale. Even if you cross-validate, there’s the question of which decision rule to use.

        I’d say the “standard” way that we approach something like logistic regression in Stan is to use a hierarchical model. That still leaves you choice of prior family, for which we can throw the horseshoe, Finnish horseshoe, and Cauchy (or general Student-t) into the ring. And choice of hyperprior, but that’s usually less sensitive with lots of groups or lots of data per group.

        • Andrew says:

          Bob:

          A hierarchical model is fine, but (a) this doesn’t resolve the problem when the number of coefficients is low, (b) non-hierarchical models are easier to compute than hierarchical models because with non-hierarchical models we can just work with the joint posterior mode, and (c) lots of people are fitting non-hierarchical models and we need defaults for them.

  4. Rahul says:

    Good parameter estimation is a sufficient but not necessary condition for good prediction?

  5. OliP says:

    Sander said “It is then capable of introducing considerable confounding (e.g., shrinking age and sex effects toward zero and thus reducing control of distortions produced by their imbalances). Weirdest of all is that rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not.”

    I wonder if anyone is able to provide pointers to papers to book sections that discuss these issues in greater detail?

    • Sander Greenland says:

      I’d love to hear of any book coverage…

      On the general debate among different defaults vs. each other and vs. contextually informed priors, entries 1-20 and 52-56 of this blog discussion may be of interest (the other entries digress into a largely unrelated discussion of MBI):
      https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580

      To clarify “rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not”:
      Consider that the less restricted the confounder range, the more confounding the confounder can produce and so in this sense the more important its precise adjustment; yet also the larger its SD and thus the the more shrinkage and more confounding is reintroduced by shrinkage proportional to the confounder SD (which is implied by a default unit=k*SD prior scale). This behavior seems to me to make this default at odds with what one would want in the setting.

      At the very least such examples show the danger of decontextualized and data-dependent defaults. Too often statisticians want to introduce such defaults to avoid having to delve into context and see what that would demand. Another default with even larger and more perverse biasing effects uses k*SE as the prior scale unit with SE=the standard error of the estimated confounder coefficient: The bias that produces increases with sample size (note that the harm from bias increases with sample size as bias comes to dominate random error).

  6. I don’t get the scaling by two standard deviations. Why transform to mean zero and scale two? It seems like just normalizing the usual way (mean zero and unit scale), you can choose priors that work the same way and nobody has to remember whether they should be dividing by 2 or multiplying by 2 or sqrt(2) to get back to unity.

    I could understand having a normal(0, 2) default prior for standardized predictors in logistic regression because you usually don’t go beyond unit scale coefficients with unit scale predictors; at least not without co-linearity.

Leave a Reply