Comments on: The default prior for logistic regression coefficients in Scikit-learn

By: OliP

OliP — Fri, 29 Nov 2019 22:04:41 +0000

In reply to Sander Greenland. Many thanks for the link and for elaborating.

By: Daniel Lakeland

Daniel Lakeland — Fri, 29 Nov 2019 21:42:16 +0000

Bob, the Stan sampling parameters do not make assumptions about the world or change the posterior distribution from which it samples, they are purely about computational efficiency. So they are about “how well did we calculate a thing” not “what thing did we calculate”.

I don’t think there should be a default when it comes to modeling decisions. I think it makes good sense to have defaults when it comes to computational decisions, because the computational people tend to know more about how to compute numbers than the applied people do. But the applied people know more about the scientific question than the computing people do, and so the computing people shouldn’t implicitly make choices about how to answer applied questions.

By: Andrew

Andrew — Fri, 29 Nov 2019 20:55:19 +0000

In reply to Bob Carpenter. Daniel, Bob: I think defaults are good; I think a user should be able to run logistic regression on default settings. The defaults should be clear and easy to follow. But there's a tradeoff: once we try to make a good default, it can get complicated (for example, defaults for regression coefficients with non-binary predictors need to deal with scaling in some way). The default warmup in Stan is a mess, but we're working on improvements, so I hope the new version will be more effective and also better documented. In particular, in Stan we have different goals: one goal is to get reliable inferences for the models we like, another goal is to more quickly reveal problems with models we don't like.

By: Andrew

Andrew — Fri, 29 Nov 2019 20:50:38 +0000

In reply to Bob Carpenter. Bob: A hierarchical model is fine, but (a) this doesn't resolve the problem when the number of coefficients is low, (b) non-hierarchical models are easier to compute than hierarchical models because with non-hierarchical models we can just work with the joint posterior mode, and (c) lots of people are fitting non-hierarchical models and we need defaults for them.

By: Andrew

Andrew — Fri, 29 Nov 2019 20:48:36 +0000

In reply to Bob Carpenter. Bob: As discussed here, we scale continuous variables by 2 sd's because this puts them on the same approximate scale as 0/1 variables.

By: Bob Carpenter

Bob Carpenter — Fri, 29 Nov 2019 20:26:34 +0000

I don’t get the scaling by two standard deviations. Why transform to mean zero and scale two? It seems like just normalizing the usual way (mean zero and unit scale), you can choose priors that work the same way and nobody has to remember whether they should be dividing by 2 or multiplying by 2 or sqrt(2) to get back to unity.

I could understand having a normal(0, 2) default prior for standardized predictors in logistic regression because you usually don’t go beyond unit scale coefficients with unit scale predictors; at least not without co-linearity.

By: Bob Carpenter

Bob Carpenter — Fri, 29 Nov 2019 20:21:15 +0000

In reply to Andrew.

Don’t we just want to answer this whole kerfuffle with “use a hierarchical model”? W.D., in the original blog post, says

Furthermore, the lambda is never selected using a grid search. Someone learning from this tutorial who also learned about logistic regression in a stats or intro ML class would have no idea that the default options for sklearn’s LogisticRegression class are wonky, not scale invariant, and utilizing untuned hyperparameters.

By grid search for lambda, I believe W.D. is suggesting the common practice of choosing the penalty scale to optimize some end-to-end result (typically, but not always predictive cross-validation). This isn’t usually equivalent to empirical Bayes, because it’s not usually maximizing the marginal.

There’s simply no accepted default approach to logistic regression in the machine learning world or in the stats world. For a start, there are three common penalties in use, L1, L2 and mixed (elastic net). Only elastic net gives you both identifiability and true zero penalized MLE estimates. Then there’s the matter of how to set the scale. Even if you cross-validate, there’s the question of which decision rule to use.

I’d say the “standard” way that we approach something like logistic regression in Stan is to use a hierarchical model. That still leaves you choice of prior family, for which we can throw the horseshoe, Finnish horseshoe, and Cauchy (or general Student-t) into the ring. And choice of hyperprior, but that’s usually less sensitive with lots of groups or lots of data per group.

By: Bob Carpenter

Bob Carpenter — Fri, 29 Nov 2019 20:10:02 +0000

In reply to Daniel Lakeland. I agree! I wish R hadn't taken the approach of always guessing what users intend. I'm curious what Andrew thinks, because he writes that statistics is the science of defaults. We supply default warmup and adaptation parameters in Stan's fitting routines. But those are a bit different in that we can usually throw diagnostic errors if sampling fails. And most of our users don't understand the details (even I don't understand the dual averaging tuning parameters for setting step size---they seem very robust, so I've never bothered).

By: Sander Greenland

Sander Greenland — Fri, 29 Nov 2019 17:23:27 +0000

In reply to OliP.

I’d love to hear of any book coverage…

On the general debate among different defaults vs. each other and vs. contextually informed priors, entries 1-20 and 52-56 of this blog discussion may be of interest (the other entries digress into a largely unrelated discussion of MBI):
https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580

To clarify “rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not”:
Consider that the less restricted the confounder range, the more confounding the confounder can produce and so in this sense the more important its precise adjustment; yet also the larger its SD and thus the the more shrinkage and more confounding is reintroduced by shrinkage proportional to the confounder SD (which is implied by a default unit=k*SD prior scale). This behavior seems to me to make this default at odds with what one would want in the setting.

At the very least such examples show the danger of decontextualized and data-dependent defaults. Too often statisticians want to introduce such defaults to avoid having to delve into context and see what that would demand. Another default with even larger and more perverse biasing effects uses k*SE as the prior scale unit with SE=the standard error of the estimated confounder coefficient: The bias that produces increases with sample size (note that the harm from bias increases with sample size as bias comes to dominate random error).

By: OliP

OliP — Fri, 29 Nov 2019 11:33:28 +0000

Sander said “It is then capable of introducing considerable confounding (e.g., shrinking age and sex effects toward zero and thus reducing control of distortions produced by their imbalances). Weirdest of all is that rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not.”

I wonder if anyone is able to provide pointers to papers to book sections that discuss these issues in greater detail?

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 18:47:24 +0000

In reply to Tom Holden.

Also, Wald’s theorem shows that you might as well look for optimal decision rules inside the class of Bayesian rules, but obviously, the truly optimal decision rule would be the one that puts a delta-function prior on the “real” parameter values. If you are using a normal distribution in your likelihood, this would reduce mean squared error to its minimal value… But if you have an algorithm for discovering the exact true parameter values in your problem without even seeing data (ie. as a prior) what do you need statistics for ;-)

so the problem is hopeless… the “optimal” prior is the one that best describes the actual information you have about the problem. And that obviously can’t be a one-size-fits-all thing.

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 18:17:32 +0000

In reply to Daniel Lakeland. The alternative book, which is needed, and has been discussed recently by Rahul, is a book on how to model real world utilities and how different choices of utilities lead to different decisions, and how these utilities interact. For example, your inference model needs to make choices about what factors to include in the model or not, which requires decisions, but then your decisions for which you plan to use the predictions also need to be made, like whether to invest in something, or build something, or change a regulation etc.

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 18:13:16 +0000

In reply to Tom Holden.

In my opinion this is problematic, because real world conditions often have situations where mean squared error is not even a good approximation of the real world practical utility.

The questions can be good to have an answer to because it lets you do some math, but the problem is people often reify it as if it were a very very important real world condition.

Let me give you an example, since I’m near the beach this week… suppose you have low mean squared error in predicting the daily mean tide height… this might seem very good, and it is very good if you are a cartographer and need to figure out where to put the coastline on your map… but if you are a beach house owner, what matters is whether the tide is 36 inches above your living room floor.

It would absolutely be a mistake to spend a bunch of time thinking up a book full of theory about how to “adjust penalties” to “optimally in predictive MSE” adjust your prediction algorithms. Such a book, while of interest to pure mathematicians would undoubtedly be taken as a bible for practical applied problems, in a mistaken way.

By: Tom Holden

Tom Holden — Thu, 28 Nov 2019 17:52:24 +0000

In reply to Daniel Lakeland. I mean in the sense of large sample asymptotics. You can take in-sample CV MSE or expected out of sample MSE as the objective.

By: Andrew

Andrew — Thu, 28 Nov 2019 17:37:13 +0000

In reply to Tom Holden. Tom, When the number of predictors increases in this way, you'll want to fit a hierarchical model in which the amount of partial pooling is a hyperparameter that is estimated from the data.

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 17:18:36 +0000

In reply to Rahul. Some problems are insensitive to some parameters. Imagine failure of a bridge. it could be very sensitive to the strength of one particular connection. but because that connection will fail first, it is insensitive to the strength of the over-specced beam.

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 17:15:34 +0000

In reply to Tom Holden. >How regularization optimally scales ... this can only be defined by specifying an objective function. since the objective function changes from problem to problem, there can be no one answer to this question

By: Rahul

Rahul — Thu, 28 Nov 2019 17:06:43 +0000

Good parameter estimation is a sufficient but not necessary condition for good prediction?

By: Tom Holden

Tom Holden — Thu, 28 Nov 2019 16:46:14 +0000

Hi Andrew,
Do you not think the variance of these default priors should scale inversely with the number of parameters being estimated?
How regularization optimally scales with sample size and the number of parameters being estimated is the topic of this CrossValidated question: https://stats.stackexchange.com/questions/438173/how-should-regularization-parameters-scale-with-data-size
It would be great to hear your thoughts. It could make for an interesting blog post!
Thanks in advance,
Tom

By: Daniel Lakeland

Daniel Lakeland — Thu, 28 Nov 2019 15:51:09 +0000

I honestly think the only sensible default is to throw an error and complain until a user gives an explicit prior. Cranking out numbers without thinking is dangerous. Imagine if a computational fluid mechanics program supplied defaults for density and viscosity and temperature of a fluid. No way is that better than throwing an error saying “please supply the properties of the fluid you are modeling”

By: Keith O'Rourke

Keith O'Rourke — Thu, 28 Nov 2019 15:01:38 +0000

“Informative priors—regularization—makes regression a more powerful tool” powerful for what?

The what needs to be carefully considered whereas defaults are supposed to be only place holders until that careful consideration is brought to bear. Given my sense of the literature, that will often be just overlooked so “warnings” that it shouldn’t be, should be given.