Many thanks for the link and for elaborating.

]]>Bob, the Stan sampling parameters do not make assumptions about the world or change the posterior distribution from which it samples, they are purely about computational efficiency. So they are about “how well did we calculate a thing” not “what thing did we calculate”.

I don’t think there should be a default when it comes to modeling decisions. I think it makes good sense to have defaults when it comes to computational decisions, because the computational people tend to know more about how to compute numbers than the applied people do. But the applied people know more about the scientific question than the computing people do, and so the computing people shouldn’t implicitly make choices about how to answer applied questions.

]]>Daniel, Bob:

I think defaults are good; I think a user should be able to run logistic regression on default settings. The defaults should be clear and easy to follow. But there’s a tradeoff: once we try to make a good default, it can get complicated (for example, defaults for regression coefficients with non-binary predictors need to deal with scaling in some way). The default warmup in Stan is a mess, but we’re working on improvements, so I hope the new version will be more effective and also better documented. In particular, in Stan we have different goals: one goal is to get reliable inferences for the models we like, another goal is to more quickly reveal problems with models we don’t like.

]]>Bob:

A hierarchical model is fine, but (a) this doesn’t resolve the problem when the number of coefficients is low, (b) non-hierarchical models are easier to compute than hierarchical models because with non-hierarchical models we can just work with the joint posterior mode, and (c) lots of people are fitting non-hierarchical models and we need defaults for them.

]]>Bob:

As discussed here, we scale continuous variables by 2 sd’s because this puts them on the same approximate scale as 0/1 variables.

]]>I could understand having a normal(0, 2) default prior for standardized predictors in logistic regression because you usually don’t go beyond unit scale coefficients with unit scale predictors; at least not without co-linearity.

]]>Don’t we just want to answer this whole kerfuffle with “use a hierarchical model”? W.D., in the original blog post, says

Furthermore, the lambda is never selected using a grid search. Someone learning from this tutorial who also learned about logistic regression in a stats or intro ML class would have no idea that the default options for sklearn’s LogisticRegression class are wonky, not scale invariant, and utilizing untuned hyperparameters.

By grid search for lambda, I believe W.D. is suggesting the common practice of choosing the penalty scale to optimize some end-to-end result (typically, but not always predictive cross-validation). This isn’t usually equivalent to empirical Bayes, because it’s not usually maximizing the marginal.

There’s simply no accepted default approach to logistic regression in the machine learning world or in the stats world. For a start, there are three common penalties in use, L1, L2 and mixed (elastic net). Only elastic net gives you both identifiability and true zero penalized MLE estimates. Then there’s the matter of how to set the scale. Even if you cross-validate, there’s the question of which decision rule to use.

I’d say the “standard” way that we approach something like logistic regression in Stan is to use a hierarchical model. That still leaves you choice of prior family, for which we can throw the horseshoe, Finnish horseshoe, and Cauchy (or general Student-t) into the ring. And choice of hyperprior, but that’s usually less sensitive with lots of groups or lots of data per group.

]]>I agree! I wish R hadn’t taken the approach of always guessing what users intend.

I’m curious what Andrew thinks, because he writes that statistics is the science of defaults.

We supply default warmup and adaptation parameters in Stan’s fitting routines. But those are a bit different in that we can usually throw diagnostic errors if sampling fails. And most of our users don’t understand the details (even I don’t understand the dual averaging tuning parameters for setting step size—they seem very robust, so I’ve never bothered).

]]>I’d love to hear of any book coverage…

On the general debate among different defaults vs. each other and vs. contextually informed priors, entries 1-20 and 52-56 of this blog discussion may be of interest (the other entries digress into a largely unrelated discussion of MBI):

https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580

To clarify “rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not”:

Consider that the less restricted the confounder range, the more confounding the confounder can produce and so in this sense the more important its precise adjustment; yet also the larger its SD and thus the the more shrinkage and more confounding is reintroduced by shrinkage proportional to the confounder SD (which is implied by a default unit=k*SD prior scale). This behavior seems to me to make this default at odds with what one would want in the setting.

At the very least such examples show the danger of decontextualized and data-dependent defaults. Too often statisticians want to introduce such defaults to avoid having to delve into context and see what that would demand. Another default with even larger and more perverse biasing effects uses k*SE as the prior scale unit with SE=the standard error of the estimated confounder coefficient: The bias that produces increases with sample size (note that the harm from bias increases with sample size as bias comes to dominate random error).

]]>I wonder if anyone is able to provide pointers to papers to book sections that discuss these issues in greater detail?

]]>Also, Wald’s theorem shows that you might as well look for optimal decision rules inside the class of Bayesian rules, but obviously, the truly optimal decision rule would be the one that puts a delta-function prior on the “real” parameter values. If you are using a normal distribution in your likelihood, this would reduce mean squared error to its minimal value… But if you have an algorithm for discovering the exact true parameter values in your problem without even seeing data (ie. as a prior) what do you need statistics for ;-)

so the problem is hopeless… the “optimal” prior is the one that best describes the actual information you have about the problem. And that obviously can’t be a one-size-fits-all thing.

]]>The alternative book, which is needed, and has been discussed recently by Rahul, is a book on how to model real world utilities and how different choices of utilities lead to different decisions, and how these utilities interact. For example, your inference model needs to make choices about what factors to include in the model or not, which requires decisions, but then your decisions for which you plan to use the predictions also need to be made, like whether to invest in something, or build something, or change a regulation etc.

]]>In my opinion this is problematic, because real world conditions often have situations where mean squared error is not even a good approximation of the real world practical utility.

The questions can be good to have an answer to because it lets you do some math, but the problem is people often reify it as if it were a very very important real world condition.

Let me give you an example, since I’m near the beach this week… suppose you have low mean squared error in predicting the daily mean tide height… this might seem very good, and it is very good if you are a cartographer and need to figure out where to put the coastline on your map… but if you are a beach house owner, what matters is whether the tide is 36 inches above your living room floor.

It would absolutely be a mistake to spend a bunch of time thinking up a book full of theory about how to “adjust penalties” to “optimally in predictive MSE” adjust your prediction algorithms. Such a book, while of interest to pure mathematicians would undoubtedly be taken as a bible for practical applied problems, in a mistaken way.

]]>I mean in the sense of large sample asymptotics. You can take in-sample CV MSE or expected out of sample MSE as the objective.

]]>Tom,

When the number of predictors increases in this way, you’ll want to fit a hierarchical model in which the amount of partial pooling is a hyperparameter that is estimated from the data.

]]>Some problems are insensitive to some parameters. Imagine failure of a bridge. it could be very sensitive to the strength of one particular connection. but because that connection will fail first, it is insensitive to the strength of the over-specced beam.

]]>>How regularization optimally scales …

this can only be defined by specifying an objective function. since the objective function changes from problem to problem, there can be no one answer to this question

]]>Do you not think the variance of these default priors should scale inversely with the number of parameters being estimated?

How regularization optimally scales with sample size and the number of parameters being estimated is the topic of this CrossValidated question: https://stats.stackexchange.com/questions/438173/how-should-regularization-parameters-scale-with-data-size

It would be great to hear your thoughts. It could make for an interesting blog post!

Thanks in advance,

Tom ]]>

The what needs to be carefully considered whereas defaults are supposed to be only place holders until that careful consideration is brought to bear. Given my sense of the literature, that will often be just overlooked so “warnings” that it shouldn’t be, should be given.

]]>