I just had reason to reread this article from 2009, and I think it holds up just fine!

Just to emphasize, I’m not saying you *have to* scale predictors by dividing by two standard deviations, nor am I even saying that you *should* do this scaling. I’m just saying that this scaling is a useful default, and in most settings I’ve seen I prefer it to the current default of doing no scaling at all.

Here’s how the paper concludes:

Rescaling numeric regression inputs by dividing by two standard deviations is a reasonable auto- matic procedure that avoids conventional standardization’s incompatibility with binary inputs. Standardizing does not solve issues of causality [4], conditioning [1], or comparison between fits with different data sets [2], but we believe it usefully contributes to the goal of understanding a model whose predictors are on different scales.

It can be a challenge to pick appropriate ‘round numbers’ for scaling regression predictors, and standardization, as we have defined it here, gives a general solution which is, at the very least, an interpretable starting point. We recommend it as an automatic adjunct to displaying coefficients on the original scale.

This does not stop us from keeping variables on some standard, well-understood scale (for example, in predicting election outcomes given unemployment rate, coefficients can be interpreted as percentage points of vote per percentage point change in unemployment), but we would use our standardization as a starting point. In general, we believe that our recommendations will generally lead to more understandable inferences than the current default, which is typically to include variables; however, they happen to have been coded in the data file. Our goal is for regression coefficients to be interpretable as changes from low to high values (for binary inputs or numeric inputs that have been scaled by two standard deviations).

We also center each input variable to have a mean of zero so that interactions are more interpretable. Again, in some applications it can make sense for variables to be centered around some particular baseline value, but we believe our automatic procedure is better than the current default of using whatever value happens to be zero on the scale of the data, which all too commonly results in absurdities such as age=0 years or party identification=0 on a 1–7 scale. Even with such scaling, the correct interpretation of the model can be untangled from the regression by pulling out the right combination of coefficients (for example, evaluating interactions at different plausible values of age such as 20, 40, and 60); the advantage of our procedure is that the default outputs in the regression table can be compared and understood in a consistent way.

We also hope that these ideas could also be applied to predictive comparisons for logistic regression and other nonlinear models [11], and beyond that to multilevel models and nonlinear procedures such as generalized additive models [12]. Nonlinear models can best be summarized graphically, either compactly through summary methods such as graphs of coefficient estimates or nomograms [13–15], showing the (perhaps nonlinear) relationship between the expected outcome as each input is varied. But to the extent that numerical summaries are useful—and they certainly will be used—we would recommend, as a default starting point, evaluating at the mean ±1 standard deviation of each input variable. For linear models this reduces to the scaling presented in this paper.

Finally, one might dismiss the ideas in this paper with the claim that users of regressions should understand their predictors well enough to interpret all coefficients. Our response is that regression analysis is used routinely enough that it is useful to have a routine method of scaling. For example, scanning through recent issues of two leading journals in medicine and one in economics, we found

• Table 5 of Itani et al. [16], which reports odds ratios (exponentiated logistic regression coef- ficients) for a large set of predictors. Most of the predictors are binary or were dichotomized, with a few numeric predictors remaining, which were rescaled by dividing by one standard deviation. As argued in this paper, dividing by one (rather than two) standard deviation will lead the user to understate the importance of these continuous inputs.

• Table 2 of Murray et al. [17], which reports linear regression coefficients for log income and latitude; the latter has a wide range in the data set and so unsurprisingly has a coefficient estimate that is very small on the absolute scale.

• Table 4 of Adda and Cornaglia [18], which reports linear regression coefficients for some binary predictors and some numerical predictors. Unsurprisingly, the coefficients for predictors such as age and education (years), house size (number of bedrooms), and family size are much smaller in magnitude than those for indicators for sex, ethnicity, church attendance, and marital status.

We bring up these examples not to criticize these papers or their journals, but to point out that, even in the most professional applied work, standard practice yields coefficients for numeric predictors that are hard to interpret. Our proposal is a direct approach to improving this interpretability.

Thinking more generally, we could consider scaling a variable by subtracting M and dividing by S, and determining M and S in part from the data. This includes various existing approaches as special cases and gives a clue about how to think about these procedures in hierarchical models or whenever else we are trying to generalize to new populations.

In general I’m all about understanding scale. I tend to prefer picking the scale out of some “prior” knowledge of how big the measurements are.

But there’s another method I’d like to mention here which is to pick a particular condition, and choose the scaling such that by definition one or a few of the coefficients in the model are exactly 1.

The uncertainty then becomes a part of the scale of the measurements. For example you are doing a regression of daily calories consumed where

C = a + b*Age +c*ActivityLevel + error

instead, we say “there exists c* and A* such that”

(C/c*) = 1 + ((Age-MedianAge)/A*) + q*ActivityLevel

We simply *define* the constant and Age coefficients to be 1, and estimate the c* and A* needed to make that happen. q then becomes a measure of how important Activity Level changes are relative to age changes (because age change coefficient is defined to be 1).

Daniel:

This reminds me of that “number required to treat” thing.

Maybe an extra complication is when the regression input is latent. Take measurement-error-model for example: when the input X has errors in measurement, should I divide by 2 sd (x), or to the standard deviation of the inferred true measurement?

Yuling:

Yes, good point! There’s some research to do here . . .

Why *two* standard deviations?

Interpretive parity with an unscaled bernoulli covariate with p = 0.5; change in one unit = change in 2 standard deviations