Daniel Egan sent me a link to an article, “Standardized or simple effect size: What should be reported?” by Thom Baguley, that recently appeared in the British Journal of Psychology. Here’s the abstract:

It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy — though this may be changing — and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.

Egan writes:

I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is “which of these is the biggest effect?” The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant – its because sex’s range is tiny compared to other covariates.

But moreover, sex is irrelevant to “policy-making” – we can’t change a persons sex! So what we’re interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent.

So two questions:

1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report)

2. Is there any easy way to infer the elasticity of the effect – i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data – and this “effect” size is really what matters the most.

My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.

I also have some thoughts on the Baguley article:

On one hand, I like to see thoughtful, sensible statistical advice for practitioners–advice that links technical statistical concerns with applied goals. I’ve only participated in a little bit of psychology research myself, so I’m not the best judge of what tools are most useful there.

That said, based on my experience in social science and environmental research, I disagree with Baguley’s advice. He recommends reporting effect sizes–in my terminology, regression coefficients–on raw scales, and he warns against standardization. This is close to the opposite advice that I give in my recent article on rescaling regression inputs by dividing by two standard deviations.

How can Baguley and I, applied researchers in the not-so-different fields of psychology and political science, come to such different conclusions? I think it’s because we place slightly different (negative) values on the various mistakes we’ve seen.

What bugs me the most is regression coefficients defined on scales that are uninterpretable or nearly so: for example, coefficients for age and age-squared (In a political context, do we really care about the difference between a 51-year-old and a 52-year-old? How are we supposed to understanding the resulting coefficients such as 0.01 and 0.003?) or, worse still, a predictor such as the population of a country (which will give nicely-interpretable coefficients on the order of 10^-8)? I used to deal with these problems by rescaling by hand, for example using age/10 or population in millions. But big problems remained: manual scaling is arbitrary (why not age/20? Should we express income in thousands or tens of thousands of dollars?); still left difficulties in comparing coefficients (if we’re typically standardizing by factors of 10, this leaves a lot of play in the system); is difficult to give as general advice.

Applied researchers–myself included–tend to use default options, especially in the crucial model-building phase. I’m not just concerned about the presentation of final results–important as that is–but also in the ways that we can use regression coefficients (effect sizes) in understanding the models we build as intermediate steps in our data exploration. For example, it is common practice–and good practice, in my opinion–to consider interactions of the largest main effects in a model. But what does “largest” mean? Standardizing puts things on an approximately common scale, and with no effort.

This is perhaps less of a problem in experimental psychology, where statistical models are traditionally specified in advance–but, as we analyze increasingly complex data structures, statistical exploration and modeling become increasingly entwined, and model-building strategy can make a difference.

Baguley’s article performs a useful service by reminding us of the many problems inherent in standardized measures. These problems are real, but I think they’re much smaller than the problems of raw coefficients. For example, his Figure 1 shows how, by selecting extreme x-values, a user can (misleadingly) make a standardized regression coefficient go from 0.60 to 0.73. Point taken. But suppose you’re predicting test scores given grade point average (on a 0-4 scale) and SAT score (scaled from 200-800). Now you’re talking about a factor of over 100! Even if you, say, rescale SAT by dividing by 100, the coefficient for it and GPA could easily be off by a factor of 2 in interpretation. Or, to put it another way, why should a change of 1 in GPA be considered the same as a change of 1, or 100, in SAT?

OK, I’ve made my point as best I can here. Let me summarize the take-home point.

1. For comparing coefficients for different predictors *within* a model, standardizing gets the nod. (Although I don’t standardize binary inputs. I code them as 0/1, and then I standardize all other numeric inputs by dividing by two standard deviation, thus putting them on approximately the same scale as 0/1 variables.)

2. For comparing coefficients for the same predictors across *different data sets*, it’s better not to standardize–or, to standardize only once. Baguley discusses the slipperiness of standardized effect sizes when the denominator starts changing under your feet.

Writers on the subject (including myself) have not done a good job at explaining the differences between these scenarios, even when we have recognized the issue in our applied work. I hope this brief essay is helpful in this regard.

As the author – I'm really pleased with your comment. Psychology has gone too far (in my opinion) toward the default of standardizing without thinking through the implications. In some situations this is clearly a very bad move – and in others less so.

In terms of communicating effects, I can't think of an easy answer just yet. I think emphasizing interval estimates over point estimates and focussing on prediction may be a start. Belatedly, I've come to realize that communicating findings is harder problem than the analysis. The idea of reporting effect size seems like a good one, but when you read hundreds of results reported with partial eta-squared (often mislabled as eta-squared) just because that is what SPSS reports, something has gone wrong (it is just a mindless ritual of "must report effect size").

One difference between psychology and the complex modeling you are involved in is (as you note) complexity of the models. I can't see a strong argument ever to standardized effect size when reporting the difference in means between two experimental conditions. A further difference is that it is very common to compare such results between studies. This is when the standardization can be most misleading. Quite often the samples will differ in variances, the designs will differ and the measures differ in reliability (e.g., different numbers of items on a memory test). A researcher needs to understand that two studies can have identical differences in means (e.g., percentage accuracy) and very different standardized differences. Many psychologists I speak to seem surprised by this.

Just a few quick thoughts – the last line of the abstract (I think) shows some fundamental agreement between myself and Andrew:

"Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers."

Thom

As an applied economist, my basic feeling is that I don't care how you've scaled the variables as long as you tell me and I can replicate the model myself. Since I'm going to reproduce the model and play with it myself, my preference is for authors to do a minimal amount of manipulation (just scaling to keep things numerically of similar magnitudes, arbitrarily). In fact, the thing I most regret is that nobody reports variance-covariance matrices of coefficients. I grant that my willingness to do this makes me unusual among readers of the article, but the article has to be pretty important to me, or I won't bother… and if the article is not important, why would I care how to carefully interpret the results? That's what referees and editors ought to be for…

This whole situation reminds me strongly of the difference between "dimensional" and "non-dimensional" or "dimensionless" models in physics.

The basic idea is that you have some phenomena for which there are a set of variables that determine the state of the system, and an ODE or PDE that describes the evolution of these variables. For example, perhaps pressure, temperature, and number of molecules for a system involving a gas flowing through a container. Our variables are then P, T, and N. They can be measured in any of a variety of systems of units. Perhaps PSI, Farenheit, and moles, or kPa, Kelvin, and molecules… or some mixture of different units…

If you model the system based on these arbitrary units you will get a model which applies only to those units. However, for every system there is generally a "good" choice of reference values which you can measure your variables from, and this will change your system into one where all the variables take on values around zero or one, and each of the terms in your equation can then be compared against others, and certain terms dropped due to their small relative size.

In this example, we could define "primed" variables which are nondimensional as follows:

P = Patm * P'

T = Tinit * (1+T')

N = (Patm*Vcontainer)/(Tinit*R) * N'

Now P' starts at 1, N' starts at 1 (PV = NRT is the gas law), and T' starts at 0.

This scaling is not arbitrary, it has a goal of choosing a reference value so that the nondimensional variables have "nice" sizes, and so that the reference values are "natural" or "commonly occurring" or generally easily "reproducible" in a lab setting. In other words, when you see a well constructed nondimensionalization you can recognize it.

In statistical models for social sciences I can easily imagine doing a similar thing. For example rather than standardizing by 2 standard deviations of a particular variable, like age, you might standardize against a simple external value like "life expectancy" which can be gotten from someone elses statistical analysis (like the census bureau) thus making your new variable approximately in the range 0 to 1. Or you might standardize by years between presidential elections (4), defining a new "age prime" as

Age' = (Age – (Age at first eligibility to vote))/(Yrs between Elections).

So that you're essentially counting age as a function of number of elections participated in (I don't like this as much but for some purposes it may be ideal)

For something like population of a country, perhaps defining the variable as actual population divided by median population among all countries, or if you're interested in population density, defining a variable population divided by average density times actual area….

These ideas are well explicated in many physics/PDE modeling books.

I agree that there's no one-size-fits-all solution. One tool that works for many situations I encounter is percent of maximum possible (POMP) scores. POMP scoring can be used for any variable that has a defined theoretical minimum and maximum (GPA and SAT scores both fit this; so do a lot of likert-type attitude and personality scales). Basically you just transform the measure to go from 0 to 100.

Since it's just a linear transformation of the raw score, you don't run into some of the problems of between-sample comparisons that come with standardized scores. It puts different measures on a nominally comparable scale. And 0-100 is a fairly intuitive metric for most people: 0 means "the worst that you can possibly do on this measurement," and 100 means the best.

Daniel: You suggest defining variables so that they are approximately in the range 0 to 1. The trouble is that will make the coefficients for such variables seem large, when compared to coefficients for binary variables, which take on the values 0 and 1 exactly. The difficulty is that a binary variable has a sd of 0.5 (assuming p is not too close to 0 or 1), whereas a continuous variable defined on (0,1) will have a much smaller sd: a uniform (0,1) random variable has sd of 0.29, and something more bell-shaped will have an even smaller sd. Thus, the typical comparisons for such a variable will be much smaller than 1. And, taking a coefficient which corresponds to a change from 0 to 1 will be overstating the predictive effect of this input in your model.

Sanjay: See comment to Daniel. 0-100 is bad in the other direction, in that you'll have tiny coefficients. In general, I don't find the change from, say, x=56 to x=57 to be so interesting.

As Iain and I discuss in our paper, defaults matter.

Of courseyou can always reanalyze the data yourself, transform as you'd like, etc.–but I'd like the default coefs to be directly interpretable, in a way that is often not the case with raw-scaled inputs.If you've set up a subjective prior distribution for the regression coefficients, you've already established a scale for each coefficient – the prior expectation of the its absolute value. You can then compare the posterior expectation to this prior expectation to see how things turned out.

Standardizing the variables is a poor (but perhaps expedient) substitute for this, giving roughly the same results if you assume that the variables have been prescreened to include only those for which the variation in the available data is thought a priori to be enough to (possibly) see an effect. After standardizing, you might then (if you're not trying to get the best results) use the same prior for all regression coefficients, and interpret them in the same way. But I wouldn't regard this as anything but a quick hack for when you're not willing (or able) to think about the variables in more detail.

Radford: I pretty much agree with you. But I prefer the term "default" to "hack." We need defaults in all sorts of settings, and it's not hackish to come up with good defaults.

"But what does "largest" mean? Standardizing puts things on an approximately common scale, and with no effort." My impression is that in most situations a common metric is the _least_ important problem for a meaningful interpretation of the coefficients. Instead,the main problem in most applied contexts tends to be how easy or difficult (or relevant or irrelevant) is the change in one predictor over the change in the others for the question at hand. For instance, what is easier to be change, income or party ID? This is also true even for relatively similar inputs such as disposable income and income inequality. One have to master subject matter in order to evaluate how large is a particular change: standardization and common metrics might be of little help here.

My comment ultimately is related to Radford's, the choice of a "good" nondimensionalization is ultimately intimately related to an expected size of a given variable. Also, it wasn't so much that I recommended transforming so that things were in the range 0 to 1, but rather so that typical variations are on the order of 1 and so that the initial value is either 0 or 1 (in a dynamics model)… there are reasons for doing this that are perhaps irrelevant to statistical models but more relevant to PDE type models.

In areas of physics like fluid mechanics, if we see a factor of 2 difference between one thing and another, we're often extremely pleased, as effect sizes can range over factors of 10 or 100.

the advantage of choosing a nondimensionalization that is relative to a typical value that can be estimated or constructed independent of your model is that things are very interpretable… If the pressure goes up by 1 atm that's sort of an understandable unit, if it goes up by 38 kPa that's kind of annoying to figure out whether that's big or small

Similarly if a coefficient in a statistical model tells you that an age difference of 3 election cycles causes as big an effect as male vs female for a given age, that's kind of interesting and interpretable as well… even if the coefficients themselves are not directly comparable. On the other hand, 3 election cycles is 12 years, and when you measure on that scale it's not very helpful.