Daniel Egan sent me a link to an article, “Standardized or simple effect size: What should be reported?” by Thom Baguley, that recently appeared in the British Journal of Psychology. Here’s the abstract:
It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy — though this may be changing — and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.
I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is “which of these is the biggest effect?” The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant – its because sex’s range is tiny compared to other covariates.
But moreover, sex is irrelevant to “policy-making” – we can’t change a persons sex! So what we’re interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent.
So two questions:
1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report)
2. Is there any easy way to infer the elasticity of the effect – i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data – and this “effect” size is really what matters the most.
My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.
I also have some thoughts on the Baguley article:
On one hand, I like to see thoughtful, sensible statistical advice for practitioners–advice that links technical statistical concerns with applied goals. I’ve only participated in a little bit of psychology research myself, so I’m not the best judge of what tools are most useful there.
That said, based on my experience in social science and environmental research, I disagree with Baguley’s advice. He recommends reporting effect sizes–in my terminology, regression coefficients–on raw scales, and he warns against standardization. This is close to the opposite advice that I give in my recent article on rescaling regression inputs by dividing by two standard deviations.
How can Baguley and I, applied researchers in the not-so-different fields of psychology and political science, come to such different conclusions? I think it’s because we place slightly different (negative) values on the various mistakes we’ve seen.
What bugs me the most is regression coefficients defined on scales that are uninterpretable or nearly so: for example, coefficients for age and age-squared (In a political context, do we really care about the difference between a 51-year-old and a 52-year-old? How are we supposed to understanding the resulting coefficients such as 0.01 and 0.003?) or, worse still, a predictor such as the population of a country (which will give nicely-interpretable coefficients on the order of 10^-8)? I used to deal with these problems by rescaling by hand, for example using age/10 or population in millions. But big problems remained: manual scaling is arbitrary (why not age/20? Should we express income in thousands or tens of thousands of dollars?); still left difficulties in comparing coefficients (if we’re typically standardizing by factors of 10, this leaves a lot of play in the system); is difficult to give as general advice.
Applied researchers–myself included–tend to use default options, especially in the crucial model-building phase. I’m not just concerned about the presentation of final results–important as that is–but also in the ways that we can use regression coefficients (effect sizes) in understanding the models we build as intermediate steps in our data exploration. For example, it is common practice–and good practice, in my opinion–to consider interactions of the largest main effects in a model. But what does “largest” mean? Standardizing puts things on an approximately common scale, and with no effort.
This is perhaps less of a problem in experimental psychology, where statistical models are traditionally specified in advance–but, as we analyze increasingly complex data structures, statistical exploration and modeling become increasingly entwined, and model-building strategy can make a difference.
Baguley’s article performs a useful service by reminding us of the many problems inherent in standardized measures. These problems are real, but I think they’re much smaller than the problems of raw coefficients. For example, his Figure 1 shows how, by selecting extreme x-values, a user can (misleadingly) make a standardized regression coefficient go from 0.60 to 0.73. Point taken. But suppose you’re predicting test scores given grade point average (on a 0-4 scale) and SAT score (scaled from 200-800). Now you’re talking about a factor of over 100! Even if you, say, rescale SAT by dividing by 100, the coefficient for it and GPA could easily be off by a factor of 2 in interpretation. Or, to put it another way, why should a change of 1 in GPA be considered the same as a change of 1, or 100, in SAT?
OK, I’ve made my point as best I can here. Let me summarize the take-home point.
1. For comparing coefficients for different predictors within a model, standardizing gets the nod. (Although I don’t standardize binary inputs. I code them as 0/1, and then I standardize all other numeric inputs by dividing by two standard deviation, thus putting them on approximately the same scale as 0/1 variables.)
2. For comparing coefficients for the same predictors across different data sets, it’s better not to standardize–or, to standardize only once. Baguley discusses the slipperiness of standardized effect sizes when the denominator starts changing under your feet.
Writers on the subject (including myself) have not done a good job at explaining the differences between these scenarios, even when we have recognized the issue in our applied work. I hope this brief essay is helpful in this regard.