Nasir Bashir writes:
Effect sizes (ESs), also known as standardized differences, are a scale-independent measure for quantifying differences in variables across two groups. They are widely adopted within the health and social sciences and two scenarios when ESs commonly used are: (i) to compare baseline characteristics between groups, (ii) to quantify the experimental effect by comparing outcomes between the groups. This information will not be new to many readers of the blog. What is more interesting is the fact that there has been very little in the way of comprehensive Python software for computing ESs, especially given that Python is being heavily utilized for scientific computing.
There is now a new Python package named effectsize, which provides comprehensive treatment of ESs for both continuous and categorical variables, including complex functionality, such as the ability to deal with skewed data, multinomial categories, and weighted statistics. effectsize implements the methodology outlined by Yang and Dalton, “A Unified Approach to Measuring the Effect Size between Two Groups Using SAS” (2012) i.e., mean difference divided by the pooled standard deviation for continuous variables and a multivariate Mahalanobis distance for categorical variables.
The main function within the package (effectsize.compute) takes a Pandas DataFrame and a variable within the DataFrame defining the two groups of interest, and will compute ESs for all variables which the user specifies. The software is open-source, and all source code, documentation, and usage examples are available on the GitHub repository (https://github.com/nbashir97/effectsize). Binary installers for effectsize are available through both PyPI and the conda-forge channel on Conda. The developer, Nasir Bashir, is a UK-based researcher who is open to suggestions for expanding on the software’s functionality and also developments to the underlying statistical theory.
GitHub repository: https://github.com/nbashir97/effectsize
PyPI page: https://pypi.org/project/effectsize/
I know nothing about this, but I like the idea of people putting these sorts of packages online, so here it is, make of it what you will!
I do not quite understand “(ii) to quantify the experimental effect by comparing outcomes between the groups”.
In experiments, the difference between groups is “real” effect plus/minus chance variation, or not? If so, we can talk about effect sizes only with regard to populations?
What you are referring to (the “real” effect) is what we might call the unstandardised effect size. For example, if we are looking at the change in blood pressure (in mmHg) for those taking a new medication vs those taking a placebo then we could say that the effect of the drug is to reduce blood pressure by X mmHg. However, if we were to take the “X mmHg” and divide by the pooled standard deviation across the experimental and control groups then we would get what we might call the standardised effect size, or just effect size for short. This would now be interpreted in standard deviation units i.e., those taking the drug have a blood pressure which is Y standard deviations less that those not taking the drug. This is roughly how effect sizes are interpreted for experimental outcomes and they are very commonly used in experimental psychology in this regard.
Another very common use for effect sizes in the context of experimental outcomes is when doing meta-analysis. For example, let’s say we are now looking at 10 trials on this new blood pressure drug, and 7 of them measure blood pressure using mmHg, 2 of them use kPa, and the final study uses some other arbitrary measure that is not mmHg or kPa. Imagine we are not entirely sure how to translate all of these onto a single scale of measurement. Well in this case we would compute the effect size for each and this is now our single scale of measurement, because standard deviation units are interpreted the same way regardless of the original scale used to measure the raw values. Now these 10 trials can be pooled together into a meta-analysis on the effect size scale. I believe this is the context in which Hedges’ g was originally implemented and is the recommended way of pooling studies on different outcome scales by the Cochrane Collaboration.
You did misunderstand me here, I’m afraid. The calculation was not my issue. Based on J. Cohen, my imporession always was that in samples you cannot determine effect sizes. Of course, one can calculate an effect size /measure/ (for example for meta analyses, as you mentioned). But if I had an experiment n1=n2=5, and found d=0.5, I cannot call this “the effect size of the experimental intervention”. I just can say that I calculated a d = 0.5 . The true effect of the experimental intervention might be d=0.9 or d=0.0 or even d=-0.1. The d=0.5 is a combination of unknown true effect and of random error. Therefore, I have the impression that calling the sample d an effect size is misleading.
I have frequently encountered the reasoning: “p was < 0.05 in my experiment, therefore I may interpret the calculated effect size measure from my sample as true effect", completely ignoring random variation.
Ney – I see what you mean, that is my fault for abuse of terminology. Yes, I would also not call the sample-specific d (or any other standardized measure) the same thing as the true effect, as the calculation also includes random variation. Reading back, I think my use of the term “experimental effect” here is quite confusing as I seem to be implying that the sample-specific d and the true effect are one and the same (which, as you say, they are not).
Are you not ware that one can’t draw more general inferences regarding the “size” of an effect when using “standardized effect sizes” based on sample-specific variances? There is a huge literature about this isuue. In particular, Sander Greenland has written extensively on this topic. I suggest that you spend some time becoming educated. What you are proposing is of great concern.
I’m increasingly convinced that rote application of standardized effect size measures is too widespread already and is mainly bad. Perhaps the exception is for simple rules of thumb for power / sample size calculations.
I also think that, where possible, effect sizes should be kept in the units of the original scale on which the outcome was measured and not standardized as this greatly helps with interpretability.
Having said that, standardized indices of effect size are widespread in their use and you mentioned a couple of scenarios where they may be useful. My personal use case for them was for comparing groups after propensity score weighting, which is what led me to develop this package.
I should have been more clear in my reply to Ney, I was outlining how effect sizes are computed in the context of experimental outcomes, and examples of scenarios in which they are used. I did not make any claims about how generalisable the inferences are based on sample-specific calculations, or at least I did not intend to. I apologise if this was unclear in my wording.
Nasir — It seems to me that you really do not have any idea what you are talking about with regard to “standardization” and effect size. Please spend some time learning about this issue.