Different effect sizes used in statistics/psychometrics or machine learning; what should this researcher do?

Tony Lee writes:

One of my research projects is to conduct a meta-analysis by synthesizing the effect sizes of those studies that employed machine/deep learning to predict personality scores. Meanwhile, the obtained (mean) effect size should be compared to the effect sizes of studies derived from conventional psychological assessment.

As you know, in statistics, there are several commonly used effect sizes such as Pearson’s r or Cohen’s d, among many others. Yet studies in machine (deep) learning often use very different metrics to evaluate performance (i.e., accuracy/AUC-ROC/F1/kappa for binary classification problems, loss functions like MAE/MSE/RMSE for regression problems).

Now the question arises: Is there a convergent effect size that we can use to measure globally across different performance metrics in ML/DL studies? Or is it possible (or reasonable) to convert the performance metrics used in ML into, say, Pearson’s correlation coefficient?

I’ve read some papers and also searched extensively the information, but the solutions are either not satisfying or not appropriately addressed. For examples, some suggest to convert AUC-ROC into r (for binary classification problems, but this does not work for regression problem). Convert MAE/MSE/RMSE into r seems to be mathematically feasible, is that right?

My ultimate goal is to find a global effect size estimator that allows me to cross-compare these performance metrics.

David Powers once wrote a paper on ML evaluation using concept like Informedness and Markedness (2007). Rama Ramakrishnan also wrote a blog on something similar. I wonder if this could be a solution to my problem? I am also considering Matthew’s Correlation Coefficient as potential effect size estimator for classification problem and Pearson’ r for regression problem (then the two can be cross-compared), but I am not certain to what extent it covers different types of ML/DL performance metrics.

My reply: I’d think that just about any performance metric would work, but since you get to choose, I suggest you choose something that makes sense for your application. I don’t like things like AUC-ROC because they seem more like mathematical definitions with no clear applied interpretation. So, I’m not really offering an answer to Lee’s question; I’m just suggesting a direction for him to go.

9 thoughts on “Different effect sizes used in statistics/psychometrics or machine learning; what should this researcher do?

  1. There is no theoretical reason much less requirement for using only one effect size metric. Different metrics tell different stories. Given that, why not use multiple metrics?

  2. It occurs to me that the essence of Cohen’s _d_ is (Y-estimate – Y-actual)/Scale-factor. (You could square this to get rid of the sign and summarize it across the test sample.

    The model will define Y-estimate and Y-actual come from the test sample, so the key is to pick an appropriate scale factor.

    The SD(Y) might be a good starting point (although Dylan Wiliams argues that for educational measures, the average growth in one year might be better).

    Then again, from what I know from my colleagues in meta-analysis, the tricky part is often the paper does not always give all the information needed to calculate the effect size. If you are interested in continuous prediction problems, maybe scaling the r.m.s.e. on the test-set would be possible.

  3. One challenge is also that the same measures might not be comparable between studies since they are likely using different kinds of samples. Standardized effect size measures depend on the variation in the outcome (e.g., measures of personality traits). And this can be quite different between samples used by personality psychologists and computer scientists.

  4. One of the nice things about machine learning is that there are a variety of standard data sets available, and it’s pretty common for different techniques to be tried on these data sets. (That’s a typical thing done in, say, a predictive analytics course.)

    Given that, it would seem likely that there’s some study somewhere of how different effect size measures compare across a wide variety of these standard data sets, although these are not likely to be personality related. (Surely, this would be a nice dissertation topic for somebody.)

    So, if we imagine a matrix with effect size measures across the top and studies as rows, this MIGHT provide SOME ability to impute the effect size measures that couldn’t be directly calculated from the data available.

    I’m with Donald T’s comment above, that different effect size measures tell somewhat different stories. The search for one perfect measure is futile.

  5. If every statistics problem is a decision problem, asking how to compare standard psychometric inventories to fancier ML algorithms on raw data sounds to me like asking “how many items is this fancy algorithm equivalent to, and can I scrap the expensive inventory in favor of the cheap data?” So pick a baseline inventory and make that the unit. “This ML algorithm is equivalent to 1.5x Short Big Five inventories.” Sounds decision-relevant to me! (Or if you don’t like that, express it in terms of items: “this ML algorithm for Openness is worth 19.5 Likert items”, say.)

    • Gwern:

      I don’t think every statistics problem is a decision problem (except in the empty sense that everything is a “decision problem” in that you have to decide what to do next, or whether to continue). Decision making is one application of statistics, not the only one.

  6. AUC-ROC (discrimination or concordance) has a pretty intuitive interpretation once you get away from viewing the actual ROC curve (which I don’t find particularly useful in medical statistics). It can be interpreted as the average probability that one measurement exceeds another.

    • Anon:

      That makes sense. But it’s not clear to my why this average probability is of interest. To put it another way: given that AUC-ROC exists and is used by people, it makes sense to understand it in this way, but I don’t see why there’d be much interest in this from first principles.

      That said, one could say the same thing about root mean squared error, for example, and I don’t mind using that measure. So I guess a lot of this has to do with what we’re used to seeing.

      • The appeal on this nonparametric measure to me often arises with quality of life instruments which have no real meaning attached to the score. In this case, when comparing say groups in a clinical trial, it makes sense to want to say how much benefit the treatment has in terms of a probability, without trying to over-interpret the scale scores. It’s certainly not the only way to view such a situation, but it can make sense.

Leave a Reply

Your email address will not be published. Required fields are marked *