Data concerns when interpreting comparisons of gender equality between countries

A journalist pointed me to this research article, “Gender equality and sex differences in personality: evidence from a large, multi-national sample,” by Tim Kaiser (see also news report by Angela Lashbrook here), which states:

A large, multinational (N = 926,383) dataset was used to examine sex differences in Big Five facet scores for 70 countries. Difference scores were aggregated to a multivariate effect size (Mahalanobis’ D). . . .

Countries’ difference scores were related to an index of gender equality, revealing a positive weighted correlation of r = .39 . . .

Using multivariate effect sizes derived from latent scores with invariance constraints, the study of sex differences in personality becomes more robust und replicable. Sex differences in personality should not be interpreted as results of unequal treatment . . .

The journalist wrote, “This study found that as gender equality increases, so do gender differences. Have you seen evidence of this in research, including your own research?”

I replied as follows:

I have not worked in this area of gender equality myself, so I can’t say I’ve seen this pattern, or not seen it, in my own research. This particular study linked above has a lot of details; the main pattern appears in its figure 2. I have no sense of why this “Mahanalobis D” number is so much higher in Latvia than Iran, but I could well imagine this could be an artifact of survey responses. In general, if the survey responses are noisier, you’d expect a measure such as this D to be closer to zero. If you look in the paper, you’ll see that D is based on various personality inventories, and I could well imagine that these responses would have different interpretations in Latvia, Iceland, and Finland than in Iran, Mexico, and Jamaica. In addition, it says in the paper that “the assessment procedures have selected for English language subjects with Internet access.” This would seem to destroy all their interpretations of the results when comparing countries. So I would be wary of taking these results too seriously. That said, there’s nothing wrong with speculation, as long as it is clearly labeled as such.

Finally, I’m skeptical about the following claim made in this paper: “The degree to which a society allows individuals to express biological gender differences can vary. If a society ensures that men and women have exactly the same access to all resources that this society has to offer, the biological factors could be expressed more strongly than in more repressive societies. A stronger sexual dimorphism should therefore be seen more as an expression of a successful gender policy.” I don’t see how the results in their paper—even if you ignore any potential data issues and assume the survey responses have the same meanings in each country—lead to this conclusion.

Here’s another line from the paper: “The results presented suggest that greater sexual dimorphism should not be interpreted as an indicator of a society that discriminates against a particular sex, but rather as an indicator of a successful gender equality policy.” I don’t understand this at all. Even taking the data at face value, they have two measures: D (the measure of sex differences in personality) and GGGI (the measure of lack of sex differences in outcomes in the country). D is weakly correlated with GGGI. Fine. But then a high value of D is a measure of a high value of D; it’s not an indicator of GGGI. For that matter, GGGI is not a measure of “policy” either.

Finally, I am concerned about some of the details in the paper, for example on page 7, Antarctica, Puerto Rico, and Andorra are listed as countries. Andorra, maybe although I can’t imagine we’d learn much from such an odd case. Puerto Rico is of course not a country, and Antarctica even less so.

Anyway, my quick reaction is that it’s a good step for these findings to be published but I think they are being way overinterpreted.

I sent the above comments to Laurie Rudman, a researcher who works in related areas, and she agreed that assessing and analyzing cross-cultural data can be difficult. Rudman pointed me to this paper, “Mind the level: problems with two recent nation-level analyses in psychology,” by Toon Kuppens, and Thomas Pollet, that raises some of these issues (although not with the Kaiser paper discussed above).

P.S. The journalist asked for some clarifications, so I added these points:

1. What is the Mahalanobis D? When comparing two groups (for example. male and female survey respondents) on a single variable (for example, height), you can just take the difference and report, say, that on average men are one standard deviation taller than women, or whatever it is. When comparing two groups on multiple variables (for example, several different personality assessment survey responses), you can construct a “multivariate distance measure.” Mahalanobis D is one such measure. Bigger numbers correspond to larger average differences between men and women in whatever variables are measured. In the above-linked paper, my issue with the Mahalanobis D was not how it was defined, but rather with the data used to compute it. I’m concerned (a) that the responses to the survey questions will have different meanings in different countries, (b) that there will be nonresponse bias due to the overrepresentation of English-speaking internet users, and (c) that these issues will be correlated with key variables in the study.

2. Regarding the point about biological differences: Part of my concern is that I don’t really know what is meant by “biological gender differences” in this context. My other concern is that the conclusions of the paper relate to things not measured in the paper. For example, the paper refers to discrimination, but there’s nothing in the data about discrimination. And the paper refers to a gender equality policy, but there’s nothing in the paper about gender equality policies. Just in general, I’m wary about conclusions that don’t directly connect to the data.

18 thoughts on “Data concerns when interpreting comparisons of gender equality between countries

  1. The post uses “gender” and “sex” pretty much interchangeably. (Differences are sometimes called “gender differences” and sometimes “sex differences”.) How do you decide when to use one word and when to use the other?

    • Terry:

      I usually will use the word “sex” rather than “gender” in such settings. In the above post, I used “gender” because that’s was the word used in the paper I was discussing.

  2. Dear Andrew,

    I think it’s very important to consider every analysis in its relevant context, which I am not sure you have sufficiently done so here.
    It is interesting that gender differences in personality consistently replicate throughout the world, right?

    Now, there’s an important question of why this occurs. Many people would argue that these difference emerge because of societal factors such as traditionalism, economic equality and such.
    We know that different societies vary quite a lot in the extent to which they are progressive.
    So it does make alot of sense to see whether the following assertion is correct:

    (1) “Difference in gender changes across countries, such that more liberal countries show less of a difference in personality between men and women”

    I don’t think it would not be accurate to call the current study exploratory, because it is basically a replication which fixes slight problems in previous research (although it introduces new problems).

    I think it does provide reasonable evidence against assertion (1) above, wouldn’t you?
    One might say that claiming that the reverse is true (more liberal->more gender gap) is pushing it a bit–but in the broader context of the discussion, this is actually not the key element.

    The paper definitely has some meaningful shortcomings, addresses some of the shortcomings of the previous research, and so on.
    But I think it should meaningfully change your degree of belief regarding Assertion 1 above.

    • Michael:

      What you say makes sense, and I agree that it’s unsurprising, but still interesting, that gender differences in personality surveys replicate around the world. As I wrote in my post, I think it’s a good step for these findings to be published. My concern, as discussed above, is with the cross-national comparisons (issues of interpretation of the survey responses and of representativeness of the samples), and with over-interpretation of the results more generally (including statements in the conclusion that don’t connect to any data in the paper).

    • Michael,

      I am far less convinced of the claim you make given the data and methods used. Whenever I see someone take parameter estimates from a Structural Equation model and analyze them with a univariate method, my immediate assumption is that the results were not what the researchers wanted to find when modeling the item level data and so they pushed on until they found a method that demonstrated what they wanted to see. I also get quite suspicious when the p-value is not reported for the chi-square goodness of fit, when the rest of the paper is grounded in threshold cutoff decisions.

      Finally, the claim that this correlation was based on 900k+ is simply not true. It was based on an n = ~30 (whatever the number of countries were) and the correlation was driven by two countries in the upper right corner away from the big circle blob of countries in the center of the graph.

  3. Haven’t read this article, but many personality researchers think that intercultural comparisons of self-report scores are very difficult to do because of reference-group problem (i.e. when making personality ratings, people tend to compare themselves to other people in their own culture/reference group, not to everyone in the world, or to a “global average person”). For instance, there is at least one large multicultural personality dataset in which for example the Japanese are the least conscientious of all nationalities, and Finns are more extraverted than Americans, and there are other similar results that go against anthropological evidence and the experiences of people who have lived in different countries.

    It has been suggested that the gender difference thing might be because of a variant of the reference-group-problem, namely, that in less equal countries women tend to compare themselves only to other women and men, only to other men, whereas in more equal countries, everyone compares themselves to everyone. I’m not sure if anyone has looked into this more closely.

    • I think personality researchers are right. Prejudice is a complex phenomena. It creeps into intercultural comparisons, more generally. Being ensconced in social science and even NGO communities, I saw this 1st hand.

      I have Asian background, although I lived nearly all of my life in US. I would say that I was raised to compare myself to men rather women. Could be because I spent so much of life with my Dad’s colleagues too. Particularly in Madison, Wisconsin, as a teen, There a Univ. of Wisconsin professor, Richard Robinson enabled me to attend some very interesting political science and law seminars occasionally. These fields were dominated by male academics. University of Wisconsin’s political science and South Asia department were hubs that shaped our foreign policies to a large extent. Richard Cheney was there when my father taught there too.

      The elites in developing countries have been cultivating intellectual female leadership a couple of centuries at least. Subsets attended British and American universities, since the 40’s and 50’s. These subsets were then cultivated to take leadership positions. I doubt that they, at least, compare themselves to women or even men then.

      In so far as whether women in ‘equal’ countries compare themselves to everyone, It depends on the situation. Typically mentorship of academic women has been through men, in my experience. I would argue that Gordon Allport, Elliot Richardson, and Kingman Brewster played an underestimated role in encouraging female leadership. Then again they had unique family background.

      • @Sameera

        So, I’m not understanding the point. Doesn’t who you compare yourself to depend on who’s your peer set?

        Say, you are surrounded by men or are in a profession where the greatest work was predominantly done by males, isn’t it just natural that you compare yourself to men?

        But how is that indicative of prejudice?

  4. Hi Andrew,

    thanks for sharing your concerns with my preprint. As a PhD student, I am not used to the media attention this particular article is getting. All I can say that despite all the criticism, I feel honored that it is discussed in this blog. I think your comments are foreshadowing some of the upcoming comments in peer-review. Some of the results were probably expressed in somewhat strong language and the interpretations sometimes extrapolate very strongly. I will surely reconsider those in the revised version of the article. The scatter plot in the article is somewhat misleading. The statement is based on a correlation weighted according to the size of the subgroups, which is not visible at all in the plot.

    I’m not sure I can follow some of the points. For example, I don’t understand why you think D is based on “different personality inventories”. All data is based on subscales from the same inventory, the IPIP-120. Your point concerning cultural differences in interpretation is understandable. An Iranian woman will probably interpret the item “I feel comfortable around people” differently than a Danish man. Yes, comparing different cultures has its problems, some of which limit the informative value of the data – I decided to do it anyway.

    • Tim:

      Thanks for the comment. As I wrote above, I think it’s good that you’re publishing these result: Get it out there, open to discussion, and that’s the way to move forward.

  5. ‘Finally, I’m skeptical about the following claim made in this paper: “The degree to which a society allows individuals to express biological gender differences can vary. If a society ensures that men and women have exactly the same access to all resources that this society has to offer, the biological factors could be expressed more strongly than in more repressive societies. A stronger sexual dimorphism should therefore be seen more as an expression of a successful gender policy.” I don’t see how the results in their paper—even if you ignore any potential data issues and assume the survey responses have the same meanings in each country—lead to this conclusion.‘

    The “biological sex differences will be stronger in the most equal-opportunity societies” claim has become extremely popular in alt-right-ish internet discussions of societal sex differences in income, profession, etc. —often the same people who advocate for genetic differences in intelligence between races, etc., and then use these claimed factoids to argue against current attempts to remedy inequalities. The usual thing that is cited is that most nurses are women even in very egalitarian places like Scandanavia. I don’t know if there has ever been a rigorous attempt to study the question – it looks like this paper isn’t it.

    • This is not an “alt right” claim, this is the explanation I found in the current evolutionary psychology literature. If you are interested, the article quotes some of the previous studies that speak for this effect. Some results of this study challenge the previous evolutionary psychological assumptions about the effects of equality. Those who read the study would also find that not only the “stereotypical female” personality traits correlate with the GGGI, but also the facets “achievement-striving” and “assertiveness”. I don’t know the Norwegian Nurse study.

      Scepticism about this study is something I very much welcome. The implication that I serve politically extreme camps with it, however, is not.

  6. >>>This study found that as gender equality increases, so do gender differences.<<<

    Isn't this an oxymoron? What does it really mean? How can equality & differences both increase? Am I getting the definitions wrong?

    • Rahul:

      I do think there’s some incoherence in the arguments I’ve seen presented in this area. Roughly speaking, there are five sorts of country-level variables to look at:

      1. Policies and customs: These could include laws on women’s equality, abortion, child support, etc., as well as the prevalence of woman-friendly private-sector policies such as maternity leave and laws that restrict the clothing women can wear in public, etc. Also some measure of the left-right position of the government (although that could go in item 4 below, depending on whether we’re thinking of the government’s political stance as affecting policy or as a measure of the country’s political culture).

      2. Health outcomes: Life expectancy is tricky, though. Is equality in life expectancy a sign of equality, or a sign that things are really bad for women, if they don’t have their usual several years advantage compared to men?

      3. Sociological and economic outcomes: Labor force participation for men and women, proportions of women going on to higher education, courses of study at university, etc.

      4. Social and political attitudes: Opinions of men and women from surveys on attitudes regarding women’s equality, gay rights, abortion, the role of women in the workplace and in politics, etc. There are two ways to summarize this: First, how much do the people in the country support ideals of equality between the sexes; Second, how much do attitudes of men and women differ?

      5. Personality surveys: Examples would be the personality inventories used in the study described above.

      Adding to the mess is that countries change over time, and there are also variables such as the wealth of the country, its geographical location, and its ethnic composition, all of which seem relevant but are not directly captured in any of the measures above.

      In any case, if all five of the above sorts of measures were positively correlated, all would be clear: Countries with policies favoring women’s equality have more liberal politics, better health outcomes for women, more social and economic equality in economic outcomes, more support for women’s equality, and smaller differences between men and women in issue attitudes and personality measurements.

      But, to the extent these have been studied, it appears that the relevant cross-country correlations are not uniformly positive. For example, the statement provided by Nick above, “most nurses are women even in very egalitarian places like Scandinavia,” sounds like evidence in favor of a zero or negative correlation between items 1 and 3 in my list. The statement here that “the countries that minted the most female college graduates in fields like science, engineering, or math were also some of the least gender-equal countries” represents a negative correlation between items 3 and 4, or perhaps within different categories of item 3.

      The challenge here is that the five items above can’t all be highly negatively correlated with each other—and, in any case, we wouldn’t expect them to be. Rather, there will be lots of the expected positive correlations, with occasional negative correlations that are newsworthy.

      • Isn’t this some sort of methodological overreach?

        Beyond a point, there’s no way to correct for noise using algorithmic prowess. Nor will a million data points save you if all you’ve got is a highly inaccurate, unreliable and non-repeatable yardstick.

        e.g. You cannot have a noisy, inaccurate thermometer with a least count of 10 C and try to build on those measurements a theory of cryogenic Helium or low temperature superconductivity where micro Kelvin temperature changes are the relevant scale.

        Till we can resolve the measurement mess isn’t this sort of thing too ambitious a target to make any claims about? Aren’t people attempting to make claims about problems that are grossly a mismatch for the power and abilities of the tools they have?

        • Rahul:

          I don’t think it’s a methodological overreach, so much as a common confusion about measurement. It’s pretty common to see some measurement taken as a proxy of some underlying property, without much thought of measurement issues. We saw this in the “North Carolina is not a democracy” survey and in lots of other examples, and it’s coming up here too.

Comments are closed.