Skip to content

You’ve got data on 35 countries, but it’s really just N=3 groups.

Jon Baron points to a recent article, “Societal inequalities amplify gender gaps in math,” by Thomas Breda, Elyès Jouini, and Clotilde Napp (supplementary materials here), and writes:

A particular issue bothers me whenever I read studies like this, which use nations as the unit of analysis and then make some inference from correlations across nations. And I suspect that the answer to my concern is well known (but maybe not to the authors of these studies, and not to me) and you could just say what it is.

My concern is this. The results here are based on correlations across 35 nations. But, if you look at the supplement, you will see that these nations fall into distinct groups. A whole bunch of them are “northern European.” These countries are related to each other in many ways, including geography, history, and culture. It seems to me that these kinds of relationships reduce the true number of observations to some number much smaller than 35. The whole result in the paper could come from the fact that this particular culture happens to have two features: low inequality and high emphasis on women’s education. In reality, these two features need not have any relationship. (I suspect that Japan and China would help break that correlation, for example.)

To take an extreme example, suppose you had a sample consisting of all the countries of Europe and all the countries of Africa. That would be quite a few countries. And, within that total sample, you could find lots of highly significant correlations. But this would clearly be an N of 2, for all practical purposes.

Or suppose we count all the U.S. states as if they were separate “states” (in the sense of nations). Why not? Surely Massachusetts and Alabama are no more similar than Germany and Norway.

I can’t think of a purely statistical way to solve this problem. But that doesn’t mean there isn’t one.

I replied with a link to this discussion from a few years ago on that controversial claim that high genetic diversity, or low genetic diversity, is bad for the economy, in which I wrote:

Two economics professors, Quamrul Ashraf and Oded Galor, wrote a paper, “The Out of Africa Hypothesis, Human Genetic Diversity, and Comparative Economic Development,” that is scheduled to appear in the American Economic Review. . . . Ashraf and Galor have, however, been somewhat lucky in their enemies, in that they’ve been attacked by a bunch of anthropologists who have criticized them on political as well as scientific grounds. This gives the pair of economists the scientific and even moral high ground, in that they can feel that, unlike their antagonists, they are the true scholars, the ones pursuing truth wherever it leads them, letting the chips fall where they may.

The real issue for me is that the chips aren’t quite falling the way Ashraf and Galor think they are. . . .

The way to go is to start with the big pattern they noticed: the most genetically diverse countries (according to their measure) are in east Africa, and they’re poor. The least genetically diverse countries are remote undeveloped places like Bolivia and are pretty poor. Industrialized countries are not so remote (thus they have some diversity) but they’re not filled with east Africans (thus they’re not extremely genetically diverse). From there, you can look at various subsets of the data and perform various side analysis, as the authors indeed do for much of their paper.

And this post from a couple years later, where I wrote:

I continue to think that Ashraf and Galor’s paper is essentially an analysis of three data points (sub-Saharan Africa, remote Andean countries and Eurasia). It offered little more than the already-known stylized fact that sub-Saharan African countries are very poor, Amerindian countries are somewhat poor, and countries with Eurasians and their descendants tend to have middle or high incomes.

I asked if this was helpful.

Baron replied:

Yes and no. The second reference seems to state the problem, but in a way that is specific to this case. Yet I see this sort of thing all the time. I’m not concerned about the direction of causality. That is a separate problem. I think that the correlations across countries are highly deceptive when the countries fall into groups of very similar countries. When correlations are deceptive in this way, they are not useful for inferring any sort of causality, even if we can infer its direction from other considerations.

In assessing the reliability of a correlation (by any method) it helps when N is higher. With an N of 35, a correlation can be clearly significant (for example) when the same correlation would be nowhere close to significant if the N is 3. Yet, if 35 countries fall into 3 groups of highly similar countries, the effective N is more like 3 than 35. Even worse, if the countries fall into two groups, you cannot even compute a true correlation at all. It only appears that you can when you use country as the unit of analysis. In the present study, this problem is exacerbated because the sampling of countries used, out of the population of all countries, is not at all random.

The same problem occurs, however, when analyzing U.S. states, in studies that look at the entire population of states. Many of these correlations arise because states fall into groups: Confederacy, New England plus West Coast, flyover country. For example, I’m sure that “average humidity” correlates with “percent of evangelical Christians” across states, but that is really the result of the Confederacy alone, hence a historical accident.

I guess an analogous problem occurs with time series. If “year” is the unit of analysis, you can get a nice correlation that is really the result of two linear trends over some period of time. (You can even use “day” and have a huge N.) I think this problem has been solved statistically. But I don’t know of any way of solving the problem of what might be called spatial or multi-dimensional similarity.

I don’t know that I’d say that the high percentage of evangelical Christians in the southern U.S. is a result of the Confederacy—maybe we’d still see if those states had never seceded and that war had never happened—but that’s not really the point here.

My response to Baron’s question is that you can deal with this sort of clustering by fitting a multilevel model including indicators for group as well as country. That said, this won’t solve the whole problem. As always, inferences depend on the specification of the model. In particular, including group indicators in your regression won’t necessarily resolve the problem. Ultimately I think you have to go to more careful models. For example, if you are comparing what’s going on in 35 countries, but they’re all in 3 groups, you might want to separately do analyses between and within groups.


  1. Jonathan says:

    Just to note it’s an absurdity to make such a statement about Evangelical Protestantism and the Confederacy or Jim Crow, etc. It correlates but in the same way that Catholic correlates with MA: there are a lot of Irish in MA. Denominational differences between parts of the country exist. Some of these follow historical migration patterns and others reflect the growth of denominations within a region. I feel ridiculous having to mention this.

    In terms of model division, I saw an op-ed by Michael Porter yesterday that cited a ‘rigorous’ social index. Is such a thing possible? I’d say no but when he cherry picks murder rate, then I know that’s a bullshit op-ed. People treat these issues as though they’re epidemiology, as though ‘determining’ penetration rates in specific populations of virulent diseases with relatively known infection rates – depending on exposure modeling, etc. – is the same thing as taking something buried way below the surface, like some measure of genetic diversity, and applying that to a population. You can see a rough connection: some groups are perhaps more prone to certain infections given certain other factors, like the way it appears HIV spreads in Africa depending on rates of already existing infections (meaning some of the research shows a form of opportunism). But in general? Treading on Wansink territory.

  2. For all the discussions we’ve had about foundational Bayes vs Frequentist ideas etc, the real thing that gets me fired up about using Bayesian analysis is the fact that I can carry out an analysis regardless of what my model is like (computation issues aside). So in an analysis of social inequality and gender gaps in math or the like, I’d strongly recommend to stop thinking about the “statistics” (ie. traditional questions of power and choice of test and assumptions of normality and etc etc) and start thinking about the physical / mechanistic model of the phenomenon. Start with a textual description of what you think is going on causally:

    “Within each country, societal issues related to gender roles result in varying degrees of economic distinctions between genders, this leads to differences in expectation of life-trajectories, and that leads to teachers having different expectations for child achievement. The result is that teachers teach children different material or spend different amounts of time explaining different things to boys vs girls. In addition there may be some inherent differences in the mean or variance of inherent talent among the populations of boys and girls, and a difference in what each child perceives as valuable and worth spending time on or what is interesting. These feedback effects develop through time as children who achieve at something tend to do more of that thing and less of another thing. The net result is through time as children age there is a widening gender gap in achievement among various topics”

    Or whatever, that’s just some stylized idea of what people might think. But suppose it is what you think…Now start encoding that into a mathematical model.

    mathematically we have several different academic/school related topics, perhaps math, language, sports, music, etc. We have in each country some attitude about whether each topic is “more male” or “more female”, we have associated effort and encouragement by society for each gender, we have children’s perceived gender role and level of interest, we have etc etc etc. These are the parameters we use to describe the process.

    Next we need the process description:

    rate of improvement in skills related to topic X for each child is functionally related to the inputs that go into skill development, including encouragement, individual child interest, time spent by the child, availability of instruction in the topic… And country or societal level parameters determine some of the encouragement, and some of the availability…. and society level parameters are similar within groups of countries…. and then across the world different groups of countries have some similarities as well…

    In the end you’ll have a large model for worldwide educational variation across multiple topics in which there are thousands of parameters you are uncertain about. This is *the reality* of the problem. Now, because of these thousands of parameters, you’ll want to look for sources of data which can inform the quantities of interest: surveys of children’s interest, datasets on teacher populations: age, gender, subject they teach.. Data on spending in each country, time-series data tracking individual children’s achievement, time series data across different eras… whatever, each source of data is something you can potentially use to constrain the parameters within a given country, and thereby also constrain parameters within neighboring/similar countries, and thereby constrain parameters within continents… etc etc. But data won’t be uniformly available in all locations for all topics. So you’re going to have to work to provide reasonably well thought out priors for your parameters.

    Next you’ll say: gee that’s all well and good, but I don’t have any of that information right now, and I do have this one great dataset with 18 data points in it across 3 countries, how can I make progress so that I can get grants and tenure? That’s a huge hard problem you’ve just described, I’d much rather just grab some dataset and calculate some p values…

    And now we know why so little progress is made, because we’ve *institutionalized* non-science as if it were in fact the pinnacle of scientific achievement: knowledge from pretending everything comes out of random number generators and reified the idea that without really thinking about how things work very much we can just grab some small datasets and pretend that the data comes out of random number generators, and check to see if we can mathematically detect differences between RNG A and RNG B.

    • yyw says:

      Probably start small when facing such a highly complex question like this. Instead of an abstract inequality index of some sort, take a few policies implemented in recent decades and do within-country longitudinal analysis (not exactly a small problem). Then maybe go from there and expand to similar countries with somewhat different policies.

      • I agree that when it comes to doing analyses it can be useful to build up to more complexity. But I think it’s always a good idea to flesh out the theory you think best describes what’s going on before proceeding to analyses. It’s important for perspective and to help you identify how various sources of data might help, often there really is more than one or even more than a few relevant sources of data. Without some kind of broad-perspective model you won’t be capable of making good use of the available data, and or identifying the best data collection effort.

  3. Nikolai Vetr says:

    This is pretty directly analogous to the case of phylogenetic pseudoreplication (Felsenstein 1985), where independence is incorrectly assumed between observations on separate tips of a tree.

    I don’t think there are any good statistical models for how nations form (?), but another way to maybe accommodate spatial autocorrelation could be through gaussian process regression, with covariance matrix informed by pairwise geographic distances (through e.g. waypoints). Statistical Rethinking 13.4.1 has a nice walk through of how to fit this sort of model in Stan.

    A Chinese Restaurant model could also be used to average over possible partitions of the countries, if one is uncertain about clustering and unwilling to assume that relatedness between countries is influenced much by their geographic closeness.

  4. gwern says: strikes again.

    Using a MLM strikes me as not being as efficient as possible: we know these similarities decay with distance, whether temporal, physical, or genetic, and we can even construct the phylogenetic trees. Just lumping the countries together in a cluster ignores these differences which can be large or small, and provide purchase for regression on the residuals from what would be expected from their autocorrelation. (Iceland should not be considered as similar to Sweden as, say, Finland.) The autocorrelation should be modeled as directly as possible.

  5. Nat says:

    We can also think about this problem from a design of experiment perspective. If we could control the design of the experiment and collection of data then we would like to have our points spread out across the design space. For example, nations with a wide range of “societal inequalities” or “genetic diversity”. However, it sounds like Baron is describing a situation where we are restricted to using existing data or cannot control the design of the experiment. The data points are clustered together into groups rather than spread out across the design space. We do not have much information about how the dependent variable varies for values of the independent variables between the clusters of data points. Under strong assumptions such as assuming things are linear across the design space (linear regression) I think you might still fit a model with low uncertainty. However, under weaker assumptions I think you might fit a model with high uncertainty (e.g., Gaussian process regression).

    There is also the issue of spatial and temporal scales. When we are binning data spatially or temporally we are free to use very small or large bins. The size of the bins seems related to the amount of uncertainty we have in each observation. For example, if we estimate median household income at the state level then we should have much less uncertainty than estimating the value at the county or zip code level. Similarly, household spending each year is less variable than monthly or daily spending. We can consider a trade-off between lots of data points each with high uncertainty or a few data points each with low uncertainty. I think if we can incorporate this uncertainty in each observation into our modeling process then it might in some sense level the playing field.

  6. Guive says:

    Is there a name for this problem?

  7. yyw says:

    If we ignore all the modeling inadequacies and causation/correlation issues, this article can easily be renamed to “Societal inequalities narrow gender gaps in reading”. Statement like “inequalities are detrimental to the performance of girls relative to boys in the three topics math, reading, and science” can easily be rewritten to something like “inequalities are beneficial to the performance of boys relative to girls in the three topics math, reading, and science”. While it seems ridiculous, the original statement was pretty terrible too considering how much girls outperform boys in reading.

    In any case, a zero sum outcome measure like the ratio of female/male high performers is not nearly as relevant an outcome measure as say the percentage of girls/boys that are high performers.

  8. Bruce Bradbury says:

    (For some reason I can’t see the other 4 comments)

    I think it is more informative to think of these sorts of problems as problems of unobserved confounder variables (here ‘culture’) rather than as small N. This encourages us to think about exactly what cultural variables might be important, how they might work and whether we can measure them. Thinking about these theoretical issues is likely to be more useful than focusing on questions of statistical inference.

    I suspect Baron might reply to this statement by saying that thinking about unobserved confounders is about causality – and he is just interested in associations. My response is that if we are interested in whether the observed association will extend to some sort of broader population then these counfounders will influence this inference if they are different in the broader population than in the observed sample. For example, when we include East Asia in the sample.

    • This sounds right to me. This isn’t a problem of drawing overly confident conclusions about the association in the sample, due to having a “misleadingly” high N; if all you care about is the association present in the sample, that confidence is warranted. But of course, you don’t just care about the sample — you’re studying it to understand the a larger population. The real objection here is just that the sample is not representative of the populations (“all existing countries” and/or “all possible countries”) we really want to learn about from a study like this.

      Likewise, with those non-causal associations between US states in different regions — the associations are real, if perhaps uninteresting, so it isn’t a problem that we can conclude them with more confidence by considering more states per region. If they’re dominated by inter-region trends, then it would be inadvisable to apply them within a region (there could be a Simpson’s paradox situation). But that’s just a matter of knowing what your numbers mean, and what they don’t mean.

  9. Richard says:

    Perhaps not exactly what Jon is thinking about, but Simpson’s Paradox ( is a good demonstration of the dangers of breaking groups apart for analysis.

  10. hay says:

    “But I don’t know of any way of solving the problem of what might be called spatial or multi-dimensional similarity.”

    This is a curious statement, because the problem described of N=35 countries vs. N=3 countries in the example is exactly one of spatial autocorrelation. There are multiple ways to deal with that, and I guess it is theoretically possible that you do not have enough variation between groups to get a precise estimate of whatever quantity you are interested in. Nevertheless, as McElreath noted, the spatial distances can be defined on any variety of differences/similarities, including multidimensional similarity.

    I am not sure the analogous time series example is exactly analogous, as what is described there is something of a temporal aggregation problem. There is of course a similar modifiable areal unit problem, but this should probably be thought of a distinct problem.

Leave a Reply to Kyle C