How do segregation measures change when you change the level of aggregation?

In a discussion of workplace segregation, Philip Cohen posts some graphs that led me to a statistical question.

I’ll pose my question below, but first the graphs:

In a world of zero segregation of jobs by sex, the top graph above would have a spike at 50% (or, whatever the actual percentage is of women in the labor force) and, in the bottom graph, the pink and blue lines would be in the same place and would look like very steep S curves. The difference between the pink and blue lines represents segregation by job.

One thing I wonder is how these graphs would change if we redefine occupation. (For example, is my occupation “mathematical scientist,” “statistician,” “teacher,” “university professor,” “statistics professor,” or “tenured statistics professor”?) Finer or coarser classification would give different results, and I wonder how this would work.

This is not at all meant as a criticism of Cohen’s claims, it’s just a statistical question. I’m guessing that someone’s looked into this already and that there’s some research literature on the topic.

10 thoughts on “How do segregation measures change when you change the level of aggregation?

  1. Well, there’s only one occupation reported on the census, so there’s no ambiguity. In general, you ought to want as disaggregated job classification as you could get subject to having a reliable estimate of the fraction female in each job. As you aggregate up job levels you just lose information. The big worry, I’d guess, would be two jobs with the same classification but very different segregation patterns or vice versa (washroom attendant vs. “washroom attendant in men’s room” and “washroom attendant in women’s room”) or, somewhat less fancifully (“basketball player” vs. “NBA player” or “WNBA player”);

    Of course, if you really worry about the amiguity in job classification in the census you’ve got a different set of problems for which I presume the census has a host of white papers.

  2. This kind of question comes to my mind a lot when i read headlines about how “disease x is the number y largest killer of z”. if the breakdown is reasonable then this can be reasonable, but if you want to make your very own most important disease, you simply break down the other categories fine enough so yours is the biggest…

  3. With both the Gini and the index of dissimilarity, the more categories you use the higher your inequality score, in general. With the Gini this has to do with the calculus of the area under the Lorenz curve, I think. With gender segregation, it may have to do with the math as well, but it also fits the substantive pattern of gender distributions. For example, if you lump nurses and doctors into “professionals,” you see less segregation than if you separate them. And if you further break nurses and doctors down to pediatric nurses, nurse anesthetists, pediatricians and cardiologists, you would get a higher dissimilarity score still. To make comparisons over time or space, you have the further complication that the occupational composition shifts, and that the labor force size changes. No easy answers to how to do it right.

  4. Within sociology the “proper” measurement of class has always been a problem. Class is just a set of occupation such that its occupants share similar experiences, e.g. job security, status, income, work hours, chances of promotion, etc. Recently, a new literature is developing around so-called micro classes, see for example:

    Jonsson, Jan O., David B. Grusky, Matthew Di Carlo, Reinhard Pollak, and Mary C. Brinton. 2009. “Microclass Mobility: Social Reproduction in Four Countries,” American Journal of Sociology, Volume 114, Number 4.

    This fine grained classification requires very large datasets, such that each cell contains a reasonable number of observations, and the computer power to handle such datasets. I guess that that is the main reason why the empirical consequences of this fine grained classification is now being explored, even though this question is so much older.

  5. Andrew, this question is answered in the racial residential segregation literature and would provide parallels, I believe. See, for example (but not only), Claude Fischer, et al.’s 2004 Demography article, Sean Reardon and Barrett Lee’s work on segregation scale (e.g., here and ), and Rick Grannis’ work on defining neighborhoods .

    Also, Chris Winship’s article on segregation measures covers many of the statistical issues in racial residential segregation that seem relevant to your post.

  6. Slightly off topic, I find it incredible that 55% of the men in the US are truck drivers, janitors, or manual laborers. Is that statistic right? That’s over 80 million people.

  7. This is indeed a very interesting statistical issue. Roland Rathelot and I attempt to solve the problem raised by Cortese, Falk and Cohen (“Further considerations on the methodological analysis of segregation indices”, American Sociological Review, 1976) and Winship (1977), previously mentioned by Mike3550. Even if there is no segregation, standard segregation indices are going to be strictly positive when units (businesses, classrooms…) are small, due to the random allocation and finite population effects. Using a binomial mixture model, we show that segregation indices are only partially identified in general.

    Roland also has a previous paper, based on the same idea but using the parametric assumption of a mixture of beta distributions to recover point identification:

    Hope you find them interesting, and of course comments welcome!

Comments are closed.