Skip to content
 

Lee Nguyen Tran Kim Song Shimazaki

Andrew Lee writes:

I am a recent M.A. graduate in sociology. I am primarily qualitative in method but have been moving in a more mixed-methods direction ever since I discovered sports analytics (Moneyball, Football Outsiders, Wages of Wins, etc.).

For my thesis I studied Korean-Americans in education in the health professions through a comparison of Asian ethnic representation in Los Angeles-area medical and dental schools.

I did this by counting up different Asian ethnic groups at UC Irvine, USC and Loma Linda University’s medical/dental schools using surnames as an identifier (I coded for ethnicity using an algorithm from the North American Association of Central Cancer Registries which correlated surnames with ethnicity: http://www.naaccr.org/Research/DataAnalysisTools.aspx). The coding was mostly easy, since “Nguyen” and “Tran” is always Vietnamese, “Kim” and “Song” is Korean, “Shimazaki” is Japanese, etc.

Now, the first time around I found that Chinese-Americans and Vietnamese-Americans were proportionally the most numerous Asian ethnics at the medical/dental schools of UC Irvine and USC (each about 10% of graduating classes), while Korean-Americans were a distant third (3-5%). At Loma Linda University, however, Korean-Americans were about 30% of dental school graduating classes every year and 20% of medical school graduates. Chinese- and Vietnamese-Americans, meanwhile, were in the low single digits at Loma Linda (Japanese never exceed 2% at any school, strangely enough).

These results were surprising because I had expected all three schools’ Asian ethnic representation to mirror that of the region’s demographics; I thought I had made a mistake so I decided to try recoding. What I did was that I decided that I might be subconsciously over-counting Korean-Americans by coding all instances of the “Lee” surname as Korean rather than Chinese. So I decided that unless there were “ethnic” first names which could be used to identify an individual as a particular ethnicity, I would label all ambiguous cases as simply Chinese.

So for example, a “Steve Lee” or “Jonathan Lee” would always be labeled as Chinese. I would only label a person with a “Lee” last name as Korean if their ethnic Korean given name was included as a marker (Korean and Chinese given names are quite distinct).

After the recoding, my results were almost the same, except that the percentage of Chinese at Loma Linda University’s medical school went up to almost 10%, and the proportions of Koreans at Loma Linda University went down to 25% in the dental school and 15% in the medical school.

Sorry for the long, boring paragraphs above on coding, but I just wanted to know if my methodology seems “rigorous” enough and if you would see these results as valid?

I want to make sure about these results, because they were the stepping ground off which I undertook a qualitative study of Asian-American medical/dental students at Loma Linda University to find why the demographics at Loma Linda were so different. I found that Loma Linda University being a Seventh-Day Adventist university was the key deciding factor; the fact that Loma Linda University is a religious university encouraged self-selection among applicants as well as potential applicants to its medical/dental programs. Many non-Adventist potential applicants to Loma Linda’s programs decided against applying because they were turned off by the university’s overtly religious self-presentation; Korean-Americans are a highly Christian ethnic group, on the other hand, and so are well-represented within Adventism. The fact that Korean immigrants to America are also usually at least middle class and college-educated/professionally-trained means that they had advantages in academic achievement. So middle-class Korean-American Adventists are overwhelmingly choosing the medical/dental route as a way of ensuring financial stability, raising their family’s social status, and also signaling religious commitment (Adventism places a heavy emphasis on physical health in its religious doctrine, so the health professions enjoy an extra level of prestige).

My reply: Interesting. It reminds me of this article by Ron Unz. I think there must be an academic literature on ethnic coding of names. If you really want to learn some statistics you can fit a model in which the ethnic status of each person is a discrete latent variable, and then estimate using the EM algorithm. But to start I think it makes sense to do what you’re doing, trying out various extreme assumptions to bound your answer.

11 Comments

  1. Ron Unz says:

    Actually, your correspondent might want to consult the books by Nathaniel Weyl, who pioneered exactly this surname analysis technique— The Creative Elite in America (1966) and The Geography of American Achievement (1989). Both have been long out of print, but are inexpensively available on Amazon.com.

    The very simple idea is to focus solely upon uniquely distinctive ethnic names, then estimating the total ethnic representation by the relevant ratios, easily available from Census data. Thus, Nguyen, Tran, Kim, and Park would be used, while all the Lees would be ignored completely, because of the obvious ambiguity. Almost all ethnic groups have at least a smattering of such extremely distinctive surnames, and once you determine them and their ratios, it doesn’t really matter whether they constitute just 5% of the total or 40%.

    Obviously, such statistical methods are only reliable for large sample sizes, and may not be effective if you’re examining e.g. just a single medical school class. But if you were looking at 20 years of such classes, or all the American med school classes in 2012, they might work reasonably well.

    In my own research, such Weyl Analysis could only be applied to the NMS semifinalist lists, whose numbers were sufficiently large, and I employed it as a validation-check upon my more subjective estimates of Jewish surnames for that dataset. I was very pleased to find that the match was quite close. For example, my estimate of Jewish NMS semifinalists based on direct inspection was 5.95%, while the estimates based on Weyl Analysis were 5.92% and 6.03%, depending upon the particular subset of distinctive surnames selected. This tended to increase my confidential in the direct inspection methods I also applied to the far smaller lists of Olympiad winners and such.

  2. Ron Unz says:

    Actually, I forgot to mention one necessary simplifying assumption: that performance is equal across different surnames of a given ethnicity. This may not actually be correct, and Weyl noted that certain particular Anglo-Saxon names were massively over-represented in high-performance results compared to other, more common ones. But it would be difficult to determine this secondary factor for relatively small ethnicities due to sampling size problems.

  3. “This may not actually be correct, and Weyl noted that certain particular Anglo-Saxon names were massively over-represented in high-performance results compared to other, more common ones.”

    Yeah, as a layperson that problem jumped out to me — you’d want to be certain that if you’re using particular surnames as representative indicators of ethnicity, then you need to be certain that those surnames are pretty much proportionately represented in your population under study as they are in the reference population. Not only will this not necessarily be true as the populations are smaller (because of noise), but there could be systematic regional variation (because there are different ethnicities within these linguistic/national ethnic groups and for various reasons, as immigrants to the US, they may not be evenly distributed at smaller scales) and socioeconomic variation (also because of more fine-grained ethnic distinctions, or just locally historically contingent) and probably others kinds of variation.

    Again, I’m nearly totally ignorant about all this — but couldn’t there be other, similar methods by which one might check for this?

  4. Ben Bolker says:

    Would it be feasible to use a second hierarchical level in the model where individual surnames were treated as random effects (i.e., nested within their ethnic groups)?

  5. Patrick says:

    Ryan Enos at Harvard has a paper with a model to try to identify race by surname and location that might be of interest:

    http://ryandenos.com/papers/chicago_threat.pdf

  6. Ron Unz says:

    Keith M. Ellis: you need to be certain that those surnames are pretty much proportionately represented in your population under study as they are in the reference population.

    I’m not sure this is too serious a problem, at least in the cases under discussion. For example, when I was analyzing NMS semifinalist lists, I just used “Nguyen” as the proxy for Vietnamese and “Kim” as the proxy for Koreans, since they’re such high percentages of each of those groups (about 28% and 18% respectively), overwhelmingly more relatively common than any Anglo-Saxon name. I’d really be surprised if they’d differ too widely in characteristics from the ethnic general population, but I suppose someone could check this by comparing e.g. mean age.

    For most other groups, including Jews, a large number of names need to be used to produce any sort of reasonable sample, which makes the process much less efficient. Also, since the Census doesn’t count Jews, their total numbers are uncertain/ambiguous to at least 10-15%, inducing similar errors in multiplicative factors to use for surname sampling.

    Incidentally, Weyl compared the results for “old Chinese” surnames to the ones of post-WWII immigrants, and found that the latter performing 2-3x better academically, which is pretty reasonable since the former immigrant flows had been far less selective.

    I think I saw a newstory somewhere a few months back that Gregory Clark had just unveiled this “revolutionary new technique” for ethnic analysis, which is a little silly since Weyl’s very detailed research has been around for almost 50 years…

  7. Andrew says:

    Wow, very cool! I am the correspondent referred to in the above post! Thank you Prof. Gelman for posting this as well as for bringing the post to the attention of various scholars with relevant expertise. This is very humbling, both to have my humble M.A. thesis be given this attention AND because I have a lot yet to learn.

  8. Andrew says:

    And I do remember reading the Unz article in the American Conservative several weeks ago, and thinking “Cool, maybe I should call this guy!”

  9. […] the NMS semifinalists and a larger set which Weyl himself had used for his own ethnic analyses. As I mentioned on Prof. Andrew Gelman’s statistics blog, my estimate of recent Jewish NMS semifinalists was 5.95% based on direct inspection, and 5.92% and […]

  10. […] the NMS semifinalists and a larger set which Weyl himself had used for his own ethnic analyses. As I mentioned on Prof. Andrew Gelman’s statistics blog, my estimate of recent Jewish NMS semifinalists was 5.95% based on direct inspection, and 5.92% and […]