I was looking at baby name data last night and I stumbled upon something curious. I follow the baby names blog occasionally but not regularly, so I’m not sure if it’s been noticed before. Let me present it like this: Take the statement…
Of the top 100 boys and top 100 girls names, only ___% contain the letter __.
I’m using the SSA baby names page, so that’s U.S. births, and I’m looking at the decade of 2000-2009 (so kids currently aged 4 to 13). Which letters would you expect to have the lowest rate of occurrence?
As expected, the lowest score is for Q, which appears zero times. (Jacqueline ranks #104 for girls.) It’s the second lowest that surprised me.
(… You can pause and try to guess now. Spoilers to follow.)
Of the other big-point Scrabble letters, Z appears in four names (Elizabeth, Zachary, Mackenzie, Zoe) and X in six, of which five are closely related (Alexis, Alexander, Alexandra, Alexa, Alex, Xavier). J is heavily overrepresented, especially as an initial letter, with 29 names. Former powerhouse names James and John have fallen a bit lately, to #17 and #18, but Jacob and Joshua have surged past them and rank #1 and #3 in the 2000s.
Lower than any of those is a letter I normally think of as a middle-range (ranking 15 in the ETAOIN SHRDLU): F occurs in only three top 100 names, all girls (Jennifer, Faith, Sofia). It’s not that F names never existed. Names like Frank, Jeff, Fred, and Cliff used to be common, but they have all greatly declined in recent years.
And it’s not just that F has the fewest names (other than Q), but they rank lower as well. Jennifer, which was #1 in the 1970s and #2 in the 1980s, is down to #39 in the 2000s. All the other letters have at least one high-ranking name. For X, Alexis is #11 for girls and Alexander is #13 for boys. For Z, Elizabeth is #9. (Zachary was #16 in the 1990s, down to #27 in the 200s.)
The other two letters that occur in fewer than 10 names are P and W. W has five names, all boys, but three of them rank in the top ten (Matthew #4, Andrew #7, William #10, Owen, Wyatt). P has six names, with two in the top ten (Christopher #6, Joseph #9, Sophia #13, Stephanie, Paige, Patrick).
The P list provides an interesting clue to what may have happened to F. The top four P names all use PH to make the F sound. Perhaps part of the reason that the F has disappeared from names is that in names people prefer to spell the F sound as PH. (But then if so, that would leave P as underrepresented instead.)
But counting the number of names in the top 100 is a crude way of looking at this. What I really want is
Among [all/male/female] births in [year], ___% were given a name that [contains/starts with/ends with] [text string].
so that I can input the bracketed variables and get the % as output. Then I’d run that for each single letter for the past 20 years or so, and then I’d draw a graph plotting each letter across time so I could see where the letter rank relative to each other and how they’ve trended.
SSA provides complete comma-delimited text files for each year showing number of births for every name with 5 or more occurrences, so the data is available.
I suppose I could do it in Excel, but it would be slow and laborious. I imagine you stat people have tools (and practice) that could do it much more efficiently and thoroughly.
I don’t know if this interests you enough to spend any time playing around with it. Maybe if you have a student looking for an exercise to play with you could put it out there.
P.S. Other fun fact: A’s dominance of names seems to be increasing. I didn’t count all 100 for the 2000s, but in the top 10 male and female names for 2012, 19 of 20 contain the letter A. Next is I with 12.
I asked if you could the F in Cliff twice? I sort of think it should count double, actually, as it represents that much more exposure to the letter.
No, I wouldn’t. I suppose you could do it either way, but I’m thinking in terms of the name contains the letter or it doesn’t, not how many times.
The difference is more obvious when you think of higher frequencies. Like it would be interesting to say “75% of all boys born in 2012 have an A in their name”, but not so much to say, “In the first names of the 2,000,000 boys born in 2012, the letter A occurs 1,600,000 times”. The latter method only compares letters to other letters. The former compares letters to people, which is more interesting to me.
Or to put it another way, I’m not interested in letter frequency per se across the limited text corpus of baby names. I’m interested in the probability that a person you meet will have a certain letter somewhere in his name. So if there were a trend for names to become longer, in my conception all the frequencies would go up as a result, whereas in the other conception they’re essentially relative frequencies so it would be zero sum.
In any case, in response to Ubs’s original question, this looks like a job for perl or whatever the cool kids are using these days. Python? I dunno. I bet one of our readers could download the data, crunch the numbers, and make a cool graph, all during the time it will take me to write my next blog post.