As an applied statistician, I don’t do a lot of heavy math. I did prove a true theorem once (with the help of some collaborators), but that was nearly twenty years ago. Most of the time I walk along pretty familiar paths, just hoping that other people will do the mathematical work necessary for me to fit my models (for example, taking care of all the intricacies of implementing differential equation models in Stan, or developing the mathematical tools necessary to derive algorithms to sample from difficult distributions).
Every once in awhile, though, I’m reminded that a baseline level of mathematical expertise allows me (and others with similar training) to see problems from a distance and resolve them as necessary. This sort of mathematical skill can be nearly invisible while it is being applied, and even afterward it’s not always apparent what was being done.
Mathematical understanding can be used not just to solve a well-formulated problem; it also helps us decide what problems are worth solving in the first place.
I thought of this general point after some back-and-forth regarding a recently published article by Anne Case and Angus Deaton on trends in death rates. If you haven’t been following this story on the blog, you can read my recent Slate article for some background.
The study was first summarized as an increase in death rates for 45-54-year-old non-Hispanic white Americans (see, for example, Ross Douthat and Paul Krugman), but after “age adjustment”—that is, correcting for the change in age distribution, standardizing to a common distribution of ages—the pattern looks much different. We then learned more by looking at other ages and breaking up the data for men and women. The biggest part of the story is a comparison to mortality trends in other countries, but I won’t get into that now. Here I’ll be focusing on the U.S. data.
What I want to talk about is the value of a mathematical understanding of different sorts of bias correction, a kind of thinking that is known by many statisticians but is rarely part of the formal curriculum—we learn it “on the street,” as it were.
Let’s start with a first-order bias. Here’s a graph of #deaths among 45-54-year-old non-Hispanic whites in the U.S., based on data taken directly from the CDC website:
But that’s just raw number of deaths. The population is increasing too. Let’s take a look:
Hey—the population increased and then decreased in this age group! That’s the baby boomers entering and leaving the 45-54 category. Anyway, this population pattern tracks pretty closely to the #deaths pattern.
Looking at trends in number of deaths without adjusting for population is like looking at nominal price trends without adjusting for inflation. It’s a first-order bias, and (almost) everyone knows not to do it.
So the natural step is to look at changes in mortality rate, #deaths/#people in this group:
But then we have to worry about another bias. As noted above, the baby boom generation was moving through, and so we’d expect the average age among 45-54-year-olds to be increasing, which all by itself would lead to an increase in mortality rate via the aging of the group.
As expected, the 45-54-year olds are getting older. But what’s happening with 2001? Is that for real? Let’s just double-check by pulling off ages from another dataset:
Yup, it seems real. Just quickly, let’s consider 2001. 2001-55=1946, and the jumpiness of the lines at the start of the above graph is tracking corresponding jumps in the number of babies born each year during the 1940s.
OK, the next question is: How would the change in age distribution affect the death rate in the 45-54 category? In other words, what is the bias in the above raw mortality curve, due to age composition?
We can do a quick calculation by taking the death rate by single year of age in 1999, and use this along with each year’s age distribution to track the mortality rate in the 45-54 group, if there were not change in underlying death rates by age. Thus, all the changes in the graph below represent the statistical artifact of age composition:
Now let’s line up this curve with the changes in raw death rate:
About half the change can be attributed to aggregation bias.
We can sharpen this comparison by anchoring the expected-trend-in-death-rate-just-from-changing-age-composition graph at 2013, the end of the time series, instead of 1999. Here’s what we get:
And here it’s clear: since 2003, all the changes in raw death rate in this group can be explained by changes in age composition.
The much-heralded increase in death rates among middle-aged non-Hispanic white Americans happened entirely in the first part of the series.
In summary so far: this adjustment for changes in age composition is a second-order bias correction, less important then the first-order correction for raw population changes but large enough to qualitatively change the trend story.
Now that we’ve identified the bias, we can correct by producing age-adjusted death rates: for each year in time, we take the death rates by year of age and average them, thus computing the death rate that would’ve been observed had the population distribution of 45-54-year-olds been completely flat each year.
The age-adjusted numbers show an increasing death rate until 2003-2005 and then a steady state since then:
But this is only one way to perform the age adjustment. Should we be concerned, with Anne Case, that “there are a very large number of ways someone can age-adjust this cohort” and that each method comes “with its own implicit assumptions, and that each answers a different question”?
The answer is no, we need not be so concerned with exactly how the age adjustment is done in this case. I’ll show this empirically and then discuss more generally.
First the empirics. I performed three age adjustments to these data: first assuming a uniform distribution of ages 45-54, as shown above; second using the distribution of ages in 1999, which is skewed toward the younger end of the 45-54 group; and third using the 2013 age distribution, which is skewed older.
Here’s what we found:
The results don’t differ much, with no change in the qualitative trends and not much change in the numbers either.
It’s important to do some age adjustment, but it doesn’t matter so much exactly how you do the age adjustment. In math jargon, age-adjustment corrects a second-order bias, while the choice of age adjustment represents a third-order correction.
That’s why, when I did my analysis a week or so, I performed a simple age adjustment. Based on my statistical experience and general mathematical understanding, I had a sense that the choice of age adjustment was a third-order decision that really wouldn’t have any practical implications. So I didn’t even bother to check. I did it here just for the purpose of teaching this general concept, and also in response to Case’s implication that the whole age-adjustment thing was too assumption-laden to trust. Case was making the qualitative point that any adjustment requires assumptions; I’m making a quantitative analysis of how much these assumptions make a difference.
So far I’ve been focusing entirely on the headline trends in mortality among 45-54-year-old non-Hispanic whites. But there’s nothing stopping us from grabbing the data separately for men and women:
These separate age-adjusted trends tell a new and interesting story. All the bias correction in the world won’t get you there; you have to pull in new data.
To put it another way: Age adjustment was a necessary first step. But now that we’ve dealt with that, we can move forward and really start learning from the data.
We can also look at other ages and other groups; see here for some graphs.
Concerns about data quality
When I first heard about Case and Deaton’s paper, I didn’t think about age adjustment at all; I was alerted to the age aggregation bias by an anonymous commenter. More recently this commenter has raised skepticism regarding the ethnic categories in the CDC data. I haven’t checked this out at all but it seems worth looking into. Changes in categorization could affect observed trends.
Turtles all the way down.
Asking the question is the most important step
As I wrote the other day, the point of bias correction and data inspection is not “gotcha!” Rather, the point of correcting biases and questioning the data is that the original researchers are studying something interesting and important, and we want to help them do better.
And here’s the R script
I put my R code and all my data files here. You should be able to run the script and create all the graphs I’ve blogged.
Warning: the code is ugly. Don’t model your code after my practices! If any of you want to make a statistics lesson out of this episode, I recommend you clean the code. Meanwhile, perhaps the very ugliness of the script can give you a feeling of possibility, that even a clunky programmer like me can perform an effective graphical data analysis.
I have the feeling that Hadley could’ve done all of this analysis in about an hour using something like 20 lines of code.
There’s lots more that can be done; I’ve only looked at a small part of the available data. The numbers are public; feel free to do your own analyses.