Here are the data and code for that study of Puerto Rico deaths

A study just came out, Mortality in Puerto Rico after Hurricane Maria, by Nishant Kishore et al.:

Using a representative, stratified sample, we surveyed 3299 randomly chosen households across Puerto Rico to produce an independent estimate of all-cause mortality after the hurricane. Respondents were asked about displacement, infrastructure loss, and causes of death. We calculated excess deaths by comparing our estimated post-hurricane mortality rate with official rates for the same period in 2016.

From the survey data, we estimated a mortality rate of 14.3 deaths (95% confidence interval [CI], 9.8 to 18.9) per 1000 persons from September 20 through December 31, 2017. . . . a 62% increase in the mortality rate as compared with the same period in 2016. However, this number is likely to be an underestimate because of survivor bias. The mortality rate remained high through the end of December 2017, and one third of the deaths were attributed to delayed or interrupted health care. Hurricane-related migration was substantial.

The 14.3 per 1000 is an annualized rate: it’s 14.3 deaths per 1000 people per year. The population of Puerto Rico is 3.4 million (or 3.3 million, or 3.7 million, depending on where you look it up), so 14.3 deaths per 1000 people is an estimated 49,000 deaths per year. They’re comparing that with a previous death rate of 8.8 per 1000 per year, which comes to 30,000 deaths per year. The difference—excess deaths—is 19,000 in a year, or 5,300 in the 102 days of the study, assuming a constant mortality rate throughout the year.

Here are some data they report from official death tallies:

It would help if they were to label the gray lines. I think death rates were going up from 2010-2016 but I’m not sure. Also, I have no idea why most years show a big jump in deaths from November to December. Is that real, or is is some data issue? The drop from November to December in 2017 (in contrast to the jump in all the other years) is consistent with under-reporting of deaths, and I guess it’s also consistent with people leaving the territory who would otherwise have died.

Anyway, from the above figure you can see that reported deaths did not dramatically increase from Sep-Dec 2016 to Sep-Dec 2017; even the months showing the biggest year-on-year trends show increases of less than 500 per month.

So the claim from the survey is that a lot of deaths went unreported. I don’t know how this works, exactly. I’m not saying the survey results are wrong, I’m just saying I don’t know enough about what was going on in Puerto Rico to know how to think about these survey responses.

Here are all the ages and months of deaths reported in the survey of 3299 households:

This graph is useful, not just in giving an estimate for the number of post-hurricane deaths but for showing the comparison: 18 deaths in the 263 days before the hurricane and 38 deaths in the 101 days after the hurricane. Indeed, this gives a raw estimate of relative death rate of (38/101)/(18/263) = 5.5.

A ratio of 5.5: that seems a bit too high! This makes me wonder about reporting bias.

On a separate matter, the the authors say that their survey is underestimating the number of deaths because there will be no response from households where everyone died (or a one-person household where that one person died).

I have some issues regarding survey design and weighting.

The design involves stratification (which makes sense) and clustering (which makes sense) and a procedure for getting replacement samples for abandoned homes. I’m not so sure that this last idea was so good, because the result will be to overrepresent areas with more abandoned homes. To get a sense of the importance of this bias, you’d want to know how often this happened in the data, and where. Also this replacement procedure involved “sampling from all surrounding visible houses” so I’m not sure what they did with apartment buildings.

The paper describes a two-stage weighting process, but then in the statistical analysis they don’t use their weights or account for clustering at all! Which makes me wonder why they were talking about the weights at all. They also get a standard error for the number of deaths using the Poisson distribution: That ain’t right! What you’re supposed to do is estimate the rate in each cluster and then use the cluster-level analysis to get your estimate and uncertainty. (Or fit a hierarchical model, but here I’m just talking about the classical approach.)

I also got an email from David Manheim who had some issues with the death rate comparisons. Manheim writes:

Figure S2 has the official death tolls, and makes me wonder how they justify their very high estimates and makes me wonder why they don’t discuss the time-shifting, given the much lower than usual death toll in November/December 2017.

The survey methodology is also less than ideal for what they are doing. They could have used the same dataset that they used for the baseline in 2017, which was the number of official monthly deaths in Puerto Rico. Instead, the sampling they did has tons of other potential biases and issues. As one specific example, they used random resampling of missing households and those who refuse to consent. This is not horrible, but it creates a bias they don’t seem to adjust for. . . . the sampled median age was a decade older than the actual average – indicating what seems to be an obvious bias against working-age individuals, who were presumably less likely to be home when sampled, and shifting the age ranges presumably inflates the death tolls somewhat.

Lastly, the normal counting of deaths from hurricanes is direct deaths. The original death toll report, which all of the press cites as misleading, reflects this number. The obvious comparisons to other storms is to that same number – direct deaths. The number they report is defensible, and as they cited, CDC recently recommended using it more generally, but they chose to estimate it for this particular storm, and not include comparisons to any other storm.

My overall take on this is that it’s a hard problem. It makes sense to do this sort of survey as a cross-check on official death tolls. But the uncertainties and possible biases in the estimates are so large that it’s hard to know what to do with the numbers.

Sometimes this can be difficult to capture in news reporting: the idea that this is a careful, high-quality study with summary numbers that are noisy. There’s a temptation to either dismiss the study entirely or to take its estimates as truth, but neither of those extremes is right.

It’s good news that all the data and code are publicly available so anyone can do alternative analyses.

Here’s the published paper, here’s the published supplementary material, and here’s the Github page with the data and code. The repository includes all the data except the gps locations, you get barrio instead. In particular, I’ve been told that the posted data include cluster identifiers so you should be able to analyze the survey respecting the design. The survey also includes potentially interesting data on questions about neighbors, so lots of things to explore for students or anyone else interested in going in depth here.

P.S. Rafael Irizarry, the statistician on this project, adds some comments:

I am convinced that the drop in Nov and Dec 2017 in monthly level data has to do with the fact that the government demographers are still catching up. I base this on the attached plot showing daily data. [I’m not sure if I have permission to share this plot, but I can describe it here. The reported deaths are approximately 100 per day in January 2017, then gradually decline to about 75 per day from March through October, then the hurricane lands, and deaths spike to about 120 per day for a couple weeks and then drop quickly to 100, then 75, then 50, then 25. — AG.] You can see an almost monotonic drop starting in mid-October, consistent with this being an incomplete dataset. If the rate remained constant at the rate we see in late September / early October, continuing through December, we end up with an estimate well within our confidence interval. The reason we have two datasets here is because we could not get the government to share the daily level data nor the monthly level data for 2017. The New York Times somehow got the daily data and shared it with us. The PR Institute of Statistics, an autonomous branch of the government, sent us the monthly data you see in the paper. But neither are official.

Regarding apartments, I checked with other authors and they tell me that for apartments we had a random selection of floor, then unit in that floor.

The larger ratio you note between the before-and-after rate is perhaps due in part to survival bias: the denominator is smaller than it should be. After applying the adjustment we describe, that ratio of rates goes down some.

Our findings go beyond the estimation of the death count. For me, the most important takeaways were:
1- A substantial number of the post hurricane deaths were due to lack of access to medical care. Many more than due to trauma.
2- There is evidence that the death rate was high through December, not just right after the hurricane.
3- The survey that got us this information was completed in 3 weeks for less than $75K. Government agencies can run surveys like this, preferably larger ones, right after a disaster to get an idea of what is going on.

These are all interesting points. From a mathematical or statistical standpoint, the challenge arises because we are trying to estimate the frequency of a (fortunately) rare event from a general-population survey. Just think about it: They had to survey 3300 households to learn about 38 post-hurricane deaths. Whenever you’re in the position of needing to survey thousands of people to find out about only 38 events, you know you have a statistical challenge: any estimated total will be noisy, and estimates for subsets will be even worse. So I think Rafael is right that, if you’re going to do this sort of survey, you should look at all the information you can and not get stuck on any particular number. Policymakers should be combining what they learn from different sources.

30 thoughts on “Here are the data and code for that study of Puerto Rico deaths

  1. Kudos to the authors for being so transparent! I was able to use their code/data to replicate all their figures. Well done! (The contrast with the behavior of the Lancet folks regarding Iraq mortality estimates 10+ years ago is, perhaps, a sign of the increasing success of open science and/or replication standards.)

    Observation: Using their formula, I find a annualized mortality estimate for the pre-Maria time period of 2.6 deaths (95% confidence interval [CI], 1.4 to 3.8)) per 1000 persons. This is, obviously, ludicrously low. Almost no country on earth has such low mortality and the 2010 to 2017 official statistics (which the authors, to their credit provide) show a fairly stable mortality rate of 8.3.

    If the data from the survey is so flawed for the Jan to Maria period, how much faith should we have in it for the period from Maria through December? Is there a chance that some deaths which actually occurred pre-Maria were mistakenly recorded as occurring post-Maria?

    • I am writing a letter to the editor. (The limit is 175 words.) Below is a draft. Comments sought!

      The excess mortality estimates of Kishore N, Marqués D, Mahmud A, et al. (May 29 issue)1 for the 102 days after Hurricane Maria depend on the quality of their data. We can examine that quality by calculating a mortality rate for the 263 days prior to Maria and then comparing that rate to other sources. Using their methodology and data, I calculate a mortality rate of 2.6 deaths (95% confidence interval [CI], 1.4 to 3.8)) per 1000 persons from January 1 through September 19, 2017. This mortality rate is inconsistent with the rate calculated from the official monthly statistics for 2010 through 2017: 8.3 deaths (95% CI, 8.2 to 8.4) per 1000 persons. It is, statistically, almost impossible for there to have been only 18 deaths in the 3299 households prior to Maria. What explains this anomaly? One possibility is that deaths which actually occurred before Maria were (mis)-recorded as happening after. If such data errors affected 3 or more of the 38 post-Maria deaths, then the post-Maria mortality rate would no longer be statistically different from their 2016 baseline.

      • This looks like a good catch!

        I’d suggest a more general introductory explanation along the lines of;
        [small samples are really noisy, and the survey hasn’t been validated via comparison to any other dataset. Comparing the survey estimate to the official mortality figures, we see the underestimate places the true value well outside the confidence bounds, and this implies a systematic issue with the survey. Given this, one possible explanation is…] that deaths which actually occurred before Maria were (mis)-recorded as happening after. If such data errors affected 3 or more of the 38 post-Maria deaths, then the post-Maria mortality rate would no longer be statistically different from their 2016 baseline.

        • > the survey hasn’t been validated via comparison to any other dataset.

          I don’t think this would be fair to claim since they do plenty of comparisons with ACS, at least for demographic data. Which, perhaps, makes their lack of explicit comment on comparisons of mortality rates from other sources all the more glaring . . .

      • The annualized mortality for their entire survey is *below* the observed mortality rate in 2016. So in the year of the year of the hurricane, they find below average mortality rates? Their results are implausible.

        Additionally, we should expect the survey to have found ~84 deaths, based on the 2016 mortality rate yet they only found 56. Their results are implausible.

        Even if you take their results as being “correct”, it would imply a completeness of death registration of ~90%. Puerto Rico has *never* had a completeless below 97%. Their results are implausible.

  2. … the key issue now in “improved” bean counting of “hurricane deaths” in Puerto Rico is of course… political. High death tolls bolster the political push for even more Federal aid to the supposed “poor, racially oppressed minority Americans” on that island. White Americans in Florida or Texas would not have been treated so badly by the Feds or suffered as much after a big hurricane (?) That was most certainly the media template during and following those hurricanes in Pueto Rico.

    Artificially pumping up the “death toll” with very broad definitions of ‘indirect deaths’ serves the political purpose well — the American media trumpets this new survey constantly.

    What’s the ultimate purpose of counting hurricane deaths anyway ?

    • Artificially pumping up the “death toll” with very broad definitions of ‘indirect deaths’ serves the political purpose well — the American media trumpets this new survey constantly.

      I was surprised to learn it too, but according to the CDC definition,[1] “indirect deaths” is defined to be vague enough to allow death from anything at any point[2] after surviving a disaster. However, usually the determination is made by a medical examiner who has some professional reputation to uphold. Here they just switched from asking them to asking people who have money, etc to gain if the death is counted as due to the disaster.

      Really I would put the blame on the too vague definition of “indirect death” which allows this type of thing. Besides this example, who knows what trends in disaster related deaths may be due to examples shown in “medical examiner school” or conferences, etc? Basically there is no historical record of what counts in practice.

      [2] Ref 2 from above:

      • There are several reasons to care about the number of hurricane deaths. An obvious one is that it helps with resource allocation: how much money is it worth putting into preparing for hurricanes, into recovering from them, etc.? Whether this particular hurricane killed 60 people or 4000 people is surely an informative data point!

        I’m not sure what Anoneuoid means about there being a problem with the definition of indirect deaths. If, in absence of an event, you expect N deaths, and you actually get N + D, I think it’s fair if your _estimate_ of the excess deaths due to the event is equal to D. Sure there are uncertainties, but I think it’s fair to count all-cause mortality. For one thing, it would be hard to definitively say that a given death was _not_ indirectly caused by the event, especially if the event is a hurricane or other disaster that affects all aspects of life. Maybe there are more deaths from domestic violence after a hurricane (more stressful), maybe more deaths from crime (looters, desperate people), more suicides (lost everything), etc., etc. I don’t think counting all-cause mortality is a way of “artificially pumping up the ‘death toll'”, it seems legitimate to me as long as you compare all-cause mortality before the event to all-cause mortality after.

        I think people are right to be skeptical of any specific number, and it is probably worth thinking about major ways the estimates from this study could be substantially biased upwards, but the general concept seems fine to me.

        An important question is, if you use exactly the same approach for estimating mortality for a time and place when there was NOT a disaster, do you get the right number? D Kane’s says the numbers from the ‘before’ period are far too low. That seems to me to be a big problem.

        • I’m not sure what Anoneuoid means about there being a problem with the definition of indirect deaths.

          Check the refs, according to the current definition: if you got a shrapnel wound during a disaster in the 1970s and died of infection to this wound 48 years later in 2018, that would qualify as an indirect death. It is totally up to the collective opinion of medical examiners to do this or not.

        • I still don’t see the problem. If deaths from 48-year-old shrapnel wounds become more likely because of the hurricane — inability to get medical treatment, shortage of antibiotics, shortage of clean water, whatever — then those _should_ count as ‘indirect deaths.’

          Of course, some deaths will happen ‘randomly’ that have nothing to do with the hurricane, but that is true prior to the hurricane too, which is all part of the base rate of all-cause mortality.

        • I think you missed the point, if someone in 2017 got a sliver in their leg during the hurricane and in 2057 it gets infected and they die it could in theory be counted as an indirect death from the hurricane. That definitely seems problematic to me, because everything that happens happens cumulatively from past causes. So one might just as easily count the sliver death 40 years from now as indirectly caused by say a shortage of baby wipes in 2027 leading to altered hygiene practices that eventually led to …blablabla… And then to death by infection.

        • Daniel, I’m not sure that is really Anoneuoid’s objection, although it may be. It doesn’t make a lot of sense to me: nobody would presume to count the hurricane-related mortality by doing a before/after comparison that spans 48 years. Actually that argument would support an assertion that -if- the methodology is accurate then it undercounts the number of hurricane-related deaths, because there will be a few people who are damaged in some way by the hurricane but don’t actually die until years later, and you won’t count those this way.

          I still don’t see the conceptual problem with looking at all-cause mortality before and after the event, controlling for seasonality etc. I certainly see the methodological and statisical problems with it, but I think it is fair to attribute the difference to direct deaths + indirect deaths + noise.

        • The point is that whether a death is classified as hurricane related on a death certificate or the like is a very noisy thing. You’d do better to just look at rates before and after as you suggest and ignore the official cause

        • Daniel, I’m not sure that is really Anoneuoid’s objection, although it may be. It doesn’t make a lot of sense to me: nobody would presume to count the hurricane-related mortality by doing a before/after comparison that spans 48 years.

          If there is money/power involved? Yes, they will. And you apparently reject 48 years, so where should the cutoff be?

          IMO, there simply is no good way to define “indirect deaths due to a disaster” so we should get rid of the concept until someone figures out a way to demonstrate otherwise. I currently don’t see any function for these numbers other than as a political playground.

        • A perfectly good model of a disaster is that it instantly increases death rate, and then the rate decays exponentially back to some new steady state. The decay time constant could be days to single digits years… Any longer term trends should be modeled separately. Using this model we can estimate the indirect effects without involving any medical examiners with money or power issues.

  3. Don’t know much about death statistics and haven’t read the paper, but I wonder if the jump from November to December is due to deaths for which a date cannot be precisely determined. E.g. if you are missing for 7 years, maybe they declare you dead on the first Dec 31 after 7 years have elapsed? This alone would not seem to explain the jump, which represents 1% of annual deaths, but maybe there are other similar categories…

  4. Also, I have no idea why most years show a big jump in deaths from November to December

    Human mortality is seasonal. Amazingly to me a quick search didnt come up with a decent plot of this so I made one… From CDC WONDER:

    Formatted Data:

    That is number of deaths for 1999-2016, not rate per 100k people or anything. For some reason it returned not applicable for that column… I guess we are supposed to get our own census data?

      • I had something else to say about the latitude, axial tilt, the cosine of what results from that, and insolation… But really just look at the March uptick. It looks like the same process is at play. I would really like to see a plot of mortality rate by birth/death/avg latitude though.

        • But really just look at the March uptick.

          Either I am looking at the wrong graph or misunderstand the point but I don’s see (grasp ?) the uptick.

          I would really like to see a plot of mortality rate by birth/death/avg latitude

          That would be interesting. I wonder if any epidemiologists have tried this? Since I live in Canada my first thought was that lower temperatures and nasty wet weather was out to get us.

          I had not considered something like hours of sunshine though it becomes obvious as soon as you mention latitude.

          Could be a good master’s thesis?

          P.S. I just graphed some monthly mortality data from StatCan. It looks safer to spend Nov to April in Australia and May to Oct in Canada.

        • Since these are just # of deaths, grouped by month, could this also be an effect of February being a *shorter* month?

  5. I wonder if the drop in deaths in November and December is also partially due to frail people who would ordinarily have died in those months dying in September and October instead.

  6. My vague impression is that Puerto Rico has been suffering a general meltdown of government services for quite a few years now.

    You should check into Puerto Rico’s remarkably awful school test scores on the a special Spanish version of the federal NAEP adapted particularly for PR. Although PR’s NAEP scores went up in 2017 over 2015, they are still incredibly awful compared to Hispanics on the mainland.

    Spending per student is low in Puerto Rico although it’s higher than in, I believe, Utah and Wyoming. Spending on teachers is very low but some of the shiftier category of administrative expenses are higher per student than in any state, trailing only the notoriously gold-plated District of Columbia school system.

    It might have something to do with the fact that depopulating Puerto Rico could tip Florida from a purple state to a permanently blue state in the Electoral College.

  7. “Puerto Rico has been suffering a general meltdown of government services for quite a few years now…”

    …yes, that’s the real hurricane-news-story (largely unreported in MSM). PR government is horribly corrupt and incompetent; the electric system and overall infrastructure were in shambles long prior to any major hurricanes. PR government had recently squandered $84 Billion in borrowed money that easily could have prepared the island for annual hurricane threats — and mostly avoided the excessive deaths and suffering of the populace last fall.

    • Right.

      You hear a lot of bad things about nationalism these days, but Puerto Rico is a scary example of what a lack of nationalism can do. It’s not it’s own country, but nobody in the real United States cares enough about PR to impose good government either. The Democrats benefit from depopulating PR by tipping Florida blue and the Republicans are too stupid to notice their interest in keeping PR habitable.

  8. The authors continue to deserve tons of credit for all their transparency. Indeed, is there a better example of open science associated with such a high profile article? I can’t think of one.

    For example, see the discussion they are hosting on github.

  9. A colleague and I also looked into the death toll due to Hurricane Maria earlier this year. The authors of the NEJM are correct when they say that the authorities were not forthcoming with the data. But after several attempts late last year we got death counts until September 19 (the day before Hurricane Maria landed) and from September 20 to October 31st. Estimation works under the following premise: deaths per day tend to be distributed similarly in the short term under ‘normal conditions’ (e.g. no hurricanes). Therefore, if we estimate the difference in death per day between a post-storm period and a pre-storm period we get an estimate of the hurricane death toll. Using our data we determined with 95% confidence that Hurricane Maria’s death toll was between 605 and 1039 victims.
    After the commotion created from the misinterpretation of the NEJM paper, the government released updated data; including revisions to September, October, and November total deaths. Inputting the new data into our code (see link to paper above) we got a revised 95% confidence interval of 720 and 1397.

    In our view, it was more sensible to assume death rate per day was fairly constant within 2017 than between 2016 and 2017. Moreover, we wanted to separate the pre and post groups as much as possible. Thus our pre-storm data set goes all the way to September 19.

Comments are closed.