Dave Kane writes:

You wrote about the NRA conventions and firearm injuries study here.

The lead author, Anupam Jena, kindly provided some of the underlying data and a snippet of the code they used to me. You can see it all here.

The data are here.

I [Kane] wrote up a brief analysis, R Markdown and html files are at Github.

Thanks for posting. Here is a display of the data.

The authors preferred not to share daily data. But, they did share the data as it was aggregated in their letter: summed over 3-day periods the same length as the convention. I can replicate their overall estimate (the 20% drop in firearm injuries). Whether or not it is plausible that less 1% of firearm owners attending a convention could possibly cause a 20% in injuries, I leave to discussion here.

Arrgg! I can’t get the image to display in a comment. Here is a link to it.

I haven’t gone to the data, only looked at your graph. Am I correct that these are the data points that correspond to the paper’s figure showing the error bars (rather than the points) for each week before and after the conventions? If so, I am shocked that they are bothering to think there is a pattern at all. The means and error bars make it appear that there is meaning in the data, but the 9 data points (one for each convention) look like noise to me. Also, is there a reason that the -3 weeks data only has 6 points while the other weekly data all show 9 points?

> Am I correct that these are the data points that correspond to the paper’s figure showing the error bars (rather than the points) for each week before and after the conventions?

Correct. There are 9 points (injury rate for each of the 9 years) for each week. In my write-up, I am able to recreate their Figure 1 (B), which is more or less the same as their Figure S1 in the supplements.

> is there a reason that the -3 weeks data only has 6 points

Yes. There is some overlap in the points:

x %>% mutate(rate = 100000*gunshot/total) %>% filter(date == -3) %>% arrange(rate)

A tibble: 9 x 6

year date gunshot total convention rate

1 2010 -3 13 1142614 F 1.14

2 2014 -3 14 1220806 F 1.15

3 2013 -3 16 1131267 F 1.41

4 2015 -3 11 771634 F 1.43

5 2007 -3 12 830806 F 1.44

6 2012 -3 36 2476577 F 1.45

7 2011 -3 23 1367774 F 1.68

8 2009 -3 20 1042228 F 1.92

9 2008 -3 21 1030124 F 2.04

> I am shocked that they are bothering to think there is a pattern at all.

Well, there is a statistically significant difference between all 9 convention weeks and all 54 non-convention weeks . . .

Which test are you using for that?

If I use a linear model on the rates it isn’t significant:

nra$rate <- nra$gunshot / nra$total * 100000

nra$nra <- nra$date==0

summary(lm(data = nra, rate~ nra))

If I use a poisson model on the counts (or with an offset to model as rates) it is significant:

summary(glm(data = nra, gunshot~nra, family = "poisson", offset = log(total) ))

summary(glm(data = nra, gunshot~nra, family = "poisson"))

If I use a negative binomial model on the counts it is not significant:

summary(glm.nb(data = nra, gunshot~nra))

If I expand out the counts to an individual level dataset (which is how they analyzed the data in the paper I think), I do get a significant result but the p-value is 0.0148 rather than the p=0.004 they reported in the paper. They do have various controls in there that we don't have access to and are using clustered standard errors but interesting that the p-value would be pushed down by that:

data <- rbind(data.frame(date = inverse.rle(list(lengths = nra$total – nra$gunshot, values = nra$date)), gun = 0),

data.frame(date = inverse.rle(list(lengths = nra$gunshot, values = nra$date)), gun = 1))

data$nra <- as.numeric(data$date==0)

mod.simp <- lm(data=data , gun ~ nra)

As a sidenote, I wonder if linear models on binary variables when the proportion is so close to zero might be pushing the bounds of the assumptions.

They said that their logits didn't converge, but I managed to get one done and it comes out with the same p-value as well.

mod.simp.binom <- glm(data=data , gun ~ nra, family = "binomial")

summary(mod.simp.binom)

The p-value for that is .

This is all done quickly and not to publication quality, so apologies if there are errors or misunderstandings etc.

Well, I guess if it’s statistically significant we can go straight to the TED talk.

> Which test are you using for that?

Just compare the rate in convention weeks (129 injuries out of 10,883,433 visits) versus non-convention weeks (963 injuries out of 64,684,217 visits):

> prop.test(c(129, 963), c(10883433, 64684217))

2-sample test for equality of proportions with continuity correction

data: c(129, 963) out of c(10883433, 64684217)

X-squared = 5.7295, df = 1, p-value = 0.01668

alternative hypothesis: two.sided

95 percent confidence interval:

-5.339666e-06 -7.300024e-07

sample estimates:

prop 1 prop 2

1.185288e-05 1.488771e-05

I think that this is what they are doing in the paper.

Since you are clearly more familiar with the details than I am, can you comment on the following? The paper claims to have run a multiple regression model with the week of the year (and day of week) as two of the independent variables. I don’t see those results reported anywhere, but I am interested in them. Since the NRA conventions are held at the same time of year (mid April – mid May), there can easily be a seasonal effect which the week of year might have picked up (depending on how they entered that variable). Alternatively, as Andrew originally suggested, it would be nice to see the data for each day of the year. I might have chosen a different control period rather than the entire year outside of the NRA convention period. If spring is a time of year with lower firearm injuries, then why not just compare the convention period with comparable months? And, if you include Fall hunting season, that might be enough to reveal higher than normal firearm injuries compared with the convention months. And, the resort to multiple regression rather than logistic regression (since the latter did not converge) seems like a fairly crude attempt to get some results – surely there are better alternatives they could have used?

If you can clarify any of these questions, I’d appreciate that.

> I don’t see those results reported anywhere, but I am interested in them.

They were not reported in detail. The supplementary appendix (pdf) includes some discussion. The authors, although helpful, have declined to release any of the relevant code.

My guess is that it does not matter much. Predicting firearm injuries is hard! I bet that even their most complex model has a trivial R^2. So, you get more-or-less the same answer as the trivial binomial test above. The rate of firearm injuries is much lower during conventions than it is outside of conventions, no matter what sort of model you use.

> it would be nice to see the data for each day of the year

The authors declined to provide daily data. I am unclear as to why. And, to be fair, I hesitate to hassle them too much. They are already better than the vast majority of authors in terms of responsiveness, at least in my experience.

> surely there are better alternatives they could have used?

Maybe, but I bet that it would not have mattered. I bet that it is, essentially, impossible to forecast whether or not person X will suffer a firearm injury. So, the world’s most complex/wonderful model would still capture almost none of the underlying variation, leaving the main conclusion, from the binomial model, unchanged.

To (mis)use Andrew’s kangaroo/feather analogy, your suggestions are fine, but it really doesn’t matter how precise we make the bathroom scale.

The reason for my points is that I suspect there may be a strong seasonal variation – where the NRA conventions occur during a lower firearm injury time of year. I haven’t found a good data source, but I did find this article in the BMJ:

“National estimates of non-fatal firearm related injuries other than gunshot wounds”

J M Hootman1, J L Annest2, J A Mercy3, G W Ryan2, S W Hargarten4

It is US data, but from a much earlier time period (early 1990s) and they do say “Almost half of the recreational injuries (3209 injuries) were caused by gun recoil, and nearly eight out of 10 occurred during the fall and winter.”

It is the last observation that concerns me. If anywhere near 80% occur in fall and winter, then using the average weekly figures as a control is inappropriate. Since they claim to have included week of the year as an explanatory variable (somehow), I wonder exactly what they did.

Absolutely. The control should be the predicted rate from a Fourier series with yearly periodicity with the convention weeks data removed. Fall hunting season is obviously going to be very different from spring regardless of whether there’s a convention or not

http://models.street-artists.org/2016/05/02/philly-vs-bangkok-thinking-about-seasonality-in-terms-of-fourier-series/

One of my blog posts on seasonality the last time this came up

Also does it really make sense to express the gunshots as a rate compared to all hospital admissions? I don’t think so, it should be expressed as a rate compared to the total population. Perhaps there’s more sporting related injuries in high school students during the convention rather than fewer gunshots (or the like)

What total population? They don’t have data for all firearm-related admissions, only when there are insurance claims from a subset of the total population. Normalizing the numbers as they do may be easier than determining what is the relevant population at each date.

The relevant population that I meant is the population of the US.

GunInjuries(t)/TotalUSPopulation(t)

Otherwise if you’re looking at hospitalizations, then seasonality in peoples totally unrelated activities like driving or sports or being on ladders fixing your roof or whatever will appear in the denominator.

Ah, I see they are maybe normalizing by commercially insured individuals not by hospitalizations, I somehow thought that they were looking at what fraction of all hospitalizations were gun related.

It isn’t easy to get my hands on the firearm injury data (I’m working on that), but it was pretty easy to get homicide data. I selected all the gun related homicides and simply broke the year into two periods – NRA convention periods (March, April, May) and the rest of the year. Then I looked at the monthly average number of gun related homicides during those two yearly periods, over the 1976-2016 period. I can’t paste the picture here, but the monthly gun related homicide deaths are significantly (sorry: p<.0001) lower during the NRA months than the rest of the year. Homicides are not a good proxy for firearm injuries but as a crude first attempt to look at the data over the year, it sure seems like the control period should not be the rest of the year outside of the weeks surrounding the NRA conventions.

Too bad that it’s not easy to upload images, like in the Discourse software or something.

Andrew: this blog would totally benefit from a dedicated associated file-sharing platform like Nextcloud or something, where people could upload their datasets and analysis code and graph images and link it to the comments… not sure the best way. For security reasons you might need to prevent people from uploading enormous files to try to crash the server or whatever, but if it were possible somehow it’d facilitate a lot of cooperation.

Dale: Where did you get the homicide data? CDC Wonder? If you can share it that’d be great, maybe a Google Drive folder or some such thing?

How does a ~ 5 term Fourier series work to predict the seasonality? What does the residual look like for the months of the convention?

What I mean is something like:

HomicideInWeek[Y,W]/PopulationInYear[Y] = sum(a[i]*cos(2*pi*W/52) + b[i]*sin(2*pi*W/52), i from 0 to 5) + err

where W goes 0 to 51 weeks in the year

you can fit it in “lm” even it’s just linear in a[i],b[i]

leave out the weeks where the convention occurs, and look at the difference between predicted from the regression and actual in those weeks… I bet it’s basically 0

if you want to get super-fancy you can fit a loess model to the population data at the weekly level…

This comment also makes me wonder about the unspecified adjustment they claim to make for seasonality. The include “week of the year” as an explanatory variable in their unreported multiple regression model. This seems like a strange way to do it. I suspect week of the year (hopefully it is a nominal and not continuous variable) is a very noisy way to try to model seasonality and that month would be much better (I’ve looked at flu data this way – weekly data is too noisy, but monthly data exhibits pretty clear seasonality). So, I’m not sure that weeks is a good way to incorporate seasonal effects (hunting seasons may well be stable from year to year in terms of dates, but where weekends fall and what the weather is like probably means the weekly patterns vary a lot across years).

In the Fourier method you keep the number of terms down, so although you’re computing a continuous function at the week level, it is smoothed in such a way that it can’t vary widely from week to week. with say 5 terms you’re talking about 52/5 weeks is the “wavelength” or around 10 weeks. You could maybe go to 10th or 11th harmonic if needed, and you’d still be talking about ~ 5 weeks wavelength.

I think using the STL (seasonal decomposition of time series data using loess) algorithm is a pretty standard method for seasonal adjustment.

It does a lot of smoothing in a lot of different directions + robustness methods put in (took me like 5 readings of the paper to finally put together what it actually does). Interestingly, the original paper suggests smoothing over the seasonal estimates (i.e. get estimates for week 1, week 2, etc., then smooth over these so there can’t be a huge jump from week 1 to week 2), but the R implementation does not appear to do this. With that said, empirically it does seem to give pretty good decompositions.

I’m sure there’s a nice Bayesian implementation of this type of decomposition; the cover picture of BDA 3rd edition is a picture of such a decomposition, clearly with more bells and whistles than base R::stl().

Might be interesting to look at the residuals of the events of interest?

Can you tell me where you are seeing that they used STL? All I can see is that they used fixed effects for the week of the year. Without more detail, I am imagining they put dummy variables for each week of the year (and only 9 years of data). I don’t see any reference to smoothing nor any details about the weekly effects.

Sorry, I wasn’t saying that they had. Haven’t read the paper. But if someone said “we adjusted for seasonality in this time series” with no further explanation, it would probably be my first guess…although putting a dummy variable for season in a regression model is up there too, depending on background of “someone”.

And, if they did that, then the whole thing is totally wrong. If you have one dummy per week and weekly data then you can fit the data exactly with the dummy regression.

I think “a reader” is advocating rather that if you could get the data you could run STL and get something better out of it than you’d get by weekly dummy.

“If you have one dummy per week and weekly data then you can fit the data exactly with the dummy regression.”

I meant “dummy variable for season”: i.e. one variable for Fall, Spring, etc. Even if it went down to weeks (i.e. one dummy variable for week 1 of the year, etc.), it would still not turn into a degenerate statistic, as long as you had two or more years of complete data.

And yes, I’m certainly not advocating that. This would certainly still be worse than an STL decomposition for several reasons: it would not remove an long term trends, for example.

The homicide data is here: http://murderdata.org/. The firearm injury data is http://www.gunviolencearchive.org/. At the latter you can request all data (observations are the individual incidents) for a given time period – however, the data appears on many separate pages and I am not good enough with automatic retrieval to know how to automate collecting the data for an entire year (it would take very many pulls from the data source). So, I’ve asked them if I can get a full year of data more easily. If you know how to do that and can get it, I’d love to also see what an entire year looks like. As for the Fourier series, I’d be interested. I have some other methods I’d like to try as well.

I don’t think the Fourier analysis is going to work well with the Christmas holidays seasonality, for example.

It depends on what you mean by “work” and “seasonality” it will be the case absolutely that you’ll have short-term “signals” lasting 1 or 2 or 3 weeks that are left after doing the Fourier analysis. That’s actually an intended effect. In fact, because you’re looking for a deviation in one specific week from a 3 week sequence (or the like) all these high frequency signals need to remain in the residuals in order to detect them.

So it will “work” in the sense that the Fourier prediction for the week of the conference will be a good control provided you use enough but not too many Fourier terms, and that’s fairly independent of what’s going on in the christmas week (I agree if they held their conferences very close to christmas you’d have issues)

I think that this seasonality discussion is overblown, at least with regard to the data that the authors use for this paper. If there were enough seasonality to matter, then we would see a different between the rate (or even the raw number of gunshot injuries) between the weeks before or the weeks after the convention.

> read_csv(“yearly.csv”) %>% mutate(period = case_when(date 0 ~ “after”, date == 0 ~ “convention”)) %>% group_by(period) %>% summarize(gunshots = sum(gunshot), totals = sum(total))

# A tibble: 3 x 3

period gunshots totals

1 after 484 32372613

2 before 479 32311604

3 convention 129 10883433

>

The total gunshot injuries and the rate is almost identical before/after the NRA convention. So, I don’t think seasonality matters — given that we only have data for 3 weeks before/after the convention.