I noticed this article in the newspaper today:

A simple rule change in Ivy League football games has led to a significant drop in concussions, a study released this week found.

After the Ivy League changed its kickoff rules in 2016, adjusting the kickoff and touchback lines by just five yards, the rate of concussions per 1,000 kickoff plays fell to two from 11, according to the study, which was published Monday in the Journal of the American Medical Association. . . .

Under the new system, teams kicked off from the 40-yard line, instead of the 35, and touchbacks started from the 20-yard line, rather than the 25.

The result? A spike in the number of touchbacks — and “a dramatic reduction in the rate of concussions” . . .

The study looked at the rate of concussions over three seasons before the rule change (2013 to 2015) and two seasons after it (2016 to 2017). Researchers saw a larger reduction in concussions during kickoffs after the rule change than they did with other types of plays, like scrimmages and punts, which saw only a slight decline. . . .

I was curious so I followed the link to the research article, “Association Between the Experimental Kickoff Rule and Concussion Rates in Ivy League Football,” by Douglas Wiebe, Bernadette D’Alonzo, Robin Harris, et al.

From the Results section:

Kickoffs resulting in touchbacks increased from a mean of 17.9% annually before the rule change to 48.0% after. The mean annual concussion rate per 1000 plays during kickoff plays was 10.93 before the rule change and 2.04 after (difference, −8.88; 95% CI, −13.68 to −4.09). For other play types, the concussion rate was 2.56 before the rule change and 1.18 after (difference, −1.38; 95% CI, −3.68 to 0.92). The difference-in-differences analysis showed that 7.51 (95% CI, −12.88 to −2.14) fewer concussions occurred for every 1000 kickoff plays after vs before the rule change.

I took a look at the table and noticed some things.

First, the number of concussions was pretty high and the drop was obviously not statistical noise: 126 (that is, 42 per year) for the first three years, down to 33 (15.5 per year) for the last two years. With the exception of punts and FG/PATs, the number of cases was large enough that the drop was clear.

Second, I took a look at the confidence intervals. The confidence interval for “other play types combined” includes zero: see the bottom line of the table. Whassup with that?

I shared this example with my classes this morning and we also had some questions about the data:

– How is “concussion” defined? Could the classification be changing over time? I’d expect that, what with increased concussion awareness, that concussion would be diagnosed more than before, which would make the declining trend in the data even more impressive. But I don’t really know.

– Why data only since 2013? Maybe that’s only how long they’ve been recording concussions.

– We’d like to see the data for each year. To calibrate the effect of a change over time, you want to see year-to-year variation, in this case a time series of the 5 years of data. Obviously the years of the concussions are available, and they might have even been used in the analysis. In the published article, it says, “Annual concussion counts were modeled by year and play type, with play counts as exposures, using maximum likelihood Poisson regression . . .” I’m not clear on what exactly was done here.

– We’d also like to see similar data from other conferences, not just the Ivy League, to see changes in games that did not follow these rules.

– Even simpler than all that, we’d like to see the raw data on which the above analysis was based. Releasing the raw data—that would be trivial. Indeed, the dataset may

alreadybe accessible—I just don’t know where to look for it. Ideally we’d move to a norm in which it was just expected that every publication came with its data and code attached (except when not possible for reasons of privacy, trade secrets, etc.). It just wouldn’t be a question.

The above requests are not meant to represent devastating criticisms of the research under discussion. It’s just hard to make informed decisions without the data.

**Checking the calculations**

Anyway, I was concerned about the last row of the above table so I thought I’d do my best to replicate the analysis in R.

First, I put the data from the table into a file, football_concussions.txt:

y1 n1 y2 n2 kickoff 26 2379 3 1467 scrimmage 92 34521 28 22467 punt 6 2496 2 1791 field_goal_pat 2 2090 0 1268

Then I read the data into R, added a new row for “Other plays” summing all the non-kickoff data, and computed the classical summaries. For each row, I computed the raw proportions and standard errors, the difference between the proportion, and the standard errors of that difference. I also computed the difference in differences, comparing the change in the concussion rate for kickoffs to the change in concussion rate for non-kickoff plays, as this comparison was mentioned in the article’s results section. I multiplied all the estimated differences and standard errors by 1000 to get the data in rates per thousand.

Here’s the (ugly) R code:

concussions <- read.table("football_concussions.txt", header=TRUE) concussions <- rbind(concussions, apply(concussions[2:4,], 2, sum)) rownames(concussions)[5] <- "other_plays" compare <- function(data) { p1 <- data$y1/data$n1 p2 <- data$y2/data$n2 diff <- p2 - p1 se_diff <- sqrt(p1*(1-p1)/data$n1 + p2*(1-p2)/data$n2) diff_in_diff <- diff[1] - diff[5] se_diff_in_diff <- sqrt(se_diff[1]^2 + se_diff[5]^2) return(list(diffs=data.frame(data, diff=diff*1000, diff_ci_lower=(diff - 2*se_diff)*1000, diff_ci_upper=(diff + 2*se_diff)*1000), diff_in_diff=diff_in_diff*1000, se_diff_in_diff=se_diff_in_diff*1000)) } print(lapply(compare(concussions), round, 2))

And here's the result:

$diffs y1 n1 y2 n2 diff diff_ci_lower diff_ci_upper kickoff 26 2379 3 1467 -8.88 -13.76 -4.01 scrimmage 92 34521 28 22467 -1.42 -2.15 -0.69 punt 6 2496 2 1791 -1.29 -3.80 1.23 field_goal_pat 2 2090 0 1268 -0.96 -2.31 0.40 other_plays 100 39107 30 25526 -1.38 -2.05 -0.71 $diff_in_diff [1] -7.5 $se_diff_in_diff [1] 2.46

The differences and 95% intervals computed this way are similar, although not identical to, the results in the published table--but there are some differences that baffle me. Let's go through the estimates, line by line:

- Kickoffs. Estimated difference is identical 95% interval is correct to two significant digits. This unimportant discrepancy could be coming because I used a binomial model and the published analysis used a Poisson regression.

- Plays from scrimmage. Estimated difference is the same for both analyses; lower confidence bound is almost the same. But the upper bounds are different: I have -0.69; the published analysis is -0.07. I'm guessing they got -0.70 and they just made a typo when entering the result into the paper. Not as bad as the famous Excel error, but these slip-ups indicate a problem with workflow. A problem I often have in my workflow too, as in the short term it is often easier to copy numbers from one place to another than to write a full script.

- Punts. Estimated difference and confidence interval work out to within 2 significant digits.

- Field goals and extra points. Same point estimate, but confidence bounds are much different, with the published intervals much wider than mine. This makes sense: There's a zero count in these data, and I was using the simple sqrt(p_hat*(1-p_hat)) standard deviation formula which gives too low a value when p_hat=0. Their Poisson regression would be better.

- All non-kickoffs. Same point estimate but much different confidence intervals. I think they made a mistake here: for one thing, their confidence interval doesn't exclude zero, even though the raw numbers show very strong evidence of a difference (30 concussions after the rule change and 100 before; even if you proportionally scale down the 100 to 60 or so, that's still sooo much more than 30; it could not be the result of noise). I have no idea what happened here; perhaps this was some artifact of their regression model?

This last error is somewhat consequential, in that it could lead readers to believe that there's no strong evidence that the underlying rate of concussions was declining for non-kickoff plays.

- Finally, the diff-in-diff is three standard errors away from zero, implying that the relatively larger decline in concussion rate in kickoffs, compared to non-kickoffs, could not be simply explained by chance variation.

**Difference in difference or ratio of ratios?**

But is the simple difference the right thing to look at?

Consider what was going on. Before the rule change, concussion rates were much higher for kickoffs than for other plays. After the rule change, concussion rates declined for all plays.

As a student in my class pointed out, it really makes more sense to compare the *rates of decline* than the absolute declines.

Put it this way. If the probabilities of concussion are:

- p_K1: Probability of concussion for a kickoff play in 2013-2015

- p_K2: Probability of concussion for a kickoff play in 2016-2017

- p_N1: Probability of concussion for a non-kickoff play in 2013-2015

- p_N2: Probability of concussion for a non-kickoff play in 2016-2017.

Following the published paper, we estimated the difference in differences, (p_K2 - p_K1) - (p_N2 - p_N1).

But I think it makes more sense to think multiplicatively, to work with the ratio of ratios, (p_K2/p_K1) / (p_N2/p_N1).

Or, on the log scale, (log p_K2 - log p_K1) - (log p_N2 - log p_N1).

What's the estimate and standard error of this comparison?

The key step is that we can use relative errors. From the Poisson distribution, the relative sd of an estimated rate is 1/sqrt(y), so this is the approx sd on the log scale. So, the estimated difference in difference of log probabilities is (log(3/1467) - log(26/2379)) - (log(30/25526) - log(100/39107)) = -0.90. That's a big number: exp(0.90) = 2.46, which means that the concussion rate fell over twice as fast in kickoff than in non-kickoff plays.

But the standard error of this difference in difference of logs is sqrt(1/3 + 1/26 + 1/30 + 1/100) = 0.64. The observed d-in-d-of-logs is -0.90, which is less than two standard errors from zero, thus not conventionally statistically significant.

So, from these data alone, we cannot confidently conclude that the relative decline in concussion rates is more for kickoffs than for other plays. That estimate of 3/1467 is just too noisy.

We could also see this using a Poisson regression with four data points. We create the data frame using this (ugly) R code:

data <- data.frame(y = c(concussions$y1[c(1,5)], concussions$y2[c(1,5)]), n = c(concussions$n1[c(1,5)], concussions$n2[c(1,5)]), time = c(0, 0, 1, 1), kickoff = c(1, 0, 1, 0))

Which produces this:

y n time kickoff 1 26 2379 0 1 2 100 39107 0 0 3 3 1467 1 1 4 30 25526 1 0

Then we load Rstanarm:

library("rstanarm") options(mc.cores = parallel::detectCores())

And run the regression:

fit <- stan_glm(y ~ time*kickoff, offset=log(n), family=poisson, data=data) print(fit)

Which yields:

Median MAD_SD (Intercept) -6.0 0.1 time -0.8 0.2 kickoff 1.4 0.2 time:kickoff -0.9 0.6

The coefficient for time is negative: concussion rates have gone down a lot. The coefficient for kickoff is positive: kickoffs have higher concussion rates. The coefficient for the interaction of time and kickoff is negative: concussion rates have been going down faster for kickoffs. But that last coefficient is less than two standard errors from zero. This is just confirming with our regression analysis what we already saw with our simple standard error calculations.

OK, but don't I think that the concussion rate really was going down faster, even proportionally, in kickoffs than in other plays? Yes, I do, for three reasons. In no particular order:

- The data *do* show a faster relative decline, albeit with uncertainty;

- The story makes sense given that the Ivy League directly changed the rules to make kickoffs safer;

- Also, there were many more touchbacks, and this directly reduces the risks.

That's fine.

**Summary**

- There seem to be two mistakes in the published paper: (1) some error in the analysis of the plays from scrimmage which led to too wide a standard error, and (2) a comparison of absolute rather than relative probabilities.

- It would be good to see the raw data for each year.

- The paper's main finding---a sharp drop in concussions, especially in kickoff plays---is supported both by the data and our substantive understanding.

- Concussion rates fell faster for kickoff plays, but they also dropped a lot in non-kickoff plays too. I think the published article should've emphasized this finding more than they did, but they perhaps here were hindered by the error they made in their analysis leading to an inappropriately wide confidence interval for non-kickoff plays. The decline in concussions for non-kickoff plays is consistent with a general rise in concussion awareness.

**P.S.** I corrected a mistake in the above Poisson regression code, pointed out by commenter TJ (see here).

So as somebody who studies this stuff (and wrote a not-dissimilar paper, here as a pre-print: https://arxiv.org/abs/1805.01271), but who had nothing to do with this analysis, let me take a stab at some of the questions from you and your students:

-How is concussion defined? However the diagnosing and reporting athletic trainers chose to define them, as this data comes from a conference-level surveillance system that relies on data provided by them.

Also I think you’re dead-on about expecting reporting changes to bias more recent numbers upwards. That appears to be the case in the NFL (SEE e.g. https://www.footballoutsiders.com/stat-analysis/2017/truth-behind-rising-injury-rates)

-Why only data since 2013? Yeah, could be because that’s how far back they had data – the NCAA-wide injury surveillance system goes back further than that, but not sure about Ivy League or for concussions specifically. Alternatively, they had 2 years post-change so maybe they didn’t want to use a ton more from the pre-change period? Second alternatively, we have 5 fingers so that’s a nice number of years to have?

-Data for each year: Yeah, I’d definitely like to see this and provided it in my article. That’s the only way that made sense to me to look at the data. In their defense, though, this is a research letter so maybe they just didn’t have space. Not sure if JAMA allows online supplements for research letters, either.

-Similar conferences: as I alluded to above there IS an NCAA-wide injury surveillance program across all sports, including football. Not sure if it includes any Ivy League schools, but you should be able to separate those out and look at the sample across other conferences. Maybe that would be a nice follow-up study.

-Raw data: Yeah. I wish I could release mine, as well, but it’s from a proprietary database. Commercial incentives don’t line up with open science, but I don’t have the data otherwise.

On the D-i-D or R-o-R front…that’s an interesting issue I hadn’t thought about for (which is kind of scary given that I teach difference-in-differences). The question is would we expect other, non-kickoff-rule changes over time to reduce concussions on kickoffs by 50% or by 1.38 per 1,000 plays, right? Given that kickoffs are fundamentally different from other plays, I might argue that other changes (reporting changes, changing play and tackling styles, other new rules, etc.) shouldn’t be expected to reduce kickoff concussions by 50%. Maybe more than 1.38 per 1,000 plays, though, since they do have further to fall. I’m genuinely not sure. It’s probably somewhere in the middle. An intriguing thought.

“But I think it makes more sense to think multiplicatively, to work with the ratio of ratios…”

Can you (or someone) elaborate on this? The authors don’t agree, so clearly there is a point of discussion.

Garnett:

I don’t know that the authors of the paper disagree with me on this. My guess is that they just didn’t think it through.

In any case, my reasoning is that I understand these effects multiplicatively.

For example, suppose that the rate of concussions fell from 10 per 1000 to 5 per 1000 in kickoff plays, and the rate fell from 2 per 1000 to 1 per 1000 in non-kickoff plays. (That’s not quite what happened in the data; I’m just using these round numbers for simplicity.) Then I’d say that concussion rates dropped by 50% in both sorts of plays, and this would be understandable as a general drop in concussion rates, nothing special for kickoffs.

Thanks for your reply.

Your example is, of course, correct. It’s also correct to say from your example that there are five fewer concussions in kickoff plays and one fewer concussions in non-kickoff plays. I don’t see how either statement, both true, motivates preference for the multiplicative model.

There have been many arguments in epidemiology, and other areas, about relative risks vs risk differences. This seems to be a similar situation, and I don’t see why the multiplicative model is necessarily preferable to the additive model.

Both numbers should be presented, and more besides.

There are a lot more non-kickoff plays than kickoff plays. Something like concussions per _game_ on kickoff plays and on non-kickoff plays might be more relevant.

On the other hand, the rate per thousand plays of one type or another is a better way of seeing which type of play is especially dangerous, and you can gauge success by looking at the reduction in the number of concussions per thousand plays.

On the other other hand, if you do something that cuts the risk of a concussion by 40% for every type of play, it would be nice to be able to see that by looking at percent reduction rather than absolute reduction, as Andrew says.

I bet we could think of other normalizations that would make sense too.

There’s no reason to expect that just one way of looking at it will be better than all others in all cases.

Garnett makes an interesting point regarding modeling and the choice of a figure of merit:

“There have been many arguments in epidemiology, and other areas, about relative risks vs [absolute] risk differences.”

Drug companies tend to report relative risk reduction because it looks more dramatic than absolute risk reduction,especially when a disease is rare. If the analogy is useful, is there anything in this concussion situation that relates to NNT, the number needed to treat, the inverse of absolute risk reduction?

I always think of this question as easiest to think about by thinking about how rates are bounded at 0. So imagine that there are 100 concussions per 1000 plays on kickoff plays and 1 concussion per 1000 plays on non-kickoff plays. And imagine that this decreases to 95 concussions per 1000 plays on kickoff plays. To have the same drop-off on an additive scale, the non-kickoff plays would have to drop to -4 concussions per 1000 plays… But that isn’t going to happen! (I think that’s what Andrew is trying to gesture at with his answer, BTW).

A different way of putting this is that we expect (almost) all effects to work multiplicatively in this context. In order to reduce the chance of a concussion on some particular play, there must be a chance of a concussion, and then we have to do something to reduce that chance, and that something is going to operate on the first chance: if that first chance is large, the harm reduction will be large, and vice versa.

I find this latter sort of explanation less intuitive than the explanation that takes advantage of the 0 bound, but it does point us to some edge cases where additive risk changes could be appropriate. So imagine that there are two disjoint Ways to get concussions, W1 and W2, with the total risk = W1 + W2 (and your data doesn’t let you distinguish between the two Ways), and then somebody actually eliminates W1: then it would be appropriate to use an additive model. Of course, it’s more likely that you’d find a way to reduce W1, so you’re sorta part-way between an additive and multiplicative model (p*W1+W2 vs. W1+W2), with the additive model being more accurate for large W2.

So basically, to convince someone that the additive scale was actually correct here, you’d need to convince them that something like the above was happening… maybe concussions come from violent tackles and from… something else? And the kickoff-improving treatment only affects the violent tackle part of the equation. Maybe there are other ways to get an (approximately) additive effect, but I think they’re all going to be pretty rare.

BTW, the above justification explains why you’d use RoR for analyzing statistically significant differences: it says nothing about how you’d want to illustrate these effects descriptively, which depends on your goal. It’s often the case that public health practitioners need to know the absolute size of the effect for e.g. cost-benefit analysis purposes. So maybe first you’d want to figure out whether you have evidence that the kickoff-improving treatment works (using the RoR scale) and then, if it does, describe the impacts on an absolute scale (e.g. it eliminates 5 concussions per 1000 kickoffs). Due to these sorts of base rate problems, this can be useful, especially if the baseline risks were operating in the opposite direction (i.e. imagine that you successfully proved that you made kickoff plays safer, but kickoff plays were ex ante the safest type of play; it might reasonably be asked: did that actually matter at all? Or would we be better off ignoring this treatment and pursuing a treatment that operates on the actually dangerous plays?).

So, using the ratio-of-ratios analysis, if there was data for each year, would you run a 3-way interaction model?

y ~ year*period*kickoff

Where ‘year’ is timepoint for each year, and ‘period’ is a binary indicator before rule change vs. after rule change.

This way you would get the ratio of the change over time of probabilities for a concussion for a kickoff play to the change over time of probability for a concussion in a non-kickoff play in the ‘after rule change’ period?

Or, would you do a two-way interaction with year*kickoff?

Or, would you just do the two-way interaction that you described, and use the ‘year’ data for visualization purposes?

It seems to me that this type of rule change intervention would be expected to have an effect of an almost immediate decline (if it worked) that maintained (and not a gradual over time decline), making the year*kickoff interaction that ignores the intervention periods, not quite right. The 3-way interaction seems interesting, but maybe not relevant in light of an immediate decline that maintained, and perhaps not enough timepoints either.

Jd:

With 5 years of data, I’d start by plotting things. But, sure, a natural first step of modeling would be a linear time trend plus a discontinuity. And you could also interact the time discontinuity with the kickoff variable. There are lots of other ways one could model it too.

Cool. This is a very helpful blog post.

“There are lots of other ways one could model it too.”

Considering that the rule change involved kickoffs, so we are interested in concussions in kickoffs, what other ways would you potentially model this data, given that you had the data for each year? …I have already described what comes to my mind…what other models would be helpful?

Jd:

Yes, we’re interested in kickoffs, but we’re interested in other plays too, so all the modeling is relevant. One choice is how to model the time trend. With just 5 years, a linear trend seems natural, but other models are possible. Also it would make sense to have intercepts vary by year: if there’s lots of year-to-year variation, this will make us less certain about the effect of any particular intervention.

Ah, ok. Makes sense.

When having intercepts vary by year (group level effect), one would not include year in the population level effects, correct? So, maybe something like y~period*kickoff + (1|year) ?

This post, with thought process and R code imbedded, is quite nice. It’s interesting that a careful reading and some quick re-analysis found what look like minor and not-so-minor errors in the paper. It makes me think of the “Many Analysts, One Data Set” paper at http://journals.sagepub.com/doi/10.1177/2515245917747646 which included “peer-review” (in quotes because this is different from the usual pre-publication peer review) of analyses, among other things. If a scientific question is worth investigating, maybe it’s worth spending a little more time and money to have this sort of peer review, and to have more than one analysis included in the manuscript.

Vince:

Yes, this is the efficiency argument for post-publication review. All papers, even the ones that will never be read again, get pre-publication review, so pre-publication review is, by necessity, shallow. Post-publication review is focused on those publications that get attention, that are of interest to some people, so it can be more focused.

And yes, it’s striking how a careful reanalysis can uncover mistakes, even in such a short article. I had the same experience with that Case and Deaton paper.

Statistics is hard, and it’s easy to make mistakes, even in relatively simple analyses.

” – How is “concussion” defined? “

… and how is it specifically linked to “kickoffs” versus other events in the game ?

seems unlikely that concussions are always immediately noticed & diagnosed during the many game events… and perhaps only noticed after a game.

are there concussions that go unnoticed?

are there types/degrees of formal “concussion” diagnoses in medical practice?

seems the “raw data” is very vulnerable to error in its collection.

As a volunteer trainer and emergency physician I would say that concussion is almost invariably diagnosed very soon after it occurs – a minute or so of observation and a few brief questions are all that are needed.

Many thanks. This is the most helpful thing I’ve read of late.

P.S. Do an edX or Coursera course, please.

Thanatos:

My colleague and I signed up to do a Coursera course but then we backed out because we weren’t quite ready to do all the work to create the course and keep it running. But I’d be happy if other people were to create such a course and use my examples!

FWIW, I couldn’t get glm() or stan_glm() to run the poisson model above–it complained about wanting starting values–but it worked when I moved the offset into the formula. I got the same numbers. https://gist.github.com/tjmahr/798337caa1438a0538edc028cad71e7b

library(“rstanarm”)

options(mc.cores = parallel::detectCores())

fit <- stan_glm(

y ~ time * kickoff + offset(log(n)),

family = poisson,

data = data)

print(fit)

Tj:

My bad. That should’ve been offset=log(n), not offset=n. I guess I’d fixed the code in my R script but forgot to fix it in the post. My fault for not using a reproducible workflow.

My guess for the Other Types combined confidence interval is that they dropped the minus sign for the upper bound during copy-paste. Their upper bound should be -0.92, not +0.92 I’d bet.