The most recent issue of Significance had a very interesting article by Stephen Senn, in which he wrote about the TeGenero tgn1412 drug trial catastrophe which occurred in March 2006, when 6 volunteers received the drug, and two received a placebo. The 6 volunteers almost immediately had massive immune system reactions – specifically a cytokine storm, and were hospitalised for at least a month.

What we have here, is the potential of a statistical analysis. We’ve got a 2×2 table, so let’s do the stats.

Placebo Drug

Yes 0 6

Cytokine Storm

No 2 0A 2×2 table. We obviously can’t do a chi-square test, as the sample is too small. But we can do a Fisher’s exact test. If we do that we get a one-tailed p of 0.036. It’s a one-tailed test, so our p-value cut off is 0.025, so we don’t have evidence that the drug caused the cytokine storm, and all the subsequent ills.

But that’s got to be a silly thing to say. It’s obvious that the drug did cause the cytokine storm. It’s not just barely significant; it’s really, really obvious. Why is it so obvious? It’s obvious because people don’t have cytokine storms every day. In fact, if you haven’t got the Spanish Flu we’re pretty safe saying that you will never have a cytokine storm. In other words, it’s not just the data that we have obtained here that we need to take into account. We need to take into account the probability of having a cytokine storm ever is very low. In other words, we need to take into account the prior probability. And so we have just done a Bayesian analysis.

This is a great example. I have no knowledge of cytokine storms, so I’m not quite sure how to put a prior distribution on this. One way to think about this problem is to imagine how you would analyze the data if you only had the 6/6 data from the treated group, and no data from the controls. Or what if the treated group were only 3/6? If cytokine storms are really rare, then even 3/6 cases (even 1/6?) would be evidence of a problem. (Conversely, even one cytokine storm among the controls would cast doubt upon the prior assumption that cytokine storms are extremely rare in the population represented by the study.)

**A question**

I’m curious what Stephen Senn (or Jeremy Miles) would conclude if all we had were data on 6 controls, where, say, 1 had cytokine storms and 5 didn’t. It sounds like this would be enough to stop the clinical trial. In that case, the prior distribution would certainly be relevant!

**Just for laffs**

I analyzed the data using our default weakly informative prior distribution. In R:

> library(arm) > y <- rep (c(1,0), c(6,2)) > x <- y > M1 <- bayesglm (y~x, family=binomial(link="logit")) > display(M1) bayesglm(formula = y ~ x, family = binomial(link = "logit")) coef.est coef.se (Intercept) -1.86 1.88 x 4.80 2.31 n = 10, k = 2 residual deviance = 1.2, null deviance = 9.0 (difference = 7.8)

So the difference is statistically significant! As noted above, this doesn’t really resolve the real issue, since apparently the problem would arise even with only 1 storm among the treated units and no control data at all. But I was just curious how this would work out.

**By the way, don’t do the so-called Fisher exact test**

Senn discusses how the so-called Fisher exact test does not give a statistically significant result in this example. In any case, as I’ve written elsewhere (see Section 3.3 of this paper), the so-called Fisher exact test makes no sense in this sort of problem, where only one of the two margins is specified by the experimental design.

All I know about cytokine storms is what I read on Wikipedia. I guess if 1 person had a cytokine storm, you'd be pretty worried about your trial.

My thanks to Jeremy Miles for drawing this to my attention

A fuller discussion of this example is contained in the RSS Working Party report,

1. Working Party on Statistical Issues in First-in-Man Studies. Statistical issues in first-in-man studies, Journal of the Royal Statistical Society, Series A 2007; 170: 517-579.

where we write:

"There are several possible approaches to

calculating such a P-value. Fisher’s exact test is one of the most common. If this is used, a

one-sided P-value of 0.0357 is obtained. This test is not uncontroversial, because it conditions

on both margins, but whereas it was known in advance that the treatments would split 6 to 2

it was not known that the side-effects would also split 6 to 2. Fisher’s exact test assesses the

unusualness of the pattern (all six side-effects under TGN1412 and none under placebo) given

that the 6 to 2 split has occurred, but the split itself is suggestive. A test that does not condition on both margins is Barnard’s test, which yields a one-sided P-value of 0.0111.

It is clear, however, that neither of these analyses does justice to the result. All commentators are convinced—with good reason—that the drug is highly toxic in the dose given; these fairly modest P-values do not begin to do justice to that degree of conviction (Senn, 2006). This raises the issue as to how such judgments are made. There are two plausible sources of further information not conveyed by Table 7: first, the

background knowledge that adverse reactions of the sort that occurred are almost impossible

in subjects given placebo; secondly, the timing of the reactions. (The use of timing as a means

to judge causality in epidemiology is receiving increasing attention: Farrington and Whitaker

(2006).)"

You will see that the issue of the margins being fixed or not is addressed. However, my own view on the subject is that where one is prepared to take a frequentist view even though one of the margins is not fixed in advance, for reasons given in:

Streitberg, B, Röhmel, J. Alternatives to Fisher's exact test?, Biometrie und Informatik in Medizin und Biologie 1991; 22: 139-146.

Fisher's exact test is a good test to use, although it fails to recover some relevant information in the most extreme case. (Here the fact that both margins split 2:6 is itself suggestive.)

However, here it is quite clear that the background information is exteremly important and analysis with 'uniformative' prior distributions is a bad idea.

As someone who uses Fisher's Exact test a lot (almost always where both margins are fixed, I ahould point out) your point is well taken, nut the problem here, even with Barnard's test which conditions on only one margin, is that 0/2 for the controls is weak evidence that the true effect is anywhere near zero. As you point out, if cytokine storms were thought to be the problem going into the experiment, you'd have needed to have enough controls to have some power to measure the true probability. Obviously, the possibility of a cytokine storm was not what militated the use of a control group at all, but the fear of, for example, a positive placebo effect. For this specific effect, the control group was useless, and we are left with a 6/6 in the test group, for which the 95 percent confidence interval for the possibility of cytokine storms is 0.54-1 (Clopper-Pearson) or 0.60-1 (Blyth-Still-Casella). That's ignoring the fact that the effect is rare. If we want to be a little more cautious (99 percent confidence interval) we still get roughly 40 percent. Even at a 99.9 confidence level we get a lower limit on the confidence interval of 0.2817. Since this is dramatically higher than the probability of incidence in the population (hereby taken as the set of all controls) the danger is amply demonstrated.

One more thing: there still seems to be a problem with your one out of six hypothesis. Suppose the one person had been struck by a falling meteorite. Surely, while highly unlikely in the population, you wouldn't attribute causality to TeGenero and stop the trial. You need more background information than a rare event — you need a rare event which is plausibly (even by a reasonably obscure pathway) connected to the trial. My point here is that Bayesianism requires a prior not only on cytokine storms, but a prior on cytokine storms connected to this trial; I'm not sure I know how you'd make a sensible guess about this prior conditional probability before the trial started. Now if all six test group members had been struck by six different falling meteorites… what would you do then? Stop the trial? Why? Suppose two had been in fatal car accidents, one had committed suicide, one had a cytokine storm, one got leukemia, and the other was cured instantly. What then?

Let me be facetious, if the six group members had been hit with six different falling meteorites, our main problem would not be about cytokine storms.

But with regards to that last question where people are actors of their health (rather than simple victims like in the meteorite incident)

"Suppose two had been in fatal car accidents, one had committed suicide, one had a cytokine storm, one got leukemia, and the other was cured instantly. What then?."

Isn't there a simple response like: We don't understand this procedure ?

Igor.

don't forget the semi-colon, Jonathan; it's often useful. :)