Statistics is hard, especially if you don’t know any statistics (FDA edition)

Paul Alper shares this story:

From the NYT:

Dr. Stephen M. Hahn, the commissioner of the Food and Drug Administration, said 35 out of 100 Covid-19 patients “would have been saved because of the administration of plasma.”

He later walked this back because of confusion between Absolute Risk Reduction and Relative Risk Reduction, a common error usually promoted by drug manufacturers because relative improvement appears more dramatic to the beholder.

He [Hahn] clarified that his earlier statements suggested an absolute reduction in risk, instead of the relative risk of a certain group of patients compared with another.

The chart, analyzing the same tiny subset of Mayo Clinic study patients, did not include numerical figures, but it appeared to indicate a 30-day survival probability of about 63 percent in patients who received plasma with a low level of antibodies, compared with about 76 percent in those who received a high level of antibodies.

From the FDA:

“there appears to be roughly a 35 percent relative improvement in the survival rates of patients” who received the plasma with higher versus lower levels of antibodies.

As best as I [Alper] can figure out, the absolute risk reduction is

.37-.24 = .13

The relative risk reduction is

(.37-.24)/ .37 = .35

The number needed to treat, a figure of merit which is often omitted from discussion, is

1/.13 = 7.69

Statisticians and scientists said that Dr. Hahn, in saying at the news conference that 35 out of 100 sick Covid-19 patients would have been saved by receiving plasma, appeared to have overstated the benefits.

I looked up Hahn on the internet and he’s an oncologist:

Hahn completed an internal medicine residency at the University of California, San Francisco School of Medicine where he eventually served as chief resident before embarking on a fellowship in medical oncology at the National Institutes of Health.

After completing his fellowship, Hahn worked as a medical oncologist in Santa Rosa, California. He was then recruited by his mentor, Dr. Eli J. Glatstein to complete a separate residency in radiation oncology at the NIH between 1991 and 1994, where he eventually attained the rank of commander in the U.S. Public Health Service Commissioned Corps between 1989–1995. During the period of 1993–1999, he served as chief of NCI’s Prostate Cancer Clinic in the Clinical Pharmacology Branch . . .

I don’t see any formal statistics training. That’s fine! Alper and I don’t have any oncology training and here we are posting on this. But in any case the commissioner of the FDA might well too busy to be carefully reading the individual studies. I assume the fault is whatever assistant prepared the numbers for him.

The story with the “A/Chairman @WhiteHouseCEA” was worse, because that guy didn’t just mess up some numbers, he went all-in on the attack. The FDA story is embarrassing, but such things happen. Whoever prepared the FDA commissioner’s briefing must feel pretty bad about this one.

42 thoughts on “Statistics is hard, especially if you don’t know any statistics (FDA edition)

  1. I agree that this whole incident is a big embarrassment for the FDA, including this misrepresentation of what the 35% allegedly means. But if Dr Hahn had just said “35 out 100 patients who died would have been saved” (rather than “35 out 100 sick patients”), wouldn’t he have been correct, at least inasfar as interpretation of this point statistic? – and this is still an extremely dramatic story which would have gotten near-identical headlines and, if true, is totally worth celebrating.

    The by far bigger story is that the evidence for the risk reduction, whether copmmunicated as absolute or relative, is still very weak and will remain so until there are results from some proper randomized trials. I think focusing on the absolute v relative risk thing is an error in terms of advocacy, and we should be pointing more at the lack of randomization. At this point there is only some loose observational evidence in favour of this treatment, not much better than counting anecdotes, unless I have missed something.

    • Peter Ellis feels that “focusing on the absolute v relative risk thing is an error in terms of advocacy, and we should be pointing more at the lack of randomization.” I would agree that a proper RCT is always desired but in this instance, a basic statistical error was made. Furthermore, because the Trump people are desperate to have a miracle cure in place before the upcoming election, it is not surprising that there is a tendency/desire to convince us of that (Vietnam-War type) light at the end of the tunnel.

    • Why do you think convalescent plasma would not be beneficial if given early? I think that is a pretty difficult default position to take.

      Same with vitamin c for vitamin c deficiency and oxidative stress as well as HBOT for oxygen deficiency (and now its looking like probable methemoglobinemia).

      The point of science is that we do not need to run giant RCTs for every single thing, because we can generalize from past experience and knowledge.

  2. Something seems wrong in the analysis. The risk of death of 37% is possibly an order of magnitude off. The risk of death is much lower. This in turn would affect the absolute risk calculation and the number needed to treat calculation

    • I’m not sure where Paul Alper got his 0.37 and 0.24 probabilities from. In the Mayo research that seems to be cited most often as justification for this (https://www.medrxiv.org/content/10.1101/2020.08.12.20169359v1) they talk about seven day mortality going down from 11.9% to 8.7% if a patient gets a transfusion within three days, rather than four days or later. That’s a 27% relative risk reduction (but I don’t think it can be trusted, as its a big, rambling non-randomized observational study with lots going on). As far as I can see there’s no comparison to “no transfusion at all”, but say that 27% just on the timing were real (a big if), maybe 35% compared to a control is a plausible guess. Also note that these were unusually ill patients (but that’s ok if that’s the population we’re interested in).

      Your main point is correct – I think much more important than whether 35% is correctly presented as the relative risk reduction is the fact that the basis for “35%” seems to be extremely weak.

      • Peter F. Ellis wrote “I’m not sure where Paul Alper got his 0.37 and 0.24 probabilities from.”

        The .37 and .24 come from the following statement in the NYT:

        “The chart, analyzing the same tiny subset of Mayo Clinic study patients, did not include numerical figures, but it appeared to indicate a 30-day survival probability of about 63 percent in patients who received plasma with a low level of antibodies, compared with about 76 percent in those who received a high level of antibodies.”

        That is, 1-.63 = .37 and 1-.76 = .24.
        The subtraction is in order to look at risk rather than survival.

        • Thanks Paul. Those numbers presumably are for very sick people, so a subset of all those in the research. I agree that does look a likely origin of the 35% figure.

        • Hahn originally presented this as population of people getting sick with no mention of a sub population so there is that error too.

  3. No, I don’t think Hahn gets a pass on his lack of knowledge of statistics. The job of heading up the FDA requires a reasonable level of familiarity with statistics. If he doesn’t have that, he’s not qualified, and he should resign. Now, I wouldn’t judge that by whether he has formal training in it–it is possible to be self-taught in this area, as in most areas. And it may be that, in general, he has reasonable knowledge about statistics and this one incident is just a fluke. But being the head of the FDA has broad and important decision making power: it is by no means analogous to a couple of statisticians voicing their views about oncology. If he doesn’t have the skills, he should not be in that job.

    • In one of those daily briefings Dr. Birx referred to the throat swab PCR tests as blood tests. I wouldn’t assume any of these people are particularly knowledgeable about details.

      • Anon:

        Agency heads are busy, and also if you’re interviewed a lot, you can make mistakes. Mistakes are bad—they shouldn’t be labeling throat swabs as blood tests—but I think it’s part of the agency head’s job to get good subordinates who can prepare accurate briefings.

        • Right, I just meant these people have been closer to managers who play office politics well rather than scientists or doctors for many years.

      • In every math class I ever took, the instructor made at least one sign mistake in almost every lecture. Knowledgeable people can mis-speak even when they know what they’re talking about.

    • Clyde Schechter wrote “No, I don’t think Hahn gets a pass on his lack of knowledge of statistics. The job of heading up the FDA requires a reasonable level of familiarity with statistics. If he doesn’t have that, he’s not qualified, and he should resign.”

      That seems excessively harsh, but it is not without some validity.

      In this case the issue around Covid-19 treatment has become so politicized and contested that any reported results will be intensely scrutinized from all sides, many looking for ideological justifications. It’s remarkable that this would not have been realized before issuing press statements on the report. Its about awareness rather than statistics as such. The result is a textbook example of poor communication of science and scientist anger. Will the FDA be trusted the next time an announcement is issued?

      Much of Hahn’s initial comments have since been clarified and walked-back, but is it enough? And what damage has been done (politically as well as scientific)? A key benchmark will be how a similar announcement is handled next time.

      • I did say that it’s possible he is, in general, knowledgeable enough about statistics for the job and maybe this was just a fluke occurrence. So I’m not calling for him to resign based just on this one incident. I’m saying that the incident clearly raises questions, and his capacity to handle the duties of the job should be looked into. If he doesn’t have the necessary skills, then he should resign. I really don’t see that as harsh.

    • It seems as if nearly every statistical concept & practice has been under such severe scrutiny for many decades. Non-experts have bee at an even greater disadvantage in terms of discerning NNT’s relevancy.

      What about Lead Time Biase?

  4. The word hard doesn’t help distinguish statistics from anything else, really. “Hard” doesn’t promote any productive conversation. There are many things that are hard I need not mention. It just turns statistics/science into an ego competition. I think subtle is a better word. Then we can ask, “how is is subtle?” “How might we combat this?” Rather than propagating fear.

  5. The purpose of the US government approach to this illness has been to make the administration look good. Thus we are told that this is a trivial ill that will disappear when the weather gets warm, only affects feeble old people, requires no change in behavior, and will evaporate when treated with miracle drugs such as hydroxychloroquine and oleandrin. In this context it is missing the point to say that Dr. Hahn made some errors in using statistics. Clearly, he knows how to get ahead inhis position. He is doing quite well.
    There is a reason that Donald Trump is the President rather than someone with two PhDs.
    I used no statistical test to arrive at my opinion.

  6. Can’t we just hold agency heads and such accountable for what they say? Sure they have people doing work for them. They fail to ask obvious questions in their own briefings to ensure they understand vitally important content, maybe because they don’t understand what they need too not screw up? Then maybe also someone else should be in their job?

    Where should the Buck stop? How about right here. Where I decline to accept excuses for people who have power. Another flavour of blaming the grad students.

    It is 100% not the underlings’ fault unless they are giving the interview and making the false statements.

    All glory, superior pay and opportunities deflected to me but all responsibility is yours. Bad outcomes happen with that attitude. Eg p-hacking of the worst kind. Can we risk it here? Isn’t it a little too important to have decisions made by those who think this is fine?

    • H:

      Oh, yes, the agency head should definitely be accountable for what he says. I’m just saying that, conditional on him not having the time to read the technical material, his failure was more of a personnel-management failure than a statistical-reasoning failure. His mistake was to hire or retain a team that messed up when preparing his talking points.

      • Also, there can be a tendency in bureaucracies for senior managers to insulate themselves from technical staff – so the team that has the technical understanding is actually blocked from direct communication. So a personnel-management enforced failure.

  7. There are some folks that are saying this is politicized which means the purported benefits are ephemeral and the administration is lying. That is different than just making a mistake in the statistics. As Bayesians, shouldn’t you all have a prior? Convalescent plasma has been used for other diseases and the physical mechanism is straightforward, although there is no sustained research (I suspect because vaccines are better). What is the cost benefit on this treatment? How should we make a decision under uncertainty given the devastation to lives and the economy? I think Andrew is clear-headed (see comments on statistics is hard) but it seems other commenters are the ones actually politicizing this event.

    • This idea that you need an international RCT of tens of thousands of people to claim anything is nuts. I always ask, how was astronomy so successful with n = 7 and only observation?

      And the way some of these RCTs are run, I’m not sure they would discover water can save a house from fire. If you put a little bit of water on a house that already burned down, or so much water it washes the house away, or included fires that started in the rain so it went out quickly anyway, etc it can be difficult to see the benefit.

      • “how was astronomy so successful with n = 7 and only observation?”

        Because it’s based on stable physical laws. Once you move away from Physics, towards Chemistry, Biology, Psychology, Sociology, Pol. Science, Economics, it gets messier and messier. The concepts are harder to measure and noise is everywhere.

        Medicine is in the middle of the spectrum, at the intersection of biology, chemistry, psychology, etc. Very hard to measure precisely and a lot of noise and proxy variables.

        However, I agree that the world can’t afford waiting for numerous RCTs, which are in this situation unethical anyway. If there is no harm from conv. plasma, vitamin C or anything else, it would be wise to try it. It’s not like we are stalling some miracle cure by using these unproven alternatives. As long as there is no harm, why not try it.

        • Because it’s based on stable physical laws. Once you move away from Physics, towards Chemistry, Biology, Psychology, Sociology, Pol. Science, Economics, it gets messier and messier. The concepts are harder to measure and noise is everywhere.

          Everything is based on stable physical laws. You solve for the simplified scenario and get the first approximation. Look at the calculations that go into a modern ephemerides if you want very accurate predictions: https://ssd.jpl.nasa.gov/?ephemerides

          Yet, the simplified calculations done by hand arrived at from n = 7 get us 90%+ there. No one even tries to do this type of research for medicine these days. Stuff like SIR models, law of mass action, and law of effect were developed back in the day when researchers did try to take a physics-like approach to medical data.

          Nowadays you get the “war on cancer” that spends hundreds of billions of dollars over decades to conclude “cancer is many diseases”, which is the exact opposite of finding a “law”.

          However, I agree that the world can’t afford waiting for numerous RCTs, which are in this situation unethical anyway. If there is no harm from conv. plasma, vitamin C or anything else, it would be wise to try it. It’s not like we are stalling some miracle cure by using these unproven alternatives. As long as there is no harm, why not try it.

          Yes, cost/risk-benefit needs to determine what treatments are used. Not statistical significance.

        • And you do not know the risk or the benefit until the the ostensible effect (or non-effect) has been observed in such numbers as to elicit a “stable law” — or at least a “meta-stable” law.

        • I did blinded, randomized animal research, which is much easier to do than a human RCT but still riddled with issues. In the end it wasnt even clear to me what was being measured, in fact I dont think it was possible for the experiment to answer the research question.

          To do that I needed to come up with a mathematical/computation model that made predictions and interpret the parameters, which people said was too hard. Then I found it had ready been done in the 1930s!

          Didnt matter. All anyone cared about was whether there was a significant difference between groups though. And I had a lot of meaningless significant differences between groups.

  8. I’ve never taken a statistics course in my life. I’ve never studied statistics. The error doesn’t seem particularly “hard” to me to recognize. Absolute values versus relative values seems rather basic and applies across many domains. I’m not sure this fits into the “statistics” is hard category. This doesn’t seem to me to be a lack of a specialized kind of knowledge. Seems to me like a basic logic fail – perhaps spurred by a “motivation” to promote a politically favorable message.

    How does one distinguish basic logic, in particular as applied to math, and “statistics?”

  9. Table 3 of the Mayo paper ((https://www.medrxiv.org/content/10.1101/2020.08.12.20169359v1) (which others have cited) gives the 7-day and 30-day mortality rates for patients given convalescent plasma with low, medium, and high titers of IgG antibodies. The 7-day mortality rate in the 561 patients given low titer IgG plasma was 13.7%; it was 8.9% in the 515 patients given high titer IgG. The rate ratio (relative risk) for 7-day mortality in patients given high titer IgG compared with low titer IgG is 0.089 / 0.137 = 0.65.

    Epidemiologists (and I’m not defending this practice) often (very often) calculate a “% reduction in relative risk” by subtracting the relative risk estimate from 1.00. Thus, many epidemiologists would put into a paper (or issue a press release or describe to a newspaper reporter) a statement that “high titer IgG reduced 7-day mortality by 35%”—1.00 minus 0.65.

    The BMJ EBM toolkit promotes this. https://bestpractice.bmj.com/info/us/toolkit/learn-ebm/how-to-calculate-risk/ The BMJ EBM folks call this measure the “Relative Risk Reduction (RRR).”

    The website states: “RR of 0.8 means an RRR of 20% (meaning a 20% reduction in the relative risk of the specified outcome in the treatment group compared with the control group).”

    It is, in my opinion, nonsense to describe the subtraction of the relative risk estimate from 1.00 as a “% reduction in the relative risk.”

    I illustrate the problem using the high titer IgG compared with low titer IgG data as an example.

    The 35% reduction in the relative risk for high titer IgG compared with low titer IgG (relative risk 0.089 / 0.137 = 0.65) would be a 54% increase in the relative risk for low titer compared with high titer (relative risk 0.137 / 0.089 = 1.54). The choice of referent is (always) arbitrary.

    In terms of the substantive issue about use of convalescent serum it is important to look at 30-day mortality in patients treated with high titer IgG and low titer IgG. The 30-day mortality in the 561 patients given low titer IgG plasma was 29.6%; it was 22.3% in the 515 patients given high titer IgG. The rate ratio (relative risk) for 7-day mortality in patients given high titer IgG compared with low titer IgG is 0.223 / 0.296 = 0.75. The amount of benefit at 30-days is less than the benefit at 7-days. The long-term outcomes are most important.

Leave a Reply to Tom Cancel reply

Your email address will not be published. Required fields are marked *