Yesterday we discussed difficulties with the concept of average treatment effect.

Part of designing a study is accounting for uncertainty in effect sizes. Unfortunately there is a tradition in clinical trials of making optimistic assumptions in order to claim high power. Here is an example that came up in March, 2020. A doctor was designing a trial for an existing drug that he thought could be effective for high-risk coronavirus patients. I was asked to check his sample size calculation: under the assumption that the drug increased survival rate by 25 percentage points, a sample size of N = 126 would assure 80% power. With 126 people divided evenly in two groups, the standard error of the difference in proportions is bounded above by √(0.5*0.5/63 + 0.5*0.5/63) = 0.089, so an effect of 0.25 is at least 2.8 standard errors from zero, which is the condition for 80% power for the z-test.

When I asked the doctor how confident he was in his guessed effect size, he replied that he thought the effect on these patients would be higher and that 25 percentage points was a conservative estimate. At the same time, he recognized that the drug might not work. I asked the doctor if he would be interested in increasing his sample size so he could detect a 10 percentage point increase in survival, for example, but he said that this would not be necessary.

It might seem reasonable to suppose that a drug might not be effective but would have a large individual effect in case of success. But this vision of uncertainty has problems. Suppose, for example, that the survival rate was 30% among the patients who do not receive this new drug and 55% among the treatment group. Then in a population of 1000 people, it could be that the drug has no effect on the 300 of people who would live either way, no effect on the 450 who would die either way, and it would save the lives of the remaining 250 patients. There are other possibilities consistent with a 25 percentage point benefit—for example the drug could save 350 people while killing 100—but we will stick with the simple scenario for now. In any case, the point is that the posited benefit of the drug is not “a 25 percentage point benefit” for each patient; rather, it’s a benefit on 25% of the patients. And, from that perspective, of course the drug could work but only on 10% of the patients. Once we’ve accepted the idea that the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that “the treatment effect” in any given study will depend entirely on the patient mix. There is no underlying number representing the effect of the drug. Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect. My point is that if we consider the treatment effect in the context of variation between patients, this can be the first step in a more grounded understanding of effect size.

This is an interesting example because the outcome is binary—live or die—so the variation in the treatment effect is obvious. By construction, the treatment effect on any given person is +1, -1, or 0, and there’d be no way for it to be 0.25 on everybody. Even in this clear case, however, I think the framing in terms of average treatment effect causes problems, as illustrated in the story above.

> Once we’ve accepted the idea that the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that “the treatment effect” in any given study will depend entirely on the patient mix.

This makes sense.

> There is no underlying number representing the effect of the drug.

This makes sense (though I like the previous sentence better)

> Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect.

What I don’t get is this. It’s like giving up the thought process that got us this far.

There’s lots of variation in sick patients -> we should recognize that or we’ll mess up with our ATEs -> but it’s enough to show a clear ATE

What do you mean by this? Is it just that we shouldn’t obsess over the difference in a 10% and a 25% effect if we suspect there’s some sort of population sensitivity? Or is it something else?

My take was that the sentence loosely translated to “historically, the standard for clinical trials is to show a clear ATE”. I don’t think Andrew was necessarily saying it was good enough scientifically speaking. Given the probable noise and the number of possible factors (many of them latent), obsessing over a difference in 10% and 25% will provide few conclusive answers (though it is important to mine correlations as potential indicators). Hence, I can understand why significance in ATE is a “good enough” standard to move forward in a clinical setting while remaining unsatisfactory in a scientific one.

> By construction, the treatment effect on any given person is +1, -1, or 0, and there’d be no way for it to be 0.25 on everybody.

Doesn’t this ultimately depend on how deterministic of a process you think the infection is in a particular individual? Saying the “the treatment effect on any given person is +1, -1, or 0” seems to suggest that if you reran the tape, so to speak, everyone would have the same outcome. I’m not saying that’s wrong, just saying it seems to be an assumption here. But is it the only possible assumption?

In other words “I am giving you a drug that increases your probability of survival by 25% could mean “there’s a 25% chance that you are in the group of patients who would die without the drug and survive with it”. But couldn’t it also mean, “your survival is stochastic, and this drug will increase your personal p(survival) by 0.25”?

I agree, it seems that some variability is not being embraced or some uncertainty is not being accepted. :-)

Although rereading the post I guess that may be the point? The “effect” is big or it isn’t. However, it’s unkown and unknowable (depends on a non-existing counterfactual). Once we start estimating things, one can be “a little bit pregnant”.

I don’t think that the fact that treatments won’t have the same effect in all patients is lost on anyone doing clinical trials, though.

Tell Schrodinger’s cat

There is no basis in science for a stochastic action here. The action of the drug occurs without reference to our model of its action. If it acts differently in two situations, we infer there is a difference in the situations, not that the drug plays bingo.

The stars are indifferent to astromony.

“The action of the drug” doesn’t mean anything outside of our model.

No it doesn’t, but chemicals do not need meaning to interact with each other.

You may have a philosophical question here. Just don’t mistake it for a scientific one.

I was being facetious about quantum effects, but what about a drug which interacts with, say, your blood glucose level which might vary by a factor of two in a normal person depending on when their last meal was. Person A would have been cured if he had taken the drug on an empty stomach but is not on a full stomach. OK.. Maybe you monitored that particular timing, but what about some other neurotransmitter that you don’t even know about whose levels routinely vary? There is science to who gets cures, but if you’re not measuring the relevant variable, it looks stochastic.

Jonathan:

In Rubin’s potential outcome framework, the causal effect for each person exists and is simply y^treatment – y^control, which indeed is either +1, -1, or 0 in this example. It’s the same idea as saying that the length of your life exists; we just don’t know the number yet. But you make a good point that it can make sense in modeling to define some sort of intermediate quantity that includes some but not all sources of variation. I wonder if Imbens and Rubin discuss this in their book on causal inference.

What do you do with a +1 -1 0 causal effect when there are multiple observations per subject? Joe got sick five times, and the drug cured him twice and he stayed sick three times.

Jonathan:

In that case, there’s a separate treatment effect for each event, not just one per person.

I think you have the shoe on the wrong foot: if the mechanisms and outcomes of a process are entirely deterministic but literally unknowable, and the central limit theorem means the distributions of outcomes will be identical to those of a stochastic process, then parsing whether it is, in fact, a stochastic process is philosophical, if not metaphysical. After all, a blind draw of a ball from an urn is entirely deterministic but also perfectly modeled as random.

> There is no basis in science for a stochastic action here.

I disagree. It is easy to think of such bases. To name one: whether or not your immune system hits on a particularly effective antibody. The generation of such is effectively random.

And not just generation. Whether or not it is made in large enough quantities.

This goes back to the Rubin/Neyman causal model. You can do inference by modeling the potential outcomes as fixed (or, equivalently, conditioning on some subset/function of them) even if some stochastic process brings them about.

@Adam thanks I will have to read up on that.

@Ben

But couldn’t it also mean, “your survival is stochastic, and this drug will increase your personal p(survival) by 0.25”?

How does one survive ‘better’ by 0.25? You either survive or not. There is no ‘increase’ part, as it is one off for each patient. Basically, I know what you meant, but it’s really not a helpful figure on an individual basis.

Unfortunately, some people also interpret these sorts of numbers as:

“My symptoms will be 25% less severe”.

Dr: We could change the valve in your heart, but there is a 10% probability that you die during the operation.

Nr: That’s not a helpful figure, doc. I will either die or not, just tell me what it is!

Carlos,

The 10% in your example has been derived over many attempts on a population, not an individual.

However, one patient at a time doesn’t really benefit from that 10% (I mean it’s comforting to hear, sounds official, but the outcome is binary, good for comparison to other numbers only,with 20% being worse.

I’m really not understanding your objections either. People do things for a “percentage benefit” all the time. Seatbelts and bike helmets don’t stop you from dying in crashes, they make it less likely. If my doctor says to get my cholesterol down, it might not help me avoid heart trouble (and maybe I would never have had any either way), but it may lower my probability of it. (Not sure I believe this, as an aside.) People like reliable cars even though a “less reliable model” might never break down. The idea of raising my survival odds by a certain percentage seems perfectly lucid to me, entirely apart from whether I would think it was worth the effort and expense.

@ Navigator: there may be all kinds of random processes in how the infection proceeds. Does your body start making a particularly effective antibody. What happens to your cytokine levels at what times. Do you get a bacterial co-infection. Et cetera, et cetera.

> I asked the doctor if he would be interested in increasing his sample size so he could detect a 10 percentage point increase in survival, for example, but he said that this would not be necessary.

Is that an example of “making optimistic assumptions in order to claim high power”? He could have said “yes” and claim high power anyway. Maybe then the example would be about claiming high power for a 10pp effect but not being interested in a 2pp effect.

Perhaps a more effective framing would be to ask the doctor for what fraction of the population a 100% treatment effect would be worth detecting.

Isn’t that what he provided? He stated that he wanted to detect a 25 percentage point increase in survival. Say for the sake of the example that survival is 35% without treatment, it would be 60% with treatment. You could say that it has a 100% effect (saves people who would have died otherwise) in 25% of the treated population.

Unless I’m missing something, that’s precisely what the 25pp increase in survival means: an additional 25% of the treated population survives, i.e. 25% of them see the 100% treatment effect of their death being prevented.

(The preceding discussion assumes the treatment doesn’t kill anyone. That would complicate slightly the argument but not substantially, I think.)

Yes, exactly. But putting the statistically equivalent threshold in the starkest “real world” terms may serve to break through the physician’s innumeracy.

What innumeracy? I don’t know where do you see it in the story, if you’re referring to it.

I don’t understand either how “25% of the treated patients see a 100% effect (surviving instead of dying)” is more “real world” than “survival increases from 30% to 65% when patients are treated”.

Surely the doctor understands that the desired “keep alive a patient that would have otherwised died” appears in some fraction of the treated population.

In medicine this is often put in “number needed to treat” terms: 25 percentage points increase in survival is equivalent to saying that four patients have to be treated to prevent one death (in expectation).

Number to treat is indeed a common metric and one with some advantages in terms of interpretation. But it suffers from the exact same issues of average treatment effects. The number to treat (as you point out with the words “in expectation”) is only an average. It may take 10 on average, but that might mean 2 on average in a particular subgroup and 25 in another.

I think we all agree that more information is preferrable to less information. You can go all way there and say that the number to treat is one when you treat only those that die without treatment and survive when treated. If that doctor knew who they are, they would probably design the trial differently. But he doesn’t, so it’s not clear to me what’s the problem with his approach.

It’s hard enough to show (or refute) an ATE. Finding heterogeneous treatment effects, confidence bounds on them, risk factors that explain them, etc… is ?multiple? orders of magnitud harder.

I think (many of) those who compute and use these sorts of numbers understand the limitations. But the seeming alternatives are 1) nihilism, 2) smample sizes of 10,000,000 or more for every study

We would (or should) never claim that a mathematically describable distribution like N(0,1) is sufficiently accurate in describing most real-life phenomena that we can meaningfully interpret our estimate of the eighth moment. Nor would we claim that our effect size estimate is sufficiently precise that we can meaningfully interpret the value of its eighth significant digit. Isn’t it equally absurd to claim the inverse is true, that we can justify our model of a joint distribution of dozens of phenomena we barely understand, so long as we only use the first and second moments? Or that we can rely upon an effect size estimate that summarizes outcomes that depend on a dizzying array of factors not in the model, so long as we only interpret the first and second significant digits? If the answer is no, then the standardized effect estimate is not an estimate of a drug’s true (or even probable) effect on the virus in the human body. It’s a socially-constructed quantity, the mathematical basis of which provides a means of forming consensus among scientists about when we should conclude that a drug is worth prescribing to the public. One that feels more objective than our choice of an acceptable Type I error rate but is really just as arbitrary. It’s not just that all models are wrong: all models of systems as complex as the human body are “not even wrong.”

Now that’s a nihilistic alternative. And while I don’t buy it in practice–at a minimum, the law of large numbers ensures that the top-line results hold up most of the time for most of the people, which is better than waiting until we understand the mechanisms involved–it must be incontrovertibly true for SOME level of complexity, No?

I think the future for estimating heterogeneous treatment effects is to focus on observational datasets. That’s the only place we can easily find effective samples sizes of 10,000,000 or more.

I use a paint analogy. Everyone accepts that painting metal slows/prevents rust. But say you do a RCT of a bunch of painted pieces of metal then check for rust. You can get:

1) Metal alteady painted, so additional paint does nothing

2) Metal already rusted, so paint was applied too late

3) Pieces too large for the amount of paint applied, so part of it is uncovered and rusts anyway

4) Metal that gets dinged up a lot so the paint chips off, so it will rust unless you repaint it often.

Just taking an average including all these scenarios or, even worse, only studying scenarios like that is going to lead to the conclusion “paint doesnt work”.

That basically sums up 60 years of vitamin c research, we are seeing it with HCQ for covid, etc.

The average person does not exist, so why almost all medical research directed at treating them?

Also:

https://www.medrxiv.org/content/10.1101/2020.06.29.20142703v1

A great tragedy is that not a single report of vitamin c levels in a covid patient has been published. They will be deficient, and will benefit from correcting that deficiency.

But if you just give some random amount for a few days at a random point in the disease process instead of enough of it, before massive oxidative damage is done, until the patient is healthy again, then it will probably look like vitamin c doesnt do anything.

Thank you for this post. It is reminiscent of conversations I have had over and over again with researchers, either in context of power analysis or after data collection when trying to convince them to actually look at the raw data before averaging and modeling means. The ease and availability of statistical models for means, as well as a general expectation of their use, has created a culture where we don’t even expect justification of the assumption that means are reasonable parameter of interest in the first place. In my experience, there is an deep seeded belief in scientists that means/averages are inherently meaningful and useful and there is no need to justify choice of models based on them.

Megan said,

“The ease and availability of statistical models for means, as well as a general expectation of their use, has created a culture where we don’t even expect justification of the assumption that means are reasonable parameter of interest in the first place. In my experience, there is an deep seeded belief in scientists that means/averages are inherently meaningful and useful and there is no need to justify choice of models based on them.”

+1 This has become “That’s the way we’ve always done it”, “This is what’s standard in the field” — don’t think; just follow the standard procedures.