After learning of a news article by Amy Harmon on problems with medical trials–sometimes people are stuck getting the placebo when they could really use the experimental treatment, and it can be a life-or-death difference, John Langford discusses some fifteen-year-old work on optimal design in machine learning and makes the following completely reasonable point:
With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial. . . .
Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to “phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).
Langford then asks: if these ideas are so good, why aren’t they done already? He conjectures:
Getting from here to there seems difficult. It’s been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it’s also a failure of imagination within our own field. When we design algorithms, we often don’t think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways.
I agree with these sentiments, but . . . the sorts of ideas Langford is talking about have been around in a statistics for a long long time–much more than 15 years! I welcome the involvement of computer scientists in this area, but it’s not simply that the CS people have a great idea and just need to communicate it or adapt it to the world of clinical trials. The clinical trials people already know about these ideas (not with the same terminology, but they’re the same basic ideas) but, for various reasons, haven’t widely adapted them.
P.S. The news article is by Amy Harmon, but Langford identifies it only as being from the New York Times. I don’t think this is appropriate to omit the author’s name. The publication is relevant but it’s the reporter who did the work. I certainly wouldn’t like it if someone referred to one of my articles by writing, “The Journal of the American Statistical Association reported today that . . .”
For 2 they may wish to look here to get a sense of the currently grappling of this
and in particular the TV debate http://justin.tv/cochranetv/b/272278382
For 1, they may wish to look at what happens when you can't randomize.
I don't think the parallel between NYTs and the Journal of the American Statistical Association works. Journalists for the NYTs get paid by the paper, the paper is held accountable (both legally and ethically) for their mistakes, presumably also research funds and assistance, editing, fact checking etc.
So while I completely agree with your usual line against "The Lancet Study" or so, I don't think that applies to journalism in the same way.
But the 'reasonable' statement contradicts findings in placebo effects. It may perhaps be a molecular biological fact that for many drugs the effect should not be changing from year to year and it's a reasonable statement. But, as a standard treatment gets better known and it's expectations better known the placebo component of the treatment will vary (either up or down). Thus the effect of standard treatment could very well vary a great deal depending on such trivial matter as what the fear mongers (american press) are reporting this week.
Agree with previous comment. The NYT employee is speaking for his employer, and the paper will correct any errors regardless of what the reporter says. He is writing the story that his boss told him to write. (Okay, that is an exaggeration. Most errors go uncorrected.) But you don't have any responsibility to the Journal of the American Statistical Association, and you don't speak for them. If you did not write that last article, the journal would not have commissioned someone else to write it.
There is necessarily a high bar of proof for a phase three trial, and the suggested methodology changes would break our ability to make causal statements about the treatment. While we do have a good general idea of what happens to people under control treatments, we don't have the same certainty about what the outcomes of people gathered under a certain recruitment protocol. I would fight tooth and nail against the removal of the control group from any phase III trial.
Clinical trials are rough on the participants. In some trials people die because they don't get the experiment treatment, and in others they die because they don't get the control. We don't know which it is until we finish the trial. This is both necessary for science and tragic for the participants and their families.
As for the point about adaptive design, cancer trials in particular are a hotbed for this kind of non-standard design. While I think that it is appropriate to be conservative in the adoption of these methods, we should be constantly thinking about how to minimize the suffering we inflict on research subjects.
I believe this is not a case where CS-side analysis mimics or rediscovers statistics-side analysis. If any of you think otherwise, a citation would definitely be interesting.
Just to emphasize, the EXP* analysis is none of these relatively familiar deviations from standard statistical analysis:
(1) replace a Gaussian assumption with a Binomial assumption.
(2) remove the need for a correct Bayesian prior as per Gittin's indicies.
(3) remove any specific distributional assumption except for IIDness.
Instead, the EXP* analysis is of the form: We removed the need to make _any_ statistical assumptions about the world. This may sound impossible, but it isn't with the right structure and the right definition of success. Understanding this alters your notion of what it means to deal with causality effects. I believe this is important in two ways:
(a) Since the 'if' in the theorem statement is weaker than for any analysis relying on statistical assumptions about the world, it provides something with a stronger guarantee than the current standard of randomized clinical trials.
(b) Making the 'if' weaker helps show us that the notion of a phased trial is broken with respect to optimality. Instead, there should be a progressing process of exploration. That's obviously challenging w.r.t. record keeping, but it seems doable with improved technology.
Since (a) meets or exceeds Ian's "high bar of proof", (b) shows us that fighting tooth and nail for the existing system is misguided.
In your blog, you wrote:
I've heard these ideas for awhile in the evidence-based medicine world. On the other hand, most studies don't seem to be done this way, so I applaud all work in this direction, whether from clinicians, statisticians, computer scientists, or others.
To me, the challenge is in record-keeping and keeping track of who does what treatment and what happens next. Issues of Gaussian assumptions etc don't seem so important to me.
I definitely concur that record-keeping is a primary challenge. I'm enough of a theorist to hope that sharpening theorems and understanding along with some education can also be of substantial help.
I can see some issues with "start with some initial small pool (equivalent to “phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment…"
If initial results with a small pool are showing no effect or negative effects then it may be hard to recruit more patients to an expanded pool, even if the true effect of the drug would be substantially beneficial on average.
Depending on the rate at which you expand the pool, it might take a lot longer than with a conventional approach to either get convincing evidence that the drug is good, or that it's bad. If the drug is beneficial then the delay is bad because people who would have benefited from the drug die, and because it raises the already-high cost of developing new drugs. If the drug is harmful then the delay in finding out could be either good or bad, but is probably bad.
I don't think we live in the best of all possible worlds when it comes to drug assessment and I'm absolutely certain the situation can be improved. I'm just pointing out that some things that seem like no-brainer improvements actually aren't. A gradually expanding pool might be good or bad, perhaps depending on how it's implemented and how much it increases the cost and so on.
What one might call "micro-adaptive" trials are actually fairly commonplace. For example, try looking up "generalized drop the loser rule", and any such hit will feed you into many deep paths through that statistics of adaptive trials. All of this could be improved upon, and optimizing for information gain (minimizing regret, maximally reducing entropy) are all useful goals — which biostats folks have been thinking about (and doing about, to a large extent) more-or-less forever. (Another important thing to look at is/are n-of-1 trials.) There are two problems with all of these approaches, both of which have been mentioned, but worth emphasizing: 1. The gathering of information is extremely difficult, esp. outcomes information, which may take years, and be very expensive (if even possible) to get, and 2. There may not be enough subjects in the even possible population to make up for the statistical power that you give up by constantly peeking at your results – heck, there aren't enough subject in many diseases — perhaps most! — to do to the "hard way". If you have a massive effect, then this isn't an issue, but see #1, above.
My experience with running randomized trials of commercial programs is very similar to that described by Ian Fellows: Roughly that causal effects of test treatment vs. standard-of-care (or in commercial parlance, "challenger" vs. "champion") are highly dependent on testing context, and therefore the assumption that we know what the causal effect of placebo / standard-of-care would have been in the control group without having to actually bother to run it is very, very far from reliable. The reason that, contra the excerpt in the post, this doesn't imply that the causal effects of treatments are so unstable that there is no practical point to an RCT is that the difference between the causal effects of test and control can be reasonably consistent across a wider variety of contexts (and therefore more projectable into the future) than the absolute size of either effect.
In re John's comment:
"Since (a) meets or exceeds Ian's "high bar of proof", (b) shows us that fighting tooth and nail for the existing system is misguided."
I think that what I would really need to understand about EXP3.P is how sensitive it is to two factors:
1) Difficult to measure group differences. There is a lot in medicine where a careful measuring of covariates can't remove bias due to physician knowledge (see channeling bias aka confounding by indication for the most dramatic example).
2) How susceptible is the approach to being gamed. The current approach is very difficult to manipulate. Given the huge financial rewards in drug approval (maybe not an optimal system but the universe we are in), it is easy to imagine a company "putting their thumb on the scale".
There may be diseases and conditions where this highly conservative approach doesn't make sense. But the reason it is in place is not that trialists don;t understand the weaknesses of the system.
In my own field, there is a huge debate about the use of observational studies for drug effects and many of the same issues come up.