Brief summary notes on Statistical Thinking for enabling better review of clinical trials.

This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

Now it was spurred by Andrew’s recent post on Statistical Thinking enabling good science.

The day of that post, I happened to look in my email’s trash and noticed that it went back to 2011. One email way back then had an attachment entitled Learning Priorities of RCT versus Non-RCTs. I had forgotten about it. It was one of the last things I had worked on when I last worked in drug regulation.

It was draft of summary points I was putting together for clinical reviewers (clinicians and biologists working in a regulatory agency) to give them a sense of (hopefully good) statistical thinking in reviewing clinical trials for drug approval. I though it brought out many of the key points that were in Andrew’s post and in the paper by Tong that Andrew was discussing.

Now my summary points are in terms of statistical significance, type one error and power, but was 2011. Additionally, I do believe (along with David Spiegelhalter) that regulatory agencies do need to have lines drawn in the sand or set cut points. They have to approve or not approve.  As the seriousness of the approval increases, arguably these set cut points should move from  being almost automatic defaults, to inputs into a weight of evidence evaluation that may overturn them. Now I am working on a post to give an outline of what usually happens in drug regulation. I have received some links to material from a former colleague to help update my 2011 experience base.

In this post, I have made some minor edits, it is not meant to be polished prose but simply summary notes. I thought it might of interest to some and hey I have not posted in over a year and this one was quick and easy.

What can you learn from randomized versus non-randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)

What can you learn from randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)

The crucial uncertainties randomized comparisons are:
1. With perfect execution, just two, the variation of covariate distribution imbalance compared to the size of the signal of interest. The first, covariate distribution imbalance, is extra-sample or counterfactual in that with randomization you are assured balance in distribution but in any given randomization it favours treatment or control by some amount – it’s just not recognizable given there are unobserved covariates. However, it does not systematically favour either treatment or control and leads to statistical significance only 5% of the time (i.e the type one error rate). As for the size of the signal of interest, which determines power (with bigger signal having higher power), it is never known but just conjectured and often on limited and faulty historical data.

This is unfortunate, as it is critical to get the ratio of power to type one error high (e.g. 80% to 5%) as this better separates null signals from real signals (when it is unknown which are which.) One way to see the problem of low power, say 20% power, is that when there is a real signal (of just the right size to have 20% power), only 20% of signals investigated will have much attention paid to them.This subset will be highly biased upwards being a small non-random subset of all the studies that could have been done – being the ones where the treatment comparison was observed to be very large (most likely because the covariate imbalance already by chance favoured treatment over control by more than a trivial amount.) This results in exaggerated treatment effect estimates. The amount of exaggeration goes away quickly as the power increases and even 50% power is often good enough to make this exaggeration unimportant.

2. The execution of the randomized comparison was not perfect, but actually flawed to a degree that invalidates the above reasoning. There will always be uncertainty about how likely it is that these flaws will be noticed and exactly how much they impaired the comparisons i.e. increased type one error above 5% or decreased power.

What You Can’t Learn (WYCL):
1. How to further decrease type one error or increase power.
2. Almost anything about the treatment mechanism for the signal detected.
3. How to credibly generalize the signal much beyond the randomized study itself.
4. How to get good type 1 error and power/ type 1 error balance for all, most or even a few sub-groups.

How/Why That’s Critical (HWTC);
1. Even with perfection, always sometimes wrong (at least 5% and often 5%+) about benefits exceeding risks and always uncertain about the precise Benefit/Harm ratio (there just are no good confidence intervals methods that give even close to 5% error rates in ethical randomized comparisons in humans).
2. Wrong much more than 5%+ about benefits exceeding risks for subgroups and almost always wrong about the ratio Benefit/Harm.
3. There is almost always low power for detecting operational bias, testing assumptions, dealing with non-compliance or more generally finding out how things were not perfect and made the %5 actually 5%+++.

Anticipate How To Lessen these limitations (AHTL)
1. Nothing cures error better than replication and the never ending hope of getting close to perfection next time (that “guarantees” the just 5% error rate).
2. Transparent displays of all the pieces and processes that go into learning (and can go wrong).
3. More focussing in on errors in estimating the ratio benefit/harm rather than just Benefit >? Harm and on sub-groups and generalization.
4. Lots of sensitivity analyses that starts an endless loop of WYCL; HWTC; AHTL;…

 

What can you learn from non-randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)

The crucial uncertainties with non-randomized comparisons are:
1. The compared groups will be initially non-comparable in ways that can not be fully appreciated or noticed and the steps to lessen this (restriction, matching, stratification, adjustment, weighting and various combinations of these and others) will not be known with any assurance (say like working even 50% of the time). Anticipating further steps will be difficult and tenuous and though the resulting non-comparability is sometimes known to be less than initially, the remaining degree of non-comparability will be largely unknown.

2. In the almost never occurring case of actually getting comparable groups, all the uncertainties of randomized comparisons do remain. Although these uncertainties are likely less than uncertainties in randomized comparisons (especially if large groups were used to try to get groups less non-comparable) they can still be important.

What You Can’t Learn (WYCL):
1. Get a good sense of how unequal the groups compared were made or could be made.
2. Very general or generic methods for noticing non-comparability, recognizing how to make groups less non-comparable and doing it – at least with data in hand. It is always very situation specific.

How/Why That’s Critical (HWTC):
1. Never know if you would be better off ignoring the data completely. Never!
2. Unlike, randomized comparisons, it is very possible that more, even a lot more of the same type of data collection, will help very little if at all.
3. Anticipating what kind of different data and how to obtain it is very important but difficult.
4. Carefully and fully evaluating the current data for clues as to what this may be is absolutely necessary, though not at all very rewarding in the short run. Almost never are there any quick visible successes but rather just clearer understandings of how unlikely success ever is and how much work and uncertainty remains.

Anticipate How To Lessen these limitations (AHTL)
1. Identify key barriers to getting less unequal groups and ways to lessen these.
2. Communicate those clearly and widely.
3. Get the academic, pharmaceutical and regulatory communities to repeatedly do this, realizing there are few rewards for academics and even fewer for pharmaceutical firms (unless their products are currently being threatened).

 

14 thoughts on “Brief summary notes on Statistical Thinking for enabling better review of clinical trials.

  1. Keith, Thanks for your post. You write that even with an excellent randomized controlled trial, you will be “always uncertain about the precise Benefit/Harm ratio (there just are no good confidence intervals methods that give even close to 5% error rates in ethical randomized comparisons in humans).”

    I’d like to learn more about this issue — can you recommend any good articles about it?

    Thanks!

  2. Thanks, Keith. I too would like to read more. There are so many new twists and turns to this topic. I want to digest the information as systematically as feasible. Moreover, I don’t see how without full access to data, we can make substantial headway. But even more fundamentally, there seems to be a major need for re-evaluating allopathic theories and modalities in light of publication biases that have been highlighted by several academics.

  3. AnonymousCommentator:

    There is lot wrapped up in that one comment.

    We can start with the technical issue that confidence coverage is usually a false claim except for convenient assumptions (e.g. Normal) and particular types of effects (e.g. shift in mean only). So for just benefit and with a large sample with continuous outcomes and an effect that happens to be common for all in the treated group – the coverage can be as advertised. But how do you know the effect was common for all in the treatment group? Now go to outcomes that are not continuous, effects not common (interactions), not large enough sample sizes and ratios of effect ( Benefit/Harm) and all bets should be off.

    Somehow all that uncertainty gets forgotten about or maybe never appreciated. Given that people who have taken a couple service courses (and perhaps even obtained a Phd in statistics) take the claims literally, these misconceptions can do a lot of harm.

    I think there will be good references, but I don’t currently have any in mind. A general strategy here is take no one’s word for it and do simulations. Create fake universes where you make everything known (ideally based on some biologically informed realistic treatment effects) and determine the actual coverage rates of the confidence interval method you will use. This general strategy was discussed in my talk here https://www.youtube.com/watch?v=I7AVP9BCm1g&feature=youtu.be (also slide and code) starting from a Bayesian perspective.

    Now other things to consider are that in ethical randomized comparisons in humans there will be non-compliance, withdrawals, loss of blinding, miss-measurement, etc, etc. More fake universes are needed to investigate repeated performance there.

    • Keith said:
      “Somehow all that uncertainty gets forgotten about or maybe never appreciated. Given that people who have taken a couple service courses (and perhaps even obtained a Phd in statistics) take the claims literally, these misconceptions can do a lot of harm.”

      Yes, yes, and yes.

      Keith also said,
      “A general strategy here is take no one’s word for it and do simulations. Create fake universes where you make everything known (ideally based on some biologically informed realistic treatment effects) and determine the actual coverage rates of the confidence interval method you will use. This general strategy was discussed in my talk here https://www.youtube.com/watch?v=I7AVP9BCm1g&feature=youtu.be (also slide and code) starting from a Bayesian perspective.

      Now other things to consider are that in ethical randomized comparisons in humans there will be non-compliance, withdrawals, loss of blinding, miss-measurement, etc, etc. More fake universes are needed to investigate repeated performance there.”

      Again, yes, yes, yes!

    • Keith, thanks for the added explanation, which helped me understand your post better. I had misunderstood the statement that I quoted — I had thought your statement was about benefit/harm ratios specifically. But now I see that your statement was much more general, and that the limitations you pointed out apply to many aspects of randomised trials, though they might be especially bad for benefit/harm ratios.

      I also understand now that by the “benefit/harm ratio” you mean the ratio of something like “average treatment effect for outcome A” to “average treatment effect for outcome B”, where the treatment is expected to improve A on average and worsen B on average. It looks like you are not referring to alternatives like the ratio patients “helped” vs. “harmed” by the investigated treatment.

      I’ll watch the lecture, thanks for the link. I agree simulation is important, with the caveat that when simulating it is easy to assume (a) lack of overdispersion, (b) independencies between characteristics, and (c) linearities of effects when the real data are (a) overdispersed, (b) correlated, and (c) nonlinear. I’ve seen these issues make simulations overly optimistic, though my experience on this is limited and I’m not sure how far the problems generalise.

  4. You state that we cannot learn “Almost anything about the treatment mechanism for the signal detected” from a RCT. Why is that? Take RCTs on blood pressure lowering medicine that show a reduction in mortality and a reduction in blood pressure. Don’t those studies show us that those medicines reduce mortality by reducing blood pressure. The same for statins, don’t those RCTs and show us that mortality and cardiovascular disease are reduced by statins via the mechanism of reducing cholesterol?

    • > via the mechanism
      The via a mechanism requires a mediation analysis that in 2011 was fairly suspect and not widely accepted.

      Now there are better approaches https://www.hsph.harvard.edu/tyler-vanderweele/tools-and-tutorials/ but “having some data and an aching desire to get an answer” may apply. (The RCT may not be adequate).

      > studies show us that those medicines reduce mortality by reducing blood pressure
      The last paper I read on blood pressure lowering medicine (2018) argued that that was very uncertain if that was the mechanism.

    • > Take RCTs on blood pressure lowering medicine that show a reduction in mortality and a reduction in blood pressure. Don’t those studies show us that those medicines reduce mortality by reducing blood pressure

      No, they show us that taking the drug reduces mortality. Here are plausible mechanisms:

      1) Reducing blood pressure itself reduces physical stress on the arteries resulting in less chance of arterial failure.

      2) Taking the drug alters calcium metabolism changing inflammatory responses that cause reduced production of very low density lipoproteins and reduced activity of certain immune cells thereby reducing the buildup of harmful plaques that damage the artery walls and increase the risk of both ischemic stroke and aneurism.

      3) Taking the drug alters the brain causing people to naturally produce different hormones and choose different foods to eat, both of which alter some aspect of blood chemistry…

      4) …. etc etc

      The same goes for any trial. Unless you can find *many different* ways of altering the thing of interest and each one has very similar results… like for example there are 5 totally different classes of drugs all of which result in reduced cholesterol and all of which produce the same reduction in stroke risk and increase in longevity, now you can maybe have some sense that you understand the mechanism.

      • There are many different classes of medication for the treatment of hypertension, all working in different ways. What they all have in common is that a)they reduce blood pressure, and b)they reduce long term cardiovascular mortality. I think it’s pretty safe to interpret this as causation rather than association without causation.

        • You misunderstand. Yes, RCTs give you the causative inference: taking drug causes cardiovascular mortality reduction…

          but it doesn’t give you the *mechanistic inference* that taking the drug causes mortality reduction because of the blood pressure reduction.

          Consider two models:

          Drug -> Blood Pressure reduction -> Mortality reduction

          vs

          Drug -> Blood Pressure Reduction
          |
          |—–> Other effect -> Mortality reduction

          How is a single RCT going to differentiate between these two?

          Even if you have 4 or 5 different classes of meds that treat hypertension, each one might have its own “other effect” that causes the mortality reduction.

          note that Keith above mentions that current papers even question this mechanism.

          I’m not saying the mechanism is wrong, just that even a series of RCTs all of which show that taking blood pressure medications reduces mortality can’t really tell us that much about the mechanism. We need a lot of additional stuff to really understand the mechanism. The easiest way to tease out a mechanism is to make multiple predictions based on that mechanism and see if they all are coming true together.

        • Sure, there probably is an “other effect” (possibly something to do with the endothelium or inner lining of the blood vessels) but there is still causation even if it is mediated through another effect.
          Your comment about mechanisms and predictions is sensible – a lot of this sort of research is going on. It’s really hard though because biology is really, really complicated.

        • Right, I mean all of the following mechanistic descriptions will produce the same RCT results:

          Drug -> BP -> mortality

          Drug -> BP
          |
          |—–> Other Effect -> mortality

          Drug -> BP
          | /
          |——Other Effect -> Mortality

          Drug -> BP ——-
          | / \
          |——Other Effect ——> Mortality

          etc

          RCTs can establish that there is a causal link between doing the treatment and getting the result under some background conditions, but it doesn’t establish how that result comes about, and without an understanding of the mechanism, we are left with very little predictive power for other situations than the one involved in the RCT… So if people have a different diet, or are on different additional drugs, or have other health issues, or exercise more, or exercise less or have different levels of financial stress, or live in different climates or whatever… we don’t know much

        • Hmm… blog stripped my diagrams of spaces… last two should look like: (hopefully the pre tag works)

          Drug -> BP
          | /
          |——Other Effect -> Mortality

          Drug -> BP ——-
          | / \
          |——Other Effect ——> Mortality

Leave a Reply to AnonymousCommentator Cancel reply

Your email address will not be published. Required fields are marked *