Evidence-based medicine eats itself

Posted on February 10, 2020 9:39 AM by Andrew

There are three commonly stated principles of evidence-based research:

1. Reliance when possible on statistically significant results from randomized trials;

2. Balancing of costs, benefits, and uncertainties in decision making;

3. Treatments targeted to individuals or subsets of the population.

Unfortunately and paradoxically, the use of statistics for hypothesis testing can get in the way of the movement toward an evidence-based framework for policy analysis. This claim may come as a surprise, given that one of the meanings of evidence-based analysis is hypothesis testing based on randomized trials. The problem is that principle (1) above is in some conflict with principles (2) and (3).

The conflict with (2) is that statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty—indeed, researchers are encouraged to do this and it is standard practice.

The conflict with (3) is that estimating effects for individuals or population subsets is difficult. A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions. That is fine if we remember principle (2), but not so fine if our experiences with classical statistics have trained us to demand statistical significance as a prerequisite for publication and decision making.

38 thoughts on “Evidence-based medicine eats itself”

AV on February 10, 2020 9:47 AM at 9:47 am said:

All this is crassly obvious, & cannot be proclaimed until some other belief takes hold out of practice.

Reply ↓
- Andrew on February 10, 2020 10:10 AM at 10:10 am said:
  
  Av:
  
  This all may be obvious to you, but unfortunately it’s not obvious to many researchers. Indeed, it wasn’t obvious to me until recently! The purpose of much of academic research and writing is to figure out and explore ideas, looking at them in enough different ways until the ideas seem obvious to us.
  
  I will be very happy if we reach a time when the ideas of the above post are considered obvious by most statisticians, medical and social scientists, and quantitative analysts.
  
  Reply ↓
- GK on February 10, 2020 1:33 PM at 1:33 pm said:
  
  In some fields/organizations there can be either suspension of belief or lack of awareness. The subject matter experts and overseeing bodies are not always experts in statistics and statistical rigor is usually not the immediate motivating factor. As Andrew as voiced, I wish it could be considered obvious.
  
  Reply ↓
  - Martha (Smith) on February 10, 2020 3:45 PM at 3:45 pm said:
    
    “suspension of belief ”
    
    Do you perhaps mean “suspension of disbelief”?
    
    Reply ↓
Adede on February 10, 2020 11:59 AM at 11:59 am said:

Sounds to me like we should jettison (or modify) principle 1.

Reply ↓
- Daniel Lakeland on February 10, 2020 1:07 PM at 1:07 pm said:
  
  Modify it to “reliance on experimental evidence with random assignment and blinding when possible” would be a much better version of (1) IMHO.
  
  Reply ↓
  - Andrew on February 10, 2020 1:21 PM at 1:21 pm said:
    
    Daniel:
    
    “Controlled” is more important than “randomized,” I think.
    
    Reply ↓
    - Daniel Lakeland on February 10, 2020 1:27 PM at 1:27 pm said:
      
      Yes. experimental was intended to encompass that, though it might be better to phrase it explicitly, something like:
      
      “reliance on evidence from controlled experiments with random assignment and blinding when possible”, in other words controlled experiment is essential, random assignment and blinding is nice to have.
    - Peter Dorman on February 10, 2020 2:15 PM at 2:15 pm said:
      
      Let me see if I understand this. Randomization is useful for minimizing selection bias, but if there are constraints on sample size (as there typically are) stratification across expected confounders can be more helpful still. This was the issue (yes?) in the famous Gosset-Fisher debate, and when I taught stats to budding young field ecologists I reviewed the debate since so much of their data collection comes from small-n studies. Of course one can randomize within strata if you can get a large enough n.
      
      Also, there are potential issues with randomization depending on how the sample frame is constructed: you could have a randomized selection procedure but it might not be randomized wrt to the full population of interest — e.g. the famous example of randomized dialing of landline numbers.
    - Daniel Lakeland on February 10, 2020 2:24 PM at 2:24 pm said:
      
      To me randomization is a gadget that we can use to asymptotically eliminate correlations between stuff you’re doing and *anything at all*. This makes it an extremely useful gadget, but it’s just a gadget.
      
      The key is finding out by repeatedly causing something what the downstream effects of causing that thing are. This information is valuable even if you don’t have asymptotically zero correlations. Ideally your model can include these correlations and correctly account for the size of your uncertainty.
      
      For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.
      
      I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.
      
      Your grandmother could have told you that for free.
    - Martha (Smith) on February 10, 2020 3:41 PM at 3:41 pm said:
      
      “For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.
      
      I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.
      
      Your grandmother could have told you that for free.”
      
      Even if she died long before 2016? Even if she died long before I was born? By some miracle or prescience? ;~)
    - Daniel Lakeland on February 10, 2020 7:20 PM at 7:20 pm said:
      
      One word: Ouija board… Two words… Use a Ouija board… four words!
      
      https://www.youtube.com/watch?v=FAxkcPoLYcQ
    - Keith O'Rourke on February 10, 2020 3:17 PM at 3:17 pm said:
      
      > Gosset-Fisher debate
      Also taken up as blinding via randomization being more important than ensuring imbalances in important confounders rarely occur.
      
      However, randomization is the only known cure for ignorance, with the main side effect being loss of precision.
      
      Its value will depend on the subject matter, but in medicine Mendelian Randomization is making it clearer that in treatment/exposure comparisons for treatments – its extremely important.
    - Jay on February 10, 2020 4:19 PM at 4:19 pm said:
      
      We already rely on RCTs when possible. The problem is it is often not possible. For example, investigating long-term effects of diet on chronic disease by RCTs is infeasible (due to compliance, for one thing), so we have to rely on observational studies.
    - Dale Lehman on February 10, 2020 4:49 PM at 4:49 pm said:
      
      This week’s New England Journal of Medicine has an article on observational vs RCTs, emphasizing the relative value of the latter and weaknesses of the former. It has some good recommendations on how RCTs can be made easier and less expensive to conduct. However, I think it paints an overly stark distinction between types of studies. RCTs usually depend on two questionable assumptions: that intention to treat, rather than actual treatment received, is the relevant randomization factor. The other is that the randomized groups are sufficiently large to reduce the sampling variability enough to be meaningful. For the latter, they do compare the randomized groups so that they look similar (or have cofounders which could be modeled), but given the number of omitted variables we can never be sure that the randomized groups are sufficiently similar. Large enough sample sizes can offset this, but RCTs are expensive and often do not have very large sample sizes. At the same time, as the amount of observational data increases (both in observations and number of features), the performance of observational studies can get better.
      
      I would not propose that observational studies are preferred to RCTs, but I do see these are on a continuum rather than stark alternatives. Both types of studies have practical limitations which make them more similar than the NEJM article suggests. I often (too often these days) find myself looking for evidence on a medical condition or treatment, only to find that there are no reasonably close RCTs (especially given Andrew’s point about the need to see the effects on particular subgroups rather than looking for average effects), and that the observational data I would like to see is simply unavailable (although, in theory, much more observational data could be made available, were it not for the insane private insurance model we use in the US, with little standardization or sharing of data).
    - Keith O'Rourke on February 10, 2020 4:56 PM at 4:56 pm said:
      
      Dale:
      
      We seem to be learning via Mendelian Randomization that there are few meaningful subgroup effects in medicine (very few piranhas swim in biological systems).
      
      See – Professor George Davey Smith – Some constraints on the scope and potential of personalised medicine https://www.youtube.com/watch?v=uiCd9m6tmt0&t=2467s
    - Daniel Lakeland on February 10, 2020 7:40 PM at 7:40 pm said:
      
      Note however that randomized != controlled experimental…
      
      We can run an experiment where for example we use some prior knowledge and decision theory to choose a treatment and then observe the outcome and model the treatment response using known confounders. You can’t eliminate all confounders using large sample sizes with this method, but you can learn a lot, and in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.
    - Zad on February 10, 2020 7:52 PM at 7:52 pm said:
      
      @Daniel Lakeland: “in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.”
      
      I think this would most likely be a problem with trials that are using simple randomization, which would largely be dependent on the size of the study, but would also give you large standard errors to reflect the uncertainty, but then again, most experienced trialists and statisticians avoid using simple randomization for this reason due to potential imbalances and focus on blocking and stratifying based on prior knowledge of potential confounding variables
    - Daniel Lakeland on February 10, 2020 8:15 PM at 8:15 pm said:
      
      Zad, what you’re talking about is ways to make your data more informative if your model is correct (that is, you’ve blocked or stratified on properly meaningful variables and for reasonable values of those variables). So you can learn more with smaller sample sizes. But within any group, you’re still randomizing, and within that group the probabilistic independence with unknown confounders still is limited by sample size.
      
      Like for example, suppose you know women are different from men, and body weight is important in a medical treatment… SO you split by women and men, and you put them into 3 groups weight 1, weight 2, and weight 3… so you have 3 * 2 = 6 different groups, then you randomize within each group between drug A and drug B… Now you decide you want to have say 100 people in each category for some reason, you need 1200 people, which means you need to recruit somewhat more than that because you’re demanding balance among all the groups… maybe you need to see 2000 people, sort them and put them all into the various groups. Now your medical treatment is $5000 and you’ve got 1200 people: $6M to run the trial. Sure, this is doable for some people, for others it’s 2 orders of magnitude more money than they have.
    - Sameera Daniels on April 12, 2020 9:47 AM at 9:47 am said:
      
      Hey, I actually understood your commentary. Well put.
    - Sameera Daniels on October 22, 2021 3:05 PM at 3:05 pm said:
      
      Another great response
Sander Greenland on February 10, 2020 4:50 PM at 4:50 pm said:

“A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect”…that “16” was an absurd value before and it still is. Why? Because there are just too many context-sensitive free parameters in the general formulas for the sample sizes. Consequently, anything more precise than “it will usually take a lot more observations to estimate heterogeneity than an average effect” will be BS based on glossed-over arbitrary settings (like making the interaction way smaller than the main effect, which is hardly a general law of nature); it’s just as absurd as the old advice I heard as a student that “a sample size of 30 is large” (for what setting and purpose?).

Consider that in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial, meaning that only 4 times the sample size would be needed to get the interaction SE down to what the main effect SE was. For tests, I published some not-so-quick calculations long ago for binary-data settings of interest in my applications, which gave most sizes much less than 16 times those for main effects (Greenland S (1983). Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine 2:243-251), similar to what others got in the same type of setting.

Reply ↓
- Andrew on February 10, 2020 5:46 PM at 5:46 pm said:
  
  Sander:
  
  I agree that 16 is not a magic number; it’s the product of assumptions. The larger the interactions are, the smaller this number will be. I don’t think that my number of 16 is “B.S.”; it’s clearly derived from its assumps.
  
  Just one thing: In your comment, you write, “in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial.” I agree. But that’s what I say too! The factor of 16 in sample size comes from two factors: the factor of 4 in sample size arising from the factor of 2 in SE that you mention, and my assumption that interactions are half the size of main effects. If you’re disagreeing with me on the factor of 16, it’s because you’re saying that your interactions of interest are more than half the size of main effects. It’s hard to know about this, but I agree that the number we get will depend on this assumption.
  
  Reply ↓
  - Sander Greenland on February 10, 2020 8:19 PM at 8:19 pm said:
    
    OK we agree on the math (mere arithmetic once you plug in the numbers). But it’s a pet peeve of mine when anyone tosses out context-sensitive numbers that are unmoored from context. In my biz sometimes the interactions are bigger than the average effects, occasionally to the point of effect reversal (as with my own dissertation’s real-example data!). No surprise as there are treatments that kill some patients and save others, especially in the Wild West of real clinical practice (which includes off-label and even contraindicated usage) as opposed to the carefully-selected patients and protocols in the refined world of RCTs. In that kind of reality, saying it takes 16 times the sample size is destructive, since not only is it wrong in general but it will make it sound like there is no point in proposing to examine interactions – but if that is not in the protocol then some will scream “data dredging!” if you look at them. So yeah I think tossing off a number like 16 (and repeatedly is very very bad nonsense, really statistical numerology (like most decontextualized “applied statistics”).
    
    Reply ↓
    - Andrew on February 10, 2020 8:41 PM at 8:41 pm said:
      
      Sander:
      
      Fair enough. All such advice is context dependent.
    - Sameera Daniels on April 12, 2020 9:51 AM at 9:51 am said:
      
      Also well put.
      
      All in all, it would be great if clinicians and healthcare policy expertise would follow Andrew’s blog.
    - Sameera Daniels on October 22, 2021 3:08 PM at 3:08 pm said:
      
      Sander should be inducted into the National Academy of Sciences, if not already a member. My hope though is that he will undertake a statistics primer.
Nick Adams on February 10, 2020 10:40 PM at 10:40 pm said:

Ok, so it is not easy, but small incremental gains can get you a long way.
The amelioration of symptoms and prognosis of almost every common disease has improved since I started clinical medicine in 1987; progress built on very many RCTs, none of them perfect but together forming a tapestry of overlapping evidential strands that can be read.

Reply ↓
- Andrew on February 11, 2020 9:09 AM at 9:09 am said:
  
  Nick:
  
  The question is, would this benefit have occurred without RCTs, just by clinicians and researchers trying different things and publishing their qualitative findings? I have no idea (by which I really mean I have no idea, not that I’m saying that RCTs have no value).
  
  Reply ↓
  - Daniel Lakeland on February 11, 2020 12:25 PM at 12:25 pm said:
    
    > just by clinicians and researchers trying different things and publishing their qualitative findings?
    
    If researchers had zero personal incentives to do this, then sure… But in the presence of career incentives to publish stuff… then the literature would be totally polluted with bullshit.
    
    hey wait…
    
    Reply ↓
jim on February 11, 2020 1:23 AM at 1:23 am said:

God forbid we ever find out for certain if organic food is healthier or not, or if calcium pills protect women from bone loss, or if two aspirin on rainy Thursdays helps protect the intellect from Ted talk damage (strengthening the skull against cranial implosions).

Same for social science. OMG I fear the day when there is precise definition for “food desert”; when we know what “quality preschool” is and what it does; when it’s known that we’ve become an “equal” society.

Thousands – nae! Tens of thousands! – would be out of work!

Save NHST! Save the economy!

Reply ↓
Oliver on February 11, 2020 5:02 AM at 5:02 am said:

It seems odd to me that no-one has mentioned systematic reviews and meta-analyses (of RCTs) in this discussion so far, as they are generally considered the highest level of evidence in evidence-based medicine. As with anything this approach can have its limitations, but there is at least less of a focus on statistical significance for each individual RCTs, more of a focus on magnitude of effect size and an acknowledgement that treatment effects can differ across different populations. Where enough data are provided for each study, or where individual patient data can be shared, there is also the potential to gain greater understanding of differences between subgroups than can be achieved by any one study.

Reply ↓
jrkrideau on February 11, 2020 9:33 AM at 9:33 am said:

precise definition for “food desert” == Rubʿ al Khali.

I once spent ~2 weeks trying to find a definition for ‘housing unit” (roughly house/apartment, maybe?) It appeared that in the USA & Canada it was a case of “I know one when I see it”.

Reply ↓
Justin on February 11, 2020 11:36 AM at 11:36 am said:

“..statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty..”

I don’t agree that that is how significance testing is typically used. The wording is odd to me. If I said there is evidence from a well-designed experiment(s) to suggest a coin is unfair, I am not stating that as a truth with a capital T certainty, but as evidence for, and at a certain alpha level, and I allow for errors and discuss any assumptions.

“A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions”

Then change the design to not do interactions, and/or get a larger sample (may need to save up some $). That still might be preferable to using a prior to get at the interaction. And did that 16 come from replacing uncertainty with certainty? ;)

Justin

Reply ↓
- Chris Wilson on February 11, 2020 12:56 PM at 12:56 pm said:
  
  Justin, it is all over the place in applied research. You routinely see papers thresholding results as follows:
  1. Run a bunch of analyses on everything
  2. Report which ones p 0.05
  4. Report LS means or sample statistics for the responses passing #2.
  
  I see a lot of sophisticated stat types getting on Andrew’s case for “strawman NHST”, but what he is describing is rampant and widespread. Besides, I have never seen a rigorous research program out in the wild cleaving to a Neyman-Pearson decision framework consistently for long enough for type I error rates to matter…
  
  Reply ↓
- Anon on February 11, 2020 6:57 PM at 6:57 pm said:
  
  ‘then change the design to not do interactions’. I’m curious to see how you see that playing out in most real applied cases. Also, what’s the problem with a prior on an interaction? is making some assumptions about the shape of a parameter of interest egregiously worse than the myriad of other assumptions made in your favorite stats? Or is that just the one you think is rhetorically easiest to pick on? why is this assumption worse than, say, setting an utterly arbitrary alpha level? subjectivity is all around us.
  
  Reply ↓
jim on February 12, 2020 7:52 AM at 7:52 am said:

Andrew:

I was thinking the other day: it’s great that there is a group of people with extraordinary statistical expertise who can identify problems with NHST and suggest alternatives; but if any method is going to be trickled down into daily use and standard practice, it’s going to be used by a much broader group of people with substantially less statistical expertise. Under those conditions, there will always be people who just want to put guts in the machine and get sausage out, and not worry too much about what happens inside. What will the shortcomings of the alternative methods be under those circumstances?

Not defending NHST by any means. But the more widely any method is used the more widely it will be abused. so that’s something to consider.

Reply ↓
- Andrew on February 20, 2020 11:35 AM at 11:35 am said:
  
  Jim:
  
  I agree.
  
  For example, suppose we characterize the current standard approach as:
  
  Approach 0: Compute classical confidence intervals and then report YES THERE’S AN EFFECT if the interval clearly excludes zero and report MAYBE THERE’S A SMALL EFFECT if the endpoint of the interval is very close to zero and report THERE’S NO EFFECT if zero is well within the interval.
  
  Now consider the following reform:
  
  Approach 1: Use the same classification rule as above but with Bayesian posterior intervals. I think this approach would be an improvement, because it lets us include prior information. But it still has major problems.
  
  Then we can move to:
  
  Approach 2: Do Approach 1, but instead of looking at comparisons or estimates one at a time, look at all of them at once, if possible embedding them in a hierarchical model. I think this would be a further improvement, because it uses more information and helps us avoid selection bias relating to forking paths. But it still has the problem that it’s extracting certainty from uncertainty.
  
  So this moves to some sort of:
  
  Approach 3: Do good modeling and report uncertainty intervals conditional on the model, but don’t use overlap-with-zero as a way of making strong deterministic-sounding statements.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Evidence-based medicine eats itself

38 thoughts on “Evidence-based medicine eats itself”

Leave a Reply Cancel reply