Evidence-based medicine eats itself

There are three commonly stated principles of evidence-based research:

1. Reliance when possible on statistically significant results from randomized trials;

2. Balancing of costs, benefits, and uncertainties in decision making;

3. Treatments targeted to individuals or subsets of the population.

Unfortunately and paradoxically, the use of statistics for hypothesis testing can get in the way of the movement toward an evidence-based framework for policy analysis. This claim may come as a surprise, given that one of the meanings of evidence-based analysis is hypothesis testing based on randomized trials. The problem is that principle (1) above is in some conflict with principles (2) and (3).

The conflict with (2) is that statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty—indeed, researchers are encouraged to do this and it is standard practice.

The conflict with (3) is that estimating effects for individuals or population subsets is difficult. A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions. That is fine if we remember principle (2), but not so fine if our experiences with classical statistics have trained us to demand statistical significance as a prerequisite for publication and decision making.

38 thoughts on “Evidence-based medicine eats itself

    • Av:

      This all may be obvious to you, but unfortunately it’s not obvious to many researchers. Indeed, it wasn’t obvious to me until recently! The purpose of much of academic research and writing is to figure out and explore ideas, looking at them in enough different ways until the ideas seem obvious to us.

      I will be very happy if we reach a time when the ideas of the above post are considered obvious by most statisticians, medical and social scientists, and quantitative analysts.

    • In some fields/organizations there can be either suspension of belief or lack of awareness. The subject matter experts and overseeing bodies are not always experts in statistics and statistical rigor is usually not the immediate motivating factor. As Andrew as voiced, I wish it could be considered obvious.

        • Yes. experimental was intended to encompass that, though it might be better to phrase it explicitly, something like:

          “reliance on evidence from controlled experiments with random assignment and blinding when possible”, in other words controlled experiment is essential, random assignment and blinding is nice to have.

        • Let me see if I understand this. Randomization is useful for minimizing selection bias, but if there are constraints on sample size (as there typically are) stratification across expected confounders can be more helpful still. This was the issue (yes?) in the famous Gosset-Fisher debate, and when I taught stats to budding young field ecologists I reviewed the debate since so much of their data collection comes from small-n studies. Of course one can randomize within strata if you can get a large enough n.

          Also, there are potential issues with randomization depending on how the sample frame is constructed: you could have a randomized selection procedure but it might not be randomized wrt to the full population of interest — e.g. the famous example of randomized dialing of landline numbers.

        • To me randomization is a gadget that we can use to asymptotically eliminate correlations between stuff you’re doing and *anything at all*. This makes it an extremely useful gadget, but it’s just a gadget.

          The key is finding out by repeatedly causing something what the downstream effects of causing that thing are. This information is valuable even if you don’t have asymptotically zero correlations. Ideally your model can include these correlations and correctly account for the size of your uncertainty.

          For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.

          I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

          Your grandmother could have told you that for free.

        • “For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.

          I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

          Your grandmother could have told you that for free.”

          Even if she died long before 2016? Even if she died long before I was born? By some miracle or prescience? ;~)

        • > Gosset-Fisher debate
          Also taken up as blinding via randomization being more important than ensuring imbalances in important confounders rarely occur.

          However, randomization is the only known cure for ignorance, with the main side effect being loss of precision.

          Its value will depend on the subject matter, but in medicine Mendelian Randomization is making it clearer that in treatment/exposure comparisons for treatments – its extremely important.

        • We already rely on RCTs when possible. The problem is it is often not possible. For example, investigating long-term effects of diet on chronic disease by RCTs is infeasible (due to compliance, for one thing), so we have to rely on observational studies.

        • This week’s New England Journal of Medicine has an article on observational vs RCTs, emphasizing the relative value of the latter and weaknesses of the former. It has some good recommendations on how RCTs can be made easier and less expensive to conduct. However, I think it paints an overly stark distinction between types of studies. RCTs usually depend on two questionable assumptions: that intention to treat, rather than actual treatment received, is the relevant randomization factor. The other is that the randomized groups are sufficiently large to reduce the sampling variability enough to be meaningful. For the latter, they do compare the randomized groups so that they look similar (or have cofounders which could be modeled), but given the number of omitted variables we can never be sure that the randomized groups are sufficiently similar. Large enough sample sizes can offset this, but RCTs are expensive and often do not have very large sample sizes. At the same time, as the amount of observational data increases (both in observations and number of features), the performance of observational studies can get better.

          I would not propose that observational studies are preferred to RCTs, but I do see these are on a continuum rather than stark alternatives. Both types of studies have practical limitations which make them more similar than the NEJM article suggests. I often (too often these days) find myself looking for evidence on a medical condition or treatment, only to find that there are no reasonably close RCTs (especially given Andrew’s point about the need to see the effects on particular subgroups rather than looking for average effects), and that the observational data I would like to see is simply unavailable (although, in theory, much more observational data could be made available, were it not for the insane private insurance model we use in the US, with little standardization or sharing of data).

        • Note however that randomized != controlled experimental…

          We can run an experiment where for example we use some prior knowledge and decision theory to choose a treatment and then observe the outcome and model the treatment response using known confounders. You can’t eliminate all confounders using large sample sizes with this method, but you can learn a lot, and in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.

        • @Daniel Lakeland: “in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.”

          I think this would most likely be a problem with trials that are using simple randomization, which would largely be dependent on the size of the study, but would also give you large standard errors to reflect the uncertainty, but then again, most experienced trialists and statisticians avoid using simple randomization for this reason due to potential imbalances and focus on blocking and stratifying based on prior knowledge of potential confounding variables

        • Zad, what you’re talking about is ways to make your data more informative if your model is correct (that is, you’ve blocked or stratified on properly meaningful variables and for reasonable values of those variables). So you can learn more with smaller sample sizes. But within any group, you’re still randomizing, and within that group the probabilistic independence with unknown confounders still is limited by sample size.

          Like for example, suppose you know women are different from men, and body weight is important in a medical treatment… SO you split by women and men, and you put them into 3 groups weight 1, weight 2, and weight 3… so you have 3 * 2 = 6 different groups, then you randomize within each group between drug A and drug B… Now you decide you want to have say 100 people in each category for some reason, you need 1200 people, which means you need to recruit somewhat more than that because you’re demanding balance among all the groups… maybe you need to see 2000 people, sort them and put them all into the various groups. Now your medical treatment is $5000 and you’ve got 1200 people: $6M to run the trial. Sure, this is doable for some people, for others it’s 2 orders of magnitude more money than they have.

  1. “A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect”…that “16” was an absurd value before and it still is. Why? Because there are just too many context-sensitive free parameters in the general formulas for the sample sizes. Consequently, anything more precise than “it will usually take a lot more observations to estimate heterogeneity than an average effect” will be BS based on glossed-over arbitrary settings (like making the interaction way smaller than the main effect, which is hardly a general law of nature); it’s just as absurd as the old advice I heard as a student that “a sample size of 30 is large” (for what setting and purpose?).

    Consider that in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial, meaning that only 4 times the sample size would be needed to get the interaction SE down to what the main effect SE was. For tests, I published some not-so-quick calculations long ago for binary-data settings of interest in my applications, which gave most sizes much less than 16 times those for main effects (Greenland S (1983). Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine 2:243-251), similar to what others got in the same type of setting.

    • Sander:

      I agree that 16 is not a magic number; it’s the product of assumptions. The larger the interactions are, the smaller this number will be. I don’t think that my number of 16 is “B.S.”; it’s clearly derived from its assumps.

      Just one thing: In your comment, you write, “in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial.” I agree. But that’s what I say too! The factor of 16 in sample size comes from two factors: the factor of 4 in sample size arising from the factor of 2 in SE that you mention, and my assumption that interactions are half the size of main effects. If you’re disagreeing with me on the factor of 16, it’s because you’re saying that your interactions of interest are more than half the size of main effects. It’s hard to know about this, but I agree that the number we get will depend on this assumption.

      • OK we agree on the math (mere arithmetic once you plug in the numbers). But it’s a pet peeve of mine when anyone tosses out context-sensitive numbers that are unmoored from context. In my biz sometimes the interactions are bigger than the average effects, occasionally to the point of effect reversal (as with my own dissertation’s real-example data!). No surprise as there are treatments that kill some patients and save others, especially in the Wild West of real clinical practice (which includes off-label and even contraindicated usage) as opposed to the carefully-selected patients and protocols in the refined world of RCTs. In that kind of reality, saying it takes 16 times the sample size is destructive, since not only is it wrong in general but it will make it sound like there is no point in proposing to examine interactions – but if that is not in the protocol then some will scream “data dredging!” if you look at them. So yeah I think tossing off a number like 16 (and repeatedly is very very bad nonsense, really statistical numerology (like most decontextualized “applied statistics”).

  2. Ok, so it is not easy, but small incremental gains can get you a long way.
    The amelioration of symptoms and prognosis of almost every common disease has improved since I started clinical medicine in 1987; progress built on very many RCTs, none of them perfect but together forming a tapestry of overlapping evidential strands that can be read.

    • Nick:

      The question is, would this benefit have occurred without RCTs, just by clinicians and researchers trying different things and publishing their qualitative findings? I have no idea (by which I really mean I have no idea, not that I’m saying that RCTs have no value).

      • > just by clinicians and researchers trying different things and publishing their qualitative findings?

        If researchers had zero personal incentives to do this, then sure… But in the presence of career incentives to publish stuff… then the literature would be totally polluted with bullshit.

        hey wait…

  3. God forbid we ever find out for certain if organic food is healthier or not, or if calcium pills protect women from bone loss, or if two aspirin on rainy Thursdays helps protect the intellect from Ted talk damage (strengthening the skull against cranial implosions).

    Same for social science. OMG I fear the day when there is precise definition for “food desert”; when we know what “quality preschool” is and what it does; when it’s known that we’ve become an “equal” society.

    Thousands – nae! Tens of thousands! – would be out of work!

    Save NHST! Save the economy!

  4. It seems odd to me that no-one has mentioned systematic reviews and meta-analyses (of RCTs) in this discussion so far, as they are generally considered the highest level of evidence in evidence-based medicine. As with anything this approach can have its limitations, but there is at least less of a focus on statistical significance for each individual RCTs, more of a focus on magnitude of effect size and an acknowledgement that treatment effects can differ across different populations. Where enough data are provided for each study, or where individual patient data can be shared, there is also the potential to gain greater understanding of differences between subgroups than can be achieved by any one study.

  5. precise definition for “food desert” == Rubʿ al Khali.

    I once spent ~2 weeks trying to find a definition for ‘housing unit” (roughly house/apartment, maybe?) It appeared that in the USA & Canada it was a case of “I know one when I see it”.

  6. “..statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty..”

    I don’t agree that that is how significance testing is typically used. The wording is odd to me. If I said there is evidence from a well-designed experiment(s) to suggest a coin is unfair, I am not stating that as a truth with a capital T certainty, but as evidence for, and at a certain alpha level, and I allow for errors and discuss any assumptions.

    “A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions”

    Then change the design to not do interactions, and/or get a larger sample (may need to save up some $). That still might be preferable to using a prior to get at the interaction. And did that 16 come from replacing uncertainty with certainty? ;)

    Justin

    • Justin, it is all over the place in applied research. You routinely see papers thresholding results as follows:
      1. Run a bunch of analyses on everything
      2. Report which ones p 0.05
      4. Report LS means or sample statistics for the responses passing #2.

      I see a lot of sophisticated stat types getting on Andrew’s case for “strawman NHST”, but what he is describing is rampant and widespread. Besides, I have never seen a rigorous research program out in the wild cleaving to a Neyman-Pearson decision framework consistently for long enough for type I error rates to matter…

    • ‘then change the design to not do interactions’. I’m curious to see how you see that playing out in most real applied cases. Also, what’s the problem with a prior on an interaction? is making some assumptions about the shape of a parameter of interest egregiously worse than the myriad of other assumptions made in your favorite stats? Or is that just the one you think is rhetorically easiest to pick on? why is this assumption worse than, say, setting an utterly arbitrary alpha level? subjectivity is all around us.

  7. Andrew:

    I was thinking the other day: it’s great that there is a group of people with extraordinary statistical expertise who can identify problems with NHST and suggest alternatives; but if any method is going to be trickled down into daily use and standard practice, it’s going to be used by a much broader group of people with substantially less statistical expertise. Under those conditions, there will always be people who just want to put guts in the machine and get sausage out, and not worry too much about what happens inside. What will the shortcomings of the alternative methods be under those circumstances?

    Not defending NHST by any means. But the more widely any method is used the more widely it will be abused. so that’s something to consider.

    • Jim:

      I agree.

      For example, suppose we characterize the current standard approach as:

      Approach 0: Compute classical confidence intervals and then report YES THERE’S AN EFFECT if the interval clearly excludes zero and report MAYBE THERE’S A SMALL EFFECT if the endpoint of the interval is very close to zero and report THERE’S NO EFFECT if zero is well within the interval.

      Now consider the following reform:

      Approach 1: Use the same classification rule as above but with Bayesian posterior intervals. I think this approach would be an improvement, because it lets us include prior information. But it still has major problems.

      Then we can move to:

      Approach 2: Do Approach 1, but instead of looking at comparisons or estimates one at a time, look at all of them at once, if possible embedding them in a hierarchical model. I think this would be a further improvement, because it uses more information and helps us avoid selection bias relating to forking paths. But it still has the problem that it’s extracting certainty from uncertainty.

      So this moves to some sort of:

      Approach 3: Do good modeling and report uncertainty intervals conditional on the model, but don’t use overlap-with-zero as a way of making strong deterministic-sounding statements.

Leave a Reply to Sander Greenland Cancel reply

Your email address will not be published. Required fields are marked *