“Not statistically significant” != 0, stents edition

Doug Helmreich writes:

OK, I work at a company that is involved in stents, so I’m not unbiased, but…

http://www3.imperial.ac.uk/newsandeventspggrp/imperialcollege/newssummary/news_2-11-2017-15-52-46 and especially https://www.nytimes.com/2017/11/02/health/heart-disease-stents.html

The research design is pretty cool—placebo participants got a sham surgery with no stent implanted. The results show that people with the stent did have better metrics than those with just the placebo… but the difference was not statistically significant at 95% confidence, so the authors claim there is no effect! (the difference was significant at 80% confidence). So, underpowered study becomes ammunition in the “stents have no material impact” fight.

Here are the relevant quotes. From the press release:

Coronary artery stents are lifesaving for heart attack patients, but new research suggests that the placebo effect may be larger than previously thought. . . .

“Surprisingly, even though the stents improved blood supply, they didn’t provide more relief of symptoms compared to drug treatments, at least in this patient group,” said Dr Al-Lamee, who is also an interventional cardiologist at Imperial College Healthcare NHS Trust.

From the news article:

Heart Stents Fail to Ease Chest Pain . . . When the researchers tested the patients six weeks later, both groups said they had less chest pain, and they did better than before on treadmill tests.

But there was no real difference between the patients, the researchers found. Those who got the sham procedure did just as well as those who got stents. . . .

“It was impressive how negative it was,” Dr. Redberg said of the new study. . . .

Here’s what the research article (Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial, by Rasha Al-Lamee et al.) had to say:

Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy. . . . There was no significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16·6 s, 95% CI −8·9 to 42·0, p=0·200).

Setting aside the silliness of presenting the p-value to three significant digits, that summary is reasonable enough. The press release and the news article got it wrong by reporting a positive but non-statistically-significant change as zero (“they didn’t provide more relief of symptoms compared to drug treatments” and “Those who got the sham procedure did just as well as those who got stents”) or even negative (“It was impressive how negative it was”)! The research article got it right by saying “there is no evidence”—actually, saying “there is no strong evidence” would be a better way to put it, as the data do show some evidence for a difference.

Getting to the statistics for a moment . . . Tables 1 and 3 show that there were some pre-treatment differences between treatment and control groups. This will happen under randomization, but then it’s a good idea to adjust for those differences when estimating the treatment effect. What they did was compute the gain score for each group, using post-treatment minus pre-treatment as their outcome measure. That’s not horrible but it will tend to overcorrect for pre-test—it’s equivalent to a regression adjustment with a coefficient of 1 on the pre-test, and typically you’d see a coefficient less than 1, for the usual reasons of regression to the mean.

Indeed, had the natural regression adjustment been performed, the observed difference might well have been “statistically significant” at the 5% level. Not that this should make such a difference, but imagine how all the headlines would’ve changed.

Are the raw data from this study available? The answer should be Yes, as a matter of course, but unfortunately I don’t think Lancet yet requires data and code repositories.

The other thing going on is that there are multiple outcome measures. The research paper unfortunately focuses on whether differences are “statistically significant”—this just makes me want to scream!—and I don’t know enough about the context to be sure, but if the treated patients are improving, on average, for all the outcomes, that’s useful evidence too.

Are stents a good idea?

That’s another question and it comes to costs and benefits. Not “Do stents work better than placebo?”, but “How much better do they work, compared to realistic alternatives?” and “What are their risks?” We need numbers here, not just wins and losses.

There’s a science question: What do stents do, what makes them work when they work and what makes them fail when they fail, etc. (I phrase this in a vague way because I know nothing of the science here.) And there are various decision questions, at the level of individual doctors and patients, and higher up when deciding which procedures to recommend and reimburse.

Lots of questions, and it’s a mistake to think that all (or even any) of them can be answered by a single number obtained from this study alone.

The problem here is not lots of potentially useful data have been smushed and compressed into a binary summary (p greater than or less than 0.05) but that there’s a disconnect in the way that this study is used to address the scientific and policy questions we care about. There’s a flow of argument in which all scientific information goes into this one number which then is supposed to answer all questions. And this makes no sense. Indeed, it makes so little sense that we should ask how it is that people could ever have thought this was a good idea. But that’s a subject for another post.

P.S. I wrote this post awhile ago, and in the meantime I wrote an article with John Carlin and Brahmajee Nallamothu, expanding on some of these points. The article is called ORBITA and coronary stents: A case study in the analysis and reporting of clinical trials, and I have no idea where we will publish it. It can be tough to get an article published that doesn’t take strong conclusions; indeed, our message is that researchers should express less strong conclusions from their data.

73 thoughts on ““Not statistically significant” != 0, stents edition

  1. Great post – the “Not statistically significant” != 0 fallacy is pretty pervasive.

    Am I reading between the lines that you generally recommend checking pre-test balance in RCTs (maybe with a significance test), and if imbalance is found, one should adjust for it? I am asking because this seems to be a rather contentious issue. John Myles White (http://www.johnmyleswhite.com/notebook/2017/04/06/covariate-based-diagnostics-for-randomized-experiments-are-often-misleading/) argued against any testing for balance in RCT, and I recall Deaton and Cartwright in their paper “against RCTs” arguing against significance tests (in essence saying that all significant results are by definition Type I errors). Deaton and Cartwright preferred balance checks using normalized mean difference (which is essentially Cohen’s d, as psychologists would call it). I also seem to recall Senn arguing that imbalance is not a bad thing in RCTs, and we shouldn’t obsess about it (as we only require NET balance of all possible confounders). That does raise the question though: if the RCT identifies the average causal effect of some treatment on an outcome, which estimate is then preferable: the unadjusted or the adjusted one?

    • Felix:

      I do recommend adjusting for pre-treatment differences between treatment and control groups, of course. I don’t recommend testing for imbalance, I just recommend adjusting for imbalance. It would be silly not to. If you have lots of pre-treatment variables to adjust for, I recommend doing the adjustment using some regularized procedure such as Bayesian regression. I followed your link and I agree that adjusting for lots and lots of factors using least squares can add noise. But there’s no need to use least squares.

      • Andrew,

        (I seem unable to reply at the correct spot in this thread.)

        Thanks for the explicit pointer to (2) in your paper. There, you say that (2) is “the optimal linear estimate of the treatment effect”. The treatment effect is y_{post}^T – y_{post}^C in your notation and b in Freedman’s notation. Correct me if I am wrong, but I assume you mean that this is the linear unbiased estimator of minimum variance. If so, then I presume you are making some assumptions typical for OLS; please let me know what you assume.

        In any case, (2) is the same as what Freedman denotes \hat b_{MR} in his Theorem 4. Freedman compares to the unadjusted estimate, the ITT estimator. When the experiment divides treatment and control (in ORBITA, real versus placebo surgery) into equal parts, p = 1/2 in Freedman’s notation, then as he says on p. 11, the MR estimator has asymptotically smaller variance than the ITT estimator (i.e., adjustment “helps”). In the ORBITA case analyzed by your paper, p is close to 1/2. (In your post, you say “N is never large”, but I am unclear on whether this means asymptotic results here are not of interest in this analysis.) But if p \ne 1/2, then adjustment can help or hurt (p. 11). There is also bias in the MR estimator, but not in the ITT estimator, though it is asymptotically small (p. 10).

        According to your paper, ORBITA had N = 194 for exercise. For finite N (denoted n by Freedman), he says this on p. 12:

        “(vii) In a variety of examples, simulation results (not reported here) indicate the following. When the number of subjects n is 100 or 250, bias in the multiple regression estimator may be quite noticeable. If n is 500, bias is sometimes significant, but rarely of a size to matter. With n = 1000, bias is negligible, and the asymptotics seem to be quite accurate.

        “(viii) The simulations, like the analytic results, indicate a wide range of possible behavior. For instance, adjustment may help or hurt. Nominal variances for the regression estimators can be too big or too small, by factors that are quite large.”

        Freedman has additional results with better proofs here: https://www.stat.berkeley.edu/~census/neyregcm.pdf Green http://citeseerx.ist.psu.edu/viewdoc/download?doi= says that Freedman’s concerns do not apply to political science.

        The upshot, it seems to me, is that without further assumptions or qualifiers, one cannot advise always to adjust, as you seem to, even in this simple case, never mind adjusting for more covariates.

        By the way, the ITT estimator for exercise time in ORBITA is 54.5; the difference pre-treatment is 38.0 (in the same direction). The latter is (presumably) entirely due to randomness, so one does not expect the former to indicate a clinically strong effect.

        • Russ:

          Comparing the pre-test to post-test differences, which is what was done in the original paper, is a special case of the regression estimate, setting b=1. It’s true that estimating b from data can be noisy, and it can make sense to use a regularized estimate to get more stability, but setting b=1 is not a sensible way to regularize, given that on logical grounds alone we would expect b to be less than 1.

        • Andrew,

          Yes, I know they did that, and I did not discuss what they did in my previous post: They adjusted one way, you adjusted in a different way (probably better). I discussed your way versus ITT (no adjustment). The question I am asking concerns your statement that it is always better to adjust; you also said that your way is the optimal way. I asked you to say what you mean by “optimal”. That was near the beginning of my post; sorry if it got overlooked. Implicitly, I am also asking you to compare your recommendation to adjust with the detailed analysis Freedman gave. Thanks.

        • Russ:

          By “no adjustment,” do you mean just taking the difference in post-test scores, not accounting for pre-test scores at all? Setting b=0? That would be a horrible idea.

          Basically, my position is that it’s always better to adjust, that is, to estimate b, rather than to set it to a pre-set value.

          That said, I understand that b will typically be estimated in an unregularized way such as least squares, and there are some situations where it is reasonable to set b to a fixed value (e.g., 0, 1, or 0.5) based on prior information or assumptions. In those settings, I think a regularized approach would be better (instead of setting b=0, put a prior or penalty function on b to partially pool the estimate toward 0.

          I’ve seen some horrible analyses, where people control for pre-treatment variables using noisy estimates, and you’ll see coefficients all over the map. In those settings, it could well have been better to just not touch those variables in the first place.

          In the stents example, the pre-post correlation is high, and we’d expect it to be high, given that these are two measurements on the exact same people. So setting b=0 would be absolutely foolish. Setting b=1 isn’t too bad, but given the structure of the problem, you’d expect b to be less than 1, so I think it makes sense to estimate b.

          When I wrote “optimal,” I wasn’t thinking too hard about it. I don’t really care about optimality anyway. To be formal, yes, the optimality here is only asymptotic. But my real point was that it makes senes to adjust for pre-test but not by simply subtracting.

        • Andrew,

          That’s right, “no adjustment” means difference in post scores. This is called the ITT estimator. This is what Freedman’s papers are about. Apparently, his papers were rather influential. It seems even critics agree he made some good points and some good recommendations. He proved some theorems and used his analyses to give a more nuanced discussion. You say you have a “position … that it’s always better to adjust,” but if it is backed up by analysis, I would very much like to see that.

        • Russ:

          It would be easy enough to run simulations and demonstrate the point. This is not a hard problem. If the true value of b is something like 0.8, then setting b=0 is a terrible idea—it’s just throwing away data. If the true value of b is closer to 0, that’s another story.

          This sort of thing comes up a lot in statistics. There are endless complexities, nonlinearities, etc., that can be added to a model. Realistically you can’t do everything. Setting parameters to zero is an extreme special case of regularization, and it’s my impression that when setting parameter to zero seems to perform well, that’s when it’s being compared to naive unregularized estimates. Doing some intermediate regularization should work better than either of these two extremes. In the stents example, I wanted to keep it simple so I did the regression estimate. The simple difference estimate is not as good, but it’s not terrible. Not correcting for pre-test at all, that would be terrible, it would be statistical incompetence to even consider this as your estimated treatment effect for this sort of problem. Again, if the true b were close to 0, that would be another story—but that’s not something you’ll see in a situation such as this, where pre-test and post-test measurements are so similar.

        • Andrew,

          Yes, I agree that if the correlation is high, like 0.8, then it is better to adjust in the way you say in any reasonable situation. This can be seen from Freedman’s formulas.

          It seems you do now agree that it is not always good to adjust, however.

        • Russ:

          I think it’s always good to adjust—if the adjustment is done in a reasonable way. Least squares is not always a reasonable way to do things.

        • Andrew,

          Well, then I have to repeat my request for analysis to back up your assertion that it is always good to adjust and to say what you mean by “in a reasonable way”. Thanks.

        • Russ:

          The mathematics is simple. I think that sometimes complicated notation can get in the way.

          Here’s the story. There’s no optimal estimation method, and there can never be an optimal estimation method. If b happens to be exactly 0, then b=0 dominates any other estimate. If b happens to be very close to zero, then, for reasonable sample sizes, b=0 will dominate any purely data-based method such as least squares. Even better will be a Bayesian approach—but that’s only optimal when averaging over the prior, which is itself just a model.

          If you have in your hand an article saying that setting b=0 can dominate least-squares estimation of b, that’s fine. As I said, such a theorem is relevant if b is near 0. The point is, you could replace “0” by 1 or 0.5 or any other pre-specified value, and your theorem would still be just as true as before. So the theorem doesn’t really resolve any problems; it just pushes you back one step to consider the default choice of b.

          In the stents paper, they chose b=1, which isn’t a disastrous choice but does have some problems, given that before-after correlations of this measurement won’t be 1. It sounds like the paper you are citing choses b=0 as a default, which will typically be a terrible idea when considering studies where the same thing is being measured before and after. But b=0 can make sense in other settings where the pre-treatment variable is not strongly related to the outcome. Indeed, if you have dozens of pre-treatment predictors, then I think b=0 is typically a reasonable default choice for most of the predictors.

          In any case, the math here is relevant mostly in helping us understand that, when you are worried about noisy estimation of b, you have to think about subject-matter or structural knowledge in order to decide what value of b to plug in (if you want to go that route) or to partially pool toward (as I’d prefer). Just setting b to 0, or 1, or 0.5, or 3.14, or whatever, doesn’t make sense without considering context.

        • Andrew,

          I don’t see how you disagree with me or Freedman’s papers (of which Felix, not I, brought up one originally). Freedman says that it is not obvious whether it is better to adjust or not (in your notation, estimate b or use b = 0 as a default) and neither decision is always the right one, and you seem to agree. Freedman does not say that it is always better not to adjust. He would agree with you that his theorems do not “resolve any problems”. Rather, their intention is to point out some problems that bear thinking about.

          By the way, I wondered why you seemed not to want to mention Freedman’s name in this discussion, nor discuss particulars of his papers, so I looked up mentions of him on your blog. To my surprise, I found this one http://statmodeling.stat.columbia.edu/2006/06/16/treatment_inter/ whose title, “Treatment interactions in experiments and observational studies”, and discussion suggest it is about precisely Freedman’s papers! However, the links there are dead, so I cannot check. Nevertheless, you wrote that you “took a look at Freedman’s papers. They were thought-provoking and fun to read”. Your post is from June 16, 2006 and Freedman’s (first) paper was received Sep. 27, 2006, so it seems possible you read a preprint of this. Also, somewhat to my surprise, I found that you wrote, “In my experience Freedman was either a liar or a fool or a completely lazy person (although if you tell untruths via the expedient of carefully not checking the truth-value of your statements, this is a sort of lying, as far as I’m concerned), or perhaps all three.” You wrote this much later in 2015: http://statmodeling.stat.columbia.edu/2015/02/14/two-unrecognized-hall-fame-statisticians/#comment-211579

        • Russ:

          It’s true that in my experience Freedman was either a liar or a fool or a completely lazy person, but that seemed irrelevant to the current discussion so I didn’t bring it up. Nobody’s perfect, and I’m sure that Freedman had some good days, unobserved by me, in which case he displayed various mixtures of truthfulness, intelligence, diligence, and integrity.

          Regarding the particular article you’re discussing, perhaps the following thought experiment would be useful: Imagine an infinite set of articles, each identical to the one you discussed, but with a different value of b. The article you discussed makes, and proves, the correct point that a regression estimate does not dominate the estimate with b set to 0. There’s another, hypothetical, article, that makes and proves the correct point that a regression estimate does not dominate the estimate with b set to 0.01; there’s another making and proving this point for b=0.02, another for b=-3.14; etc.

          That’s all fine.

          Now, what do we do with all of this? We can’t do much with it until we have some sense of what values of b might make sense. There are a lot of problems where b=0 is a reasonable default guess. The stents problem is not one of them.

          This particular thread got started when someone asked about the advice to test and adjust for imbalance. My recommendation was not to test for anything, but just to adjust appropriately for imbalance. How exactly to do this will depend on context. I think that in general the best way to do this would be using partial pooling, but then you have to decide what to partially pool toward. That’s fine, as people are already making this sort of decision. For example, to not adjust for a variable is to set b=0. From a statistical perspective, I don’t think it’s so helpful to talk about adjusting or not-adjusting; I think it makes more sense to talk about what value of b we want to use. In any case, I agree with the point that a least-squares estimate of b will not necessarily be best, so it was sloppy of me to use the term “optimal” in that paper.

          P.S. I went back and fixed the links that I could from that earlier post.

        • This really reminds me of the analogous situation where you see folks “testing” whether a hierarchical (“random effects”) term or formulation is “significant” before using it, rather than understanding partial pooling as a compromise between no pooling and complete pooling.

        • Chris:

          Yes, exactly. Such a procedure privileges b=0 and it can make rough sense when b is near 0. In that sense, these procedures can be viewed as approximations to Bayesian inference, where the prior information is included, not as a “prior distribution,” but as a default model.

        • Chris,
          > folks “testing” whether a hierarchical (“random effects”) term or formulation is “significant” before using it
          That type of thinking is hard to permanently kill (vampirical), many have tried (e.g. Cochrane’s statisticians) over the years, but it keeps springing up.

        • Andrew:
          Indeed, as somebody around here likes to say, ‘statistics is the science of defaults’ :) My default is to always model the variation with partial pooling (or average over uncertainty in b in this case).

          I would be interested to see where Cochrane has commented on this issue. In my experience, that practice tends to go hand in hand with folks who were taught that “random effects are for when you want to generalize to a population from which you have randomly sampled, whereas fixed effects are for when you want the best estimate (sic) for the specific group/level/whatever sampled here”. For me, nothing about any of this really made sense until I understood the actual mechanics of partial pooling (then learned about the James-Stein estimator, etc.). It’s funny how math, properly wielded, can be a great clarifier!

        • Chris,

          From the Cochrane handbook “9.5.2 Identifying and measuring heterogeneity
          It is important to consider to what extent the results of studies are consistent. If confidence intervals for the results of individual studies (generally depicted graphically using horizontal lines) have poor overlap, this generally indicates the presence of statistical heterogeneity. More formally, a statistical test for heterogeneity is available. This chi-squared (χ2, or Chi2) test is included in the forest plots in Cochrane reviews. It assesses whether observed differences in results are compatible with chance alone. A low P value (or a large chi-squared statistic relative to its degree of freedom) provides evidence of heterogeneity of intervention effects (variation in effect estimates beyond chance).

          Care must be taken in the interpretation of the chi-squared test, since it has low power in the (common) situation of a meta-analysis when studies have small sample size or are few in number. This means that while a statistically significant result may indicate a problem with heterogeneity, a non-significant result must not be taken as evidence of no heterogeneity. This is also why a P value of 0.10, rather than the conventional level of 0.05, is sometimes used to determine statistical significance. A further problem with the test, which seldom occurs in Cochrane reviews, is that when there are many studies in a meta-analysis, the test has high power to detect a small amount of heterogeneity that may be clinically unimportant.

          Some argue that, since clinical and methodological diversity always occur in a meta-analysis, statistical heterogeneity is inevitable (Higgins 2003). Thus the test for heterogeneity is irrelevant to the choice of analysis; heterogeneity will always exist whether or not we happen to be able to detect it using a statistical test. ”

          And, if you have not, you might wish to read X. L. Meng, From unit root to Stein’s estimator to Fisher’s k statistics: If you have a moment,I can tell you more, Statistical Science 20 (2005) 141–162.

    • The analysis by John Myles White seems irrelevant. Yes, if we searched enough covariates, we could find some with extreme unbalance but this isn’t the point of the recommendation from Andrew and others to simply include initial conditions in the model automatically. I explore this more with the highly cited Mouse fecal transplant study, which I think is a good example of a large conditional bias in a t-test of a gain score. I say “think” because I don’t have the raw data so I had to create fake data that generates datasets that look like the summary statistics of the published data. This is a work in progress (or not, as I don’t know what I’ll do with it) and would appreciate any feedback.

    • Thanks Andrew and Jeff for the replies – I liked your analysis that you linked to, Jeff, and I never heard about this highly cited mouse fecal transplant study, but the title itself sounds interesting.
      I could think of some silly reasons to adjust for covariates in RCTs. One of course is p-hacking, and just adjust on whatever will drive p < .05, but let's not go there.
      Freedman seemed to suggest that adjusted estimates exhibit small biases in RCTs (https://www.stat.berkeley.edu/~census/neyregr.pdf), and argued that the design of the experiment (here randomization) implies that one should not generally adjust on pre-treatment covariates.
      Also for the sake of argument, we may consider that in order to adjust for a pre-treatment imbalance on an observed covariate, we match participants. By doing so, we may induce imbalance in another unobserved covariate, and if that one is highly related to the outcome, we just introduced more bias, and get actually worse estimates.
      Anyway, I did not mean to derail the thread here with questions about covariates in randomization, and appreciate the replies!

    • Sorry, I will add one more point. It seems weird that in an RCT in which researchers went through the effort and collected lots of (presumably) important pre-treatment covariates, we would feel suspicious if we saw an imbalance, and would demand adjustment, but if the same researchers just conducted an RCT without collecting this, most of us would probably be comfortable just looking at the unadjusted effect – it is after all unbiased. Most of us would probably NOT look at an RCT, and immediately say “well, but what if a pre-treatment covariate that you did not collect was imbalanced” (and we shouldn’t say this). Strangely, it seems that folks running RCTs would have an incentive not to collect any pre-treatment variables.

      • Felix:

        The imbalance doesn’t make me “suspicious.” Suspicion has nothing to do with it. Of course you should adjust for pre-test.

        And you write, “if the same researchers just conducted an RCT without collecting this, most of us would probably be comfortable just looking at the unadjusted effect.” I can’t speak for “most of us,” but for me, I would definitely have a problem with a study that did not collect pre-test data. At least I’d want to know why they didn’t do so.

        Finally, I strongly disagree with your statement, “folks running RCTs would have an incentive not to collect any pre-treatment variables.” Collecting pre-treatment variables allows you to better line up treatment and control groups, giving you a more precise estimate of the treatment effect.

        • Andrew:
          Thanks – I was always under the impression that consumers of RCTs operate under the assumption that randomization allows us to ignore pre-treatment imbalances, and that adjustment is generally not necessary, as the unadjusted effect is an unbiased estimate of the average causal effect. You seem to say that you would generally prefer to see pre-treatment covariates collected, and since there will always be some imbalances, prefer to see the adjusted effect. But doesn’t this rob randomization of one it’s most amazing features? The fact that you don’t have to spend extra time and resources to collect the pre-treatment covariates, and can safely ignore them?

          Also, I am not advocating that trialist don’t collect pre-treatment covariates, I am just saying they have an incentive not to, if that means they avoid scrutiny, as in “your RCT is misleading because covariate X was imbalanced.” I could see that such a criticism would be easier to evoke (because you see the imbalance), as to saying “your RCT is misleading, because an unobserved covariate Y may be imbalanced”.

        • Felix:

          1. An unbiased estimate is useless if it is too noisy. Within-person comparisons are typically necessary to get noise down to a reasonable level, and pre-treatment measurements allow within-person comparisons.

          2. If you don’t have the “extra time and resources to collect the pre-treatment covariates,” I don’t think you have any business doing the study in the first place. This particular study involved an expensive operation! Pre-treatment measurements are cheap in comparison. To do a big medical procedure, and then make a comparison based on post-treatment outcomes, without collecting pre-treatment variables, would be unethical.

          3. You can’t “safely ignore” the pre-treatment measurements. To do so will give you the wrong answer.

          4. I do not think it is appropriate for a critic to say, “your RCT is misleading because covariate X was imbalanced.” It is appropriate for a critic to say that an appropriate adjustment should be performed.

          5. Also, I don’t think that “avoiding scrutiny” is an appropriate goal. This was a high-visibility study and it was going to get scrutiny in any case. Scrutiny is good! When I do research, I don’t want to avoid scrutiny; I want people to look at our results as carefully as they can.

        • Andrew:
          Of course, I also want scrutiny, so maybe let’s just put aside hypothetical incentives that trialist may or may not have.

          But you writing that ignoring pre-test measurements in an RCT give you the wrong answer, is still somewhat confusing to me.
          Are you saying that unadjusted estimates from RCTs are wrong? And are you therefore arguing that RCTs always need to collect and include (in the analysis) pre-test measurements? I don’t want to put words in your mouth, but that is the gist that I am getting. Maybe I am about to be exposed as naive, but I always thought that in an RCT, the unadjusted estimate is perfectly fine, simply because we randomized, and the design guarantees unbiased estimates. Sure, precision can be improved through design and analysis choices (like blocking, or covariate adjustment), but the estimate itself is unbiased, so your point 3 where you state that pre-test measurements CANNOT be ignored in an RCT, and that doing so gives you the wrong answer, is something that I simply don’t understand. Maybe you were talking about this particular trial, or is this some general recommendation?

          By the way, I am not commenting here to “win” some sort of argument. I am genuinely interested in your opinion and what I can learn from it! So, thanks for the discussion so far.

        • There are lots of things that are wrong with the usual practice. Remember that *unbiased* means that the average taken over many *repeated trials* converges to the true average. Any given trial may be far from the truth. Adjusting (properly) for known reasons why *this particular* trial might be farther away from the true average will always improve what you learn from the particular trial (of course doing an inappropriate bad job will make things worse)

          Since most trials are not repeated say 350 times, (or even 3 times) being unbiased is rarely of real interest

        • The real thing that randomization does is it selects one of the possible assignments giving every possible assignments equal opportunity. This ensures that there is no human induced bias, and that the chosen assignment has essentially no chance of being dramatically far from typical. It doesn’t mean that the actually chosen assignments will give excellent results unadjusted, only that it is exponentially hard to give dramatically wrong results. For example it is potentially easy for doctors choosing people by hand to put the worst patients into the stent group because they want to make sure they are getting the best possible treatment in the doctors opinion. That would give dramatically wrong conclusions unadjusted, however an RNG would essentially never put all of the worst patients together. Furthermore an RNG would essentially never put all of ANY sizeable group together. But an RNG should fail to perfectly balance every group as well… There are very few assignment that perfectly balance everything.

        • Felix, Russ:

          I gave our recommended analysis in equation (2) of our linked article. This is just about the simplest case because we’re just using this one pre-treatment predictor and the model is linear. Who knows what I’d do with all the raw data, but this is an analysis I was able to do given the summaries presented in the published paper. A larger issue is that it’s not clear that we should be looking at improvements in exercise time as a primary outcome.

        • Like Felix, I too would be interested to hear details of Andrew’s recommendation always to adjust, in light of Freedman’s analysis showing when it helps and when it hurts or misleads.

        • Thanks for chiming in Daniel – I always appreciate your comments here on this blog. For some reason, I couldn’t hit reply directly beneath your comment (maybe too many levels of nesting).
          I enjoyed your explanation of randomization giving every possible assignment an equal chance, and that therefore assignments that would yield dramatically wrong results are rare. That does make sense.

          And yet, isn’t it exactly this property of randomization (that it unlikely will pick very badly unbalanced designs, because even though some covariates will be imbalanced, the NET effect of them will be zero under randomization), that should comfort us when looking at the unadjusted treatment effect estimate? In fact, wasn’t this Fisher’s insight why randomization is so powerful?

          I am in absolute agreement (with both you and Andrew) that using pre-tests can increase precision, statistical power, etc., but I am really not getting why the unadjusted treatment effect seems to be “vilified” here. If I imagine reading about an RCT, and finding that overall it has been properly conducted (no missigness, blinding, large N), then generally speaking I would be quite happy to see the unadjusted effect reported, and in fact, would probably feel quite confident about it.

        • Felix:

          Unbiasedness won’t help you if your standard error is huge. If you do a between-person design, you need to control for good pre-treatment predictors or your standard error will be huge. In addition, big errors and lack of within-person control will make you vulnerable to bias in your measurements.

        • I agree with Andrews statements but will also add that statistics is basically a way to reduce the cost of doing experiments, we never run so many N that the standard errors are small enough to ignore because it costs too much. And we always scale up our questions to fit the available data if we have a lot of data.

  2. Both fake surgeries and real surgeries have the potential (roughly equal?) for post-operative iatrogenic problems, but only real stents have the possibilities for pain (and other complications) related to the stents themselves. In line with your recent posts, this result is entirely consistent with stents having a significant positive impact, conditional on no idiosyncratic negative side effects — and the side effects could be independent of the beneficial effects.

    This would yield a recommendation which says:
    (a) all surgeries have possible bad side effects — infections, etc. and positive psychological effects, with some net effect measurable from the placebo group compared with an untreated group
    (b) stents have a significant positive impact on their own fixing the problem
    (c) sometimes, however, stents have a significant negative impact (immune reaction, side effects of drug elution, etc.)

    Some of the failure in statistical significance is that the treatment group is (b)+(c), not (b) alone.

    Total effect = (a) + (b) + (c) and the recommendation will depend on the variances and covariances of (a), (b) and (c)

  3. …that summary is reasonable enough. The press release and the news article got it wrong by reporting a positive but non-statistically-significant change as zero (“they didn’t provide more relief of symptoms compared to drug treatments” and “Those who got the sham procedure did just as well as those who got stents”) or even negative (“It was impressive how negative it was”)! The research article got it right by saying “there is no evidence”—actually, saying “there is no strong evidence” would be a better way to put it, as the data do show some evidence for a difference.

    You have to read the next paragraph of the summary to find out what they really think:

    PCI did not increase exercise time by more than the effect of a placebo procedure.

    Usually there is a progression of more and more ridiculous claims coming from the authors as they go from the results section to discussion section to press release to interviews with the media. I tend to think the last is what they really think while the earlier claims are there for appearance purposes (like adding “suggests” to your paper at key points).

    • The last paragraph of the paper itself reads as follows (emphasis added):

      ORBITA made a blinded comparison of PCI and a
      placebo procedure in patients with stable angina and
      anatomically and haemodynamically severe coronary
      stenosis. The primary endpoint of exercise time increment
      showed no difference between groups.
      This first
      placebo-controlled trial of PCI for stable angina suggests
      that the common clinical observation of symptomatic
      improvement from PCI might well contain a large
      placebo component. Placebo-controlled efficacy data
      could be just as important for assessing invasive
      procedures, where the stakes are higher, as for assessing
      pharmacotherapy where it is already standard practice.

      Table 3 of the paper shows the outcomes for 9 different variables or measured quantities. If I read this table correctly (a fairly big if—given that I’ve never been inside a medical classroom), 1 variable shows no change, 7 variables show a favorable change for those who underwent PCI, and 1 variable shows a favorable change for those on the placebo. My intuition tells me that, if there were no effect that the sign of the differences in these outcomes would follow a Bernoulli distribution with parameter 0.5 (after ignoring the ties). In this case we get 7 + and 1 – (conditional on my classification of the outcomes being correct!). Now, flipping 7 heads in 8 trials is not impossible—but it is unlikely. I get that 7 heads or 8 heads would occur about 4% of the time. In other words, this set of outcomes seems to me to be significant at the 0.05 level.


      The table showing the key endpoint (the Duke treadmill score) is shown below:

      Patients assessed 104 90
      Pre-randomisation 4·24 (4·82) 4·18 (4·65)
      Follow-up 5·46 (4·79) 4·28 (4·98)
      Increment (pre-randomisation to follow-up)
      1·22 0·10
      (4·36; 95% CI 0·37 to 2·07) (5·20; 95% CI –0·99 to 1·19)
      Difference in increment between groups
      (95% CI –0·23 to 2·47)
      p value 0·104

      • Reposting this since I noticed a bracket error near the beginning and another typo. Mods should delete the other one if they want…

        My post was more a ftfy to Andrews post (addition in bold):

        The press release and the news article and the paper got it wrong by reporting a positive but non-statistically-significant change as zero

        [observes 7/8 outcomes in one direction and proposing treating as a sample from binomial distribution with p = 0.5]

        In other words, this set of outcomes seems to me to be significant at the 0.05 level.

        Ok, but why do you care? I’m not being sarcastic. I suspect if we work through why you care about this there is a fallacy somewhere. Lets derive your model:

        Assuming each result is independent of the others, we use the multiplication law for independent events: p(A and B) = p(A) = p(B). Further assume each outcome had the same probability p of being positive and 1 – p of being negative, ie p = p1 = p2 = ... = p7 .

        Then we can see the probability of getting 7 positives and then one negative will be p1*p2*p3*p4*p5*p6*p7*(1-p8) = p^7*(1-p). Or you could see a different order like p1*p2*p3*p4*p5*p6*(1-p7)*p8 = p^7*(1-p) … etc. We can add all these up or just realize there are choose(8, 7) = 8 ways to get 7/8 positives.

        So the probability of getting 7 positives out of 8 possibilities will be choose(8,7)*p^7*(1-p)^1 = 0.03125. In general you can work this out to see it is choose(n, k)*p^k*(1-p)^(n-k), so for 8/8 positives we get choose(8,8)*p^8*(1-p)^0 = 0.00390625. Adding the two values gives ~0.0351 which is where I assume you got ~4% from.

        – Does this sound like a process that would generate those 8 outcomes after heart surgery?

        – Do we expect “Exercise time (s), Time to 1 mm ST depression (s), Peak oxygen uptake (mL/min), SAQ-physical limitation, SAQ-angina frequency, SAQ-angina stability, EQ-5D-5L QoL, Peak stress wall motion index score, and Duke treadmill score” to have the exact same probability of being higher in the treatment group?

        – Is it reasonable to assume there are no correlations between these probabilities?

        • Ok, two more typos (plus the missed closing blockquote tag, I made another post with that fix that I hope doesnt show up)… this site really needs a preview post option:
          p(A and B) = p(A) * p(B)
          p = p1 = p2 = ... = p8

        • Well, you are right of course.

          You asked: Ok, but why do you care? I’m not being sarcastic.

          I was reacting to the author’s statement of “no difference” and pointing out what appears to be a systematic pattern of differences between the treated and control groups. Now, calculating a “significance level” on this blog has to be regarded as an ironic statement. I’m sorry that I did not make that more clear. Note also that choosing to do the analysis that I did required going down another fork in the road.

          My main point was to show that the conclusion that Andrew objected to in the abstract also appeared in the body of the paper.


        • I see, thanks. One thing is that the quote specifically talks about exercise time in isolation, so the overall pattern seen in the battery of tests would be a different story.

  4. Hi Andrew:

    Nice thoughtful blog post as always. I’m surprised that the
    randomization wasn’t between balloon angioplasty without
    stenting and balloon angioplasty with stenting. That would
    seem to me to be a more relevant comparison.


  5. It is obvious that putting in a stent makes good sense. Unfortunately, while plausible, that assertion may be false. On page 89-90 of their book, “Ending Medical Reversals,” Prasad and Cifu comment on the SAMMPRIS study dealing with intracranial stenosis:

    “Unfortunately, the group receiving the the stents was having more strokes.”

    The current controversy involves cardiac stenting; from the NYT:

    “The idea that stenting relieves chest pain is so ingrained that some experts said they expect most doctors will continue with stenting, reasoning that the new research is just one study.”

    “Even Dr. Davies hesitated to say patients like those he tested should not get stents. ‘Some don’t want drugs or can’t take them,’ he said.”

    “Stenting is so accepted that American cardiologists said they were amazed ethics boards agreed to a study with a sham control group.”

    “But in the United Kingdom, said Dr. Davies, getting approval for the study was not so difficult. Neither was it difficult to find patients.”

    And stents are not cheap. From the NYT, “inserting them costs from $11,000 to $41,000 at hospitals in the United States.” Clearly, stent makers and stent inserters will not take kindly to the new results.

    • “Stenting is so accepted that American cardiologists said they were amazed ethics boards agreed to a study with a sham control group.”

      “But in the United Kingdom, said Dr. Davies, getting approval for the study was not so difficult. Neither was it difficult to find patients.”

      This looks like it could turn out to be a great case study against research monocultures. I saw the NIH actually bragging about it a bit ago, this should not be seen as a good sign: https://www.statnews.com/2018/02/12/nih-funding-drug-development/

        • Yes – in the right setting that may reduce hundreds of thousands being cut up to receive devices that confer little benefit or even harm. That might have been the case here…

        • “Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI…”

          So wait – four people in the control group (out of 95) ended up with severe bleeding (presumably caused by the unnecessary placebo procedure) and the accepted course of action was… the thing they were saying didn’t work (PCI)?

          And while I’m here, another question: What kind of relevant “treatment effect” are they estimating when they compare PCI to “placebo surgery”? If I were a (medical) doctor, I wouldn’t care if PCI worked better than cutting someone open and sewing them back up for no reason – i’d care about how effective it was compared to anything else I might reasonably do (or compared to doing nothing).

          I just don’t get why it was necessary. But apparently the authors thought it was, because they write “The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.” Is that really any comparison we should care about – the effect of “real” invasive surgery v. invasive surgery that doesn’t actually do anything?

          I’m sure there must be some good reason for this, but I admit I’m struggling to see it. I mean yeah, I see your point to some degree about getting information (” ‘It was amazing to be able to watch the entire procedure on a TV screen. I learned a lot about my heart.’ Roberta, 41″)…but would Roberta consider that information worth a 5% chance of “severe bleeding” that required further surgery?

          Full disclosure: I only read the free parts of this article, so I may very well be missing something. Still – placebo surgery sorta blows my mind as a thing we do to people.

        • > i’d care about how effective it was compared to anything else I might reasonably do (or compared to doing nothing).

          How do you measure effectiveness? Sometimes there are no objective measures and you have to rely on patient-reported outcomes.
          If you treat some patients and do nothing to some other patients, and then asks them if they got any better, you cannot really trust their answers. There are several reasons why a sham procedure can provide a better baseline to understand what’s going on.

          Conclusion: Sham surgery is associated with a large improvement in pain and other subjective patient-reported outcomes but with relatively small effect on objective outcomes. Sham surgeries are overwhelmingly safe. The magnitude of this effect should be used when planning future sham-controlled surgery trials.

        • It looks like both groups had a diagnostic cardiac catheterization (make a small incision in the groin or wrist, then thread wires and tubes through the arteries to the the coronary arteries of the heart, which they can then use to take pictures of those arteries and measure a bunch of physiological parameters). One group also had stents placed, and the other (“placebo”) just sat there sedated for 15 minutes longer to simulate the time it would have taken to place the stents and repeat the physiological measurements (FFR and iFR of the lesions that were stented). In the appendix, they state that those four people had a dissection of the coronary artery. This means that the inner lining of the artery was torn by the wires they were using, and that flap that forms can completely block flow or enlarge and lead to other very bad things like cardiac tamponade. This absolutely needed to be treated immediately because it is life-threatening – a very different situation than stable angina which is mostly about symptoms and exercise tolerance.

          From a medical perspective, it really does matter that they included a placebo group. These patients were already medically “optimized” but still were having significant symptoms. If the intervention was cheap and with almost no risk, then clinically it wouldn’t matter much if it was placebo or not. But PCI is expensive and can lead to its own complications. Even drug-eluting stents will eventually fail, and the usual treatment is to place another stent inside that one and balloon it open – this will fail even sooner.

    • That seemed to come up a lot when I worked with surgeons – most said studying that would be unethical/unacceptable to patients but then some group somewhere did a trial and had no difficulties recruiting patients.

      Of course, we would only hear about the successes :-(

  6. In the planned paper with John Carlin and Brahmajee Nallamothu:


    I have been trying to make this point for many years, but my medical co-researchers always put it in again as “required by the journals guidline”. In case someone knows of a high-ranked paper that makes the above claim, it would act as tranquillizer.

    I wish there would be a common statement of Lancet, NEMJ, BMJ and ASA on the subject.

    • Looks like I used double carets in the quote above. Here again

      We suggest that the phrase used by these authors,“We deemed a p value less than 0.05 to be significant,” should be strongly discouraged, rather than actively demanded as is currently the case by many journal editors.

  7. I was skimming the paper and paused on this sentence – it is a great way to represent the effect size:

    “an increase in exercise time of 21 sec from 509 to 530 sec would take a patient from the 50th percentile to the 54th percentile of the distribution.”

    It can be hard sometimes to put a treatment effect in a context (or a unit measure) that makes clear just how big or small it is. But it forces the reader to immediately think about it in the context of other possible treatments or inputs or manipulations. We should do that more ourselves when we write and demand that more as reviewers. This is a really nice example.

    • I go on about this all the time with my dimensional analysis comments, but this is the essence of a good analysis of measurement. Start off by finding dimensionless ratios where 1 is a change everyone cares about.

      Instead here it seems they are trying to give some sense of scale after the fact. If have started with something like a measurement of healthy people, and a measurement of sick people pre operation, and then created a dimensionless ratio where the difference in averages between these groups was 1, then a post surgical change of 1 would be restoring an avg patient to an avg health…

      • Imagine you have two groups of patients who each exercise for three sessions before and after the intervention.

        Then (iiuc) you would want to take the average of before for each patient, average of after for each patient, take the overall average of each group, divide both into some “healthy” exercise time, then subtract the two and report the difference. Possibly there would be a hierarchical model used to get pooling on the estimates, etc but that would be the gist of it.

        I dunno, I’d much rather know the distribution of individual actual exercise times than some multi-averaged normalized value. Once you start transforming the data like that there is no intuition about what the numbers really mean.

        And if you think the intervention is improving “heart function” or whatever there needs to also be at least one plot of individual “heart function” vs individual exercise time (if there doesn’t seem to be any trending during the sessions, you can put error bars around each for the three measurements).

        • >I dunno, I’d much rather know the distribution of individual actual exercise times than some multi-averaged normalized value. Once you start transforming the data like that there is no intuition about what the numbers really mean.

          I disagree, I have no intuition about what x more minutes means, but I do know that if you start at 50% of the value established as average for healthy people of your age and finish up at 75%… this is a big improvement.

          I think the key is to make the denominator a well known well established value. You don’t want too much gaming the system there.

  8. Suggestion for your consideration: In your article, also maybe give the regression form of the two analysis methods (here in R pseudo-code, but easily translated as needed):

    The model they used:
    (1) lm(Ypost-Ypre ~ T)

    The model you (and I) recommend:
    (2) lm(Ypost-Ypre ~ Ypre + T)

  9. Just curious: how would you have treated the pre-post measurements? Would you have used the gain score as dependent and the pre score as a covariate? Or would you have used ANCOVA or LMM?

  10. How can we say this study was underpowered when it was powered to find a difference of 30 seconds and it only found a difference of 16 seconds? If a meaningful effect size is in fact 30 seconds and we didn’t find it with our sample size and the variability was similar to the power calculation, then there wasn’t a difference as far as this study is concerned. And there’s no evidence that there was a difference.

    • Lindsay:

      Actually, with the correct analysis the estimate was 21 seconds, not 16 seconds. But, either way, 16 or 21 is pretty tiny compared to the spread of the distribution across people.

  11. Chiming in very late.
    The support-value for the primary outcome of the ORBITA trial is 37%. That is, the support the data give to the supposition that the true improvement in exercise treadmill time after PCI is 30 seconds or more, is 37%. See arXiv.org > stat > arXiv:1806.02419 for details of how this value was derived.

Leave a Reply

Your email address will not be published.