The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174

BACKGROUND

Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.

METHODS

We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”

RESULTS

The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

19 thoughts on “The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

  1. If the researchers can construct the endpoint, they must know the underlying causes of mortality? Could they (or an interested third party) not redo the analysis with different endpoints / on subsets of the mortality data? If that leads to very different sub-models, you get a clear hint of the underlying effect working in different ways and the model with combined endpoints not being a feasible model.

    • Your model of the researchers motivations is very different from mine. I pretty much assume they couldn’t get stat significant results from each of those individual results because they’re too noisy so they constructed a composite measure and managed to score a win and a publication, chits all around…

      • I don’t know, and can’t comment on, how the composite outcome was arrived at in the study in question in the main post. But I am familiar with a number of recent studies using composite outcomes. Daniel Lakeland is only partly right. These composite outcomes are, indeed, used because the number of events for any individual outcome in a feasible study is usually too small, to support reliable inference. But the composite, at least nowadays, is typically pre-specified and is not a post-hoc search for p < 0.05. In fact, some specialty societies have developed standardized composite outcomes that they recommend for use in research studies, and researchers often follow those guidelines.

        Willem is correct that usually the presumed or adjudicated underlying causes of mortality are available to the researchers. Lack of information is not the issue: it's small numbers. The fact is that today the fatality rates for some conditions are so low that it is becoming difficult to conduct studies of interventions that might lead to further improvements. (Death from early stage breast cancer with favorable biomarkers is a good example of this.) The event rates are so low that very large trials of lengthy duration are needed, and funding agencies, quite reasonably in my opinion, often feel that there are better uses to which research resources can be put.

        When the individual components of the composite are sufficiently biologically similar, or sufficiently similar in the anticipated way they will respond to the intervention under study, it strikes me as perfectly reasonable to use composites. But throwing together a bunch of outcomes whose only similarity is that they are "bad" strikes me as potentially quite misleading. That said, where to draw the line on sufficiently similar can be a difficult question.

        • If someone were to ask me what I think SHOULD be done, I’d say you should model underlying causes, as well as individual outcomes, and then from the individual outcomes you can create inference for as many composite outcomes as you like.

        • Just to add a bit to what Clyde said: composite outcomes are often adopted for bad reasons i.e. to increase the incidence of the outcome so that you can get a sensible sample size out of your sample size calculation and hence have a reasonable shot at p<0.05. What often happens then is that any differences are driven mostly by the less serious components of the composite, so the result is difficult to interpret.

          I'm in favour generally of using multiple outcomes to draw conclusions from trials, as it's rare that there is one overriding outcome that determines whether the treatment is a good idea or not, but simple composites don't do that very well. Weighting all the components equally makes no sense for a start.

        • Composite outcomes can be very thoughtful.

          I picked up on the idea from Mosteller and Tukey and used the idea to try to model study quality effects. Later Sander Greenland showed how to partially pool the individual quality items towards the composite score as a better compromise – https://www.researchgate.net/publication/10601089_On_the_bias_produced_by_quality_scores_in_meta-analysis_and_a_hierarchical_view_of_proposed_solutions

          Also used to be involved in running expert advisory panels to develop pre-specified composite outcome scores for clinical trials. Difficult and very expensive work – but was very enjoyable.

          From an economic perspective – if its ethical and possible – likely better to just do a much larger trial.

        • Bit of a tangent but I noticed that this statistical analysis plan includes a “sequential nferential approach” for the secondary outcomes i.e. they have a sequence of outcomes and only test the next if the first is “significant”. Personally, I think that’s horrible; it just seems to be geting so far away from trying to learn and understand from the data.

          But… error control!

        • Note I wasn’t specifically arguing that they p hacked the composite score, but rather that going into the whole thing they knew they wouldn’t get any single result with enough N to get a statistically significant result, so they specified a bunch of related things they could pool together and have a reasonable chance of turning that into a statistically significant result, then they pulled the handle and watched the wheels spin… and got lucky that they managed to include enough stuff that had small effects to ring the bell

        • Spinning the wheel had a cost in the hundreds of millions of dollars, so it’s understandable they tried to balance the potential strength of a positive outcome and the chance of getting a positive outcome (from the regulatory point of view, at least) at all.

        • IMHO this is down to poor regulatory concept. A good way to handle this would be to put a per person cost on all the things they think they might be able to improve, as well as a per person cost on the different kinds of side effects, ideally the regulatory body would have these numbers on file, and then get a Bayesian estimate of each outcome and combine the costs and benefits using the posterior, and approve the drug if its benefits outweigh costs… since they are getting a govt guaranteed Monopoly (patent) I’d also say they should be required to specify a price ceiling they guarantee and include that in the costs.

    • Jordan:

      I followed the link, and I don’t enjoy the p-values, “empirical” or otherwise! As usual, I think they’re working hard to get a precise answer to the wrong question.

  2. My all-time favourite composite outcome is from the J-rhythm trial:

    ..the primary endpoint is defined as a composite of total mortality, symptomatic cerebral infarction, systemic embolism, major bleeding, hospitalization for heart failure requiring intravenous administration of diuretics, or physical/psychological disability requiring discontinuation of the assigned therapeutic strategy…

    lumping death in with psychological disability.

    (And in case your wondering, there was an absolute difference of 6.6% between groups for the primary composite endpoint – 5.6% of this due to the “disability” component and 1% due to the others)

    • “lumping death in with psychological disability”

      Well, it wasn’t just any psychological disability; it was “physical/psychological disability requiring discontinuation of the assigned therapeutic strategy…”, which makes one wonder if the “psychological disability” in question was something quite severe, like hallucinations that prompted behavior that could be seriously harmful to the patient or bystanders.

    • Yup. And quoting from that:

      “The choice of the components of a composite endpoint should be made carefully. Because the
      occurrence of any one of the individual components is considered to be an endpoint event, each
      of the components is of equal importance in the analysis of the composite. The treatment effect
      on the composite rate can be interpreted as characterizing the overall clinical effect when the
      individual events all have reasonably similar clinical importance.”

  3. I am an ICU doctor in training and I disagree with his letter on the Pantoprazole-GI bleeding trial.

    The purpose of the trial was not just to investigate whether pantoprazole is effective in preventing GI bleeding, you can read the trial protocol and see that this was not their stated intent. That pantoprazole prevents GI bleeding is an uncontroversial position, and nobody would really run a trial to investigate this.

    The trial was done because it is increasingly recognized that the intervention of giving broad swaths of patients pantoprazole to prevent GI bleeding carries tradeoffs–namely a risk of pneumonia, gut infections and possibly heart attacks. This is why the trial measured the composite outcome, to see if the interventions provided more composite help than harm. I normally fall in the anti-composite camp, but this is a trial where I found the composite appropriate.

    The primary outcome of mortality was chosen because these are ICU patients, at high risk of death, so ultimately what we want to know is whether our interventions are savings lives or not–though a criticism I have of this trial is that it was underpowered for that endpoint.

Leave a Reply

Your email address will not be published. Required fields are marked *