Massive confusion about a study that purports to show that exercise may increase heart risk

I read this front-page New York Times article and was immediately suspicious. Here’s the story (from reporter Gina Kolata):

Could exercise actually be bad for some healthy people? A well-known group of researchers, including one who helped write the scientific paper justifying national guidelines that promote exercise for all, say the answer may be a qualified yes.

By analyzing data from six rigorous exercise studies involving 1,687 people, the group found that about 10 percent actually got worse on at least one of the measures related to heart disease: blood pressure and levels of insulin, HDL cholesterol or triglycerides. About 7 percent got worse on at least two measures. And the researchers say they do not know why.

“It is bizarre,” said Claude Bouchard, lead author of the paper, published on Wednesday in the journal PLoS One . . .

Dr. Michael Lauer, director of the Division of Cardiovascular Sciences at the National Heart, Lung, and Blood Institute, the lead federal research institute on heart disease and strokes, was among the experts not involved in the provocative study who applauded it. “It is an interesting and well-done study,” he said.

What made me suspicious? Two things. First, I didn’t see why the researcher described it as “bizarre” that some people could get less healthy under an exercise regimen. Each person is an individual, and I would not be surprised at all to learn that a treatment that is effective for most people can hurt others. Once you accept the idea of a varying treatment effect, it’s natural enough to think that the effect could be negative for some people.

The other thing that bugged me was this:

Dr. Bouchard stumbled upon the adverse exercise effects when he looked at data from his own study that examined genetics and responses to exercise. He noticed that about 8 percent seemed to be getting worse on at least one measure of heart disease risk.

But couldn’t they have been getting worse if they’d received the control instead? We all know that a simple before-after comparison doesn’t give you a causal effect (except under the strong and implausible assumption that there would be zero change under the control).

The news article continues:

Some experts, like Dr. Benjamin Levine, a cardiologist and professor of exercise sciences at the University of Texas Southwestern Medical Center, asked whether the adverse responses represented just random fluctuations in heart risk measures. Would the same proportion of people who did not exercise also get worse over the same periods of time? Or what about seasonal variations in things like cholesterol? Maybe the adverse effects just reflected the time of year when people entered the study.

But the investigators examined those hypotheses and found that they did not hold up.

Hmmm . . . now I’m curious. How did the investigators find that those claims “did not hold up”? I follow up the link to the paper, but I didn’t see the promised explanation. Here’s what they had:

A fundamental question is whether there are individuals who experience one or several adverse responses (ARs) in terms of exercise-induced changes in common risk factors. . . . Data on a maximum of 1687 adults from six studies were available for analysis. . . . For the four traits studied, some subjects experienced changes in an opposite, unfavorable direction compared to the expected beneficial effects. . . . we have conservatively defined an AR as a response beyond 2×TE in a direction indicating a worsening of the risk factor. For the four traits in the present study, twice the value of TE [“the technical error (TE), defined as the within-subject standard deviation as derived from repeated measures”] would mean that ARs would be reached if the exercise training-induced increases are ≥10 mm Hg for SBP, ≥0.42 mmol/L for plasma TG, and ≥24 pmol/L for plasma FI or if there is a decrease of ≤0.12 mmol/L for HDL-C. . . .

OK, so they’re defining a negative outcome as a before-after decline by at least 2 standard errors. This is not what I would do, but let’s go with it. The question remains: can’t such a decline occur, even in the absence of an exercise regimen? What makes the researchers so sure that these declines are attributable to the treatment, rather than that these were people who were going to have problems in any case?

I don’t see it. Here’s all that I could find:

The percentages of adverse responders for each trait for each study are depicted in Figure 2. It is remarkable that such cases were found in each study, even though the age and health status of the subjects were widely divergent and the exercise programs were quite heterogeneous.

I don’t see why this is so remarkable. As noted above, can’t some people just be getting better and some people getting worse?


My problems with the scientific paper and the news article are:

1. They make a big deal of the idea that exercise may increase heart risk, but it seems uncontroversial to me that an activity that helps most could be harmful to some.

2. All they seem to measure are before-after changes, so how can they be so sure about attributing causality?

Am I missing something? (This is not a rhetorical question!)

This is not my area of research, and there could well be something crucial that I didn’t notice. If I am right that the study is hopelessly flawed, the question then arises as to how the expert from the National Heart, Lung, and Blood Institute got fooled. It’s not such a surprise for a statistically-flawed article to appear in a scientific journal, that happens all the time, but I’d expect better from the New York Times. Not because NYT reporters know more than journal referees, but because reporters call other experts to get their take on it.

Given all this, I’ll reluctantly assign a high probability to the hypothesis that I’m missing something important here. Perhaps someone out there could help clarify the situation?

22 thoughts on “Massive confusion about a study that purports to show that exercise may increase heart risk

  1. I have already seen multiple stories on this and heard one on the BBC. This story is spreading like wildfire.

    Anyone know of any studies that look at impacts of new studies on behavior changes. It would also be interesting the difference before and after the introduction of the internet. If that makes sense.

  2. “He noticed that about 8 percent seemed to be getting worse on at least one measure of heart disease risk.”

    And this still doesn’t say anything about the overall risk for those individuals.

  3. Andrew, this is my area of research (public health), and I don’t think you’re missing anything, and I’m not the least bit surprised that it resulted in big headlines in the NYT. This happens ALL THE TIME. Recall the recent results from Harvard regarding red meat and cancer: OK, this one wasn’t exactly front page in NYT, but still they were hyping another similarly hopelessly flawed study. And, it’s not too hard to find many more examples in the NYT, some of which certainly made the front page. See John Iaonnidis.

  4. Also, I’d point out that many people greatly increase their carb intake when doing serious (or not so serious) training – Gatorade, energy gel (aka sugar), “carb loading”. Maybe, just maybe, some of Gary Taubes’ hypotheses have some merit.

  5. I believe you are not missing anything and are right on target. (Of course, I am even more likely to be missing something than you!)

    I would add that it is possible that some of these individuals whose measures declined are nonetheless healthier in terms of reduced mortality and other things that can’t be measured instantly.

  6. I had some of the same problems as you, but was more astonished that the within-subject standard error was measured using three observations on each subject. (Actually in one case, two combined with an estimate of the standard error from another sources). What isn’t clear to me is whether there is actually a change (up or down) in results or just a lot of heterogeneity in what they call Technical Error — the 3 observation-based standard error. After all if the TE has enough standard deviation, you’re going to find a lot of people with two standard deviation moves down (and up) even if nothing is going on at all.

    • zbicyclist:

      It would be very convenient if one could reasnably know that a given effect would be of the same sign for all exposed. For instance you could justify ignoring multi-level modeling and just get a (overly) precise weighted average as a MRP (post stratified) population estimate that although counter factual (not a real population) would settle the sign of that treatment for any population.

      Probably sounds strange but it what R. Peto’s group has done for years without any justification.

      I did think a bit about vaccines – but even thats tricky.

      Also discussed a bit with David Cox and his opinion was that it would be a very rare case where monoticity of effect could be well argued for.

      Problem is many people think of (were taught) the same effect (applies) for all.

      • K?, of course I agree with your point as we had a similar back and forth on comments recently. However, this “monoticity of effect” is exactly what’s being assumed, at some level, in many nonliner statistical models (e.g., logistic regression and, errr, Cox proportional hazards). These models do not estimate an “average effect” in any sense, they estimate THE treatment effect.

  7. You are somewhat correct. The TE is not measured on the same people; one of the studies had an ancillary group of 60 ‘controls’ not engaging in an exercise program. That these controls well match the individuals in the other studies (or HERITAGE itself) isn’t terribly well supported. It may well underestimate the variation in the other groups, especially around the tails (where 60 is a totally inadequate sample). I don’t know if the temporal spacing of measurements was the same across studies; clearly the within-person measures are going to be temporally correlated, affecting the total within-person variability. Calling it “technical error” is potentially misleading, since most people assume this to mean true lab measurement error. The equality of other sources of dispersion doesn’t appear to be addressed.

    You don’t even need variable treatment effects. The first claim (10% drop by 2 SDs on 1/6) is totally unsurprising if the changes in biomarkers are iid normal (the null expectation is about 12%). The second claim (7% on 2/6) is surprising if they are iid, but not if some of them are highly correlated (they are) or if the normality assumption is bad (glancing at the data, I think it is).

  8. Actually, Jonathan, I think the three-observation calculation of TE only occured in a subset (N=60) of one of the cohorts.

    I had a similar surprise at the lack of comparison of the change in outcomes to the control group. My best guess is that they believe the estimate of TE captures temporal variation of the outcomes, but as you point out, we don’t know anything about the population variation of this variation, as it stands.

  9. I’m glad you brought this up. I was curious what you’d think about it.

    1. I do think the finding would be unexpected and important, if it were proven true. These were people in an exercise study. They weren’t 80 years old, they didn’t have heart failure or severe lung disease. I would not expect a substantial number to be harmed by exercise. If I did, on my doctoring days (primary care) when starting an exercise routine I might institute a n-of-1 program of some sort – people with a poor response might not be recommended exercise – which would be a big change.

    2. The possibility that this is natural blood pressure change is minimal – if there’s something real that causes a 10 mmHg change over 3 weeks, I’d love to know about it.

    3. My guess is similar to yours, but is based around behavioral possibilities. Maybe a few patients got excited about their new fitness and stopped taking their BP medications. (“I’m exercising now so I don’t need the pill any more” happens pretty often.) More likely is that there was technical error in the measurement – blood pressures can be elevated if you use the wrong cuff size, take the measurement after someone walking into the clinic without a break first, etc. Really, it would only take a few people to have forgotten to pick up their refills that week to cause this finding, right?

    Do these guesses seem possible?

  10. Andrew,

    A few more points (I also work a lot with cardiovascular epidemiologists)

    1) I think the finding would be regarded as surprising — the expectation in this field would be that a treatment having an average beneficial effect on a risk factor would be beneficial for everyone. This expectation wouldn’t extend to outcomes like heart attack, but for blood pressure or lipid levels I think it would be expected. Not to say that the expectation is necessarily reasonable, but I do think it exists. The Oxford clinical trials group (Peto et al) argue particularly strongly that treatments beneficial on average are likely to be beneficial in all identifiable subgroups.

    2) There has been some recent work by people at the University of Sydney (Bell is the first author on the papers; Les Irwig is one of the senior authors) showing that heterogeneity in treatment effect for some pharmacological treatments is smaller than physicians had thought. For example, they showed that nearly all the apparent heterogeneity in the blood pressure effect of ACE inhibitors in randomised trials could be explained by measurement error (sensu lato, ie, including real day-to-day variation). That would reinforce the assumption of homogeneous direction of effect. There’s also been a failure to find much in the way of common genetic variants that produce any meaningful heterogeneity in treatment response (apart from in cancer, where we’re talking new mutations, not inherited variation).

    3) I think the technical error calculation is actually wrong: that is, they estimate the within-person standard deviation, and use twice this as a threshold for differences. But the standard deviation of the difference is sqrt(2) times the within-person standard deviation, so they are only using a -1.4 SD threshold. Given that, you’d expect nearly 10% to exceed the threshold if there were no average treatment effect; less with some real average effect.

    4) The assumption of causality from before-after differences is not completely bogus, although it certainly isn’t reliable. The rationale, as I see it, is that cardiovascular risk factors tend to be reasonably stable over short periods of time: blood pressure, for example, is reasonably well modelled by a slow increasing trend (few mmHg/year) plus exchangeable noise. Under this model, before-after changes coincidental with the intervention probably are caused by some aspect of the intervention, if not necessarily by the physical exercise. Now, as I said, this isn’t really reliable, since the argument requires the model to work well in the tails, but it is at least a reasonable explanation for why causality is being assumed.

    • Thomas: I noticed we both mentioned work by Sir Richard of Oxford (Clinical Trials Group) but I do not know if you noticed my post above – it may seem somewhat contradictory.
      His group does argue strongly for assuming common direction but have they persuaded anyone in the Statistics discipline? I am not aware of any credible arguments that such a very informative prior on direction of effect is justified nor that it could be checked against the data with any reasonable _power_. You set aside a type one error (mistaking technical error for underlying treatment variability) but what about the type two error. I am not sure that there can’t be effects that are always of the same sign, but I am concerned I will often wrong about which.
      Credible arguments would be very important (the prior is very informative) but I find them lacking. I did invite Richard to present on this at a SAMSI summer program on meta-analysis but he declined – Ken Rice (who sometimes comments here) suggested we try to find a justification for Peto’s method in one of the workshops.
      I presented the argument of taking the weighted average as a poststratified estimate of the effect in the population defined by those weights. Jim Berger quickly pointed out that it would not be a real population presumably as the weights are based on standard errors the effect estimate ended up having in the particular studies. I had no response at the time, but later realized if you could justify the assumption of common direction, the counterfactualness does not matter – positive effect in any population implies positive effect for all populations.
      But if you really want to entertain such a model, why not actually use it (set up a special random effects model in which say the true effects had a gamma distribution, i.e. were always positive) and do a Bayesian analysis or a frequentist analysis perhaps profiling out magnitude to focus on direction?
      This was mentioned briefly on page 26 of my thesis not to suggest you should read any of it.
      I would very much look forward to seeing more of what you do with this!

  11. I admit I haven’t read the new paper, but if this data is being drawn from Bouchard’s GWAS of 2011 (, that study looked very shaky to me. The sample size was quite small and, from what I recall, none of the results actually reached the standard statistical significance expected of a GWAS.

  12. Another thing seems funny:

    “The study subjects exercised at a range of intensities from very moderate to fairly intense. But intensity of effort was not related to the likelihood of an untoward effect. Nothing predicted who would have an adverse response.”

    I can certainly believe that a segment of the population reacts adversely to exercise but I’d normally expect to see a relationship between the reaction and the level of exercise.

  13. Pingback: Gelman: Massive confusion about a study that purports to show that exercise may increase heart risk « Stats in the Wild

Comments are closed.