Skip to content

Random patterns in data yield random conclusions.

Bert Gunter points to this New York Times article, “How Exercise May Make Us Healthier: People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary,” writing that it is “hyping a Journal of Applied Physiology paper that is now my personal record holder for most extensive conclusions from practically no data by using all possible statistical (appearing) methodology . . . I [Gunter] find it breathtaking that it got through peer review.”

OK, to dispose of that last issue first, I’ve seen enough crap published by PNAS and Lancet to never find it breathtaking that anything gets through peer review.

But let’s look at the research paper itself, “Habitual aerobic exercise and circulating proteomic patterns in healthy adults: relation to indicators of healthspan,” by Jessica Santos-Parker, Keli Santos-Parker, Matthew McQueen, Christopher Martens, and Douglas Seals, which reports:

In this exploratory study, we assessed the plasma proteome (SOMAscan proteomic assay; 1,129 proteins) of healthy sedentary or aerobic exercise-trained young women and young and older men (n = 47). Using weighted correlation network analysis to identify clusters of highly co-expressed proteins, we characterized 10 distinct plasma proteomic modules (patterns).

Here’s what they found:

In healthy young men and women, 4 modules were associated with aerobic exercise status and 1 with participant sex. In healthy young and older men, 5 modules differed with age, but 2 of these were partially preserved at young adult levels in older men who exercised; among all men, 4 modules were associated with exercise status, including 3 of the 4 identified in young adults.

Uh oh. This does sound like a mess.

On the plus side, the study is described right in the abstract as “exploratory.” On the minus side, the word “exploratory” is not in the title, nor did it make it into the news article. The journal article concludes as follows:

Overall, these findings provide initial insight into circulating proteomic patterns modulated by habitual aerobic exercise in healthy young and older adults, the biological processes involved, and the relation between proteomic patterns and clinical and physiological indicators of human healthspan.

I do think this is a bit too strong. The “initial” in “initial insight” corresponds to the study being exploratory, but it does not seem like enough of a caveat to me, especially considering that the preceding sentences (“We were able to characterize . . . Habitual exercise-associated proteomic patterns were related to biological pathways . . . Several of the exercise-related proteomic patterns were associated . . .”) had no qualifications and were written exactly how you’d write them if the results came from a preregistered study of 10,000 randomly sampled people rather than an uncontrolled study of 47 people who happened to answer an ad.

How to analyze the data better?

But enough about the reporting. Let’s talk about how this exploratory study should’ve been analyzed. Or, for that matter, how it can be analyzed, as the data are still there, right?

To start with, don’t throw away data. For example, “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.” Huh?

Also this: “Because of the exploratory nature of this study, significance for all subsequent analyses was set at an uncorrected α < 0.05." This makes no sense. Look at everything. Don't use an arbitrary threshold. Also there's some weird thing in which proteins were divided into 5 categories. It's kind of a mess. To be honest, I'm not quite sure what should be done here. They're looking at 1129 different proteins so some sort of structuring needs to be done. But I don't think it makes sense to do the structuring based on this little dataset from 47 people. A lot must already be known about these proteins, right? So I think the right way to go would be to use some pre-existing structuring of the proteins, then present the correlations of interest in a grid, then maybe fit some sort of multilevel model. I fear that the analysis in the published paper is not so useful because it's picking out a few random comparisons, and I'd guess that a replication study using the same methods would come up with results that are completely different. Finally, I hove no doubt that the subtitle of the news article, "People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary," is true, because any two groups of people will differ in all sorts of ways. I think the analysis as performed won't help much in understanding these differences in the general population, but perhaps a multilevel model, along with more data, could give some insight. P.S. Maybe the title of this post could be compressed to the following: Random in, random out.


  1. Zad says:

    I think they give too much weight to their results, especially after seeing them describe how they interpret results that survive multiple testing corrections… if you’re calling your study “exploratory” why even bother with this and a testing interpretation? It’s like the whole “exploratory” thing was just added but they didn’t really believe their study was exploratory and took results that passed the significance filter more seriously

  2. Kyle C says:

    Gretchen Reynolds, who writes these health articles for the Times, is a mystery. She wrote a very good book about the indeterminacy of research on exercise, weight, and health, yet she continues to pump out this hype in her day job, as if she has no choice, because, after all, it’s a daily paper and we need things to read.

    It reminds me of political reporters who write big books about how the media got the narrative all wrong in a past election, then go back out on the campaign trail and chase the daily trivia all over again.

  3. This is why I think Raymond Hubbard’s viewpoints should be given more consideration as they pertain to evaluating the merits and demerits of RCTs also.

  4. 133 says:

    Publication pressure at its best.
    They collected some samples. No or very weak rationale behind the project. No estimation or hypothesis on the effect size. No use of prior knowledge.

    This type of papers can be written in a weekend. And also relax a bit.

    • Jeff Walker says:

      exactly. If this were in a cell biology lab, this whole study would have been figure 1 of a supplement and the rest of the study would be a series of experiments, in mice, of course, to try to build a qualitative picture of causal paths between exercise, signaling paths, cell processes, and health. In fact, there’s already lots of this done in mice. Of course, at the other end of the science spectrum, If the outcome were some measure of happiness, it would be in PNAS.

  5. zbicyclist says:

    “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.”

    So here’s a question:

    If we had MISSING data, we might use multiple imputation as a way to help insure that our findings weren’t some artifact of a particular way of handling missing data. There are some classic references on this topic to help things along, and some software to do this easily.

    But what about outliers? There are all sorts of ways to define what an outlier is, and deal with it. How you handle outliers can have an outsized effect on the analysis (pun intended).

    Are there procedures roughly analogous to multiple imputation that provide a similar way to help insure that your analysis isn’t primarily due to how you handled outliers?

    • Andrew says:


      I think there are two issues here. First, we want to clean our data and remove or correct mistyped entries, random survey responses, etc. (except when we care about modeling such data-quality issues). Second, if data are legitimately outlying, we want to model them appropriately. I haven’t looked at the data for the above-linked paper so it’s not clear to me whether this outlier-removing thing was for either of the two reasons above. There may just be some big numbers in the data that could be analyzed directly.

      • Nick Patterson says:

        Andrew’s view seems to me much too extreme. As a working data analyst one all the time sees
        a tranche of data that rather obviously shows some massive trouble. Suppose
        one is doing an assay where you expect the data to be roughly standard normal, but yesterday all
        values were > 100. In practice you would chuck the data from that day, (and see if you could figure out what
        went wrong). Yes this is unmodeled censoring of the data, but it’s common sense and a full Bayesian
        analysis incorporating extreme mismeasurement is pretty much a waste of time.

        • Martha (Smith) says:

          I see Andrew’s view as being cautious, but not extreme. He is basically saying that something that looks unexpected should be investigated before making decisions on how to handle it. I see his response as as saying something like “Try to figure out what went wrong before you chuck the data.”

        • Andrew says:


          Huh? I wrote above, “First, we want to clean our data and remove or correct mistyped entries, random survey responses, etc. (except when we care about modeling such data-quality issues).” So, yes, if you have screwed up data, clean them. That’s what I said. I did not recommend “a full Bayesian analysis incorporating extreme mismeasurement.” Whoever you’re disagreeing with here, it’s not me.

          Regarding the paper under discussion: it’s not at all clear they were discarding “data that rather obviously shows some massive trouble.” They just said: “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.” These might be perfectly good data that just happen to fall outside some prespecified range. I have no idea.

          • Nick says:

            Andrew: “To start with, don’t throw away data”

            Well sometimes data is ridiculous. My point is sometimes
            the common sense thing to do is throw it away. The authors may well
            have discarded important data, and introduced bias but maybe not.

            Data cleaning is often a matter of judgement, and indeed in incompetent
            (or dishonest hands) it can lead to a garden of forking paths. I’m not trying to
            defend the paper… a sample size of 47 seems absurdly small for a complicated analysis.

            And to answer jd’s query below. If all you have are 10 samples then here are lots of questions that
            can’t be answered from the data.

            • Andrew says:


              My advice was in the context of this paper. I think “don’t throw away data” is a much better choice than “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.” They could well be discarding important information here, also the term “outliers” is a bad sign (as it suggests they are removing data not because the data are ridiculous but just because they exceeded some threshold), and of course the 3 sd thing is not a good sign either as it suggests they’re following an arbitrary rule rather than using common sense judgment.

            • Michael Nelson says:

              The choice isn’t between throwing outliers away or treating them the same as the rest of the observations. There’s also Winsorizing, using the median or other robust statistics, or computing the result both with and without the outliers and reporting the extent to which it made a difference.

              In your example, where you eliminate all observations associated with a bad batch, I’d argue that you’re throwing away data primarily because you have strong evidence of a bad batch, and only secondarily because they are outliers. If, on the other hand, you threw away a few extreme observations from each batch each day, with no evidence that they are due to anything other than coincidentally lying in the tails of the population distribution, I’d recommend one of the approaches above.

              In terms of the uncertainty over the authors’ analytical rationale, it’s only uncertain because they didn’t explicitly provide it (perhaps in a footnote). Absent that clarification, it’s perfectly reasonable to interpret what they wrote in the way that they wrote it without qualification, i.e., they provided an arbitrary rule without giving a meaningful rationale, therefore I conclude that their rationale was arbitrary. Otherwise, we have to follow every critique of every paper with the qualification “…unless their description of their methods was incomplete or inaccurate.”

              • zbicyclist says:

                Michael Nelson: That’s where my initial comment was heading. We have options (throw away, treat them as the same, segment them in a different group, Winsorize, and so on).

                But unlike missing data — where there’s a decent literature and procedures that are arguably “best practice”, identifying and treating outliers seems to be outlaw land, where it’s every researcher to themselves.

              • IMHO best practices is to treat every model as an explicit measurement error process initially and then consciously decide to ignore measurement error only when that is justifiable… (for some people that will be nearly always, others almost never). The methods mentioned here are all basically two bit measurement error models, which are sometime or even often enough. Occasionally you need to pull out some big guns and do full Bayesian inference on the underlying quantity.

        • Michael Lew says:

          The “outliers” may not have been outliers at all, as it’s common for the levels of low concentration proteins in the blood to vary geometrically. That means that the >3 SD “outliers” might have been within that threshold if the data were more appropriately scaled (i.e. logarithmically). The dynamite graphs and pie charts do not give the necessary detail to tell.

          • Martha (Smith) says:

            Thanks for this. Throwing out “outliers” without discussion of context seems unreasonable to me.

            • Shravan says:

              In my statistics course I have introduced HW assignments where students have to deal with real, already published data, and I show them that after making some very innocuous looking decisions about whether or not to remove one or two data points completely changes the picture in terms of statistical significance. You can make the effect appear, or disappear, as desired, and all choices seem reasonable in retrospect.

              IMO one should not make blanket “recommendations” to remove or not remove data. It has to be decided on a case-by-case basis, and the robustness of the conclusion should be independently establishable by the reader. The author must release data and code. Even today, this rarely happens. People write things like this in papers: Data and code available “upon reasonable request”, or data and code available “to competent researchers”. You contact them for code and data, and they just ignore the request. There are lots of outs in the current climate and journals do not enforce data release; everyone says data should be publicly available. In that situation, people will continue to engage in misconduct and questionable research practices, and somehow get the p below 0.05 or p above 0.05, as desired.

              One wonderful exception is the Elsevier Journal of Memory and Language—they demand data and code release now. Since it’s the top-ranking or near-to top-ranking journal in my field, this will set a new standard.

              When I review papers, I insist on seeing the data and code and I do analyze it myself. I rarely come to the same conclusion as the authors.

  6. jd says:

    It often seems that sample sizes in exploratory biomarker studies are low. I think sometimes (not in the above case) it is by necessity – the condition is rare and only samples from a few individuals are to be had.
    Notice the first paragraph in this paper:
    I’ve been looking at RNA-seq data, which seems like other p>>>n sorts of biomarker studies, and this sort of p>>>n thing really bothers me. There’re lots of tools for power analysis and well known analysis pipelines to use (usually involving thousands of glm’s and hypothesis tests), but it still makes me uneasy.
    I’ve seen some people attempting ideas that look more appealing, but it seems the problems are a bit of the same (see especially last couple comments):

    It would be great to hear some comments from knowledgeable people. What if you are stuck with samples from 10 people for the condition (i.e that’s all the samples money can buy)? Seems like a rather desperate situation, but people are still doing analysis…

Leave a Reply