Regression to the mean continues to confuse people and lead to errors in published research

David Allison sends along this paper by Tanya Halliday, Diana Thomas, Cynthia Siu, and himself, “Failing to account for regression to the mean results in unjustified conclusions.” It’s a letter to the editor in the Journal of Women & Aging, responding to the article, “Striving for a healthy weight in an older lesbian population,” by Tomisek et al. in that journal. Halliday et al. write:

The authors conclude that the SHE [“Strong. Healthy. Energized”] program should be adopted . . . as it demonstrated “effectiveness in improving health behaviors and short-term health outcomes in the target population.” Specifically, the authors make this conclusion based upon a “marked step increase” for participants in the lowest tertile-defined category of baseline step count. However, the analysis does not support this conclusion. This is because regression to the mean (RTM), rather than treatment effectiveness, explains, in part, the arrived-at conclusion.

RTM is a statistical phenomenon that describes the tendency for extreme values observed on initial assessment to be less extreme and closer to the population mean with repeated measurement when the correlation coefficient is less than 1.0 . . . RTM is a concept that has often been ignored and misunderstood in health and obesity-related research . . . Failure to account for RTM often leads to errors in interpretation of results and unjustified conclusions. In pre-/poststudy designs that lack a comparator control group, neglecting RTM can lead to the inaccurate conclusion that an intervention was effective in improving a health outcome in a group of participants.

What happened in this case? Halliday et al. continue:

In the results section of Tomisek et al. (2017), the authors state, “The SHE program was most effective for participants with low levels of physical activity and steps.” . . . The expected analytical approach of evaluating change in step count for the entire sample was not reported . . . given the acknowledgement that a decrease in body weight in the group with highest initial body weights was expected, the same logic should have been applied to the outcome of change in step counts. Thus, the conclusion that there was evidence for effectiveness is not justified given that the results are likely due to RTM and not specific intervention effects attributable to the SHE program. This does not mean that the SHE program is not effective, only that it was not convincingly shown to be effective by ordinary scientific standards in this study.

And:

Interventions that are evaluated without the use of a control group are susceptible to the reliance on results that may be a consequence of RTM. Greater vigilance regarding RTM is necessary throughout the research and publication process.

Yup.

What matters here

The news here is not that a statistical mistake got published in an obscure journal. You could spend your whole life going through the published literature finding papers with fatal statistical flaws. The news is that regression to the mean remains paradoxical and continues to mislead people; hence this post.

The authors of the original paper respond

The above-linked article includes a response from the authors of the paper. Unless I missed it, the authors in their response forgot to say, “We thank Halliday et al. for pointing out the flaw in our analysis, and here is our bias-corrected estimate…”

It’s frustrating when people can’t admit their errors. We all make mistakes; what’s important is to learn from them.

54 thoughts on “Regression to the mean continues to confuse people and lead to errors in published research

    • Stephen:

      There’s something about regression-to-the-mean that just invites explanations. No matter how much we think we understand it, we’re always trying to explain it in a new way. Sometimes I feel that the essence of the regression-to-the-mean problem is causal inference—the mistaken imputation of causality to patterns that can be explained using random variation—but then I think of the classic problem of heights of parents and children, where causality doesn’t really come in at all.

      Someone put together a book some years ago with 50 or 100 proofs of the Pythagorean theorem. Maybe we need a book with 50 explanations of regression to the mean.

      • Yes, Andrew & Stephen. Kahneman’s standard explanation is ‘wanting’. No pun intended. I think that many of these explanations are due for revision despite attempts to explain it a new way. It would require too, revision of 235 or so biases implicated in several domains, I speculate.

      • Galton’s original unearthing of regression to the mean had offspring’s height vs. parents’ mid-height. And it certainly begs for an interpretation that parents cause offspring height, as in tall parents produce, on average, not so tall kids. He cleverly reversed the axes so that offspring’s height appeared to cause mid-height of the parents–hence, tall kids produce on average less tall parents.
        Thus, the lack of a causal mechanism for the improvement of the human race. This lack of generational improvement lead to Galton espousing/creating eugenics because this proved to him that without interference, the human race would never improve. And to the embarrassment of statisticians of today, Karl Pearson and R.A. Fisher were the fellow originators and prime movers of eugenics.

        • “And to the embarrassment of statisticians of today, Karl Pearson and R.A. Fisher were the fellow originators and prime movers of eugenics.”

          Eugenics, with the goal of improving the lives of *EVERYONE* is a good thing. When twisted, even slightly, by politics, racism, etc., it is an incredibly bad dangerous thing.

          Justin
          http://www.statisticool.com

        • Justin Smith wrote:

          “Eugenics, with the goal of improving the lives of *EVERYONE* is a good thing. When twisted, even slightly, by politics, racism, etc., it is an incredibly bad dangerous thing.”

          Eugenics as conceived and promulgated by its statistician originators, Galton, Pearson and Fisher, was dedicated to preventing the “wrong” people from producing offspring and to encourage the “right” people to do so. This conceptual framework of the founders of the discipline did not need to be “twisted” by politicians or racists. “Improving the lives of *EVERYONE*” was never on the statistics agenda.

      • “Someone put together a book some years ago with 50 or 100 proofs of the Pythagorean theorem.”

        I believe you are referring to The Pythagorean Proposition, by E. S. Loomis, which contains 367 proofs of the Pythagorean Theorem (see Footnote 2 of https://www.cut-the-knot.org/pythagoras/index.shtml)

        I agree that we need a similar collection for regression to the mean. It might help make RTTM as basic in elementary statistics as the Pythagorean Theorem is in elementary geometry.

      • Andrew,

        isn’t the regression to the mean (after an outlier is observed) essentially the same as your frequent remark that a statistically significant estimate will overestimate the magnitude of the underlying effect?

        • I’m not sure I would call it “essentially the same”, but the similarity also came to my mind last night.

    • Thank you for your helpful article. It caused a couple of ideas to sprout in the weedy garden that is my amateur thinking about statistics.

      First, mightn’t regression to the mean have been one of the primary motivations for Gosset and Fisher in their search for a measure of statistical significance? After all, they each had had prior experiences with attempts to increase agricultural output that produced “Eureka! That fertilizer really is new and improved!” moments, followed by bitter disappointment a growing-season later. The p-value then might have been conceived as a sort of defense to category errors (rather than anything to do with causation) wherein the variation within a complex system is assumed and suspicion that a different system might be present is only allowed to creep in when p- is very low.

      Second, I started wondering why nobody ever seems to talk about regression to the median or regression to the mode. I assumed it’s because they produce embarrassingly precise numbers but on further reflection realized that calculated means are usually declared with curious precision. I also thought regression to a mode say in the case of how many children Americans who start families after a recession most often want to have (3) might tell us something interesting whereas the mean (1.87) tells us something stupid. So, I went rummaging around and while not finding much did find something you wrote in 1990 with regression to the mode in the title. No doubt the ideas were fresh back then but I assumed that after 28 years Taylor & Francis would have at least moved them to the day-old bin. Alas, no such luck for me and the paywall is fifty dollars high. So, if you wouldn’t mind, what are/were your thoughts about regression to the mode?

    • I think that a convincing explanation can be based on a graph that shows the fluctuations if symptoms from day to day. You tend to go for treatment when your symptoms are at their worst. The next day they are better (and would have been without the treatment). There is an example in my post on this topic, “Placebo effects are weak: regression to the mean is the main reason ineffective treatments appear to work”, at http://www.dcscience.net/2015/12/11/placebo-effects-are-weak-regression-to-the-mean-is-the-main-reason-ineffective-treatments-appear-to-work/

      • Nice example — especially as I have been debating with myself about whether or not to resume PT treatment for my sore wrist and elbows; at least gives me a plausible rationale not to bother today and see how they feel manana. ;~)

  1. Like Andrew, I think it is an issue of overinterpreting a bunch of things, including natural variation, as if they were an actual shift in the average effect. We all know that a statistical phenomenon cannot cause something physical, it can only be observed. In baseball it might be a hitting streak or in research sampling variation or floor and ceiling effects. With height i think it may be that really tall people likely have genetics of shorter people and as two parents with recessive traits can produce offspring with that trait not apparent in either parent, two tall parents may be more likely to have the genes of people who are on average shorter than they both are. Perhaps someone who actually understands genetics, unlike myself, could weigh in.

  2. It is always a good habit to leave open the possibility that one is wrong. It can save one from getting hammered by even more cranky temperaments who NEVER admit they are wrong. LOL

    Behind the scenes discussions in so many fields reflect that competition and incentives drive temperamental reactions. I admire responders who adapt to the people around them because there are many difficult people. Let’s be candid about this. Sticking to the merits is a very good habit of mind.

    • Many people implicitly disagree with you, even if they would say something very similar to what you said. Many view such stonewalling (tempered slightly with diplomacy_ as we see here as an effective career strategy. A large corpus of stories support this position. Is there much evidence against it?

      With a slightly different tack: if your career vision involves getting public funding for research, pursue paths that promise to find evidence for a position with strong political support. (As a scientist, of course, you will avoid mentioning that you do this.) This is not a story that institutions of Science can afford to tell about the themselves. But the logic is self-evident, and history provides plenty of support.

  3. I think some of us suffer from “the curse of knowledge”, because it’s hard for me to understand how exactly people failed to account for regression to the mean. Once I had learned of it, it pretty stuck to me like glue. So, now I wonder, is it because people never learned of regression to the mean, or is it because they did learn it, but they simply didn’t understand it enough to account for it properly.

    • +1

      All too often, when we understand something, it seems almost obvious, which makes it harder to explain to those who don’t (yet, I hope) understand it.
      Or, in your metaphor, we need to figure out how to make it “stick to them like glue.

  4. This really is one of the most misunderstood concepts among the general public.

    Let’s say a rookie NBA player shoots 40% from 3 in summer league or his first year in the league. People will matter of factly state that he will shoot worse next season because of regression to the mean. When an individual regresses to the mean, they are regressing to THEIR mean, not the population mean. Sure, if I had to bet on it, I would say the odds are this person will shoot worse since THEIR true mean is likely less than 40%, but THEIR true mean could be 45% or even higher.

    • You could say “Their” mean regresses back to the population mean.

      In the presence of extreme data points, hierarchical models weigh whether it was more likely that the individual’s mean was high compared with the distributions of means OR that individual’s particular data point was high conditional on their mean (and usually it’s a combination of both).

    • Your point actually confuses me a bit – and I think is part of the reason why this principle is so readily misunderstood. Is this rookie MBA player an unusual observation – where we would expect regression to the mean to entail the likelihood that they will shoot less than 40% next year or are they an unusual player who normally shoots 45% and will more likely show a higher success rate next year than this year? In other words, it isn’t obvious for this person whether we should expect their shooting percentage to fall or rise – we need to make an assumption about whether they are an average player or not. In terms of the principle, what mean will the regress towards?

      • “what mean will the regress towards?”

        That’s my point. Whenever people see an observation above the mean, they assume it will then regress towards the mean. But there’s no guarantee that that person’s mean is lower than the initial observation, their mean could actually be higher.

        • Yes. That is why I think the attitude regarding why so many people misunderstand basic concepts needs to change. We lament how people don’t understand when the problem is that the subjects are not really straightforward. A similar concept is that of ecological correlation. Many analyses deal with relationships with aggregates, but think of them as relationships between individual observations. I find it no easy task to figure out when this is a problem and when it is not. My statistics courses (taken too long ago to want to mention) were silent on these important phenomena, but I did prove the Central Limit Theorem in at least 3 classes as an undergraduate (did I mention that I also stayed in a Holiday Inn Express?). I think these “paradoxes” are actually very important things and fundamental to a good statistics education. We too often ignore them and then lament that the public, and even researchers, fail to understand them.

        • ” We lament how people don’t understand when the problem is that the subjects are not really straightforward. … I think these “paradoxes” are actually very important things and fundamental to a good statistics education. We too often ignore them and then lament that the public, and even researchers, fail to understand them.”

          +1

          One approach I often take in teaching statistical inference is starting with “What we want,” and ending with “What we get,” which is not really what we want, but just a means of trying to get close to what we want. It’s important not to confuse “what we want” with “what we get”, but the language often used does not make this distinction.

      • Regression to the mean is not really about what will in fact happen, it’s about what our inferences about those events should be.

        We should expect outlier observations to be followed by more typical observations.

        In the case of the rookie, we really don’t know what this person’s mean value is, but we should infer that it’s probably closer to average for rookies than wildly different.

        Regression to the mean isn’t like gravitation, it’s not a physical force, it’s a principle of extended probabilistic logic

        • If Mr X is rookie of the year, and we are considering his prospects precisely because he did so well, we should expect substantial regression next year.

          But what about Mr X himself? Whether his rookie scores were good, indifferent, or bad, we can expect him to try to guess his career prospects based on his first season. And under most plausible models, I believe, he (and any other player) also be better off in expectation shrinking his prediction towards the group mean. But should he shrink as much? – after all, from his perspective, he was going to ask this question anyway and there is no hard selection-for-extremes effect.

          So if I’m right, which I’m not confident about, there is more than one right answer to the ‘What should you
          expect Mr X’s performance next season to be’? (Or else, from another perspective: this isn’t a well-enough
          defined question.) There’s nothing wrong with this, but it’s fairly subtle/confusing.

          I also question whether the selection-for-extremes effect is really a necessary component of regression to the mean. You generally get better estimates by shrinking any/all observations to the mean; is this not also ‘regression to the mean’?

        • Michael Jordan averaged 17.7 points per game at UNC. https://www.sports-reference.com/cbb/players/michael-jordan-1.html

          1st NBA season PPG = 28.2
          2nd NBA season PPG = 22.7
          3rd NBA season PPG = 37.1
          4th NBA season PPG = 35.0

          Career NBA PPG = 30.1

          Which of these season PPG reflect a regression toward the mean? Does any season that is lower than the previous or the average of the previous seasons reflect regression to the mean?

          A purely statistical effect of moving towards the mean makes sense to me if we are talking about randomly selecting another data point after having selected one that was in a low probability space.

          Otherwise it seems we are talking about context, causality, and random variation but calling it ‘regression to the mean’. Take Michael Jordan as both an example, perhaps an exception that both proves and disproves the rule, of the rookie phenom regressing to the mean. In his highest PPG seasons (1986-87 & 1987-88) he scored 37.1 and 35.0 points per game respectively. The Bulls ended in 8th and 3rd. During each of the 3-peat runs Jordan averaged 31.4 and 29.6 respectively. Was this regression to the mean only to be expected or was it that Scottie Pippen averaged 19.1 & 20.3 points per game during each of those runs? Or was it that he faced better defense?

          https://www.basketball-reference.com/players/j/jordami01.html

        • Or missing a game or two rather than playing when having a cold or many other potential factors. The point of regression to the mean and other threats to validity is that we need to temper our conclusions based on the possibility that it could be happening . However if you sample based on extreme values it sets up a sample in which we’d expect that overall it’s more likely that sample members will reflect regression to the mean. Whereas if you don’t sample on extreme values, some people are regressing to their means from the positive side, some from the negative side, we expect half and half if the distribution is symmetrical. We assume but don’t know that all that roughly cancels out.

        • I suppose my main complaint is that I don’t find regression to the mean to be a compelling post-hoc argument about a research result. I agree with your comment about sampling and think it would be more fruitful to focus on issues of measurement error, sampling error, and unmeasured causes rather than regression to the mean. It appears to be a confusing concept to many that brings little real understanding of a particular phenomenon even when regression to the mean is well understood.

        • Edit:

          I forgot to include that the 2015-2016 (the only year I could find quickly) PPG across the NBA was 9.7 prior to these questions: “Which of these season PPG reflect a regression toward the mean? Does any season that is lower than the previous or the average of the previous seasons reflect regression to the mean?”

        • This makes me think of the hot hand analyses that assume that players’ shooting percentages are fixed over time.

        • There’s nothing controlled about this data. For example, the composition of the teams he played on and against changed each year. He aged at an age where there are changes in physique relevant for elite athletic performance. Etc.

          The old joke was that only Dean Smith had ever been able to hold Jordan under 20 ppg.

        • Dan:

          No control group is a fair criticism of this study.

          However, for RTM to be a convincing argument you would have to be able to explain why measurement error in a continuous variable such as data from a step sensor would not be reasonably close to normally distributed simply because it was in the lowest tertile.

        • “Regression to the mean” also has to take into account sample sizes. In basketball, unlike baseball, stats tend to stabilize very quickly and are fairly consistent from year to year. A 40% 3pt shooter after 10 games is unlikely to really be a 40% 3pt shooter unless he is Stephen Curry. But if he shoots 40% over an entire season and takes hundreds of 3pt attempts, we can be more confident about it. Very few players score 30 points per game in the NBA, so if you do it one year, it’s most likely not a fluke. That is different than hitting 0.300 in baseball, which from my understanding can be pretty flukey.

  5. “It’s frustrating when people can’t admit their errors. We all make mistakes; what’s important is to learn from them.”

    I think that in many cases (and this may be one) it’s not that they can’t admit their errors, but that they just don’t see how what they did is an error — i.e., they’re clueless (or, worse yet, clueless that they’re clueless0. The basic problem is that statistics teaching is (currently) not good enough to consistently convey the subtleties of the field, to get through the common human resistance to uncertainty and complexity. Improvement in the teaching is going to take a lot of work.

    • That’s my impression as well. The statistic course at my university for human biologists took one semester and covered some descriptive statistics (= definitions of mean, median and mode) and how to interpret R outputs for a t-test.
      It is a real problem when not only active researchers see statistics as a nuisance distracting them from their real work, but also the ones in charge of teaching.

    • Doesnt seem to be the case here. They say “sure maybe it could be regression to the mean but look at how many (crappy) lines of evidence we have”:

      In the original article, our study team noted multiple reasons to support, or justify, the intervention’s effectiveness in addition to the step increase for those with the lowest baseline step count. This includes…
      It is important not to discount the full body of evidence

      I find these “numbers of lines of evidence” arguments really irritating since:

      1) they are so easy to cherry pick
      2) one inconsistent line of evidence is more important than any number of consistent lines of evidence
      3) The theory is so vague it only predicts up/down directions and you can usually find some excuse to allow the opposite result anyway

      I advocate for “synthesizing” lines of evidence into a mathematical/computational model and then compare the output of that model to data. I don’t think that is anything new, its just how science used to be done.

  6. This tendency to double down and refuse to admit your mistake seems to be context dependent. In publications, it is exceedingly rare to see an author accept criticism offered in “letters to the editor.” And when it does happen, it is usually grudging and hedged with desperate attempts to show that the main conclusions hold nevertheless.

    But when these same people get the “pink sheets” on grants that the reviewers have deemed too flawed to fund, the resubmissions almost invariably incorporate the changes requested by reviewers. As they say, money talks, nobody walks.

    • Clyde,

      Your assumption seems to be that nearly all or all ‘letters to the editor’ are accurate in their own right. Can you give us an example?

      This dynamic that you lay out may be resolved better with the Open Science Framework. Preregistration and preprints, etc.

  7. I was trying to think about a way to teach regression to the mean in a way that would be accessible to intro statistics or research methods students.

    Does this make sense?

    Give everyone a 6 sided die. Get everyone to agree that the mean is 3.5.
    Now, lets say that the dice represent first graders and the numbers represent scores on a reading assessment, with 6 being the best.

    Everyone rolls.( Calculate the mean and variance of results.)
    The people who get 6 are then put into the gifted group and the people who get 1 are put in the remediation group. The rest (2,3,4,5) are in the typical group.

    Now, everyone rolls again and we calculate the new means for each of the three groups.
    We calculate the means for the three groups (and maybe variance).

    Discuss.

    You would have to have a decent number of students, though, or give each student multiple dice (which would bring an additional possibility of looking at sample size).

    • What are the theories/hypotheses for that calculation process? And to which domains most relevant, less relevant, and irrelevant?

      Nor have I ever understood why dice come into it, even though I have followed Howard Raiffa’s scholarship since the 90’s.

      • The dice represent people and the rolling represents repeated measurement of the same person, in this case a person with a uniform distribution of results on the test with an expect value of 3.5. And everyone in the same population has the same uniform distribution.

        So the first roll represents a single measurement, and that measurement is given meaning in the sense of resulting in and extreme value based selection into treatment groups.

        Then when we do a second measurement later, we should see that the gifted program not only is not successful at maintaining gifted status, it actually seems to have a backfire effect, with about 1/6 of students now needing remedial help. However, the remedial program actually appears to move most of its students to scores of 2 or higher, so back to typical levels and 1/6 of the students even ends up scoring as gifted. Successful program.

        This illustrates how regression to the mean is a threat to internal validity that is particularly heightened when selecting groups based on extreme values.

        • I believe I understand your explanation. However that is one explanation. In following articles or opinion pieces about ‘regression to mean’, there are differences and nuances in narratives accompanying the definition, which is why this specific concept will probably require further contextually rigorous examination. As I have repeated, it’s in the informal conversations with statisticians and decision theorists more generally where logical fallacies and cognitive biases become starkly manifest. That suggests to me that we can’t rely on current explanations categorically. It will some creative genius at work in sorting through these logical and cognitive biases. Just my hunch at this point.

  8. https://www.nejm.org/doi/full/10.1056/NEJMsa1906848 is a very interesting study illustrating how regression to the mean can lead to biased observational studies, especially when a program is targeted at high risk populations. “Hotspotting” is a very frequently proposed health policy intervention where you assign additional care staff to high cost patients with the hope of reducing future costs and admissions. This intervention has been significantly studied observationally in pre-post designs with the result that patients achieve significant cost and admission reductions under this program.

    https://www.nejm.org/doi/full/10.1056/NEJMsa1906848 challenges those results by pointing out that most of that cost and admission reduction is probably due to regression to the mean. It runs a proper randomized trial and demonstrates that the observed reduction in cost due to the program is almost identical to the observed reduction in cost for people who are not assigned to the program, illustrating that the program itself has no effect. Figure 2 (https://www.nejm.org/na101/home/literatum/publisher/mms/journals/content/nejm/2020/nejm_2020.382.issue-2/nejmsa1906848/20200103/images/img_xlarge/nejmsa1906848_f2.jpeg) shows this quite well, with corresponding curves for both the treated and control population decreasing over time (regressing to the mean) but with almost not difference between the two.

Leave a Reply to a reader Cancel reply

Your email address will not be published. Required fields are marked *