The scandal isn’t what’s retracted, the scandal is what’s not retracted.

Andrew Han at Retraction Watch reports on a paper, “Structural stigma and all-cause mortality in sexual minority populations,” published in 2014 by Mark Hatzenbuehler, Anna Bellatorre, Yeonjin Lee, Brian Finch, Peter Muennig, and Kevin Fiscella, that claimed:

Sexual minorities living in communities with high levels of anti-gay prejudice experienced a higher hazard of mortality than those living in low-prejudice communities (Hazard Ratio [HR] = 3.03, 95% Confidence Interval [CI] = 1.50, 6.13), controlling for individual and community-level covariates. This result translates into a shorter life expectancy of approximately 12 years (95% C.I.: 4–20 years) for sexual minorities living in high-prejudice communities.

Hatzenbuehler et al. attributed some of this to “an 18-year difference in average age of completed suicide between sexual minorities in the high-prejudice (age 37.5) and low-prejudice (age 55.7) communities,” but the whole thing still doesn’t seem to add up. Suicide is an unusual cause of death. To increase the instantaneous probability of dying (that is, the hazard rate) by a factor of 3, you need to drastically increase the rate of major diseases, and it’s hard to see how living in communities with high levels of anti-gay prejudice would do that, after controlling for individual and community-level covariates.

Part 1. The original paper was no good.

Certain aspects of the design of the study were reasonable: they used some questions on the General Social Survey (GSS) about attitudes toward sexual minorities, aggregated these by geography to identify areas which, based on these responses, corresponded to positive and negative attitudes, then for each area they computed the mortality rate of a subset of respondentx in that area, which they were able to do using “General Social Survey/National Death Index (GSS-NDI) . . . a new, innovative prospective cohort dataset in which participants from 18 waves of the GSS are linked to mortality data by cause of death.” They report that “Of the 914 sexual minorities in our sample, 134 (14.66%) were dead by 2008.” (It’s poor practice to call this 14.66% rather than 15%—it would be kinda like saying that Steph Curry is 6 feet 2.133 inches tall—but this is not important for the paper, it’s only an indirect sign of concern as it indicates a level of innumeracy on the authors’ part to have let this slip into the paper.)

The big problem with this paper—what makes it dead in arrival even before any of the data are analyzed—is that you’d expect most of the deaths to come from heart disease, cancer, etc., and the most important factor predicting death rates will be the age of the respondents in the survey. Any correlation between age of respondent and their anti-gay prejudice predictor will drive the results. Yes, they control for age in their analysis, but throwing in such a predictor won’t solve the problem. The next things to worry about are sex and smoking status, but really the game’s already over. To look at aggregate mortality here is to be trying to find a needle in a haystack. And all this is exacerbated by the low sample size. It’s hard enough to understand mortality-rate comparisons with aggregate CDC data. How can you expect to learn anything from a sample of 900 people?

Once the analysis comes in, the problem becomes even more clear, as the headline result is an 3-fold increase in the hazard rate, which is about as plausible as that claim we discussed a few years ago that women were 3 times more likely to wear red or pink clothing during certain times of the month, or that claim we discussed a few years before that beautiful parents were 36 percent more likely to have girls.

Beyond all that—and in common with these two earlier studies—this new paper has forking paths, most obviously in that the key predictor is not the average anti-gay survey response by area, but rather an indicator for whether this is in the top quartile.

To quickly summarize:
1. This study never had a chance of working.
2. Once you look at the estimate, it doesn’t make sense.
3. The data processing and analysis had forking paths.

Put this all together and you get: (a) there’s no reason to believe the claims in this paper, and (b) you can see you the researchers obtained statistical significance, which fooled them and the journal editors into thinking that their analysis told them something generalizable to the real world.

Part 2. An error in the data processing.

So far, so boring. A scientific journal publishes a paper that makes bold, unreasonable claims based on a statistical analysis of data that could never realistically have been hoped to support the analysis. Hardly worth noting in Retraction Watch.

No, what made the story noteworthy was the next chapter, when Mark Regnerus published a paper, “Is structural stigma’s effect on the mortality of sexual minorities robust? A failure to replicate the results of a published study,” reporting:

Hatzenbuehler et al.’s (2014) study of structural stigma’s effect on mortality revealed an average of 12 years’ shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice . . . Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study’s key finding on structural stigma.

Regnerus continues:

The effort to replicate the original study was successful in everything except the creation of the PSU-level structural stigma variable.

Regnerus attributes the problem to the authors’ missing-data imputation procedure, which was not clearly described in the original paper. It’s also a big source of forking paths:

Minimally, the findings of Hatzenbuehler et al. (2014) study of the effects of structural stigma seem to be very sensitive to subjective decisions about the imputation of missing data, decisions to which readers are not privy. Moreover, the structural stigma variable itself seems questionable, involving quite different types of measures, the loss of information (in repeated dichotomizing) and an arbitrary cut-off at a top-quartile level. Hence the original study’s claims that such stigma stably accounts for 12 years of diminished life span among sexual minorities seems unfounded, since it is entirely mitigated in multiple attempts to replicate the imputed stigma variable.

Regnerus also points out the sensitivity of any conclusions to confounding in the observational study. Above, I mention age, sex, and smoking status; Regnerus mentions ethnicity as a possible confounding variable. Again, controlling for such variables in a regression model is a start, but only a start; it is a mistake to think that if you throw a possible confounder into a regression, that you’ve resolved the problem.

A couple months later, a correction note by Hatzenbuehler et al. appeared in the journal:

Following the publication of Regnerus’s (2017) paper, we hired an independent research group . . . A coding error was discovered. Specifically, the data analyst mis-specified the time variable for the survival models, which incorrectly addressed the censoring for individuals who died. The time variable did not correctly adjust for the time since the interview to death due to a calculation error, which led to improper censoring of the exposure period. Once the error was corrected, there was no longer a significant association between structural stigma and mortality risk among the sample of 914 sexual minorities.

I can’t get too upset about this particular mistake, given that I did something like that myself a few years back!

Part 3. Multiple errors in the same paper.

But here’s something weird: The error reported by Hatzenbuehler et al. in their correction is completely unrelated to the errors reported by Regnerus. The paper had two completely different data problems, one involving imputation and the construction of the structural stigma variable, and another involving the measurement of survival time. And these both errors are distinct from the confounding problem and the noise problem.

How did this all happen? It’s surprisingly common to see multiple errors in a single published paper. It’s rare for authors to reach the frequency of a Brian Wansink (150 errors in four published papers, and that was just the start of it) or a Richard Tol (approaching the Platonic ideal of more errors in a paper than there are data points), but multiple unrelated errors—that happens all the time.

How does this happen? Here are a few reasons:

1. Satisficing. The goal of the analysis is to obtain statistical significance. Once that happens, there’s not much reason to check.

2. Over-reliance on the peer-review process. This happens in two ways. First, you might be less careful about checking your manuscript, knowing that three reviewers will be going through it anyway. Second—and this is the real problem—you might assume that reviewers will catch all the errors, and then when the paper is accepted by the journal you’d think that no errors remain.

3. A focus on results. The story of the paper is that prejudice is bad for your health. (See for example this Reuters story by Andrew Seaman, linked to from the Retraction Watch article.) If you buy the conclusion, the details recede.

There’s no reason to think that errors in scientific papers are rare, and there’s no reason to be surprised that a paper with one documented error will have others.

Let’s give credit where due. It was a mistake of the journal Social Science and Medicine to publish that original paper, but, on the plus side, they published Regnerus’s criticism. Much better to be open with criticism than to play defense.

I’m not so happy, though, with Hatzenbuehler et al.’s correction notice. They do cite Regnerus, but they don’t mention that the error he noticed is completely different from the error they report. Their correction notice gives the misleading impression that their paper had just this one error. That ain’t cool.

Part 4. The reaction.

We’ve already discussed the response of Hatzenbuehler et al., which is not ideal but I’ve seen worse. At least they admitted one of their mistakes.

In the Reaction Watch article, Han linked to some media coverage, including this news article by Maggie Gallagher in the National Review with headline, “A widely reported study on longevity of homosexuals appears to have been faked.” I suppose that’s possible, but my guess is that the errors came from sloppiness, plus not bothering to check the results.

Han also reports:

Nathaniel Frank, a lawyer who runs the What We Know project, a catalog of research related to LGBT issues housed at Columbia Law School, told Retraction Watch:

Mark Regnerus destroyed his scholarly credibility when, as revealed in federal court proceedings, he allowed his ideological beliefs to drive his conclusions and sought to obscure the truth about how he conducted his own work. There’s an enormous body of research showing the harms of minority stress, and Regnerus is simply not a trustworthy critic of this research.

Hey. Don’t shoot the messenger. Hatzenbuehler et al. published a paper that had multiple errors, used flawed statistical techniques, and came up with an implausible conclusion. Frank clearly thinks this area of research is important. If so, he should be thanking Regnerus for pointing out problems with this fatally flawed paper and for motivating Hatzenbuehler et al. to perform the reanalysis that revealed another mistake. Instead of thanking Regnerus, Frank attacks him. What’s that all about? You can dislike a guy but that shouldn’t make you dismiss his scientific contribution. Yes, maybe sometimes it takes someone predisposed to be skeptical of a study to go to the trouble to find its problems. We should respect that.

Part 5. The big picture.

As the saying goes, the scandal isn’t what’s illegal, the scandal is what’s legal. Michael Kinsley said this in the context of political corruption. Here I’m not talking about corruption or even dishonesty, just scientific error—which includes problems in data coding, incorrect statistical analysis, and, more generally, grandiose claims that are not supported by the data.

This particular paper has been corrected, but lots and lots of other papers have not, simply because they have no obvious single mistake or “smoking gun.”

The scandal isn’t what’s retracted, the scandal is what’s not retracted. All sorts of papers whose claims aren’t supported by their data.

The Retraction Watch article reproduces the cover of the journal Social Science and Medicine:

Hmmm . . . what’s that number in the circle on the bottom right? “2.742,” it appears? I couldn’t manage to blow up this particular image but I found something very similar in high resolution on Wikipedia:

“Most Cited Social Science Journal” “Impact Factor 2.814.”

I have no particular problem with the journal, Social Science and Medicine. They just happened to have the good fortune that, after publishing a problem with major problems, they were able to have an excuse to correct it. Standard practice is, unfortunately, for errors not to be noticed in the published record, or, when they are, for the authors of the paper to reply with denials that there was ever a problem.

25 thoughts on “The scandal isn’t what’s retracted, the scandal is what’s not retracted.

  1. That Hatzenbuehler et al. originally mis-specified the time variable for the survival models seems like a big deal, given that keeping everything else the same, their results go away. This leads me to wonder why Regnerus focuses on imputation. It seems he is able to replicate the structural stigma variable, just not the proportion of those living in high-stigma areas.

    What happened here is that someone tried to replicate the results of a paper. That person wasn’t able to. The replicator tried a few things (namely testing various imputation procedures) out to see why the results didn’t replicate. The replicator doesn’t really find anything that he can point to as to why the results didn’t replicate. The original authors went back to the data, realize one of their grad students made a whoopsie and the results vanish. They definitely should have addressed the imputation issue, but I don’t think you can actually say their imputation process was riddled with errors.

  2. “it is a mistake to think that if you throw a possible confounder into a regression, that you’ve resolved the problem”

    I understand that this is rather a side note in your post but can you elaborate a bit on why controlling for a confounder in a model does not resolve the problem? That is, if I put all covariates into a statistical model and compare it to a model with all covariates + target predictor, I thought I was able to test whether the additional target predictor can account for additional variance in the criterion?

    • Here are a variety of reasons why throwing a variable into your regression doesn’t work:

      1) The model is misspecified: the variable acts nonlinearly, in combination with other variables that are missing, or the like

      2) Samples are biased: the variable is not represented in the sample in a way similar to the actual population of interest. For example different groups have different non-response rates.

      3) Measurement issues, such as correlation of the variable with measurement error. Ask people if they are gay in the regions where gay people are regularly abused and you might get people unwilling to admit it for example. The general case also holds for other variables from income to how much rice you consume to whether you’ve ever been involved in police violence… who knows, people don’t necessarily answer honestly and that measurement error can be correlated with lots of stuff.

      4) The phase of the moon…. which is sometimes used as a phrase meaning “anything any everything could matter” but some groups use lunar calendars, and so it could even *literally* matter in social science.

      • I understand your list of reasons, but they apply to _any_ regression model, no?
        Interpretation of the adjusted regression coefficients requires the assumption that the model is correct, which, as you point might not be reasonable.
        That doesn’t seem to be Andrews’s point, though I’m not really sure what Andrew is getting at with the quote in Marc’s comment….

        • Andrew, thanks for replying. Your point about the marsupial problem is well taken, but I’m missing it in Marc’s comment.

          You write:
          “….the most important factor predicting death rates will be the age of the respondents in the survey. Any correlation between age of respondent and their anti-gay prejudice predictor will drive the results. Yes, they control for age in their analysis, but throwing in such a predictor won’t solve the problem.”

          I think I misunderstand what “the problem” is in this quote. Adjusting for age, with Daniel’s caveats in mind, is a standard method of estimating the effects of the predictor of interest, no?

        • Garnett:

          The problem, among other things, is that a linear adjustment will be too crude. Any bias in the age adjustment will swamp any effect of the sort they are looking to learn about.

        • Andrew, I had the same thought as Garnett when reading this post. To me it still seems like a standard case of misspecification of age. The “true” model is something like:
          Mortality = f(Age) + anti-gay prejudice*beta + epsilon.

          What they are doing is more like:
          Mortality = Age*gamma + anti-gay prejudice*beta + epsilon.
          Which is why (like you said) they are doomed to obtain inconsistent estimates of beta as long as age and anti-gay prejudice are correlated (which they are).

          So far I think we are on the same page. I must confess, though, that I do not see what the effect size has to do with this. Even if beta was very “large”, we’d still not have a hope of getting consistent estimates of beta unless we get the specification of age correctly. Right?

        • Nahim:

          Here’s the issue. Using the wrong functional form for f(age) will bias the results. The size of the bias will be roughly of the same order of magnitude as the full effect of age, but a bit smaller (consider the difference between a curve, and the best-fit straight line). If the effect of age is HUGE compared to any underlying effect of anti-gay prejudice, then the bias in the age-effect estimation will still be large compared to the underlying effect of anti-gay prejudice. And it’s not just age. Age is just one of the variables we have to worry about.

        • Another factor that I believe tends to promote the kind of thing we’re talking about here is use of language in ways that obscure that the devil is in the details. This can be illustrated in this particular case by the following quote from Marc’s original post:

          “controlling for a confounder in a model does not resolve the problem? That is, if I put all covariates into a statistical model and compare it to a model with all covariates + target predictor, I thought I was able to test whether the additional target predictor can account for additional variance in the criterion?”

          A big part of the problem here is using the word “control” in a technical meaning that is only vaguely related to the way the word is used in everyday situations. My experience is that the use of “control” here leads people to believe (innocently) that the procedure in question does something stronger than it really does. I think it would be more helpful (communicate more clearly) if the process were called “attempt to adjust for” or “attempt to take into account” rather than “control for”.

          The usage of “control for” in statistics is de facto one instance of a more general phenomenon often mentioned in this blog: “laundering out uncertainty”. There seems to be a common human tendency to want more certainty than is realistic, and the human brain often latches onto any opportunity to see or hear more certainty than is really there.

        • good point.
          I guess “attempt to adjust for” can be further expanded to “attempt to adjust for using unrealistic linear assumption”

        • Agree, it always annoyed me when a journalist would raise the issue of a confounding with x and then get put in their place by an academic with the phrase – “No, we controlled for that in the analysis”.

          Now, I am being more often annoyed when a journalist outlines the importance of a finding in a new study and the academic quickly adds “but it needs to be replicated” and then journalist continues outlining the importance with not even much of a pause – as if to double down on claiming to be reporting on something of real important to their listeners.

  3. ‘It’s poor practice to call this 14.66% rather than 15%—… it’s only an indirect sign of concern as it indicates a level of innumeracy on the authors’ part to have let this slip into the paper.’
    FFS not again.
    1) Lots of journals expect two decimal places. Nothing wrong with not chosing to die on this hill and giving them what they want.
    2) Reporting 15% might leave readers wondering about a couple of things:
    2a) What’s up with that .1 death in 914 * 0.15 = 137.1 deaths.
    2b) Why the 137 deaths thus implied don’t line up with number of deaths reported elsewhere in the paper (i.e. other numbers add to 134)
    2c) Whether maybe the authors are glossing over some missing data and the sample size is closer to 894 than to 914, so the 15% of 894 works out to 134 deaths.
    3) The number is not important for the paper, so carelessness – if the cause – would still not be a good indicator of innumeracy. If anything, you could make that argument based on HR = 3.03 which really should be 3 and is an important number for the paper but I suspect you couldn’t maintain a straight face while doing so. (Because clearly the reason behind is #1 above and they provide the CI so due diligence is done.)
    4) No concern here as the raw numbers are reported, but in many cases it is indeed a good idea to include the two decimal places for future meta analyes and overviews. Not because one believes in those two decimal places but because rounding removes information and it’s not a good strategy to report less information if the sample size is smaller.

    • Markus:

      You can curse all you want; I’m still not convinced.

      1. I’ve published in over 100 journals and have never had a journal that would insist that percentages would be reported to 2 decimal places (which is equivalent to reporting a proportion to 4 decimal places, e.g., 0.1466). I have no reason to think that the authors of the above-discussed paper would’ve had to “die on a hill” to report the number as 15%.

      2. What do you think a reader would’ve thought if they’d see 15%? They’d think, “914 * 0.15 = 137.1,” duh I guess that 15% was rounded to the nearest percentage point, as it was not reported as 15.0%.

      3. The point about carelessness is that they’d write 14.66% in the first place. It would be as if I carelessly displayed illiteracy. To write 14.66% out of carelessness still requires that you wrote 14.66% in the first place. It’s not like it was a typo.

      4. You write, “it is indeed a good idea to include the two decimal places for future meta analyes and overviews.” I cannot imagine it could ever be a good idea for the 15% in that paper to be reported as 14.66%. The only time I would think it would be a good idea to report a percentage to 2 decimal place (which, again, is reporting a proportion to 4 decimal places) would be either if the proportion is very close to 0 or 1 (for example, the incidence of a rare disease) or if it’s something that is incredibly stable, so that it could be 0.1466 in one place and 0.1469 in another, and you’d care about the difference. I’ve never seen such an example, but, sure, I guess it could happen. In that case, I’d recommend that the author explain that this is a concern, and then we can see if there is really enough information available to estimate that proportion to 4 decimal places, which I doubt.

      tl;dr: Lots of smart people do dumb things because of habit or custom or because they’ve never thought too much about it. It’s fine to give people the benefit of the doubt, but reporting 14.66% is not such a good idea. And, as discussed in the above post, the paper had multiple errors, so in this case I don’t think it makes so much sense to give the authors the benefit of the doubt on this one too.

Leave a Reply to Garnett Cancel reply

Your email address will not be published. Required fields are marked *