Skip to content

“Note sure what the lesson for data analysis quality control is here is here, but interesting to wonder about how that mistake was not caught pre-publication.”

The Journal of the American Medical Association published a correction notice with perhaps the most boring title ever written:

Incorrect Data Due to Incorrect Conversion Factor

In the Original Investigation entitled “Effect of Intravenous Acetaminophen vs Placebo Combined With Propofol or Dexmedetomidine on Postoperative Delirium Among Older Patients Following Cardiac Surgery: The DEXACET Randomized Clinical Trial,” published in the February 19, 2019, issue of JAMA, an incorrect conversion factor of 2.4 was used to convert fentanyl to morphine equivalents. The correct conversion factor is 100. Data related to postoperative morphine equivalents administered, as reported in the Abstract, Results section of the main article text, Tables 2 and 3, and eFigures 1 and 2 in Supplement 2, were recalculated using the correct conversion factor. This article was corrected online.

Maybe next time JAMA runs a correction, they can just pick something from this list of unused titles. “Here Lies Yesterday,” perhaps?

I learned about the above correction notice from David Allison, who wrote:

Note sure what the lesson for data analysis quality control is here is here, but interesting to wonder about how that mistake was not caught pre-publication.

Hmmm . . . I don’t think it’s surprising at all that this mistake was not caught! I say this for three reasons:

1. It’s a Reinhart-Rogoff Excel error. If the data and code are not public, then there’s no clean workflow, and errors can be introduced just about anywhere in the process.

2. Correctness of the published claims is the responsibility of the author, not the journal. If the author didn’t care enough to check, an error can get through. Also, everyone makes mistakes. (Just go here and search for *correction notice*.)

3. There are a few million scientific papers published every year, thus even more submissions. If each submission gets three reviews, and each reviewer checks every detail of the paper . . . do the math. The result is that nobody has time to do any science! That’s why I prefer post-publication review.

Lots of errors do get caught in the review process, and that’s fine—but no surprise that lots more errors never get noticed until later, if at all.

Association for Psychological Science takes a hard stand against criminal justice reform

Here’s the full quote, from an article to be published in one of our favorite academic journals:

The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability. Highly educated societies with global esteem have more influence over global trends, and so the prescriptive values promulgated by these groups are likely to influence others who may not share their other cognitive characteristics. Perhaps then highly educated and intelligent groups should be humble about promoting the unique and relatively novel values that thrive among them and perhaps should be cautious about mocking certain cultural narratives and norms that are perceived as having little value in their own society.

I have a horrible feeling that I’m doing something wrong, in that by writing this post I’m mocking certain cultural narratives and norms that are perceived as having little value in my own society.

But, hey, in my own subculture, comedy is considered to be a valid approach to advancing our understanding. So I’ll continue.

The Association for Psychological Science (a “highly educated and intelligent group,” for sure) has decided that they should be humble about promoting the unique and relatively novel values that thrive among them. And these unique and relatively novel values include . . . “criminal justice reform and unrestricted sociosexuality.” If members of the Association for Psychological Science want criminal justice reform and unrestricted sociosexuality for themselves, that’s fine. Members of the APS should be able to steal a loaf of bread without fear of their hands getting cut off, and they should be able to fool around without fear of any other body parts getting cut off . . . but for groups with lower self-control and cognitive ability—I don’t know which groups these might be, you’ll just have to imagine—anyway, for those lesser breeds, no criminal justice reform or unrestricted sociosexuality for you. Gotta bring down the hammer—it’s for your own good!

You might not agree with these positions, but in that case you’re arguing against science. What next: are you going to disbelieve in ESP, air rage, himmicanes, ovulation and voting, ego depletion, and the amazing consequences of having an age ending in 9?

OK, here’s the background. I got this email from Keith Donohue:

I recently came across the paper, “Declines in Religiosity Predicted Increases in Violent Crime—But Not Among Countries with Relatively High Average IQ”, by Clark and colleagues, which is available on research gate and is in press at Psychological Science. Some of the authors have also written about this work in the Boston Globe.

I [Donohue] will detail some of issues that I think are important, below, but first a couple of disclaimers. First, two of the authors, Roy Baumeister and Bo Winegard, are (or were) affiliated with Florida State University (FSU). I got my PhD from FSU, but I don’t have any connection with these authors. Second, the research in this paper uses estimates of the intelligence quotient (IQ)s for different nations. Psychology has a long and ignoble history of using dubious measures for intellectual ability to make general claims about differences in average intelligence between groups of people – racial/ethnic groups, immigrant groups, national groups, etc. Often, these claims have aligned with prevailing prejudices or supported frankly racist social policies. This history disturbs me, and seeing echoes of it a flagship journal for my field disturbs me more. I guess what I am trying to say is that my concerns about this paper go beyond its methodological issues, and I need to be honest about that.

With that in mind, I have tried to organize and highlight the methodological issues that might interest you or might be worth commenting on.

1.) Data source for estimates of national IQ and how to impute missing values in large datasets

a. The authors used a national IQ dataset (NIQ: Becker, 2019) that is available online, as well as two other datasets (LV12GeoIQ, NIQ_QNWSAS – I can only guess at what these initialisms stand for). These datasets seem to be based on work by Richard Lynn, Tatu Vanhanen, and (more recently) David Becker for their books IQ and the Wealth of Nations and Intelligence: A Unifying Construct for the Social Sciences, and Intelligence of Nations. This data seems to be collections of IQ estimates made from proxies of intelligence, such as school achievement or scores on other tests of mental abilities. Based on my training and experience with intelligence testing, this research decision seems problematic. Also problematic is the decision to input missing values for some nations based on data from neighboring nations. Lynn and colleagues’ work has received a lot of criticism from within academia, as well as from various online sources. I find this criticism compelling, but I am curious about your thoughts on how researchers ought to impute missing values for large datasets. Let’s suppose that a researcher was working at a (somewhat) less controversial topic, like average endorsement of a candidate or agreement with a policy, across different voting areas. If they had incomplete data, what are some concerns that they ought to have when trying to guess at missing values?

2.) Testing hypotheses about interactions, when the main effect for one of the variables isn’t in the model

a. In Study 1, the authors tested their hypothesis that the negative relationship between religiosity and violent crime (over time) is moderated by national IQ, by using fixed effects, within-country linear regression. There are some elements to these analyses that I don’t really understand, such as the treatment of IQ (within each country) as a time-stable predictor variable and religiosity as a time-varying predictor variable (both should change, over time, right?) or even the total number of models used. However I am mostly curious about your thoughts on testing an interaction, using a model that does not include the main effects for each of the predictor variables that are in the interaction effect (which appears to be what the authors are doing). I don’t have my copy of Cohen, Cohen, West, & Aiken in front of my, but I remember learning test models with interactions in a hierarchical fashion, by first entering main effects for the predictor variables, and then looking for changes in model fit when the interaction effect was added. I can appreciate that the inclusion of time in these analyses (some of the variables – but not IQ? – are supposed to change) make these analyses more complicated, but I wonder if this is an example of incongruity between research hypotheses and statistical analyses.

b. Also, I wonder if this is an example of over-interpreting the meaning of moderation, in statistical analyses. I think that this happens quite a lot in psychology – researchers probe a dataset for interactions between predictor variables, and when they find them, they make claims about underlying mechanisms that might explain relationships between those predictors or the outcome variable(s). At the risk of caricaturing this practice, it seems a bit like IF [statistically significant interaction is detected] THEN [claims about latent mechanisms or structures are allowed].

3.) The use of multiverse analyses to rule-out alternative explanations

a. In Study 2, the authors use multiverse analysis to further examine the relationship between religiosity, national IQ, and violent crime. I have read the paper that you coauthored on this technique (Steegen, Tuerlinck, Gelman & Vanpaemal, 2016) and I followed some of the science blogging around Orben & Przybylski (2019) use of it, in their study on adolescent well-being and digital technology use. If I understand it correctly, the purpose of multiverse analysis is to organize a very large number of analyses that could potentially be done with a group of variables, so as to better understand whether or not the hypotheses that the researchers’ (e.g,. fertility effects political attitudes, digital technology use effects well-being) are generally supported – or, to be more Popper-ian, if they are generally refuted. In writing up the results of a multiverse analysis, it seems like the goal is to detail how different analytic choices (e.g. the inclusion/exclusion of variables, how these variables are operationalized, etc.) influence these results. With that in mind, I wonder if this is a good (or bad?) example of how to conducts and present a multiverse analysis. In reading it, I don’t get much of a sense of the different analytic decisions that the authors considered, and their presentation of their results seems a little hand-wavey – “we conducted a bunch of analyses, but they all came out the same…” But given my discomfort with the research topic (i.e., group IQ differences), and my limited understanding of multiverse analyses, I don’t really trust my judgement.

4.) Dealing with Galton’s Problem and spatial autocorrelation

a. The authors acknowledge that the data for the different nations in their analyses are likely dependent, because of geographic or cultural proximity, an issue that they identify as Galton’s Problem or spatial autocorrelation. This issue seems important, and I appreciate the authors’ attempts to address it (note: it must be interesting to be known as a “Galton’s Problem expert”). I guess that I am just curious as to your thoughts about how to handle this issue. Is this a situation in which multilevel modeling makes sense?

There are a few issues here.

First, I wouldn’t trust anything by Roy Baumeister, first because he has a track record of hyping problematic research claims, second because he has endorsed the process of extracting publishable findings from noise. Baumeister’s a big fan of research that is “interesting.” As I wrote when this came up a few years ago:

Interesting to whom? Daryl Bem claimed that Cornell students had ESP abilities. If true, this would indeed be interesting, given that it would cause us to overturn so much of what we thought we understood about the world. On the other hand, if false, it’s pretty damn boring, just one more case of a foolish person believing something he wants to believe.

Same with himmicanes, power pose, ovulation and voting, alchemy, Atlantis, and all the rest.

The unimaginative hack might find it “less broadly interesting” to have to abandon beliefs in ghosts, unicorns, ESP, and the correlation between beauty and sex ratio. For the scientists among us, on the other hand, reality is what’s interesting and the bullshit breakthroughs-of-the-week are what’s boring.

To the extent that phenomena such as power pose, embodied cognition, ego depletion, ESP, ovulation and clothing, beauty and sex ratio, Bigfoot, Atlantis, unicorns, etc., are real, then sure, they’re exciting discoveries! A horse-like creature with a big horn coming out of its head—cool, right? But, to the extent that these are errors, nothing more than the spurious discovery of patterns from random noise . . . then they’re just stories that are really “boring” (in the words of Baumeister) stories, low-grade fiction.

Second, Baumeister has some political axe to grind. This alone doesn’t mean his work is wrong—someone can have strong political views and do fine research, indeed sometimes the strong views can motivate careful work, if you really care about getting things right. Rather, the issue is that we have good technical reasons to not take Baumeister’s research seriously. (For more on problems with his research agenda, see the comment on p.37 by Smaldino here.) Given that Baumeister’s work may still have some influence, it’s good to understand his political angle.

And, no, it’s not “bullying” or an “ad hominem attack” to consider someone’s research record when evaluating his published claims.

Second, yeah, it’s my impression that you have to be careful with these cross-national IQ comparisons; see this paper from 2010 by Wicherts, Borsboom, and Dolan. Relatedly, I’m amused by the claim in the abstract by Clark et al. that “Many have argued that religion reduces violent behavior within human social groups.” I guess it depends on the religion, as is illustrated by the graph shown at the top of this page (background here).

Third, ok, sure, the paper will be published in Psychological Science, flagship journal bla bla bla. I’ve written for those journals myself, so I guess I like some of what they do—hey, they published our multiverse paper!—but the Association for Psychological Science is also a bit of a member’s club, run for the benefit of the insiders. In some way, I have to admire that they’d publish a paper on such a politically hot topic as IQ differences between countries. I’d actually have guessed that such a paper would push too many buttons for it to be considered publishable by the APS. But Baumeister is well connected—he’s “one of the world’s most prolific and influential psychologists”—so I guess that in the battle between celebrity and political correctness, celebrity won.

And, yes, that paper is politically incorrect! Check out these quotes:

Educated societies might promote secularization without considering potentially disproportionately negative consequences for more cognitively disadvantaged groups. . . .

We suspect that similar patterns might emerge for numerous cultural narratives. The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability.

OK, I got it. We won’t throw you in jail if you’re from a group with higher self-control and cognitive ability. But if you’re from one of the bad groups, you can do what you want: you can handle a bit of “libertarianism, criminal justice reform, and unrestricted sociosexuality, among others.” Hey, it worked for Jeffrey Epstein!

Fourth, there are questions about the statistical model. I won’t lie: I’m happy they did a multiverse analysis and fit some multilevel regressions. I’m not happy with all the p-values and statistical significance, nor am I happy with some of their arbitrary modeling decisions (“the difference led us to create two additional dummy variables, whether a country was majority Christian or not and whether a country was majority Muslim or not, and to test whether either of these dummy variables moderated the nine IQ by religiosity interactions (in the base models, without controls). None of the 18 three-way interactions were statistically significant, and so we do not interpret this possible difference between Christian majority countries and Muslim majority countries”) or new statistical methods they seemed to make up on the spot (“We arbitrarily decided that a semipartial r of .07 or higher for the IQ by religiosity interaction term would be a ‘consistent effect’ . . .”), but, hey, baby steps.

Donohue’s questions above about multiverse analysis and spatial correlations are good questions, but it’s hard for me to answer them in general, or in the context of this sort of study, where there are so many data issues.

To see where I’m coming from, consider an example that I’ve though about a lot: the relation between income, religious attendance, geography, and vote choice. My colleagues and I wrote a whole book about this!

The red-state-blue-state project and the homicide/religion/IQ project have a lot of similarities, in that we’re understanding social behavior through demographics, and looking at geographic variation in this relationship. We’re interested in individual and average characteristics (individual and state-average incomes in our case; individual and national-average IQ’s in theirs). Clark et al. have data issues with national IQ measurements, but it’s not like our survey measurements of income are super-clean.

So, if we take Red State Blue State as a template for this sort of analysis, how does the Clark et al. paper differ? The biggest difference is that we have individual level data—survey responses on income, religious attendance, and voting—whereas Clark et al. only have averages. So they have a big ecological correlation problem. Indeed, one of the big themes of Red State Blue State is that you can’t directly understand individual correlations by looking at correlations among aggregates. The second difference between the two projects is that we had enough data that we can analyze each election year separately, whereas Clark et al. pool across years, which makes results much harder to understand and interpret. The third difference is that we developed our understanding through lots of graphs. I can’t imagine us figuring out much, had we just looked at tables of regression coefficients, statistical significance, correlations, etc.

This is not to say that their analysis is necessarily wrong, just that it’s hard for me to make sense of this sort of big regression; there are just too many moving parts. I think the first step in trying to understand this sort of data would be some time series plots of trends of religiosity and crime, with a separate graph for each country, ordering the countries by per-capita GDP or some similar measure of wealth. Just to see what’s going on in the data before going forward.

Fifth, I’m kinda running out of energy to keep staring at this paper but let me point out one more thing which is the extreme stretch from the empirical findings of this paper, such as they are, to its sociological and political conclusions. Set aside for a moment problems with the data and statistical analysis, and suppose that the data show exactly what the authors claimed, that time trends in religious attendance correlate with time trends in homicide rates in low-IQ countries but not in high-IQ countries. Suppose that’s all as they say. How can you, from that pattern, draw the conclusion that “The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability”? You can’t. To make such a claim is not a gap in logic, it’s a chasm. Aristotle is spinning in his goddam grave, and Lewis Carroll, Georg Cantor, and Kurt Godel ain’t so happy either. This is story time run amok. I’d say it’s potentially dangerous for the sorts of reasons discussed by Angela Saini in her book, but I guess nobody takes the Association for Psychological Science seriously anymore.

I’m surprised Psych Science would publish this paper, given its political content and given that academic psychology is pretty left-wing and consciously anti-racist. I’m guessing that it’s some combination of: (a) for the APS editors, support of the in-group is more important than political ideology, and Baumeister’s in the in-group, (b) nobody from the journal ever went to the trouble of reading the article from beginning to end (I know I didn’t enjoy the task!), (c) if they did read the paper, they’re too clueless to have understood its political implications.

But, hey, if the APS wants to take a stance against criminal justice reform, it’s their call. Who am I to complain? I’m not even a member of the organization.

I’d love to see the lords of social psychology be forced to take a position on this one—but I can’t see this ever happening, given that they’ve never taken a position on himmicanes, ESP, air rage, etc. At one point one of those lords tried to take a strong stand in favor of that ovulation-and-voting paper, but then I asked him flat-out if he thought that women were really three times more likely to wear red during certain days of the month, and he started dodging the question. These people pretty much refuse to state a position on any scientific issue, but they very strongly support the principle that anything published in their journals should not be questioned by an outsider. How they feel about scientific racism, we may never know.

P.S. One more thing, kinda separate from everything else but it’s a general point so I wanted to share it here. Clark et al. write:

Note also that noise in the data, if anything, should obscure our hypothesized pattern of results.

No no no no no. Noise can definitely obscure true underlying patterns, but it won’t necessarily obscure your hypothesized patterns. Noise can just give you more opportunities to find spurious, “statistically significant” patterns. The above quote is an example of the “What does not kill my statistical significance makes it stronger” fallacy. It’s an easy mistake to make; famous econometricians have done it. But a mistake it is.

The turtles stop here. Why we meta-science: a meta-meta-science manifesto

All those postscripts in the previous post . . . this sort of explanation of why I’m writing about the scientific process, it comes up a lot.

I spend a lot of time thinking and writing about the research process, rather than just doing research.

And all too often I often find myself taking time out to explain why I’m spending time on meta-science that I could be spending on science instead.

My answer is that meta-science discussions are, like it or not, necessary conditions for more technical work. Without meta-science, we keep getting caught in deference traps. Remember when we and others criticized silly regression discontinuity analyses? The response in some quarters was a reflexive deference to standard methods in econometrics, without reflection on the applicability of these methods to the problems at hand. Remember when we and others criticized silly psychology experiments? The response in some quarters was a reflexive deference to standard practices in that field (“You have no choice but to accept,” etc.). Remember when we and others criticized ridiculously large estimated causal effects in policy analysis? The response in some quarters was to not respond at all, perhaps based on a savvy judgment that institutional reputations would outlast technical criticism. Remember pizzagate? The response was to duck and weave, which probably would’ve worked had there not been so so so many data anomalies, and also that particular researcher had no powerful friends to go on the counterattack against his critics. Remember that gremlins research? Again, the researcher didn’t give an inch; he relied on the deference given to the economics profession, and the economics profession didn’t seem to care that its reputation was being used in this way. Remember beauty and sex ratio, divorce predictions, etc.? Technical criticism means nothing at all to the Freakonomicses, Gladwells, NPRs, and Teds of the world. Remember how methods of modern survey adjusted were blasted by authority figures such as the president of the American Association for Public Opinion Research and the world’s most famous statistical analyst? Again, our technical arguments didn’t matter one bit to these people. Technical reasoning didn’t come into play at all. It was just one deference trap after another. So, yes, I spend hours dismantling these deference traps so we can get to our real work. Perhaps not the best use of my time, given all my technical training, but somebody’s gotta do it. I’m sick and tired of feeling like I have to explain myself, but the alternative, where people don’t understand why I’m doing this, seems worse. In the words of Auden, “To-morrow, perhaps the future.”

Indeed, not only do we have to spend valuable time on meta-science so we can get to our science, but sometimes we need to spend time on meta-meta-science as in the above paragraph and the postscripts to the previous post. We need the meta-meta-science to explain to people who would otherwise ask why we are wasting our time on meta-science.

THE TURTLES STOP HERE.

“The good news about this episode is that it’s kinda shut up those people who were criticizing that Stanford antibody study because it was an un-peer-reviewed preprint. . . .” and a P.P.P.S. with Paul Alper’s line about the dead horse

People keep emailing me about this recently published paper, but I already said I’m not going to write about it. So I’ll mask the details.

Philippe Lemoine writes:

So far it seems you haven’t taken a close look at the paper yourself and I’m hoping that you will, because I’m curious to know what you think and I know I’m not alone.

I really think there are serious problems with this paper and that it shouldn’t have been published without doing something about them. In my opinion, the most obvious issue is that, just looking at table **, one can see that people in the treatment groups were almost ** times as likely to be placed on mechanical ventilation as people in the control group, even though the covariates they used seem balanced across groups. As tables ** in the supplementary materials show, even after matching on propensity score, there are more than twice as many people who ended up on mechanical ventilation in the treatment groups as in the control groups.

If the control and treatment groups were really comparable at the beginning, it seems very unlikely there would be such a difference between them in the proportion of people who ended up being placed on mechanical ventilation, so I think it was a huge red flag that the covariates they used weren’t sufficient to adequately control for disease severity at baseline. (Another study with a similar design published recently in the NEJM used more covariates to control for baseline disease severity and didn’t find any effect.) They should at least have tried other specifications to see if that affected the results. But they didn’t and only used propensity score matching with exactly the same covariates in a secondary analysis.

In the discussion section, when they talk about the limitations of the study, they write this extraordinary sentence: “Due to the observational study design, we cannot exclude the possibility of unmeasured confounding factors, although we have reassuringly [emphasis mine] noted consistency between the primary analysis and the propensity score matched analyses.” But propensity score matching is just a non-parametric alternative to regression, it still assumes that treatment assignment is strongly ignorable, so how could it be reassuring that unmeasured confounding factors didn’t bias the results?

I actually have an answer to that one! In causal inference from observational data, you start with the raw-data comparison, then you adjust for basic demographics, then you adjust for other available pre-treatment predictors such as pre-existing medical conditions, smoking history, etc. And then you have to worry about adjustments for the relevant pre-treatment predictors you haven’t measured. At each step you should show what your adjustment did. If adjusting for demographics doesn’t change your answer much, and adjusting for available pre-treatment predictors doesn’t change your answer much, then it’s not so unreasonable to suppose that adjustment for other, harder-to-measure, variables won’t do much either. This is standard reasoning in observational studies (see our 1990 paper, for example). I think Paul Rosenbaum has written some more formal arguments along those lines.

Lemoine continues:

Of course, those are hardly the only issues with this paper, as many commenters have noted on your blog. In particular, I think **’s analysis of table ** is pretty convincing, but as I noted in response to his comment, if he is right that it’s what the authors of the study did, what they say in the paper is extremely misleading. They should clarify what they did and, if the data in that table were indeed processed, which seems very likely, a correction should be made to explain what they did.

Frankly, I don’t expect ** to have anything other than a small effect, whether positive or negative, so I don’t really care about that issue. But I fear that, since the issue has become politicized (which is kind of crazy when you think about it), many people are unwilling to criticize this study because the conclusions are politically convenient and they don’t want to appear to side with **. I think this is very bad for science and that it’s important that post-publication peer review proceeds as it normally would.

I just wanted to encourage you to dig into the study yourself because I’m curious to know what you think. Moreover, if you agree there are serious issues with it and say that on your blog, the authors will be more likely to respond instead of ignoring those criticisms, as they have been doing so far.

My reply:

My now, enough has been said about this study that I don’t need to look into it in detail! At this point, it seems that nobody believes the published analysis or conclusion, and the main questions revolve around what the data actually are and where they came from. It’s become a pizzagate kind of thing. It’s possible that the authors will be able to pull a rabbit out of the hat and explain everything, but given their responses so far, I’m doubtful. As we’ve discussed, ** (and journals in general) have a poor record about responding to criticisms of their paper: at best, the most you’ll usually get is a letter published months after the original article, along with a bag of words by the original authors explaining how, surprise! none of their conclusions have changed in any way.

The good news about this episode is that it’s kinda shut up those people who were criticizing that Stanford antibody study because it was an un-peer-reviewed preprint. The problem with the Stanford antibody study is not that it was an un-peer-reviewed preprint; it’s that it had bad statistical analyses and the authors supplied no data or code. It easily could’ve been published in JAMA or NEJM or Lancet or whatever and had the same problems. Indeed, “Stanford” played a similar role as “Lancet” in giving the paper instant credibility. As did “Cornell” with the pizzagate papers.

As Kelsey Piper puts it, “the new, fast scientific process (and even the old, slow scientific process) can produce errors — sometimes significant ones — to make it through peer review.”

P.S. Keep sending me cat pictures, people! They make these posts soooo much more appealing.

P.P.S. As usual, I’m open to the possibility that the conclusions in the disputed paper are correct. Just because they haven’t made a convincing case and they haven’t shared their data and code and people have already found problems with their data, that doesn’t mean that their substantive conclusions are wrong. It just means they haven’t supplied strong evidence for their claims. Remember evidence and truth.

P.P.P.S. I better explain something that comes up sometimes with these Zombies posts. Why beat a dead horse? Remember Paul Alper’s dictum, “One should always beat a dead horse because the horse is never really dead.” Is it obsessive to post multiple takes on the same topic? Remember the Javert paradox. It’s still not too late for the authors to release their code and some version of their data and to respond in good faith on the pubpeer thread, also not too late for the journal to do something.

What could the journal do? For one, they could call on the authors to release their code and some version of their data and to respond in good faith on the pubpeer thread. That’s not a statement that the published paper is wrong; it’s a statement that the topic is important enough to engage the hivemind. Nobody’s perfect in design of a study or in data analysis, and it seems absolutely ludicrous for data and code to be hidden so that, out of all the 8 billion people in the world, only 4 people have access to this information from which such big conclusions are drawn. It’s kind of like how in World War 2, so much was done in such absolute secrecy that nobody but the U.S. Army and Joseph Stalin knew what was going on. Except here the enemy can’t spy on us, so secrecy serves no social benefit.

In Bayesian priors, why do we use soft rather than hard constraints?

Luiz Max Carvalho has a question about the prior distributions for hyperparameters in our paper, Bayesian analysis of tests with unknown specificity and sensitivity:

My reply:

1. We recommend soft rather than hard constraints when we have soft rather than hard knowledge. In this case, we don’t absolutely know that spec and sens are greater than 50%. There could be tests that are worse than that. Conversely, to the extent that we believe spec and sens to be greater than 50% we don’t think they’re 51% either.

2. I typically use normal rather than beta because normal is easier to work with, and it plays well with hierarchical models.

“The Moral Economy of Science”

In our discussion of Lorraine Daston’s “How Probabilities Came to Be Objective and Subjective,” commenter John Richters points to Daston’s 1995 article, “The Moral Economy of Science,” which is super-interesting and also something I’d never heard of before. I should really read the whole damn article and comment on everything in it, but for now I’ll just issue this pointer and let you do some of the work of reading and commenting.

An open letter expressing concerns regarding the statistical analysis and data integrity of a recently published and publicized paper

James Watson prepared this open letter to **, **, **, and **, authors of ** and to ** (editor of **). The letter has approximately 96,032 signatures from approximately 6 continents. And I heard a rumor that they have contacts at the Antarctic Polar Station who are going to sign the thing once they can get their damn fur gloves off.

I like the letter. This kind of thing should be a generic letter that applies to all research papers!

I’ve obscured the details of the letter here because I don’t want to single out the authors of this particular paper or the editor of this particular journal.

If the paper really does have all the problems that some people are concerned about, then maybe the journal in question will follow the “Wakefield rule” and retract in 2032. You thought journal review was slow? Retraction’s even slower!

A journalist who’s writing a story about this controversy asked me what I thought, and I said I didn’t know. The authors have no obligation to share their data or code, and I have no obligation to believe anything they say. Similarly, the journal has no obligation to try to get the authors to respond in a serious way to criticisms and concerns, and I have no obligation to take seriously the papers they publish. This doesn’t mean all or even most of the papers they publish are bad; it just means that we need to judge them on their merits.

P.S. The above link no longer works, so here’s another copy of the letter from Watson et al.

P.P.S. In this news article, Stephanie Lee notes:

One of the biggest concerns of the letter’s signatories was that the authors had not released their code or data, even though the Lancet has signed a pledge to share COVID-19–related data.

Blast from the past

Lizzie told me about this paper, “Bidirectionality, Mediation, and Moderation of Metaphorical Effects: The Embodiment of Social Suspicion and Fishy Smells,” which reports:

As expected (see Figure 1), participants who were exposed to incidental fishy smells invested less money (M = $2.53, SD = $0.93) than those who were exposed to odorless water (M = $3.34, SD = $1.02), planned contrast i(42) = 2.07, p = .05, Cohen’s d = 0.83, or fart spray (M = $3.38, SD = $1.23), i(42) = 2.22, p = .03, d = 0.78.

Fart spray!

The paper faithfully follows Swann’s 18 rules for success in social priming research.

I was surprised to see that people were still doing this sort of thing . . . but then I looked at the paper more carefully. Journal of Personality and Social Psychology, 2012. Just one year after they’d published that ESP paper. To criticize a psychology journal for publishing this sort of thing in 2012 would be like mocking someone for sporting a mullet in the 1980s.

Of course, just cos a paper is on a funny topic and just cos it follows the cargo-cult-science template, it doesn’t mean that it is wrong. I guess I’ll believe it when I see a preregistered replication, not before. In the meantime, just recall that experimental results can be statistically significant and look super-clean but still not replicate. The garden of forking paths is not just a slogan, it’s a real thing that can easily lead researchers to fool themselves; hence the need to be careful.

P.S. All that said, it’s still not as bad as “Low glucose relates to greater aggression in married couples”: that’s the study where they had people blasting their spouses with loud noises and sticking pins into voodoo dolls.

These issues also arise with published research on more important topics.

This is not a post about remdesivir.

Someone pointed me to this post by a doctor named Daniel Hopkins on a site called KevinMD.com, expressing skepticism about a new study of remdesivir. I guess some work has been done following up on that trial on 18 monkeys. From the KevinMD post:

On April 29th Anthony Fauci announced the National Institute of Allergy and Infectious Diseases, an institute he runs, had completed a study of the antiviral remdesivir for COVID-19. The drug reduced time to recovery from 15 to 11 days, he said, a breakthrough proving “a drug can block this virus.” . . .

While the results were preliminary, unpublished, and unconfirmed by peer review, Fauci felt an obligation, he said, to announce them immediately. Indeed, he explained, remdesivir trials “now have a new standard,” a call for researchers everywhere to consider halting any studies, and simply use the drug as routine care.

Hopkins has some specific criticisms of how the results of the study were reported:

Let us focus on something Fauci stressed: “The primary endpoint was the time to recovery.” . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . Unfortunately, the trial registry information, data which must be entered before and during the trial’s actual execution, shows Fauci’s briefing was more than just misleading. On April 16th, just days before halting the trial, the researchers changed their listed primary outcome. This is a red flag in research. . . . In other words they shot an arrow and then, after it landed, painted their bullseye. . . .

OK, this might be a fair description, or maybe not. You can click through and follow the links and judge for yourself.

Here I want to talk about two concerns that came up in this discussion which arise more generally when considering this sort of wide-open problem where many possible treatments are being considered.

I think these issues are important in many settings, so I’d like to talk about them without thinking too much about remdesivir or that particular study or the criticisms on that website. The criticisms could all be valid, or they could all be misguided, and it would not really affect the points I will make below.

Here are the 2 issues:

1. How to report and analyze data with multiple outcomes.

2. How to make decisions about when to stop a trial and use a drug as routine care.

1. In the above-linked post, Hopkins writes:

This choice [of primary endpoint], made in the planning stages, was the project’s defining step—the trial’s entry criteria, size, data collection, and dozens of other elements, were tailored to it. This is the nature of primary outcomes: they are pivotal, studies are built around them. . . .

Choosing any primary outcome means potentially missing other effects. Research is hard. You set a goal and design your trial to reach for it. This is the beating heart of the scientific method. You can’t move the goalposts. That’s not science.

I disagree. Yes, setting a goal and designing your trial to reach for it is one way to do science, but it’s not the only way. It’s not “the beating heart of the scientific method.” Science is not a game. It’s not about “goalposts”; it’s about learning how the world works.

2. Lots is going on with coronavirus, and doctors will be trying all sorts of different treatments in different situations. If there are treatments that people will be trying anyway, I don’t see why they shouldn’t be used as part of experimental protocols. My point is that, based on the evidence available, even if remdesivir should be used as routine care, it’s not clear that all the studies should be halted. More needs to be learned, and any study is just a formalization of the general idea that different people will be given different treatments.

Again, this is not a post about remdesivir. I’m talking about more general issues of experimentation and learning from data.

Age-period-cohort analysis.

Chris Winship and Ethan Fosse write with a challenge:

Since its beginnings nearly a century ago, Age-Period-Cohort analysis has been stymied by the lack of identification of parameter estimates resulting from the linear dependence between age, period, and cohort (age= period – cohort). In a series of articles, we [Winship and Fosse] have developed a set of methods that allow APC analysis to move forward despite the identification problem. We believe that our work provides a solid methodological foundation for APC analysis, one that has not existed previously. By a solid methodological foundation, we mean a set of methods that can produce substantively important results where the assumptions involved are both explicit and likely to be plausible.

After nearly a century of effort this is a big claim. How might we test it? In mathematics, if someone claims to have proved a theorem, the proof is not considered valid until others have rigorously analyzed it. Our request and hope that researchers will interrogate our claim with similar rigor. Have we in fact succeed after so many years of efforts by others?

Full Challenge Document

APC-R Software Download

My own articles on age-period-cohort analysis are here, here, and here. The first of these was an invited discussion for the American Journal of Sociology that they decided not to publish; the second (with Jonathan Auerbach) is our summary of what went wrong with that notorious claim a few years ago about the increasing death rate of middle-aged white Americans, and the third (with Yair Ghitza and Jonathan Auerbach) is our very own age-period-cohort analysis of presidential voting.

I have not looked at Winship and Fosse’s work in detail, but I agree with their general point that the the right way forward with this problem is to think about nonlinear models.

Last post on hydroxychloroquine (perhaps)

James “not this guy” Watson writes:

The Lancet study has already been consequential, for example, the WHO have decided to remove the hydroxychloroquine arm from their flagship SOLIDARITY trial.

Thanks in part to the crowdsourcing of data sleuthing on your blog, I have an updated version of doubts concerning the data reliability/veracity.

1/ Ozzy numbers:
This Australian government report (Table 5) says that as of 10th May, only 866 patients in total had been hospitalized in Australia, of whom 7.9% died (68 patients)… whereas 73 Australian patients in the Lancet paper were reported as having died. The mean age reported in the Lancet paper for Australian patients is 55.8 years. The median age for all Australian patients in the attached is 47 years, and for those hospitalized it’s 61 years. (Note the Lancet paper only included hospitalized people, up to April 14th).

2/ A very large Japanese hospital:
The Mehra et al. paper in the NEJM (Cardiovascular disease, drug therapy, and mortality in Covid-19, same data provenance, time period: Dec 20th to March 15th) gave the number of hospitals broken down by country. They had 9 hospitals in Asia (7 in China, 1 in Japan and 1 in South Korea) and 1,507 patients. Their follow-up paper in The Lancet presumably used the same data plus extra data up until April the 14th. The Lancet paper had 7,555 participants in Asia and also 9 hospitals. The assumption would be that these hospitals are the same (why would you exclude the hospitals from the first analysis in the second analysis?). Therefore, we assume that they had an extra 6048 patients in that time period.
Cases in China went from 80,860 on March the 15th to 82,295 by April the 14th (difference is 1435). South Korea: increase from 8,192 to 10,564 (difference is 2372); Japan: from 833 to 7,885 in this time (7052). This is a total increase of 10,859. If all cases in China and South Korea in the intervening period were seen in these 8 hospitals, then it would imply that 2241 patients were seen in 1 hospital in Japan in the space of a month!

3/ High dosing:
Almost 2 thirds of the data come from North America (66%, 559 hospitals). In the previous NEJM publication, the majority of the hospitals were in USA (121 versus 4 in Canada). Assuming that the same pattern holds for the extra 434 hospitals in this Lancet paper, the majority of the patients will have received doses of HCQ according to FDA recommendations: 800mg on day 1, followed by 400mg (salt weights) for 4-7 days. This is not a weight-based dosing recommendation.
The mean daily doses and durations of dosing for HCQ are given as: 596 mg (SD: 126) for an average of 4.2 days (SD: 1.9); HCQ with a macrolide: 597 mg (SD 128) and 4.3 days (SD 2). The FDA dosing for 4 days would give an average of 500mg daily, i.e. (800 + 3×400) / 4. Nowhere in the world recommends higher doses than this, with the exception of the RECOVERY trial in the UK.
So are these average daily doses possible?

4/ Disclaimer/background
It may be worth mentioning that I (or the research unit for which I work) could be seen as having a “vested interest” in chloroquine because we are running the COPCOV study (I am not an investigator on that trial). COPCOV is a COVID19 prevention trial in health workers. Participants will take low dose chloroquine as prophylaxis for 3 months (they are not sick and the doses are about 3x lower than given for treatment – so different population&dose than Lancet study). The Lancet study will inevitably damage this trial due the media attention. Understanding whether the underlying data are reliable or not is of extreme importance to our research group. Because our unit has been thinking/reading about (hydroxy)chloroquine a lot recently (and some people in the group have been studying chloroquine pharmacology for 40 years) we rapidly picked up on the “oddness” of this recent paper.

My conclusion from this is that post-publication review is a vital component of science. Medical journals need to embrace and stop pretending that peer/editorial review will solve all problems.

Perhaps the authors of that Lancet study will respond in the comments here? They haven’t yet responded on pubpeer.

P.S. The authors have this followup post which has some general discussion of their data sources but no engagement with the criticisms of the paper. On the ladder of responses to criticism, I’d put them at #4 (“Avoid looking into the question”). The good news is that they’re nowhere near #6 (“Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim”) or #7 (“Attack the messenger”). As I’ve said before, I have an open mind on this, and it’s possible the paper has no mistakes at all: maybe the criticisms are misguided. I’d feel better if the authors acknowledged the criticisms and responded in some way.

Alexey Guzey’s sleep deprivation self-experiment

Alexey “Matthew Walker’s ‘Why We Sleep’ Is Riddled with Scientific and Factual Errors” Guzey writes:

I [Guzey] recently finished my 14-day sleep deprivation self experiment and I ended up analyzing the data I have only in the standard p < 0.05 way and then interpreting it by writing explicitly about how much I believe I should update based on this data. I honestly have absolutely no knowledge of Bayesian data analysis, so I'd be curious if you think the data I have is worth analyzing in some more sophisticated manner or if you have general pointers to resources that would help me figure this out (unless the answer to this is that I should just google something like "bayesian data analysis"..) Here’s the experiment.

One concern that I have is that my Psychomotor Vigilance Task data (as an example) is just not very good (which I note explicitly in the post), and I would be worried that if I try doing any fancy analysis on it, people would be led to believe that the data is more trustworthy than it really is, based on the fancy methods (when in reality it’s garbage in garbage out type of a situation).

Here’s the background (from the linked post):

I [Guzey] slept 4 hours a night for 14 days and didn’t find any effects on cognition (assessed via Psychomotor Vigilance Task, a custom first-person shooter scenario, and SAT). I’m a 22-year-old male and normally I sleep 7-8 hours. . . .

I did not measure my sleepiness. However, for the entire duration of the experiment I had to resist regular urges to sleep . . . This sleep schedule was extremely difficult to maintain.

Lack of effect on cognitive ability is surprising and may reflect true lack of cognitive impairment, my desire to demonstrate lack of cognitive impairment due to chronic sleep deprivation and lack of blinding biasing the measurements, lack of statistical power, and/or other factors.

I believe that this experiment provides strong evidence that I experienced no major cognitive impairment as a result of sleeping 4 hours per day for 12-14 days and that it provides weak suggestive evidence that there was no cognitive impairment at all.

I [Guzey] plan to follow this experiment up with an acute sleep deprivation experiment (75 hours without sleep) and longer partial sleep deprivation experiments (4 hours of sleep per day for (potentially) 30 and more days). . . .

His main finding is a null effect, in comparison with Van Dongen et al., 2003, who reported large and consistent declines in performance after sleep deprivation.

My quick answer to Guzey’s question (“I’d be curious if you think the data I have is worth analyzing in some more sophisticated manner”) is, No, I don’t think any fancy statistical analysis is needed here. Not given the data we see here. An essentially null effect is an essentially null effect, no matter how you look at it. Looking forward, yes, I think a multilevel Bayesian approach as described here and here) would make sense. One reason I say this is because I noticed this bit of confusion from Guzey’s description:

The more hypotheses I have, the more samples I need to collect for each hypothesis, in order to maintain the same false positive probability (https://en.wikipedia.org/wiki/Multiple_comparisons_problem). This is a n=1 study and I’m barely collecting enough samples to measure medium-to-large effects and will spend 10 hours performing PVT. I’m not in a position to test many hypotheses at once.

This is misguided. The goal should be to learn, not to test hypotheses, and the false positive probability has nothing to do with anything relevant. It would arise if your plan were to perform a bunch of hypothesis tests and then record the minimum p-value, but it would make no sense to do this, as p-values are super-noisy.

Guzey has a whole bunch of this alpha-level test stuff, and I can see why he’d do this, because that’s what it says to do in some textbooks and online tutorials, and it seems like a rigorous thing to do, but this sort of hypothesis testing is not actually rigorous, it’s just a way to add noise to your data.

Anyway, none of this is really an issue here because he’s sharing his raw data. That’s really all the preregistration you need. For his next study, I recommend that Guzey just preregister exactly what measurements to take, then commit to posting the data and making some graphs.

There’s not much to say about the data analysis because Guzey’s data don’t show much. It could be, though, that as Guzey says he’s particularly motivated to perform well so he can find that sleep deprivation isn’t so bad.

Why do we go short on sleep and why do we care?

God is in every leaf of every tree.

As is so often the case, we can think better about this problem by thinking harder about the details and losing a layer or two of abstraction. In this case, the abstraction we can lose is the idea of “the effect of sleep deprivation on performance.”

To unpack “the effect of sleep deprivation on performance,” we have to ask: What sleep deprivation? What performance?

There are lots of reasons for sleep deprivation. For example, maybe you work 2 jobs, or maybe you’re up all night caring for a child or some other family member, or maybe you have some medical condition so you keep waking up in the middle of the night, or maybe you stay up all night sometimes to finish your homework.

Similarly, there are different performances you might care about. If you’re short on sleep because you’re working 2 jobs, maybe you don’t want to crash your car driving home one morning. Or maybe you’re operating heavy machinery and would like to avoid cutting your arm off. Or, if you’re staying up all night for work, maybe you want to do a good job on that assignment.

Given all this, it’s hard for me to make sense of general claims about the impact, or lack of impact, of lack of sleep on performance. I have the same concerns about measuring cognitive ability, as ability depends a lot on motivation.

These concerns are not unique to Guzey’s experiment; they also arise in other research, such as the cited paper by Van Dongen et al.

This controversial hydroxychloroquine paper: What’s Lancet gonna do about it?

Peer review is not a form of quality control

In the past month there’s been a lot of discussion of the flawed Stanford study of coronavirus prevalence—it’s even hit the news—and one thing came up was that the article under discussion was just a preprint—it wasn’t even peer reviewed!

For example, in a NYT op-ed:

This paper, and thousands more like it, are the result of a publishing phenomenon called the “preprint” — articles published long before the traditional form of academic quality control, peer review, takes place. . . . They generally carry a warning label: “This research has yet to be peer reviewed.” To a scientist, this means it’s provisional knowledge — maybe true, maybe not. . . .

That’s fine, as long as you recognize that “peer-reviewed research” is also “provisional knowledge — maybe true, maybe not.” As we’ve learned in recent years, lots of peer-reviewed research is really bad. Not just wrong, as in, hey, the data looked good but it was just one of those things, but wrong, as in, we could’ve or should’ve realized the problems with this paper before anyone even tried to replicate it.

The beauty-and-sex-ratio research, the ovulation-and-voting research, embodied cognition, himmicanes, ESP, air rage, Bible Code, the celebrated work of Andrew Wakefield, the Evilicious guy, the gremlins dude—all peer-reviewed.

I’m not saying that all peer-reviewed work is bad—I’ve published a few hundred peer-reviewed papers myself, and I’ve only had to issue major corrections for 4 of them—but to consider peer review as “academic quality control” . . . no, that’s not right. The quality of the paper has been, and remains, the responsibility of the author, not the journal.

Lancet

So, a new one came in. A recent paper published in the famous/notorious medical journal Lancet reports that hydroxychloroquine and chloroquine increased the risk of in-hospital death by 30% to 40% and increased arrhythmia by a factor of 2 to 5. The study hit the news with the headline, “Antimalarial drug touted by President Trump is linked to increased risk of death in coronavirus patients, study says.” (Meanwhile, Trump says that Columbia is “a liberal, disgraceful institution.” Good thing we still employ Dr. Oz!)

All this politics . . . in the meantime, this Lancet study has been criticized; see here and here. I have not read the article in detail so I’m not quite sure what to make of the criticisms; I linked to them on Pubpeer in the hope that some experts can join in.

Now we have open review. That’s much better than peer review.

What’s gonna happen next?

I can see three possible outcomes:

1. The criticisms are mistaken. Actually the research in question adjusted just fine for pre-treatment covariates, and the apparent data anomalies are just misunderstandings. Or maybe there are some minor errors requiring minor corrections.

2. The criticisms are valid and the authors and journal publicly acknowledge their mistakes. I doubt this will happen. Retractions and corrections are rare. Even the most extreme cases are difficult to retract or correct. Consider the most notorious Lancet paper of all, the vaccines paper by Andrew Wakefield, which appeared in 1998, and was finally retracted . . . in 2010. If the worst paper ever took 12 years to be retracted, what can we expect for just run-of-the-mill bad papers?

3. The criticisms are valid, the authors dodge and do not fully grapple with the criticism, and the journal stays clear of the fray, content to rack up the citations and the publicity.

That last outcome seems very possible. Consider what happened a few years ago when Lancet published a ridiculous article purporting to explain variation in state-level gun deaths using 25 state-level predictors representing different gun control policies. A regression with 50 data points and 25 predictors and no regularization . . . wait! This was a paper that was so fishy that, even though it was published in a top journal and even though its conclusions were simpatico with the views of gun-control experts, those experts still blasted the paper with “I don’t believe that . . . this is not a credible study and no cause and effect inferences should be made from it . . . very flawed piece of research.” A couple of researchers at Rand (full disclosure: I’ve worked with these two people) followed up with a report concluding:

We identified a number of serious analytical errors that we suspected could undermine the article’s conclusions. . . . appeared likely to support bad gun policies and to hurt future research efforts . . . overfitting . . . clear evidence that its substantive conclusions were invalid . . . factual errors and inconsistencies in the text and tables of the article.

They published a letter in Lancet with their criticisms, and the authors responded with a bunch of words, not giving an inch on any of their conclusions or reflecting on the problems of using multiple regression the way they did. And, as far as Lancet is concerned . . . that’s it! Indeed, if you go to the original paper on the Lancet website, you’ll see no link to this correspondence. Meanwhile, according to Google, the article has been cited 74 times. OK, sure, 74 is not a lot of citations, but still. It’s included in a meta-analysis published in JAMA—and one of the authors of that meta-analysis is the person who said he did not believe the Lancet paper when it came out! The point is, it’s in the literature now and it’s not going away.

A few years ago I wrote, in response to a different controversy regarding Lancet, that journal reputation is a two-way street:

Lancet (and other high-profile journals such as PPNAS) play a role in science publishing, that is similar to the Ivy League in universities: It’s hard to get in, but once you’re in, you have that Ivy League credential, and you have to really screw up to lose that badge of distinction.

Or, to bring up another analogy I’ve used in the past, the current system of science publication and publicity is like someone who has a high fence around his property but then keeps the doors of his house unlocked. Any burglar who manages to get inside the estate then has free run of the house. . . .

As Dan Kahan might say, what do you call a flawed paper that was published in a journal with impact factor 50 after endless rounds of peer review? A flawed paper. . . .

My concern is that Lancet papers are inappropriately taken more seriously than they should. Publishing a paper in Lancet is fine. But then if the paper has problems, it has problems. At that point it shouldn’t try to hide behind the Lancet reputation, which seems to be what is happening. And, yes, if that happens enough, it should degrade the journal’s reputation. If a journal is not willing to rectify errors, that’s a problem no matter what the journal is.

Remember Newton’s third law? It works with reputations too. The Lancet editor is using his journal’s reputation to defend the controversial study. But, as the study becomes more and more disparaged, the sharing of reputation goes the other way.

I can imagine the conversations that will occur:

Scientist A: My new paper was published in the Lancet!

Scientist B: The Lancet, eh? Isn’t that the journal that published the discredited Iraq survey, the Andrew Wakefield paper, and that weird PACE study?

A: Ummm, yeah, but my article isn’t one of those Lancet papers. It’s published in the serious, non-politicized section of the magazine.

B: Oh, I get it: The Lancet is like the Wall Street Journal—trust the articles, not the opinion pages?

A: Not quite like that, but, yeah: If you read between the lines, you can figure out which Lancet papers are worth reading.

B: Ahhh, I get it.

Now we just have to explain this to journalists and policymakers and we’ll be in great shape. Maybe the Lancet could use some sort of tagging system, so that outsiders can know which of its articles can be trusted and which are just, y’know, there?

Long run, reputation should catch up to reality. . . .

I don’t think the long run has arrived yet. Almost all the press coverage of this study seemed to be taking the Lancet label as a sign of quality.

Speaking of reputations . . . the first author of the Lancet paper is from Harvard Medical School, which sounds pretty impressive, but then again we saw that seriously flawed paper that come out from Stanford Medical School, and a few months ago we heard about a bungled job from the University of California medical school. These major institutions are big places, and you can’t necessarily trust a paper, just because it comes from a generally respected medical center.

Again, I haven’t looked at the article in detail, nor am I any kind of expert on hydro-oxy-chloro-whatever-it-is, so let me say one more time that outcome 1 above is still a real possibility to me. Just cos someone sends me some convincing-looking criticisms, and there are data availability problems, that doesn’t mean the paper is no good. There could be reasonable explanations for all of this.

Be careful when estimating years of life lost: quick-and-dirty estimates of attributable risk are, well, quick and dirty.

Peter Morfeld writes:

Global burden of disease (GBD) studies and environmental burden of disease (EBD) studies are supported by hundreds of scientifically well-respected co-authors, are published in high level journals, are cited world wide and have a large impact on health institutions‘ reports and related political discussions.

The main metrics used to calculate the impact of exposures on the health of populations are „numbers of premature deaths“, DALYs („disability adjusted life years“) and YLLs („Years of Life Lost“). This large and influential branch of science overlooks seminal papers published by Robins and Greenland in the 1980s. These papers have shown that „etiologic deaths“ (premature deaths due to exposure) cannot be identified from epidemiological data alone which entails that YLLs and DALYs cannot be broken down by age or endpoints (diseases). DALYs due to exposure are problematic when interpreted in a counterfactual setting. Thus, most of this influential GBD and EBD mainstream work is scientifically unjustified.

We published a paper on this issue (open access):

Hammitt JK, Morfeld P, Tuomisto JT, Erren TC. Premature Deaths, Statistical Lives, and Years of Life Lost: Identification, Quantification, and Valuation of Mortality Risks. Risk Anal. 2019 Dec 10. doi: 10.1111/risa.13427.

Just for some additional background when you like to comment on the issue: Here is a letter exchange in Lancet with the leader of the largest GBD (global burden of disease) project world wide (Christopher Murray, Seattle).

This exchange is not covered in our paper. It may give an indication how the arguments and bias calculations are received.

My only comment is that I still think Qalys (or Dalys or whatever) are a good unit of measurement. The problems above are not with qualys, but with intuitively appealing but problematic statistical estimates of them. What joker put seven dog lice in my Iraqi fez box?

P.S. That above-linked discussion also involves Ty Beal, whose name rang a bell . . . here it is!

Hydroxychloroquine update

Following up on our earlier post, James “not the cancer cure guy” Watson writes:

I [Watson] wanted to relay a few extra bits of information that have come to light over the weekend.

The study only has 4 authors which is weird for a global study in 96,000 patients (and no acknowledgements at the end of the paper). Studies like this in medicine usually would have 50-100 authors (often in some kind of collaborative group). The data come from the “Surgical Outcomes Collaborative”, which is in fact a company. The CEO (Sapan Desai) is the second author. One of the comments on the blog post is “I was surprised to see that the data have not been analyzed using a hierarchical model”. But not only do they not use hierarchical modelling and they do not appear to be adjusting by hospital/country, they also give almost no information about the different hospitals: which countries (just continent level), how the treated vs not treated are distributed across hospitals etc. A previous paper by the same group in NEJM says that they use data from UK hospitals (no private hospitals are treating COVID so must be from the NHS). Who is allowing some random company to use NHS data and publish with no acknowledgments. Another interesting sentence is about patient consent and ethical approval:

The data collection and analyses are deemed exempt from ethics review.

We emailed them to ask for the data, in particular to look at the dose effect which I think is key in understanding the results. They got back to us very quickly and said

Thanks for your email inquiry. Our data sharing agreements with the various governments, countries and hospitals do not allow us to share data unfortunately. I do wish you all the very best as you continue to perform trials since that is the stance we advocate. All we have said is to cease and desist the off label and unmonitored and uncontrolled use of such therapy in hospitalized patients.“

So unavailable data from unknown origins . . .

Another rather remarkable aspect is how beautifully uniform the aggregated data are across continents:

For example, smoking is almost between 9.4-10% in 6 continents. As they don’t tell us which countries are involved, hard to see how this matches known smoking prevalences. Antiviral use is 40.5, 40.4, 40.7, 40.2, 40.8, 38.4%. Remarkable! I didn’t realise that treatment was so well coordinated across the world. Diabetes and other co-morbidities don’t vary much either.

I [Watson] am not accusing the authors/data company of anything dodgy, but as they give almost no details about the study and “cannot share the data”, one has to look at things from a skeptical perspective.

Again, I have not looked into this at all. I’m sharing this because open data is a big deal. Right now, hydroxychloroquine is a big deal too. And we know from experience that Lancet can make mistakes. Peer review is nothing at all compared to open review.

The authors of the paper in question, or anyone else who knows more, should feel free to share information in the comments.

Doubts about that article claiming that hydroxychloroquine/chloroquine is killing people

James Watson (no, not the one who said that cancer would be cured by 2000, and not this guy either) writes:

You may have seen the paper that came out on Friday in the Lancet on hydroxychloroquine/chloroquine in COVID19 hospitalised patients. It’s got quite a lot of media attention already.

This is a retrospective study using data from 600+ hospitals in the US and elsewhere with over 96,000 patients, of whom about 15,000 received hydroxychloroquine/chloroquine (HCQ/CQ) with or without an antibiotic. The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people.

This caught my eye, as an effect size that big should have been picked up pretty quickly in the interim analyses of randomized trials that are currently happening. For example, the RECOVERY trial has a hydroxychloroquine arm and they have probably enrolled ~1500 patients into that arm (~10,000 + total already). They will have had multiple interim analyses so far and the trial hasn’t been stopped yet.

The most obvious confounder is disease severity: this is a drug that is not recommended in Europe and the USA, so doctors give it as “compassionate use”. I.e. very sick patient, so why not try just in case. Therefore the disease severity of the patients in the HCQ/CQ groups will be greater than the controls. The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. This means that someone who comes in OKish and then deteriorates rapidly could be much more likely to get given the drug as compared to someone as bad but stable. This temporal aspect cannot be picked up a single severity measurement.

In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. What’s interesting is that the New England Journal of Medicine published a very similar study a few weeks ago where they saw no effect on mortality. Guess what, they had much more detailed data on patient severity.

One thing that the authors of the Lancet paper didn’t do, which they could have done: If HCQ/CQ is killing people, you would expect a dose (mg/kg) effect. There is very large variation in the doses that the hospitals are giving (e.g. for CQ the mean daily dose is 750 but standard deviation is 300). Our group has already shown that in chloroquine self-poisoning, death is highly predictable from dose (we used stan btw, very useful!). No dose effect would suggest it’s mostly confounding.

In short, it’s a pretty poor dataset and the results, if interpreted literally, could massively damage ongoing randomized trials of HCQ/CQ.

I have not read all these papers in detail, but in general terms I am sympathetic to Watson’s point that statistical adjustment (or, as is misleadingly stated in the cited article, “controlling for” confounding factors) is only as good as what you’re adjusting for.

Again speaking generally, there are many settings where we want to learn from observational data, and so we need to adjust for differences between treated and control groups. I’d rather see researchers try their best to do such adjustments, rather than naively relying on pseudo-rigorous “identification strategies” (as, notoriously, here). So I applaud the authors for trying. I guess the next step is to look more carefully at pre-treatment differences between the two groups.

Are the (de-identified) data publicly available? That would help.

Also, when I see a paper published in Lancet, I get concerned, as they have a bit of a reputation for chasing headlines. I’m not saying that it is for political reasons that they published a paper on the dangers of hydroxychloroquine, but this sort of thing is always a concern when Lancet is involved.

P.S. More here.

“Banishing ‘Black/White Thinking’: A Trio of Teaching Tricks”

Richard Born writes:

The practice of arbitrarily thresholding p values is not only deeply embedded in statistical practice, it is also congenial to the human mind. It is thus not sufficient to tell our students, “Don’t do this.” We must vividly show them why the practice is wrong and its effects detrimental to scientific progress. I [Born] offer three teaching examples I have found to be useful in prompting students to think more deeply about the problem and to begin to interpret the results of statistical procedures as measures of how evidence should change our beliefs, and not as bright lines separating truth from falsehood.

He continues:

Humans are natural born categorizers. We instinctively take continuous variables and draw (often) arbitrary boundaries that allow us to put names to groups. For example, we divide the continuous visible spectrum up into discrete colors like “red,” “yellow,” and “blue.” And the body mass index (BMI) is a continuous measure of a person’s weight-to-height ratio, yet a brief scan of the Internet turns up repeated examples of the classification [into three discrete categories].

In some cases, such as for color, certain categories appear to be “natural,” as if they were baked into our brains (Rosch, 1973). In other cases, categorization is related to the need to make decisions, as is the case for many medical classifications. And the fact that we communicate our ideas using language—words being discrete entities—surely contributes to this tendency.

Nowhere is the tendency more dramatic—and more pernicious—than in the practice of null hypothesis significance testing (NHST), based on p values, where an arbitrary cutoff of 0.05 is used to separate “truth” from “falsehood.” Let us set aside the first obvious problem that in NHST we never accept the null (i.e., proclaim falsehood) but rather only fail to reject it. And let us also ignore the debate about whether we should change the cutoff to something more stringent, say 0.005 (Benjamin et al., 2018), and instead focus on what I consider to be the real problem: the cutoff itself. This is the problem I refer to as “black/white thinking.”

Because this tendency to categorize using p values is (1) natural and (2) abundantly reinforced in many statistics courses, we must do more than simply tell our students that it is wrong. We must show them why it is wrong and offer better ways of thinking about statistics. What follows are some practical methods I have found useful in classroom discussions with graduate students and postdoctoral fellows in neuroscience. . . .

Create your own community (if you need to)

Back in 1991 I went to a conference of Bayesians and I was disappointed that the vast majority seem to not be interested in checking their statistical models. The attitude seemed to be, first, that model checking was not possible in a Bayesian context, and, second, that model checking was illegitimate because models were subjective. No wonder Bayesianism was analogized to a religion.

This all frustrated me, as I’d found model checking to be highly relevant in my Bayesian research in two different research problems, one involving inference for emission tomography (which had various challenges arising from spatial models and positivity constraints), the other involving models for district-level election results.

The good news is that, in the years since our book Bayesian Data Analysis came out, a Bayesian community has developed that is more accepting of checking models by looking at their fit to data. Many challenges remain.

The point of this story is that sometimes you can work with an existing community, sometimes you have to create your own community, and sometimes it’s a mix. In this case, my colleagues and I did not try to create a community on our own; we very clearly piggybacked off the existing Bayesian community, which indeed included lots of people who were interested in checking model fit, once it became clear that this was a theoretically valid step.

P.S. For more on the theoretical status of model checking in Bayesian inference, see this 2003 paper, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing and this 2018 paper, Visualization in Bayesian workflow.

P.P.S. Zad’s cat, pictured above, is doing just fine. He doesn’t need to create his own community.

New report on coronavirus trends: “the epidemic is not under control in much of the US . . . factors modulating transmission such as rapid testing, contact tracing and behavioural precautions are crucial to offset the rise of transmission associated with loosening of social distancing . . .”

Juliette Unwin et al. write:

We model the epidemics in the US at the state-level, using publicly available death data within a Bayesian hierarchical semi-mechanistic framework. For each state, we estimate the time-varying reproduction number (the average number of secondary infections caused by an infected person), the number of individuals that have been infected and the number of individuals that are currently infectious. We use changes in mobility as a proxy for the impact that NPIs and other behaviour changes have on the rate of transmission of SARS-CoV-2. We project the impact of future increases in mobility, assuming that the relationship between mobility and disease transmission remains constant. We do not address the potential effect of additional behavioural changes or interventions, such as increased mask-wearing or testing and tracing strategies.

Nationally, our estimates show that the percentage of individuals that have been infected is 4.1% [3.7%-4.5%], with wide variation between states. For all states, even for the worst affected states, we estimate that less than a quarter of the population has been infected; in New York, for example, we estimate that 16.6% [12.8%-21.6%] of individuals have been infected to date. Our attack rates for New York are in line with those from recent serological studies [1] broadly supporting our modelling choices.

There is variation in the initial reproduction number, which is likely due to a range of factors; we find a strong association between the initial reproduction number with both population density (measured at the state level) and the chronological date when 10 cumulative deaths occurred (a crude estimate of the date of locally sustained transmission).

Our estimates suggest that the epidemic is not under control in much of the US: as of 17 May 2020, the reproduction number is above the critical threshold (1.0) in 24 [95% CI: 20-30] states. Higher reproduction numbers are geographically clustered in the South and Midwest, where epidemics are still developing, while we estimate lower reproduction numbers in states that have already suffered high COVID-19 mortality (such as the Northeast). These estimates suggest that caution must be taken in loosening current restrictions if effective additional measures are not put in place.

We predict that increased mobility following relaxation of social distancing will lead to resurgence of transmission, keeping all else constant. We predict that deaths over the next two-month period could exceed current cumulative deaths by greater than two-fold, if the relationship between mobility and transmission remains unchanged. Our results suggest that factors modulating transmission such as rapid testing, contact tracing and behavioural precautions are crucial to offset the rise of transmission associated with loosening of social distancing.

Overall, we show that while all US states have substantially reduced their reproduction numbers, we find no evidence that any state is approaching herd immunity or that its epidemic is close to over.

One question I have is about the assumptions underlying “increased mobility following relaxation of social distancing.” Even if formal social distancing rules are relaxed, if the death rate continues, won’t enough people be scared enough that they’ll limit their exposure, thus reducing the rate of transmission? This is not to suggest that the epidemic will go away, just that maybe people’s behavior will keep the infections spreading at something like the current rate? Or maybe I’m missing something here.

The report and other information is at their website.

Unwin writes:

Below is our usual three panel plot showing our results for the five states we uses as a case study in the report – we chose them because we felt they showed different responses across the US. New in this report, we estimate the number of people who are currently infectious over time – the difference in this and those getting newly infected each day is quite stark.

We have also put the report on open review, which is an online platform enabling open reviews of scientific papers. It’s usually used for computer science conferences but Seth Flaxman has been in touch to partner with them to try it for a pre-print. If you’d like, click on the link and you can leave us a comment or recommend a reviewer.

Lots and lots of graphs (they even followed some of my suggestions, but I’m still concerned about the way that the upper ends of the uncertainty bounds are so visually prominent), and they fit a multilevel model in Stan, which I really think is the right way to go, as it allows a flexible workflow for model building, checking, and improvement.

You can make of the conclusions what you will: the model is transparent, so you should be able to map back from inferences to assumptions.

But the top graph looked like such strong evidence!

I just posted this a few hours ago, but it’s such an important message that I’d like to post it again.

Actually, maybe we should just post nothing but the above graph every day, over and over again, for the next 20 years.

This is hugely important, one of the most important things we need to understand about statistics.

The top graph is what got published, the bottom graph is the preregistered replication from Joe Simmons and Leif Nelson.

Remember: Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample.

But the top graph looked like such strong evidence!

Sorry.

It’s so easy to get fooled. Think of all the studies with results that look like that top graph.