“Dangerous messaging around a recent SIDS study”: This time, don’t blame the news media.

Gary Smith points to this post by Peter Attia, who writes:

Over the last couple of weeks, I [Attia] have received several messages about a “miraculous” study which has sent social media and news outlets into a frenzy, proclaiming that researchers had identified the cause of Sudden Infant Death Syndrome (SIDS). A press release about the study went a step further, asserting that the “game-changing” findings “could prevent SIDS” and quoting lead study author Dr. Carmel Harrington in saying, “we can begin to change the outcome for these babies and make SIDS a thing of the past.”

Actually, though, Attia writes:

Dr. Harrington’s study did not find the cause of SIDS, and it certainly didn’t discover a way to prevent it. What it did find was a potential biomarker associated with SIDS risk. . . . The authors found that children who died from SIDS had lower BChE activity on average than either surviving children or children who had died of other causes.

In other words, low BChE activity appears to correlate with incidence of SIDS risk . . . in this case, correlation is hardly even correlation: the association between BChE activity and SIDS risk may be significant (P = 0.0014), but the effect size is so small that it calls into question whether BChE would even have value as a biomarker of SIDS risk. With so much overlap in BChE between the SIDS and non-SIDS children, screening newborns by this metric certainly couldn’t provide any guarantees one way or the other. Children with relatively high BChE could still die of SIDS and vice versa. . . .

The news media handled this one just fine

On the other hand, the media reports of this study seem pretty reasonable. I did a quick google and found:

New York Times: “New Research Offers Clues as to Why Some Babies Die of SIDS
The study could pave the way for newborn screening — but the results still need to be corroborated by further research.”

CNN: “Study identifies potential biomarker for SIDS, but a test for it is a long way off.”

NPR: No report on the study.

Attia writes:

I don’t blame tweeters for these misinterpretations; they’re trusting the statements of reputable news outlets and making logical conclusions based on what they’re hearing and reading.

He links to an article by Benjamin Mazer, “How a SIDS Study Became a Media Train Wreck,” but all that Mazer points to is a press release, something on twitter, a segment on Good Morning America, and an article in the New York Post.

I’m sure that Good Morning America and the New York Post do lots of good things, but neither one would usually be characterized as a “reputable news outlet.”

What I’m getting it is, yeah, Attia and Mazer seem correct that a misleading press release spread wildly on social media, leading to a bunch of really stupid tweets . . . but the legit media seemed to do just fine here. So, to call this a “media train wreck” is misleading. What’s interesting here is that the misinformation and confusion spread despite the responsible behavior of reputable news outlets.

To say this again, Attia writes, “I don’t blame tweeters for these misinterpretations; they’re trusting the statements of reputable news outlets . . .” Well, sorry, but I do blame the tweeters, and actually the reputable news outlets did just fine!

We’ve seen this for decades, actually. Whether it’s nutty conspiracy theories, wacky cancer cures, or flat-out political disinformation, the news media are often pretty good about keeping an even keel and sticking to the facts. But that doesn’t stop people from spreading the confusion. From the Bermuda Triangle and Laetrile in the 1970s to vaccine denial and election denial more recently, bad ideas can circulate. This latest SIDS thing just fits into a long history of incremental scientific steps being interpreted as miracle cures, and this happens whether or not credentialed hypesters are involved.

I get really annoyed when the news media engages in science hype. It seems only fair, then, to not blame them in those cases where it’s not their fault.

One more thing

Further googling reveals that Peter Attia has a 7-part series with discredited sleep activist Matthew Walker. What’s that about??? Seems like he had too much trust in reputable academia and the news media. I hope he can read up on the problems with Walker’s work and issue some sort of retraction.

Progress in post-publication review: error is found, author will correct it without getting mad at the people who pointed out the error

Valentin Amrhein points to this quote from a recent paper in the journal, Physical and Engineering Sciences in Medicine:

Some might argue that there is confusion over the interpretation and usage of p-values. More likely, its’ value is misplaced and subsequently misused. The p-value is the probability of an effect or association (hereafter collectively referred to as ‘effect’), and is most often computed through testing a (null) hypothesis like “the mean values from two samples have been obtained by random sampling from the same normal populations”. The p-value is the probability of this hypothesis. Full-stop.

Hey, they mispunctuated “its”! But that’s the least of their problems. The p-value mistake is a classic example of people not knowing what they don’t know, expressing 100% confidence in something that’s completely garbled. Or, as they say in Nudgeland, they’ve discovered a new continent.

But everybody makes mistakes. And it’s the nature of ignorance that it can be hard to notice your ignorance. It can happen to any of us.

The real test is what happens next: when people point out your error. The twitter people were kind enough to point out the mistake in the above article, and Sander Greenland informs me that the author will be sending a correction to the journal. So that’s good. It’s post-publication review working just the way it should. No complaining about Stasi or terrorists or anything, just a correction. Good to see this.

P.S. Confusion about p-values appears to have a long history. Greenland gives some references here.

Is Martha (Smith) still with us?

This post is by Phil Price, not Andrew.

It occurred to me a few weeks ago that I haven’t seen a comment by Martha (Smith) in quite a while…several months, possibly many months? She’s a long-time reader and commenter and often had interesting things to say. At times she also alluded to the fact that she was getting on in years. Perhaps she has simply lost interest in the blog, or in commenting on the blog…I hope that’s what it is. Martha, if you’re still out there, please let us know!

Axios II: Attack of the Clones

This story, along with the above picture, is pretty funny. I guess the previous man in the sequence is sitting in a rocking chair somewhere and the next one is at Little League practice. What really makes the photo work, beyond these guys all looking pretty similar, is (a) they’re dressed the same way, kinda like adorable twins whose parents buy them identical outfits, and (b) the monotonic pattern of hair loss, which reinforces the sequential look.

Really, though, those Axios dudes should stick to sports (background here).

A couple weeks ago we discussed this thing that used to happen, but we don’t see so much of anymore, which is writers being treated as all-purpose pundits, even though they had zero qualifications other than writing skills. An example would be Norman Mailer. This Axios thing is kind of the reverse: some bad writers offering writing tips. I’d say “selling” writing tips, but it really doesn’t count as selling if you’re actually paying people to buy your book.

Not that I’m one to talk, giving out all my prose for free here . . .

Hey—what’s up with that method from 1998 that was going to cure cancer in 2 years??

Our recent post about the non-existent Los Angeles tunnel reminded me of another bit of hype from the past: a claim by science hero James Watson in 1998 that cancer would be cured in two years. I was curious what was up with that so I googled and came across this news article from 2013 pointing to him writing, “We now have no general of influence, much less power … leading our country’s War on Cancer.”

B-b-b-ut . . . cancer had already been cured 13 years earlier! So what’s the problem? Maybe the “war” analogy is appropriate: the threat from Germany was stopped in 1918 but then twenty years later they had to do it all over again.

And this interview from 2016, where we learn that Watson at age 88 can still serve a tennis ball at 100 miles per hour. Also, he says, “I was pessimistic about curing cancer when gene-targeted drugs began to fail, but now I’m optimistic.”

Ahhh, now he’s optimistic. That’s good! The article continues with this juicy bit:

On what he sees as the best hope for treating and even curing advanced (metastatic) cancer: an experimental drug from Boston Biomedical (for which Watson is a paid consultant):

Papers have identified the gene STAT3, a transcription factor [that turns on other genes], as expressed in most kinds of cancer. It causes cancer cells to become filled with antioxidants [which neutralize many common chemotherapies]. In the presence of the experimental drug that targets STAT3, cancers become sensitive to chemotherapies like paclitaxel and docetaxel again. This is the most important advance in the last 40 years. It really looks like late-stage cancer will be partly stopped by a drug.

Hey, wait a second! In 1998, Watson was talking about a cure for cancer in two years. According to the news article from back then, he said that the developer of this cure “would be remembered along with scientists like Charles Darwin as someone who permanently altered civilization.” And that was less than 40 years previous.

So in what sense is this advance from 2016 “the most important advance in the last 40 years,” if only 18 years earlier there had been the advance that led his pal to be remembered along with Darwin etc etc.?

I’m all for people updating their opinions based on data, but if you’re gonna be hyping things that disappear, shouldn’t you at least acknowledge that you’ve changed your mind. I have the same feeling about this as I do about Stasi-guy and the Nudegelords memory-holing their hype of disgraced food scientist Brian Wansink (work that they earlier referred to as “masterpieces”).

As discussed in a previous thread:

The problem is that Watson was using “letter of recommendation” language rather than science language or journalism language. Letters of recommendation are full of B.S.; it’s practically required. For example, I recall years ago being asked to fill out a recommendation with options that went something like this:
– best student I’ve ever seen
– in top 1% of all students
– top 5%
– top 15%
– top 50%
– bottom 50%.
Top 15% is pretty good, right? But it’s on the bottom half of the scale. So to fill out this form in the way that’s expected of me, I had to do some mental gymnastics, where first I considered some narrow category where this particular student was excellent, and then I could honestly declare the student to be in the top 5%. I don’t remember who the student was, I just remember this annoying form which is pretty much demanding that I do some exaggeration.

The flip side of this is recommendations that are too honest. For example, I remember once receiving a letter of recommendation from an econ professor saying that a certain student was pretty good, not good enough for one of the top 8 programs but ok for anything from 9 through 20. This was just obnoxious, in no small part because of the ridiculous implication that the student could be evaluated at that level of precision, also in the assumption that qualifications could be summarized in a single dimension.

Anyway, the connection to Watson is that he was an academic administrator for many years, so I guess he got in the habit of writing letters that were full of hype so he could justify each hire and promotion he made, and so he could promote his students and postdocs to jobs elsewhere.

The trouble is that when the hype gets reported straight up with no acknowledgement later that it didn’t happen as they claimed it would. Same as with that tunnel story.

“Fill in the Blank Leads to More Citations”: Junk science and unavailable data at the European Heart Journal

Retraction Watch links to this post by Phil Davis, who writes:

Even a casual reader of the scientific literature understands that there is an abundance of papers that link some minor detail with article publishing — say, the presence of a punctuation mark, humor, poetic, or popular song reference in one’s title — with increased citations. Do X and you’ll improve your citation impact, the researchers advise.

Over the past few years, the latest trend in X leads to more citations research has focused on minor interventions in social media, like a tweet.

There is considerable rationale for why researchers would show so much enthusiasm for this type of research. Compared to a clinical trial on human subjects, these studies are far easier to conduct.


Davis continues:

As a social media-citation researcher, all you need is a computer, a free account, and a little time. This is probably why there are so many who publish papers claiming that some social media intervention (a tweet, a Facebook like) leads to more paper citations. And given that citation performance is a goal that every author, editor, and publisher wishes to improve, it’s not surprising that these papers get a lot of attention, especially on social media.

Well put. Next comes the example:

The latest of these claims that X leads to more citations, “Twitter promotion is associated with higher citation rates of cardiovascular articles: the ESC Journals Randomized Study” was published on 07 April 2022 in the European Heart Journal (EHJ), a highly-respected cardiology journal. The paper . . . claims that promoting a newly published paper in a tweet, increases its citation performance by 12% over two years. The EHJ also published this research in 2020 as a preliminary study, boasting a citation improvement by a phenomenal 43%.

OK, publicity helps, so this doesn’t seem so surprising to me. On the other hand, Davis writes,

Media effects on article performance are known to be small, if any, which is why a stunning report of 43% followed by a less stunning 12% should raise some eyebrows. The authors were silent about the abrupt change in results, as they were with other details missing from their paper. They were silent when I asked questions about their methods and analysis, and won’t share their dataset . . . According to the European Heart Journal’s author guidelines, all research papers are required to include a Data Availability Statement, meaning, a disclosure on how readers can get authors’ data when questions arise.

Hey, that’s not good! Davis continues:

Based on my own calculations, I suspect that the Twitter-citation study was acutely underpowered to detect the reported (43% and 12%) citation differences. Based on the journals, types of papers, and primary endpoint described in the paper, I calculated that the researchers would require at least a sample size of 6693 (3347 in each arm) to detect a 1 citation difference after 2 years (SD=14.6, power=80%, alpha=0.05, two-sided test), about ten-times the sample size reported in their paper (N=694). With 347 papers in each study arm, the researchers had a statistical power of just 15%, meaning they had just a 15% chance of discovering a non-null effect if one truly existed. For medical research, power is often set at 80% or 90% when calculating sample sizes. In addition, low statistical power often exaggerates effect sizes. In other words, small sample sizes (even if the sampling was done properly) tend to generate unreliable results that over-report true effects. This could be the reason why one of their studies reports a 43% citation benefit, while the other just 12%.

And where are the data?

I [Davis] contacted the corresponding author . . . by email with questions about sample size calculation and secondary analyses (sent 5 May 2022) but received no response. One week later, I contacted him again to request a copy of his dataset (sent 10 May 2022), but also received no response. (I do have prior correspondences from [the author] from earlier this year.)

After another week of silence, I contacted the European Heart Journal’s Editor-in-Chief (EiC) on 16 May 2022, asking for the editorial office to become involved. I asked for a copy of the authors’ dataset and for the journal to publish an Editorial Expression of Concern for this paper. The response from the editorial office was an invitation to submit a letter to their Discussion Forum, outlining my concerns. If accepted, EHJ would publish my letter along with the author’s response in a future issue of the journal. The process could take a long time and there was no guarantee that I would end up with a copy of the dataset.

Yup. I’ve seen this sort of thing before. It’s too hard to publish criticisms and obtain data for replication.

Davis continues:

The European Heart Journal has clear rules for the reporting of scientific results and even clearer rules for the reporting of randomized controlled trials. The validity of the research results described in the paper is irrelevant. Required parts of the paper are missing. Even if my analysis had shown that the conclusions were justifiable, the authors clearly violated two EHJ policies.

I agree. This is like Columbia University’s attitude to being caught with their hand in the U.S. News rankings cookie jar. If you’re a public-serving institution and you break the rules, the appropriate response is to make things right, not to hide and hope the problem goes away on its own.

Davis followed up the next week:

More than a month after my initial request, the editorial office of the European Heart Journal provided me [Davis] a link to the authors’ data (as of this writing, there is no public link from the published paper). . . .

The statistics for analyzing and reporting the results of Randomized Controlled Trials (RCTs) are often very simple. Because treatment groups are made similar in all respects with the exception of the intervention, no fancy statistical analysis is normally required. This is why most RCTs are analyzed using simple comparisons of sample means or medians.

Deviating from this norm, the authors of the Twitter-citation paper used Poisson regression, a more complicated model that is very useful in some fields (e.g. economics) when analyzing data with lots of independent variables. However, Poisson regression is limited in its application because it comes with a big assumption — the mean value must equal the variance. When this assumption is violated, the researcher should use a more flexible model, like the Negative Binomial.

That’s right! We discuss this in chapter 15 of Regression and Other Stories. Good catch, Davis!

He continues:

Using Poisson regression on their data, I got the same results as reported (12% citation difference, 95% confidence interval 8% to 15%, p<0.0001), which appears to be a robust and statistically significant finding. However, the model fits the data very poorly. When I analyzed their dataset using a Negative Binomial model, the data were no longer significant (13%, 95% C.I. -5% to 33%, p=0.17). Yes, the estimate was close, but the confidence interval straddled zero. Using a common technique when dealing with highly-skewed data (normalizing the data with a log transformation) and employing a simple linear model also provided non-significant results (8%, 95% C.I. -7% to 25%, p=0.33). Similarly, a simple comparison of means (t-test) was non-significant (p=0.17), as was the non-parametric (signed-rank) equivalent (p=0.33). In sum, the only test that provided a statistically significant finding was the one where the model was inappropriate for the data.


And more:

The authors didn’t register their protocols or even provide justification for a Poisson regression model in their preliminary paper. A description of how their sample size was determined was missing, as was a data availability statement — both are clear violations of the journal’s policy. The editorial office was kind enough to provide me with a personal link to the dataset, but it is still not public. . . . No one is willing to admit fault, and the undeclared connection of several authors with current or past EHJ editorial board roles raises questions about special treatment.

Davis concludes:

A tarnished reputation is deep and long-lasting. I hope the editors of EHJ understand what they are sacrificing with this paper.

I’d like to believe that, but I don’t know. Lancet has published lots of politically-motivated junk, and that doesn’t stop the news media and, I guess, medical researchers, from taking it seriously. Harvard Law School is full of plagiarists and it’s still “Harvard Law School.” Malcolm Gladwell keeps selling books. Dr. Oz is on track to enter the U.S. Congress. I guess the point is that reputations are complicated. This story will indeed tarnish the reputation of the European Heart Journal, but just by a little bit, which I guess is appropriate, as it’s just one little article, right? Also the journal’s behavior, while reprehensible, is absolutely standard—it’s the same way that just about any scholarly or scientific journal will act in this situation. Kinda like maybe one reason that other universities didn’t make more for of a fuss about Columbia’s U.S. News thing is that maybe lots of them have skeletons in their data closets.

I’m glad that Davis did this deep dive. I’m just less confident that this incident will do much to tarnish the reputation of the European Heart Journal, or, for that matter, its parent organization, Oxford University Press. According to its webpage, the journal has an impact factor of 29.983. That’s what counts, right? The only question is whether this paper with the unavailable data, underpowered sample, and bad analysis will pull its weight by getting at least 30 citations. And, hey, after all this blogging, maybe it will!

Calling all epidemiology-ish methodology-ish folks!

I just wanted to share that my department, Epidemiology at the University of Michigan School of Public Health, has just opened up a search for a tenure-track Assistant Professor position.

We are looking in particular for folks who are pushing forward innovative epidemiological methodology, from causal inference and infectious disease transmission modeling to the ever-expanding world of “-omics”.

We’ll be reviewing applications starting October 12th; don’t hesitate to reach out to me ([email protected]) or the search committee ([email protected]) if you have any questions!

Also – you can find the posting here.

The “You’re only allowed to publish 2 or 3 journal articles per year” rule

Yesterday we discussed a news article on science journals being overwhelmed by fake research. Here was one quote from that article:

International biotechnology consultant Glenn Begley, who has been campaigning for more meaningful links between academia and industry, said research fraud was a story of perverse incentives. He wants researchers to be banned from producing more than two or three papers per year, to ensure the focus remained on quality rather than quantity.

At first I thought that was a horrible idea. Some of us have more than two or three things to say in a year! Is Begley trying to silence us??

But then I thought, sure, it all just depends on how you define “journal articles.” Instead of publishing 20 journal articles in a year, I could put 20 articles on Arxiv and just choose 3 of them to publish in journals. That would be kinda cool, actually, as it would save my collaborators and me huge amounts of effort in dealing with review reports, paperwork from journals, and so forth. My collaborators and I could write as much as we always do, just much more efficiently.

So I’m on board with Begley’s proposal. I’m not sure how it would be enforced, and I’m not planning to do it unilaterally but I’m starting to warm to it it as a general policy.

So, again, you’d be able to continue to publish 20 or more articles per year, but there’s a logic to saying that only 2 or 3 can be in “journals.” There are still a couple of details that need to be worked out, such as how to count coauthored articles and how to think about non-journal publications such as arxiv. The next step would be to replace journals by recommender systems, an idea I’ve proposed before (see also here). This would move us toward an equilibrium in which we publish 0 journal articles per year. Or a system where journals continue to exist but have no special status, in the same way that private schools such as Columbia continue to have some value but are not considered intrinsically better than public schools.

Some thoughts on academic research and advocacy

An economist who would prefer to remain anonymous writes:

There is an important question of the degree of legitimacy we grant to academic research as advocacy.

It is widely accepted, and I think also true, that the adversary system we have in courts, where one side is seeking to find and present every possible fact and claim that tends to mitigate the guilt of the accused, and the other side strives to find and present every possible fact and claim that tends to confirm the guilt of the accused, and a learned and impartial judge accountable to the public decides, is a good system that does a creditable job of reaching truth. [Tell that to the family of Nicole Brown Simpson. — ed.] (Examining magistrates probably do a better job, but jurisdictions with examining magistrates also have defense attorneys and DAs; they just allow the judge also to participate in the research project.) It is also widely accepted, and I think also true, that it is important that high standards of integrity should be imposed on lawyers to prevent them from presenting false and misleading cases. (Especially DAs.) There is no question of “forking paths” here, each side is explicitly tasked with looking for evidence for one side.

I don’t think that this is a bad model for academic policy research. Martin Feldstein, like most prominent academic economists from both right and left, was an advocate and did not make a secret of his political views or of their sources. He was also a solid researcher and was good at using data and techniques to reach results that confirmed (and occasionally conditioned) his views. The same is true of someone like Piketty from the social-democratic left, or Sam Bowles from a more Marxist perspective, or the farther-right Sam Peltzman from Chicago.

All these individuals were transparent and responsible advocates for a particular policy regime. They all carried out careful and fascinating research and all were able to learn from each other. This is the model that I see as the de facto standard for my profession (policy economics) and I think it is adequate and functional and sustainable.

Romer’s whole “mathiness” screed is not mostly about “Chicago economists are only interested in models that adopt assumptions that conform to their prejudices”, it is IMHO mostly about “Chicago economists work hard to hide the fact that they are only interested in models that adopt assumptions than conform to their prejudice”. I think Romer exaggerates a bit (that is his advocacy) but I agree that he makes an important point.

I’m coming at this as an outsider and have noting to add except to point out the converse point, that honesty and transparency are not enough, and if you’re a researcher and come to a conclusion that wasn’t expected, that you weren’t aiming for, that you weren’t paid or ideologically predisposed to find, that doesn’t automatically mean that you’re right. I’ve seen this attitude a lot, of researchers thinking that their conclusions absolutely must be correct because they came as a surprise or because they’re counter to their political ideologies. But that’s not right either.

What’s the origin of “Two truths and a lie”?

Someone saw my recently-published article, “‘Two truths and a lie’ as a class-participation activity,” and wanted to know what is the origin of that popular game.

I have no idea when the game first appeared, who invented it, or what were its precursors. I don’t remember it being around when I was a kid, so at the very least I think its popularity has increased in recent decades, but I’d be interested in the full story.

Can anyone help?

The Economic Pit and the Political Pendulum: Predicting Midterm Elections

This post is written jointly with Chris Wlezien.

We were given the following prompt in July 2022 and asked to write a response: “The Economy, Stupid: Accepted wisdom has it that it’s the economy that matters to voters. Will other issues matter in the upcoming midterm elections, or will it really be all about the economy?”

In the Gallup poll taken that month, 35% of Americans listed the economy as one of the most important problems facing the country today, a value which is neither high nor low from a historical perspective. What does this portend for the 2022 midterm elections? The quick answer is that the economy is typically decisive for presidential elections, but midterm elections have traditionally been better described as the swinging of the pendulum, with the voters moving to moderate the party in power. This may partly reflect “thermostatic” public response (see here and here) to government policy actions.

It has long been understood that good economic performance benefits the incumbent party, and there’s a long history of presidents trying to time the business cycle to align with election years. Edward Tufte was among the first to seriously engage this possibility in his 1978 book, Political Control of the Economy, and other social scientists have taken up the gauntlet over the years. Without commenting on the wisdom of these policies, we merely note that even as presidents may try hard to set up favorable economic conditions for reelection, they do not always succeed, and for a variety of reasons: government is only part of the economic engine, the governing party does not control it all, and getting the timing right is an imperfect science.

To the extent that presidents are successful in helping ensure their own reelection, this may have consequences in the midterms. For example, one reason the Democrats lost so many seats in the 2010 midterms may be that Barack Obama in his first two years in office was trying to avoid Carter’s trajectory; his team seemed to want a slow recovery, at the cost of still being in recession in 2010, rather than pumping up the economy too quickly and then crashing by 2012 and paying the political price then.

In the 1970s and 1980s, Douglas Hibbs, Steven Rosenstone, James Campbell, and others established the statistical pattern that recent economic performance predicts presidential elections, and this is consistent with our general understanding of voters and their motivations. Economics is only part of the referendum judgment in presidential elections; consider that incumbents tend to do well even when economic conditions are not ideal. The factors that lead incumbent party candidates to do well in presidential elections also influence Congressional elections in those years, via coattails. So the economy matters here. This is less true in midterm elections. There, voters tend to balance the president, voting for and electing candidates of candidates of the “out-party.” This now is conventional wisdom. There is variation in the tendency, and this partly reflects public approval of the president, which offers some sense of how much people like what the president is doing – we are more (less) likely to elect members of the other party, the less (more) we approve of the president. The economy matters for approval, so it matters in midterm elections, but to a lesser degree than in presidential elections, and other things matter, including policy itself.

Whatever the specific causes, it is rare for voters to not punish the president at the midterm, and historically this has taken very high approval ratings. The exceptions to midterm loss are 1998, after impeachment proceedings against popular Bill Clinton were initiated, and again in 2002, when the even more popular George W. Bush continued to benefit from the 9/11 rally effect, and gains in these years were slight. It has not happened since, and to find another case of midterm gain, one has to go all the way back to Franklin Roosevelt’s first midterm in 1934. This is consistent with voters seeing the president as directing the ship of state, with Congressional voting providing a way to make smaller adjustments to the course. Democrats currently control two of the three branches of government, so it may be natural for some voters to swing Republican in the midterms, particularly given Joe Biden’s approval numbers, which have been hovering in the low-40s.

Elections are not just referenda on the incumbent; the alternative also matters. This may seem obvious in presidential election years, but it is true in midterms as well. Consider that Republican attempts to impeach Bill Clinton may have framed the 1998 election as a choice rather than a “balancing” referendum, contributing to the (slight) Democratic gains in that year. We may see something similar in 2022, encouraged in part by the Supreme Court decision on abortion, among other rulings. Given that the court now is dominated by Republican appointees, some swing voters will want to maintain Democratic control of Congress as a way to check Republicans’ judicial power and future appointments to the courts. The choice in the midterms also may be accentuated by the reemergence of Donald Trump in the wake of the FBI raid of Mar-a-Lago.

Change in party control of Congress is the result of contests in particular districts and states. These may matter less as national forces have increased in importance in recent years, but the choices voters face in those contests still do matter when they go to the polls. In the current election cycle, there has been an increase in the retirements of Democratic incumbents as would be expected in a midterm with a Democratic president, but some of the candidates Republicans are putting up may not be best positioned to win. This is particularly true in the Senate, where candidates and campaigns matter more.

Some things, like cubes, tetrahedrons, and Venn diagrams, seem so simple and natural that it’s kind of a surprise when you learn that their supply is very limited.


I know I’ve read somewhere about the challenge of Venn diagrams with 4 or more circles, but I can’t remember the place. It seems like a natural for John Cook but I couldn’t find it on his blog, so I’ll just put it here.

Venn diagrams are misleading, in the sense that they work for n = 1, 2, and 3, but not for n > 3.

n = 1: A Venn diagram is just a circle. There are 2^1 options: in or out.

n = 2: A Venn diagram is two overlapping circles, with 2^2 options: A & B, not-A & B, A & not-B, not-A & not-B.

n = 3: Again, it works just fine. The 3 circles overlap and divide the plane into 8 regions.

n = 4: Venn FAIL. Throw down 4 overlapping circles and you don’t get 16 regions. You can do it with ellipses (here’s an example I found from a quick google) but it doesn’t have the pleasing symmetry of the classic three-circle Venn, and it takes some care both to draw and to interpret.

n = 5: There’s a pretty version here but it’s no longer made out of circles.

n > 5: Not much going on here. You can find examples like this which miss the point by not including all subsets, or examples like this which look kinda goofy.

The challenge here, I think, is that we have the intuition that if something works for n = 1, 2, and 3, that it will work for general n. For the symmetric Venn diagram on the plane, though, no, it doesn’t work.

Here’s an analogy: We all know about cubes. If at some point you see a tetrahedron and a dodecahedron, it would be natural to think that there are infinitely many regular polyhedra, just as there are infinitely many regular polygons. But, no, there are only 5 regular polyhedra.

Some things, like cubes, tetrahedrons, and Venn diagrams, seem so simple and natural that it’s kind of a surprise when you learn that their supply is very limited.

“The distinction between exploratory and confirmatory research cannot be important per se, because it implies that the time at which things are said is important”

This is Jessica. Andrew recently blogged in response to an article by McDermott arguing that pre-registration has costs like being unfair to junior scholars. I agree with his view that pre-registration can be a pre-condition for good science but not a panacea, and was not convinced by many of the reasons presented in the McDermott article for being skeptical about pre-registration. For example, maybe it’s true that requiring pre-registration would favor those with more resources, but the argument given seemed quite speculative. I kind of doubt the hypothesis made that many researchers are trying out a whole bunch of studies and then pre-registering and publishing on the ones where things work out as expected. If anything, I suspect pre-pre-registration experimentation looks more like researchers starting with some idea of what they want to see then tweak their study design or definition of the problem until they get data they can frame as consistent with some preferred interpretation (a.k.a. design freedoms). Whether this is resource-intensive in a divisive way seems hard to comment on without more context. Anyway, my point in this post is not to further pile on the arguments in the McDermott critique, but to bring up certain more nuanced critiques of pre-registration that I have found useful for getting a wider perspective, and which all this reminded me of.

In particular, arguments that Chris Donkin gave in a talk in 2020 about work with Aba Szollosi on pre-registration (related papers here and here) caught my interest when I first saw the talk and have stuck with me. Among several points the talk makes, one is that pre-registration doesn’t deserve privileged status among proposed reforms because there’s no single strong argument for what problem it solves. The argument he makes is NOT that pre-registration isn’t often useful, both for transparency and for encouraging thinking. Instead, it’s that bundling up a bunch of reasons why preregistration is helpful (e.g., p-hacking, HARKing, blurred boundary between EDA and CDA) misdiagnoses the issues in some cases, and risks losing the nuance in the various ways that pre-registration can help. 

Donkin starts by pointing out how common arguments for pre-registration don’t establish privileged status. For example, if we buy the “nudge” argument that pre-registration encourages more thinking which ultimately leads to better research, then we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them, it’s just that they are somehow too rushed to make use of it. Another is that the argument that we need controlled error rates in confirmatory data analysis and thus a clear distinction between explanatory and confirmatory research implies that the time at which things are said is important. But, if we take that seriously we’re implying there’s somehow a causal effect of saying what we will find ahead of time that makes it more true later. In other domains however, like criminal law, it would seem silly though to argue that because an explanation was proposed after the evidence came in, it can’t be taken seriously. 

The problem, Donkin argues, is that the role of theory is often overlooked in strong arguments for pre-registration. In particular, the idea that we need a sharp contrast between exploratory versus confirmatory data analysis doesn’t really make sense when it comes to testing theory. 

For instance, Donkin argues that we regularly pretend that we have a random sample in CDA, because that’s what gives it its validity, and the barebones statistical argument for pre-registration is that with EDA we no longer have a random sample, invalidating our inferences. However, in light of the importance of this assumption that we have a random sample in CDA, post-hoc analysis is critical to confirming that we do. We should be poking the data in whatever ways we can think up to see if we can find any evidence that the assumptions required of CDA don’t hold. If not, we shouldn’t trust any tests we run anyway. (Of course, one could preregister a bunch of preliminary randomization checks. But the point seems to be that there are activities that are essentially EDA-ish that can be done only when the data comes in, challenging the default). 

When we see pre-registration as “solving” the problem of EDA/CDA overlap, we invert an important distinction related to why we expect something that happened before to happen again. The reason it’s okay for us rely on inductive reasoning like this is because we embed the inference in theory: the explanation motivates the reason why we expect the thing to happen again. Strong arguments for pre-registration as a fix for “bad” overlap implies that this inductive reasoning is the fundamental first principle, rather than being a tool embedded in our pursuit of better theory. In other words, taking preregistration too seriously as a solution implies we should put our faith in the general principle that the past repeats itself. But we don’t use statistics because they create valid inferences, but because they are a tool for creating good theories.

Overall, what Donkin seems to be emphasizing in this is that there’s a rhetorical risk to too easily accepting that pre-registration is the solution to a clear problem (namely, that EDA and CDA aren’t well separated). Despite the obvious p-hacking examples we may think of when we think about the value of pre-registration, buying too heavily into this characterization isn’t necessarily doing pre-registration a favor, because it’s hiding a lot of nuance in ways that pre-registration can help. For example, if you ask people why pre-registration is useful, different people may stress different reasons. If you give preregistration an elevated status for the supposed reason that it “solves” the problem of EDA and CDA not being well distinguished, then, similar to how any nuance in intended usage of NHST has been lost, you may lose the nuance of preregistration as an approach that can improve science, and increase pre-occupation with a certain way of (mis)diagnosing the problems. Devezer et al. (and perhaps others I’m missing) have also pointed out the slipperiness of placing too much faith in the EDA/CDA distinction. Ultimately, we need to be a lot more careful in stating what problems we’re solving with reforms like pre-registration.

Again, none of this is to take away from the value of pre-registration in many practical settings, but to point out some of the interesting philosophical questions thinking about it critically can bring up.

NBC TV series: The Irrational

Gur points us to the publicity for this new TV show about “a professor of behavioral science who lends his expertise to an array of high-stakes cases.” I’m waiting for the episode where he tracks down the case of the missing shredder. Maybe a scene where there’s a heist at the Museum of Scholarly Misconduct and the team has to use behavioral science to solve it. They could also have a cool scene where the hero makes a “move fast and break things” speech and explains how in boring academic science everything has to be replicated, but that in the real world of crime solving we sometimes have to fake the data to move forward. That’s how they did it in Mission Impossible, right??

Why do Dickens novels have all those coincidences?

Regular readers will know that I have an answer to the above question. My resolution is based on the statistics of sampling from networks.

I like my theory, but it’s not the only one.

Here’s what Gareth Rees had to say. I don’t disagree with Rees—and I appreciate that he links to TV Tropes!—I guess I’d say that my resolution (that coincidence is one way to resolve an inherent impossibility of conveying a complex social structure in a book with a small number of characters) and his resolution (that, in the period when Dickens was writing, coincidence was viewed as a plus, not an unfortunate byproduct of necessity, in the same way that nowadays we consider contrivances such as plot and climax to be a plus in storytelling) are complementary.

Any theories you have, feel free to share in the comments.

Geoff Dyer Kazuo Ishiguro Owen Sheers David Leavitt Veronica Geng

I guess the first of this series was George Orwell.

I’m talking about writers who lay down the prose with a clear directness, a crystal-clear declarative style. Gay Talese not quite, as he has a bit of a knowingly courtly style. Not Hemingway either, as he seems too mannered.

Veronica Geng doesn’t quite fit here but I think that’s how she would’ve written, had she written extended nonfiction.

Of the authors listed above, Ishiguro is the most famous for having a style that’s ostentatiously plain (as Dyer might put it), but I’d say the others have it too. Reading their books can make me uncomfortable, in the same way as if I’m talking with someone who I realize is staring into my eyes, and I’m not interested in a staring contest.

P.S. I just finished the latest Dyer book and, when placing it on the shelf, I flipped through his two books of essays—it seems appropriate to flip through a book by Dyer and not read it cover to cover—and then, when replacing those, came across a book of essays by John Gregory Dunne which was next to them on the shelf. This was another one not to read all the way through again, but I did come across this great line about movie critic Pauline Kael: “Reading her on film is like reading Lysenko on genetics—fascinating, unless you know something about genetics.”

P.P.S. Just to be clear, I don’t think there’s any obligation for authors to write in this plain style. Hemingway, Fitzgerald, Martin Amis, lot of writers have distinctive, even flashy, styles that work well. It’s good that we can encounter a range of writing styles.

What’s the difference between Derek Jeter and preregistration?

There are probably lots of clever answers to this one, but I’ll go with: One of them was hyped in the media as a clean-cut fresh face that would restore fan confidence in a tired, scandal-plagued entertainment cartel—and the other is a retired baseball player.

Let me put it another way. Derek Jeter had three salient attributes:

1. He was an excellent baseball player, rated by one source at the time of his retirement as the 58th best position player of all time.

2. He was famously overrated.

3. He was a symbol of integrity.

The challenge is to hold 1 and 2 together in your mind.

I was thinking about this after Palko pointed me to a recent article by Rose McDermott that begins:

Pre-registration has become an increasingly popular proposal to address concerns regarding questionable research practices. Yet preregistration does not necessarily solve these problems. It also causes additional problems, including raising costs for more junior and less resourced scholars. In addition, pre-registration restricts creativity and diminishes the broader scientific enterprise. In this way, pre-registration neither solves the problems it is intended to address, nor does it come without costs. Pre-registration is neither necessary nor sufficient for producing novel or ethical work. In short, pre-registration represents a form of virtue signaling that is more performative than actual.

I think this is like saying, “Derek Jeter is no Cal Ripken, he’s overrated, gets too much credit for being in the right place at the right time, he made the Yankees worse, his fans don’t understand how the game of baseball really works, and it was a bad idea to promote him as the ethical savior of the sport.”

Here’s what I think of preregistration: It’s a great idea. It’s also not the solution to problems of science. I have found preregistration to be useful in my own work. I’ve seen lots of great work that is not preregistered.

I disagree with the claim in the above-linked paper that “Under the guidelines of preregistration, scholars are expected to know what they will find before they run the study; if they get findings they do not expect, they cannot publish them because the study will not be considered legitimate if it was not preregistered.” I disagree with that statement in part for the straight-up empirical reason that it’s false; there are counterexamples; indeed a couple years ago we discussed a political science study that was preregistered and yielded unexpected findings which were published and were considered legitimate by the journal and the political science profession.

More generally, I think of preregistration as a floor, not a ceiling. The preregistered data collection and analysis is what you need to do. In addition, you can do whatever else you want.

Preregistration remains overrated if you think it’s gonna fix science. Preregistration facilitates the conditions for better science, but if you preregister a bad design, it’s still a bad design. Suppose you could go back in time and preregister the collected work of the beauty-and-sex-ratio guy, the ESP guy, and the Cornell Food and Brand Lab guy, and then do all those studies. The result wouldn’t be a spate of scientific discoveries; it would just be a bunch of inconclusive results, pretty much no different than the inconclusive results we actually got from that crowd but with the improvement that the inconclusiveness would have been more apparent. As we’ve discussed before, the benefits of procedural reforms such as preregistration are indirect—making it harder for scientists to fool themselves and others with bad designs—but not direct. Are these indirect benefits greater than the costs? I don’t know; maybe McDermott is correct that they’re not. I guess it depends on the context.

I think preregistration can be valuable, and I say that while recognizing that it’s been overrated and inappropriately sold as a miracle cure for scientific corruption. As I wrote a few years ago:

In the long term, I believe we as social scientists need to move beyond the paradigm in which a single study can establish a definitive result. In addition to the procedural innovations [of preregistration and mock reports], I think we have to more seriously consider the integration of new studies with the existing literature, going beyond the simple (and wrong) dichotomy in which statistically significant findings are considered as true and nonsignificant results are taken to be zero. But registration of studies seems like a useful step in any case.

Derek Jeter was overrated. He was a times a drag on the Yankees’ performance. He was still an excellent player and overall was very much a net positive.

“Merchants of doubt” operating within organized science

I came across this post from 2018, where Dorothy Bishop wrote:

In Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising doubt can be used as an effective weapon against inconvenient science. On topics such as the effects of tobacco on health, climate change and causes of acid rain, it has been possible to delay or curb action to tackle problems by simply emphasising the lack of scientific consensus. . . .

The parallels with Merchants of Doubt occurred to me as I re-read the critique by Gilbert et al of the classic paper by the Open Science Collaboration (OSC) on ‘Estimating the reproducibility of psychological science’. I was prompted to do so because we were discussing the OSC paper in a journal club* and inevitably the question arose as to whether we needed to worry about reproducibility, in the light of the remarkable claim by Gilbert et al: ‘We show that OSC’s article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is also consistent with the opposite conclusion — that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%.’

That article by Gilbert et al. is indeed absolute crap and was definitively refuted by Brian Nosek and Elizabeth Gilbert (no relation to the author of Gilbert et al.) back when it came out in 2016; see more here.

In 2018 Bishop did a literature search and saw that the crappy paper by Gilbert et al. was indeed cited a bunch of times; she writes:

The strong impression was that the authors of these papers lacked either the appetite or the ability to engage with the detailed arguments in the critique, but had a sense that there was a debate and felt that they should flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem – people are left thinking that it’s all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.

The news is not all bad, though:

Fortunately, it looks as if Gilbert et al’s critique has been less successful than might have been expected, given the eminence of the authors. This may in part be because the arguments in favour of change are founded not just on demonstrations such as the OSC project, but also on logical analyses of statistical practices and publication biases that have been known about for years . . . social media allows a rapid evaluation of claims and counter-claims that hitherto was not possible when debate was restricted to and controlled by journals. The publication this week of three more big replication studies just heaps on further empirical evidence that we have a problem that needs addressing. Those who are saying ‘nothing to see here, move along’ cannot retain any credibility.

Let’s not forget, though: back when the replication crisis was blowing up, the merchants of doubt were spraying gallons and gallons of squid ink into the scientific literature. And those were just the active participants in the game; as Alexey Guzey and I discuss, a big chunk of the leadership of academic psychology has just looked away at non-replication, research misconduct, and even outright fraud.

It goes like this:

1. Lots of people are doing research using flawed methods, resulting in some literatures that are scientific dead ends.

2. Some of this bad work gets lots of publicity, which leads to scrutiny, which leads to the widespread realization of methodological problems, along with various specific examples of flawed work and failed replications.

3. There’s concern, not just about a few bad apples, but about entire subfields.

4. The merchants of doubt come in and argue that the replication rate in psychology is “statistically indistinguishable from 100%.”

5. That claim is ridiculous, but it creates just enough plausible doubt for the powers-that-be to continue business as usual.

I’m guessing that Gilbert et al. are sincere in their doubts, in that they’re true believers that their work is serving the world and that Bishop and the rest of us are just a bunch of haters; also they (Gilbert et al.) are in that methodological uncanny valley in which they know just enough statistics to think of themselves as experts but not enough to know that they don’t know what they’re talking about. That doesn’t really matter, though, as long as they can play the useful role of providing leaders in their field with covering fire, an excuse to deny the problems. As Guzey and I wrote, quantitative analysis when used unscrupulously can serve as a sort of squid ink that hides the holes in scientific reasoning, and it is the role of statisticians (and, more generally, quantitative researchers) to be bothered by this when it happens.