“U.S. Watchdog Halts Studies at N.Y. Psychiatric Center After a Subject’s Suicide”

I don’t know anyone involved in this story and don’t really have anything to add. I just wanted to post on it because it sits at the intersection of science, statistics, and academia. The New York State Psychiatric Institute is involved in a lot of the funded biostatistics research at Columbia University. Ultimately we want to save lives and improve people’s health, but in the meantime we do the work and take the funding without always thinking too much about the people involved. I don’t have any specific study in mind here; I’m just thinking in general terms.

This empirical paper has been cited 1616 times but I don’t find it convincing. There’s no single fatal flaw, but the evidence does not seem so clear. How to think about this sort of thing? What to do? First, accept that evidence might not all go in one direction. Second, make lots of graphs. Also, an amusing story about how this paper is getting cited nowadays.

1. When can we trust? How can we navigate social science with skepticism?

2. Why I’m not convinced by that Quebec child-care study

3. 20 years on

1. When can we trust? How can we navigate social science with skepticism?

The other day I happened to run across a post from 2016 that I think is still worth sharing.

Here’s the background. Someone pointed me to a paper making the claim that “Canada’s universal childcare hurt children and families. . . . the evidence suggests that children are worse off by measures ranging from aggression to motor and social skills to illness. We also uncover evidence that the new child care program led to more hostile, less consistent parenting, worse parental health, and lower‐quality parental relationships.”

I looked at the paper carefully and wasn’t convinced. In short, the evidence went in all sorts of different directions, and I felt that the authors had been trying too hard to fit it all into a consistent story. It’s not that the paper had fatal flaws—it was not at all in the category of horror classics such as the beauty-and-sex-ratio paper, the ESP paper, the himmicanes paper, the air-rage paper, the pizzagate papers, the ovulation-and-voting paper, the air-pollution-in-China paper, etc etc etc.—it just didn’t really add up to me.

The question then is, if a paper can appear in a top journal, have no single killer flaw but still not be convincing, can we trust anything at all in the social sciences? At what point does skepticism become nihilism? Must I invoke the Chestertonian principle on myself?

I don’t know.

What I do think is that the first step is to carefully assess the connection between published claims, the analysis that led to these claims, and the data used in the analysis. The above-discussed paper has a problem that I’ve seen a lot, which is an implicit assumption that all the evidence should go in the same direction, a compression of complexity which I think is related to the cognitive illusion that Tversky and Kahneman called “the law of small numbers.” The first step in climbing out of this sort of hole is to look at lots of things at once, rather than treating empirical results as a sort of big bowl of fruit where the researcher can just pick out the the juiciest items and leave the rest behind.

2. Why I’m not convinced by that Quebec child-care study

Here’s what I wrote on that paper back in 2006:

Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.

Also:

The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really that horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

Screen Shot 2016-06-22 at 1.56.04 PM

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

Screen Shot 2016-06-22 at 1.59.53 PM

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how do you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

What to do?

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how should one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

3. Nearly 20 years on

So here’s the story. I heard about this work in 2016, from a press release issued in 2006, the article was published in a top economics journal in 2008, it appeared in preprint form in 2005, and it was based on data collected in the late 1990s. And here we are discussing it again in 2023.

It’s kind of beating a dead horse to discuss a 20-year-old piece of research, but you know what they say about dead horses. Also, according to Google Scholar, the article has 1616 citations, including 120 citations in 2023 alone, so, yeah, still worth discussing.

That said, not all the references refer to the substance of the paper. For example, the very first paper on Google Scholar’s list of citers is a review article, Explaining the Decline in the US Employment-to-Population Ratio, and when I searched to see what they said about this Canada paper (Baker, Gruber, and Milligan 2008), here’s what was there:

Additional evidence on the effects of publicly provided childcare comes from the province of Quebec in Canada, where a comprehensive reform adopted in 1997 called for regulated childcare spaces to be provided to all children from birth to age five at a price of $5 per day. Studies of that reform conclude that it had significant and long-lasting effects on mothers’ labor force participation (Baker, Gruber, and Milligan 2008; Lefebvre and Merrigan 2008; Haeck, Lefebvre, and Merrigan 2015). An important feature of the Quebec reform was its universal nature; once fully implemented, it made very low-cost childcare available for all children in the province. Nollenberger and Rodriguez-Planas (2015) find similarly positive effects on mothers’ employment associated with the introduction of universal preschool for three-year-olds in Spain.

They didn’t mention the bit about “the evidence suggests that children are worse off” at all! Indeed, they’re just kinda lumping this in with positive studies on “the effects of publicly provided childcare.” Yes, it’s true that this new article specifically refers to “similarly positive effects on mothers’ employment,” and that earlier paper, while negative about the effect of universal child care on kids, did say, “Maternal labor supply increases significantly.” Still, when it comes to sentiment analysis, that 2008 paper just got thrown into the positivity blender.

I don’t know how to think about this.

On one hand, I feel bad for Baker et al.: they did this big research project, they achieved the academic dream of publishing it in a top journal, it’s received over 1616 citations and remains relevant today—but, when it got cited, its negative message was completely lost! I guess they should’ve given their paper a more direct title. Instead of “Universal Child Care, Maternal Labor Supply, and Family Well‐Being,” they should’ve called it something like: “Universal Child Care: Good for Mothers’ Employment, Bad for Kids.”

On the other hand, for the reasons discussed above, I don’t actually believe their strong claims about the child care being bad for kids, so I’m kinda relieved that, even though the paper is being cited, some of its message has been lost. You win some, you lose some.

Cohort effects in literature (David Foster Wallace and other local heroes)

I read this review by Patricia Lockwood of a book by David Foster Wallace. I’d never read the book being reviewed, but that was no problem because the review itself was readable and full of interesting things. What struck me was how important Wallace seemed to be to her. I’ve heard of Wallace and read one or two things by him, but from my perspective he’s just one of many, many writers, with no special position in the world. I think it’s a generational thing. Wallace hit the spot for people of Lockwood’s age, a couple decades younger than me. To get a sense of how Lockwood feels about Wallace’s writing, I’d have to consider someone like George Orwell or Philip K. Dick, who to me had special things to say.

My point about Orwell and Dick (or, for Lockwood, Wallace) is not that they stand out from all other writers. Yes, Orwell and Dick are great writers with wonderful styles and a lot of interesting things to say—but that description characterizes many many others, from Charles Dickens and Mark Twain through James Jones, Veronica Geng, Richard Ford, Colson Whitehead, etc etc. Orwell and Dick just seem particularly important to me; it’s hard to say exactly why. So there was something fascinating about seeing someone else write about a nothing-special (from my perspective) writer but with that attitude that, good or bad, he’s important.

It kinda reminds me of how people used to speculate on what sort of music would’ve been made by the Beatles had they not broken up. In retrospect, the question just seems silly: they were a group of musicians who wrote some great songs, lots of great songs have been written by others since then, there’s no reason to think that future Beatles compositions would’ve been maybe more amazing than the fine-but-not-earthshaking songs they wrote on their own or that others were writing during that period. What’s interesting to me here is not to think about the Beatles but to put myself into that frame of mind in which those Beatles were so important that the question, What would they have done next?, is considered so important.

That’s why I call Wallace, and some of the other writers discussed above, “local heroes,” with their strongest appeal localized in cohort and time rather than in space. “Voice of a generation” would be another way to put it, but I like the framing of locality because it opens the door to considering dimensions other than cohort and time.

Modest pre-registration

This is Jessica. In light of the hassles that can arise when authors make clear that they value pre-registration by writing papers about its effectiveness but then they can’t find their pre-registration, I have been re-considering how I feel about the value of the public aspects of pre-registration. 

I personally find pre-registration useful, especially when working with graduate students (as I am almost always doing). It gets us to agree on what we are actually hoping to see and how we are going to define the key quantities we compare. I trust my Ph.D. students, but when we pre-register we are more likely to find the gaps between our goals and the analyses that we can actually do because we have it all in a single document that we know cannot be further revised after we start collecting data.

Shravan Vasishth put it well in a comment on a previous post:

My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). 

The part I find icky is that as soon as pre-registration gets discussed outside the lab, it often gets applied and interpreted as a symbol that the research is rigorous. Like the authors who pre-register must be doing “real science.” But there’s nothing about pre-registration to stop sloppy thinking, whether that means inappropriate causal inference, underspecification of the target population, overfitting to the specific experimental conditions, etc.

The Protzko et al. example could be taken as unusual, in that we might not expect the average reviewer to feel the need to double check the pre-registration when they see that author list includes Nosek and Nelson. On the other hand, we could see it as particularly damning evidence of how pre-registration can fail in practice, when some of the researchers that we associate with the highest standards of methodological rigor are themselves not appearing to take claims made about what practices were followed so seriously as to make sure they can back them up when asked. 

My skepticism about how seriously we should take public declarations of pre-registration is influenced by my experience as author and reviewer, where, at least in the venues I’ve published in, when you describe your work as pre-registered it wins points with reviewers, increasing the chances that someone will comment about the methodological rigor, that your paper will win an award, etc. However, I highly doubt the modal reviewer or reader is checking the preregistration. At least, no reviewer has ever asked a single question about the pre-registration in any of the studies I’ve ever submitted, and I’ve been using pre-registration for at least 5 or 6 years. I guess it’s possible they are checking it and it’s just all so perfectly laid out in our documents and followed to a T that there’s nothing to question. But I doubt that… surely at some point we’ve forgotten to fully report a pre-specified exploratory analysis, or the pre-registration wasn’t clear, or something else like that. Not a single question ever seems fishy.

Something I dislike about authors’ incentives when reporting on their methods in general is that reviewers (and readers) can often be unimaginative. So what the authors say about their work can set the tone for how the paper is received. I hate when authors describe their own work in a paper as “rigorous” or “highly ecologically valid” or “first to show” rather than just allowing the details to speak for themselves. It feels like cheap marketing. But I can understand why some do it, because one really can impress some readers for saying such things. Hence, points won for mentioning pre-registration, but no real checks and balances, can be a real issue.  

How should we use pre-registration in light of all this? If nobody cares to do the checking, but extra credit is being handed out when authors slap the “pre-registered” label on their work, maybe we want to pre-register more quietly.

At the extreme, we could pre-register amongst ourselves, in our labs or whatever, without telling everyone about it. Notify our collaborators by email or slack or whatever else when we’ve pinned down the analysis plan and are ready to collect the data but not expect anyone else to care, except maybe when they notice that our research is well-engineered in general, because we are the kind of authors who do our best to keep ourselves honest and use transparent methods and subject our data to sensitivity analyses etc. anyways.

I’ve implied before on the blog that pre-registration is something I find personally useful but see externally as a gesture toward transparency more than anything else. If we can’t trust authors when they claim to pre-register, but we don’t expect the reviewing or reading standards in our communities to evolve to the point where checking to see what it actually says becomes mainstream, then we could just omit the signaling aspect altogether and continue to trust that people are doing their best. I’m not convinced we would lose much in such a world as pre-registration is currently practiced in the areas I work in. Maybe the only real way to fix science is to expect people to find reasons to be self-motivated to do good work. And if they don’t, well, it’s probably going to be obvious in other ways than just a lack of pre-registration. Bad reasoning should be obvious and if it’s not, maybe we should spend more time training students on how to recognize it.

But of course this seems unrealistic, since you can’t stop people from saying things in papers that they think reviewers will find relevant. And many reviewers have already shown they find it relevant to hear about a pre-registration. Plus of course the only real benefit we can say with certainty that pre-registration provides is that if one pre-registers, others can verify to what extent the the analysis was planned beforehand and therefore less subject to authors exploiting degrees of freedom, so we’d lose this.  

An alternative strategy is to be more specific about pre-registration while crowing about it less. Include the pre-registration link in your manuscript but stop with all the label-dropping that often occurs, in the abstract, the introduction, sometimes in the title itself describing how this study is pre-registered. (I have to admit, I have been guilty of this, but from now on I intend to remove such statements from papers I’m on).

Pre-registration statements should be more specific, in light of the fact that we can’t expect reviewers to catch deviations themselves. E.g., if you follow your pre-registration to a T, say something like “For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our pre-registration documents. We do not report any analyses that were not included in our pre-registration.” That makes it clear what you are knowingly claiming regarding the pre-registration status of your work. 

Of course, some people may say reasonably specific things even when they can’t back them up with a pre-registration document. But being specific at least acknowledges that a pre-registration is actually a bundle of details that we must mind if we’re going to claim to have done it, because they should impact how it’s assessed. Plus maybe the act of typing out specific propositions would remind some authors to check what their pre-registration actually says. 

If you don’t follow your pre-registration to a T, which I’m guessing is more common in practice, then there are a few strategies I could see using:

Put in a dedicated paragraph before you describe results detailing all deviations from what you pre-registered. If it’s a whole lot of stuff, perhaps the act of writing this paragraph will convince you to just skip reporting on the pre-registration altogether because it clearly didn’t work out. 

Label each individual comparison/test as pre-registered versus not as you walk through results. Personally I think this makes things harder to keep track of than a single dedicated paragraph, but maybe there are occasionally situations where its better.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any nod toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.

Dorothy Bishop on the prevalence of scientific fraud

Following up on our discussion of replicability, here are some thoughts from psychology researcher Dorothy Bishop on scientific fraud:

In recent months, I [Bishop] have become convinced of two things: first, fraud is a far more serious problem than most scientists recognise, and second, we cannot continue to leave the task of tackling it to volunteer sleuths.

If you ask a typical scientist about fraud, they will usually tell you it is extremely rare, and that it would be a mistake to damage confidence in science because of the activities of a few unprincipled individuals. . . . we are reassured [that] science is self-correcting . . .

The problem with this argument is that, on the one hand, we only know about the fraudsters who get caught, and on the other hand, science is not prospering particularly well – numerous published papers produce results that fail to replicate and major discoveries are few and far between . . . We are swamped with scientific publications, but it is increasingly hard to distinguish the signal from the noise.

Bishop summarizes:

It is getting to the point where in many fields it is impossible to build a cumulative science, because we lack a solid foundation of trustworthy findings. And it’s getting worse and worse. . . . in clinical areas, there is growing concern that systematic reviews that are supposed to synthesise evidence to get at the truth instead lead to confusion because a high proportion of studies are fraudulent.

Also:

[A] more indirect negative consequence of the explosion in published fraud is that those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.

Given all the above, it’s unsurprising that, in Bishop’s words,

To date, the response of the scientific establishment has been wholly inadequate. There is little attempt to proactively check for fraud . . . Even when evidence of misconduct is strong, it can take months or years for a paper to be retracted. . . . this relaxed attitude to the fraud epidemic is a disaster-in-waiting.

What to do? Bishop recommends that some subset of researchers be trained as “data sleuths,” to move beyond the current whistleblower-and-vigilante system into something more like “the equivalent of a police force.”

I don’t know what to think about that. On one hand, I agree that whistleblowers and critics don’t get the support that they deserve; on the other hand, we might be concerned about who would be attracted to the job of official police officer here.

Setting aside concerns about Bishop’s proposed solution, I do see her larger point about the scientific publication process being so broken that it can actively interfere with the development of science. In a situation parallel to Cantor’s diagonal argument or Russell’s theory of types, it would seem that we need a scientific literature, and then, alongside it, a vetted scientific literature, and then, alongside that, another level of vetting, and so on. In medical research this sort of system has existed for decades, with a huge number of journals for the publication of original studies; and then another, smaller but still immense, set of journals that publish nothing but systematic reviews; and then some distillations that make their way into policy and practice.

Clarke’s Law

And don’t forget Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. All the above problems also arise with the sorts of useless noise mining we’ve been discussing in this space for nearly twenty years now. I assume most of those papers do not involve fraud, and even when there are clearly bad statistical practices such as rooting around for statistical significance, I expect that the perpetrators think of these research violations as merely serving the goal of larger truths.

So it’s not just fraud. Not by a longshot.

Also, remember the quote from Bishop above: “those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.” Replace “fraud” by “crappy research” and, yeah, we’ve been there for awhile!

P.S. Mark Tuttle points us to this news article by Richard Van Noorden, “How big is science’s fake-paper problem?”, that makes a similar point.

“Open Letter on the Need for Preregistration Transparency in Peer Review”

Brendan Nyhan writes:

Wanted to share this open letter. I know preregistration isn’t useful for the style of research you do, but even for consumers of preregistered research like you it’s essential to know if the preregistration was actually disclosed to and reviewed by reviewers, which in turn helps make sure that exploratory and confirmatory analyses are adequately distinguished, deviations and omissions labeled, etc. (The things I’ve seen as a reviewer… are not good – which is what motivated me to organize this.)

The letter, signed by Nyhan and many others, says:

It is essential that preregistrations be considered as part of the scientific review process.

We have observed a lack of shared understanding among authors, editors, and reviewers about the role of preregistration in peer review. Too often, preregistrations are omitted from the materials submitted for review entirely. In other cases, manuscripts do not identify important deviations from the preregistered analysis plan, fail to provide the results of preregistered analyses, or do not indicate which analyses were not preregistered.

We therefore make the following commitments and ask others to join us in doing so:

As authors: When we submit an article for review that has been preregistered, we will always include a working link to a (possibly anonymized) preregistration and/or attach it as an appendix. We will identify analyses that were not preregistered as well as notable deviations and omissions from the preregistration.

As editors: When we receive a preregistered manuscript for review, we will verify that it includes a working link to the preregistration and/or that it is included in the materials provided to reviewers. We will not count the preregistration against appendix page limits.

As reviewers: We will (a) ask for the preregistration link or appendix when reviewing preregistered articles and (b) examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.

I’ve actually been moving toward more preregistration in my work. Two recent studies we’ve done that have been preregistered are:

– Our project on generic language and political polarization

– Our evaluation of the Millennium Villages project

And just today I met with two colleagues on a medical experiment that’s in the pre-design stage—that is, we’re trying to figure out the design parameters. To do this, we need to simulate the entire process, including latent and observed data, then perform analyses on the simulated data, then replicate the entire process to ensure that the experiment will be precise enough to be useful, at least under the assumptions we’re making. This is already 90% of preregistration, and we had to do it anyway. (See recommendation 3 here.)

So, yeah, given that I’m trying now to simulate every study ahead of time before gathering any data, preregistration pretty much comes for free.

Preregistration is not magic—it won’t turn a hopelessly biased, noisy study into something useful—but it does seem like a useful part of the scientific process, especially if we remember that preregistering an analysis should not stop us from performing later, non-preregistered analyses.

Preregistration should be an addition to the research project, not a limitation!

I guess that Nyhan et al.’s suggestions are good, if narrow in that they’re focused on the very traditional journal-reviewer system. I’m a little concerned with the promise that they as reviewers will “examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.” I mean, sure, fine in theory, but I would not expect or demand that every reviewer do this for every paper that comes in. If I had to do all that work every time I reviewed a paper, I’d have to review many fewer papers a year, and I think my total contribution to science as a reviewer would be much less. If I’m gonna go through and try to replicate an analysis, I don’t want to waste that on a review that only 4 people will see. I’d rather blog it and maybe write it up on some other form (as for example here), as that has the potential to help more people.

Anyway, here’s the letter, so go sign it—or perhaps sign some counter-letter—if you wish!

Another reason so much of science is so bad: bias in what gets researched.

Nina Strohminger and Olúfémi Táíwò write:

Most of us have been taught to think of scientific bias as a distortion of scientific results. As long as we avoid misinformation, fake news, and false conclusions, the thinking goes, the science is unbiased. But the deeper problem of bias involves the questions science pursues in the first place. Scientific questions are infinite, but the resources required to test them — time, effort, money, talent — are decidedly finite.

This is a good point. Selection bias is notoriously difficult for people to think about, as by its nature it depends on things that haven’t been seen.

I like Strohminger and Táíwò’s article and have only two things to add.

1. They write about the effects of corporations on what gets researched, using as examples the strategies of cigarette companies and oil companies to fund research to distract from their products’ hazards. I agree that this is an issue. We should also be concerned about influences from sources other than corporations, including the military, civilian governments, and advocacy organizations. There are plenty of bad ideas to go around, even without corporate influence. And, setting all this aside, there’s selection based on what gets publicity, along with what might be called scientific ideology. Think about all that ridiculous research on embodied cognition or on the factors that purportedly influence the sex ratio of babies. These ideas fit certain misguided models of science and have sucked up lots of attention and researcher effort without any clear motivation based on funding, corporate or otherwise. My point here is just that there are a lot of ways that the scientific enterprise is distorted by selection bias in what gets studied and what gets published.

2. They write: “The research on nudges could be completely unbiased in the sense that it provides true answers. But it is unquestionably biased in the sense that it causes scientists to effectively ignore the most powerful solutions to the problems they focus on. As with the biomedical researchers before them, today’s social scientists have become the unwitting victims of corporate capture.” Agreed. Beyond this, though, that research is not even close to being unbiased in the sense of providing accurate answers to well-posed questions. We discussed this last year in the context of a fatally failed nudge meta-analysis: it’s a literature of papers with biased conclusions (the statistical significance filter), with some out-and-out fraudulent studies mixed in).

My point here is that these two biases—selection bias in what is studied, and selection bias in the studies themselves—go together. Neither bias alone would be enough. If there were only selection bias is what was studied, the result would be lots of studies reporting high uncertainty and no firm conclusions, and not much to sustain the hype machine. Conversely, if there were only selection bias within each study, there wouldn’t be such a waste of scientific effort and attention. Strohminger and Táíwò’s article is valuable because they emphasize selection bias in what is studied, which is something we haven’t been talking so much about.

Simulations of measurement error and the replication crisis: Update

Last week we ran a post, “Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?”, reporting some questions that neuroscience student Federico D’Atri asked about a paper that Eric Loken and I wrote a few years ago. It’s one of my favorite papers so it was good to get feedback on it. D’Atri had run a simulation and had some questions, and in my post I shared some old code from the paper. Eric and I then looked into it. We discussed with D’Atri and here’s what we found:

1. No, Loken and I did not have a mistake in our paper. (More on that below.)

2. The code I posted on the blog was not the final code for our paper. Eric had made the final versions. From Eric, here’s:
(a) the final code that we used to make the figures in the paper, where we looked at regression slopes, and
(b) cleaned version of the code I’d posted, where we looked at correlations.
The code I posted last week was something in my files, but it was not the final version of the code, hence the confusions about what was being conditioned on in the analysis.

Regarding the code, Eric reports:

All in all we get the same results whether it’s correlations or t-tests of slopes. At small samples, and for small effects, the majority of the stat sig cors/slopes/t-tests are larger in the error than the non error (when you compare them paired). The graph’s curve does pop up through 0.5 and higher. It’s a lot higher if r = 0.08, and it’s not above 50% if the r is 0.4. It does require a relatively small effect, but we also have .8 reliability.

3. Some interesting questions remain. Federico writes:

I don’t think there’s an error in the code used to produce the graphs in the paper; rather I personally find that certain sentences in the paper may lead to some misunderstandings. I also concur with the main point made in their paper, that large estimates obtained from a small sample in high-noise conditions should not be trusted and I believe they do a good job of delivering this message.

What Andrew and Eric show is the proportion of larger correlations achieved when noisy measurements are selected for statistical significance, compared to the estimate one would obtain in the same scenario without measurement error and without selecting for statistical significance. What I had initially thought was that there was an equal level of selection for statistical significance applied to both scenarios. They essentially show that under conditions of insufficient power to detect the true underlying effect, doing enough selection based on statistical significance, can produce an overestimation much higher than the attenuation caused by measurement error.

This seems quite intuitive to me, and I would like to clarify it with an example. Consider a true underlying correlation in ideal conditions of 0.15 and a sample size of N = 25, and the extreme scenario where measurement error is infinite (in the noisy x and y will be uncorrelated). In this case, the measurements of x and y under ideal conditions will be totally uncorrelated with those obtained under noisy conditions, hence the correlation estimates in the two different scenarios as well. If I select for significance the correlations obtained under noisy conditions, I am only looking at correlations greater than 0.38 (for α = 0.05, two-tailed test), which I’ll be comparing to an average correlation of 0.15, since the two estimates are completely unrelated. It is clear then that the first estimate will almost always be greater than the second. The greater the noise, the more uncorrelated the correlation estimates obtained in the two different scenarios become, making it less likely that obtaining a large estimate in one case would also result in a large estimate in the other case.

My criticism is not about the correctness of the code (which is correct as far as I can see), but rather how relevant this scenario is in representing a real situation. Indeed, I believe it is very likely that the same hypothetical researchers who made selections for statistical significance in ‘noisy’ measurement conditions would also select for significance in ideal measurement conditions, and in that case, they would obtain an even higher frequency of effect overestimation when selecting for statistical significance (once selecting for the direction of the true effect) as well as a greater ease in achieving statistically significant results .

However, I think it could be possible that in research environments where measurement error is greater (and isn’t modeled), there might be an incentive, or a greater co-occurrence, of selection for statistical significance and poor research practices. Without evidence of this, though, I find it more interesting to compare the two scenarios assuming similar selection criteria.

Also I’m aware that in situations deviating from the simple assumptions of the case we are considering here (simple correlation between x and y and uncorrelated measurement errors), complexities can arise. For example, as probably know better than me, in multiple regression scenarios where two predictors, x1 and x2, are correlated and their measurement errors are also correlated (which can occur with certain types of measures, such as self-reporting where individuals prone to overestimating x1 may also tend to overestimate x2), and only x1 is correlated with y, there is an inflation of Type I error for x2 and asymptotically β2 is biased away from zero.

Eric adds:

Glad we resolved the initial confusion about our article’s main point and associated code. When you [Federico] first read our article, you were interested in different questions than the one we covered. It’s a rich topic, with lots of work to be done, and you seem to have several ideas. Our article addressed the situation where someone might acknowledge measurement error, but then say “my finding is all the more impressive because if not for the measurement error I would have found an even bigger effect.” We target the intuition that if a dataset could be made error free by waving a wand, that the data would necessarily show a larger correlation. Of course the “iron law” (attenuation) holds in large samples. Unsurprisingly, however, in smaller samples, data with measurement error can have a larger realized correlation. And after conditioning on the statistical significance of the observed correlations, a majority of them could be larger than the corresponding error free correlation. We treated the error free effect (the “ideal study”) as the counterfactual (“if only I had no error in my measurements”), and thus filtered on the statistical significance of the observed error prone correlations. When you tried to reproduce that graph, you applied the filter differently, but you now find that what we did was appropriate for the question we were answering.

By the way, we deliberately kept the error modest. In our scenario, the x and y values have about 0.8 reliability—widely considered excellent measurement. I agree that if the error grows wildly, as with your hypothetical case, then the observed values are essentially uncorrelated with the thing being measured. Our example though was pretty realistic—small true effect, modest measurement error, range of sample sizes. I can see though that there are many factors to explore.

Different questions are of interest in different settings. One complication is that, when researchers say things like, “Despite limited statistical power . . .” they’re typically not recognizing that they have been selecting on statistical significance. In that way, they are comparing to the ideal setting with no selection.

And, for reasons discussed in my original paper with Eric, researchers often don’t seem to think about measurement error at all! because they have the (wrong) impression that having a “statistically significant” result gives retroactive assurance that their signal-to-noise ratio is high.

That’s what got us so frustrated to start with: not just that noisy studies get published all the time, but that many researchers seem to not even realize that noise can be a problem. Lots and lots of correspondence with researchers who seem to feel that if they’ve found a correlation between X and Y, where X is some super-noisy measurement with some connection to theoretical concept A, and Y is some super-noisy measurement with some connection to theoretical concept B, that they’ve proved that A causes B, or that they’ve discovered some general connection between A and B.

So, yeah, we encourage further research in this area.

“Modeling Social Behavior”: Paul Smaldino’s cool new textbook on agent-based modeling

Paul Smaldino is a psychology professor who is perhaps best known for his paper with Richard McElreath from a few years ago, “The Natural Selection of Bad Science,” which presents a sort of agent-based model that reproduces the growth in the publication of junk science that we’ve seen in recent decades.

Since then, it seems that Smaldino has been doing a lot of research and teaching on agent-based models in social science more generally, and he just came out with a book, “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution.” The book has social science, it has code, it has graphs—it’s got everything.

It’s an old-school textbook with modern materials, and I hope it’s taught in thousands of classes and sells a zillion copies.

There’s just one thing that bothers me. The book is entertainingly written and bursting with ideas, also does a great job of giving concerns about the models that it’s simulating, not just acting like everything’s already known. My concern is that nobody reads books anymore. If I think about students taking a class in agent-based modeling and using this book, it’s hard for me to picture most of them actually reading the book. They’ll start with the homework assignments and then flip through the book to try to figure out what they need. That’s how people read nonfiction books nowadays, which I guess is one reason that books, even those I like, are typically repetitive and low on content. Readers don’t want the book to offer a delightful reading experience, so authors don’t deliver it, and then readers don’t expect it, etc.

To be clear: this is a textbook, not a trade book. It’s a readable and entertaining book in the way that Regression and Other Stories is a readable and entertaining book, not in the way that Guns, Germs, and Steel is. Still, within the framework of being a social science methods book, it’s entertaining and thought-provoking. Also, I like it as a methods book because it’s focused on models rather than on statistical inference. We tried to get a similar feel with A Quantitative Tour of Social Sciences but with less success.

So it kinda makes me sad to see this effort of care put into a book that probably very few students will read from paragraph to paragraph. I think things were different 50 years ago: back then, there wasn’t anything online to read, you’d buy a textbook and it was in front of you so you’d read it. On the plus side, readers can now go in and make the graphs themselves—I assume that Smaldino has a website somewhere with all the necessary code—so there’s that.

P.S. In the preface, Smaldino is “grateful to all the modelers whose work has inspired this book’s chapters . . . particularly want to acknowledge the debt owed to the work of,” and then he lists 16 names, one of which is . . . Albert-Lázló Barabási!

Huh?? Is this the same Albert-Lázló Barabási who said that scientific citations are worth $100,000 each. I guess he did some good stuff too? Maybe this is worthy of an agent-based model of its own.

Wow—those are some really bad referee reports!

Dale Lehman writes:

I missed this recent retraction but the whole episode looks worth your attention. First the story about the retraction.

Here are the referee reports and authors responses.

And, here is the author’s correspondence with the editors about retraction.

The subject of COVID vaccine safety (or lack thereof) is certainly important and intensely controversial. The study has some fairly remarkable claims (deaths due to the vaccines numbering in the hundreds of thousands). The peer reviews seem to be an exemplary case of your statement that “the problems with peer review are the peer reviewers). The data and methodology used in the study seem highly suspect to me – but the author appears to respond to many challenges thoughtfully (even if I am not convinced) and raises questions about the editorial practices involved with the retraction.

Here are some more details on that retracted paper.

Note the ethics statement about no conflicts – doesn’t mention any of the people supposedly behind the Dynata organization. Also, I was surprised to find the paper and all documentation still available despite being retracted. It includes the survey instrument. From what I’ve seen, the worst aspect of this study is that it asked people if they knew people who had problems after receiving the vaccine – no causative link even being asked for. That seems like an unacceptable method for trying to infer deaths from the vaccine – and one that the referees should never have permitted.

The most amazing thing about all this was the review reports. From the second link above, we see that the article had two review reports. Here they are, in their entirety:

The first report is an absolute joke, so let’s just look at the second review. The author revised in response to that review by rewriting some things, then the paper was published. At no time were any substantive questions raised.

I also noticed this from the above-linked news article:

“The study found that those who knew someone who’d had a health problem from Covid were more likely to be vaccinated, while those who knew someone who’d experienced a health problem after being vaccinated were less likely to be vaccinated themselves.”

Here’s a more accurate way to write it:

“The study found that those who SAID THEY knew someone who’d had a health problem from Covid were more likely to be SAY THEY WERE vaccinated, while those who SAID THEY knew someone who’d experienced a health problem after being vaccinated were less likely to SAY THEY WERE vaccinated themselves.”

Yes, this is sort of thing arises with all survey responses, but I think the subjectivity of the response is much more of a concern here than in a simple opinion poll.

The news article, by Stephanie Lee, makes the substantive point clearly enough:

This methodology for calculating vaccine-induced deaths was rife with problems, observers noted, chiefly that Skidmore did not try to verify whether anyone counted in the death toll actually had been vaccinated, had died, or had died because of the vaccine.

Also this:

Steve Kirsch, a veteran tech entrepreneur who founded an anti-vaccine group, pointed out that the study had the ivory tower’s stamp of approval: It had been published in a peer-reviewed scientific journal and written by a professor at Michigan State University. . . .

In a sympathetic interview with Skidmore, Kirsch noted that the study had been peer-reviewed. “The journal picks the peer reviewers … so how can they complain?” he said.

Ultimately the responsibility for publishing a misleading article falls upon the article’s authors, not upon the journal. You can’t expect or demand careful reviews from volunteer reviewers, nor can you expect volunteer journal editors to carefully vet every paper they will publish. Yes, the peer reviews for the above-discussed paper were useless—actually worse than useless, in that they gave a stamp of approval to bad work—but you can’t really criticize the reviewers for “not doing their jobs,” given that reviewing is not their job—they’re doing it for free.

Anyway, it’s a good thing that the journal shared the review reports so we can see how useless they were.

A successful example of “adversarial collaboration.” When does this approach work and when does it not?

Stephen Ceci, Shulamit Kahn, and Wendy Williams write:

We synthesized the vast, contradictory scholarly literature on gender bias in academic science from 2000 to 2020. . . . Claims and counterclaims regarding the presence or absence of sexism span a range of evaluation contexts. Our approach relied on a combination of meta-analysis and analytic dissection. We evaluated the empirical evidence for gender bias in six key contexts in the tenure-track academy: (a) tenure-track hiring, (b) grant funding, (c) teaching ratings, (d) journal acceptances, (e) salaries, and (f) recommendation letters. We also explored the gender gap in a seventh area, journal productivity, because it can moderate bias in other contexts. . . . Contrary to the omnipresent claims of sexism in these domains appearing in top journals and the media, our findings show that tenure-track women are at parity with tenure-track men in three domains (grant funding, journal acceptances, and recommendation letters) and are advantaged over men in a fourth domain (hiring). For teaching ratings and salaries, we found evidence of bias against women; although gender gaps in salary were much smaller than often claimed, they were nevertheless concerning.

They continue:

Even in the four domains in which we failed to find evidence of sexism disadvantaging women, we nevertheless acknowledge that broad societal structural factors may still impede women’s advancement in academic science. . . . The key question today is, in which domains of academic life has explicit sexism been addressed? And in which domains is it important to acknowledge continuing bias that demands attention and rectification lest we maintain academic systems that deter the full participation of women? . . .

Our findings of some areas of gender neutrality or even a pro-female advantage are very much rooted in the most recent decades and in no way minimize or deny the existence of gender bias in the past. Throughout this article, we have noted pre-2000 analyses that suggested that bias either definitely or probably was present in some aspects of tenure-track academia before 2000. . . .

The authors characterize this project as an “adversarial collaboration”:

This article represents more than 4.5 years of effort by its three authors. By the time readers finish it, some may assume that the authors were in agreement about the nature and prevalence of gender bias from the start. However, this is definitely not the case. Rather, we are collegial adversaries who, during the 4.5 years that we worked on this article, continually challenged each other, modified or deleted text that we disagreed with, and often pushed the article in different directions. . . . Kahn has a long history of revealing gender inequities in her field of economics, and her work runs counter to Ceci and Williams’s claims of gender fairness. . . . In 2019, she co-organized a conference on women in economics, and her most recent analysis in 2021 found gender inequities persisting in tenure and promotion in economics. . . . Her findings diverge from Ceci and Williams’s, who have published a number of studies that have not found gender bias in the academy, such as their analyses of grants and tenure-track hiring . . .

Although our divergent views are real, they may not be evident to readers who see only what survived our disagreements and rewrites; the final product does not reveal the continual back and forth among the three of us. Fortunately, our viewpoint diversity did not prevent us from completing this project on amicable terms. Throughout the years spent working on it, we tempered each other’s statements and abandoned irreconcilable points, so that what survived is a consensus document that does not reveal the many instances in which one of us modified or cut text that another wrote because they felt it was inconsistent with the full corpus of empirical evidence. . . .

Editors and board members can promote science by encouraging, when possible, diverse viewpoints and by commissioning teams of adversarial coauthors (as this particular journal, Psychological Science in the Public Interest, was founded to do—to bring coauthors together in an attempt to resolve their historic differences). Knowing that one’s writing will be criticized by one’s divergently thinking coauthors can reduce ideologically driven criticisms that are offered in the guise of science. . . .

Interesting. In the past I’ve been suspicious of adversarial collaborations—whenever I’ve tried such a thing it hasn’t worked so well, and examples I’ve seen elsewhere have seemed to have more of the “adversarial” than the “collaboration.”

Here are two examples (here and here) where I tried to work with people who I disagreed with, but they didn’t want to work with me.

I get it: in both places I was pretty firm that they had been making strong claims that were not supported by their evidence, and there was no convenient halfway point where they could rest. Ideally they’d just have agreed with me, but it’s pretty rare that people will just give up something they’ve already staked a claim on.

I’m not saying these other researchers are bad people. In each case, there was a disagreement about the strength of evidence. My point is just that there was no clear way forward regarding an adversarial collaboration. So I just wrote my articles on my own; I consider each of these to be a form of “asynchronous collaboration.” Still better than nothing.

But this one by Ceci, Kahn, and Williams seems to have worked well. Perhaps it’s easier in psychology than in political science, for some reason?

That said, I can’t imagine a successful adversarial collaboration with the psychologists who published some of the horrible unreplicable stuff from the 2005-2020 era. They just seem too invested in their claims, also they achieved professional success with that work and have no particular motivation to lend their reputations to any work that might shoot it down. By their behavior, they treat their claims as fragile and would not want them to be put to the test. The Ceci, Kahn, Williams example is different, perhaps, because there are policy questions at stake, and all of them are motivated to persuade people in the middle of the debate. In contrast, the people pushing some of the more ridiculous results in embodied cognition and evolutionary psychology have no real motivation to persuade skeptics or even neutrals; they just need to keep their work from being seen as completely discredited.

This is related to my point about research being held to a higher standard when it faces active opposition.

Frictionless reproducibility; methods as proto-algorithms; division of labor as a characteristic of statistical methods; statistics as the science of defaults; statisticians well prepared to think about issues raised by AI; and robustness to adversarial attacks

Tian points us to this article by David Donoho, which argues that some of the rapid progress in data science and AI research in recent years has come from “frictionless reproducibility,” which he identifies with “data sharing, code sharing, and competitive challenges.” This makes sense: the flip side of the unreplicable research that has destroyed much of social psychology, policy analysis, and related fields is that when we can replicate an analysis with a press of a button using open-source software, it’s much easier to move forward.

Frictionless reproducibility

Frictionless reproducibility is a useful goal in research. It can take a while between the development of a statistical idea and its implementation in a reproducible way, and that’s ok. But it’s good to aim for that stage. The effort it takes to make a research idea reproducible is often worth it, in that getting to reproducibility typically requires a level of care and rigor beyond what is necessary just to get a paper published. One thing I’ve learned with Stan is that much is learned in the process of developing a general tool that will be used by strangers.

I think that statisticians have a special perspective for thinking about these issues, for the following reason:

Methods as proto-algorithms

As statisticians, we’re always working with “methods.” Sometimes we develop new methods or extend existing methods; sometimes we place existing methods into a larger theoretical framework; sometimes we study the properties of methods; sometimes we apply methods. Donoho and I are typical of statistics professors in having done all these things in our work.

A “method” is a sort of proto-algorithm, not quite fully algorithmic (for example, it could require choices of inputs, tuning parameters, expert inputs at certain points) but it follows some series of steps. The essence of a method is that it can be applied by others. In that sense, any method is a bridge between different humans; it’s a sort of communication among groups of people who may never meet or even directly correspond. Fisher invented logistic regression and decades later some psychometrician uses it; the method is a sort of message in a bottle.

Division of labor as a characteristic of statistical methods

There are different ways to take this perspective. One direction is to recognize that almost all statistical methods involve a division of labor. In Bayes, one agent creates the likelihood model and another agent creates the prior model. In bootstrap, one agent comes up with the estimator and another agent comes up with the bootstrapping procedure. In classical statistics, one agent creates the measurement protocol, another agent designs the experiment, and a third agent performs the analysis. in machine learning, there’s the training and test sets. With public surveys, one group conducts the survey and computes weights; other groups analyze the data using the weights. Etc. We discussed this general idea a few years ago here.

But that’s not the direction I want to go right here. Instead I want to consider something else, which is the way that a “method” is an establishment of a default; see here and also here.

Statistics as the science of defaults

The relevance to the current discussion is that, to the extent that defaults are a move toward automatic behavior, statisticians are in the business of automating science. That is, our methods are “successes” to the extent that they enable automatic behavior on the part of users. As we have discussed, automatic behavior is not a bad thing! When we make things automatic, users can think at the next level of abstraction. For example, push-button linear regression allows researchers to focus on the model rather than on how to solve a matrix equation, and it can even take them to the next level of abstraction and think about prediction without even thinking about the model. As teachers and users of research, we then are (rightly) concerned that lack of understanding can be a problem, but it’s hard to go back. We might as well complain that the vast majority of people drive their cars with no understanding of how those little explosions inside the engine make the car go round.

Statisticians well prepared to think about issues raised by AI

To get back to the AI issue: I think that we as statisticians are particularly well prepared to think about the issues that AI brings, because the essence of statistics is the development of tools designed to automate human thinking about models and data. Statistical methods are a sort of slow-moving AI, and it’s kind of always been our dream to automate as much of the statistics process as possible, while recognizing that for Cantorian reasons (see section 7 here) we will never be there. Given that we’re trying, to a large extent, to turn humans into machines or to routinize what has traditionally been a human behavior that has required care, knowledge, and creativity, we should have some insight into computer programs that do such things.

In some ways, we statisticians are even more qualified to think about this than computer scientists are, in that the paradigmatic action of a computer scientist is to solve a problem, whereas the paradigmatic action of a statistician is to come up with a method that will allow other people to solve their problems.

I sent the above to Jessica, who wrote:

I like the emphasis on frictionless reproducibility as a critical driver of the success in ML. Empirical ML has clearly emphasized methods for ensuring the validity of predictive performance estimates (hold out sets, common task framework etc) compared to fields that use statistical modeling to generate explanations, like social sciences, and it does seem like that has paid off.

From my perspective, there’s something else that’s been very successful though as well – post-2015ish there’s been a heavy emphasis on making models robust to adversarial attack. Being able to take an arbitrary evaluation metric and incorporate it into your loss function so you’re explicitly training for it is also likely to improve things fast. We comment on this a bit in a paper we wrote last year reflecting on what, if anything, recent concerns about ML reproducibility and replicability have in common with the so-called replication crisis in social science.

I do think we are about at max hype currently in terms of perceived success of ML though, and it can be hard to tell sometimes how much the emerging evidence of success from ML research is overfit to the standard benchmarks. Obviously there have been huge improvements on certain test suites, but just this morning for instance I saw an ML researcher present a pretty compelling graph showing that the “certified robustness” of the top LLMs (GPT-3.5, GPT 4, llambda 2, etc), when trained on the common datasets (imagenet, mnist, etc), has not really improved much at all in the past 7-8 years. This was a line graph where each line denoted changes in robustness for different benchmarks (imagenet, mnist, etc) with new methodological advances. Each point in a line represented the robustness of a deep net on that particular benchmark given whatever was considered the state of the art in robust ML at that time. The x-axis was related to time, but each tick represented a particular paper that advanced SOTA. It’s still very easy to trick LLMs into generating toxic text, leaking private data they trained on, or changing their mind based on what should be an inconsequential change to the wording of a prompt, for example.

Debate over effect of reduced prosecutions on urban homicides; also larger questions about synthetic control methods in causal inference.

Andy Wheeler writes:

I think this back and forth may be of interest to you and your readers.

There was a published paper attributing very large increases in homicides in Philadelphia to the policies by progressive prosecutor Larry Krasner (+70 homicides a year!). A group of researchers then published a thorough critique, going through different potential variants of data and models, showing that quite a few reasonable variants estimate reduced homicides (with standard errors often covering 0):

Hogan original paper,
Kaplan et al critique
Hogan response
my writeup

I know those posts are a lot of weeds to dig into, but they touch on quite a few topics that are recurring themes for your blog—many researcher degrees of freedom in synthetic control designs, published papers getting more deference (the Kaplan critique was rejected by the same journal), a researcher not sharing data/code and using that obfuscation as a shield in response to critics (e.g. your replication data is bad so your critique is invalid).

I took a look, and . . . I think this use of synthetic control analysis is not good. I pretty much agree with Wheeler, except that I’d go further than he does in my criticism. He says the synthetic control analysis in the study in question has data issues and problems with forking paths; I’d say that even without any issues of data and forking paths (for example, had the analysis been preregistered), I still would not like it.

Overview

Before getting to the statistical details, let’s review the substantive context. From the original article by Hogan:

De-prosecution is a policy not to prosecute certain criminal offenses, regardless of whether the crimes were committed. The research question here is whether the application of a de-prosecution policy has an effect on the number of homicides for large cities in the United States. Philadelphia presents a natural experiment to examine this question. During 2010–2014, the Philadelphia District Attorney’s Office maintained a consistent and robust number of prosecutions and sentencings. During 2015–2019, the office engaged in a systematic policy of de-prosecution for both felony and misdemeanor cases. . . . Philadelphia experienced a concurrent and historically large increase in homicides.

I would phrase this slightly differently. Rather than saying, “Here’s a general research question, and we have a natural experiment to learn about it,” I’d prefer the formulation, “Here’s something interesting that happened, and let’s try to understand it.”

It’s tricky. On one hand, yes, one of the major reasons for arguing about the effect of Philadelphia’s policy on Philadelphia is to get a sense of the effect of similar policies there and elsewhere in the future. On the other hand, Hogan’s paper is very much focused on Philadelphia between 2015 and 2019. It’s not constructed as an observational study of any general question about policies. Yes, he pulls out some other cities that he characterizes as having different general policies, but there’s no attempt to fully involve those other cities in the analysis; they’re just used as comparisons to Philadelphia. So ultimately it’s an N=1 analysis—a quantitative case study—and I think the title of the paper should respect that.

Following our “Why ask why” framework, the Philadelphia story is an interesting data point motivating a more systematic study of the effect of prosecution policies and crime. For now we have this comparison of the treatment case of Philadelphia to the control of 100 other U.S. cities.

Here are some of the data. From Wheeler (2023), here’s a comparison of trends in homicide rates in Philadelphia to three other cities:

Wheeler chooses these particular three comparison cities because they were the ones that were picked by the algorithm used by Hogan (2022). Hogan’s analysis compares Philadelphia from 2015-2019 to a weighted average of Detroit, New Orleans, and New York during those years, with those cities chosen because their weighted average lined up to that of Philadelphia during the years 2010-2014. From Hogan:

As Wheeler says, it’s kinda goofy for Hogan to line these up using homicide count rather than homicide rates . . . I’ll have more to say in a bit regarding this use of synthetic control analysis. For now, let me just note that the general pattern in Wheeler’s longer time series graph is consistent with Hogan’s story: Philadelphia’s homicide rate moved up and down over the decades, in vaguely similar ways to the other cities (increasing throughout the 1960s, slightly declining in the mid-1970s, rising again in the late-1980s, then gradually declining since 1990), but then steadily increasing from 2014 onward. I’d like to see more cities on this graph (natural comparisons to Philadelphia would be other Rust Belt cities such as Baltimore and Cleveland. Also, hey, why not show a mix of other large cities such as LA, Chicago, Houston, Miami, etc.) but this is what I’ve got here. Also it’s annoying that the above graphs stop in 2019. Hogan does have this graph just for Philadelphia that goes to 2021, though:

As you can see, the increase in homicides in Philadelphia continued, which is again consistent with Hogan’s story. Why only use data up to 2019 in the analyses? Hogan writes:

The years 2020–2021 have been intentionally excluded from the analysis for two reasons. First, the AOPC and Sentencing Commission data for 2020 and 2021 were not yet available as of the writing of this article. Second, the 2020–2021 data may be viewed as aberrational because of the coronavirus pandemic and civil unrest related to the murder of George Floyd in Minnesota.

I’d still like to see the analysis including 2020 and 2021. The main analysis is the comparison of time series of homicide rates, and, for that, the AOPC and Sentencing Commission data would not be needed, right?

In any case, based on the graphs above, my overview is that, yeah, homicides went up a lot in Philadelphia since 2014, an increase that coincided with reduced prosecutions and which didn’t seem to be happening in other cities during this period. At least, so I think. I’d like to see the time series for the rates in the other 96 cities in the data as well, going from, say, 2000, all the way to 2021 (or to 2022 if homicide data from that year are now available).

I don’t have those 96 cities, but I did find this graph going up to 2000 from a different Wheeler post:

Ignore the shaded intervals; what I care about here is the data. (And, yeah, the graph should include zero, since it’s in the neighborhood.) There has been a national increase in homicides since 2014. Unfortunately, from this national trend line alone I can’t separate out Philadelphia and any other cities that might have instituted a de-prosecution strategy during this period.

So, my summary, based on reading all the articles and discussions linked above, is . . . I just can’t say! Philadelphia’s homicide rate went up since 2014 during the same period that it decreased prosecutions, and this was part of a national trend of increased homicides—but there’s no easy way given the directly available information to compare to other cities with and without that policy. This is not to say that Hogan is wrong about the policy impacts, just that I don’t see any clear comparisons here.

The synthetic controls analysis

Hogan and the others make comparisons, but the comparisons they make are to that weighted average of Detroit, New Orleans, and New York. The trouble is . . . that’s just 3 cities, and homicide rates can vary a lot from city to city. It just doesn’t make sense to throw away the other 96 cities in your data. The implied counterfactual is that if Philadelphia had continued post-2014 with its earlier sentencing policy, that its homicide rates would look like this weighted average of Detroit, New Orleans, and New York—but there’s no reason to expect that, as this averaging is chosen by lining up the homicide rates from 2010-2014 (actually the counts and populations, not the rates, but that doesn’t affect my general point so I’ll just talk about rates right now, as that’s what makes more sense).

And here’s the point: There’s no good reason to think that an average of three cities that give you numbers comparable to Philadelphia’s for the homicide rates in the five previous years will give you a reasonable counterfactual for trends in the next five years. To think there’s no mathematical reason we should expect the time series to work that way, nor do I see any substantive reason based on sociology or criminology or whatever to expect anything special from a weighted average of cities that is constructed to line up with Philadelphia’s numbers for those three years.

The other thing is that this weighted-average thing is not what I’d imagined when I first heard that this was a synthetic controls analysis.

My understanding of a synthetic controls analysis went like this. You want to compare Philadelphia to other cities, but there are no other cities that are just like Philadelphia, so you break up the city into neighborhoods and find comparable neighborhoods in other cities . . . and when you’re done you’ve created this composite “city,” using pieces of other cities, that functions as a pseudo-Philadelphia. In creating this composite, you use lots of neighborhood characteristics, not just matching on a single outcome variable. And then you do all of this with other cities in your treatment group (cities that followed a de-prosecution strategy).

The synthetic controls analysis here differed from what I was expecting in three ways:

1. It did not break up Philadelphia and the other cities into pieces, jigsaw-style. Instead, it formed a pseudo-Philadelphia by taking a weighted average of other cities. This is a much more limited approach, using much less information, and I don’t see it as creating a pseudo-Philadelphia in the full synthetic-controls sense.

2. It only used that one variable to match the cities, leading to concerns about comparability that Wheeler discusses.

3. It was only done for Philadelphia; that’s the N=1 problem.

Researcher degrees of freedom, forking paths, and how to think about them here

Wheeler points out many forking paths in Hogan’s analysis, lots of data-dependent decision rules in the coding and analysis. (One thing that’s come up before in other settings: At this point, you might ask how do we know that Hogan’s decisions were data-dependent, as this is a counterfactual statement involving the analyses he would’ve had done had the data been different. And my answer, as in previous cases, is that, given that the analysis was not pre-registered, we can only assume it is data-dependent. I say this partly because every non-preregistered analysis I’ve ever done has been in the context of the data, also because if all the data coding and analysis decisions had been made ahead of time (which is what been required for these decisions to not be data-dependent), then why not preregister? Finally let me emphasize that researcher degrees of freedom and forking paths do not represent criticisms of flaws of a study; they’re just a description of what was done, and in general I don’t think they’re a bad thing at all; indeed, almost all the papers I’ve ever published include many many data-dependent coding and decision rules.)

Given all the forking paths, we should not take Hogan’s claims of statistical significance at face value, and indeed the critics find that various alternative analyses can change the results.

In their criticism, Kaplan et al. say that reasonable alternative specifications can lead to null or even opposite results compared to what Hogan reported. I don’t know if I completely buy this—given that Philadelphia’s homicide rate increased so much since 2014, it seems hard for me to see how a reasonable estimate would find that its policy rate reduced the homicide rate.

To me, the real concern is with comparing Philadelphia to just three other cities. Forking paths are real, but I’d have this concern even if the analysis were identical and it had been preregistered. Preregister it, whatever, you’re still only comparing to three cities, and I’d like to see more.

Not junk science, just difficult science

As Wheeler implicitly says in his discussion, Hogan’s paper is not junk science—it’s not like those papers on beauty and sex ratio, or ovulation and voting, or air rage, himmicanes, ages ending in 9, or the rest of our gallery of wasted effort. Hogan and the others are studying real issues. The problem is that the data are observational, the data are sparse and highly variable; that is, the problem is hard. And it doesn’t help when researchers are under the impression that these real difficulties can be easily resolved using canned statistical identification techniques. In that aspect, we can draw an analogy to the notorious air-pollution-in-China paper. But this one’s even harder, in the following sense: The air-pollution-in-China paper included a graph with two screaming problems: an estimated life expectancy of 91 and an out-of-control nonlinear fitted curve. In contrast, the graphs in the Philadelphia-analysis paper all look reasonable enough. There’s nothing obviously wrong with the analysis, and the problem is a more subtle issue of the analysis not fully accounting for variation in the data.

Gigerenzer: Simple heuristics to run a research group

The cognitive psychologist writes:

Collaboration between researchers has become increasingly common, enabling a level of discovery and innovation that is difficult if not impossible to achieve by a single person. But how can one establish and maintain an environment that fosters successful collaboration within a research group? In this case study, I use my own experience when directing the ABC Research Group at the Max Planck Institute for Human Development in Berlin. I first describe the heuristic principles for setting up a research group, including (i) common topic and multiple disciplines, (ii) open culture, (iii) spatial proximity, and (iv) temporal proximity. Then I describe heuristics for maintaining the open culture, such as setting collective goals, including contrarians, distributing responsibility, making bets, the cake rule, and side-by-side writing. These heuristics form an “adaptive toolbox” that shapes the intellectual and social climate. They create a culture of friendly but rigorous discussion, embedded in a family-like climate of trust where everyone is willing to expose their ignorance and learn from the other members. Feeling accepted and trusted encourages taking the necessary risks to achieve progress in science.

This all makes sense to me. I’ve never been very good about organizing research groups myself, so probably I should try to follow this advice.

Here are Gigerenzer’s principles for How to Start a Research Group:

Principle 1: Common topic, multiple disciplines

Principle 2: Create an open culture

Principle 3: Spatial proximity

Principle 4: Temporal proximity

Then, How to Maintain a Culture:

Set collective goals

Distribute responsibility

Secure open culture

One thing I’d like to hear more about is the difficulty of setting all of this up. Here are some challenges:

– To ensure spatial proximity, you need an institution to commit to the space, which in turn can require “politics”; that is, negotiation with powerful people at the institution to secure the space as needed.

– To ensure temporal proximity, you need a steady flow of funds, which requires fundraising or grant-writing. The challenge is to be able to do this without being overwhelmed, as in some biomedical labs where it seems that the only thing ever going on is writing grant proposals.

– There’s also the challenge of getting people in the same room at a time where email and remote meetings are so easy.

There are other challenges too. Setting up an open culture isn’t hard for me, but some of the other steps have been difficult. One thing I like is that the lab that Gigerenzer runs is not called the Gigerenzer Lab. So many professors do this, having the lab named after themselves, and that just seems so tacky.

No, this paper on strip clubs and sex crimes was never gonna get retracted. Also, a reminder of the importance of data quality, and a reflection on why researchers often think it’s just fine to publish papers using bad data under the mistaken belief that these analyses are “conservative” or “attenuated” or something like that.

Brandon Del Pozo writes:

Born in Bensonhurst, Brooklyn in the 1970’s, I came to public health research by way of 23 years as a police officer, including 19 years in the NYPD and four as a chief of police in Vermont. Even more tortuously, my doctoral training was in philosophy at the CUNY Graduate Center.

I am writing at the advice of colleagues because I remain extraordinarily vexed by a paper that came out in 2021. It purports to measure the effects of opening strip clubs on sex crimes in NYC at the precinct level, and finds substantial reductions within a week of opening each club. The problem is the paper is implausible from the outset because it uses completely inappropriate data that anyone familiar with the phenomena would find preposterous. My colleagues and I, who were custodians of the data and participants in the processes under study when we were police officers, wrote a very detailed critique of the paper and called for its retraction. Beyond our own assertions, we contacted state agencies who went on the record about the problems with the data as well.

For their part, the authors and editors have been remarkably dismissive of our concerns. They said, principally, that we are making too big a deal out of the measures being imprecise and a little noisy. But we are saying something different: the study has no construct validity because it is impossible to measure the actual phenomena under study using its data.

Here is our critique, which will soon be out in Police Practice and Research. Here is the letter from the journal editors, and here is a link to some coverage in Retraction Watch. I guess my main problem is the extent to which this type of problem was missed or ignored in the peer review process, and why it is being so casually dismissed now. It is a matter of economists circling their wagons?

My reply:

1. Your criticisms seem sensible to me. I also have further concerns with the data (or maybe you pointed these out in your article and I did not notice), in particular the distribution of data in Figure 1 of the original article. Most weeks there seem to be approximately 20 sex crime stops (which they misleadingly label as “sex crimes”), but then there’s one week with nearly 200? This makes me wonder what is going on with these data.

2. I see from the Retraction Watch article that one of the authors responded, “As far as I am concerned, a serious (scientifically sound) confutation of the original thesis has not been given yet.” This raises the interesting question of burden of proof. Before the article is accepted for publication, it is the authors’ job to convincingly justify their claim. After publication, the author is saying that the burden is on the critic (i.e., you). To put it another way: had your comment been in a pre-publication referee report, it should’ve been enough to make the editors reject the paper or at least require more from the authors. But post-publication is another story, at least according to current scientific conventions.

3. From a methodological standpoint, the authors follow the very standard approach of doing an analysis, finding something, then performing a bunch of auxiliary analyses–robustness checks–to rule out alternative explanations. I am skeptical of robustness checks; see also here. In some way, the situation is kind of hopeless, in that, as researchers, we are trained to respond to questions and criticism by trying our hardest to preserve our original conclusions.

4. One thing I’ve noticed in a lot of social science research is a casual attitude toward measurement. See here for the general point, and over the years we’ve discussed lots of examples, such as arm circumference being used as a proxy for upper-body strength (we call that the “fat arms” study) and a series of papers characterizing days 6-14 of the menstrual cycle as the days of peak fertility, even though the days of peak fertility vary a lot from woman to woman with a consensus summary being days 10-17. The short version of the problem here, especially in econometrics, is that there’s a general understanding that if you use bad measurements, it should attenuate (that is, pull toward zero) your estimated effect sizes; hence, if someone points out a measurement problem, a common reaction is to think that it’s no big deal because if the measurements are off, that just led to “conservative” estimates. Eric Loken and I wrote this article once to explain the point, but the message has mostly not been received.

5. Given all the above, I can see how the authors of the original paper would be annoyed. They’re following standard practice, their paper got accepted, and now all of a sudden they’re appearing in Retraction Watch!

6. Separate from all the above, there’s no way that paper was ever going to be retracted. The problem is that journals and scholars treat retraction as a punishment of the authors, not as a correction of the scholarly literature. It’s pretty much impossible to get an involuntary retraction without there being some belief that there has been wrongdoing. See discussion here. In practice, a fatal error in a paper is not enough to force retraction.

7. In summary, no, I don’t think it’s “economists circling their wagons.” I think this is a mix of several factors: a high bar for post-publication review, a general unconcern with measurement validity and reliability, a trust in robustness checks, and the fact that retraction was never a serious option. Given that the authors of the original paper were not going to issue a correction on their own, the best outcome for you was to either publish a response in the original journal (which would’ve been accompanied by a rebuttal from the original authors) or to publish in a different journal, which is what happened. Beyond all this, the discussion quickly gets technical. I’ve done some work on stop-and-frisk data myself and I have decades of experience reading social science papers, but even for me I was getting confused with all the moving parts, and indeed I could well imagine being convinced by someone on the other side that your critiques were irrelevant. The point is that the journal editors are not going to feel comfortable making that judgment, any more than I would be.

Del Pozo responded by clarifying some points:

Regarding the data with outliers in my point 1 above, Del Pozo writes, “My guess is that this was a week when there was an intense search for a wanted pattern rape suspect. Many people were stopped by police above the average of 20 per week, and at least 179 of them were innocent. We discuss this in our reply; non only do these reports not record crimes in nearly all cases, but several reports may reflect police stops of innocent people in the search for one wanted suspect. It is impossible to measure crime with stop reports.”

Regarding the issue of pre-publication and post-publication review in my point 2 above, Del Pozo writes, “We asked the journal to release the anonymized peer reviews to see if anyone had at least taken up this problem during review. We offered to retract all of our own work and issue a written apology if someone had done basic due diligence on the matter of measurement during peer review. They never acknowledged or responded to our request. We also wrote that it is not good science when reviewers miss glaring problems and then other researchers have to upend their own research agenda to spend time correcting the scholarly record in the face of stubborn resistance that seems more about pride than science. None of this will get us a good publication, a grant, or tenure, after all. I promise we were much more tactful and diplomatic than that, but that was the gist. We are police researchers, not the research police.”

To paraphrase Thomas Basbøll, they are not the research police because there is no such thing as the research police.

Regarding my point 3 on the lure of robustness checks and their problems, Del Pozo writes, “The first author of the publication was defensive and dismissive when we were all on a Zoom together. It was nothing personal, but an Italian living in Spain was telling four US police officers, three of whom were in the NYPD, that he, not us, better understood the use and limits of NYPD and NYC administrative data and the process of gaining the approvals to open a strip club. The robustness checks all still used opening dates based on registration dates, which do not associate with actual opening in even a remotely plausible way to allow for a study of effects within a week of registration. Any analysis with integrity would have to exclude all of the data for the independent variable.”

Regarding my point 4 on researchers’ seemingly-strong statistical justifications for going with bad measurements, Del Pozo writes, “Yes, the authors literally said that their measurement errors at T=0 weren’t a problem because the possibility of attenuation made it more likely that their rejection of the null was actually based on a conservative estimate. But this is the point: the data cannot possibly measure what they need it to, in seeking to reject the null. It measures changes in encounters with innocent people after someone has let New York State know that they plan to open a business in a few months, and purports to say that this shows sex crimes go down the week after a person opens a sex club. I would feel fraudulent if I knew this about my research and allowed people to cite it as knowledge.”

Regarding my point 6 that just about nothing ever gets involuntarily retracted without a finding of research misconduct, Del Pozo points to an “exception that proves the rule: a retraction for the inadvertent pooling of heterogeneous results in a meta analysis that was missed during peer review, and nothing more.”

Regarding my conclusions in point 7 above, Del Pozo writes, “I was thinking of submitting a formal replication to the journal that began with examining the model, determining there were fatal measurement errors, then excluding all inappropriate data, i.e., all the data for the independent variable and 96% of the data for the dependent variable, thereby yielding no results, and preventing rejection of the null. Voila, a replication. I would be so curious to see a reviewer in the position of having to defend the inclusion of inappropriate data in a replication. The problem of course is replications are normatively structured to assume the measurements are sound, and if anything you keep them all and introduce a previously omitted variable or something. I would be transgressing norms with such a replication. I presume it would be desk rejected.”

Yup, I think such a replication would be rejected for two reasons. First, journals want to publish new stuff, not replications. Second, they’d see it as a criticism of a paper they’d published, and journals usually don’t like that either.