More than 10k scientific papers were retracted in 2023

Hi all, here to talk about one of my favorite scientific topic: integrity and correction of science.

Here comes some good news for most of us and of humanity. More than 10k scientific papers have been retracted this year. Aside from the researchers who have received these notices of retractions (some of them for multiple papers), and the publishers, this is quite good news I would argue. This comes after a big year on this topic and the topic of finding fraudulent practices (see, for instance, how Guillaume Cabanac easily found papers generated by ChatGPT) and very problematic journals with, for instance, Hindawi journals probably being more problematic than others. Many retractions and reports have focused on duplicated images or use of tortured phrases. New fraudulent practices have also emerged and been found (see for instance our findings about “sneaked references” that some editors/journals have manipulated the metadata of accepted papers to increase citations of specific scholars and journals).

Of course, some like me may always see the glass half empty and I would still argue that probably many more papers should have been retracted and that, as I have lamented many times, the process of correcting the scientific literature is too slow, too opaque, and too bureaucratic while at the same time not protecting, funding, or rewarding the hard-working sleuth behind the work. Most of the sleuthing work takes place in spite of, rather than thanks to, the present publication and editorial system. Often the data or metadata to facilitate investigations is not published or available (e.g., lack of metadata about ethics or lack of metadata about reviewing practices).

Still, I guess it is kind of victory that sleuthing work is taken seriously these days I suppose, and I would like to take the opportunity of this milestone of 10k retracted paper to invite some of you to also participated in Pubpeer discussions. I am sure your input would be quite helpful there.

Happy to read thoughts and comments on the milestone and its importance. I will continue to write (a bit more regularly I hope) here on this topic.

Lonni Besançon

 

 

 

 

 

 

“Has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?”

Kelsey Piper writes:

I’m writing about the replication crisis for Vox and I was wondering if you saw this blog post from one of the DARPA replication project participants, particularly the section that argues:

I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn’t even dream of publishing studies based on 23 undergraduates any more (I actually saw plenty of those), etc. Stuart Ritchie’s new book praises psychologists for developing “systematic ways to address” the flaws in their discipline. In reality there has been no discernible improvement.

Your blog post yesterday about scientists who don’t care about doing science struck a similar tone, and I was curious: do you think we’re in a better place w/r/t the replication crisis than we were ten years ago? Or has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?

My discussion of that above-quoted blog post appeared a couple years ago. I agreed with some of that post and disagreed with other parts.

Regarding Piper’s question, “has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?,” I don’t think the influence has been zero! For one thing, this crisis has influenced my own research practices, and I assume it’s influenced many others as well. And it’s my general impression that journals such as Psychological Science and PNAS don’t publish as much junk as they used to. I haven’t done any formal study of this, though.

P.S. For some other relevant recent discussions, see More on possibly rigor-enhancing practices in quantitative psychology research and (back to basics:) How is statistics relevant to scientific discovery?.

Exploring pre-registration for predictive modeling

This is Jessica. Jake Hofman, Angelos Chatzimparmpas, Amit Sharma, Duncan Watts, and I write:

Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling, spanning core machine learning tasks to various scientific applications, challenges such as data-dependent decision-making and unintentional re-use of test data have raised questions about the integrity of results. To help address these issues, we propose adapting pre-registration practices from explanatory modeling to predictive modeling. We discuss current best practices in predictive modeling and their limitations, introduce a lightweight pre-registration template, and present a qualitative study with machine learning researchers to gain insight into the effectiveness of pre-registration in preventing biased estimates and promoting more reliable research outcomes. We conclude by exploring the scope of problems that pre-registration can address in predictive modeling and acknowledging its limitations within this context.

Pre-registration is no silver bullet to good science, as we discuss in the paper and later in this post. However, my coauthors and I are cautiously optimistic that adapting the practice could help address a few problems that can arise in predictive modeling pipelines like research on applied machine learning. Specifically, there are two categories of concerns where pre-specifying the learning problem and strategy may lead to more reliable estimates. 

First, most applications of machine learning are evaluated using predictive performance. Usually we evaluate this on held-out test data, because it’s too costly to obtain a continuous stream of new data for training, validation and testing. The separation is crucial: performance on held-out test data is arguably the key criterion in ML, so making reliable estimates of it is critical to avoid a misleading research literature. If we mess up and access the test data during training (test set leakage), then the results we report are overfit. It’s surprisingly easy to do this (see e.g., this taxonomy of types of leakage that occur in practice). While pre-registration cannot guarantee that we won’t still do this anyway, having to determine details like how exactly features and test data will be constructed a priori could presumably help authors catch some mistakes they might otherwise make.

Beyond test set leakage, other types of data-dependent decisions threaten the validity of test performance estimates. Predictive modeling problems admit many degrees-of-freedom that authors can (often unintentionally) exploit in the interest of pushing the results in favor of some a priori hypothesis, similar to the garden of forking paths in social science modeling. For example, researchers may spend more time tuning their proposed methods than baselines they compare to, making it look like their new method is superior when it is not. They might report on straw man baselines after comparing test accuracy across multiple variations. They might only report the performance metrics that make test performance look best. Etc. Our sense is that most of the time this is happening implicitly: people end up trying harder for the things they are invested in. Fraud is not the central issue, so giving people tools to help them avoid unintentionally overfitting is worth exploring.

Whenever the research goal is to provide evidence on the predictability of some phenomena (Can we predict depression from social media? Can we predict civil war onset? etc.) there’s a risk that we exploit some freedoms in translating the high level research goal to a specific predictive modeling exercise. To take an example my co-authors have previously discussed, when predicting how many re-posts a social media post will get based on properties of the person who originally posted, even with the dataset and model specification held fixed, exercising just a few degrees of freedom can change the qualitative nature of the results. If you treat it as a classification problem and build a model to predict whether a post will receive at least 10 re-posts, you can get accuracy close to 100%. If you treat it as a regression problem and predict how many re-posts a given post gets without any data filtering, R^2 hovers around 35%. The problem is that only a small fraction of posts exceed the threshold of 10 re-posts, and predicting which posts do—and how far they spread—is very hard.  Even when the drift in goal happens prior to test set access, the results can paint an overly optimistic picture. Again pre-registering offers no guarantees of greater construct validity, but it’s a way of encouraging authors to remain aware of such drift. 

The specific proposal

One challenge we run into in applying pre-registration to predictive modeling is that because we usually aren’t aiming for explanation, we’re willing to throw lots of features into our model, even if we’re not sure how they could meaningfully contribute, and we’re agnostic to what sort of model we use so long as its inductive bias seems to work for our scenario. Deciding the model class ahead of time as we do in pre-registering explanatory models can be needlessly restrictive. So, the protocol we propose has two parts. 

First, prior to training, one answers the following questions, which are designed to be addressable before looking at any correlations between features and outcomes

Phase 1 of the protocol: learning problem, variables, dataset creation, transformations, metrics, baselines

Then, after training and validation but before accessing test data, one answers the remaining questions:

Phase 2: Prediction method, training details, access test? anything else

Authors who want to try it can grab the forms by forking this dedicated github repo and include them in their own repository.

What we’ve learned so far

To get a sense of whether researchers could benefit from this protocol, we observed as six ML Ph.D. students applied it to a prediction problem we provided (predicting depression in teens using responses to the 2016 Monitoring the Future survey of 12th graders, subsampled from data used by Orben and Przybylski). This helped us see where they struggled to pre-specify decisions in phase 1, presumably because doing so was out of line with their usual process of figuring some things out as they conducted model training and validation. We had to remind several to be specific about metrics and data transformations in particular. 

We also asked them in an exit interview what else they might have tried if their test performance had been lower than they expected. Half of the six participants described procedures that if not fully reported, seemed likely to compromise the validity of their test estimates (things like going back to re-tune hyperparameters then trying again on test data). This suggests that there’s an opportunity for pre-registration, if widely adopted, to play a role in reinforcing good workflow.  This may be especially useful in fields where ML models are being applied by expertise in predictive modeling is still sparse.

The caveats 

It was reassuring to directly observe examples where this protocol, if followed, might have prevented overfitting. However, the fact that we saw these issues despite having explained and motivated pre-registration during these sessions, and walked the participants through it, suggests that pre-specifying certain components of a learning pipeline alone is not necessarily enough to prevent overfitting. 

It was also notable that while all of the participants but one saw value in pre-registering, their specific understandings of why and how it could work varied. There was as much variety in their understandings of pre-registration as there was in ways they approached the same learning problem. Pre-registration is not going to be the same thing to everyone nor even used the same way, because the ways it helps are multi-faceted. As a result, it’s dangerous to interpret the mere act of pre-registration as a stamp of good science. 

I have some major misgivings about putting too much faith into the idea that publicly pre-registering guarantees that estimates are valid, and hope that this protocol gets used responsibly, as something authors choose to do because they feel it helps them prevent unintentional overfitting rather than the simple solution that guarantees to the world that your estimates are gold. It was nice to observe that a couple of study participants seemed particularly drawn to the idea of pre-registering based on perceived “intrinsic” value, remarking about the value they saw in it as a personally-imposed set of constraints to incorporate in their typical workflow.

It won’t work for all research projects. One participant figured out while talking aloud that prior work he’d done identifying certain behaviors in transformer models would have been hard to pre-register because it was exploratory in nature.

Another participant fixated on how the protocol was still vulnerable: people could lie about not having already experimented with training and validation, there’s no guarantee that the train/test split authors describe is what they actually used to produce their estimates, etc. Computer scientists tend to be good at imagining loopholes that adversarial attacks could exploit, so maybe they will be less likely to oversell pre-registration as guaranteeing validity. At the end of the day, it’s still an honor system. 

As we’ve written before, part of the issue with many claims in ML-based research is that often performance estimates for some new approach represent something closer to best case performance due to overlooked degrees of freedom, but they can get interpreted as expected performance. Pre-registration is an attempt at ensuring that the estimates that get reported are more likely to be represent what they’re meant to be. Maybe it’s better though to try to change readers’ perceptions that they can be taken at face value to begin with, though. I’m not sure. 

We’re open to feedback on the specific protocol we provide and curious to hear how it works out for those who try it. 

P.S. Against my better judgment, I decided to go to NeurIPS this year. If you want to chat pre-registration or threats to the validity of ML performance estimates find me there Wed through Sat.

Modest pre-registration

This is Jessica. In light of the hassles that can arise when authors make clear that they value pre-registration by writing papers about its effectiveness but then they can’t find their pre-registration, I have been re-considering how I feel about the value of the public aspects of pre-registration. 

I personally find pre-registration useful, especially when working with graduate students (as I am almost always doing). It gets us to agree on what we are actually hoping to see and how we are going to define the key quantities we compare. I trust my Ph.D. students, but when we pre-register we are more likely to find the gaps between our goals and the analyses that we can actually do because we have it all in a single document that we know cannot be further revised after we start collecting data.

Shravan Vasishth put it well in a comment on a previous post:

My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). 

The part I find icky is that as soon as pre-registration gets discussed outside the lab, it often gets applied and interpreted as a symbol that the research is rigorous. Like the authors who pre-register must be doing “real science.” But there’s nothing about pre-registration to stop sloppy thinking, whether that means inappropriate causal inference, underspecification of the target population, overfitting to the specific experimental conditions, etc.

The Protzko et al. example could be taken as unusual, in that we might not expect the average reviewer to feel the need to double check the pre-registration when they see that author list includes Nosek and Nelson. On the other hand, we could see it as particularly damning evidence of how pre-registration can fail in practice, when some of the researchers that we associate with the highest standards of methodological rigor are themselves not appearing to take claims made about what practices were followed so seriously as to make sure they can back them up when asked. 

My skepticism about how seriously we should take public declarations of pre-registration is influenced by my experience as author and reviewer, where, at least in the venues I’ve published in, when you describe your work as pre-registered it wins points with reviewers, increasing the chances that someone will comment about the methodological rigor, that your paper will win an award, etc. However, I highly doubt the modal reviewer or reader is checking the preregistration. At least, no reviewer has ever asked a single question about the pre-registration in any of the studies I’ve ever submitted, and I’ve been using pre-registration for at least 5 or 6 years. I guess it’s possible they are checking it and it’s just all so perfectly laid out in our documents and followed to a T that there’s nothing to question. But I doubt that… surely at some point we’ve forgotten to fully report a pre-specified exploratory analysis, or the pre-registration wasn’t clear, or something else like that. Not a single question ever seems fishy.

Something I dislike about authors’ incentives when reporting on their methods in general is that reviewers (and readers) can often be unimaginative. So what the authors say about their work can set the tone for how the paper is received. I hate when authors describe their own work in a paper as “rigorous” or “highly ecologically valid” or “first to show” rather than just allowing the details to speak for themselves. It feels like cheap marketing. But I can understand why some do it, because one really can impress some readers for saying such things. Hence, points won for mentioning pre-registration, but no real checks and balances, can be a real issue.  

How should we use pre-registration in light of all this? If nobody cares to do the checking, but extra credit is being handed out when authors slap the “pre-registered” label on their work, maybe we want to pre-register more quietly.

At the extreme, we could pre-register amongst ourselves, in our labs or whatever, without telling everyone about it. Notify our collaborators by email or slack or whatever else when we’ve pinned down the analysis plan and are ready to collect the data but not expect anyone else to care, except maybe when they notice that our research is well-engineered in general, because we are the kind of authors who do our best to keep ourselves honest and use transparent methods and subject our data to sensitivity analyses etc. anyways.

I’ve implied before on the blog that pre-registration is something I find personally useful but see externally as a gesture toward transparency more than anything else. If we can’t trust authors when they claim to pre-register, but we don’t expect the reviewing or reading standards in our communities to evolve to the point where checking to see what it actually says becomes mainstream, then we could just omit the signaling aspect altogether and continue to trust that people are doing their best. I’m not convinced we would lose much in such a world as pre-registration is currently practiced in the areas I work in. Maybe the only real way to fix science is to expect people to find reasons to be self-motivated to do good work. And if they don’t, well, it’s probably going to be obvious in other ways than just a lack of pre-registration. Bad reasoning should be obvious and if it’s not, maybe we should spend more time training students on how to recognize it.

But of course this seems unrealistic, since you can’t stop people from saying things in papers that they think reviewers will find relevant. And many reviewers have already shown they find it relevant to hear about a pre-registration. Plus of course the only real benefit we can say with certainty that pre-registration provides is that if one pre-registers, others can verify to what extent the the analysis was planned beforehand and therefore less subject to authors exploiting degrees of freedom, so we’d lose this.  

An alternative strategy is to be more specific about pre-registration while crowing about it less. Include the pre-registration link in your manuscript but stop with all the label-dropping that often occurs, in the abstract, the introduction, sometimes in the title itself describing how this study is pre-registered. (I have to admit, I have been guilty of this, but from now on I intend to remove such statements from papers I’m on).

Pre-registration statements should be more specific, in light of the fact that we can’t expect reviewers to catch deviations themselves. E.g., if you follow your pre-registration to a T, say something like “For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our pre-registration documents. We do not report any analyses that were not included in our pre-registration.” That makes it clear what you are knowingly claiming regarding the pre-registration status of your work. 

Of course, some people may say reasonably specific things even when they can’t back them up with a pre-registration document. But being specific at least acknowledges that a pre-registration is actually a bundle of details that we must mind if we’re going to claim to have done it, because they should impact how it’s assessed. Plus maybe the act of typing out specific propositions would remind some authors to check what their pre-registration actually says. 

If you don’t follow your pre-registration to a T, which I’m guessing is more common in practice, then there are a few strategies I could see using:

Put in a dedicated paragraph before you describe results detailing all deviations from what you pre-registered. If it’s a whole lot of stuff, perhaps the act of writing this paragraph will convince you to just skip reporting on the pre-registration altogether because it clearly didn’t work out. 

Label each individual comparison/test as pre-registered versus not as you walk through results. Personally I think this makes things harder to keep track of than a single dedicated paragraph, but maybe there are occasionally situations where its better.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any nod toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.

Dorothy Bishop on the prevalence of scientific fraud

Following up on our discussion of replicability, here are some thoughts from psychology researcher Dorothy Bishop on scientific fraud:

In recent months, I [Bishop] have become convinced of two things: first, fraud is a far more serious problem than most scientists recognise, and second, we cannot continue to leave the task of tackling it to volunteer sleuths.

If you ask a typical scientist about fraud, they will usually tell you it is extremely rare, and that it would be a mistake to damage confidence in science because of the activities of a few unprincipled individuals. . . . we are reassured [that] science is self-correcting . . .

The problem with this argument is that, on the one hand, we only know about the fraudsters who get caught, and on the other hand, science is not prospering particularly well – numerous published papers produce results that fail to replicate and major discoveries are few and far between . . . We are swamped with scientific publications, but it is increasingly hard to distinguish the signal from the noise.

Bishop summarizes:

It is getting to the point where in many fields it is impossible to build a cumulative science, because we lack a solid foundation of trustworthy findings. And it’s getting worse and worse. . . . in clinical areas, there is growing concern that systematic reviews that are supposed to synthesise evidence to get at the truth instead lead to confusion because a high proportion of studies are fraudulent.

Also:

[A] more indirect negative consequence of the explosion in published fraud is that those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.

Given all the above, it’s unsurprising that, in Bishop’s words,

To date, the response of the scientific establishment has been wholly inadequate. There is little attempt to proactively check for fraud . . . Even when evidence of misconduct is strong, it can take months or years for a paper to be retracted. . . . this relaxed attitude to the fraud epidemic is a disaster-in-waiting.

What to do? Bishop recommends that some subset of researchers be trained as “data sleuths,” to move beyond the current whistleblower-and-vigilante system into something more like “the equivalent of a police force.”

I don’t know what to think about that. On one hand, I agree that whistleblowers and critics don’t get the support that they deserve; on the other hand, we might be concerned about who would be attracted to the job of official police officer here.

Setting aside concerns about Bishop’s proposed solution, I do see her larger point about the scientific publication process being so broken that it can actively interfere with the development of science. In a situation parallel to Cantor’s diagonal argument or Russell’s theory of types, it would seem that we need a scientific literature, and then, alongside it, a vetted scientific literature, and then, alongside that, another level of vetting, and so on. In medical research this sort of system has existed for decades, with a huge number of journals for the publication of original studies; and then another, smaller but still immense, set of journals that publish nothing but systematic reviews; and then some distillations that make their way into policy and practice.

Clarke’s Law

And don’t forget Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. All the above problems also arise with the sorts of useless noise mining we’ve been discussing in this space for nearly twenty years now. I assume most of those papers do not involve fraud, and even when there are clearly bad statistical practices such as rooting around for statistical significance, I expect that the perpetrators think of these research violations as merely serving the goal of larger truths.

So it’s not just fraud. Not by a longshot.

Also, remember the quote from Bishop above: “those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.” Replace “fraud” by “crappy research” and, yeah, we’ve been there for awhile!

P.S. Mark Tuttle points us to this news article by Richard Van Noorden, “How big is science’s fake-paper problem?”, that makes a similar point.

Brian Nosek on “Here are some ways of making your study replicable”:

Brian Nosek is a leader of the replication movement in science and a coauthor of an article on replicability that we discussed the other day.

They discussed the rigor-enhancing practices of “confirmatory tests, large sample sizes, preregistration, and methodological transparency, and in my post I wrote that those were not the first things I’d suggest to increase rigor in science. My recommendations were (1) Make it clear what you’re actually doing, (2) Increase your effect size, e.g., do a more effective treatment, (3) Focus your study on the people and scenarios where effects are likely to be largest, (4) Improve your outcome measurement: a more focused and less variable outcome measure, (5) Improve pre-treatment measurements, and finally (6) the methods listed in the above-linked article: “confirmatory tests, large sample sizes, preregistration, and methodological transparency.”

I sent this post to Nosek, and he replied:

For your list of practices:

#1: We did this for both methodological and statistical practices.

#2: I suspect that every lab was motivated to get the largest effect that they could given the research question that they were studying (ours certainly was). But, you’ll observe in the findings that we didn’t get very large effect sizes on average. Instead, they are what I believe are around what most “real” effect sizes are for the messy concepts that social scientists study.

#3: We didn’t do this. Each lab used a sampling firm and all studies were conducted through that firm. It is possible that a lab would have tried to tailor the design to the sample, but these were very heterogeneous samples, so that would not likely have been very effective.

#4: I suspect that every lab did this the best that they could. Simultaneously, most of the research in this is pretty on-the-edge discovery work, so not necessarily a lot of existing evidence to make use of (with variation across experiments and labs).

#5: I suspect that this was done for a couple of experiments from some labs, but not others. (None from mine did so.)

I like all of your suggestions for improving rigor. I would counterargue that some of them become more meaningfully impactful on the research process as the evidence base matures (e.g., where to get the largest effect size, what are effective pretreatment measurements). In the context of discovery research like the experiments in this paper, we could only speculate about these in trying to design the most rigorous studies. The practices that we highlight are “easily” applied no matter the maturity of the domain and evidence base.

On your other points: I think the paper provides proof-of-concept that even small effects are highly replicable. And, I am much more sanguine than you are about the benefits of preregistration. Maybe we can find some time to argue about that in the future!

I disagree with Geoff Hinton regarding “glorified autocomplete”

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

“Open Letter on the Need for Preregistration Transparency in Peer Review”

Brendan Nyhan writes:

Wanted to share this open letter. I know preregistration isn’t useful for the style of research you do, but even for consumers of preregistered research like you it’s essential to know if the preregistration was actually disclosed to and reviewed by reviewers, which in turn helps make sure that exploratory and confirmatory analyses are adequately distinguished, deviations and omissions labeled, etc. (The things I’ve seen as a reviewer… are not good – which is what motivated me to organize this.)

The letter, signed by Nyhan and many others, says:

It is essential that preregistrations be considered as part of the scientific review process.

We have observed a lack of shared understanding among authors, editors, and reviewers about the role of preregistration in peer review. Too often, preregistrations are omitted from the materials submitted for review entirely. In other cases, manuscripts do not identify important deviations from the preregistered analysis plan, fail to provide the results of preregistered analyses, or do not indicate which analyses were not preregistered.

We therefore make the following commitments and ask others to join us in doing so:

As authors: When we submit an article for review that has been preregistered, we will always include a working link to a (possibly anonymized) preregistration and/or attach it as an appendix. We will identify analyses that were not preregistered as well as notable deviations and omissions from the preregistration.

As editors: When we receive a preregistered manuscript for review, we will verify that it includes a working link to the preregistration and/or that it is included in the materials provided to reviewers. We will not count the preregistration against appendix page limits.

As reviewers: We will (a) ask for the preregistration link or appendix when reviewing preregistered articles and (b) examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.

I’ve actually been moving toward more preregistration in my work. Two recent studies we’ve done that have been preregistered are:

– Our project on generic language and political polarization

– Our evaluation of the Millennium Villages project

And just today I met with two colleagues on a medical experiment that’s in the pre-design stage—that is, we’re trying to figure out the design parameters. To do this, we need to simulate the entire process, including latent and observed data, then perform analyses on the simulated data, then replicate the entire process to ensure that the experiment will be precise enough to be useful, at least under the assumptions we’re making. This is already 90% of preregistration, and we had to do it anyway. (See recommendation 3 here.)

So, yeah, given that I’m trying now to simulate every study ahead of time before gathering any data, preregistration pretty much comes for free.

Preregistration is not magic—it won’t turn a hopelessly biased, noisy study into something useful—but it does seem like a useful part of the scientific process, especially if we remember that preregistering an analysis should not stop us from performing later, non-preregistered analyses.

Preregistration should be an addition to the research project, not a limitation!

I guess that Nyhan et al.’s suggestions are good, if narrow in that they’re focused on the very traditional journal-reviewer system. I’m a little concerned with the promise that they as reviewers will “examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.” I mean, sure, fine in theory, but I would not expect or demand that every reviewer do this for every paper that comes in. If I had to do all that work every time I reviewed a paper, I’d have to review many fewer papers a year, and I think my total contribution to science as a reviewer would be much less. If I’m gonna go through and try to replicate an analysis, I don’t want to waste that on a review that only 4 people will see. I’d rather blog it and maybe write it up on some other form (as for example here), as that has the potential to help more people.

Anyway, here’s the letter, so go sign it—or perhaps sign some counter-letter—if you wish!

Another reason so much of science is so bad: bias in what gets researched.

Nina Strohminger and Olúfémi Táíwò write:

Most of us have been taught to think of scientific bias as a distortion of scientific results. As long as we avoid misinformation, fake news, and false conclusions, the thinking goes, the science is unbiased. But the deeper problem of bias involves the questions science pursues in the first place. Scientific questions are infinite, but the resources required to test them — time, effort, money, talent — are decidedly finite.

This is a good point. Selection bias is notoriously difficult for people to think about, as by its nature it depends on things that haven’t been seen.

I like Strohminger and Táíwò’s article and have only two things to add.

1. They write about the effects of corporations on what gets researched, using as examples the strategies of cigarette companies and oil companies to fund research to distract from their products’ hazards. I agree that this is an issue. We should also be concerned about influences from sources other than corporations, including the military, civilian governments, and advocacy organizations. There are plenty of bad ideas to go around, even without corporate influence. And, setting all this aside, there’s selection based on what gets publicity, along with what might be called scientific ideology. Think about all that ridiculous research on embodied cognition or on the factors that purportedly influence the sex ratio of babies. These ideas fit certain misguided models of science and have sucked up lots of attention and researcher effort without any clear motivation based on funding, corporate or otherwise. My point here is just that there are a lot of ways that the scientific enterprise is distorted by selection bias in what gets studied and what gets published.

2. They write: “The research on nudges could be completely unbiased in the sense that it provides true answers. But it is unquestionably biased in the sense that it causes scientists to effectively ignore the most powerful solutions to the problems they focus on. As with the biomedical researchers before them, today’s social scientists have become the unwitting victims of corporate capture.” Agreed. Beyond this, though, that research is not even close to being unbiased in the sense of providing accurate answers to well-posed questions. We discussed this last year in the context of a fatally failed nudge meta-analysis: it’s a literature of papers with biased conclusions (the statistical significance filter), with some out-and-out fraudulent studies mixed in).

My point here is that these two biases—selection bias in what is studied, and selection bias in the studies themselves—go together. Neither bias alone would be enough. If there were only selection bias is what was studied, the result would be lots of studies reporting high uncertainty and no firm conclusions, and not much to sustain the hype machine. Conversely, if there were only selection bias within each study, there wouldn’t be such a waste of scientific effort and attention. Strohminger and Táíwò’s article is valuable because they emphasize selection bias in what is studied, which is something we haven’t been talking so much about.

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

Wow—those are some really bad referee reports!

Dale Lehman writes:

I missed this recent retraction but the whole episode looks worth your attention. First the story about the retraction.

Here are the referee reports and authors responses.

And, here is the author’s correspondence with the editors about retraction.

The subject of COVID vaccine safety (or lack thereof) is certainly important and intensely controversial. The study has some fairly remarkable claims (deaths due to the vaccines numbering in the hundreds of thousands). The peer reviews seem to be an exemplary case of your statement that “the problems with peer review are the peer reviewers). The data and methodology used in the study seem highly suspect to me – but the author appears to respond to many challenges thoughtfully (even if I am not convinced) and raises questions about the editorial practices involved with the retraction.

Here are some more details on that retracted paper.

Note the ethics statement about no conflicts – doesn’t mention any of the people supposedly behind the Dynata organization. Also, I was surprised to find the paper and all documentation still available despite being retracted. It includes the survey instrument. From what I’ve seen, the worst aspect of this study is that it asked people if they knew people who had problems after receiving the vaccine – no causative link even being asked for. That seems like an unacceptable method for trying to infer deaths from the vaccine – and one that the referees should never have permitted.

The most amazing thing about all this was the review reports. From the second link above, we see that the article had two review reports. Here they are, in their entirety:

The first report is an absolute joke, so let’s just look at the second review. The author revised in response to that review by rewriting some things, then the paper was published. At no time were any substantive questions raised.

I also noticed this from the above-linked news article:

“The study found that those who knew someone who’d had a health problem from Covid were more likely to be vaccinated, while those who knew someone who’d experienced a health problem after being vaccinated were less likely to be vaccinated themselves.”

Here’s a more accurate way to write it:

“The study found that those who SAID THEY knew someone who’d had a health problem from Covid were more likely to be SAY THEY WERE vaccinated, while those who SAID THEY knew someone who’d experienced a health problem after being vaccinated were less likely to SAY THEY WERE vaccinated themselves.”

Yes, this is sort of thing arises with all survey responses, but I think the subjectivity of the response is much more of a concern here than in a simple opinion poll.

The news article, by Stephanie Lee, makes the substantive point clearly enough:

This methodology for calculating vaccine-induced deaths was rife with problems, observers noted, chiefly that Skidmore did not try to verify whether anyone counted in the death toll actually had been vaccinated, had died, or had died because of the vaccine.

Also this:

Steve Kirsch, a veteran tech entrepreneur who founded an anti-vaccine group, pointed out that the study had the ivory tower’s stamp of approval: It had been published in a peer-reviewed scientific journal and written by a professor at Michigan State University. . . .

In a sympathetic interview with Skidmore, Kirsch noted that the study had been peer-reviewed. “The journal picks the peer reviewers … so how can they complain?” he said.

Ultimately the responsibility for publishing a misleading article falls upon the article’s authors, not upon the journal. You can’t expect or demand careful reviews from volunteer reviewers, nor can you expect volunteer journal editors to carefully vet every paper they will publish. Yes, the peer reviews for the above-discussed paper were useless—actually worse than useless, in that they gave a stamp of approval to bad work—but you can’t really criticize the reviewers for “not doing their jobs,” given that reviewing is not their job—they’re doing it for free.

Anyway, it’s a good thing that the journal shared the review reports so we can see how useless they were.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

The other day we talked about checking survey representativeness by looking at canary variables:

Like the canary in the coal mine, a canary variable is something with a known distribution that was not adjusted for in your model. Looking at the estimated distribution of the canary variable, and then comparing to external knowledge, is a way of checking your sampling procedure. It’s not an infallible check—–our sample, or your adjusted sample, can be representative for one variable but not another—but it’s something you can do.

Then I noticed another reference, from 2014:

What you’d want to do [when you see a problem] is not just say, Hey, mistakes happen! but rather to treat these errors as information, as model checks, as canaries in the coal mine and use them to improve your procedure. Sort of like what I did when someone pointed out problems in my election maps.

Canaries all around us

When you notice a mistake, something that seemed to fit your understanding but turned out to be wrong, don’t memory-hole it; engage with it. I got soooo frustrated with David Brooks, or the Nudgelords (further explanation here), or the Freakonomics team or, at a more technical level, the Fivethirtyeight team, when they don’t wrestle with their mistakes.

Dudes! A mistake is a golden opportunity, a chance to learn. You don’t get these every day—or maybe you do! To throw away such opportunities . . . it’s like leaving the proverbial $20 bill on the table.

When Matthew Walker or Malcolm Gladwell get caught out on their errors and they bob and weave and avoid confronting the problem, then I don’t get frustrated in the same way. Their entire brand is based on simplifying the evidence. Similarly with Brian Wansink: there was no there there. If he were to admit error, there’s be nothing left.

But David Brooks, Nudge, Freakonomics, Fivethirtyeight . . . they’re all about explanation, understanding, and synthesis. Sure, it would be a short-term hit to their reputations to admit they got fooled by bad statistical analyses (on the topic of Jews, lunch, beauty, and correlated forecasts, respectively) that happened to aligned with their ideological or intellectual preconceptions, but longer-term, they could do so much better. C’mon, guys! There’s more to life than celebrity, isn’t there? Try to remember what got you interested in writing about social science in the first place.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

I wonder whether part of this is the implicit dichotomy of “admit when you’re wrong.” We’re all wrong all the time, but when we frame “being wrong” as something that stands out, something that needs to be admitted, maybe that makes it more difficult for us to miss all the micro-errors that we make. If we could get in the habit of recognizing all the mistakes we make every day, all the false starts and blind alleys and wild goose chases that are absolutely necessary in any field of inquiry, then maybe it would be less of a big deal to face up to mistakes we make that are pointed out to us by others.

Mistakes are routine. We should be able to admit them forthrightly without even needing to swallow hard and face up to them, as it were. For example, Nate Silver recently wrote, “The perfect world is one in which the media is both more willing to admit mistakes—and properly frame provisional reporting as provisional and uncertain—and the public is more tolerant of mistakes. We’re not living that world.” Which I agree with, and it applies to Nate too. Maybe we need to go even one step further and not think of a mistake as something that needs to be “admitted,” but just something that happens when we are working on complicated problems, whether they be problems of straight-up journalism (with reports coming from different sources), statistical modeling (relying on assumptions that are inevitably wrong in various ways), or assessment of evidence more generally (at some point you end up with pieces of information that are pointing in different directions).

A successful example of “adversarial collaboration.” When does this approach work and when does it not?

Stephen Ceci, Shulamit Kahn, and Wendy Williams write:

We synthesized the vast, contradictory scholarly literature on gender bias in academic science from 2000 to 2020. . . . Claims and counterclaims regarding the presence or absence of sexism span a range of evaluation contexts. Our approach relied on a combination of meta-analysis and analytic dissection. We evaluated the empirical evidence for gender bias in six key contexts in the tenure-track academy: (a) tenure-track hiring, (b) grant funding, (c) teaching ratings, (d) journal acceptances, (e) salaries, and (f) recommendation letters. We also explored the gender gap in a seventh area, journal productivity, because it can moderate bias in other contexts. . . . Contrary to the omnipresent claims of sexism in these domains appearing in top journals and the media, our findings show that tenure-track women are at parity with tenure-track men in three domains (grant funding, journal acceptances, and recommendation letters) and are advantaged over men in a fourth domain (hiring). For teaching ratings and salaries, we found evidence of bias against women; although gender gaps in salary were much smaller than often claimed, they were nevertheless concerning.

They continue:

Even in the four domains in which we failed to find evidence of sexism disadvantaging women, we nevertheless acknowledge that broad societal structural factors may still impede women’s advancement in academic science. . . . The key question today is, in which domains of academic life has explicit sexism been addressed? And in which domains is it important to acknowledge continuing bias that demands attention and rectification lest we maintain academic systems that deter the full participation of women? . . .

Our findings of some areas of gender neutrality or even a pro-female advantage are very much rooted in the most recent decades and in no way minimize or deny the existence of gender bias in the past. Throughout this article, we have noted pre-2000 analyses that suggested that bias either definitely or probably was present in some aspects of tenure-track academia before 2000. . . .

The authors characterize this project as an “adversarial collaboration”:

This article represents more than 4.5 years of effort by its three authors. By the time readers finish it, some may assume that the authors were in agreement about the nature and prevalence of gender bias from the start. However, this is definitely not the case. Rather, we are collegial adversaries who, during the 4.5 years that we worked on this article, continually challenged each other, modified or deleted text that we disagreed with, and often pushed the article in different directions. . . . Kahn has a long history of revealing gender inequities in her field of economics, and her work runs counter to Ceci and Williams’s claims of gender fairness. . . . In 2019, she co-organized a conference on women in economics, and her most recent analysis in 2021 found gender inequities persisting in tenure and promotion in economics. . . . Her findings diverge from Ceci and Williams’s, who have published a number of studies that have not found gender bias in the academy, such as their analyses of grants and tenure-track hiring . . .

Although our divergent views are real, they may not be evident to readers who see only what survived our disagreements and rewrites; the final product does not reveal the continual back and forth among the three of us. Fortunately, our viewpoint diversity did not prevent us from completing this project on amicable terms. Throughout the years spent working on it, we tempered each other’s statements and abandoned irreconcilable points, so that what survived is a consensus document that does not reveal the many instances in which one of us modified or cut text that another wrote because they felt it was inconsistent with the full corpus of empirical evidence. . . .

Editors and board members can promote science by encouraging, when possible, diverse viewpoints and by commissioning teams of adversarial coauthors (as this particular journal, Psychological Science in the Public Interest, was founded to do—to bring coauthors together in an attempt to resolve their historic differences). Knowing that one’s writing will be criticized by one’s divergently thinking coauthors can reduce ideologically driven criticisms that are offered in the guise of science. . . .

Interesting. In the past I’ve been suspicious of adversarial collaborations—whenever I’ve tried such a thing it hasn’t worked so well, and examples I’ve seen elsewhere have seemed to have more of the “adversarial” than the “collaboration.”

Here are two examples (here and here) where I tried to work with people who I disagreed with, but they didn’t want to work with me.

I get it: in both places I was pretty firm that they had been making strong claims that were not supported by their evidence, and there was no convenient halfway point where they could rest. Ideally they’d just have agreed with me, but it’s pretty rare that people will just give up something they’ve already staked a claim on.

I’m not saying these other researchers are bad people. In each case, there was a disagreement about the strength of evidence. My point is just that there was no clear way forward regarding an adversarial collaboration. So I just wrote my articles on my own; I consider each of these to be a form of “asynchronous collaboration.” Still better than nothing.

But this one by Ceci, Kahn, and Williams seems to have worked well. Perhaps it’s easier in psychology than in political science, for some reason?

That said, I can’t imagine a successful adversarial collaboration with the psychologists who published some of the horrible unreplicable stuff from the 2005-2020 era. They just seem too invested in their claims, also they achieved professional success with that work and have no particular motivation to lend their reputations to any work that might shoot it down. By their behavior, they treat their claims as fragile and would not want them to be put to the test. The Ceci, Kahn, Williams example is different, perhaps, because there are policy questions at stake, and all of them are motivated to persuade people in the middle of the debate. In contrast, the people pushing some of the more ridiculous results in embodied cognition and evolutionary psychology have no real motivation to persuade skeptics or even neutrals; they just need to keep their work from being seen as completely discredited.

This is related to my point about research being held to a higher standard when it faces active opposition.

“I am a junior undergraduate student majoring in linguistics and have recently started conducting brain imaging studies. . . .”

Isabella Lai writes:

I am a junior undergraduate student majoring in linguistics and have recently started conducting brain imaging studies.

Yesterday, I came across a paper published in Nature Human Behavior by Grand, Blank, Pereira, and Fedorenko that raised several concerns for me. The paper attempts to find word-embedding using Amazon Mechanical Turk, or, in their words, “investigates context-dependent knowledge using semantic projection in word embeddings”, but the methodology might have a few issues. Take the simplest example, regarding the use of Pearson’s correlation as an evaluation measure for concept acquisition, which might be potentially misleading. The acquisition process for linguistic concepts with multiple dimensions might involve phase transitions, and Pearson’s correlation, especially when outliers are smoothed out, measures the linear relationship between variables. It might not capture the nonlinear nuances of such phase transitions.

I thought you might be interested in this paper, as it is a paradigmatic study of Computational X or Computationalized X, where X is a humanities subject that has been taken over by machine learning models before a coherent, formalized theoretical foundation has been developed for this subject. Sometimes it is linguistics, other times it is psychology, and most likely X stands at the intersection of both subjects, for their shared status of being a modern science, and their shared lack of a well-defined mathematical basis.

I have been thinking about the neural bases of language since college, and after a few hands-on experiences in data preprocessing using MEG and EEG, I have come up with a tentative hypothesis that I hope to share with you. My proposal is that the lack of evidence for an exact neural correlate of language or any theoretical concepts in linguistics (e.g., Professor Chomsky’s favorite merge combinator and lexical access) might actually be evidence that language is an innate capacity. In other words, we might not need an additional consumption of blood glucose when using language, regardless of whether it is for computational purposes (talking to oneself via the “I-language”) or for communication purposes because language is the air of our thoughts. It might be the case that poverty of the stimulus is trivially true, but a median correlation of 0.47 and 52 category-feature pair are inadequate to disprove it.

I am curious about your opinion on this overly naive hypothesis (but it might be just parsimonious, if it turns out that it can be developed, and might one day go beyond being trivially true) from a statistical perspective.

My reply: You raise two issues: (1) the use of theory-free statistical analysis to obtain data summaries that are then used to make general scientific conclusions, and (2) evidence for neural correlates of language etc.

For item #1, I’ll just say there’s nothing wrong with looking at correlations and other theory-free data summaries as a way to turn up interesting patterns that then can be studied more carefully. The process goes like this: (a) root around in all sorts of data, run all sorts of experiments, calculate all sorts of things, look for interesting correlations; (b) consider “interesting correlations” as anomalies in existing informal theories about the world; (c) form specific hypotheses and test them. (Here I’m using the idea of “testing” hypothesis in the sense of science, not so-called “hypothesis testing” in statistics.)

For item #2, I have no idea. This one’s not just outside my areas of expertise, it’s also outside anything I’ve ever really thought about.

Lai adds:

On a separate note, I read your blog post yesterday discussing John Tierney’s opinion piece on school shootings and the potential negative impact of active-shooter drills. I hope to share my intuition, that school shootings are fat-tailed events, and the extreme cases involving psychopathic individuals are more likely to occur than the default priors. As the media reports on and exposes more school shootings, psychopaths who previously may not have considered such a possibility are now more likely to view school shootings as a newly discovered option. The increased visibility of these incidents might contribute to a compounding rise in the frequency of school shootings, which might not be mitigated, unless through interventions like implementing better gun control measures.

Yeah, that’s pretty much my take too. But I don’t have any evidence on this concern, one way or another.

The significance of a reliable crypto management tool like Ledger Live Desktop cannot be overstated. It ensures that your investment strategies are well-informed and executed efficiently.