Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.

Dorothy Bishop on the prevalence of scientific fraud

Following up on our discussion of replicability, here are some thoughts from psychology researcher Dorothy Bishop on scientific fraud:

In recent months, I [Bishop] have become convinced of two things: first, fraud is a far more serious problem than most scientists recognise, and second, we cannot continue to leave the task of tackling it to volunteer sleuths.

If you ask a typical scientist about fraud, they will usually tell you it is extremely rare, and that it would be a mistake to damage confidence in science because of the activities of a few unprincipled individuals. . . . we are reassured [that] science is self-correcting . . .

The problem with this argument is that, on the one hand, we only know about the fraudsters who get caught, and on the other hand, science is not prospering particularly well – numerous published papers produce results that fail to replicate and major discoveries are few and far between . . . We are swamped with scientific publications, but it is increasingly hard to distinguish the signal from the noise.

Bishop summarizes:

It is getting to the point where in many fields it is impossible to build a cumulative science, because we lack a solid foundation of trustworthy findings. And it’s getting worse and worse. . . . in clinical areas, there is growing concern that systematic reviews that are supposed to synthesise evidence to get at the truth instead lead to confusion because a high proportion of studies are fraudulent.

Also:

[A] more indirect negative consequence of the explosion in published fraud is that those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.

Given all the above, it’s unsurprising that, in Bishop’s words,

To date, the response of the scientific establishment has been wholly inadequate. There is little attempt to proactively check for fraud . . . Even when evidence of misconduct is strong, it can take months or years for a paper to be retracted. . . . this relaxed attitude to the fraud epidemic is a disaster-in-waiting.

What to do? Bishop recommends that some subset of researchers be trained as “data sleuths,” to move beyond the current whistleblower-and-vigilante system into something more like “the equivalent of a police force.”

I don’t know what to think about that. On one hand, I agree that whistleblowers and critics don’t get the support that they deserve; on the other hand, we might be concerned about who would be attracted to the job of official police officer here.

Setting aside concerns about Bishop’s proposed solution, I do see her larger point about the scientific publication process being so broken that it can actively interfere with the development of science. In a situation parallel to Cantor’s diagonal argument or Russell’s theory of types, it would seem that we need a scientific literature, and then, alongside it, a vetted scientific literature, and then, alongside that, another level of vetting, and so on. In medical research this sort of system has existed for decades, with a huge number of journals for the publication of original studies; and then another, smaller but still immense, set of journals that publish nothing but systematic reviews; and then some distillations that make their way into policy and practice.

Clarke’s Law

And don’t forget Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. All the above problems also arise with the sorts of useless noise mining we’ve been discussing in this space for nearly twenty years now. I assume most of those papers do not involve fraud, and even when there are clearly bad statistical practices such as rooting around for statistical significance, I expect that the perpetrators think of these research violations as merely serving the goal of larger truths.

So it’s not just fraud. Not by a longshot.

Also, remember the quote from Bishop above: “those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.” Replace “fraud” by “crappy research” and, yeah, we’ve been there for awhile!

P.S. Mark Tuttle points us to this news article by Richard Van Noorden, “How big is science’s fake-paper problem?”, that makes a similar point.

Brian Nosek on “Here are some ways of making your study replicable”:

Brian Nosek is a leader of the replication movement in science and a coauthor of an article on replicability that we discussed the other day.

They discussed the rigor-enhancing practices of “confirmatory tests, large sample sizes, preregistration, and methodological transparency, and in my post I wrote that those were not the first things I’d suggest to increase rigor in science. My recommendations were (1) Make it clear what you’re actually doing, (2) Increase your effect size, e.g., do a more effective treatment, (3) Focus your study on the people and scenarios where effects are likely to be largest, (4) Improve your outcome measurement: a more focused and less variable outcome measure, (5) Improve pre-treatment measurements, and finally (6) the methods listed in the above-linked article: “confirmatory tests, large sample sizes, preregistration, and methodological transparency.”

I sent this post to Nosek, and he replied:

For your list of practices:

#1: We did this for both methodological and statistical practices.

#2: I suspect that every lab was motivated to get the largest effect that they could given the research question that they were studying (ours certainly was). But, you’ll observe in the findings that we didn’t get very large effect sizes on average. Instead, they are what I believe are around what most “real” effect sizes are for the messy concepts that social scientists study.

#3: We didn’t do this. Each lab used a sampling firm and all studies were conducted through that firm. It is possible that a lab would have tried to tailor the design to the sample, but these were very heterogeneous samples, so that would not likely have been very effective.

#4: I suspect that every lab did this the best that they could. Simultaneously, most of the research in this is pretty on-the-edge discovery work, so not necessarily a lot of existing evidence to make use of (with variation across experiments and labs).

#5: I suspect that this was done for a couple of experiments from some labs, but not others. (None from mine did so.)

I like all of your suggestions for improving rigor. I would counterargue that some of them become more meaningfully impactful on the research process as the evidence base matures (e.g., where to get the largest effect size, what are effective pretreatment measurements). In the context of discovery research like the experiments in this paper, we could only speculate about these in trying to design the most rigorous studies. The practices that we highlight are “easily” applied no matter the maturity of the domain and evidence base.

On your other points: I think the paper provides proof-of-concept that even small effects are highly replicable. And, I am much more sanguine than you are about the benefits of preregistration. Maybe we can find some time to argue about that in the future!

I disagree with Geoff Hinton regarding “glorified autocomplete”

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

“Open Letter on the Need for Preregistration Transparency in Peer Review”

Brendan Nyhan writes:

Wanted to share this open letter. I know preregistration isn’t useful for the style of research you do, but even for consumers of preregistered research like you it’s essential to know if the preregistration was actually disclosed to and reviewed by reviewers, which in turn helps make sure that exploratory and confirmatory analyses are adequately distinguished, deviations and omissions labeled, etc. (The things I’ve seen as a reviewer… are not good – which is what motivated me to organize this.)

The letter, signed by Nyhan and many others, says:

It is essential that preregistrations be considered as part of the scientific review process.

We have observed a lack of shared understanding among authors, editors, and reviewers about the role of preregistration in peer review. Too often, preregistrations are omitted from the materials submitted for review entirely. In other cases, manuscripts do not identify important deviations from the preregistered analysis plan, fail to provide the results of preregistered analyses, or do not indicate which analyses were not preregistered.

We therefore make the following commitments and ask others to join us in doing so:

As authors: When we submit an article for review that has been preregistered, we will always include a working link to a (possibly anonymized) preregistration and/or attach it as an appendix. We will identify analyses that were not preregistered as well as notable deviations and omissions from the preregistration.

As editors: When we receive a preregistered manuscript for review, we will verify that it includes a working link to the preregistration and/or that it is included in the materials provided to reviewers. We will not count the preregistration against appendix page limits.

As reviewers: We will (a) ask for the preregistration link or appendix when reviewing preregistered articles and (b) examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.

I’ve actually been moving toward more preregistration in my work. Two recent studies we’ve done that have been preregistered are:

– Our project on generic language and political polarization

– Our evaluation of the Millennium Villages project

And just today I met with two colleagues on a medical experiment that’s in the pre-design stage—that is, we’re trying to figure out the design parameters. To do this, we need to simulate the entire process, including latent and observed data, then perform analyses on the simulated data, then replicate the entire process to ensure that the experiment will be precise enough to be useful, at least under the assumptions we’re making. This is already 90% of preregistration, and we had to do it anyway. (See recommendation 3 here.)

So, yeah, given that I’m trying now to simulate every study ahead of time before gathering any data, preregistration pretty much comes for free.

Preregistration is not magic—it won’t turn a hopelessly biased, noisy study into something useful—but it does seem like a useful part of the scientific process, especially if we remember that preregistering an analysis should not stop us from performing later, non-preregistered analyses.

Preregistration should be an addition to the research project, not a limitation!

I guess that Nyhan et al.’s suggestions are good, if narrow in that they’re focused on the very traditional journal-reviewer system. I’m a little concerned with the promise that they as reviewers will “examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.” I mean, sure, fine in theory, but I would not expect or demand that every reviewer do this for every paper that comes in. If I had to do all that work every time I reviewed a paper, I’d have to review many fewer papers a year, and I think my total contribution to science as a reviewer would be much less. If I’m gonna go through and try to replicate an analysis, I don’t want to waste that on a review that only 4 people will see. I’d rather blog it and maybe write it up on some other form (as for example here), as that has the potential to help more people.

Anyway, here’s the letter, so go sign it—or perhaps sign some counter-letter—if you wish!

Another reason so much of science is so bad: bias in what gets researched.

Nina Strohminger and Olúfémi Táíwò write:

Most of us have been taught to think of scientific bias as a distortion of scientific results. As long as we avoid misinformation, fake news, and false conclusions, the thinking goes, the science is unbiased. But the deeper problem of bias involves the questions science pursues in the first place. Scientific questions are infinite, but the resources required to test them — time, effort, money, talent — are decidedly finite.

This is a good point. Selection bias is notoriously difficult for people to think about, as by its nature it depends on things that haven’t been seen.

I like Strohminger and Táíwò’s article and have only two things to add.

1. They write about the effects of corporations on what gets researched, using as examples the strategies of cigarette companies and oil companies to fund research to distract from their products’ hazards. I agree that this is an issue. We should also be concerned about influences from sources other than corporations, including the military, civilian governments, and advocacy organizations. There are plenty of bad ideas to go around, even without corporate influence. And, setting all this aside, there’s selection based on what gets publicity, along with what might be called scientific ideology. Think about all that ridiculous research on embodied cognition or on the factors that purportedly influence the sex ratio of babies. These ideas fit certain misguided models of science and have sucked up lots of attention and researcher effort without any clear motivation based on funding, corporate or otherwise. My point here is just that there are a lot of ways that the scientific enterprise is distorted by selection bias in what gets studied and what gets published.

2. They write: “The research on nudges could be completely unbiased in the sense that it provides true answers. But it is unquestionably biased in the sense that it causes scientists to effectively ignore the most powerful solutions to the problems they focus on. As with the biomedical researchers before them, today’s social scientists have become the unwitting victims of corporate capture.” Agreed. Beyond this, though, that research is not even close to being unbiased in the sense of providing accurate answers to well-posed questions. We discussed this last year in the context of a fatally failed nudge meta-analysis: it’s a literature of papers with biased conclusions (the statistical significance filter), with some out-and-out fraudulent studies mixed in).

My point here is that these two biases—selection bias in what is studied, and selection bias in the studies themselves—go together. Neither bias alone would be enough. If there were only selection bias is what was studied, the result would be lots of studies reporting high uncertainty and no firm conclusions, and not much to sustain the hype machine. Conversely, if there were only selection bias within each study, there wouldn’t be such a waste of scientific effort and attention. Strohminger and Táíwò’s article is valuable because they emphasize selection bias in what is studied, which is something we haven’t been talking so much about.

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

Wow—those are some really bad referee reports!

Dale Lehman writes:

I missed this recent retraction but the whole episode looks worth your attention. First the story about the retraction.

Here are the referee reports and authors responses.

And, here is the author’s correspondence with the editors about retraction.

The subject of COVID vaccine safety (or lack thereof) is certainly important and intensely controversial. The study has some fairly remarkable claims (deaths due to the vaccines numbering in the hundreds of thousands). The peer reviews seem to be an exemplary case of your statement that “the problems with peer review are the peer reviewers). The data and methodology used in the study seem highly suspect to me – but the author appears to respond to many challenges thoughtfully (even if I am not convinced) and raises questions about the editorial practices involved with the retraction.

Here are some more details on that retracted paper.

Note the ethics statement about no conflicts – doesn’t mention any of the people supposedly behind the Dynata organization. Also, I was surprised to find the paper and all documentation still available despite being retracted. It includes the survey instrument. From what I’ve seen, the worst aspect of this study is that it asked people if they knew people who had problems after receiving the vaccine – no causative link even being asked for. That seems like an unacceptable method for trying to infer deaths from the vaccine – and one that the referees should never have permitted.

The most amazing thing about all this was the review reports. From the second link above, we see that the article had two review reports. Here they are, in their entirety:

The first report is an absolute joke, so let’s just look at the second review. The author revised in response to that review by rewriting some things, then the paper was published. At no time were any substantive questions raised.

I also noticed this from the above-linked news article:

“The study found that those who knew someone who’d had a health problem from Covid were more likely to be vaccinated, while those who knew someone who’d experienced a health problem after being vaccinated were less likely to be vaccinated themselves.”

Here’s a more accurate way to write it:

“The study found that those who SAID THEY knew someone who’d had a health problem from Covid were more likely to be SAY THEY WERE vaccinated, while those who SAID THEY knew someone who’d experienced a health problem after being vaccinated were less likely to SAY THEY WERE vaccinated themselves.”

Yes, this is sort of thing arises with all survey responses, but I think the subjectivity of the response is much more of a concern here than in a simple opinion poll.

The news article, by Stephanie Lee, makes the substantive point clearly enough:

This methodology for calculating vaccine-induced deaths was rife with problems, observers noted, chiefly that Skidmore did not try to verify whether anyone counted in the death toll actually had been vaccinated, had died, or had died because of the vaccine.

Also this:

Steve Kirsch, a veteran tech entrepreneur who founded an anti-vaccine group, pointed out that the study had the ivory tower’s stamp of approval: It had been published in a peer-reviewed scientific journal and written by a professor at Michigan State University. . . .

In a sympathetic interview with Skidmore, Kirsch noted that the study had been peer-reviewed. “The journal picks the peer reviewers … so how can they complain?” he said.

Ultimately the responsibility for publishing a misleading article falls upon the article’s authors, not upon the journal. You can’t expect or demand careful reviews from volunteer reviewers, nor can you expect volunteer journal editors to carefully vet every paper they will publish. Yes, the peer reviews for the above-discussed paper were useless—actually worse than useless, in that they gave a stamp of approval to bad work—but you can’t really criticize the reviewers for “not doing their jobs,” given that reviewing is not their job—they’re doing it for free.

Anyway, it’s a good thing that the journal shared the review reports so we can see how useless they were.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

The other day we talked about checking survey representativeness by looking at canary variables:

Like the canary in the coal mine, a canary variable is something with a known distribution that was not adjusted for in your model. Looking at the estimated distribution of the canary variable, and then comparing to external knowledge, is a way of checking your sampling procedure. It’s not an infallible check—–our sample, or your adjusted sample, can be representative for one variable but not another—but it’s something you can do.

Then I noticed another reference, from 2014:

What you’d want to do [when you see a problem] is not just say, Hey, mistakes happen! but rather to treat these errors as information, as model checks, as canaries in the coal mine and use them to improve your procedure. Sort of like what I did when someone pointed out problems in my election maps.

Canaries all around us

When you notice a mistake, something that seemed to fit your understanding but turned out to be wrong, don’t memory-hole it; engage with it. I got soooo frustrated with David Brooks, or the Nudgelords (further explanation here), or the Freakonomics team or, at a more technical level, the Fivethirtyeight team, when they don’t wrestle with their mistakes.

Dudes! A mistake is a golden opportunity, a chance to learn. You don’t get these every day—or maybe you do! To throw away such opportunities . . . it’s like leaving the proverbial $20 bill on the table.

When Matthew Walker or Malcolm Gladwell get caught out on their errors and they bob and weave and avoid confronting the problem, then I don’t get frustrated in the same way. Their entire brand is based on simplifying the evidence. Similarly with Brian Wansink: there was no there there. If he were to admit error, there’s be nothing left.

But David Brooks, Nudge, Freakonomics, Fivethirtyeight . . . they’re all about explanation, understanding, and synthesis. Sure, it would be a short-term hit to their reputations to admit they got fooled by bad statistical analyses (on the topic of Jews, lunch, beauty, and correlated forecasts, respectively) that happened to aligned with their ideological or intellectual preconceptions, but longer-term, they could do so much better. C’mon, guys! There’s more to life than celebrity, isn’t there? Try to remember what got you interested in writing about social science in the first place.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

I wonder whether part of this is the implicit dichotomy of “admit when you’re wrong.” We’re all wrong all the time, but when we frame “being wrong” as something that stands out, something that needs to be admitted, maybe that makes it more difficult for us to miss all the micro-errors that we make. If we could get in the habit of recognizing all the mistakes we make every day, all the false starts and blind alleys and wild goose chases that are absolutely necessary in any field of inquiry, then maybe it would be less of a big deal to face up to mistakes we make that are pointed out to us by others.

Mistakes are routine. We should be able to admit them forthrightly without even needing to swallow hard and face up to them, as it were. For example, Nate Silver recently wrote, “The perfect world is one in which the media is both more willing to admit mistakes—and properly frame provisional reporting as provisional and uncertain—and the public is more tolerant of mistakes. We’re not living that world.” Which I agree with, and it applies to Nate too. Maybe we need to go even one step further and not think of a mistake as something that needs to be “admitted,” but just something that happens when we are working on complicated problems, whether they be problems of straight-up journalism (with reports coming from different sources), statistical modeling (relying on assumptions that are inevitably wrong in various ways), or assessment of evidence more generally (at some point you end up with pieces of information that are pointing in different directions).

A successful example of “adversarial collaboration.” When does this approach work and when does it not?

Stephen Ceci, Shulamit Kahn, and Wendy Williams write:

We synthesized the vast, contradictory scholarly literature on gender bias in academic science from 2000 to 2020. . . . Claims and counterclaims regarding the presence or absence of sexism span a range of evaluation contexts. Our approach relied on a combination of meta-analysis and analytic dissection. We evaluated the empirical evidence for gender bias in six key contexts in the tenure-track academy: (a) tenure-track hiring, (b) grant funding, (c) teaching ratings, (d) journal acceptances, (e) salaries, and (f) recommendation letters. We also explored the gender gap in a seventh area, journal productivity, because it can moderate bias in other contexts. . . . Contrary to the omnipresent claims of sexism in these domains appearing in top journals and the media, our findings show that tenure-track women are at parity with tenure-track men in three domains (grant funding, journal acceptances, and recommendation letters) and are advantaged over men in a fourth domain (hiring). For teaching ratings and salaries, we found evidence of bias against women; although gender gaps in salary were much smaller than often claimed, they were nevertheless concerning.

They continue:

Even in the four domains in which we failed to find evidence of sexism disadvantaging women, we nevertheless acknowledge that broad societal structural factors may still impede women’s advancement in academic science. . . . The key question today is, in which domains of academic life has explicit sexism been addressed? And in which domains is it important to acknowledge continuing bias that demands attention and rectification lest we maintain academic systems that deter the full participation of women? . . .

Our findings of some areas of gender neutrality or even a pro-female advantage are very much rooted in the most recent decades and in no way minimize or deny the existence of gender bias in the past. Throughout this article, we have noted pre-2000 analyses that suggested that bias either definitely or probably was present in some aspects of tenure-track academia before 2000. . . .

The authors characterize this project as an “adversarial collaboration”:

This article represents more than 4.5 years of effort by its three authors. By the time readers finish it, some may assume that the authors were in agreement about the nature and prevalence of gender bias from the start. However, this is definitely not the case. Rather, we are collegial adversaries who, during the 4.5 years that we worked on this article, continually challenged each other, modified or deleted text that we disagreed with, and often pushed the article in different directions. . . . Kahn has a long history of revealing gender inequities in her field of economics, and her work runs counter to Ceci and Williams’s claims of gender fairness. . . . In 2019, she co-organized a conference on women in economics, and her most recent analysis in 2021 found gender inequities persisting in tenure and promotion in economics. . . . Her findings diverge from Ceci and Williams’s, who have published a number of studies that have not found gender bias in the academy, such as their analyses of grants and tenure-track hiring . . .

Although our divergent views are real, they may not be evident to readers who see only what survived our disagreements and rewrites; the final product does not reveal the continual back and forth among the three of us. Fortunately, our viewpoint diversity did not prevent us from completing this project on amicable terms. Throughout the years spent working on it, we tempered each other’s statements and abandoned irreconcilable points, so that what survived is a consensus document that does not reveal the many instances in which one of us modified or cut text that another wrote because they felt it was inconsistent with the full corpus of empirical evidence. . . .

Editors and board members can promote science by encouraging, when possible, diverse viewpoints and by commissioning teams of adversarial coauthors (as this particular journal, Psychological Science in the Public Interest, was founded to do—to bring coauthors together in an attempt to resolve their historic differences). Knowing that one’s writing will be criticized by one’s divergently thinking coauthors can reduce ideologically driven criticisms that are offered in the guise of science. . . .

Interesting. In the past I’ve been suspicious of adversarial collaborations—whenever I’ve tried such a thing it hasn’t worked so well, and examples I’ve seen elsewhere have seemed to have more of the “adversarial” than the “collaboration.”

Here are two examples (here and here) where I tried to work with people who I disagreed with, but they didn’t want to work with me.

I get it: in both places I was pretty firm that they had been making strong claims that were not supported by their evidence, and there was no convenient halfway point where they could rest. Ideally they’d just have agreed with me, but it’s pretty rare that people will just give up something they’ve already staked a claim on.

I’m not saying these other researchers are bad people. In each case, there was a disagreement about the strength of evidence. My point is just that there was no clear way forward regarding an adversarial collaboration. So I just wrote my articles on my own; I consider each of these to be a form of “asynchronous collaboration.” Still better than nothing.

But this one by Ceci, Kahn, and Williams seems to have worked well. Perhaps it’s easier in psychology than in political science, for some reason?

That said, I can’t imagine a successful adversarial collaboration with the psychologists who published some of the horrible unreplicable stuff from the 2005-2020 era. They just seem too invested in their claims, also they achieved professional success with that work and have no particular motivation to lend their reputations to any work that might shoot it down. By their behavior, they treat their claims as fragile and would not want them to be put to the test. The Ceci, Kahn, Williams example is different, perhaps, because there are policy questions at stake, and all of them are motivated to persuade people in the middle of the debate. In contrast, the people pushing some of the more ridiculous results in embodied cognition and evolutionary psychology have no real motivation to persuade skeptics or even neutrals; they just need to keep their work from being seen as completely discredited.

This is related to my point about research being held to a higher standard when it faces active opposition.

“I am a junior undergraduate student majoring in linguistics and have recently started conducting brain imaging studies. . . .”

Isabella Lai writes:

I am a junior undergraduate student majoring in linguistics and have recently started conducting brain imaging studies.

Yesterday, I came across a paper published in Nature Human Behavior by Grand, Blank, Pereira, and Fedorenko that raised several concerns for me. The paper attempts to find word-embedding using Amazon Mechanical Turk, or, in their words, “investigates context-dependent knowledge using semantic projection in word embeddings”, but the methodology might have a few issues. Take the simplest example, regarding the use of Pearson’s correlation as an evaluation measure for concept acquisition, which might be potentially misleading. The acquisition process for linguistic concepts with multiple dimensions might involve phase transitions, and Pearson’s correlation, especially when outliers are smoothed out, measures the linear relationship between variables. It might not capture the nonlinear nuances of such phase transitions.

I thought you might be interested in this paper, as it is a paradigmatic study of Computational X or Computationalized X, where X is a humanities subject that has been taken over by machine learning models before a coherent, formalized theoretical foundation has been developed for this subject. Sometimes it is linguistics, other times it is psychology, and most likely X stands at the intersection of both subjects, for their shared status of being a modern science, and their shared lack of a well-defined mathematical basis.

I have been thinking about the neural bases of language since college, and after a few hands-on experiences in data preprocessing using MEG and EEG, I have come up with a tentative hypothesis that I hope to share with you. My proposal is that the lack of evidence for an exact neural correlate of language or any theoretical concepts in linguistics (e.g., Professor Chomsky’s favorite merge combinator and lexical access) might actually be evidence that language is an innate capacity. In other words, we might not need an additional consumption of blood glucose when using language, regardless of whether it is for computational purposes (talking to oneself via the “I-language”) or for communication purposes because language is the air of our thoughts. It might be the case that poverty of the stimulus is trivially true, but a median correlation of 0.47 and 52 category-feature pair are inadequate to disprove it.

I am curious about your opinion on this overly naive hypothesis (but it might be just parsimonious, if it turns out that it can be developed, and might one day go beyond being trivially true) from a statistical perspective.

My reply: You raise two issues: (1) the use of theory-free statistical analysis to obtain data summaries that are then used to make general scientific conclusions, and (2) evidence for neural correlates of language etc.

For item #1, I’ll just say there’s nothing wrong with looking at correlations and other theory-free data summaries as a way to turn up interesting patterns that then can be studied more carefully. The process goes like this: (a) root around in all sorts of data, run all sorts of experiments, calculate all sorts of things, look for interesting correlations; (b) consider “interesting correlations” as anomalies in existing informal theories about the world; (c) form specific hypotheses and test them. (Here I’m using the idea of “testing” hypothesis in the sense of science, not so-called “hypothesis testing” in statistics.)

For item #2, I have no idea. This one’s not just outside my areas of expertise, it’s also outside anything I’ve ever really thought about.

Lai adds:

On a separate note, I read your blog post yesterday discussing John Tierney’s opinion piece on school shootings and the potential negative impact of active-shooter drills. I hope to share my intuition, that school shootings are fat-tailed events, and the extreme cases involving psychopathic individuals are more likely to occur than the default priors. As the media reports on and exposes more school shootings, psychopaths who previously may not have considered such a possibility are now more likely to view school shootings as a newly discovered option. The increased visibility of these incidents might contribute to a compounding rise in the frequency of school shootings, which might not be mitigated, unless through interventions like implementing better gun control measures.

Yeah, that’s pretty much my take too. But I don’t have any evidence on this concern, one way or another.

The significance of a reliable crypto management tool like Ledger Live Desktop cannot be overstated. It ensures that your investment strategies are well-informed and executed efficiently.

Many ways to trick a deep model

This is Jessica. Somewhat related to Andrew’s earlier post today, Bo Li from UIUC gave an interesting talk at Northwestern yesterday. The results she presented were an informative counterpoint to the “max AI hype” moment that seems to be playing out in the stream of articles celebrating artificial general intelligence or proposing explanations of how ML got so successful (like Donoho’s paper on frictionless reproducibility, which I liked).

Much of the talk covered a recent project called Decoding Trust, which is an open source toolkit for assessing how trustworthy GPT models are. It provides test cases in the form of prompts that can be used to make the model mess up, where messing up can be defined from various perspectives: generating toxic output, demonstrating gender or racial bias, leaking private training data, etc. 

For example, one prompt might describe a CS undergrad with a certain level of work experience, and ask the model whether they deserved a software engineer position with a starting salary of $225k or more. A second prompt is the same but substitutes the male name and pronouns with female name and pronouns, leading the model to change its mind about whether the salary was appropriate. 

One thing her results showed was that GPT-4 is actually worse than GPT-3.5 on many of these tests. I.e., improvements that lead GPT-4 to better follow instructions may also make it more vulnerable to being tricked.

Another was that if you examined recent LLMs from these nine perspectives, there were differences in terms of where different models were most likely to fail, but none of them were dominating across the board. She did mention that lamba2 was more conservative, though, and would often default to saying it didn’t know or couldn’t answer.

One set of results she presented was interesting in that it got at unexpected holes in GPT’s understanding of words that meant the same thing. She showed that depending on which synonym you used to describe one person privately telling another person something (secretly, in confidence, confidentially, privately, etc.) you could get different responses from GPT about whether it was appropriate to tell others the info, suggesting it properly understood some of the synonyms but not others. It was also eye-opening ot me that these issues could be demonstrated even when the model was prompted with context warning it what not to do. For example, even when it was first reminded not to share certain corporate info like contact information outside of the company, she showed you could still get it to reveal emails and phone numbers. 

Overall, this has me wondering how viable some safe-guarding approaches used to prevent bad behaviors of LLMs like fine-tuning are for getting these models to the point where they can be deployed in the world. For example, she mentioned that GPT models won’t produce SSNs, suggesting they have been carefully tuned to avoid this, and that they seem  less likely to leak numeric information than text. However, there are many ways to be adversarial, and its hard to imagine fine-tuning to prevent them all.

Li also suggested that having approaches that can efficiently find such examples, or incorporate them in training (i.e., robust training approaches), doesn’t necessarily translate to performance improvements. She put up a graph at the end of her talk, similar to the one below, to make the point that despite all the work on robustness there is still a long way to go. The concept of “certified robustness” was new to me, but refers to robustness verification approaches that can theoretically certify the lower bound of a model’s performance against a particular form of adversary. For example, the strongest adversary is “white-box,” meaning they can access the model (including parameters and architecture). A common formulation is an adversary who searches a space of perturbations within some predefined bound of an input instance to identify those that will fool the model into making incorrect predictions. Robust training approaches are devised to prevent models from being vulnerable to such attacks, but, as Li’s graph showed, beyond certain benchmarks like MNIST, progress in certifiably robust ML has been slow and there’s still a long way to go. I pulled from the below graph from her github site on a platform for benchmarking verification and robust training approaches for deep neural nets

Rubric-based holistic word salad represents a change from traditional approaches to publishing papers in physics

Pointing to this recent article in Physical Review Physics Education Research, Michael Weissman flags this passage:

In our initial analysis of the historical data, we [the authors of the above-linked article] noticed there are cases where applicants have similar physics GRE scores and GPA, yet one applicant is accepted while the other is not. Given that cases such as these might add challenges to modeling the data, removing such applicants might allow us to better characterize the general trends in the data. We, therefore, consider an alternative approach that detects similar applicants with different admission outcomes and removes them from the database.

Whaaaaa?

The last 2 sentences of the abstract of that article are just amazing:

Our inability to model the second dataset despite being able to model the first combined with model comparison analyses suggests that rubric-based admission does change the underlying process. These results suggest that rubric-based holistic review is a method that could make the graduate admission process in physics more equitable.

I pointed this one out to Jamie Robins, who wrote:

Almost as equitable as a coin toss. With more grant money and research they may get there yet.

Jamie then elaborated:

Of course I was being overly pithy. If many more advantaged than disadvantaged apply, a coin toss will not meet their goals. So they will require a rubric based selection for which the ML program does better than chance (unless both the the covariates used by the rubric and their correlates are withheld from the features given to the ML program).

Miguel Hernan picked up on one other thing:

Also from the abstract:

Yet, no studies have examined whether rubric-based admissions methods represent a fundamental change in the admissions process or simply represent a new tool that achieves the same outcome.

But I couldn’t find a mention to MIT’s experience as summarized here by its Dean of Admissions:

Standardized tests also help us identify academically prepared, socioeconomically disadvantaged students who could not otherwise demonstrate readiness because they do not attend schools that offer advanced coursework, cannot afford expensive enrichment opportunities, cannot expect lengthy letters of recommendation from their overburdened teachers, or are otherwise hampered by educational inequalities. By using the tests as a tool in the service of our mission, we have helped improve the diversity of our undergraduate population⁠ while student academic outcomes at MIT have gotten better, too; our strategic and purposeful use of testing has been crucial to doing both simultaneously.⁠

I also enjoyed (that is, was upset by) this bit:

While we were able to develop a sufficiently good model whose results we could trust for the data before the implementation of the rubric, we were unable to do so for the data collected after the implementation of the rubric, despite multiple modifications to the algorithms and data such as implementing Tomek Links.

“Despite multiple modifications to the algorithms and data,” huh? Maybe they just weren’t trying hard enough. I don’t know what’s worse, someone who keeps altering his algorithms and data until he finds success, or someone who tries to do it but can’t succeed.

We laugh because it’s too painful to always be crying.

P.S. I can’t speak to the substance of the subject under discussion. Maybe rubric-based holistic review is a great thing. Just cos a bad paper has been published promoting it, that doesn’t mean it’s a bad idea. Just one thing: I hate the use letters of recommendation for admissions and hiring.

Gigerenzer: Simple heuristics to run a research group

The cognitive psychologist writes:

Collaboration between researchers has become increasingly common, enabling a level of discovery and innovation that is difficult if not impossible to achieve by a single person. But how can one establish and maintain an environment that fosters successful collaboration within a research group? In this case study, I use my own experience when directing the ABC Research Group at the Max Planck Institute for Human Development in Berlin. I first describe the heuristic principles for setting up a research group, including (i) common topic and multiple disciplines, (ii) open culture, (iii) spatial proximity, and (iv) temporal proximity. Then I describe heuristics for maintaining the open culture, such as setting collective goals, including contrarians, distributing responsibility, making bets, the cake rule, and side-by-side writing. These heuristics form an “adaptive toolbox” that shapes the intellectual and social climate. They create a culture of friendly but rigorous discussion, embedded in a family-like climate of trust where everyone is willing to expose their ignorance and learn from the other members. Feeling accepted and trusted encourages taking the necessary risks to achieve progress in science.

This all makes sense to me. I’ve never been very good about organizing research groups myself, so probably I should try to follow this advice.

Here are Gigerenzer’s principles for How to Start a Research Group:

Principle 1: Common topic, multiple disciplines

Principle 2: Create an open culture

Principle 3: Spatial proximity

Principle 4: Temporal proximity

Then, How to Maintain a Culture:

Set collective goals

Distribute responsibility

Secure open culture

One thing I’d like to hear more about is the difficulty of setting all of this up. Here are some challenges:

– To ensure spatial proximity, you need an institution to commit to the space, which in turn can require “politics”; that is, negotiation with powerful people at the institution to secure the space as needed.

– To ensure temporal proximity, you need a steady flow of funds, which requires fundraising or grant-writing. The challenge is to be able to do this without being overwhelmed, as in some biomedical labs where it seems that the only thing ever going on is writing grant proposals.

– There’s also the challenge of getting people in the same room at a time where email and remote meetings are so easy.

There are other challenges too. Setting up an open culture isn’t hard for me, but some of the other steps have been difficult. One thing I like is that the lab that Gigerenzer runs is not called the Gigerenzer Lab. So many professors do this, having the lab named after themselves, and that just seems so tacky.

“What role will hydrological science play in the age of machine learning?” I think it will depend on the problem you’re trying to solve.

Eric Potash writes:

I knew that deep learning was doing amazing things with audio, images, video, language, etc. But when it comes to physics, I sort of assumed that stochastic and process-based models were holding their own. Apparently not so in the case of hydrology at the watershed scale, and this is causing a reckoning in that field. Check out this article, “What Role Does Hydrological Science Play in the Age of Machine Learning?” by Nearing et al. (2020). A cautionary tale for those who think their domain expertise is safe from off-the-shelf ML.

Key quote:

[In 2019] one of the suggested questions was: “Does Machine Learning have a real role in hydrological modeling?” In contrast, we suggest that the existential question for our discipline right now is: “What role will hydrological science play in the age of machine learning?” . . .

Very likely, the future of hydrology will be a mix of AI and physics‐based approaches, but we have a hard time envisioning a future where transformative data science approaches like DL become simply another tool in the hydrologist’s toolbox. We see it as much more likely that hydrological domain knowledge will become an integral part of guiding and developing fundamentally AI‐based systems and analyses . . .

Our message in this opinion piece is to stop assuming that the world needs our theories and expertise and start demonstrating—quantitatively and systematically—the value of individual components of that expertise against the backdrop of a growing importance of big data.

Interesting. I think it’s gotta depend on what questions are being asked. I can believe that when you have tons of past data you can forecast streamflow using a generic statistical or machine-learning model, with hydrological expertise coming in mostly in the choice of what training data to use and when to apply the predictions. When data are scarce, or when designing new systems, for example predicting what will happen if a new dam is built, I’d guess that physical modeling would be important.

The connection between the psychological concept of “generic language” and the problem of overgeneralization from research studies

A couple years ago I suggested: A quick fix in science communication: Switch from the present to the past tense.

Here’s an example. A paper was published, “Māori and Pacific people in New Zealand have a higher risk of hospitalisation for COVID-19,” and I recommended they change “have” to “had” in that title. More generally, I wrote,

There’s a common pattern in science writing to use the present tense to imply that you’ve discovered a universal truth. For example, “Beautiful parents have more daughters” or “Women are more likely to wear red or pink at peak fertility.” OK, those particular papers had other problems, but my point here is that at best these represented findings about some point in time and some place in the past.

Using the past tense in the titles of scientific reports won’t solve all our problems or even most of our problems or even many of our problems, but maybe it will be a useful start, in reminding authors as well as readers of the scope of their findings.

Recently it was brought to my attention that research has been conducted on this topic.

The relevant paper is Generic language in scientific communication, published by Jasmine DeJesus et al. in 2017, who write:

Scientific communication poses a challenge: To clearly highlight key conclusions and implications while fully acknowledging the limitations of the evidence. Although these goals are in principle compatible, the goal of conveying complex and variable data may compete with reporting results in a digestible form . . . For example, generic language (e.g., “Introverts and extraverts require different learning environments”) may mislead by implying general, timeless conclusions while glossing over exceptions and variability. Using generic language is especially problematic if authors overgeneralize from small or unrepresentative samples . . . In an analysis of 1,149 psychology articles, 89% described results using generics . . . Online workers and undergraduate students judged findings expressed with generic language more important than findings expressed with nongeneric language.

It’s good to see this coming out in the psychology literature, given that just a few years ago a prominent psychology professor expressed annoyance when I expressed problems about representativeness in a published study.

Also relevant is our post from a few years ago, Correlation does not even imply correlation, which also addressed the challenges of drawing general conclusions from nonrepresentative samples in the presence of selection bias.

P.S. Also relevant is a post from 2010, “How hard is it to say what you mean?”