Кракен Ссылка Кракен Ссылка

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any not toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

“Guns, Race, and Stats: The Three Deadliest Weapons in America”

Geoff Holtzman writes:

In April 2021, The Guardian published an article titled “Gun Ownership among Black Americans is Up 58.2%.” In June 2022, Newsweek claimed that “Gun ownership rose by 58 percent in 2020 alone.” The Philadelphia Inquirer first reported on this story in August 2020, and covered it again as recently as March 2023 in a piece titled “The Growing Ranks of Gun Owners.” In between, more than two dozen major media outlets reported this same statistic. Despite inconsistencies in their reporting, all outlets (directly or indirectly) cite as their source a survey-based infographic conducted by a firearm industry trade association.

Last week, I shared my thoughts on the social, political, and ethical dimensions of these stories in an article published in The American Prospect. Here, I address whether and to what extent their key statistical claim is true. And an examination of the infographic—produced by the National Shooting Sports Foundation (NSSF)—reveals that it is not. Below, I describe six key facts about the infographic that undermine the media narrative. After removing all false, misleading, or meaningless words from the Guardian’s headline and Newsweek’s claim, the only words remaining are “Among” “Is,” “In,” and “By.”

(1) 58.2% only refers to the first six months of 2020

To understand demographic changes in firearms purchases or ownership in 2020, one needs to ascertain firearm sales or ownership demographics from before 2020 and after 2020. The best way to do this is with a longitudinal panel, which is how Pew found no change in Black gun ownership rates among Americans from 2017 (24%) to 2021 (24%). Longitudinal research in The Annals of Internal Medicine, also found no change in gun ownership among Black Americans from 2019 (21%) through 2020/2021 (21%).

By contrast, the NSSF conducted a one-time survey of its own member retailers. In July 2020, the NSSF asked these retailers to estimate demographics in the first six months of 2020 to demographics in the first six months of 2019. A full critique of this approach and its drawbacks would require a lengthy discussion of the scientific literature on recency bias, telescoping effects, and so on. To keep this brief, I’d just like to point out that by July 2020, many of us could barely remember what the world was like back in 2019.

Ironically, the media couldn’t even remember when the survey took place. In September 2020, NPR reported—correctly—that “according to AOL News,” the survey concerned “the first six months of 2020.”  But in October of 2020, CNN said it reflected gun sales “through September.” And by June 2021, CNN revised its timeline to be even less accurate, claiming the statistic was “gun buyers in 2020 compared to 2019.”

Strangely, it seems that AOL News may have been one of the few media outlets that actually looked at the infographic it reported. The timing of the survey—along with other critical but collectively forgotten information on its methods are printed at the top of the infographic. The entire top quarter of the NSSF-produced image is devoted to these details:  “FIREARM & AMMUNITION SALES DURING 1ST HALF OF 2020, Online Survey Fielded July 2020 to NSSF Members.”

But as I discuss in my article in The American Prospect, a survey about the first half of 2020 doesn’t really support a narrative about Black Americans’ response to “protests throughout the summer” of 2020 or to that November’s “contested election.” This is a great example of a formal fallacy (post hoc reasoning), memory bias (more than one may have been at work here), and motivated reasoning all rolled into one. To facilitate these cognitive errors, the phrase “in 2020” is used ambiguously in the stories, referring at times to its first six months of 2020 and at times specific days or periods during the last seven months. This part of the headlines and stories is not false, but it does conflate two distinct time periods.

The results of the NSSF survey cannot possibly reflect the events of the Summer and Fall of 2020. Rather, the survey’s methods and materials were reimagined, glossed over, or ignored to serve news stories about those events.

(2) 58.2% describes only a tiny, esoteric fraction of Americans

To generalize about gun owner demographics in the U.S., one has to survey a representative, random sample of Americans. But the NSSF survey was not sent to a representative sample of Americans—it was only sent to NSSF members. Furthermore, it doesn’t appear to have been sent to a random sample of NSSF members—we have almost no information on how the sample of fewer than 200 participants were drawn from the NSSF’s membership of nearly 10,000. Most problematically—and bizarrely—the survey is supposed to tell us something about gun buyers, yet the NSSF chose to send the survey exclusively to its gun sellers.

The word “Americans” in these headlines is being used as shorthand for “gun store customers as remembered by American retailers up to 18 months later.” In my experience, literally no one assumes I mean the latter when I say the former. The latter is not representative of the former, so this part of the headlines and news stories is misleading.

(3) 58.2% refers to some abstract, reconstructed memory of Blackness

The NSSF doesn’t provide demographic information for the retailers it surveyed. Demographics can provide crucial descriptive information for interpreting and weighting data from any survey, but their omission is especially glaring for a survey that asked people to estimate demographics. But there’s a much bigger problem here.

We don’t have reliable information about the races of these retailers’ customers, which is what the word “Black” is supposed to refer to in news coverage of the survey. This is not an attack on firearms retailers; it is a well-established statistical tendency in third-party racial identification. As I’ve discussed in The American Journal of Bioethics, a comparison of CDC mortality data to Census records shows that funeral directors are not particularly accurate in reporting the race of one (perfectly still) person at a time. Since that’s a simpler task than searching one’s memory and making statistical comparisons of all customers from January through June of two different years, it’s safe to assume that the latter tends to produce even less accurate reports.

The word “Black” in these stories really means “undifferentiated masses of people from two non-consecutive six-month periods recalled as Black.” Again, the construct picked out by “Black” in the news coverage is a far cry from the construct actually measured by the survey.

(4) 58.2% appears to be about something other than guns

The infographic doesn’t provide the full wording of survey items, or even make clear how many items there were. Of the six figures on the infographic, two are about “sales of firearms,” two are about “sales of ammunition,” and one is about “overall demographic makeup of your customers.” But the sixth and final figure—the source of that famous 58.2%—does not appear to be about anything at all. In its entirety, that text on the infographic reads: “For any demographic that you had an increase, please specify the percent increase.”

Percent increase in what? Firearms sales? Ammunition sales? Firearms and/or ammunition sales? Overall customers? My best guess would be that the item asked about customers, since guns and ammo are not typically assigned a race. But the sixth figure is uninterpretable—and the 58.2% statistic meaningless—in the absence of answers.

(5) 58.2% is about something other than ownership

I would not guess that the 58.2% statistic was about ownership, unless this were a multiple choice test and I was asked to guess which answer was a trap.

The infographic might initially appear to be about ownership, especially to someone primed by the initial press release. It’s notoriously difficult for people to grasp distinctions like those between purchases by customers and ownership in a broader population. I happen to think that the heuristics, biases, and fallacies associated with that difficulty—reverse inference, base rate neglect, affirming the consequent, etc.—are fascinating, but I won’t dwell on them here. In the end, ammunition is not a gun, a behavior (purchasing) is not a state (ownership), and customers are none of the above.

To understand how these concepts differ, suppose that 80% of people who walk into a given gun store in a given year own a gun. The following year, the store could experience a 58% increase in customers, or a 58% increase in purchases, but not observe a 58% increase in ownership. Why? Because even the best salesperson can’t get 126% of customers to own guns. So the infographic neither states nor implies anything specific about changes in gun ownership.

(6) 58.2% was calculated deceptively

I can’t tell if the data were censored (e.g., by dropping some responses before analysis) or if the respondents were essentially censored (e.g., via survey skip logic), but 58.2% is the average guess only of retailers who reported an increase in Black customers. Retailers who reported no increase in Black customers were not counted toward the average. Consequently, the infographic can’t provide a sample size for this bar chart. Instead, it presents a range of sample sizes for individual bars: “n=19-104.”

Presenting means from four distinct, artificially constructed, partly overlapping samples as a single bar chart without specifying the size of any sample renders that 58.2% number uninterpretable. It is quite possible that only 19 of 104 retailers reported an increase in Black customers, and that all 104 reported an increase in White customers—for whom the infographic (but not the news) reported a 51.9% increase. Suppose 85 retailers did not report an increase in Black customers, and instead reported no change for that group (i.e., a change of 0%). Then if we actually calculated the average change in demographics reported by all survey respondents, we would find just a 10.6% increase in Black customers (19/104 x 58.2%), as compared to a 51.9% increase in white customers (104/104 x 51.9%).

A proper analysis of the full survey data could actually undermine the narrative of a surge in gun sales driven by Black Americans. In fact, a proper calculation may even have found a decrease, not an increase, for this group. The first two bar charts on the infographic report percentages of retailers who thought overall sales of firearms and of ammunition were “up,” “down,” or the “same.” We don’t know if the same response options were given for the demographic items, but if they were, a recount of all votes might have found a decrease in Black customers. We’ll never know.

The 58.2% number is meaningless without additional but unavailable information. Or, to use more technical language, it is a ceilingestimate, as opposed to a real number. In my less-technical write-up, I simply call it a fake number.

This is kind of in the style of our recent article in the Atlantic, The Statistics That Come Out of Nowhere, but with lot more detail. Or, for a simpler example, a claim from a few years ago about political attitudes of the super-rich, which came from a purported survey about which no details were given. As with some of those other claims, the reported number of 58% was implausible on its face, but that didn’t stop media organizations from credulously repeating it.

On the plus side, a few years back a top journal (yeah, you guessed it, it was Lancet, that fount of politically-motivated headline-bait) published a ridiculous study on gun control and, to their credit, various experts expressed their immediate skepticism.

To their discredit, the news media reports on that 58% thing did not even bother running it by any experts, skeptical or otherwise. Here’s another example (from NBC), here’s another (from Axios), here’s CNN . . . you get the picture.

I guess this story is just too good to check, it fits into existing political narratives, etc.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

Some people have no cell phone and never check their email before 4pm.

Paul Alper points to this news article, “Barely a quarter of Americans still have landlines. Who are they?”, by Andrew Van Dam, my new favorite newspaper columnist. Van Dam writes:

Only 2 percent of U.S. adults use only landlines. Another 3 percent mostly rely on landlines and 1 percent don’t have phones at all. The largest group of holdouts, of course, are folks 65 and older. That’s the only demographic for which households with landlines still outnumber wireless-only households. . . . about 73 percent of American adults lived in a household without a landline at the end of last year — a figure that has tripled since 2010.

Here’s some statistics:

“People who have cut the cord” — abandoning landlines to rely only on wireless — “are generally more likely to engage in risky behaviors,” Blumberg told us. “They’re more likely to binge drink, more likely to smoke and more likely to go without health insurance.” That’s true even when researchers control for age, sex, race, ethnicity and income.

OK, they should say “adjust for,” not “control for,” but I get the idea.

The article continues:

Until recently, we weren’t sure that data even existed. But it turns out we were looking in the wrong place. Phone usage is tracked in the National Health Interview Survey, of all things, the same source we used in previous columns to measure the use of glasses and hearing aids by our fellow Americans.

Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior.

This came up in our piranha paper, and it’s convenient to have these references in one place:

Here are just some of the factors that have been published in the social priming and related literatures as having large and predictable effects on attitudes and behavior: hormones (Petersen et al., 2013; Durante et al., 2013), subliminal images (Bartels, 2014; Gelman, 2015b), the outcomes of recent football games (Healy et al., 2010; Graham et al., 2022; Fowler and Montagnes, 2015, 2022), irrelevant news events such as shark attacks (Achen and Bartels, 2002; Fowler and Hall, 2018), a chance encounter with a stranger (Sands, 2017; Gelman, 2018b), parental socioeconomic status (Petersen et al., 2013), weather (Beall and Tracy, 2014; Gelman, 2018a), the last digit of one’s age (Alter and Hershfield, 2014; Kühnea et al., 2015), the sex of a hurricane name (Jung et al., 2014; Freese, 2014), the sexes of siblings (Blanchard and Bogaert, 1996; Bogaert, 2006; Gelman and Stern, 2006), the position in which a person is sitting (Carney et al., 2010; Cesario and Johnson, 2018), and many others.

These individual studies have lots of problems (see references below to criticisms); beyond that, the piranha principle implies that it would be very difficult for many of these large and consistent effects to coexist in the wild.

References to the claims:

Kristina M. Durante, Ashley Rae, and Vladas Griskevicius. The fluctuating female vote: Politics, religion, and the ovulatory cycle. Psychological Science, 24:1007–1016, 2013.

Larry Bartels. Here’s how a cartoon smiley face punched a big hole in democratic theory. Washington Post, https://www.washingtonpost.com/news/monkey-cage/wp/2014/09/04/heres-how-a-cartoon-smiley-face-punched-a-big-hole-in-democratic-theory/, 2014.

A. J. Healy, N. Malhotra, and C. H. Mo. Irrelevant events affect voters’ evaluations of government performance. Proceedings of the National Academy of Sciences, 107:12804–12809, 2010.

Matthew H. Graham, Gregory A. Huber, Neil Malhotra, and Cecilia Hyunjung Mo. Irrelevant events and voting behavior: Replications using principles from open science. Journal of Politics, 2022.

C. H. Achen and L. M. Bartels. Blind retrospection: Electoral responses to drought, flu, and shark attacks. Presented at the Annual Meeting of the American Political Science Association, 2002.

Anthony Fowler and Andrew B. Hall. Do shark attacks influence presidential elections? Reassessing a prominent finding on voter competence. Journal of Politics, 80:1423–1437, 2018.

Melissa L. Sands. Exposure to inequality affects support for redistribution. Proceedings of the National Academy of Sciences, 114:663–668, 2017.

Michael Bang Petersen, Daniel Sznycer, Aaron Sell, Leda Cosmides, and John Tooby. The ancestral logic of politics: Upper-body strength regulates men’s assertion of self-interest over economic redistribution. Psychological Science, 24:1098–1103, 2013.

Alec T. Beall and Jessica L. Tracy. The impact of weather on women’s tendency to wear red or pink when at high risk for conception. PLoS One, 9:e88852, 2014.

A. L. Alter and H. E. Hershfield. People search for meaning when they approach a new decade in chronological age. Proceedings of the National Academy of Sciences, 111:17066–17070, 2014.

Kiju Jung, Sharon Shavitt, Madhu Viswanathan, and Joseph M. Hilbe. Female hurricanes are deadlier than male hurricanes. Proceedings of the National Academy of Sciences, 111:8782–8787, 2014.

R. Blanchard and A. F. Bogaert. Homosexuality in men and number of older brothers. American Journal of Psychiatry, 153:27–31, 1996.

A. F. Bogaert. Biological versus nonbiological older brothers and men’s sexual orientation. Proceedings of the National Academy of Sciences, 103:10771–10774, 2006.

D. R. Carney, A. J. C. Cuddy, and A. J. Yap. Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21:1363–1368, 2010.

References to some criticisms:

Andrew Gelman. The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. Journal of Management, 41:632–643, 2015a.

Andrew Gelman. Disagreements about the strength of evidence. Chance, 28:55–59, 2015b.

Anthony Fowler and B. Pablo Montagnes. College football, elections, and false-positive results in observational research. Proceedings of the National Academy of Sciences, 112:13800–13804, 2015.

Anthony Fowler and B. Pablo Montagnes. Distinguishing between false positives and genuine results: The case of irrelevant events and elections. Journal of Politics, 2022.

Andrew Gelman. Some experiments are just too noisy to tell us much of anything at all: Political science edition. Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2018/05/29/exposure-forking-paths-affects-support-publication/, 2018b.

Andrew Gelman. Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students). Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2013/05/29/another-one-of-those-psychological-science-papers/

Andrew Gelman. When you believe in things that you don’t understand. Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2014/04/15/believe-things-dont-understand/, 2018a.

Simon Kühnea, Thorsten Schneiderb, and David Richter. Big changes before big birthdays? Panel data provide no evidence of end-of-decade crises. Proceedings of the National Academy of Sciences, 112:E1170, 2015.

Jeremy Freese. The hurricane name people strike back! Scatterplot, https://scatter.wordpress.com/2014/06/16/the-hurricane-name-people-strike-back/, 2014.

Andrew Gelman and Hal Stern. The difference between “significant” and “not significant” is not itself statistically significant. American Statistician, 60:328–331, 2006.

J. Cesario and D. J. Johnson. Power poseur: Bodily expansiveness does not matter in dyadic interactions. Social Psychological and Personality Science, 9:781–789, 2018.

Lots more out there:

The above is not intended to be an exhaustive or representative list or even a full list of examples we’ve covered here on the blog! There’s the “lucky golf ball” study, the case of the missing shredder, pizzagate, . . . we could go on forever. The past twenty years have featured many published and publicized claims about essentially irrelevant stimuli having large and predictable effects, along with quite a bit of criticism and refutation of these claims. The above is only a very partial list, just a paragraph giving a small sense of the wide variety of stimuli that are supposed to have been demonstrated to have large and consistent effects, and it’s relevant to our general point that it’s not possible for all these effects to coexist in the world. Again, take a look at the piranha paper for further discussion of this point.

Here are the most important parts of statistics:

Statistics is associated with random numbers: normal distributions, probability distributions more generally, random sampling, randomized experimentation.

But I don’t think these are the most important parts of statistics.

I thought about this when rereading this post that I wrote awhile ago but happened to appear yesterday. Here’s the relevant bit:

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

I think that these are the most important parts of statistics:
(a) to reduce, control, or adjust for biases and variation in measurement, and
(b) to systematically gather data on multiple cases.
This all should be obvious, but I don’t think it comes out clearly in textbooks, including my own. We get distracted by the shiny mathematical objects.

And, yes, random sampling and randomized experimentation are important, as is statistical inference in all its mathematical glory—our BDA book is full of math—, but you want those sweet, sweet measurements as your starting point.

Zipf’s law and Heaps’s law. Also, when is big as bad as infinity? And why unit roots aren’t all that.

John Cook has a fun and thoughtful post on Zipf’s law, which “says that the frequency of the nth word in a language is proportional to n^(−s),” linking to an earlier post of his on Heaps’s law, which “says that the number of unique words in a text of n words is approximated by Kn^β, where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 and 0.6.” Unsurprisingly, you can derive one of these laws from the other; see links on the aforementioned wikipedia page.

In his post on Zipf, Cook discusses the idea that setting large numbers to infinity can work in some settings but not in others. In some way this should already be clear to you—for example, if a = 3.4 + 1/x, and x is very large, then if you’re interested in a, then for most purposes you can just say a = 3.4, but if you care about x, you can’t just call it infinity. If you can use infinity, that simplifies your life. As Cook puts it, “Infinite is easier than big.” Another way of saying this is that, if you can use infinity, you can use some number like 10^8, thus avoiding literal infinities but getting many of the benefits in simplicity and computation.

Cook continues:

Whether it is possible to model N [the number of words in the language] as infinite depends on s [the exponent in the Zipf formula]. The value of s that models word frequency in many languages is close to 1. . . . When s = 1, we don’t have a probability distribution because the sum of 1/n from 1 to ∞ diverges. And so no, we cannot assume N = ∞. Now you may object that all we have to do is set s to be slightly larger than 1. If s = 1.0000001, then the sum of n−s converges. Problem solved. But not really.

When s = 1 the series diverges, but when s is slightly larger than 1 the sum is very large. Practically speaking this assigns too much probability to rare words.

I like how he translates the model error into a real-world issue.

This all reminds me of a confusion that sometimes arises in statistical inference. As Cook says, if you have problems with infinity, you’ll often also have problems with large finite numbers. For example, it’s not good to have an estimate that has an infinite variance, but if it has a very large variance, you’ll still have instability. Convergence conditions aren’t just about yes or no, they’re also about how close you are. Similarly with all that crap in time series about unit roots. The right question is not, Is there a unit root? It’s, What are you trying to model?

In all seriousness, what’s USC gonna do about its plagiarizing professor?

His plagiarism unambiguously violates the USC Integrity and Accountability Code:

Our Integrity and Accountability Code is anchored to USC’s Unifying Values and aligns our everyday decisions with the institution’s mission and compliance obligations. The Code is a vital resource for all faculty and staff throughout USC, including those at Keck Medicine at USC and the Keck School of Medicine.

The university policy continues:

To protect our reputation and promote our mission, every Trojan must do their part and act with integrity in our learning, teaching and research activities. This includes . . .

Never tolerating acts of plagiarism, falsification or fabrication of data, or other forms of academic and research misconduct.

I guess his defense would be that he didn’t “tolerate” these acts of plagiarism, because he didn’t know that they happened. But that would imply that he did not read his own book, which violates another part of that policy:

Making sure that all documentation and published findings are accurate, complete and unbiased.

Also it implies he was not telling the truth when he said the following in his book: “I went out and spoke to the amazing scientists around the world who do these kinds of experiments, and what I uncovered was astonishing.” Unless of course he never said this and his ghostwriter made it up, in which case he didn’t read that part either.

At some point you have to take responsibility for what is written under your name, right? I understand that in collaborative work it’s possible for coauthors to include errors or even fabrications without the other authors knowing, but he was the sole author of this book.

As the official USC document says:

Plagiarism is the appropriation of another person’s ideas, processes, results, or words without giving appropriate credit.

Here’s what the USC medical school professor said:

I am grateful that my collaborator has confirmed that I did not contribute to, nor was I aware of, any of the plagiarized or non-attributed passages in my books . . . I followed standard protocols and my attorney and I received several verbal [sic] and written assurances from this highly respected individual that she had run the book through multiple software checks to ensure proper attributions.

Ummmm . . . what about the “long sections of a chapter on the cardiac health of giraffes [that] appear to have been lifted from a 2016 blog post on the website of a South African safari company titled, ‘The Ten Craziest Facts You Should Know About A Giraffe'”? That didn’t look odd at all??

My “I have run the book through multiple software checks” T-shirt has people asking a lot of questions already answered by my shirt.

Also weird that the ghostwriter gave an assurance that she had run “multiple software checks.” This sounds like the author of record and his attorney (!) already had their suspicions. Who goes around asking for “several verbal and written assurances”? I get it: the author of record didn’t just pay for the words in the book; he also paid for an assurance that any plagiarism wouldn’t get caught.

I’m completely serious about this question:

What if a student at the USC medical school (oh, sorry, the “Keck School of Medicine”) were to hand in a plagiarized final paper? Would that student be kicked out of the program? What if the student said that he didn’t know about the plagiarism because he’d hired a ghostwriter to write the paper? And the ghostwriter supplied several verbal and written assurances that she had run the book through multiple software checks. Then would it be ok?

I have no personal interest in this one; I’m not going to file a formal complaint with USC or whatever. I just think it’s funny that USC doesn’t seem to care. What ever happened to “To protect our reputation and promote our mission . . .”? “Every Trojan,” indeed. To paraphrase Leona Helmsley, only the little people have to follow the rules, huh?

Why this matters

Junk science pollutes our discourse, Greshamly overwhelming the real stuff out there. Confident bullshitters suck up attention, along with TV, radio (NPR, of course), and space on airport bookshelves across the nation. When this regurgitated crap gets endorsements by Jane Goodall, Al Gore, Murray Gell-Mann, and other celebrities, it crowds out whatever is really being learned about the world.

There’s room for pop science writing and general health advice, for sure. This giraffe crap ain’t it.

On the other hand

Let’s get real. All this is much better than professors who engage in actual dangerous behavior such as conspiracy theorizing, election denial, medical hype, etc. I guess what bothers me about this USC case is the smugness of it all. The professors who push baseless conspiracy theories or dubious cures typically have an air of desperation to them. OK, not all of them. But often. Even Wansink at the height of his fame had a sort of overkinetic nervous aspect. And presumably they believe their own hype. Even those business school professors who made up data think they’re doing it in support of some true theory, right? But USC dude had to have known he was contracting out his reputation, just so he could get one more burst of fame with “The Ten Craziest Facts You Should Know About A Giraffe” or whatever. In any case, yeah, Alex Jones is a zillion times worse so let’s keep our perspective here.

Also, the USC doc is a strong supporter of vaccines, so he’s using his reputation for good rather than trying to use political polarization to score political points. I guess he can forget about moving to Stanford.

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.

Dorothy Bishop on the prevalence of scientific fraud

Following up on our discussion of replicability, here are some thoughts from psychology researcher Dorothy Bishop on scientific fraud:

In recent months, I [Bishop] have become convinced of two things: first, fraud is a far more serious problem than most scientists recognise, and second, we cannot continue to leave the task of tackling it to volunteer sleuths.

If you ask a typical scientist about fraud, they will usually tell you it is extremely rare, and that it would be a mistake to damage confidence in science because of the activities of a few unprincipled individuals. . . . we are reassured [that] science is self-correcting . . .

The problem with this argument is that, on the one hand, we only know about the fraudsters who get caught, and on the other hand, science is not prospering particularly well – numerous published papers produce results that fail to replicate and major discoveries are few and far between . . . We are swamped with scientific publications, but it is increasingly hard to distinguish the signal from the noise.

Bishop summarizes:

It is getting to the point where in many fields it is impossible to build a cumulative science, because we lack a solid foundation of trustworthy findings. And it’s getting worse and worse. . . . in clinical areas, there is growing concern that systematic reviews that are supposed to synthesise evidence to get at the truth instead lead to confusion because a high proportion of studies are fraudulent.

Also:

[A] more indirect negative consequence of the explosion in published fraud is that those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.

Given all the above, it’s unsurprising that, in Bishop’s words,

To date, the response of the scientific establishment has been wholly inadequate. There is little attempt to proactively check for fraud . . . Even when evidence of misconduct is strong, it can take months or years for a paper to be retracted. . . . this relaxed attitude to the fraud epidemic is a disaster-in-waiting.

What to do? Bishop recommends that some subset of researchers be trained as “data sleuths,” to move beyond the current whistleblower-and-vigilante system into something more like “the equivalent of a police force.”

I don’t know what to think about that. On one hand, I agree that whistleblowers and critics don’t get the support that they deserve; on the other hand, we might be concerned about who would be attracted to the job of official police officer here.

Setting aside concerns about Bishop’s proposed solution, I do see her larger point about the scientific publication process being so broken that it can actively interfere with the development of science. In a situation parallel to Cantor’s diagonal argument or Russell’s theory of types, it would seem that we need a scientific literature, and then, alongside it, a vetted scientific literature, and then, alongside that, another level of vetting, and so on. In medical research this sort of system has existed for decades, with a huge number of journals for the publication of original studies; and then another, smaller but still immense, set of journals that publish nothing but systematic reviews; and then some distillations that make their way into policy and practice.

Clarke’s Law

And don’t forget Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. All the above problems also arise with the sorts of useless noise mining we’ve been discussing in this space for nearly twenty years now. I assume most of those papers do not involve fraud, and even when there are clearly bad statistical practices such as rooting around for statistical significance, I expect that the perpetrators think of these research violations as merely serving the goal of larger truths.

So it’s not just fraud. Not by a longshot.

Also, remember the quote from Bishop above: “those who have committed fraud can rise to positions of influence and eminence on the back of their misdeeds. They may become editors, with the power to publish further fraudulent papers in return for money, and if promoted to professorships they will train a whole new generation of fraudsters, while being careful to sideline any honest young scientists who want to do things properly. I fear in some institutions this has already happened.” Replace “fraud” by “crappy research” and, yeah, we’ve been there for awhile!

P.S. Mark Tuttle points us to this news article by Richard Van Noorden, “How big is science’s fake-paper problem?”, that makes a similar point.

Brian Nosek on “Here are some ways of making your study replicable”:

Brian Nosek is a leader of the replication movement in science and a coauthor of an article on replicability that we discussed the other day.

They discussed the rigor-enhancing practices of “confirmatory tests, large sample sizes, preregistration, and methodological transparency, and in my post I wrote that those were not the first things I’d suggest to increase rigor in science. My recommendations were (1) Make it clear what you’re actually doing, (2) Increase your effect size, e.g., do a more effective treatment, (3) Focus your study on the people and scenarios where effects are likely to be largest, (4) Improve your outcome measurement: a more focused and less variable outcome measure, (5) Improve pre-treatment measurements, and finally (6) the methods listed in the above-linked article: “confirmatory tests, large sample sizes, preregistration, and methodological transparency.”

I sent this post to Nosek, and he replied:

For your list of practices:

#1: We did this for both methodological and statistical practices.

#2: I suspect that every lab was motivated to get the largest effect that they could given the research question that they were studying (ours certainly was). But, you’ll observe in the findings that we didn’t get very large effect sizes on average. Instead, they are what I believe are around what most “real” effect sizes are for the messy concepts that social scientists study.

#3: We didn’t do this. Each lab used a sampling firm and all studies were conducted through that firm. It is possible that a lab would have tried to tailor the design to the sample, but these were very heterogeneous samples, so that would not likely have been very effective.

#4: I suspect that every lab did this the best that they could. Simultaneously, most of the research in this is pretty on-the-edge discovery work, so not necessarily a lot of existing evidence to make use of (with variation across experiments and labs).

#5: I suspect that this was done for a couple of experiments from some labs, but not others. (None from mine did so.)

I like all of your suggestions for improving rigor. I would counterargue that some of them become more meaningfully impactful on the research process as the evidence base matures (e.g., where to get the largest effect size, what are effective pretreatment measurements). In the context of discovery research like the experiments in this paper, we could only speculate about these in trying to design the most rigorous studies. The practices that we highlight are “easily” applied no matter the maturity of the domain and evidence base.

On your other points: I think the paper provides proof-of-concept that even small effects are highly replicable. And, I am much more sanguine than you are about the benefits of preregistration. Maybe we can find some time to argue about that in the future!

I disagree with Geoff Hinton regarding “glorified autocomplete”

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accort . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

Hey, check this out! Here’s how to read and then rewrite the title and abstract of a paper.

In our statistical communication class today, we were talking about writing. At some point a student asked why it was that journal articles are all written in the same way. I said, No, actually there are many different ways to write a scientific journal article. Superficially these articles all look the same: title, abstract, introduction, methods, results, discussion, or some version of that, but if you look in detail you’ll see that you have lots of flexibility in how to do this (with the exception of papers in medical journals such as JAMA which indeed have a pretty rigid format).

The next step was to demonstrate the point by going to a recent scientific article. I asked the students to pick a journal. Someone suggested NBER. So I googled NBER and went to its home page:

I then clicked on the most recent research paper, which was listed on the main page as “Employer Violations of Minimum Wage Laws.” Click on the link and you get this more dramatically-titled article:

Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases

with this abstract:

Using Current Population Survey data, we assess whether and to what extent the burden of wage theft — wage payments below the statutory minimum wage — falls disproportionately on various demographic groups following minimum wage increases. For most racial and ethnic groups at most ages we find that underpayment rises similarly as a fraction of realized wage gains in the wake of minimum wage increases. We also present evidence that the burden of underpayment falls disproportionately on relatively young African American workers and that underpayment increases more for Hispanic workers among the full working-age population.

We actually never got to the full article (but feel free to click on the link and read it yourself). There was enough in the title and abstract to sustain a class discussion.

Before going on . . .

In class we discussed the title and abstract of the above article and considered how it could be improved. This does not mean we think the article, or its title, or its abstract, is bad. Just about everything can be improved! Criticism is an important step in the process of improvement.

The title

“Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases” . . . that’s not bad! “Wage Theft” in the first sentence is dramatic—it grabs our attention right away. And the second sentence is good too: it foregrounds “Evidence” and it also tells you where the identification is coming from. So, good job. We’ll talk later about how we might be able to do even better, but I like what they’ve got so far.

Just two things.

First, the answer to the question, “Does X vary with Y?”, is always Yes. At least, in social science it’s always Yes. There are no true zeroes. So it would be better to change that first sentence to something like, “How Does Wage Theft Vary by Demographic Group?”

The second thing is the term “wage theft.” I took that as a left-wing signifier, the same way in which the use of a loaded term such as “pro-choice” or “pro-life” conveys the speaker’s position on abortion. So I took the use of that phrase in the title as a signal that the article is taking a position on the political/economic left. But then I googled the first author, and . . . he’s an “Adjunct Senior Fellow at the Hoover Institution.” Not that everyone at Hoover is right-wing, but it’s not a place I associate with the left, either. So I’ll move on and not worry about this issue.

The point here is not that I’m trying to monitor the ideology of economics papers. This is a post on how to write a scholarly paper! My point is that the title conveys information, both directly and indirectly. The term “wage theft” in the title conveys that the topic of the paper will be morally serious—they’re talking about “theft,” not just some technical violations of a law—; also it has this political connotation. When titling your papers, be aware of the direct and indirect messages you’re conveying.

The abstract

As I said, I liked the title of the paper—it’s punchy and clear. The abstract is another story. I read it and then realized I hadn’t absorbed any of its content, so I read it again, and it was still confusing. It’s not “word salad”—there’s content in that abstract—; it’s just put together in a way that I found hard to follow. The students in the class had the same impression, and indeed they were kinda relieved that I too found it confusing.

How to rewrite? The best approach would be to go into the main paper, maybe start with our tactic of forming an abstract by taking the first sentence of each of the first five paragraphs. But here we’ll keep it simple and just go with the information right there in the current abstract. Our goal is to rewrite in a way that makes it less exhausting to read.

Our strategy: First take the abstract apart, then put it back together.

I went to the blackboard and listed the information that was in the abstract:
– CPS data
– Definition of wage theft
– What happens after minimum wage increase
– Working-age population
– African American, Hispanic, White

Now, how to put this all together? My first thought was to just start with the definition of wage theft, but then I checked online and learned that the phrase used in the abstract, “wage payments below the statutory minimum wage,” is not the definition of wage theft; it’s actually just one of several kinds of wage theft. So that wasn’t going to work. Then there’s the bit from the abstract, “falls disproportionately on various demographic groups”—that’s pretty useless, as what we want to know is where this disproportionate burden falls, and by how much.

Putting it all together

We discussed some more—it took surprisingly long, maybe 20 minutes of class time to work through all these issues—and then I came up with this new title/abstract:

Wage theft! Evidence from minimum wage increases

Using Current Population Survey data from [years] in periods following minimum wage increase, we look at the proportion of workers being paid less than the statutory minimum, comparing different age groups and ethnic groups. This proportion was highest in ** age and ** ethnic groups.

OK, how is this different from the original?

1. The three key points of the paper are “wage theft,” “evidence,” and “minimum wage increases,” so that’s now what’s in the title.

2. It’s good to know that the data came from the Current Population Survey. We also want to know when this was all happening, so we added the years to the abstract. Also we made the correction of changing the tense in the abstract from the present to the past, because the study is all based on past data.

3. The killer phrase, “wage theft,” is already in the title, so we don’t need it in the abstract. That helps, because then we can use the authors’ clear and descriptive phrase, “the proportion of workers being paid less than the statutory minimum,” without having to misleadingly imply that this is the definition of wage theft, and without having to lugubriously state that it’s a kind of wage theft. That was so easy!

4. We just say we’re comparing different age and ethnic groups and then report the results. This to me is much cleaner than the original abstract which shared this information in three long sentences, with quite a bit of repetition.

5. We have the ** in the last sentence because I’m not quite clear from the abstract what are the take-home points. The version we created is short enough that we could add more numbers to that last sentence, or break it up into two crisp sentences, for example, one sentence about age groups and one about ethnic groups.

In any case, I think this new version is much more readable. It’s a structure much better suited to conveying, not just the general vibe of the paper (wage theft, inequality, minority groups) but the specific findings.

Lessons for rewriters

Just about every writer is a rewriter. So these lessons are important.

We were able to improve the title and abstract, but it wasn’t easy, nor was it algorithmic—that is, there was no simple set of steps to follow. We gave ourselves the relatively simple task of rewriting without the burden of subject-matter knowledge, and it still took a half hour of work.

After looking over some writing advice, it’s tempting to think that rewriting is mostly a matter of a few clean steps: replacing the passive with the active voice, removing empty words and phrases such as “quite” and “Note that,” checking for grammar, keeping sentences short, etc. In this case, no. In this case, we needed to dig in a bit and gain some conceptual understanding to figure out what to say.

The outcome, though, is positive. You can do this too, for your own papers!

“Open Letter on the Need for Preregistration Transparency in Peer Review”

Brendan Nyhan writes:

Wanted to share this open letter. I know preregistration isn’t useful for the style of research you do, but even for consumers of preregistered research like you it’s essential to know if the preregistration was actually disclosed to and reviewed by reviewers, which in turn helps make sure that exploratory and confirmatory analyses are adequately distinguished, deviations and omissions labeled, etc. (The things I’ve seen as a reviewer… are not good – which is what motivated me to organize this.)

The letter, signed by Nyhan and many others, says:

It is essential that preregistrations be considered as part of the scientific review process.

We have observed a lack of shared understanding among authors, editors, and reviewers about the role of preregistration in peer review. Too often, preregistrations are omitted from the materials submitted for review entirely. In other cases, manuscripts do not identify important deviations from the preregistered analysis plan, fail to provide the results of preregistered analyses, or do not indicate which analyses were not preregistered.

We therefore make the following commitments and ask others to join us in doing so:

As authors: When we submit an article for review that has been preregistered, we will always include a working link to a (possibly anonymized) preregistration and/or attach it as an appendix. We will identify analyses that were not preregistered as well as notable deviations and omissions from the preregistration.

As editors: When we receive a preregistered manuscript for review, we will verify that it includes a working link to the preregistration and/or that it is included in the materials provided to reviewers. We will not count the preregistration against appendix page limits.

As reviewers: We will (a) ask for the preregistration link or appendix when reviewing preregistered articles and (b) examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.

I’ve actually been moving toward more preregistration in my work. Two recent studies we’ve done that have been preregistered are:

– Our project on generic language and political polarization

– Our evaluation of the Millennium Villages project

And just today I met with two colleagues on a medical experiment that’s in the pre-design stage—that is, we’re trying to figure out the design parameters. To do this, we need to simulate the entire process, including latent and observed data, then perform analyses on the simulated data, then replicate the entire process to ensure that the experiment will be precise enough to be useful, at least under the assumptions we’re making. This is already 90% of preregistration, and we had to do it anyway. (See recommendation 3 here.)

So, yeah, given that I’m trying now to simulate every study ahead of time before gathering any data, preregistration pretty much comes for free.

Preregistration is not magic—it won’t turn a hopelessly biased, noisy study into something useful—but it does seem like a useful part of the scientific process, especially if we remember that preregistering an analysis should not stop us from performing later, non-preregistered analyses.

Preregistration should be an addition to the research project, not a limitation!

I guess that Nyhan et al.’s suggestions are good, if narrow in that they’re focused on the very traditional journal-reviewer system. I’m a little concerned with the promise that they as reviewers will “examine the preregistration to understand the registered intention of the study and consider important deviations, omissions, and analyses that were not preregistered in assessing the work.” I mean, sure, fine in theory, but I would not expect or demand that every reviewer do this for every paper that comes in. If I had to do all that work every time I reviewed a paper, I’d have to review many fewer papers a year, and I think my total contribution to science as a reviewer would be much less. If I’m gonna go through and try to replicate an analysis, I don’t want to waste that on a review that only 4 people will see. I’d rather blog it and maybe write it up on some other form (as for example here), as that has the potential to help more people.

Anyway, here’s the letter, so go sign it—or perhaps sign some counter-letter—if you wish!

Oooh, I’m not gonna touch that tar baby!

Someone pointed me to a controversial article written a couple years ago. The article remains controversial. I replied that it’s a topic that I’ve not followed any detail and I’ll just defer to the experts. My correspondent pointed to some serious flaws in the article and asked that I link to the article here on the blog. He wrote, “I was unable to find any peer responses to it. Perhaps the discussants on your site will have some insights.”

My reply is the title of this post.

P.S. Not enough information is given in this post to figure out what is the controversial article here, so please don’t post guesses in the comments! Thank you for understanding.

What happens when someone you know goes off the political deep end?

Speaking of political polarization . . .

Around this time every year we get these news articles of the form, “I’m dreading going home to Thanksgiving this year because of my uncle, who used to be a normal guy who spent his time playing with his kids, mowing the lawn, and watching sports on TV, but has become a Fox News zombie, muttering about baby drag shows and saying that Alex Jones was right about those school shootings being false-flag operations.”

This all sounds horrible but, hey, that’s just other people, right? OK, actually I did have an uncle who started out normal and got weirder and weirder, starting in the late 1970s with those buy-gold-because-the-world-is-coming-to-an-end newsletters and then getting worse from there, with different aspects of his life falling apart as his beliefs got more and more extreme. Back in 1999 he was convinced that the year 2K bug (remember that?) would destroy society. After January 1 came and nothing happened, we asked him if he wanted to reassess. His reply: the year 2K bug would indeed take civilization down, but it would be gradual, over a period of months. And, yeah, he’d always had issues, but it did get worse and worse.

Anyway, reading about poll results is one thing; having it happen to people you know is another. Recently a friend told me about another friend, someone I hadn’t seen in awhile. Last I spoke with that guy, a few years back, he was pushing JFK conspiracy theories. I don’t believe any of these JFK conspiracy theories (please don’t get into that in the comments here; just read this book instead), but lots of people believe JFK conspiracy theories, indeed they’re not as wacky as the ever-popular UFOs-as-space-aliens thing. I didn’t think much about it; he was otherwise a normal guy. Anyway, the news was that in the meantime he’d become a full-bore, all-in vaccine denier.

What happened? I have no idea, as I never knew this guy that well. He was a friend, or I guess in recent years an acquaintance. I don’t really have a take on whether he was always unhinged, or maybe the JFK thing started him on a path that spiraled out of control, or maybe he just spent too much time on the internet.

I was kinda curious how he’d justify his positions, though, so I sent him an email:

I hope all is well with you. I saw about your political activities online. I was surprised to see you endorse the statement that the covid vaccine is “the biggest crime ever committed on humanity.” Can you explain how you think that a vaccine that’s saved hundreds of thousands of lives is more of a crime committed on humanity than, say, Hitler and Stalin starting WW2?

I had no idea how he’d respond to this, maybe he’d send me a bunch of Qanon links, the electronic equivalent of a manila folder full of mimeographed screeds. It’s not like I was expecting to have any useful discussion with him—once you start with the position that a vaccine is a worse crime than invading countries and starting a world war, there’s really no place to turn. He did not respond to me, which I guess is fine. What was striking to me was how he didn’t just take a provocative view that was not supported by the evidence (JFK conspiracy theories, election denial, O.J. is innocent, etc.); instead he staked out a position that was well beyond the edge of sanity, almost as if the commitment to extremism was part of the appeal. Kind of like the people who go with Alex Jones on the school shootings.

Anyway, this sort of thing is always sad, but especially when it happens to someone you know, and then it doesn’t help that there are lots of unscrupulous operators out there who will do their best to further unmoor these people from reality and take their money.

From a political science perspective, the natural questions are: (1) How does this all happen?, and (2) Is this all worse than before, or do modern modes of communication just make us more aware of these extreme attitudes? After all, back in the 1960s there were many prominent Americans with ridiculous extreme-right and extreme-left views, and they had a lot of followers too. The polarization of American institutions has allowed some of these extreme views to get more political prominence, so that the likes of Alex Jones and Al Sharpton can get treated with respect by the leaders of the major political parties. Political leaders always would accept the support of extremists—a vote’s a vote, after all—but I have the feeling that in the past they were more at arms length.

This post is not meant to be a careful study of these questions, indeed I’m sure there’s a big literature on the topic. What happened is that my friend told me about our other friend going off the deep end, and that all got me thinking, in the way that a personal connection can make a statistical phenomenon feel so much more real.

P.S. Related is this post from last year on Seth Roberts and political polarization. Unlike my friend discussed above, Seth never got sucked into conspiracy theories, but he had this dangerous mix of over-skepticism and over-credulity, and I could well imagine that he could’ve ended up in some delusional spaces.

Generically partisan: Polarization in political communication

Gustavo Novoa, Margaret Echelbarger, et al. write:

American political parties continue to grow more polarized, but the extent of ideological polarization among the public is much less than the extent of perceived polarization (what the ideological gap is believed to be). Perceived polarization is concerning because of its link to interparty hostility, but it remains unclear what drives this phenomenon.

We propose that a tendency for individuals to form broad generalizations about groups on the basis of inconsistent evidence may be partly responsible.

We study this tendency by measuring the interpretation, endorsement, and recall of category-referring statements, also known as generics (e.g., “Democrats favor affirmative action”). In study 1 (n = 417), perceived polarization was substantially greater than actual polarization. Further, participants endorsed generics as long as they were true more often of the target party (e.g., Democrats favor affirmative action) than of the opposing party (e.g., Republicans favor affirmative action), even when they believed such statements to be true for well below 50% of the relevant party. Study 2 (n = 928) found that upon receiving information from political elites, people tended to recall these statements as generic, regardless of whether the original statement was generic or not. Study 3 (n = 422) found that generic statements regarding new political information led to polarized judgments and did so more than nongeneric statements.

Altogether, the data indicate a tendency toward holding mental representations of political claims that exaggerate party differences. These findings suggest that the use of generic language, common in everyday speech, enables inferential errors that exacerbate perceived polarization.

Nice graphs. I guess PNAS publishes good stuff from time to time.