Evilicious 3: Face the Music

A correspondent forwards me this promotional material that appeared in his inbox:

“Misbeliefs are not just about other people, they are also about our own beliefs.” Indeed.

I wonder if this new book includes the shredder story.

P.S. The book has blurbs from Yuval Harari, Arianna Huffington, and Michael Shermer (the professional skeptic who assures us that he has a haunted radio). This thing where celebrities stick together . . . it’s nuts!

P.P.S. The good news is that there’s already some new material for the eventual sequel. And it’s “preregistered”! What could possibly go wrong?

What is the prevalence of bad social science?

Someone pointed me to this post from Jonatan Pallesen:

Frequently, when I [Pallesen] look into a discussed scientific paper, I find out that it is astonishingly bad.

• I looked into Claudine Gay’s 2001 paper to check a specific thing, and I find out that research approach of the paper makes no sense. (https://x.com/jonatanpallesen/status/1740812627163463842)

• I looked into the famous study about how blind auditions increased the number of women in orchestras, and found that the only significant finding is in the opposite direction. (https://x.com/jonatanpallesen/status/1737194396951474216)

• The work of Lisa Cook was being discussed because of her nomination to the fed. @AnechoicMedia_ made a comment pointing out a potential flaw in her most famous study. And indeed, the flaw was immediately obvious and fully disqualifying. (https://x.com/jonatanpallesen/status/1738146566198722922)

• The study showing judges being very affected by hunger? Also useless. (https://x.com/jonatanpallesen/status/1737965798151389225)

These studies do not have minor or subtle flaws. They have flaws that are simple and immediately obvious. I think that anyone, without any expertise in the topics, can read the linked tweets and agree that yes, these are obvious flaws.

I’m not sure what to conclude from this, or what should be done. But it is rather surprising to me to keep finding this.

My quick answer is, at some point you should stop being surprised! Disappointed, maybe, just not surprised.

A key point is that these are not just any papers, they’re papers that have been under discussion for some reason other than their potential problems. Pallesen, or any of us, doesn’t have to go through Psychological Science and PNAS every week looking for the latest outrage. He can just sit in one place, passively consume the news, and encounter a stream of prominent published research papers that have clear and fatal flaws.

Regular readers of this blog will recall dozens more examples of high-profile disasters: the beauty-and-sex-ratio paper, the ESP paper and its even more ridiculous purported replications, the papers on ovulation and clothing and ovulation and voting, himmicanes, air rage, ages ending in 9, the pizzagate oeuvre, the gremlins paper (that was the one that approached the platonic ideal of more corrections than data points), the ridiculously biased estimate of the effects of early-childhood intervention, the air pollution in China paper and all the other regression discontinuity disasters, much of the nudge literature, the voodoo study, the “out of Africa” paper, etc. As we discussed in the context of that last example, all the way back in 2013 (!), the problem is closely related to these papers appearing in top journals:

The authors have an interesting idea and want to explore it. But exploration won’t get you published in the American Economic Review etc. Instead of the explore-and-study paradigm, researchers go with assert-and-defend. They make a very strong claim and keep banging on it, defending their claim with a bunch of analyses to demonstrate its robustness. . . . High-profile social science research aims for proof, not for understanding—and that’s a problem. The incentives favor bold thinking and innovative analysis, and that part is great. But the incentives also favor silly causal claims. . . .

So, to return to the question in the title of this post, how often is this happening? It’s hard for me to say. On one hand, ridiculous claims get more attention; we don’t spend much time talking about boring research of the “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]" variety. On the other hand, do we really think that high-profile papers in top journals are that much worse than the mass of published research?

I expect that some enterprising research team has done some study, taking a random sample of articles published in some journals and then looking at each paper in detail to evaluate its quality. Without that, we can only guess, and I don’t have it in me to hazard a percentage. I’ll just say that it happens a lot—enough so that I don’t think it makes sense to trust social-science studies by default.

My correspondent also pointed me to a recent article in Harvard’s student newspaper, “I Vote on Plagiarism Cases at Harvard College. Gay’s Getting off Easy,” by “An Undergraduate Member of the Harvard College Honor Council,” who writes:

Let’s compare the treatment of Harvard undergraduates suspected of plagiarism with that of their president. . . . A plurality of the Honor Council’s investigations concern plagiarism. . . . when students omit quotation marks and citations, as President Gay did, the sanction is usually one term of probation — a permanent mark on a student’s record. A student on probation is no longer considered in good standing, disqualifying them from opportunities like fellowships and study-abroad programs. Good standing is also required to receive a degree.

What is striking about the allegations of plagiarism against President Gay is that the improprieties are routine and pervasive. She is accused of plagiarism in her dissertation and at least two of her 11 journal articles. . . .

In my experience, when a student is found responsible for multiple separate Honor Code violations, they are generally required to withdraw — i.e., suspended — from the College for two semesters. . . . We have even voted to suspend seniors just about to graduate. . . .

There is one standard for me and my peers and another, much lower standard for our University’s president.

This echoes what Jonathan Bailey has written here and here at his blog Plagiarism Today:

Schools routinely hold their students to a higher and stricter standard when it comes to plagiarism than they handle their faculty and staff. . . .

To give an easy example. In October 2021, W. Franklin Evans, who was then the president of West Liberty University, was caught repeated plagiarizing in speeches he was giving as President. Importantly, it wasn’t past research that was in dispute, it was the work he was doing as president.

However, though the board did vote unanimously to discipline him, they also voted against termination and did not clarify what discipline he was receiving.

He was eventually let go as president, but only after his contract expired two years later. It’s difficult to believe that a student at the school, if faced with a similar pattern of plagiarism in their coursework, would be given that same chance. . . .

The issue also isn’t limited to higher education. In February 2020, Katy Independent School District superintendent Lance Hindt was accused of plagiarism in his dissertation. Though he eventually resigned, the district initially threw their full sport behind Hindt. This included a rally for Lindth that was attended by many of the teachers in the district.

Even after he left, he was given two years of salary and had $25,000 set aside for him if he wanted to file a defamation lawsuit.

There are lots and lots of examples of prominent faculty committing scholarly misconduct and nobody seems to care—or, at least, not enough to do anything about it. In my earlier post on the topic, I mentioned the Harvard and Yale law professors, the USC medical school professor, the Princeton history professor, the George Mason statistics professor, and the Rutgers history professor, none of whom got fired. And I’d completely forgotten about the former president of the American Psychological Association and editor of Perspectives on Psychological Science who misrepresented work he had published and later was forced to retract—but his employer, Cornell University, didn’t seem to care. And the University of California professor who misrepresented data and seems to have suffered no professional consequences. And the Stanford professor who gets hyped by his university while promoting miracle cures and bad studies. And the dean of engineering at the University of Nevada. Not to mention all the university administrators and football coaches who misappropriate funds and then are quietly allowed to leave on golden parachutes.

Another problem is that we rely on the news media to keep these institutions accountable. We have lots of experience with universities (and other organizations) responding to problems by denial; the typical strategy appears to be to lie low and hope the furor will go away, which typically happens in the absence of lots of stories in the major news media. But . . . the news media have their own problems: little problems like NPR consistently hyping junk science and big problems like Fox pushing baseless political conspiracy theories. And if you consider podcasts and Ted talks to be part of “the media,” which I think they are—I guess as part of the entertainment media rather than the news media, but the dividing line is not sharp—then, yeah, a huge chunk of the media is not just susceptible to being fooled by bad science and indulgent of academic misconduct, it actually relies on bad science and academic misconduct to get the wow! stories that bring the clicks.

To return to the main thread of this post: by sanctioning students for scholarly misconduct but letting its faculty and administrators off the hook, Harvard is, unfortunately, following standard practice. The main difference, I guess, is that “president of Harvard” is more prominent than “Princeton history professor” or “Harvard professor of constitutional law” or “president of West Liberty University” or “president of the American Psychological Association” or “UCLA medical school professor” or all the others. The story of the Harvard president stays in the news, while those others all receded from view, allowing the administrators at those institutions to follow the usual plan of minimizing the problem, saying very little, and riding out the storm.

Hey, we just got sidetracked into a discussion of plagiarism. This post was supposed to be about bad research. What can we say about that?

Bad research is different than plagiarism. Obviously, students don’t get kicked out for doing bad research, using wrong statistical methods, losing their data, making claims that defy logic and common sense, claiming to modify a paper shredder that has never existed, etc etc etc. That’s the kind of behavior that, if your final paper also has formatting problems, will get you slammed with a B grade and that’s about it.

When faculty are found to have done bad research, the usual reaction is not to give them a B or to do the administrative equivalent—lowering their salary, perhaps?, or removing them from certain research responsibilities, maybe making them ineligible to apply for grants?—but rather to pretend that nothing happened. The idea is that, once an article has been published, you draw a line under it and move onward. It’s considered in bad taste—Javert-like, even!—to go back and find flaws in papers that are already resting comfortably in someone’s C.V. As Pallesen notes, so often when we do go back and look at those old papers, we find serious flaws. Which brings us to the question in the title of this post.

P.S. The paper by Claudine Gay discussed by Pallesen is here; it was published in 2001. For more on the related technical questions involving the use of ecological regression, I recommend this 2002 article by Michael Herron and Kenneth Shots (link from Pallesen) and my own article with David Park, Steve Ansolabehere, Phil Price, and Lorraine Minnite, “Models, assumptions, and model checking in ecological regressions,” from 2001.

“AI” as shorthand for turning off our brains. (This is not an anti-AI post; it’s a discussion of how we think about AI.)

Before going on, let me emphasize that, yes, modern AI is absolutely amazing—self-driving cars, machines that can play ping-pong, chessbots, computer programs that write sonnets, the whole deal! Call it machine intelligence or whatever, it’s amazing.

What I’m getting at in this post is the way in which attitudes toward AI fit into existing practices in science and other aspects of life.

This came up recently in comments:

“AI” does not just refer to a particular set of algorithms or computer programs but also to the attitude in which an algorithm or computer program is idealized to the extent that people think it’s ok for them to rely on it and not engage their brains.

Some examples of “AI” in that sense of the term:
– When people put a car on self-driving mode and then disengage from the wheel.
– When people send out a memo produced by a chatbot without reading and understanding it first.
– When researchers use regression discontinuity analysis or some other identification strategy and don’t check that their numbers make any sense at all.
– When journal editors see outrageous claims backed “p less than 0.05” and then just push the Publish button.

“AI” is all around us, if you just know where to look!

One thing that interests me here is how current expectations of AI in some ways match and in some ways go beyond past conceptions in science fiction. The chatbot, for example, is pretty similar to all those talking robots, and I guess you could imagine a kid in such a story asking his robot to do his homework for him. Maybe the difference is that the robot is thought to have some sort of autonomy, along which comes some idiosyncratic fallibility (if only that the robot is too “logical” to always see clearly to the solution of a problem), whereas an AI is considered more of an institutional product with some sort of reliability, in the same sense that every bottle of Coca-Cola is the same. Maybe that’s the connection to naive trust in standardized statistical methods.

This also relates to the idea that humans used to be thought of as the rational animal but now are viewed as irrational computers. In the past, our rationality was considered to be what separates us from the beasts, either individually or through collective action, as in Locke and Hobbes. If the comparison point is animals, then our rationality is a real plus! Nowadays, though, it seems almost the opposite: if the comparison point is a computer, then what makes us special is not our rationality but our emotions.

There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Following our recent post on the latest Dishonestygate scandal, we got into a discussion of the challenges of simulating fake data and performing a pre-analysis before conducting an experiment.

You can see it all in the comments to that post—but not everybody reads the comments, so I wanted to repeat our discussion here. Especially the last line, which I’ve used as the title of this post.

Raphael pointed out that it can take some work to create a realistic simulation of fake data:

Do you mean to create a dummy dataset and then run the preregistered analysis? I like the idea, and I do it myself, but I don’t see how this would help me see if the endeavour is doomed from the start? I remember your post on the beauty-and-sex ratio, which proved that the sample size was far too small to find an effect of such small magnitude (or was it in the Type S/Type M paper?). I can see how this would work in an experimental setting – simulate a bunch of data sets, do your analysis, compare it to the true effect of the data generation process. But how do I apply this to observational data, especially with a large number of variables (number of interactions scales in O(p²))?

I elaborated:

Yes, that’s what I’m suggesting: create a dummy dataset and then run the preregistered analysis. Not the preregistered analysis that was used for this particular study, as that plan is so flawed that the authors themselves don’t seem to have followed it, but a reasonable plan. And that’s kind of the point: if your pre-analysis plan isn’t just a bunch of words but also some actual computation, then you might see the problems.

In answer to your second question, you say, “I can see how this would work in an experimental setting,” and we’re talking about an experiment here, so, yes, it would’ve been better to have simulated data and performed an analysis on the simulated data. This would require the effort of hypothesizing effect sizes, but that’s a bit of effort that should always be done when planning a study.

For an observational study, you can still simulate data; it just takes more work! One approach I’ve used, if I’m planning to fit data predicting some variable y from a bunch of predictors x, is to get the values of x from some pre-existing dataset, for example an old survey, and then just do the simulation part for y given x.

Raphael replied:

Maybe not the silver bullet I had hoped for, but now I believe I understand what you mean.

To which I responded:

There is no silver bullet; there is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Again, this is not a diss on preregistration. Preregistration does one thing; it’s not intended to fix bad aspects of the culture of science such as the idea that you can gather a pile of data, grab some results, declare victory, go on the Ted talk circuit based only on the very slender bit of evidence that you seem to have been able to reject that the data came from a specific random number generator. That line of reasoning, where rejection of straw-man null hypothesis A is taken as evidence in favor of preferred alternative B, is wrong—but it’s not preregistration’s fault that people think that way!

P-hacking can be bad (but the problem here, in my view, is not in performing multiple analyses but rather in reporting only one of them rather than analyzing them all together); various questionable research practices are, well, questionable; and preregistration can help with that, either directly (by motivating researchers to follow a clear plan) or indirectly (by allowing outsiders to see problems in post-publication review, as here).

I am, however, bothered by the focus on procedural/statistical “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Again, the problem is if researchers mistakenly think that following such advice will place them back on that nonexistent golden path to discovery.

So, again, I recommend to make assumptions, simulate fake data, and analyze these data as a way of constructing a pre-analysis plan, before collecting any data. That won’t put you on the golden path to discovery either!

All I can offer you here is blood, toil, tears and sweat, along with the possibility that a careful process of assumptions/simulation/pre-analysis will allow you to avoid disasters such as this ahead of time, thus avoiding the consequences of: (a) fooling yourself into thinking you’ve made a discovery, (b) wasting the time and effort of participants, coauthors, reviewers, and postpublication reviewers (that’s me!), and (c) filling the literature with junk that will later be collected in a GIGO meta-analysis and promoted by the usual array of science celebrities, podcasters, and NPR reporters.

Aaaaand . . . in the time you’ve saved from all of that could be repurposed into designing more careful experiments with clearer connections between theory and measurement. Not a glide along the golden path to a discovery; more of a hacking through the jungle of reality to obtain some occasional glimpses of the sky.

It’s Ariely time! They had a preregistration but they didn’t follow it.

I have a story for you about a success of preregistration. Not quite the sort of success that you might be expecting—not a scientific success—but a kind of success nonetheless.

It goes like this. An experiment was conducted. It was preregistered. The results section was written up in a way that reads as if the experiment worked as planned. But if you go back and forth between the results section and the preregistration plan, you realize that the purportedly successful results did not follow the preregistration plan. They’re just the usual story of fishing and forking paths and p-hacking. The preregistration plan was too vague to be useful, also the authors didn’t even bother to follow it—or, if they did follow it, they didn’t bother to write up the results of the preregistered analysis.

As I’ve said many times before, there’s no reason that preregistration should stop researchers from doing further analyses once they see their data. The problem in this case is that the published analysis was not well justified either from a statistical or a theoretical perspective, nor was it in the preregistration. Its only value appears to be as a way for the authors to spin a story around a collection of noisy p-values.

On the minus side, the paper was published, and nowhere in the paper does it say that the statistical evidence they offer from their study does not come from the preregistration. In the abstract, their study is described as “pre-registered,” which isn’t a lie—there’s a pregistration plan right there on the website—but it’s misleading, given that the preregistration does not line up with what’s in the paper.

On the plus side, outside readers such as ourselves can see the paper and the preregistrations and draw our own conclusions. It’s easier to see the problems with p-hacking and forking paths when the analysis choices are clearly not in the preregistration plan.

The paper

The Journal of Experimental Social Psychology recently published an article, “How pledges reduce dishonesty: The role of involvement and identification,” by Eyal Peer, Nina Mazar, Yuval Feldman, and Dan Ariely.

I had no idea that Ariely is still publishing papers on dishonesty! It says that data from this particular paper came from online experiments. Nothing involving insurance records or paper shredders or soup bowls or 80-pound rocks . . . It seems likely that, in this case, the experiments actually happened and that the datasets came from real people and have not been altered.

And the studies are preregistered, with the preregistration plans all available on the papers’ website.

I was curious about that. The paper had 4 studies. I just looked at the first one, which already took some effort on my part. The rest of you can feel free to look at Studies 2, 3, and 4.

The results section and the preregistration

From the published paper:

The first study examined the effects of four different honesty pledges that did or did not include a request for identification and asked for either low or high involvement in making the pledge (fully-crossed design), and compared them to two conditions without any pledge (Control and Self-Report).

There were six conditions: one control (with no possibility to cheat), a baseline treatment (possibility and motivation to cheat and no honesty pledge), and four different treatments with honesty pledges.

This is what they reported for their primary outcome:

And this is how they summarize in their discussion section:

Interesting, huh?

Now let’s look at the relevant section of the preregistration:

Compare that to what was done in the paper:

– They did the Anova, but that was not relevant to the claims in the paper. The Anova included the control condition, and nobody’s surprised that when you give people the opportunity and motivation to cheat, that some people will cheat. That was not the point of the paper. It’s fine to do the Anova; it’s just more of a manipulation check than anything else.

– There’s something in the preregistration about a “cheating gap” score, which I did not see in the paper. But if we define A to be the average outcome under the control, B to be the average outcome under the baseline treatment, and C, D, E, F to be the average under the other four treatments, then I think the preregistration is saying they’ll define the cheating gap as B-A, and the compare this to C-A, D-A, E-A, and F-A. This is mathematically the same as looking at C-B, D-B, E-B, and F-B, which is what they do in the paper.

– The article jumps back and forth between different statistical summaries: “three of the four pledge conditions showed a decrease in self-reports . . . the difference was only significant for the Copy + ID condition.” It’s not clear what to make of it. They’re using statistical significance as evidence in some way, but the preregistration plan does not make it clear what comparisons would be done, how many comparisons would be made, or how they would be summarized.

– The preregistration plan says, “We will replicate the ANOVAs with linear regressions with the Control condition or Self-Report conditions as baseline.” I didn’t see any linear regressions in the results for this experiment in the published paper.

– The preregistration plan says, “We will also examine differences in the distribution of the percent of problems reported as solved between conditions using Kolmogorov–Smirnov tests. If we find significant differences, we will also examine how the distributions differ, specifically focusing on the differences in the percent of “brazen” lies, which are defined as the percent of participants who cheated to a maximal, or close to a maximal, degree (i.e., reported more than 80% of problems solved). The differences on this measure will be tested using chi-square tests.” I didn’t see any of this in the paper either! Maybe this is fine, because doing all these tests doesn’t seem like a good analysis plan to me.

How do we think of all the analyses stated in the preregistration plan that were not in the paper? Since these analyses were preregistered, I can only assume the authors performed them. Maybe the results were not impressive and so they weren’t included. I don’t know; I didn’t see any discussion of this in the paper.

– The preregistration plan says, “Lastly, we will explore interactions effects between the condition and demographic variables such as age and gender using ANOVA and/or regressions.” They didn’t report any of that either! Also there’s the weird “and/or” in the preregistration, which gives the researchers some additional degrees of freedom.

Not a moral failure

I continue to emphasize that scientific problems do not necessarily correspond to moral problems. You can be a moral person and still do bad science (honesty and transparency are not enuf); to put it another way, if I say that you make a scientific error or are sloppy in your science, I’m not saying you’re a bad person.

For me to say someone’s a bad person just because they wrote a paper and didn’t follow their preregistration plan . . . that would be ridiculous! Over 99% of my published papers have no preregistration plans; and, those that do have such plans, I’m pretty sure we didn’t exactly follow them in our published papers. That’s fine. The reason I do preregistration is not to protect my p-values; it’s just part of a larger process of hypothesizing about possible outcomes and simulating data and analysis as a prelude to measurement and data collection.

I think what happened in the “How pledges reduce dishonesty” paper is that the preregistration was both too vague and too specific. Too vague in that it did not include simulation and analysis of fake data, nor did it include quantitative hypotheses about effects and the distributions of outcomes, nor did it include anything close to what the authors ended up actually doing to support the claims in their paper. Too specific in that it included a bunch of analyses that the authors then didn’t think were worth reporting.

But, remember, science is hard. Statistics is hard. Even what might seem like simple statistics is hard. One thing I like about doing simulation-based design and analysis before collecting any data is that it forces me to make some of the hard choices early. So, yeah, it’s hard, and it’s no moral criticism of the authors of the above-discussed paper that they botched this. We’re all still learning. At the same time, yeah, I don’t think their study offers any serious evidence for the claims being made in that paper; it looks like noise mining to me. Not a moral failing; still, bad science in there being no good links between theory, effect sized, data collection, and measurement, which, as is often the case, leads to super-noisy results that can be interpreted in all sorts of ways to fit just about any theory.

Possible positive outcomes for preregistration

I think preregistration is great; again, it’s a floor, not a ceiling, on the data processing and analyses that can be done.

Here are some possible benefits of preregistration:

1. Preregistration is a vehicle for getting you to think harder about your study. The need to simulate data and create a fake world forces you to make hard choices and consider what sorts of data you might expect to see.

2. Preregistration with fake-data simulation can make you decide to redesign a study, or to not do it at all, if it seems that it will be too noisy to be useful.

3. If you already have a great plan for a study, preregistration can allow the subsequent analysis to be bulletproof. No need to worry about concerns of p-hacking if your data coding and analysis decisions are preregistered—and this also holds for analyses that are not based on p-values or significance tests.

4. A preregistered replication can build confidence in a previous exploratory finding.

5. Conversely, a preregistered study can yield a null result, for example if it is designed to have a high statistical power but then does not yield statistically significant preregistered results. Failure is not always as exciting or informative as success—recall the expression “big if true“—but it ain’t nothing.

6. Similarly, a preregistered replication can yield a null result. Again, this can be a disappointment but still a step in scientific learning.

7. Once the data appears, and the preregistered analysis is done, if it’s unsuccessful, this can lead the authors to change their thinking and to write a paper explaining that they were wrong, or maybe just to publish a short note saying that the preregistered experiment did not go as expected.

8. If a preregistered analysis fails, but the authors still try to claim success using questionable post-hoc analysis, the journal reviewers can compare the manuscript to the preregistration, point out the problem, and require that the article be rewritten to admit the failure. Or, if the authors refuse to do that, the journal can reject the article as written.

9. Preregistration can be useful in post-publication review to build confidence in published paper by reassuring readers who might have been concerned about p-hacking and forking paths. Readers can compare the published paper to the preregistration and see that it’s all ok.

10. Or, if the paper doesn’t follow the preregistration plan, readers can see this too. Again, it’s not a bad thing at all for the paper to go beyond the preregistration plan. That’s part of good science, to learn new things from the data. The bad thing is when a non-preregistered analysis is presented as if it were the preregistered analysis. And the good thing is that the reader can read the documents and see that this happened. As we did here.

In the case of this recent dishonesty paper, preregistration did not give benefit 1, nor did it give benefit 2, nor did it give benefits 3, 4, 5, 6, 7, 8, or 9. But it did give benefit 10. Benefit 10 is unfortunately the least of all the positive outcomes of preregistration. But it ain’t nothing. So here we are. Thanks to preregistration, we now know that we don’t need to take seriously the claims made in the published paper, “How pledges reduce dishonesty: The role of involvement and identification.”

For example, you should feel free to accept that the authors offer no evidence for their claim that “effective pledges could allow policymakers to reduce monitoring and enforcement resources currently allocated for lengthy and costly checks and inspections (that also increase the time citizens and businesses must wait for responses) and instead focus their attention on more effective post-hoc audits. What is more, pledges could serve as market equalizers, allowing better competition between small businesses, who normally cannot afford long waiting times for permits and licenses, and larger businesses who can.”

Huh??? That would not follow from their experiments, even if the results had all gone as planned.

There’s also this funny bit at the end of the paper:

I just don’t know whether to believe this. Did they sign an honesty pledge?

Overkill?

OK, it’s 2024, and maybe this all feels like shooting a rabbit with a cannon. A paper by Dan Ariely on the topic of dishonesty, published in an Elsevier journal, purporting to provide “guidance to managers and policymakers” based on the results of an online math-puzzle game? Whaddya expect? This is who-cares research at best, in a subfield that is notorious for unreplicable research.

What happened was I got sucked in. I came across this paper, and my first reaction was surprise that Ariely was still collaborating with people working on this topic. I would’ve thought that the crashing-and-burning of his earlier work on dishonesty would’ve made him radioactive as a collaborator, at least in this subfield.

I took a quick look and saw that the studies were preregistered. Then I wanted to see exactly what that meant . . . and here we are.

Once I did the work, it made sense to write the post, as this is an example of something I’ve seen before: a disconnect between the preregistration and the analyses in the paper, and a lack of engagement in the paper with all the things in the preregistration that did not go as planned.

Again, this post should not be taken as any sort of opposition to preregistration, which in this case led to positive outcome #10 on the above list. The 10th-best outcome, but better than nothing, which is what we would’ve had in the absence of preregistration.

Baby steps.

“Bayesian Workflow: Some Progress and Open Questions” and “Causal Inference as Generalization”: my two upcoming talks at CMU

I’ll be speaking twice at Carnegie Mellon soon.

CMU statistics seminar, Fri 5 Apr 2024, 2:15pm, in Doherty Hall A302:

Bayesian Workflow: Some Progress and Open Questions

The workflow of applied Bayesian statistics includes not just inference but also model building, model checking, confidence-building using fake data, troubleshooting problems with computation, model understanding, and model comparison. We would like to codify these steps in the realistic scenario in which researchers are fitting many models for a given problem. We discuss various issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

CMU computer science seminar, Tues 9 Apr, 10:30am, in Gates Hillman Building 8102:

Causal Inference as Generalization

In causal inference, we generalize from sample to population, from treatment to control group, and from observed measurements to underlying constructs of interest. The challenge is that models for varying effects can be difficult to estimate from available data. For example, there is a conflict between two tenets of evidence-based medicine: (1) reliance on statistically significant estimates from controlled trials and (2) decision making for individual patients. There’s no way to get to step 2 without going beyond step 1. We discuss limitations of existing approaches to causal generalization and how it might be possible to do better using Bayesian multilevel models. For background, see here: http://www.stat.columbia.edu/~gelman/research/published/KennedyGelman_manuscript.pdf and here: http://www.stat.columbia.edu/~gelman/research/published/causalreview4.pdf and here: http://www.stat.columbia.edu/~gelman/research/unpublished/causal_quartets.pdf

In between the two talks is a solar eclipse. I hope there’s good weather in Cleveland on Monday at 3:15pm.

Bad parenting in the news, also, yeah, lots of kids don’t believe in Santa Claus

A recent issue of the New Yorker had two striking stories of bad parenting.

Margaret Talbot reported on a child/adolescent-care center in Austria from the 1970s that was run by former Nazis who were basically torturing the kids. This happened for decades. The focus of the story was a girl whose foster parents had abused her before sending her to this place. The creepiest thing about all of this was how normal it all seemed. Not normal to me, but normal to that society: abusive parents, abusive orphanage, abusive doctors, all of which fit into an authoritarian society. Better parenting would’ve helped, but it seems that all of these people were all trapped in a horrible system, supported by an entrenched network of religious, social, and political influences.

In that same issue of the magazine, Sheelah Kolhatkar wrote about the parents of crypto-fraudster Sam Bankman-Fried. This one was sad in a different way. I imagine that most parents don’t want their children to grow up to be criminals, but such things happen. The part of the story that seemed particularly sad to me was how the parents involved themselves in their son’s crimes. They didn’t just passively accept it—which would be bad enough, but, sure, sometimes kids just won’t listen and they need to learn their lessons on their own—; they very directly got involved, indeed profited from the criminal activity. What kind of message is that to send to your child? In some ways this is similar to the Austrian situation, in that the adults involved were so convinced in their moral righteousness. Anyway, it’s gotta be heartbreaking to realize that, not only did you not stop your child’s slide into crime, you actually participated in it.

Around the same time, the London Review of Books ran an article which motivated me to write them this letter:

Dear editors,

In his article in the 2 Nov 2023 issue, John Lanchester writes that financial fraudster Sam Bankman-Fried “grew up aware that his mind worked differently from most people’s. Even as a child he thought that the whole idea of Santa Claus was ridiculous.” I don’t know what things are like in England, but here in the United States it’s pretty common for kids to know that Santa Claus is a fictional character.

More generally, I see a problem with the idealization of rich people. It’s not enough to say that Bankman-Fried was well-connected, good at math, and had a lack of scruple that can be helpful in many aspects of life. He also has to be described as being special, so much that a completely normal disbelief in the reality of Santa Claus is taken as a sign of how exceptional he is.

Another example is Bankman-Fried’s willingness to gamble his fortune in the hope of even greater riches, which Lanchester attributes to the philosophy of effective altruism, rather than characterizing it as simple greed.

Yours

Andrew Gelman
New York

They’ve published my letters before (here and here), but not this time. I just hope that in the future they don’t take childhood disbelief in Santa Claus as a signal of specialness, or attribute a rich person’s desire for even more money to some sort of unusual philosophy.

Paper cited by Stanford medical school professor retracted—but even without considering the reasons for retraction, this paper was so bad that it should never have been cited.

Last year we discussed a paper sent to us by Matt Bogard. The paper was called, “Impact of cold exposure on life satisfaction and physical composition of soldiers,” it appeared in the British Medical Journal, and Bogard was highly suspicious of it. As he put it at the time:

I don’t have full access to this article to know the full details and can’t seem to access the data link but with n = 49 split into treatment and control groups for these outcomes (also making gender subgroup comparisons) this seems to scream, That which does not kill my statistical significance only makes it stronger.

I took a look and I agreed that article was absolutely terrible. I guess it was better than most of the stuff published in the International Supply Chain Technology Journal, but that’s not saying much; indeed all it’s saying is that the paper strung together some coherent sentences.

Despite the paper being clearly very bad, it had been promoted by a professor at Stanford Medical School who has a side gig advertising this sort of thing:

What’s up with that?? Stanford’s supposed to be a serious university, no? I hate to see Stanford Medical School mixed up in this sort of thing.

News!

That article has been retracted:

The reason for the retraction is oddly specific. I think it would be enough for them to have just said they’re retracting the paper because it’s no good. As Gideon Meyerowitz-Katz put it:

To sum up – this is a completely worthless study that has no value whatsoever scientifically. It is quite surprising that it got published in its current form, and even more surprising that anyone would try to use it as evidence.

I agree that the study is worthless, even without the specific concerns that caused it to be retracted.

On the other hand, I’m not at all surprised that it got published, nor am I surprised that anyone would try to use it as evidence. Crap studies are published all the time, and they’re used as evidence all the time too.

By the way, if you are curious and want to take a look at the original paper:

You still gotta pay 50 bucks.

I wonder if that Stanford dude is going to announce the retraction of the evidence he touted for his claim that “deliberate cold exposure is great training for the mind.” I doubt it—if the quality of the evidence were important, he wouldn’t have cited the study in the first place—but who knows, I guess anything’s possible.

P.S. At this point, some people are gonna complain at how critical we all are. Why can’t we just let these people alone? I’ll give my usual answer, which is that (a) junk science is a waste of valuable resources and attention, and (b) bad science drives out the good. Somewhere there is a researcher who does good and careful work but was not hired at Stanford medical school because of not being flashy enough. Just like there are Ph.D. students in psychology whose work does not get published in Psychological Science because it can’t compete with clickbait crap like the lucky golf ball study.

As Paul Alper says, one should always beat a dead horse because the horse is never really dead.

“Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims.”

Dean Eckles sent me an email with subject line, “Another Perry Preschool paper . . .” and this link to a recent research paper that reports, “We find statistically significant effects of the program on a number of different outcomes of interest.” We’ve discussed Perry Preschool before (see also here), so I was coming into this with some skepticism. It turns out that this new paper is focused more on methods than on the application. It that begins:

This paper considers the problem of making inferences about the effects of a program on multiple outcomes when the assignment of treatment status is imperfectly randomized. By imperfect randomization we mean that treatment status is reassigned after an initial randomization on the basis of characteristics that may be observed or unobserved by the analyst. We develop a partial identification approach to this problem that makes use of information limiting the extent to which randomization is imperfect to show that it is still possible to make nontrivial inferences about the effects of the program in such settings. We consider a family of null hypotheses in which each null hypothesis specifies that the program has no effect on one of many outcomes of interest. Under weak assumptions, we construct a procedure for testing this family of null hypotheses in a way that controls the familywise error rate–the probability of even one false rejection–in finite samples. We develop our methodology in the context of a reanalysis of the HighScope Perry Preschool program. We find statistically significant effects of the program on a number of different outcomes of interest, including outcomes related to criminal activity for males and females, even after accounting for imperfections in the randomization and the multiplicity of null hypotheses.

I replied: “a family of null hypotheses in which each null hypothesis specifies that the program has no effect on one of many outcomes of interest . . .”: What the hell? This makes no sense at all.

Dean responded:

Yeah, I guess it is a complicated way of saying there’s a null hypothesis for each outcome…

To me this just really highlights the value of getting the design right in the first place — and basically always including well-defined randomization.

To which I replied: Randomization is fine, but I think much less important than measurement as a design factor. The big problem with all these preschool studies is noisy data. These selection-on-statistical-significance methods then combine with forking paths to yield crappy estimates. I’d say “useless estimates,” but I guess the estimates are useful in the horrible sense that they allow the promoters of these interventions to get attention and funding.

Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims. They’re following the “strongest link” reasoning, which is an obvious logical fallacy but that doesn’t stop it from driving so much policy research.

Dean:

Yes, definitely agreed that measurement is a key part of the design, as is choosing a sample size that has any hope of detecting reasonable effect sizes.

Me: Agreed. One reason I de-emphasize the importance of sample size is that researchers often seem to think that sample size is a quick fix. It goes like this: Researchers do a study and finds a result that’s 1.5 se’s from 0. So then they think that if they just increase sample size by a factor of (2/1.5)^2, they’ll get statistical significance. Or if their sample size was higher by a factor of (2.8/1.5)^2, that they’d have 80% power. And then they do the study, they manage to find that statistically significant result, and they (a) declare victory on their substantive claim and (b) think that their sample size is retrospectively justified, in the same way that a baseball manager’s decision to leave in the starting pitcher is justified if the outcome is a W.

So, the existence of the “just increase the sample size” option can be an enabler for bad statistics and bad science. My favorite example along those lines is the beauty-and-sex-ratio study, which used a seemingly-reasonable sample size of 3000 but would realistically need something like a million people or more to have any reasonable chance of detecting any underlying signal.

Dean:

Yes, that’s a good warning. Of course, for those who know better, obviously the original, extra noisy point estimate is not really helpful at all for power analysis. I think trying to use pilots to somehow get an initial estimate of an effect (rather than learn other things) is a common trap.

Yeah, and rules of thumbs about what a “big sample” is can lead you astray. I’ve run some experiments where if you’d run the same experiment with 100k people, it would have been hopeless. A couple slides from my recent course on this point are attached… involving the same “failed” expeirment as the second part of my recent post https://statmodeling.stat.columbia.edu/2023/10/16/getting-the-first-stage-wrong/

P.S. Regarding the title of this post, I’m not saying that randomization is always bad or usually bad or that it’s bad on net. What I’m saying is that it can be bad, and it can be bad in important situations, the kinds of settings where people want to find an effect.

To put it another way, suppose we start by looking at a study with randomization. Would it be better without random assignment, just letting participants in the study pick their treatments or having them assigned in some other way that would be subject to unmeasurable biases? No, of course not. Randomization doesn’t make a study worse. What it can do is give researchers and consumers of researchers an inappropriately warm and cozy feeling, leading them to not look at serious problems of interpretation of the results of the study, for example, extracting large and unreproducible results from small noisy samples and then using inappropriately applied statistical models to label such findings as “statistically significant.”

“Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.”

Awhile ago, Kevin Lewis pointed me to this article that was featured in the Wall Street Journal. Lewis’s reaction was, “I’m not sure how robust this is with just some generic survey controls. I’d like to see more of an exogenous assignment.” I replied, “Nothing wrong with sharing such observational patterns. They’re interesting. I don’t believe any of the causal claims, but that’s ok, description is fine,” to which Lewis responded, “Sure, but the authors are definitely selling the causal claim.” I replied, “Whoever wrote that looks like they had the ability to get good grades in college. That’s about it.”

At this point, Alex Tabarrok, who’d been cc-ed on all this, jumped in to say, quite reasonably, “Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.”

I replied to Alex as follows:

There are lots of causal claims that I believe! For this one, there are two things going on. First, do I think the claim is true? Maybe, maybe not, I have no idea. I certainly wouldn’t stake my reputation on a statement that the claim is false. Second, how relevant do I think this sort of data and analysis are to this claim? My answer: a bit relevant but not very. When I think about the causal claims that I believe, my belief is usually not coming from some observational study.

Regarding, “Plenty have to be true.” Yup, and that includes plenty of statements that are the opposite of what’s claimed to be true. For example, a few years ago a researcher preregistered a claim that exposure to poor people would cause middle-class people to have more positive views regarding economic redistribution policies. The researcher then did a study, found the opposite result (not statistically significant, but whatever), and then published the results and claimed that exposure to poor people would reduce middle-class people’s support for redistribution. So what do I believe? I believe that for most people, an encounter (staged or otherwise) with a person on the street would have essentially no effects on their policy views. For some people in some settings, though, the encounter could have an effect. Sometimes it could be positive, sometimes negative. In a large enough study it would be possible to find an average effect. The point is that plenty of things have to be true, but estimating average causal effects won’t necessarily find any of these things. And this does not even get into the difficulty with the study linked by Kevin, where the data are observational.

Or, for another example, sure, I believe that early childhood intervention can be effective in some cases. That doesn’t give me any obligation to believe the strong claims that have been made on its behalf using flawed data analysis.

To put it another way: the authors of all these studies should feel free to publish their claims. I just think lots of these studies are pretty random. Randomness can be helpful. Supposedly Philip K. Dick used randomization (the I Ching) to write some of this books. In this case, the randomization was a way to jog his imagination. Similarly, it could be that random social science studies are useful in that they give people an excuse to think about real problems, even if the studies themselves are not telling us what the researchers claim.

Finally, I think there’s a problem in social science that researchers are pressured to make strong causal claims that are not supported by their data. It’s a selection bias. Researchers who just make descriptive claims are less likely to get published in top journals, get newspaper op-eds, etc. This is just some causal speculation of my own: if the authors of this recent study had been more clear (to themselves and to others) that their conclusions are descriptive, not causal, none of us would’ve heard about the study in the first place.

Summary

There’s a division of labor in metascience as well as in science. I lean toward skepticism, to the extent that there must be cases where I don’t get around to thinking seriously about new ideas or results that are actually important. Alex leans toward openness, to the extent that there must be cases where he goes through the effort of working out the implications of results that aren’t real. It’s probably a good thing that the science media includes both of us. We play different roles in the system of communication.

Every time Tyler Cowen says, “Median voter theorem still underrated! Hail Anthony Downs!”, I’m gonna point him to this paper . . .

Here’s Cowen’s post, and here’s our paper:

Moderation in the pursuit of moderation is no vice: the clear but limited advantages to being a moderate for Congressional elections

Andrew Gelman Jonathan N. Katz

September 18, 2007

It is sometimes believed that is is politically risky for a congressmember to go against his or her party. On the other hand, Downs’s familiar theory of electoral competition holds that political moderation is a vote-getter. We analyze recent Congressional elections and find that moderation is typically worth less about 2% of the vote. This suggests there is a motivation to be moderate, but not to the exclusion of other political concerns, especially in non-marginal districts. . . .

Banning the use of common sense in data analysis increases cases of research failure: evidence from Sweden

Olle Folke writes:

I wanted to highlight a paper by an author who has previously been featured on your blog when he was one of the co-authors of a paper on the effect of strip clubs on sex crimes in New York. This paper looks at the effect of criminalizing the buying of sex in Sweden and finds a 40-60% increase. However, the paper is equally problematic as the one on strip clubs. In what I view as his two main specifications he using the timing of the ban to estimate the effect. However, while there is no variation across regions he uses regional data to estimate the effect, which of course does not make any sense. Not surprisingly there is no adjustment for the dependence of the error term across observations.

What makes this analysis particularly weird is that there actually is no shift in the outcome if we use national data (see figure below). So basically the results must have been manufactured. As the author has not posted any replication files it is not possible to figure out what he has done to achieve the huge increase.

I think that his response to this critique is that he has three alternative estimation methods. However, these are not very convincing and my suspicion is that neither those results would hold up for scrutiny. Also, I find the use of alternative methods both strange and problematic. First, it suggests that neither method is convincing it itself. However, doing four additional problematic analysis does not make the first one better. Also, it gives author an out when they are criticized as it involves a lot of labor to work through each analysis (especially when there is not replication data).

I took a look at the linked paper, and . . . yeah, I’m skeptical. The article begins:

This paper leverages the timing of a ban on the purchase of sex to assess its impact on rape offenses. Relying on Swedish high-frequency data from 1997 to 2014, I find that the ban increases the number of rapes by around 44–62%.

But the above graph, supplied by Folke, does not show any apparent effect at all. The linked paper has a similar graph using monthly data that also shows
nothing special going on at 1999:

This one’s a bit harder to read because of the two axes, the log scale, and the shorter time frame, but the numbers seem similar. In the time period under study, the red curve is around 5.0 on the log scale per month, 12*log(5) = 1781, and the annual curve is around 2000, so that seems to line up.

So, not much going on in the aggregate. But then the paper says:

Several pieces of evidence find that rape more than doubled after the introduction of the ban. First, Table 1 finds that the average before the ban is around 6 rapes per region and month, while after the introduction is roughly 12. Second, Table 2 presents the results of the naive analysis of regressing rape on a binary variable taking value 0 before the ban and 1 after, controlling for year, month, and region fixed effects. Results show that the post ban period is associated with an increase of around 100% of cases of rape in logs and 125% of cases of rape in the inverse hyperbolic sine transformation (IHS, hereafter). Third, a simple descriptive exercise –plotting rape normalized before the ban around zero by removing pre-treatment fixed effects– encounters that rape boosted around 110% during the sample period (Fig. 4).

OK, the averages don’t really tell us anything much at all: they’re looking at data from 1997-2014, the policy change happened in 1999, in the midst of a slow increase, and most of the change happened after 2004, as is clearly shown in Folke’s graph. So Table 1 and Table 2 are pretty much irrelevant.

But what about Figure 4:

This looks pretty compelling, no?

I dunno. The first thing is that the claim that of “more than doubling” relies very strongly on the data after 2004. log(2) = 0.69, and if you look at that graph, the points only reach 0.69 around 2007, so the inference is leaning very heavily on the model by which the treatment causes a steady annual increase, rather than a short-term change in level at the time of the treatment. The other issue is the data before 1999, which in this graph are flat but in the two graphs shown earlier in this post showed an increasing trend. That makes a big difference in Figure 4! Replace that flat line pre-1999 with a positively-sloped line, and the story looks much different. Indeed, that line is soooo flat and right on zero, that I wonder if this is an artifact of the statistical fitting procedure (“Pre-treatment fixed effects are removed from the data to normalize the number of rapes around zero before the ban.”). I’m not really sure. The point is that something went wrong.

They next show their regression discontinuity model, which fits a change in level rather than slope:

There’s something else strange going on here: if they’re really fitting fixed effects for years, how can they possibly estimate a change over time? This is not making a lot of sense.

I’m not going to go through all of this paper in detail, I just did the above quick checks in order to get a rough sense what was going on, and to make sure I didn’t see anything immediately wrong with Folke’s basic analysis.

Folke continued:

The paper is even stranger than I have expected. I have gotten part of the regression code and he is estimating models that would not get any estimates on the treatment of there where no coding error (treatment is constant within years but he includes year fixed effects). Also, when I do the RDanalysis he claims he is doing I get the figure below in which there clearly is not a jump of 0.6 log points…

What the hell????

This one goes into the regression discontinuity hall of fame.

The next day, Folke followed up:

It took some digging and coding the figure out how the author was able to find such a large effect. We [Joop Adema, Olle Folke, and Johanna Rickne] have now written up a draft of a comment where we show that it is all based on a specification error and he ends up estimating something entirely different than he claims to be.

The big picture, or, how can this sort of error be avoided or its consequences mitigated

Look, everybody makes mistakes. Statistical models are hard to fit and interpret, data can be a mess, and social science theories are vague enough that if you’re not careful you can explain just about anything.

Still, it looks like this paper was an absolute disaster and a bit of an embarrassment for the Journal of Population Economics, which published it.

Should the problems been noticed earlier? I’d argue yes.

The problems with the regression discontinuity model—OK, we’re not gonna expect the author, reviewers, or editors of a paper to look too carefully at that—it’s a big ugly equation, after all—and we can’t expect author, reviewers, or editors to check the code—that’s a lot of work, right? Equations that don’t make sense, that’s just the cost of doing business.

The clear problem is the pattern in the aggregate data, the national time series that shows no jump in 1999.

I’m not saying that, just cos there’s no jump in 1999, that the policy had no effect. I’m just saying that the lack of jump in 1999 is right there for everyone to see. At the very least, if you’re gonna claim you found an effect, you’re under the scientific obligation to explain how you found that effect given the lack of pattern in the aggregate data. Such things can happen—you can have an effect that happens to be canceled out in the data by some other pattern at the same time—but then you should explain it, give that trail of breadcrumbs.

So, I’m not saying the author, reviewers, and editors of that paper should’ve seen all or even most of the problems with this paper. What I am saying is that they should’ve engaged with the contradiction between their claims and what was shown by the simple time series. To have not done this is a form of “scientism,” a kind of mystical belief in the output of a black box, a “believe the stats, not your lying eyes” kind of attitude.

Also, as Folke points out, the author of this paper has a track record of extracting dramatic findings using questionable data analysis.

I have no reason to think that the author is doing things wrong on purpose. Statistics is hard! The author’s key mistakes in these two papers have been:

1. Following a workflow in which contrary indications were ignored or set aside rather than directly addressed.

2. A lack of openness to the possibility that the work could be fatally flawed.

3. Various technical errors, including insufficient concern about data quality, a misunderstanding of regression discontinuity checks, and an inappropriate faith in robustness checks.

In this case, Adema, Folke, and Rickne did a lot of work to track down what went wrong in that published analysis. A lot of work for an obscure paper in a minor journal. But the result is a useful general lesson, which is why I’m sharing the story here.

Bayesian inference with informative priors is not inherently “subjective”

The quick way of saying this is that using a mathematical model informed by background information to set a prior distribution for logistic regression is no more “subjective” than deciding to run a logistic regression in the first place.

Here’s a longer version:

Every once in awhile you get people saying that Bayesian statistics is subjective bla bla bla, so every once in awhile it’s worth reminding people of my 2017 article with Christian Hennig, Beyond subjective and objective in statistics. Lots of good discussion there too. Here’s our abstract:

Decisions in statistical data analysis are often justified, criticized or avoided by using concepts of objectivity and subjectivity. We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. Together with stability, these make up a collection of virtues that we think is helpful in discussions of statistical foundations and practice.

The advantage of these reformulations is that the replacement terms do not oppose each other and that they give more specific guidance about what statistical science strives to achieve. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgement of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling and socio-economic stratification. The aim of the paper is to push users and developers of statistical methods towards more effective use of diverse sources of information and more open acknowledgement of assumptions and goals.

Philip K. Dick’s character names

The other day I was thinking of some of the wonderful names that Philip K. Dick gave to his characters:
Joe Chip
Glen Runciter
Bob Arctor
Palmer Eldritch
Perky Pat

And, of course, Horselover Fat.

My personal favorite names from these stories are Ragle Gumm from Time out of Joint, and Addison Doug, the main character in an obscure spaceship/time-travel story from 1974.

I feel like it shows a deep confidence to give your characters this sort of name. As names, they’re off, but at the same time they’re just right in context. “Addison Doug,” indeed.

Some authors are good at titles, some are good at last lines, some are good at names. So many books, even great books, have character names that are boring or too cute or just fine, but no more than just fine. To come up with these distinctive names is a high-risk ploy that, when it works, it adds something special to the whole story.

The contrapositive of “Politics and the English Language.” One reason writing is hard:

In his classic essay, “Politics and the English Language,” the political journalist George Orwell drew a connection between cloudy writing and cloudy content.

The basic idea was: if you don’t know what you’re saying, or if you’re trying to say something you don’t really want to say, then one strategy is to write unclearly. Conversely, consistently cloudy writing can be an indication that the writer ultimately doesn’t want to be understood.

In Orwell’s words:

[The English language] becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

He continues:

In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness.

A few years ago I posted on this topic, drawing an analogy to cloudy writing in science. To be sure, much of the bad writing in science comes from researchers who have never learned to write clearly. Writing is hard!

But it’s not just that. A key problem with a lot of the bad science that we see featured in PNAS, Ted, NPR, Gladwell, Freakonomics, etc., is that the authors are trying to use statistical analysis and storytelling to do something they can’t do with their science, which is to draw near-certain conclusions from noisy data that can’t support strong conclusions. This leads to tortured constructions such as this from a medical journal:

The pair‐wise results (using paired‐samples t‐test as well as in the mixed model regression adjusted for age, gender and baseline BMI‐SDS) showed significant decrease in BMI‐SDS in the parents–child group both after 3 and 24 months, which indicate that this group of children improved their BMI status (were less overweight/obese) and that this intervention was indeed effective.

However, as we wrote in the results and the discussion, the between group differences in the change in BMI‐SDS were not significant, indicating that there was no difference in change in our outcome in either of the interventions. We discussed, in length, the lack of between‐group difference in the discussion section. We assume that the main reason for the non‐significant difference in the change in BMI‐SDS between the intervention groups (parents–child and parents only) as compared to the control group can be explained by the fact that the control group had also a marginal positive effect on BMI‐SDS . . .

Obv not as bad as political journalists in the 1930s defending Stalin’s purges or whatever; the point is that the author is in the awkward position of trying to use the ambiguities of language to say something while not quite saying it. Which leads to unclear and barely readable writing, not just by accident.

The writing and the statistics have to be cloudy, because if they were clear, the emptiness of the conclusions would be apparent.

The problem

Orwell’s statement, when transposed to writing a technical paper, is that if you attempt to cover the gaps in your reasoning with words, this will typically yield bad writing. Indeed, if you’re covering the gaps in your reasoning with words, you’ll either have bad writing or dishonest writing, or both. In some important way, it’s a good thing that this sort of writing is so hard to follow; otherwise it could be really misleading.

Now let’s flip it around.

Often you will find yourself trying to write an article, and it will be very difficult to write it clearly. You’ll go around and around, and whatever you, your written output will feel like the worst of both worlds: a jargon-filled mess, while at the same time being sloppy and imprecise. Try to make it more readable and it becomes even sloppier and harder to follow at a technical level; try to make it accurate and precise, and it reads like a complicated, uninterpretable set of directions.

You’re stuck. You’re in a bad place. And any direction you take makes the writing worse in some important way.

What’s going on?

It could be this: You’re trying to write something you don’t fully understand, you’re trying to bridge a gap between what you want to say and what is actually justified by your data and analysis . . . and the result is “Orwellian,” in the sense that you’re desperately using words to try to paper over this yawning chasm in your reasoning.

The solution

One way out of this trap is to follow what we could call Orwell’s Contrapositive.

It goes like this: Step back. Pause in whatever writing you’re doing. Pull out a new sheet of paper (or an empty document on the computer) and write, as directly as you can, in two columns. Column 1 is what you want to be able to say (the method is effective, the treatment saves lives, whatever); Column 2 is what is supported by your evidence (the method works better than a particular alternative in a particular setting, fewer people died in the treatment than the control group after adjusting this and that, whatever).

At that point, do the work to pull Column 2 to Column 1, or make concessions to reality to shift Column 1 toward Column 2. Do what it takes to get them to line up.

At this point, you’ve left the bad zone in which you’re trying to say more than you can honestly say. And the writing should then go much smoother.

That’s the contrapositive: if bad writing is a sign of someone trying to say the indefensible, then you can make your writing better by not trying to say the defensible, either by expanding what is legitimately defensible or restricting what you’re trying to say.

Remember the folk theorem of statistical computing: When you have computational problems, often there’s a problem with your model. Orwell’s Contrapositive is a sort of literary analogy to that.

One reason writing is hard

To put it another way: One reason writing is hard is that we use writing to cover the gaps in our reasoning. This is not always a bad thing! On the way to the destination of covering these gaps is the important step of revealing these gaps. We write to understand. Writing has an internal logic that can protect us from (some) errors and gaps—if we let it, by reacting to the warning sign that the writing is unclear.

Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive.

Ryan Briggs writes:

In case you haven’t seen this, PNAS (who else) has a new study out entitled “Unconditional cash transfers reduce homelessness.” This is the significant statement:

A core cause of homelessness is a lack of money, yet few services provide immediate cash assistance as a solution. We provided a one-time unconditional CAD$7,500 cash transfer to individuals experiencing homelessness, which reduced homelessness and generated net societal savings over 1 y. Two additional studies revealed public mistrust in homeless individuals’ ability to manage money and the benefit of counter-stereotypical or utilitarian messaging in garnering policy support for cash transfers. This research adds to growing global evidence on cash transfers’ benefits for marginalized populations and strategies to increase policy support. Although not a panacea, cash transfers may hasten housing stability with existing social supports. Together, this research offers a new tool to reduce homelessness to improve homelessness reduction policies.

Based on that, I was surprised to read the pre-registration documents and supplemental information and learn that literally none of the outcomes that the researchers pre-registered were significant. Even the variable that they chose to focus on (days homeless) was essentially the same in the 12 month follow up (0.18 vs 0.17) and, just eyeballing Table S3, it seems the differences were rarely large and not ever significant in any single follow up period.

This is now generating news coverage about how cash transfers work to reduce homelessness (e.g., here and here).

I guess in a sense pre-registration worked because we can see that they did not expect this and had to explore to find it, but what good does that do if the press just reports it all credulously?

I have mixed feelings on this one. On one hand, I don’t like the whole statistical-significance-thresholding thing: if the study found positive results, this could be worth reporting, even if the results are within the margin of error. This within-the-margin-of-error bit should just be mentioned in the news articles. On the other hand, if the researchers are rummaging around through their results looking for something big to report, then, yeah, these results will be massively biased upward.

So, from that perspective, maybe a good headline would not be, “Homeless people were given lump sums of cash. Their spending defied stereotypes” or “B.C. researchers studied how homeless people spent a $7,500 handout. Here’s what they found,” but rather something like, “Preliminary results from a small study suggest . . .”

But then we could step back and ask, How did this study get the press in the first place? I’m guessing PNAS is the reason. So let’s head to the PNAS paper. From the abstract:

Exploratory analyses showed that over 1 y, cash recipients spent fewer days homeless, increased savings and spending with no increase in temptation goods spending, and generated societal net savings of $777 per recipient via reduced time in shelters.

I guess that “exploratory analysis” is code for non-preregistered or non-statistically-significant. Either way, I think it’s irresponsible and statistically incorrect—although, regrettably, absolutely standard practice—to report this “$777” without any regularization or partial pooling toward zero. It’s a biased estimate, and the bias could be huge.

Figure 1 of the paper looks very impressive! This figure displays 35 outcomes, almost all of which go in a positive direction (fewer days homeless, more days in stable housing, higher value of savings . . ., all the way down to lower substance use severity, lower cost of all service use, and cost of shelter use. The very few negative outcomes were tiny compared to their uncertainty. If you look at Figure 1, the evidence looks overwhelming.

But Table 1 does not seem like such a great summary of the data displayed elsewhere in the paper. Looking at Table 3, the good stuff all seems to be happening in the 1-month and 3-month followups without much happening after 1 year.

Here’s what the authors wrote:

The preregistered analyses yielded null effects in cognitive and well-being outcomes, which could be due to the low statistical power from the small participant number in each condition or the possibility that any effect on cognition and well-being may take more than 1 mo to show up.

I agree that these null findings should be mentioned right up there in the abstract. They should also include the possibility that the treatment really has no consistent effect on these outcomes. It’s kinda lame to give all these alibis and never even consider that maybe there’s nothing going on.

What about the housing effects going away after a year? The authors write:

First, the cost of living is extremely high in Vancouver, and the majority of the cash was spent within the first 3 mo for most recipients. Second, while the cash provided immediate benefits, control participants even-tually “caught up” over time.

On the other hand, here’s what they said about a different result:

By combining the two cash and two noncash conditions to increase statistical power, exploratory analyses showed that cash recipients showed higher positive affect at 1 mo and higher executive function at 3 mo. Based on debriefing, participants expressed that while they were initially happy with the cash transfer, moving out of homelessness into stable housing took substantial efforts and hard work in the first few months, which could explain the delayed effect on cognitive function.

They’ve successfully convinced me that they have the ability to explain any possible result they might find.

The thing that bothers me most about the paper is that the authors don’t seem to have wrestled with the ways in which their results seem to refute their theoretical framework. Their choice of what to preregister suggests that they were expecting to find large effects on cognitive and subjective well-being outcomes and then maybe, if they were lucky, they’d find some positive results on financial and housing outcomes. I guess their theory was that the money would give people a better take on life, which could then lead to material benefits. Actually, though, they found no benefits on the cognitive and subjective outcomes—when I say “no benefits,” I mean, yeah, really nothing, not just nothing statistically significant—but the money did seem to help people pay the rent for the first few months. That’s fine—there are worse things than giving low-income people some money to pay the rent!—; it’s just a different story than what they’d started with. It’s less of a psychology story and more of an economics story. In any case, yeah, further study is required. I just think that they could get the most from their existing study if they thought more about what went wrong with their theory.

Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism

When rereading this post the other day, I noticed the post that came immediately before.

I followed the link and came across the delightful story of a researcher who, after one of his papers was criticized, replied, “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” One of the critics responded with appropriate disdain, writing:

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become . . . a vast graveyard of undead theories?

I agree. To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be citing 11-year-old articles.

As is so often the case, the authors of published work are happy to get unthinking positive publicity and citations, but when anything negative comes in, they pull up the drawbridge.

From the perspective of the ladder of responses to criticism, the above behavior isn’t so bad: they’re not suing their critics or using surrogates to attack them critics or labeling anybody as suicide bombers or East German secret police, they’re just trying to laugh it off. From a scientific perspective, though, it’s still pretty bad to act as there’s something wrong with discussing the flaws of a paper that’s still being cited, just cos it’s a decade old.

Putting together a list

Anyway, this made me think of a fun project, which is to list all the different ways that researchers try to avoid addressing legitimate criticism of their published work.

Here are a few responses we’ve seen. I won’t bother finding the links right now, but if we put together a good list, I can go back and provide references for all of them.

1. The corrections do not affect the main results of the paper. (Always a popular claim, even if the corrections actually do affect the main results of the paper.)

2. The criticism should be dismissed because the critics are obsessive/Stasi/terrorists, etc. (Recall the Javert paradox.)

3. The critics are jealous losers sniping at their betters. Or, if that doesn’t work, the critics are picking on unfortunate young researchers. (I don’t think it does any favors to researchers of any age to exempt their work from criticism.)

4. The criticism is illegitimate if it does not go through the peer-review process. (A hard claim to swallow given how the peer-review process is rigged against criticism of published papers.)

5. Criticism should be a discreet exchange between author and critic, with no public criticism. (But the people who claim to hold that attitude seem to have no problem when their work is cited or praised in a public way.)

The most common response to criticism seems to be to just ignore it entirely and hope it goes away. Unfortunately, that strategy often seems to work very well!

Jonathan Bailey vs. Stephen Wolfram

Key quote:

While there are definitely environments where using a ghostwriter is acceptable, academic publishing typically isn’t one of them.

The reason is simple: Using a ghostwriter on an academic paper entails having an author do significant work on the paper without receiving credit or having their work disclosed. This is broadly seen as a breach of authorship and an act of research misconduct unto itself.

Why are all these school cheating scandals happening?

Paul Alper writes:

While the national scene is all about woke, book banning and the like, apparently Columbia University is still dealing with the long-standing conundrum, the best method to teach kids how to read.

He’s referring to this news article, “Amid Reading Wars, Columbia Will Close a Star Professor’s Shop,” which begins:

Lucy Calkins ran a beloved — and criticized — center at Teachers College for four decades. It is being dissolved. . . .

Her curriculum had teachers conduct “mini-lessons” on reading strategies, but also gave students plenty of time for silent reading and freedom to choose their own books. Supporters say those methods empower children, but critics say they waste precious classroom minutes, and allow students to wallow in texts that are too easy.

Some of the practices she once favored, such as prompting children to guess at words using the first letter and context clues, like illustrations, have been discredited.

Over the past three years, several prominent school districts — including New York City, the nation’s largest — dropped her program, though it remains in wide use. . . .

Critics of her ideas, including some cognitive scientists and instructional experts, said her curriculum bypassed decades of settled research, often referred to as the science of reading. That body of research suggests that direct, carefully sequenced instruction in phonics, vocabulary building and comprehension is more effective for young readers than Dr. Calkins’s looser approach.

Alper writes:

This article did not at all mention anything about language specifics. I bring this up because my granddaughters are in a Minneapolis Spanish immersion primary school. Because Spanish is almost 100 per cent phonetic, and English is terrible in this regard, they spell and read better in Spanish than they do in English. The mechanics of learning to read back in my day, was simple and devoid of theory or disagreement. You kept at it until you got it right. The it was English only because no accommodation was made for special needs, immigrants or for the outside world in general.

I know some people at Teachers College but I’ve never encountered Prof. Calkins, nor have I ever looked at the literature on language teaching. So I got nothin’ on this one.

But I did reply that the above story isn’t half as bad as this one from a few years back, which I titled, “What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade?” It involved someone who was found to be a liar, a cheat, and a thief, and then, with that all known, was hired to two jobs as school principal! And then a Teachers College professor said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” This came out in the news after the principal in question was found do have “forged answers on students’ state English exams in April because the students had not finished the tests.” Quelle surprise, no? A liar/cheat/thief gets a new job doing the same thing and then does more lying and cheating (maybe no stealing that time, though).

Alper responded:

You wrote that in 2015 which is about the same time as this story which made Fani Willis RICO famous:

Her most prominent case was her prosecution of the Atlanta Public Schools cheating scandal. Willis, an assistant district attorney at the time, served as lead prosecutor in the 2014 to 2015 trial of twelve educators accused of correcting answers entered by students to inflate the scores of state administered standardized tests.

SAT and all the others did not exist in my 1950 NYC school days, but I believe we did have the so-called Regents Exams and they are still around. It never crossed my mind that the scoring of those exams was not on the up and up. Was I being naive? Was there more honesty and/or less messing around back then and it was just not financially worth it?

Here’s my response:

1. This particular form of cheating sounds no more easy or difficult now than in the past.

2. In the past (i.e., somewhere between 1950 and 2015), tests were important students but not so much for schools. So, yeah, students may have been motivated to cheat, but teachers and school administrators did not have any motivation, either to help students cheat, or to massively cheat on their own. Nowadays, tests can be high stakes for the school administrators, and so, for some of them, cheating is worth the risk.