The data are on a 1-5 scale, the mean is 4.61, and the standard deviation is 1.64 . . . What’s so wrong about that??

James Heathers reports on the article, “Contagion or restitution? When bad apples can motivate ethical behavior,” by Gino, Gu, and Zhong (2009):

There is some sentiment data reported in Experiment 3, which seems to be reported in whole units.

They also indicated how guilty they would feel about the behavior of the person who took all the money along with some unrelated emotional measures (1 = not at all, 5 = very much)… participants in the in-group selfish condition felt more guilty (M = 4.61, SD = 1.64) about the person’s selfish behavior than the participants in the out-group selfish condition (M = 3.26, SD = 1.54), t(80) = 3.82, p < .001.

If you have a 1 to 5 scale, it isn’t possible to have M = 4.61, SD = 1.64.

Huh? Really? Yeah!

Let’s work it out. If your measurements are on a 1-5 scale, the way to maximize their standard deviation for any given mean is to put the data all at 1 and 5. If the mean is 4.61, that would imply that (4.61 – 1)/(5 – 1) = 0.9025 of the data take on the value 5, and 1 – 0.9025 = 0.0975 take on the value 1. (Just to check, 0.0975*1 + 0.9025*5 = 4.61.)

For this extreme dataset, the standard deviation is sqrt(0.0975*(1 – 4.61)^2 + 0.9025*(5 – 4.61)^2) = 1.19. So, yeah, there’s no way to get a standard deviation of 1.64 from these data. Just not possible!

Just to make sure, we can check our calculation via simulation:

n <- 1e6
y <- sample(c(1,5), n, replace=TRUE, prob=c(0.0975, 0.9025))
print(c(mean(y), sd(y)))

Here's what we get:

[1] 4.610172 1.186317


OK, let's try one more thing. Maybe b is so small that there's some kinda 1/sqrt(n-1) thing in the denominator driving the result? I don't think so. The trouble is that, to get a mean of 4.61, you need enough data (in his post, Heathers guesses "n=41 (as 189/41 = 4.6098)") that the difference between 1/sqrt(n) and 1/sqrt(n-1) wouldn't be enough to take you from 1.19 all the way up to 1.64 or even close. Also, it's kinda implausible that all the observations would be 1's and 5's anyway.

So what happened?

It's always easier to figure out what didn't happen than to figure out what did happen.

Here are some speculations.

One possibility is a typo, but Heathers doubts that because other calculations in that paper are consistent the above-reported impossible numbers.

A related possibility is that this was a typo that was then propagated into the rest of paper. For example, the mean was 3.61, it was typed in the paper as 4.61, and then this typed-in number was used in later calculations. This would be bad workflow---you want all the computations to be done in a single script---but people use bad workflow all the time. I use bad workflow myself sometimes and end up with wrong numbers or wrongly-labeled graphs.

Another possibility is that the mean and standard deviation were calculated from two different datasets. That might sound kind of weird, but it can happen all the time, due to sloppiness or because of goofs in data processing. For example, you read in the data, calculate the mean and standard deviation for each variable, then perform some data-exclusion rule, perhaps removing data with incomplete responses to some of the questions, then you do further statistical analysis, recalculating the mean and standard deviation, among other things---but then when you pull together your numbers, you take the mean from some place and the standard deviation from the other place.

Yet another possibility is that someone involved in the data analysis or writeup was cheating in order to get a statistically-significant and thus publishable result, for example changing 3.61 to 4.61 to get a big fat difference but not touching the standard deviation. This would be a great way to cheat, because if you get caught, you can just say that you made a typo!

In any case, it's a fun little statistics example. And it's worth checking your data, even if you have no suspicion of cheating. I've often had incoherent data in problems I've worked on. Lots of things can go wrong in data processing and analysis, and we have to check things in all sorts of ways.

Infovis, infographics, and data visualization: My thoughts 12 years later

I came across this post from 2011, “Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go,” and it seemed to make sense to reassess where we are now, 12 years later.

From 2011:

I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines.

In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other students were not making a lot of graphs. I discovered and absorbed the principles of Cleveland’s The Elements of Graphing Data.

In grad school and beyond, I continued to use graphs in my research. But I noticed a disconnect in how statisticians thought about graphics. There seemed to be three perspectives:

1. The proponents of exploratory data analysis liked to graph raw data and never think about models. I used their tools but was uncomfortable with the gap between the graphs and the models, between exploration and analysis.

2. From the other direction, mainstream statisticians–Bayesian and otherwise–did a lot of math and fit a lot of models (or, as my ascetic Berkeley colleagues would say, applied a lot of procedures to data) but rarely made a graph. They never seemed to care much about the fit of their models to data.

3. Finally, textbooks and software manuals featured various conventional graphs such as stem-and-leaf plots, residual plots, scatterplot matrices, and q-q plots, all of which seemed appealing in the abstract but never did much for me in the particular applications I was working on.

In my article with Meng and Stern, and in Bayesian Data Analysis, and then in my articles from 2003 and 2004, I have attempted to bring these statistical perspectives together by framing exploratory graphics as model checking: a statistical graph can reveal the unexpected, and “the unexpected” is defined relative to “the expected”–that is, a model. This fits into my larger philosophy that puts model checking at the center of the statistical enterprise.

Meanwhile, my graphs have been slowly improving. I realized awhile ago that I didn’t need tables of numbers at all. And here and there I’ve learned of other ideas, for example Howard Wainer’s practice of giving every graph a title.

I continued with some scattered thoughts about graphics and communication:

A statistical graph does not stand alone. It needs some words to go along with it to explain it. . . . I realized that our plots, graphically strong though they were, did not stand on their own. . . . This experience has led me to want to put more effort into explaining every graph, not merely what the points and lines are indicating (although that is important and can be hard to figure out in many published graphs) but also what is the message the graph is sending.

Most graphs are nonlinear and don’t have a natural ordering. A graph is not a linear story or a movie you watch from beginning to end; rather, it’s a cluttered house which you can enter from any room. The perspective you pick up if you start from the upstairs bathroom is much different than what you get by going through the living room–or, in graphical terms, you can look at clusters of points and lines, you can look at outliers, you can make lots of different comparisons. That’s fine but if a graph is part of a scientific or journalistic argument it can help to guide the reader a bit–just as is done automatically in the structuring of words in an article. . . .

While all this was happening, I also was learning more about decision analysis. In particular, Dave Krantz convinced me that the central unit of decision analysis is not the utility function or even the decision tree but rather the goal.

Applying this idea to the present discussion: what is the goal of a graph? There can be several, and there’s no reason to suppose that the graph that is best for achieving one of these goals will be optimal, or even good, for another. . . .

I’m a statistician who loves graphs and uses them all the time, I’m continually working on improving my graphical presentation of data and of inferences, but I’m probably stuck (without realizing it) in a bit of a rut of dotplots and lineplots. I’m aware of an infographics community . . .

Here’s an example of where I’m coming from: a blog post entitled, “Is the internet causing half the rapes in Norway? I wanna see the scatterplot.” To me, visualization is not an adornment or a way of promoting social science. Visualization is a central tool in social science research. (I’m not saying visualization is strictly necessary–I’m sure you can do a lot of good work with no visual sense at all–but I think it’s a powerful approach, and I worry about people who believe social science claims that they can’t visualize. I worry about researchers who believe their own claims without understanding them well enough to visualize the relation of these claims to the data from which they are derived.)

The rest of my post from 2011 discusses my struggles in communicating with the information visualization community–these are people who produce graphs for communication with general audiences, which motivates different goals and tools than those used by statisticians to communicate as part of the research process. Antony Unwin and I wrote a paper about these differences which was ultimately published with discussion in 2013 (and here is our rejoinder to the discussions).

Looking at all this a decade later, I’m not so interested in non-statistical information visualization anymore. I don’t mean this in a disparaging way! I think infofiz is great. Sometimes the very aspects of an infographic that make it difficult to read and deficient from a purely statistical perspective are a benefit for communication in that they can push the reader into thinking in new ways; here’s an example we discussed from a few years ago.

I continue to favor what we call the click-through solution: Start with the infographic, click to get more focused statistical graphics, click again to get the data and sources. But, in any case, the whole stat graphics vs. infographics thing has gone away, I guess because it’s clear that they can coexist; I don’t really see them as competing.

Whassup now?

Perhaps surprisingly, my graphical practices have remained essentially unchanged since 2011. I say “perhaps surprisingly,” because other aspects of my statistical workflow have changed a lot during this period. My lack of graphical progress is probably a bad thing!

A big reason for my stasis in this regard, I think, is that I’ve worked on relatively few large applied projects during the past fifteen years.

From 2004 through 2008, my collaborators and I were working every day on Red State Blue State. We produced hundreds of graphs and the equivalent of something like 10 or 20 research articles. In addition to our statistical goals of understanding our data and how they related to public opinion and voting, we knew from the start that we wanted to communicate both to political scientists and to the general public, so we were on the lookout for new ways to display our data and inferences. Indeed, we had the idea for the superplot before we ever made the actual graph.

Since 2008, I’ve done lots of small applied analyses for books and various research projects, but no big project requiring a rethinking of how to make graphs. The closest thing would be Stan, and here we have made some new displays–at least, new to me–but that work was done by collaborators such as Jonah Gabry, who did ShinyStan, and this hasn’t directly affected the sorts of graphs that I make.

I continue to think about graphs in new ways (for example, causal quartets and the ladder of abstraction), but, as can be seen in those new papers, the looks of my graphs haven’t really changed since 2011.

“Close but no cigar” unit tests and bias in MCMC

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).

Do research articles have to be so one-sided?

It’s standard practice in research articles as well as editorials in scholarly journals to present just one side of an issue. That’s how it’s done! A typical research article looks like this:

“We found X. Yes, we really found X. Here are some alternative explanations for our findings that don’t work. So, yeah, it’s really X, it can’t reasonably be anything else. Also, here’s why all the thickheaded previous researchers didn’t already find X. They were wrong, though, we’re right. It’s X. Indeed, it had to be X all along. X is the only possibility that makes sense. But it’s a discovery, it’s absolutely new. As was said of the music of Beethoven, each note is prospectively unexpected but retrospectively absolutely right. In conclusion: X.”

There also are methods articles, which go like this:

“Method X works. Here’s a real problem where method X works better than anything else out there. Other methods are less accurate or more expensive than X, or both. There are good theoretical reasons why X is better. It might even be optimal under some not-too-unreasonable conditions. Also, here’s why nobody tried X before. They missed it! X is, in retrospect, obviously the right thing to do. Also, though, X is super-clever: it had to be discovered. Here are some more examples where X wins. In conclusion: X.”

Or the template for a review article:

“Here’s a super-important problem which has been studied in many different ways. The way we have studied it is the best. In this article, we also discuss some other approaches which are worse. Our approach looks even better in this contrast. In short, our correct approach both flows naturally from and is a bold departure from everything that came before.”

OK, sometimes we try to do better. We give tentative conclusions, we accept uncertainty, we compare our approach to others on a level playing field, we write a review that doesn’t center on our own work. It happens. But, unless you’re Bob Carpenter, such an even-handed approach doesn’t come naturally, and, as always with this kind of adjustment, there’s always the concern of going too far (“bending over backward”) in the other direction. Recall my criticism of the popular but I think bogus concept of “steelmanning.”

So, yes, we should try to be more balanced, especially when presenting our own results. But the incentives don’t go in that direction, especially when your contributions are out there fighting with lots of ideas that other people are promoting unreservedly. Realistically, often the best we can do is to include Limitations sections in otherwise-positive papers.

One might think that a New England Journal of Medicine editorial could do better, but editorials have the same problem as review articles, which is that the authors will still have an agenda.

Dale Lehman writes in, discussing such an example:

A recent article in the New England Journal of Medicine caught my interest. The authors – a Harvard economist and a McKinsey consultant (properly disclosed their ties) – provide a variety of ways that AI can contribute to health care delivery. I can hardly argue with the potential benefits, and some areas of application are certainly ripe for improvements from AI. However, the review article seems unduly one-sided. Almost all of the impediments to application that they discuss lay the “blame” on health care providers and organizations. No mention is made about the potential errors made by AI algorithms applied in health care. This I found particularly striking since they repeatedly appeal to AI use in business (generally) as a comparison to the relatively slow adoption of AI in health care. When I think of business applications, a common error might be a product recommendation or promotion that was not relevant to a consumer. The costs of such a mistake are generally small – wasted resources, unhappy customers, etc. A mistake made by an AI recommendation system in medicine strikes me as quite a bit more serious (lost customers is not the same thing as lost patients).

To that point, the article cites several AI applications to prediction of sepsis (references 24-27). That is a particular area of application where several AI sepsis-detection algorithms have been developed, tested, and reported on. But the references strike me as cherry-picked. A recent controversy has concerned the Epic model ( where the company reported results were much better than the attempted replication. Also, there was a major international challenge (PhysioNet: where data was provided from 3 hospital systems, 2 of which provided the training data for the competition and the remaining system was used as the test data. Notably, the algorithms performed much better on the systems for which the training data was provided than for the test data.

My question really concerns the role of the NEJM here. Presumably this article was peer reviewed – or at least reviewed by the editors. Shouldn’t the NEJM be demanding more balanced and comprehensive review articles? It isn’t that the authors of this article say anything that is wrong, but it seems deficient in its coverage of the issues. It would not have been hard to acknowledge that these algorithms may not be ready for use (admittedly, they may outperform existing human models, but that is an area on which there is research and it should be noted in the article). Nor would it be difficult to point out that algorithmic errors and biases in health care may be a more serious matter than in other sectors of the economy.

Interesting. I’m guessing that the authors of the article were coming from the opposite direction, with a feeling that there’s too much conservatism regarding health-care innovation and they wanted to push back against that. (Full disclosure: I’m currently working with a cardiologist to evaluate a machine-learning approach for ECG diagnosis.)

In any case, yes, this is part of a general problem. One thing I like about blogging, as opposed to scholarly writing or journalism, is that in a blog post there’s no expectation or demand or requirement that we come to a strong conclusion. We can let our uncertainty hang out, without some need to try to make “the best possible case” for some point. We may be expected to entertain, but that’s not so horrible!

N=43, “a statistically significant 226% improvement,” . . . what could possibly go wrong??


They looked at least 12 cognitive outcomes, one of which had p = 0.02, but other differences “were just shy of statistical significance.” Also:

The degree of change in the brain measure was not significantly correlated with the degree of change in the behavioral measure (p > 0.05) but this may be due to the reduced power in this analysis which necessarily only included the smaller subset of individuals who completed neuropsychological assessments during in-person visits.

This is one of the researcher degrees of freedom we see all the time: an analysis with p > 0.05 can be labeled as “marginally statistically significant” or even published straight-up as a main result (“P < 0.10”), it can get some sort of honorable mention (“this may be due to the reduced power”), or it can be declared to be a null effect.

The “this may be due to the reduced power” thing is confused, for two reasons. First, of course it’s due to the reduced power! Set n to 1,000,000,000 and all your comparisons will be statistically significant! Second, the whole point of having these measures of sampling and measurement error is to reveal the uncertainty in an estimate’s magnitude and sign. It’s flat-out wrong to take a point estimate and just suppose that it would persist under a larger sample size.

People are trained in bad statistical methods, so they use bad statistical methods, it happens every day. In this one, I’m just bothered that this “226% improvement” thing didn’t set off any alarms. To the extent that these experimental results might be useful, the authors should be publishing the raw data rather than trying to fish out statistically significant comparisons. They also include a couple of impressive-looking graphs which wouldn’t look so impressive if they were to graph all the averages in the data rather than just those that randomly exceeded a significance threshold.

Did they publish the raw data? No! Here’s the Data availability statement:

The datasets presented in this article are not readily available because due to reasonable privacy and security concerns, the underlying data are not easily redistributable to researchers other than those engaged in the current project’s Institutional Review Board-approved research. The corresponding author may be contacted for an IRB-approved collaboration. Requests to access the datasets should be directed to …

It seems like it would be pretty trivial to remove names and any other identifying information and then release the raw data. This is a study on “whether older adults retain or improve their cognitive ability over a six-month period after daily olfactory enrichment at night.” What’s someone gonna do, track down participants based on their “daily exposure to essential oil scents”?

One problem here is that Institutional Review Boards are set up with a default no-approval stance. I think it should be the opposite: no IRB approval unless you commit ahead of time to posting your raw data. (Not that my collaborators and I usually post our raw data either. Posting raw data can be difficult. That’s one reason I think it should required, because otherwise it’s not likely to be done.)

No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study.

James “not the cancer cure guy” Watson writes:

This letter by Thorland et al. published in the New England Journal of Medicine is rather amusing. It’s unclear to me what their point is, other than the fact that they find the published results for the new COVID drug molnupiravir “statistically implausible.”

Background: The pharma company Merck got very promising results for molnupiravir at their interim analysis (~50% reduction in hospitalisation/death) but less promising results at their final analysis (30% reduction). Thorlund et al. were surprised that the data for the two study periods (before and after interim analysis) provided very different point estimates for benefit (goes the other way in the second period). They were also surprised to see inconsistent results when comparing across the different countries included in the study (non-overlapping confidence intervals).

They clearly had never read the subgroup analysis from the ISIS-2 trial: the authors convincingly showed that aspirin reduced vascular deaths in patients of all astrological birth signs expect Gemini and Libra, see Figure 5 in this Lancet paper from 1998.

He’s not kidding—that Lancet paper really does talk about astrological signs. What the hell??

Regarding the letter in the New England Journal of Medicine, I guess the point is that different studies, and different groups within a study, have different patients and are conducted at different times and under different conditions, so it makes sense that they can have different outcomes, more different that would be expected to arise from pure chance when comparing two samples from an identical distribution. People often don’t seem to realize this, leading them to characterize differences from chance as “statistically implausible” etc. rather than just representing underlying differences across patients, scenarios, and times.

As the authors of the original study put it in their response letter in the journal:

Given the shifts in prevailing SARS-CoV-2 variants, changes in out- patient management, and inclusion of trial sites from countries with unique Covid-19 disease burdens, the trial was not necessarily conducted under uniform conditions. The differences in the results between the interim and final analyses might be statistically improbable under ideal circumstances, but they reflect the fact that several key factors could not remain constant despite a consistent trial design.


Simulation to understand two kinds of measurement error in regression

This is all super-simple; still, it might be useful. In class today a student asked for some intuition as to why, when you’re regressing y on x, measurement error on x biases the coefficient estimate but measurement error on y does not.

I gave the following quick explanation:
– You’re already starting with the model, y_i = a + bx_i + e_i. If you add measurement error to y, call it y*_i = y_i + eta_i, and then you regress y* on x, you can write y* = a + bx_i + e_i + eta_i, and as long as eta is independent of e, you can just combine them into a single error term.
– When you have measurement error in x, two things happen to attenuate b—that is, to pull the regression coefficient toward zero. First, if you spreading out x but keep y unchanged, this will reduce the slope of y on x. Second, when you add noise to x you’re changing the ordering of the data, which will reduce the strength of the relationship.

But that’s all words (and some math). It’s simpler and clearer to do a live simulation, which I did right then and there in class!

Here’s the R code:

# simulation for measurement error
n <- 1000
x <- runif(n, 0, 10)
a <- 0.2
b <- 0.3
sigma <- 0.5
y <- rnorm(n, a + b*x, sigma)
fake <- data.frame(x,y)

fit_1 <- lm(y ~ x, data=fake)

sigma_y <- 1
fake$y_star <- rnorm(n, fake$y, sigma_y)
sigma_x <- 4
fake$x_star <- rnorm(n, fake$x, sigma_x)

fit_2 <- lm(y_star ~ x, data=fake)

fit_3 <- lm(y ~ x_star, data=fake)

fit_4 <- lm(y_star ~ x_star, data=fake)

x_range <- range(fake$x, fake$x_star)
y_range <- range(fake$y, fake$y_star)

par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
plot(fake$x, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_1), col="red", main="No measurement error")
plot(fake$x, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_2), col="red", main="Measurement error on y")
plot(fake$x_star, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_3), col="red", main="Measurement error on x")
plot(fake$x_star, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_4), col="red", main="Measurement error on x and y")

The resulting plot is at the top of this post.

I like this simulation for three reasons:

1. You can look at the graph and see how the slope changes with measurement error in x but not in y.

2. This exercise shows the benefits of clear graphics, including little things like making the dots small, adding the regression lines in red, labeling the individual plots, and using a common axis range for all four graphs.

3. It was fast! I did it live in class, and this is an example of how students, or anyone, can answer this sort of statistical question directly, with a lot more confidence and understanding than would come from a textbook and some formulas.

P.S. As Eric Loken and I discuss in this 2017 article, everything gets more complicated if you condition on "statistical significance."

P.P.S. Yes, I know my R code is ugly. Think of this as an inspiration: even if, like me, you’re a sloppy coder, you can still code up these examples for teaching and learning.

Intelligence is whatever machines cannot (yet) do

I had dinner a few nights ago with Andrew’s former postdoc Aleks Jakulin, who left the green fields of academia for entrepreneurship ages ago. Aleks was telling me he was impressed by the new LLMs, but then asserted that they’re clearly not intelligent. This reminded me of the old saw in AI that “AI is whatever a machine can’t do.”

In the end, the definition of “intelligent” is a matter of semantics. Semantics is defined by conventional usage, not by fiat (the exception seems to be an astronomical organization trying to change the definition of “planet” to make it more astronomically precise). We do this all the time. If you think about what “water” means, it’s incredibly vague. In the simplest case, how many minerals can it contain before we call it “mud” rather than “water”? Does it even have to be made of H20 if we can find a clear liquid on an alternative earth that will nourish us in the same way (this is a common example in philosophy from Hilary Putnam, I believe)? When the word “water” was first introduced into English, let’s just say that our understanding of chemistry was less developed than it is now. The word “intelligent” is no different. We’ve been using the term since before computers, and now we have to rethink what it means. By convention, we could decide as a group of language users to define “intelligent” however we want. Usually such decisions are guided by pragmatic considerations (or at least I’d like to think so—this is the standard position of pragmatist philosophers of language, like Richard Rorty). For instance, we could decide to exclude GPT because (a) it’s not embodied in the same way as a person, (b) it doesn’t have long-term memory, (c) it runs on silicon rather than cells, etc.

It would be convenient for benchmarking if we could fix a definition of “intelligence” to work with. What we do instead is just keep moving the bar on what counts as “intelligent.” I doubt people 50 years ago (1974) would have said you can play chess without being intelligent. But as soon as Deep Blue beat the human chess champion, everyone changed their tune and the chorus became “chess is just a game” and “it’s finite” and “it has well defined rules, unlike real life.” Then when IBM’s Watson trounced the world champion at Jeopardy!, a language based game, it was dismissed as a parlor trick. Obviously because a machine can play Jeopardy!, the reasoning went, it doesn’t require intelligence.

Here’s the first hit on Google I found searching for something like [what machines can’t do]. This one’s in a popular magazine, not the scientific literature. It’s the usual piece in the genre of “ML is amazing, but it’s not intelligent because it can’t do X”.

Let’s go over Toews’s list of AI’s failures circa 2021 (these are direct quotes).

  1. Use “common sense.” A man went to a restaurant. He ordered a steak. He left a big tip. If asked what the man ate in this scenario, a human would have no problem giving the correct answer—a steak. Yet today’s most advanced artificial intelligence struggles with prompts like this.
  2. Learn continuously and adapt on the fly. Today, the typical AI development process is divided into two distinct phases: training and deployment.
  3. Understand cause and effect. Today’s machine learning is at its core a correlative tool. It excels at identifying subtle patterns and associations in data. But when it comes to understanding the causal mechanisms—the real-world dynamics—that underlie those patterns, today’s AI is at a loss.
  4. “Reason ethically…In 2016, Microsoft debuted an AI personality on Twitter named Tay. The idea was for Tay to engage in online conversations with Twitter users as a fun, interactive demonstration of Microsoft’s NLP technology. It did not go well. Within hours, Internet trolls had gotten Tay to tweet a wide range of offensive messages: for instance, “Hitler was right” and “I hate feminists and they should all die and burn in hell.”

(1) ChatGPT-4 gets these common-sense problems mostly right. But it’s not logic. The man may have ordered a steak, gotten it, sent it back, ordered the fish instead, and still left a big tip. This is a problem with a lot of the questions posed to GPT about whether X follows from Y. It’s not a sound inference, just the most likely thing to happen, or as we used to say, the “default.” Older AIs were typically designed around sound inference and weren’t so much trying to emulate human imprecision (having said that, my grad school admissions essay was about and my postdoc was funded by a grant on default logics back in the 1980s!).

(2) You can do in-context learning with ChatGPT, but it doesn’t retain anything long term without retraining/fine tuning. It will certainly adapt to its task/listener on the fly throughout a conversation (arguably the current systems like ChatGPT adapt to their interlocuter too much—it’s what they were trained to do via reinforcement learning). Long-term memory is perhaps the biggest technical challenge to overcome, and it’s been interesting to see people going back to LSTM/recursive NN ideas (transformers, the neural net architecture underlying ChatGPT, were introduced in a paper titled “Attention is all you need”, which used long, but finite memory).

(3) ChatGPT 4 is pretty bad at causal inference. But it’s probably above the bar for what Toews’s complaints. It’ll get simple “causal inference” right the same way people do. In general, humans are pretty bad at causal inference. We are way too prone to jump to causal conclusions based on insufficient evidence. Do we classify baseball announcers as not intelligent when they talk about how a player struggles with high pressure situations after N = 10 plate appearances in the playoffs? We’re also pretty bad at reasoning about things that go against our preconceptions. Do we think Fisher was not intelligent because he argued that smoking didn’t cause cancer? Do we think all the anthropogenic global warming deniers are not intelligent? Maybe they’re right and it’s just a coincidence that temps have gone up coinciding with industrialization and carbon emissions. Seems like a highly suspicious coincidence, but causation is really hard when you can’t do randomized controlled trials (and even then it’s not so easy because of all the possible mediation).

(4) How you call this one depends on whether you think the front-line fine-tuning of ChatGPT made a reasonably helpful/harmless/truthful bot or not and whether the “ethics” it was trained with are yours. You can certainly jailbreak even ChatGPT-4 to send it spiraling into hate land or fantasy land. You can jailbreak some of my family in the same way, but I wouldn’t go so far as to say they weren’t intelligent. You can find lots of folks who think ChatGPT is too “woke”. This is a running theme on the GPT subreddit. It’s also a running theme among anti-woke billionaires, as reflected in the UK’s Daily Telegraph article title, “ChatGPT may be the next big thing, but it’s a biased woke robot.”

I’ve heard a lot of people say their dog is more intelligent than ChatGPT. I suppose they would argue for a version of intelligence that doesn’t require (1) or (4) and is very tolerant of poor performance in (2) and (3).

Evidence, desire, support

I keep worrying, as with a loose tooth, about news media elites who are going for the UFOs-as-space-aliens theory. This one falls halfway between election denial (too upsetting for me to want to think about too often) and belief in ghosts (too weird to take seriously).

I was also thinking about the movie JFK, which I saw when it came out in 1991. As a reader of the newspapers, I knew that the narrative pushed in the movie was iffy, to say the least; still, I watched the movie intently—I wanted to believe. In the same way that in the 1970s I wanted to believe those claims that dolphins are smarter than people, or that millions of people wanted to believe in the Bermuda Triangle or ancient astronauts or Noah’s Ark or other fringe ideas that were big in that decade. None of those particular ideas appealed to me.

Anyway, this all got me thinking about what it takes for someone to believe in something. My current thinking is that belief requires some mixture of the following three things:
1. Evidence
2. Desire
3. Support

To go through these briefly:

1. I’m using the term “evidence” in a general sense to include things you directly observe and also convincing arguments of some sort or another. Evidence can be ambiguous and, much to people’s confusion, it doesn’t always point in the same direction. The unusual trajectory of Oswald’s bullet is a form of evidence, even though not as strong as has been claimed by conspiracy theories. The notorious psychology paper from 2011 is evidence for ESP. It’s weak evidence, really no evidence at all for anything beyond the low standards of academic psychology at the time, but it played the role of evidence for people who were interested in or open to believing.

2. By “desire,” I mean a desire to believe in the proposition at hand. There can be complicated reasons for this desire. Why did I have some desire in 1991 to believe the fake JFK story, even thought I knew ahead of time it was suspect? Maybe because it helped make sense of the world? Maybe because, if I could believe the story, I could go with the flow of the movie and feel some righteous anger? I don’t really know. Why do some media insiders seen to have the desire to believe that UFOs are space aliens? Maybe because space aliens are cool, maybe because, if the theory is true, then these writers are in on the ground floor of something big, maybe because the theory is a poke in the eye at official experts, maybe all sorts of things.

3. “Support” refers to whatever social environment you’re in. 30% of Americans believe in ghosts, and belief in ghosts seems to be generally socially acceptable—I’ve heard people from all walks of life express the belief—but there are some places where it’s not taken seriously, such as in the physics department. The position of ghost-belief within the news media is complicated, typically walking a fine line to avoid expressing belief or disbelief. For example, a quick search of *ghosts npr* led to this from the radio reporter:

I’m pretty sure I don’t believe in ghosts. Now, I say pretty sure because I want to leave the possibility open. There have definitely been times when I felt the presence of my parents who’ve both died, like when one of their favorite songs comes on when I’m walking the aisles of the grocery store, or when the wind chime that my mom gave me sings a song even though there’s no breeze. But straight-up ghosts, like seeing spirits, is that real? Can that happen?

This is kind of typical. It’s a news story that’s pro-ghosts, reports a purported ghost sighting with no pushback, but there’s that kinda disclaimer too. It’s similar to reporting on religion. Different religions contradict each other, and so if you want to report in a way that’s respectful of religion, you have to place yourself in a no-belief-yet-no-criticism mode: if you have a story about religion X, you can’t push back (“Did you really see the Lord smite that goat in your backyard that day?”) because that could offend adherents of that religion, but you can’t fully go with it, as that could offend adherents of every other religion.

I won’t say that all three of evidence, desire, and support are required for belief, just that they can all contribute. We can see this with some edge cases. That psychologist who published the terrible paper on ESP: he had a strong desire to believe, a strong enough desire to motivate an entire research program on his part. There was also a little bit of institutional support for the belief. Not a lot—ESP is a fringe take that would be, at best, mocked by most academic psychologists, it’s a belief that has much lower standing now than it did fifty years ago—but some. Anyway, the strong desire was enough, along with the terrible-but-nonzero evidence and the small-but-nonzero support. Another example would be Arthur Conan Doyle believing those ridiculous faked fairy photos: spiritualism was big in society at the time, so he had strong social support as well as strong desire to believe. In other cases, evidence is king, but without the institutional support it can be difficult for people to be convinced. Think of all those “they all laughed, but . . .” stories of scientific successes under adversity: continental drift and all the rest.

As we discussed in an earlier post, the “support” thing seems like a big change regarding the elite media and UFOs-as-space-aliens. The evidence for space aliens, such as it is—blurry photographs, eyewitness testimony, suspiciously missing government records, and all the rest—has been with us for half a century. The desire to believe has been out there too for a long time. What’s new is the support: some true believers managed to insert the space aliens thing into the major news media in a way that gives permission to wanna-believers to lean into the story.

I don’t have anything more to say on this right now, just trying to make sense of it all. This all has obvious relevance to political conspiracy theories, where authority figures can validate an idea, which then gives permission for other wanna-believers to push it.

Delayed retraction sampling

Colby Vorland writes:

In case it is of interest, a paper we reported 3 years, 4 months ago was just retracted:

Retracted: Effect of Moderate-Intensity Aerobic Exercise on Hepatic Fat Content and Visceral Lipids in Hepatic Patients with Diabesity: A Single-Blinded Randomised Controlled Trial

Over this time, I was sent draft retraction notices on two occasions by Hindawi’s research integrity team that were then reneged for reasons that were not clear. The research integrity team stopped responding to me, but after I involved COPE, they eventually got it done. Happy to give more details. Our full team who helped with this one was Colby Vorland, Greyson Foote, Stephanie Dickinson, Evan Mayo-Wilson, David Allison, and Andrew Brown.

As stated in the retraction notice, here are the issues:

(i) There is no mention of the clinical trial registration number, NCT03774511 (retrospectively registered in December 2018), or that this was part of a larger study. Overall, there were three arms: a control, a high-intensity exercise group (HII) and a moderate-intensity exercise group (MIC), but only the control and MIC were reported in [1].

(ii) There is no indication that references 35 and 36 [4, 5] cited in the article draw on data from the same study participants and these references are incorrectly presented as separate studies supporting the findings of the article, which may have misled readers.

(iii) The authors have stated that recruitment and randomization occurred during August-December 2017, the HII and control arms were conducted during January-August 2018, and the MIC arm was run during August-December 2018, which is a non-standard study design and was not reported in any of the articles.

(iv) The data presented in Figure 1 and Tables 1 and 2 are identical to data presented in Abdelbasset et al. [5]. With respect to Figure 1 the study has been presented without the additional study arm shown in Abdelbasset et al. [5].

(v) The data in Table 2 is identical to that shown as the MIC study arm in Abdelbasset et al. [5]. However, the p values have been presented to three decimal places whereas in Abdelbasset et al. [5] they are presented to two decimal places [5]. The data also shows inconsistent rounding. There is a particular concern where 0.046 has been rounded down to 0.04 (and hence appears statistically significant) rather than rounding up, as has occurred with other values. In addition, several items shown as in Abdelbasset et al. [5] are shown as values less than 0.01 (i.e., <0.01, 0.004 and 0.002). (vi) There are concerns with the accuracy of the statistical tests reported in the article, because the comparisons are of within-group differences rather than using valid between-group tests such as ANOVA. Many of the p-values reported in the article could not be replicated by Vorland et al. [3], and in particular they found no significant differences between treatment groups for BMI, IHTG, visceral adipose fat, total cholesterol, and triglycerides. This was confirmed by the authors’ reanalysis, apart from triglycerides for which there was a significant difference between treatment groups according to the authors’ reanalysis. (vii) The age ranges are slightly inconsistent between the articles, despite the studies collectively reporting on the same participants: 45–60 in [1, 4] and 40–60 in [5]. The authors state that 40–60 years reflects the inclusion criteria for the study, whereas the actual age range of the included participants was 45–60 years. (viii) Although this was a single clinical trial, different ethical approval numbers are given in each article: PT/2017/00-019 [1], PT/2017/00-018 [4], and P.TREC/012/002146 [5].

Also this from the published retraction:

The authors do not agree to the retraction and the notice.

I appreciate the effort by Vorland et al. I’ve done this sort of thing too on occasion, and other times I’ve asked a journal to publish a letter of correction but they’ve refused. Unfortunately, retraction and correction are not scalable. Literally zillions of scientific papers are published a year, and only a handful get retracted or corrected.

How large is that treatment effect, really? (my talk at NYU economics department Thurs 18 Apr 2024, 12:30pm)

19 W 4th Street, Room 517:

How large is that treatment effect, really?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

“He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.”

Ron Bloom points us to this wonderful article, “The Ethics of Belief,” by the mathematician William Clifford, also known for Clifford algebras. The article is related to some things I’ve written about evidence vs. truth (see here and here) but much more beautifully put. Here’s how it begins:

A shipowner was about to send to sea an emigrant-ship. He knew that she was old, and not overwell built at the first; that she had seen many seas and climes, and often had needed repairs. Doubts had been suggested to him that possibly she was not seaworthy. These doubts preyed upon his mind, and made him unhappy; he thought that perhaps he ought to have her thoroughly overhauled and refitted, even though this should put him to great expense. Before the ship sailed, however, he succeeded in overcoming these melancholy reflections. He said to himself that she had gone safely through so many voyages and weathered so many storms that it was idle to suppose she would not come safely home from this trip also. He would put his trust in Providence, which could hardly fail to protect all these unhappy families that were leaving their fatherland to seek for better times elsewhere. He would dismiss from his mind all ungenerous suspicions about the honesty of builders and contractors. In such ways he acquired a sincere and comfortable conviction that his vessel was thoroughly safe and seaworthy; he watched her departure with a light heart, and benevolent wishes for the success of the exiles in their strange new home that was to be; and he got his insurance-money when she went down in mid-ocean and told no tales.

What shall we say of him? Surely this, that he was verily guilty of the death of those men. It is admitted that he did sincerely believe in the soundness of his ship; but the sincerity of his conviction can in no wise help him, because he had no right to believe on such evidence as was before him. He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.

Clifford’s article is from 1877!

Bloom writes:

One can go over this in two passes. One pass may be read as “moral philosophy.”

But the second pass helps one think a bit about how one ought to make precise the concept of ‘relevance’ in “relevant evidence.”

Specifically (this is remarkably deficient in the Bayesian corpus I find) I would argue that when we say “all probabilities are relative to evidence” and write the symbolic form straightaway P(A|E) we are cheating. We have not faced the fact — I think — that not every “E” has any bearing (“relevance”) one way or another on A and that it is *inadmissible* to combine the symbols because it is so easy to write ’em down. Perhaps one evades the problem by saying, well what do you *think* is the case. Perhaps you might say, “I think that E is irrelevant if P(A|E) = P(A|~E).” But that begs the question: it says in effect that *both* E and ~E can be regarded as “evidence” for A. I argue that easily leads to nonsense. To regard any utterance or claim as “evidence” for any other utterance or claim leads to absurdities. Here for instance:

A = “Water ice of sufficient quantity to maintain a lunar base will be found in the spectral analysis of the plume of the crashed lunar polar orbiter.”

E = If there are martians living on the Moon of Jupiter, Europa, then they celebrate their Martian Christmas by eating Martian toast with Martian jam.

Is E evidence for A? is ~E evidence for A? Is any far-fetched hypothetical evidence for any other hypothetical whatsoever?

Just to provide some “evidence” that I am not being entirely facetious about the Lunar orbiter; I attach also a link to now much superannuated item concerning that very intricate “experiment” — I believe in the end there was some spectral evidence turned up consistent with something like a teaspoon’s worth of water-ice per 25 square Km.

P.S. Just to make the connection super-clear, I’d say that Clifford’s characterization, “He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it,” is an excellent description of those Harvard professors who notoriously endorsed the statement, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” Also a good match to those Columbia administrators who signed off on those U.S. News numbers. In neither case did a ship go down; it’s the same philosophical principle but lower stakes. Just millions of dollars involved, no lives lost.

As Isaac Asimov put it, “A robot may not injure a human being or, through inaction, allow a human being to come to harm.” Sometimes that inaction is pretty damn active, when a shipowner or a scientific researcher or a university administrator puts in some extra effort to avoid looking at some pretty clear criticisms.

Here’s something you should do when beginning a project, and in the middle of a project, and in the end of the project: Clearly specify your goals, and also specify what’s not in your goal set.

Here’s something from from Witold’s slides on baggr, an R package (built on Stan) that does hierarchical modeling for meta-analysis:

Overall goals:

1. Implement all basic meta-analysis models and tools
2. Focus on accessibility, model criticism and comparison
3. Help people avoid basic mistakes
4. Keep the framework flexible and extend to more models

(Probably) not our goal:

5. Build a package for people who already build their models in Stan

I really like this practice of specifying goals. This is so basic that it seems like we should always be doing it—but so often we don’t! Also I like the bit where he specifies something that’s not in his goals.

Again, this all seems so natural when we see it, but it’s something we don’t usually do. We should.

People have needed rituals to turn data into truth for many years. Why would we be surprised if many people now need procedural reforms to work?

This is Jessica. How to weigh metascience or statistical reform proposals has been on my mind more than usual lately as a result of looking into and blogging about the Protzko et al. paper on rigor-enhancing practices. Seems it’s also been on Andrew’s mind

Like Andrew, I have been feeling “bothered by the focus on procedural/statistical ‘rigor-enhancing practices’ of ‘confirmatory tests, large sample sizes, preregistration, and methodological transparency’” because I suspect researchers are taking it to heart that these steps will be enough to help them produce highly informative experiments. 

Yesterday I started thinking about it via an analogy to writing. I heard once that if you’re trying to help someone become a better writer, you should point out no more than three classes of things they’re doing wrong at one time, because too much new information can be self-defeating. 

Imagine you’re a writing consultant and people bring you their writing and you advise them on how to make it better. Keeping in mind the need to not overwhelm, initially maybe you focus on the simple things that won’t make them a great writer, but are easy to teach. “You’re doing comma splices, your transitions between paragraphs suck, you’re doing citation wrong.” You talk up how important these things are to get right to get them motivated, and you bite your tongue when it comes to all the other stuff they need help with to avoid discouraging them. 

Say the person goes away and fixes the three things, then comes back to you with some new version of what they’ve written. What do you do now? Naturally, you give them three more things. Over time as this process repeats, you eventually get to the most nuanced stuff that is harder to get right but ultimately more important to their success as a writer.

But for this approach to work presupposes that your audience either intrinsically cares enough about improving to keep coming back, or that they have some outside reason they must keep coming back, like maybe they are a Ph.D. student and their advisor is forcing their hand. What if you can never be sure when someone walks in the door that they will come back a second time after you give your advice? In fact, what if the people who really need your help with their writing are bad writers because they fixated on the superficial advice they got in middle school or high school that boils good writing down to a formula, and considered themselves done? And now they’re looking for the next quick fix, so they can go back to focusing on whatever they are actually interested in and treating writing as a necessary evil? 

Probably they will latch onto the next three heuristics you give them and consider themselves done. So if we suspect the people we are advising will be looking for easy answers, it seems unlikely that we are going to get them there using the approach above where we give them three simple things and we talk up the power of these things to make them good writers. Yet some would say this is what mainstream open science is doing, by giving people simple procedural reforms (just preregister, just use big samples, etc) and talking up how they help eliminate replication problems.  

I like writing as an analogy for doing experimental social science because both are a kind of wicked problem where there are many possible solutions, and the criteria for selecting between them are nuanced. There are simple procedural things that are easier to point out, like the comma splices or lacking transitions between paragraphs in writing, or not having a big enough sample or invalidating your test by choosing it posthoc in experimental science. But avoiding mistakes at this level is not going to make you a good writer, just like enacting simple procedural heuristics are not going to make you a good experimentalist or modeler. For that you need to adopt an orientation that acknowledges the inherent difficulty of the task and prompts you to take a more holistic approach 

Figuring out how to encourage that is obviously not easy. But one reason that starting with the simple procedural stuff (or broadly applicable stuff, as Nosek implies the “rigor-enhancing practices” are), seems insufficient to me is that I don’t necessarily think there’s a clear pathway from the simple formulaic stuff to the deeper stuff, like the connection between your theory and what you are measuring and how you are measuring it and how you specify and select among competing models. I actually think things make more sense to go the opposite way, from why inference from experimental data is necessarily very hard as a result of model misspecification, effect heterogeneity, measurement error etc. to the ingredients that have to be in place for us to even have a chance, like sufficient sample size and valid confirmatory tests. The problem is that one can understand the concepts of preregistration or sufficient sample size while still having a relatively simple mental model of effects as real or fake and questionable research practices as the main source of issues. 

In my own experience, the more I’ve thought about statistical inference from experiments over the years, the more seriously I take heterogeneity and underspecification/misspecification, to the point that I’ve largely given up doing experimental work. This is an extreme outcome of course, but I think we should expect that the more one recognizes how hard the job really is, the less likely one is to firehose the literature in one’s field with a bunch of careless dead-in-the-water style studies. As work by Berna Devezer and colleagues has pointed out, open science proposals are often subject to the same kinds of problems such as overconfident claims and reliance on heuristics that contributed to the replication crisis in the first place. This solution-ism (a mindset I’m all too familiar with as a computer scientist) can be counterproductive. 

Hey, some good news for a change! (Child psychology and Bayes)

Erling Rognli writes:

I just wanted to bring your attention to a positive stats story, in case you’d want to feature it on the blog. A major journal in my field (the Journal of Child Psychology and Psychiatry) has over time taken a strong stance for using Bayesian methods, publishing an editorial in 2016 advocating switching to Bayesian methods:

Editorial: Bayesian benefits for child psychology and psychiatry researchers – Oldehinkel – 2016 – Journal of Child Psychology and Psychiatry.

And recently following up with inviting myself and some colleagues to write a brief introduction to Bayesian methods (where we of course recommend Stan):

Editorial perspective: Bayesian statistical methods are useful for researchers in child and adolescent mental health – Rognli – Journal of Child Psychology and Psychiatry.

I think this consistent editorial support really matters for getting risk-averse researchers to start using new methods, so I think the editors of the JCPP deserve recognition for contributing to improving statistical practice in this field.

No reason to think that Bayes and Stan will, by themselves, transform child psychology, but I think it’s a step in the right direction. As Rubin used to say, one advantage of Bayes is that the work you do to set up the model represents a bridge between experiment, data, and scientific understanding. It’s getting you to think about the right questions.

Evilicious 3: Face the Music

A correspondent forwards me this promotional material that appeared in his inbox:

“Misbeliefs are not just about other people, they are also about our own beliefs.” Indeed.

I wonder if this new book includes the shredder story.

P.S. The book has blurbs from Yuval Harari, Arianna Huffington, and Michael Shermer (the professional skeptic who assures us that he has a haunted radio). This thing where celebrities stick together . . . it’s nuts!

P.P.S. The good news is that there’s already some new material for the eventual sequel. And it’s “preregistered”! What could possibly go wrong?

What is the prevalence of bad social science?

Someone pointed me to this post from Jonatan Pallesen:

Frequently, when I [Pallesen] look into a discussed scientific paper, I find out that it is astonishingly bad.

• I looked into Claudine Gay’s 2001 paper to check a specific thing, and I find out that research approach of the paper makes no sense. (

• I looked into the famous study about how blind auditions increased the number of women in orchestras, and found that the only significant finding is in the opposite direction. (

• The work of Lisa Cook was being discussed because of her nomination to the fed. @AnechoicMedia_ made a comment pointing out a potential flaw in her most famous study. And indeed, the flaw was immediately obvious and fully disqualifying. (

• The study showing judges being very affected by hunger? Also useless. (

These studies do not have minor or subtle flaws. They have flaws that are simple and immediately obvious. I think that anyone, without any expertise in the topics, can read the linked tweets and agree that yes, these are obvious flaws.

I’m not sure what to conclude from this, or what should be done. But it is rather surprising to me to keep finding this.

My quick answer is, at some point you should stop being surprised! Disappointed, maybe, just not surprised.

A key point is that these are not just any papers, they’re papers that have been under discussion for some reason other than their potential problems. Pallesen, or any of us, doesn’t have to go through Psychological Science and PNAS every week looking for the latest outrage. He can just sit in one place, passively consume the news, and encounter a stream of prominent published research papers that have clear and fatal flaws.

Regular readers of this blog will recall dozens more examples of high-profile disasters: the beauty-and-sex-ratio paper, the ESP paper and its even more ridiculous purported replications, the papers on ovulation and clothing and ovulation and voting, himmicanes, air rage, ages ending in 9, the pizzagate oeuvre, the gremlins paper (that was the one that approached the platonic ideal of more corrections than data points), the ridiculously biased estimate of the effects of early-childhood intervention, the air pollution in China paper and all the other regression discontinuity disasters, much of the nudge literature, the voodoo study, the “out of Africa” paper, etc. As we discussed in the context of that last example, all the way back in 2013 (!), the problem is closely related to these papers appearing in top journals:

The authors have an interesting idea and want to explore it. But exploration won’t get you published in the American Economic Review etc. Instead of the explore-and-study paradigm, researchers go with assert-and-defend. They make a very strong claim and keep banging on it, defending their claim with a bunch of analyses to demonstrate its robustness. . . . High-profile social science research aims for proof, not for understanding—and that’s a problem. The incentives favor bold thinking and innovative analysis, and that part is great. But the incentives also favor silly causal claims. . . .

So, to return to the question in the title of this post, how often is this happening? It’s hard for me to say. On one hand, ridiculous claims get more attention; we don’t spend much time talking about boring research of the “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]" variety. On the other hand, do we really think that high-profile papers in top journals are that much worse than the mass of published research?

I expect that some enterprising research team has done some study, taking a random sample of articles published in some journals and then looking at each paper in detail to evaluate its quality. Without that, we can only guess, and I don’t have it in me to hazard a percentage. I’ll just say that it happens a lot—enough so that I don’t think it makes sense to trust social-science studies by default.

My correspondent also pointed me to a recent article in Harvard’s student newspaper, “I Vote on Plagiarism Cases at Harvard College. Gay’s Getting off Easy,” by “An Undergraduate Member of the Harvard College Honor Council,” who writes:

Let’s compare the treatment of Harvard undergraduates suspected of plagiarism with that of their president. . . . A plurality of the Honor Council’s investigations concern plagiarism. . . . when students omit quotation marks and citations, as President Gay did, the sanction is usually one term of probation — a permanent mark on a student’s record. A student on probation is no longer considered in good standing, disqualifying them from opportunities like fellowships and study-abroad programs. Good standing is also required to receive a degree.

What is striking about the allegations of plagiarism against President Gay is that the improprieties are routine and pervasive. She is accused of plagiarism in her dissertation and at least two of her 11 journal articles. . . .

In my experience, when a student is found responsible for multiple separate Honor Code violations, they are generally required to withdraw — i.e., suspended — from the College for two semesters. . . . We have even voted to suspend seniors just about to graduate. . . .

There is one standard for me and my peers and another, much lower standard for our University’s president.

This echoes what Jonathan Bailey has written here and here at his blog Plagiarism Today:

Schools routinely hold their students to a higher and stricter standard when it comes to plagiarism than they handle their faculty and staff. . . .

To give an easy example. In October 2021, W. Franklin Evans, who was then the president of West Liberty University, was caught repeated plagiarizing in speeches he was giving as President. Importantly, it wasn’t past research that was in dispute, it was the work he was doing as president.

However, though the board did vote unanimously to discipline him, they also voted against termination and did not clarify what discipline he was receiving.

He was eventually let go as president, but only after his contract expired two years later. It’s difficult to believe that a student at the school, if faced with a similar pattern of plagiarism in their coursework, would be given that same chance. . . .

The issue also isn’t limited to higher education. In February 2020, Katy Independent School District superintendent Lance Hindt was accused of plagiarism in his dissertation. Though he eventually resigned, the district initially threw their full sport behind Hindt. This included a rally for Lindth that was attended by many of the teachers in the district.

Even after he left, he was given two years of salary and had $25,000 set aside for him if he wanted to file a defamation lawsuit.

There are lots and lots of examples of prominent faculty committing scholarly misconduct and nobody seems to care—or, at least, not enough to do anything about it. In my earlier post on the topic, I mentioned the Harvard and Yale law professors, the USC medical school professor, the Princeton history professor, the George Mason statistics professor, and the Rutgers history professor, none of whom got fired. And I’d completely forgotten about the former president of the American Psychological Association and editor of Perspectives on Psychological Science who misrepresented work he had published and later was forced to retract—but his employer, Cornell University, didn’t seem to care. And the University of California professor who misrepresented data and seems to have suffered no professional consequences. And the Stanford professor who gets hyped by his university while promoting miracle cures and bad studies. And the dean of engineering at the University of Nevada. Not to mention all the university administrators and football coaches who misappropriate funds and then are quietly allowed to leave on golden parachutes.

Another problem is that we rely on the news media to keep these institutions accountable. We have lots of experience with universities (and other organizations) responding to problems by denial; the typical strategy appears to be to lie low and hope the furor will go away, which typically happens in the absence of lots of stories in the major news media. But . . . the news media have their own problems: little problems like NPR consistently hyping junk science and big problems like Fox pushing baseless political conspiracy theories. And if you consider podcasts and Ted talks to be part of “the media,” which I think they are—I guess as part of the entertainment media rather than the news media, but the dividing line is not sharp—then, yeah, a huge chunk of the media is not just susceptible to being fooled by bad science and indulgent of academic misconduct, it actually relies on bad science and academic misconduct to get the wow! stories that bring the clicks.

To return to the main thread of this post: by sanctioning students for scholarly misconduct but letting its faculty and administrators off the hook, Harvard is, unfortunately, following standard practice. The main difference, I guess, is that “president of Harvard” is more prominent than “Princeton history professor” or “Harvard professor of constitutional law” or “president of West Liberty University” or “president of the American Psychological Association” or “UCLA medical school professor” or all the others. The story of the Harvard president stays in the news, while those others all receded from view, allowing the administrators at those institutions to follow the usual plan of minimizing the problem, saying very little, and riding out the storm.

Hey, we just got sidetracked into a discussion of plagiarism. This post was supposed to be about bad research. What can we say about that?

Bad research is different than plagiarism. Obviously, students don’t get kicked out for doing bad research, using wrong statistical methods, losing their data, making claims that defy logic and common sense, claiming to modify a paper shredder that has never existed, etc etc etc. That’s the kind of behavior that, if your final paper also has formatting problems, will get you slammed with a B grade and that’s about it.

When faculty are found to have done bad research, the usual reaction is not to give them a B or to do the administrative equivalent—lowering their salary, perhaps?, or removing them from certain research responsibilities, maybe making them ineligible to apply for grants?—but rather to pretend that nothing happened. The idea is that, once an article has been published, you draw a line under it and move onward. It’s considered in bad taste—Javert-like, even!—to go back and find flaws in papers that are already resting comfortably in someone’s C.V. As Pallesen notes, so often when we do go back and look at those old papers, we find serious flaws. Which brings us to the question in the title of this post.

P.S. The paper by Claudine Gay discussed by Pallesen is here; it was published in 2001. For more on the related technical questions involving the use of ecological regression, I recommend this 2002 article by Michael Herron and Kenneth Shots (link from Pallesen) and my own article with David Park, Steve Ansolabehere, Phil Price, and Lorraine Minnite, “Models, assumptions, and model checking in ecological regressions,” from 2001.

“AI” as shorthand for turning off our brains. (This is not an anti-AI post; it’s a discussion of how we think about AI.)

Before going on, let me emphasize that, yes, modern AI is absolutely amazing—self-driving cars, machines that can play ping-pong, chessbots, computer programs that write sonnets, the whole deal! Call it machine intelligence or whatever, it’s amazing.

What I’m getting at in this post is the way in which attitudes toward AI fit into existing practices in science and other aspects of life.

This came up recently in comments:

“AI” does not just refer to a particular set of algorithms or computer programs but also to the attitude in which an algorithm or computer program is idealized to the extent that people think it’s ok for them to rely on it and not engage their brains.

Some examples of “AI” in that sense of the term:
– When people put a car on self-driving mode and then disengage from the wheel.
– When people send out a memo produced by a chatbot without reading and understanding it first.
– When researchers use regression discontinuity analysis or some other identification strategy and don’t check that their numbers make any sense at all.
– When journal editors see outrageous claims backed “p less than 0.05” and then just push the Publish button.

“AI” is all around us, if you just know where to look!

One thing that interests me here is how current expectations of AI in some ways match and in some ways go beyond past conceptions in science fiction. The chatbot, for example, is pretty similar to all those talking robots, and I guess you could imagine a kid in such a story asking his robot to do his homework for him. Maybe the difference is that the robot is thought to have some sort of autonomy, along which comes some idiosyncratic fallibility (if only that the robot is too “logical” to always see clearly to the solution of a problem), whereas an AI is considered more of an institutional product with some sort of reliability, in the same sense that every bottle of Coca-Cola is the same. Maybe that’s the connection to naive trust in standardized statistical methods.

This also relates to the idea that humans used to be thought of as the rational animal but now are viewed as irrational computers. In the past, our rationality was considered to be what separates us from the beasts, either individually or through collective action, as in Locke and Hobbes. If the comparison point is animals, then our rationality is a real plus! Nowadays, though, it seems almost the opposite: if the comparison point is a computer, then what makes us special is not our rationality but our emotions.

There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Following our recent post on the latest Dishonestygate scandal, we got into a discussion of the challenges of simulating fake data and performing a pre-analysis before conducting an experiment.

You can see it all in the comments to that post—but not everybody reads the comments, so I wanted to repeat our discussion here. Especially the last line, which I’ve used as the title of this post.

Raphael pointed out that it can take some work to create a realistic simulation of fake data:

Do you mean to create a dummy dataset and then run the preregistered analysis? I like the idea, and I do it myself, but I don’t see how this would help me see if the endeavour is doomed from the start? I remember your post on the beauty-and-sex ratio, which proved that the sample size was far too small to find an effect of such small magnitude (or was it in the Type S/Type M paper?). I can see how this would work in an experimental setting – simulate a bunch of data sets, do your analysis, compare it to the true effect of the data generation process. But how do I apply this to observational data, especially with a large number of variables (number of interactions scales in O(p²))?

I elaborated:

Yes, that’s what I’m suggesting: create a dummy dataset and then run the preregistered analysis. Not the preregistered analysis that was used for this particular study, as that plan is so flawed that the authors themselves don’t seem to have followed it, but a reasonable plan. And that’s kind of the point: if your pre-analysis plan isn’t just a bunch of words but also some actual computation, then you might see the problems.

In answer to your second question, you say, “I can see how this would work in an experimental setting,” and we’re talking about an experiment here, so, yes, it would’ve been better to have simulated data and performed an analysis on the simulated data. This would require the effort of hypothesizing effect sizes, but that’s a bit of effort that should always be done when planning a study.

For an observational study, you can still simulate data; it just takes more work! One approach I’ve used, if I’m planning to fit data predicting some variable y from a bunch of predictors x, is to get the values of x from some pre-existing dataset, for example an old survey, and then just do the simulation part for y given x.

Raphael replied:

Maybe not the silver bullet I had hoped for, but now I believe I understand what you mean.

To which I responded:

There is no silver bullet; there is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Again, this is not a diss on preregistration. Preregistration does one thing; it’s not intended to fix bad aspects of the culture of science such as the idea that you can gather a pile of data, grab some results, declare victory, go on the Ted talk circuit based only on the very slender bit of evidence that you seem to have been able to reject that the data came from a specific random number generator. That line of reasoning, where rejection of straw-man null hypothesis A is taken as evidence in favor of preferred alternative B, is wrong—but it’s not preregistration’s fault that people think that way!

P-hacking can be bad (but the problem here, in my view, is not in performing multiple analyses but rather in reporting only one of them rather than analyzing them all together); various questionable research practices are, well, questionable; and preregistration can help with that, either directly (by motivating researchers to follow a clear plan) or indirectly (by allowing outsiders to see problems in post-publication review, as here).

I am, however, bothered by the focus on procedural/statistical “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Again, the problem is if researchers mistakenly think that following such advice will place them back on that nonexistent golden path to discovery.

So, again, I recommend to make assumptions, simulate fake data, and analyze these data as a way of constructing a pre-analysis plan, before collecting any data. That won’t put you on the golden path to discovery either!

All I can offer you here is blood, toil, tears and sweat, along with the possibility that a careful process of assumptions/simulation/pre-analysis will allow you to avoid disasters such as this ahead of time, thus avoiding the consequences of: (a) fooling yourself into thinking you’ve made a discovery, (b) wasting the time and effort of participants, coauthors, reviewers, and postpublication reviewers (that’s me!), and (c) filling the literature with junk that will later be collected in a GIGO meta-analysis and promoted by the usual array of science celebrities, podcasters, and NPR reporters.

Aaaaand . . . in the time you’ve saved from all of that could be repurposed into designing more careful experiments with clearer connections between theory and measurement. Not a glide along the golden path to a discovery; more of a hacking through the jungle of reality to obtain some occasional glimpses of the sky.

It’s Ariely time! They had a preregistration but they didn’t follow it.

I have a story for you about a success of preregistration. Not quite the sort of success that you might be expecting—not a scientific success—but a kind of success nonetheless.

It goes like this. An experiment was conducted. It was preregistered. The results section was written up in a way that reads as if the experiment worked as planned. But if you go back and forth between the results section and the preregistration plan, you realize that the purportedly successful results did not follow the preregistration plan. They’re just the usual story of fishing and forking paths and p-hacking. The preregistration plan was too vague to be useful, also the authors didn’t even bother to follow it—or, if they did follow it, they didn’t bother to write up the results of the preregistered analysis.

As I’ve said many times before, there’s no reason that preregistration should stop researchers from doing further analyses once they see their data. The problem in this case is that the published analysis was not well justified either from a statistical or a theoretical perspective, nor was it in the preregistration. Its only value appears to be as a way for the authors to spin a story around a collection of noisy p-values.

On the minus side, the paper was published, and nowhere in the paper does it say that the statistical evidence they offer from their study does not come from the preregistration. In the abstract, their study is described as “pre-registered,” which isn’t a lie—there’s a pregistration plan right there on the website—but it’s misleading, given that the preregistration does not line up with what’s in the paper.

On the plus side, outside readers such as ourselves can see the paper and the preregistrations and draw our own conclusions. It’s easier to see the problems with p-hacking and forking paths when the analysis choices are clearly not in the preregistration plan.

The paper

The Journal of Experimental Social Psychology recently published an article, “How pledges reduce dishonesty: The role of involvement and identification,” by Eyal Peer, Nina Mazar, Yuval Feldman, and Dan Ariely.

I had no idea that Ariely is still publishing papers on dishonesty! It says that data from this particular paper came from online experiments. Nothing involving insurance records or paper shredders or soup bowls or 80-pound rocks . . . It seems likely that, in this case, the experiments actually happened and that the datasets came from real people and have not been altered.

And the studies are preregistered, with the preregistration plans all available on the papers’ website.

I was curious about that. The paper had 4 studies. I just looked at the first one, which already took some effort on my part. The rest of you can feel free to look at Studies 2, 3, and 4.

The results section and the preregistration

From the published paper:

The first study examined the effects of four different honesty pledges that did or did not include a request for identification and asked for either low or high involvement in making the pledge (fully-crossed design), and compared them to two conditions without any pledge (Control and Self-Report).

There were six conditions: one control (with no possibility to cheat), a baseline treatment (possibility and motivation to cheat and no honesty pledge), and four different treatments with honesty pledges.

This is what they reported for their primary outcome:

And this is how they summarize in their discussion section:

Interesting, huh?

Now let’s look at the relevant section of the preregistration:

Compare that to what was done in the paper:

– They did the Anova, but that was not relevant to the claims in the paper. The Anova included the control condition, and nobody’s surprised that when you give people the opportunity and motivation to cheat, that some people will cheat. That was not the point of the paper. It’s fine to do the Anova; it’s just more of a manipulation check than anything else.

– There’s something in the preregistration about a “cheating gap” score, which I did not see in the paper. But if we define A to be the average outcome under the control, B to be the average outcome under the baseline treatment, and C, D, E, F to be the average under the other four treatments, then I think the preregistration is saying they’ll define the cheating gap as B-A, and the compare this to C-A, D-A, E-A, and F-A. This is mathematically the same as looking at C-B, D-B, E-B, and F-B, which is what they do in the paper.

– The article jumps back and forth between different statistical summaries: “three of the four pledge conditions showed a decrease in self-reports . . . the difference was only significant for the Copy + ID condition.” It’s not clear what to make of it. They’re using statistical significance as evidence in some way, but the preregistration plan does not make it clear what comparisons would be done, how many comparisons would be made, or how they would be summarized.

– The preregistration plan says, “We will replicate the ANOVAs with linear regressions with the Control condition or Self-Report conditions as baseline.” I didn’t see any linear regressions in the results for this experiment in the published paper.

– The preregistration plan says, “We will also examine differences in the distribution of the percent of problems reported as solved between conditions using Kolmogorov–Smirnov tests. If we find significant differences, we will also examine how the distributions differ, specifically focusing on the differences in the percent of “brazen” lies, which are defined as the percent of participants who cheated to a maximal, or close to a maximal, degree (i.e., reported more than 80% of problems solved). The differences on this measure will be tested using chi-square tests.” I didn’t see any of this in the paper either! Maybe this is fine, because doing all these tests doesn’t seem like a good analysis plan to me.

How do we think of all the analyses stated in the preregistration plan that were not in the paper? Since these analyses were preregistered, I can only assume the authors performed them. Maybe the results were not impressive and so they weren’t included. I don’t know; I didn’t see any discussion of this in the paper.

– The preregistration plan says, “Lastly, we will explore interactions effects between the condition and demographic variables such as age and gender using ANOVA and/or regressions.” They didn’t report any of that either! Also there’s the weird “and/or” in the preregistration, which gives the researchers some additional degrees of freedom.

Not a moral failure

I continue to emphasize that scientific problems do not necessarily correspond to moral problems. You can be a moral person and still do bad science (honesty and transparency are not enuf); to put it another way, if I say that you make a scientific error or are sloppy in your science, I’m not saying you’re a bad person.

For me to say someone’s a bad person just because they wrote a paper and didn’t follow their preregistration plan . . . that would be ridiculous! Over 99% of my published papers have no preregistration plans; and, those that do have such plans, I’m pretty sure we didn’t exactly follow them in our published papers. That’s fine. The reason I do preregistration is not to protect my p-values; it’s just part of a larger process of hypothesizing about possible outcomes and simulating data and analysis as a prelude to measurement and data collection.

I think what happened in the “How pledges reduce dishonesty” paper is that the preregistration was both too vague and too specific. Too vague in that it did not include simulation and analysis of fake data, nor did it include quantitative hypotheses about effects and the distributions of outcomes, nor did it include anything close to what the authors ended up actually doing to support the claims in their paper. Too specific in that it included a bunch of analyses that the authors then didn’t think were worth reporting.

But, remember, science is hard. Statistics is hard. Even what might seem like simple statistics is hard. One thing I like about doing simulation-based design and analysis before collecting any data is that it forces me to make some of the hard choices early. So, yeah, it’s hard, and it’s no moral criticism of the authors of the above-discussed paper that they botched this. We’re all still learning. At the same time, yeah, I don’t think their study offers any serious evidence for the claims being made in that paper; it looks like noise mining to me. Not a moral failing; still, bad science in there being no good links between theory, effect sized, data collection, and measurement, which, as is often the case, leads to super-noisy results that can be interpreted in all sorts of ways to fit just about any theory.

Possible positive outcomes for preregistration

I think preregistration is great; again, it’s a floor, not a ceiling, on the data processing and analyses that can be done.

Here are some possible benefits of preregistration:

1. Preregistration is a vehicle for getting you to think harder about your study. The need to simulate data and create a fake world forces you to make hard choices and consider what sorts of data you might expect to see.

2. Preregistration with fake-data simulation can make you decide to redesign a study, or to not do it at all, if it seems that it will be too noisy to be useful.

3. If you already have a great plan for a study, preregistration can allow the subsequent analysis to be bulletproof. No need to worry about concerns of p-hacking if your data coding and analysis decisions are preregistered—and this also holds for analyses that are not based on p-values or significance tests.

4. A preregistered replication can build confidence in a previous exploratory finding.

5. Conversely, a preregistered study can yield a null result, for example if it is designed to have a high statistical power but then does not yield statistically significant preregistered results. Failure is not always as exciting or informative as success—recall the expression “big if true“—but it ain’t nothing.

6. Similarly, a preregistered replication can yield a null result. Again, this can be a disappointment but still a step in scientific learning.

7. Once the data appears, and the preregistered analysis is done, if it’s unsuccessful, this can lead the authors to change their thinking and to write a paper explaining that they were wrong, or maybe just to publish a short note saying that the preregistered experiment did not go as expected.

8. If a preregistered analysis fails, but the authors still try to claim success using questionable post-hoc analysis, the journal reviewers can compare the manuscript to the preregistration, point out the problem, and require that the article be rewritten to admit the failure. Or, if the authors refuse to do that, the journal can reject the article as written.

9. Preregistration can be useful in post-publication review to build confidence in published paper by reassuring readers who might have been concerned about p-hacking and forking paths. Readers can compare the published paper to the preregistration and see that it’s all ok.

10. Or, if the paper doesn’t follow the preregistration plan, readers can see this too. Again, it’s not a bad thing at all for the paper to go beyond the preregistration plan. That’s part of good science, to learn new things from the data. The bad thing is when a non-preregistered analysis is presented as if it were the preregistered analysis. And the good thing is that the reader can read the documents and see that this happened. As we did here.

In the case of this recent dishonesty paper, preregistration did not give benefit 1, nor did it give benefit 2, nor did it give benefits 3, 4, 5, 6, 7, 8, or 9. But it did give benefit 10. Benefit 10 is unfortunately the least of all the positive outcomes of preregistration. But it ain’t nothing. So here we are. Thanks to preregistration, we now know that we don’t need to take seriously the claims made in the published paper, “How pledges reduce dishonesty: The role of involvement and identification.”

For example, you should feel free to accept that the authors offer no evidence for their claim that “effective pledges could allow policymakers to reduce monitoring and enforcement resources currently allocated for lengthy and costly checks and inspections (that also increase the time citizens and businesses must wait for responses) and instead focus their attention on more effective post-hoc audits. What is more, pledges could serve as market equalizers, allowing better competition between small businesses, who normally cannot afford long waiting times for permits and licenses, and larger businesses who can.”

Huh??? That would not follow from their experiments, even if the results had all gone as planned.

There’s also this funny bit at the end of the paper:

I just don’t know whether to believe this. Did they sign an honesty pledge?


OK, it’s 2024, and maybe this all feels like shooting a rabbit with a cannon. A paper by Dan Ariely on the topic of dishonesty, published in an Elsevier journal, purporting to provide “guidance to managers and policymakers” based on the results of an online math-puzzle game? Whaddya expect? This is who-cares research at best, in a subfield that is notorious for unreplicable research.

What happened was I got sucked in. I came across this paper, and my first reaction was surprise that Ariely was still collaborating with people working on this topic. I would’ve thought that the crashing-and-burning of his earlier work on dishonesty would’ve made him radioactive as a collaborator, at least in this subfield.

I took a quick look and saw that the studies were preregistered. Then I wanted to see exactly what that meant . . . and here we are.

Once I did the work, it made sense to write the post, as this is an example of something I’ve seen before: a disconnect between the preregistration and the analyses in the paper, and a lack of engagement in the paper with all the things in the preregistration that did not go as planned.

Again, this post should not be taken as any sort of opposition to preregistration, which in this case led to positive outcome #10 on the above list. The 10th-best outcome, but better than nothing, which is what we would’ve had in the absence of preregistration.

Baby steps.