Кракен Ссылка Кракен Ссылка

Cohort effects in literature (David Foster Wallace and other local heroes)

I read this review by Patricia Lockwood of a book by David Foster Wallace. I’d never read the book being reviewed, but that was no problem because the review itself was readable and full of interesting things. What struck me was how important Wallace seemed to be to her. I’ve heard of Wallace and read one or two things by him, but from my perspective he’s just one of many, many writers, with no special position in the world. I think it’s a generational thing. Wallace hit the spot for people of Lockwood’s age, a couple decades younger than me. To get a sense of how Lockwood feels about Wallace’s writing, I’d have to consider someone like George Orwell or Philip K. Dick, who to me had special things to say.

My point about Orwell and Dick (or, for Lockwood, Wallace) is not that they stand out from all other writers. Yes, Orwell and Dick are great writers with wonderful styles and a lot of interesting things to say—but that description characterizes many many others, from Charles Dickens and Mark Twain through James Jones, Veronica Geng, Richard Ford, Colson Whitehead, etc etc. Orwell and Dick just seem particularly important to me; it’s hard to say exactly why. So there was something fascinating about seeing someone else write about a nothing-special (from my perspective) writer but with that attitude that, good or bad, he’s important.

It kinda reminds me of how people used to speculate on what sort of music would’ve been made by the Beatles had they not broken up. In retrospect, the question just seems silly: they were a group of musicians who wrote some great songs, lots of great songs have been written by others since then, there’s no reason to think that future Beatles compositions would’ve been maybe more amazing than the fine-but-not-earthshaking songs they wrote on their own or that others were writing during that period. What’s interesting to me here is not to think about the Beatles but to put myself into that frame of mind in which those Beatles were so important that the question, What would they have done next?, is considered so important.

That’s why I call Wallace, and some of the other writers discussed above, “local heroes,” with their strongest appeal localized in cohort and time rather than in space. “Voice of a generation” would be another way to put it, but I like the framing of locality because it opens the door to considering dimensions other than cohort and time.

Adding intermediate outcomes to an item-response (Bradley-Terry) model

Huib Meulenbelt writes:

Assume we have the following hierarchical Bradley-Terry model:

data {
  int<lower=0> K;                                // players
  int<lower=0> N;                                // number of rallies
  int<lower=1, upper=K> player_a;     // player a
  int<lower=1, upper=K> player_b;     // player b
  int<lower=0> y;                                 // number of rallies won
}
parameters {
  real<lower=0> sigma;
  vector[K] skill;                                   // ability for player K
}
model{
  sigma ~ lognormal(0, 0.5); 
  skill ~ normal(0, sigma);
  y ~ binomial_logit(N, skill[player_a] - skill[player_b]);
}

In this blog post you argue “there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.”

I agree completely. Of each rally I obtained the length of the rally and I would like this variable to influence the skill level of the player and opponent. The skill level of the two players are closer to each other when the game ends in 11-7 and the match lasts 500 seconds than when the game ends in 11-7 and the match only lasts 100 seconds.

So, we move from
p[player A wins over player B] = logistic(skill_A – skill_B)

to

p[player A wins over player B] = logistic(f(rally_length) * (skill_A – skill_B))

How would you define this function?

My reply: This is known in psychometrics as an item-response model with a discrimination parameter. The multiplier, f(rally_length) in the above notation, is called the discrimination: the idea is that the higher it is, the more predictive the skill-level difference is of the outcome. If discrimination is zero, the skill difference doesn’t predict at all, and a negative discrimination corresponds to a prediction that goes in the unexpected direction (the worse player being more likely to win).

My answer to the immediate question above is: sure, try it out. You could start with some simple form for the function f and see how it works. Ultimately I’m not thrilled with this model because it is not generative. I expect you can do better by modeling the length of the rally as an intermediate outcome. You can do this in Stan too. I’d recommend starting with just a single parameter per player, but you might need to add another parameter for each player if the rally length varies systematically by player after adjusting for ability.

But the biggest thing is . . . above you say that you agree with me to model score differential and not just win/loss, but then in your model you’re only including win/loss as an outcome. You’re throwing away information! Don’t do that. Whatever model you use, I strongly recommend you use score differential, not win/loss, as your outcome.

Mixtures: A debugging story

I’m not a great programmer. The story that follows is not intended to represent best programming practice or even good programming practice. It’s just something that happened to me, and it’s the kind of thing that’s happened to me many times before, so I’m sharing it with you here.

The problem

For a research project I needed to fit a regression model with an error term that is a mixture of three normals. I googled *mixture model Stan* and came to this page with some code:

data {
  int K;          // number of mixture components
  int N;          // number of data points
  array[N] real y;         // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}

I should’ve just started by using this code as is, but instead I altered it in a couple of ways:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
}
model {
  lambda ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  sigma ~ lognormal(0, 2);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}

The main thing was adding the linear predictor, X*beta, but I also renamed a couple of variables to make the code line up with the notation in the paper I was writing, also I added a prior on the mixture component sizes and I removed some of the code from the Stan User’s Guide that increased computational efficiency but which seemed to me to make the code harder to read to newcomers.

I fit this model setting M=3, and . . . it was really slow! I mean, stupendously slow. My example had about 1000 data points and it was taking, oh, I dunno, close to an hour to run?

This just made no sense. It’s not just that it was slow; also, the slowness was just not right, which made me concerned that something else was going wrong.

Also there was poor convergence. Some of this was understandable, as I was fitting the model to data that had been simulated from a linear regression with normal errors, so there weren’t actually three components. But, still, the model has priors, and I don’t think the no-U-turn sampler (NUTS) algorithm used by Stan should have so much trouble traversing this space.

Playing with priors

It was time to debug. My first thought was that the slowness was caused by poor geometry: if NUTS is moving poorly, it can take up to 1024 steps per iteration, and then each iteration will take a long time. Poor mixing and slow convergence go together.

A natural way to fix this problem is to make the priors stronger. With strong priors, the parameters are restricted to fall in a small, controlled zone, and the geometry should be better.

I tried a few things with priors, the most interesting of which was to set up a hierarchical model for the scales of the mixture components. I added this line to the parameters block:

  real log_sigma_0;

And then, in the model block, I replaced “sigma ~ lognormal(0, 2);” with:

  sigma ~ lognormal(log_sigma_0, 1);

One thing that made the priors relatively easy to set up here was that the data are on unit scale: they’re the logarithms of sampling weights, and the sampling weights were normalized to have a sample mean of 1. Also, we’re not gonna have weights of 1000, so they have a limited range on the log scale.

In any case, these steps of tinkering with the prior weren’t helping. The model was still running ridiculously slowly.

Starting from scratch

OK, what’s going on? I decided to start from the other direction: Instead of starting with my desired model and trying to clean it up, I started with something simple and built up.

The starting point is the code in the Stan User’s Guide, given above. I simulated some fake data and fitted it:

library("rstanarm")
set.seed(123)

# Simulate data
N <- 100
K <- 3
lambda <- c(0.5, 0.3, 0.2)
mu <- c(-2, 0, 2)
sigma <- c(1, 1, 1)
z <- sample(1:K, n, replace=TRUE, prob=lambda)
v <- rnorm(N, mu[z], sigma[z])

# Fit model
mixture <- cmdstan_model("mixture_2.stan")
mixture_data <- list(y=v, N=N, K=3)
v_mixture_fit <- mixture$sample(data=mixture_data, seed=123, chains=4, parallel_chains=4)
print(v_mixture_fit)

And it worked just fine, ran fast, no convergence problems. It also worked fine with N=1000; it just made sense to try N=100 first in case any issues came up.

Then I changed the Stan code to my notation, using M rather than K and a couple other things, and still no problems.

Then I decided to make it a bit harder by setting the true means of the mixture components to be identical, changing "mu <- c(-2, 0, 2)" in the above code to

mu <- c(0, 0, 0)

Convergence was a bit worse, but it was still basically ok. So the problem didn't seem to be the geometry.

Next step was to add predictors to the model. I added them to the R code and the Stan code . . . and then the problem returned.

So I'd isolated the problem. It was with the regression predictors. But what was going on? One problem could be nonidentification of the constant term in the regression with the location parameters mu in the mixture model---but I'd been careful not to include a constant term in my regression, so it wasn't that.

Finding the bug

I stared at the code harder and found the problem! It was in this line of the Stan program:

      lps[m] += normal_lpdf(v[n] | X*beta + mu[m], sigma[m]);

The problem is that the code is doing one data point at a time, but "X*beta" has all the data together! So I fixed it. I changed the above line to:

      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);

and added the following line to the beginning of the model block:

  vector[N] Xbeta = X*beta;

Now it all works. I found the bug.

What next?

The code runs and does what it is supposed to do. Great. Now I have to go back to the larger analysis and see whether everything makes sense.

Here was the output of the simple normal linear regression fit to the data I'd simulated:

            Median MAD_SD
(Intercept)  0.57   0.10 
x           -0.16   0.01 

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.02   0.02

And here was the result of fitting the regression model in Stan with error term being a mixture of 3 normals:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__        -1340.21 -1339.91 2.85 2.77 -1345.21 -1335.96 1.00      681      458
 beta[1]        -0.16    -0.16 0.01 0.01    -0.18    -0.14 1.00     2139     2357
 lambda[1]       0.32     0.14 0.34 0.19     0.01     0.94 1.02      419      977
 lambda[2]       0.39     0.27 0.35 0.36     0.01     0.95 1.01      506      880
 lambda[3]       0.28     0.11 0.32 0.15     0.01     0.92 1.01      543      879
 mu[1]          -0.14    -0.02 0.61 0.57    -1.43     0.57 1.02      385      281
 mu[2]           0.58     0.56 0.51 0.33    -0.25     1.56 1.01      474      536
 mu[3]           1.53     1.37 1.56 0.67     0.60     2.42 1.01      562     1025
 sigma[1]        0.77     0.85 0.30 0.22     0.21     1.09 1.01      522      481
 sigma[2]        0.85     0.93 0.32 0.16     0.27     1.16 1.00      644     1028
 sigma[3]        0.77     0.79 0.50 0.29     0.25     1.11 1.00      910     1056
 log_sigma_0    -0.35    -0.34 0.64 0.65    -1.42     0.68 1.00     1474     1605

The estimate for the slope parameter, beta, seems fine, but it's hard to judge mu and sigma. Ummm, we can take a weighted average for mu, 0.32*(-0.14) + 0.39*0.58 + 0.28*1.53 = 0.61, which seems kind of ok although a bit off from the 0.57 we got from the linear regression. What about sigma? It's harder to tell. We can compute the weighted variance of the mu's plus the weighted average of the sigma^2's, but it's getting kinda messy, also really we want to account for the posterior uncertainty---as you can see from above, these uncertainty intervals are really wide, which makes sense given that we're fitting a mixture of 3 normals to data that were simulated from a single normal distribution . . . Ok, this is getting messy. Let's just do it right.

I added a generated quantities block to the Stan program to compute the total mean and standard deviation of the error term:

generated quantities {
  real mu_total = sum(lambda.*mu)/sum(lambda);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2))/sum(lambda));
}

And here's what came out:

    variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 mu_total        0.56     0.56 0.11 0.11     0.39     0.74 1.00     2172     2399
 sigma_total     1.02     1.02 0.03 0.02     0.98     1.06 1.00     3116     2052

Check. Very satisfying.

The next step is to continue on with the research on the problem that motivated all this. Fitting a regression with mixture model for errors, that was just a little technical thing I needed to do. It's annoying that it took many hours (not even counting the hour it took to write this post!) and even more annoying that I can't do much with this---it's just a stupid little bug, nothing that we can even put in our workflow book, I'm sorry to say---but now it's time to move on.

Sometimes, cleaning the code, or getting the model to converge, or finding a statistically-significant result, or whatever, takes so much effort that when we reach that ledge, we just want to stop and declare victory right there. But we can't! Or, at least, we shouldn't! Getting the code to do what we want is just a means to an end, not an end in itself.

P.S. I cleaned the code some more, soft-constraining mu_total to 0 so I could include the intercept back into the model. Here's the updated Stan program:

data {
  int M;
  int N;
  int K;
  vector[N] v;
  matrix[N,K] X;
}
parameters {
  vector[K] beta;
  simplex[M] lambda;
  ordered[M] mu;
  vector[M] sigma;
  real log_sigma_0;
}
model {
  vector[N] Xbeta = X*beta;
  lambda ~ lognormal(log(1./M), 1);
  mu ~ normal(0, 10);
  sigma ~ lognormal(log_sigma_0, 1);
  sum(lambda.*mu) ~ normal(0, 0.1);
  for (n in 1:N){
    vector[M] lps = log(lambda);
    for (m in 1:M){
      lps[m] += normal_lpdf(v[n] | Xbeta[n] + mu[m], sigma[m]);
    }
    target += log_sum_exp(lps);
  }
}
generated quantities {
  real mu_total = sum(lambda.*mu);
  real sigma_total = sqrt(sum(lambda.*((mu - mu_total)^2 + sigma^2)));
}

Exploring pre-registration for predictive modeling

This is Jessica. Jake Hofman, Angelos Chatzimparmpas, Amit Sharma, Duncan Watts, and I write:

Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling, spanning core machine learning tasks to various scientific applications, challenges such as data-dependent decision-making and unintentional re-use of test data have raised questions about the integrity of results. To help address these issues, we propose adapting pre-registration practices from explanatory modeling to predictive modeling. We discuss current best practices in predictive modeling and their limitations, introduce a lightweight pre-registration template, and present a qualitative study with machine learning researchers to gain insight into the effectiveness of pre-registration in preventing biased estimates and promoting more reliable research outcomes. We conclude by exploring the scope of problems that pre-registration can address in predictive modeling and acknowledging its limitations within this context.

Pre-registration is no silver bullet to good science, as we discuss in the paper and later in this post. However, my coauthors and I are cautiously optimistic that adapting the practice could help address a few problems that can arise in predictive modeling pipelines like research on applied machine learning. Specifically, there are two categories of concerns where pre-specifying the learning problem and strategy may lead to more reliable estimates. 

First, most applications of machine learning are evaluated using predictive performance. Usually we evaluate this on held-out test data, because it’s too costly to obtain a continuous stream of new data for training, validation and testing. The separation is crucial: performance on held-out test data is arguably the key criterion in ML, so making reliable estimates of it is critical to avoid a misleading research literature. If we mess up and access the test data during training (test set leakage), then the results we report are overfit. It’s surprisingly easy to do this (see e.g., this taxonomy of types of leakage that occur in practice). While pre-registration cannot guarantee that we won’t still do this anyway, having to determine details like how exactly features and test data will be constructed a priori could presumably help authors catch some mistakes they might otherwise make.

Beyond test set leakage, other types of data-dependent decisions threaten the validity of test performance estimates. Predictive modeling problems admit many degrees-of-freedom that authors can (often unintentionally) exploit in the interest of pushing the results in favor of some a priori hypothesis, similar to the garden of forking paths in social science modeling. For example, researchers may spend more time tuning their proposed methods than baselines they compare to, making it look like their new method is superior when it is not. They might report on straw man baselines after comparing test accuracy across multiple variations. They might only report the performance metrics that make test performance look best. Etc. Our sense is that most of the time this is happening implicitly: people end up trying harder for the things they are invested in. Fraud is not the central issue, so giving people tools to help them avoid unintentionally overfitting is worth exploring.

Whenever the research goal is to provide evidence on the predictability of some phenomena (Can we predict depression from social media? Can we predict civil war onset? etc.) there’s a risk that we exploit some freedoms in translating the high level research goal to a specific predictive modeling exercise. To take an example my co-authors have previously discussed, when predicting how many re-posts a social media post will get based on properties of the person who originally posted, even with the dataset and model specification held fixed, exercising just a few degrees of freedom can change the qualitative nature of the results. If you treat it as a classification problem and build a model to predict whether a post will receive at least 10 re-posts, you can get accuracy close to 100%. If you treat it as a regression problem and predict how many re-posts a given post gets without any data filtering, R^2 hovers around 35%. The problem is that only a small fraction of posts exceed the threshold of 10 re-posts, and predicting which posts do—and how far they spread—is very hard.  Even when the drift in goal happens prior to test set access, the results can paint an overly optimistic picture. Again pre-registering offers no guarantees of greater construct validity, but it’s a way of encouraging authors to remain aware of such drift. 

The specific proposal

One challenge we run into in applying pre-registration to predictive modeling is that because we usually aren’t aiming for explanation, we’re willing to throw lots of features into our model, even if we’re not sure how they could meaningfully contribute, and we’re agnostic to what sort of model we use so long as its inductive bias seems to work for our scenario. Deciding the model class ahead of time as we do in pre-registering explanatory models can be needlessly restrictive. So, the protocol we propose has two parts. 

First, prior to training, one answers the following questions, which are designed to be addressable before looking at any correlations between features and outcomes

Phase 1 of the protocol: learning problem, variables, dataset creation, transformations, metrics, baselines

Then, after training and validation but before accessing test data, one answers the remaining questions:

Phase 2: Prediction method, training details, access test? anything else

Authors who want to try it can grab the forms by forking this dedicated github repo and include them in their own repository.

What we’ve learned so far

To get a sense of whether researchers could benefit from this protocol, we observed as six ML Ph.D. students applied it to a prediction problem we provided (predicting depression in teens using responses to the 2016 Monitoring the Future survey of 12th graders, subsampled from data used by Orben and Przybylski). This helped us see where they struggled to pre-specify decisions in phase 1, presumably because doing so was out of line with their usual process of figuring some things out as they conducted model training and validation. We had to remind several to be specific about metrics and data transformations in particular. 

We also asked them in an exit interview what else they might have tried if their test performance had been lower than they expected. Half of the six participants described procedures that if not fully reported, seemed likely to compromise the validity of their test estimates (things like going back to re-tune hyperparameters then trying again on test data). This suggests that there’s an opportunity for pre-registration, if widely adopted, to play a role in reinforcing good workflow.  This may be especially useful in fields where ML models are being applied by expertise in predictive modeling is still sparse.

The caveats 

It was reassuring to directly observe examples where this protocol, if followed, might have prevented overfitting. However, the fact that we saw these issues despite having explained and motivated pre-registration during these sessions, and walked the participants through it, suggests that pre-specifying certain components of a learning pipeline alone is not necessarily enough to prevent overfitting. 

It was also notable that while all of the participants but one saw value in pre-registering, their specific understandings of why and how it could work varied. There was as much variety in their understandings of pre-registration as there was in ways they approached the same learning problem. Pre-registration is not going to be the same thing to everyone nor even used the same way, because the ways it helps are multi-faceted. As a result, it’s dangerous to interpret the mere act of pre-registration as a stamp of good science. 

I have some major misgivings about putting too much faith into the idea that publicly pre-registering guarantees that estimates are valid, and hope that this protocol gets used responsibly, as something authors choose to do because they feel it helps them prevent unintentional overfitting rather than the simple solution that guarantees to the world that your estimates are gold. It was nice to observe that a couple of study participants seemed particularly drawn to the idea of pre-registering based on perceived “intrinsic” value, remarking about the value they saw in it as a personally-imposed set of constraints to incorporate in their typical workflow.

It won’t work for all research projects. One participant figured out while talking aloud that prior work he’d done identifying certain behaviors in transformer models would have been hard to pre-register because it was exploratory in nature.

Another participant fixated on how the protocol was still vulnerable: people could lie about not having already experimented with training and validation, there’s no guarantee that the train/test split authors describe is what they actually used to produce their estimates, etc. Computer scientists tend to be good at imagining loopholes that adversarial attacks could exploit, so maybe they will be less likely to oversell pre-registration as guaranteeing validity. At the end of the day, it’s still an honor system. 

As we’ve written before, part of the issue with many claims in ML-based research is that often performance estimates for some new approach represent something closer to best case performance due to overlooked degrees of freedom, but they can get interpreted as expected performance. Pre-registration is an attempt at ensuring that the estimates that get reported are more likely to be represent what they’re meant to be. Maybe it’s better though to try to change readers’ perceptions that they can be taken at face value to begin with, though. I’m not sure. 

We’re open to feedback on the specific protocol we provide and curious to hear how it works out for those who try it. 

P.S. Against my better judgment, I decided to go to NeurIPS this year. If you want to chat pre-registration or threats to the validity of ML performance estimates find me there Wed through Sat.

In July 2015 I was spectacularly wrong

See here.

Also interesting was this question that I just shrugged aside:

If a candidate succeeds in winning a nomination and goes on to win the election and reside in the White House do they have to give up their business interests as these would be seen as a conflict of interest? Can a US president serve in office and still have massive commercial business interests abroad?

Hey, Taiwan experts! Here’s a polling question for you:

Lee-Yen Wang writes:

I have a question about comparing polls in presidential elections in Taiwan.

The four candidates are:

– Mr. A: Hou Yu-ih, mayor of New Taipei City, from Taiwan’s main opposition Kuomintang (KMT) party.

– Mr. B: Dr. Ko Wen-je, ex-mayor of Taipei, Chairman of the People’s Party (TPP), a much smaller opposition party than the KMT.

– Mr. C: Lai Ching-te, Vice President and the Chairman of the Democratic Progressive Party (DPP), the ruling party.

– Ms. D: Hsiao Bi-khim, former envoy to the United States and a member of the DPP.

My question concers the best method for comparing polls to determine the optimal candidate pairing for president and vice president.

The scenario involves two competing candidates, Mr. A and Mr. B, who want to decide who should be the presidential candidate and who should be the vice presidential candidate, while facing a decided pair, CD.

Polls are conducted for both AB vs. CD and BA vs. CD. Different combinations of AB and BA may have varying strengths against CD. The poll comparisons are listed below, with only the title row representing the comparison shown:

AB CD AB-CD BA CD BA-CD (AB-CD) -(BA-CD)

My question is: which approach is more reasonable for determining which pair of combinations, AB or BA, is stronger against CD?

A and B campaigned heavily for the presidential nomination, with the loser in the polls becoming the vice presidential candidate. However, AB-CD elicits a different response in the polls than BA-CD.

This can be simulated: AB vs. CD is 48% to 46%, while BA vs. CD is 41% to 32%. In the case of AB vs. BA, AB beats BA by 7%. However, if we use the difference of the difference, we get 2% vs. 9%, and the difference of the difference is 7%, favoring BA. These two approaches yield inconsistent results. In a high-stakes election where both candidates campaign intensely, which approach do you suggest is theoretically sound for comparison?

The aforementioned scenario played out vividly in the recent presidential election in Taiwan in January. The controversy over resolving the tie ultimately led to the split of candidates A and B just before the official candidate registration deadline.

They each quickly chose their own vice presidential candidates to run against the CD pair.

The key question is how best to compare the poll results. There are several options:
– Should we simply compare AB vs. BA using the margin of error and ignore the situation to compare the difference of the difference?
– Should we use the margin of error as a metric and apply it to compare the difference of the difference?
– Should we instead use the ratio of estimation and normalize the probability of AB and BA?
– Should we employ the ratio method of estimation and compare the difference of the difference, where we normalize p = (AB – CD)/((AB – CD) + (BA – CD)) and q = 1-p? The question is whether this operation makes sense.

Are there other effective methods for comparing surveys that could solve this dilemma?

I have no idea what’s going on in this story. No idea at all. But I thought some Taiwan experts might be interested, so I’m posting here.

Modest pre-registration

This is Jessica. In light of the hassles that can arise when authors make clear that they value pre-registration by writing papers about its effectiveness but then they can’t find their pre-registration, I have been re-considering how I feel about the value of the public aspects of pre-registration. 

I personally find pre-registration useful, especially when working with graduate students (as I am almost always doing). It gets us to agree on what we are actually hoping to see and how we are going to define the key quantities we compare. I trust my Ph.D. students, but when we pre-register we are more likely to find the gaps between our goals and the analyses that we can actually do because we have it all in a single document that we know cannot be further revised after we start collecting data.

Shravan Vasishth put it well in a comment on a previous post:

My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). 

The part I find icky is that as soon as pre-registration gets discussed outside the lab, it often gets applied and interpreted as a symbol that the research is rigorous. Like the authors who pre-register must be doing “real science.” But there’s nothing about pre-registration to stop sloppy thinking, whether that means inappropriate causal inference, underspecification of the target population, overfitting to the specific experimental conditions, etc.

The Protzko et al. example could be taken as unusual, in that we might not expect the average reviewer to feel the need to double check the pre-registration when they see that author list includes Nosek and Nelson. On the other hand, we could see it as particularly damning evidence of how pre-registration can fail in practice, when some of the researchers that we associate with the highest standards of methodological rigor are themselves not appearing to take claims made about what practices were followed so seriously as to make sure they can back them up when asked. 

My skepticism about how seriously we should take public declarations of pre-registration is influenced by my experience as author and reviewer, where, at least in the venues I’ve published in, when you describe your work as pre-registered it wins points with reviewers, increasing the chances that someone will comment about the methodological rigor, that your paper will win an award, etc. However, I highly doubt the modal reviewer or reader is checking the preregistration. At least, no reviewer has ever asked a single question about the pre-registration in any of the studies I’ve ever submitted, and I’ve been using pre-registration for at least 5 or 6 years. I guess it’s possible they are checking it and it’s just all so perfectly laid out in our documents and followed to a T that there’s nothing to question. But I doubt that… surely at some point we’ve forgotten to fully report a pre-specified exploratory analysis, or the pre-registration wasn’t clear, or something else like that. Not a single question ever seems fishy.

Something I dislike about authors’ incentives when reporting on their methods in general is that reviewers (and readers) can often be unimaginative. So what the authors say about their work can set the tone for how the paper is received. I hate when authors describe their own work in a paper as “rigorous” or “highly ecologically valid” or “first to show” rather than just allowing the details to speak for themselves. It feels like cheap marketing. But I can understand why some do it, because one really can impress some readers for saying such things. Hence, points won for mentioning pre-registration, but no real checks and balances, can be a real issue.  

How should we use pre-registration in light of all this? If nobody cares to do the checking, but extra credit is being handed out when authors slap the “pre-registered” label on their work, maybe we want to pre-register more quietly.

At the extreme, we could pre-register amongst ourselves, in our labs or whatever, without telling everyone about it. Notify our collaborators by email or slack or whatever else when we’ve pinned down the analysis plan and are ready to collect the data but not expect anyone else to care, except maybe when they notice that our research is well-engineered in general, because we are the kind of authors who do our best to keep ourselves honest and use transparent methods and subject our data to sensitivity analyses etc. anyways.

I’ve implied before on the blog that pre-registration is something I find personally useful but see externally as a gesture toward transparency more than anything else. If we can’t trust authors when they claim to pre-register, but we don’t expect the reviewing or reading standards in our communities to evolve to the point where checking to see what it actually says becomes mainstream, then we could just omit the signaling aspect altogether and continue to trust that people are doing their best. I’m not convinced we would lose much in such a world as pre-registration is currently practiced in the areas I work in. Maybe the only real way to fix science is to expect people to find reasons to be self-motivated to do good work. And if they don’t, well, it’s probably going to be obvious in other ways than just a lack of pre-registration. Bad reasoning should be obvious and if it’s not, maybe we should spend more time training students on how to recognize it.

But of course this seems unrealistic, since you can’t stop people from saying things in papers that they think reviewers will find relevant. And many reviewers have already shown they find it relevant to hear about a pre-registration. Plus of course the only real benefit we can say with certainty that pre-registration provides is that if one pre-registers, others can verify to what extent the the analysis was planned beforehand and therefore less subject to authors exploiting degrees of freedom, so we’d lose this.  

An alternative strategy is to be more specific about pre-registration while crowing about it less. Include the pre-registration link in your manuscript but stop with all the label-dropping that often occurs, in the abstract, the introduction, sometimes in the title itself describing how this study is pre-registered. (I have to admit, I have been guilty of this, but from now on I intend to remove such statements from papers I’m on).

Pre-registration statements should be more specific, in light of the fact that we can’t expect reviewers to catch deviations themselves. E.g., if you follow your pre-registration to a T, say something like “For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our pre-registration documents. We do not report any analyses that were not included in our pre-registration.” That makes it clear what you are knowingly claiming regarding the pre-registration status of your work. 

Of course, some people may say reasonably specific things even when they can’t back them up with a pre-registration document. But being specific at least acknowledges that a pre-registration is actually a bundle of details that we must mind if we’re going to claim to have done it, because they should impact how it’s assessed. Plus maybe the act of typing out specific propositions would remind some authors to check what their pre-registration actually says. 

If you don’t follow your pre-registration to a T, which I’m guessing is more common in practice, then there are a few strategies I could see using:

Put in a dedicated paragraph before you describe results detailing all deviations from what you pre-registered. If it’s a whole lot of stuff, perhaps the act of writing this paragraph will convince you to just skip reporting on the pre-registration altogether because it clearly didn’t work out. 

Label each individual comparison/test as pre-registered versus not as you walk through results. Personally I think this makes things harder to keep track of than a single dedicated paragraph, but maybe there are occasionally situations where its better.

“Reading Like It’s 1965”: Fiction as a window into the past

Raghu Parthasarathy writes:

The last seven books I read were all published in 1965. I decided on this literary time travel after noticing that I unintentionally read two books in a row from 1965. I thought: Why not continue? Would I get a deep sense of the mid-1960s zeitgeist? I don’t think so . . .

Contra Raghu, I do think that reading old books gives us some sense of how people used to live, and how they used to think. I have nothing new to offer on this front, but here are some relevant ideas we’ve discussed before:

1. The Speed Racer principle: Sometimes the most interesting aspect of a scientific or cultural product is not its overt content but rather its unexamined assumptions.

2. Storytelling as predictive model checking: Fiction is the working out of possibilities. Nonfiction is that too, just with more constraints.

3. Hoberman and Deliverance: Some cultural artifacts are striking because of what they leave out. My go-to example here is the book Deliverance, which was written during the U.S.-Vietnam war and, to my mind, is implicitly all about that war even though I don’t think it is mentioned even once in the book.

4. Also, Raghu mentions Stoner so I’ll point you to my post on the book. In the comments section, Henry Farrell promises us an article called “What Meyer and Rowan on Myth and Ceremony tells us about Forlesen.” So, something to look forward to.

5. And Raghu mentions Donald Westlake. As I wrote a few years ago, my favorite Westlake is Killing Time, but I also like Memory. And then there’s The Axe. And Slayground’s pretty good too. And Ordo, even if it’s kind of a very long extended joke on the idea of murder. Overall, I do think there’s a black hole at the center of Westlake’s writing: as I wrote a few years ago, he has great plots and settings and charming characters, but nothing I’ve ever read of his has the emotional punch of, say, Scott Smith’s A Simple Plan (to choose a book whose plot would fit well into the Westlake canon). But, hey, nobody can do everything. Also see here and here.

“Other than when he treated Steve Jobs, Agus, 58, had never been told anything besides that he’s awesome . . .”

Remember when we talked about the problem with the “scientist-as-hero” narrative?

Here’s another example. Celebrity doctor / USC professor David Agus had a side gig putting his name on plagiarized books that he never read.

I get that some people are busy, but talk about lazy! Putting your name on a book you didn’t write is one thing, but not even reading it! C’mon, dude.

But—hey—you can see what happened, right? The title of this post is a line from a magazine article about the story. I don’t actually like the article because it seems like it was written by Agus’s publicist (it credits Agus as being “very involved” with the books that have his name on them, without explaining how a “very involved” author didn’t notice an entire section of plagiarized material about giraffes—how exactly did he “dictate the substance” of that bit??), but that one line about “never been told anything besides that he’s awesome” is a good one, and I think it captures a big problem with how science is reported.

Similar problems arise with non-scientist celebrity academics too. Alan Dershowitz and Steven Pinker got nearly uniformly-positive coverage until it came out that they were doing favors for Jeffrey Epstein. Cass Sunstein was swaggering around saying he’d discovered a new continent (i.e., he wrote a book about a topic he knew next to nothing about). Edgelord Marc Hauser was riding high until he wasn’t, etc etc etc.

These people get so much deference that they just take it as their due. I’d prefer more historically-informed paradigms of scientific progress.

P.S. How was it that Los Angeles Magazine decided to run an article presenting the plagiarizing professor as a good guy, an innocent victim of his ghostwriter? A clue comes from Google . . . an article by that same author in that same magazine from 2022, describing the not-yet-acknowledged plagiarist as “genius Forrest Gump, a soft-spoken and menschy cancer researcher,” and an article from 2014 where “Pioneering biomedical researcher David Agus reveals which clock stoppers excite him most,” and another from 2016 featuring “Longevity expert Dr. David Agus,” and, on the plus side, this article from 2021 encouraging people to take the covid vaccine.

Also this:

Considering that this guy is “always uncomfortable being the focus of any media,” it’s funny that he has a page full of clips of TV and promotional appearances:

Including . . . a picture of a giraffe! Dude has giraffes on his mind. I’m starting to be suspicious of his implicit claim that he never read the chapter in his latest book that was plagiarized from “a 2016 blog post on the website of a South African safari company titled, ‘The Ten Craziest Facts You Should Know About A Giraffe.'”

EJG Pitman’s Notes on Non-Parametric Statistical Inference

Nigel Smeeton writes:

I see from the old online post, “The greatest works of statistics never published,” that there is interest in EJG Pitman’s Notes on Non-Parametric Statistical Inference.

Working from a poor online scan of the Notes, EJG Pitman’s early papers, and with the assistance of Jim Pitman and a US librarian, I have been able to resurrect the document and create a pdf file, now included in the Mimeo Series held by the North Carolina State University library.

Here it is.

I took a quick look and I’d say it’s more of historical interest than anything else. It’s all about hypothesis testing (sample bits: “We may have to decide from samples whether the distributions of two chance variables X and Y are the same or different” and “The question we wish to decide is ‘Is the mean of the population zero or not? Does the mean of the sample differ significantly from zero?'”), which I guess was what academic statisticians were mostly concerned with back in 1949.

Still, historical interest isn’t nothing, so I’m sharing it here. Enjoy.

On a proposal to scale confidence intervals so that their overlap can be more easily interpreted

Greg Mayer writes:

Have you seen this paper by Frank Corotto, recently posted to a university depository?

It advocates a way of doing box plots using “comparative confidence intervals” based on Tukey’s HSD in lieu of traditional error bars. I would question whether the “Error Bar Overlap Myth” is really a myth (i.e. a widely shared and deeply rooted but imaginary way of understanding the world) or just a more or less occasional misunderstanding, but whatever it’s frequency, I thought you might be interested, given your longstanding aversion to box plots, and your challenge to the world to find a use for them. (I, BTW, am rather fond of dox plots.)

My reply: Clever but I can’t imagine ever using this method or recommending it to others. The abstract connects the idea to Tukey, and indeed the method reminds me of some of Tukey’s bad ideas from the 1950s involving multiple comparisons. I think the problem here is in thinking of “statistical significance” as a goal in the first place!

I’m not saying it was a bad idea for this paper to be written. The concept could be worth thinking about, even if I would not recommend it as a method. Not every idea has to be useful. Interesting is important too.

Russell’s Paradox of ghostwriters

A few months ago we discussed the repulsive story of a UCLA USC professor who took full credit for a series of books that were ghostwritten. It turned out that one of the books had “at least 95 separate passages” of plagiarism, including “long sections of a chapter on the cardiac health of giraffes.”

You’d think you’d remember a chapter on the cardiac health of giraffes. Indeed, if I hired someone to write a chapter under my name on the cardiac health of giraffes, I think I’d read it, just out of curiosity! But I guess this guy has no actual curiosity. He just wants another bestselling book so he can go on TV some more and mingle with rich and famous people.

OK, I’ve ranted enough about this guy. What I wanted to share today is a fascinating story from a magazine article about the affair, where the author, Joel Stein, “Nearly all experts and celebrities use ghostwriters,” and then links to an amusing magazine article from 2009 subtitled, “If Sarah Palin can write a memoir in four months, can I write my life story in an afternoon?”:

When I heard that Sarah Palin wrote her upcoming 400-page autobiography, Going Rogue: An American Life, in four months, I thought, What took her so long? To prove that introspection doesn’t need to be time-consuming, I decided to try to write my memoir in one day. Since Palin had a ghostwriter, I figured it was only fair that I have help too, so I called Neil Strauss, who co-wrote the best-selling memoirs of Marilyn Manson, Mötley Crüe, Dave Navarro and Jenna Jameson. . . .

The whole article is fun. They wrote a whole memoir in an afternoon!

That particular memoir-book was a gag, but it got me thinking of this general idea of recursive writing. A writer hiring a ghostwriter . . . what a great idea! Of course this happens all the time when the writer is a brand name, as with James Patterson. But then what if Patterson’s ghostwriter is busy and hires a ghostwriter of his own . . .

Perhaps the most famous ghostwritten book is The Autobiography of Malcolm X, by Alex Haley. After Roots came out, the Malcom X autobiography was promoted heavily based on the Haley authorship. On the other hand, parts of Roots were plagiarized, which is kind of like a ghostwriter hiring a ghostwriter.

A writer hiring a writer to do his writing . . . that sounds so funny! But should it? I’m a professional writer and I call upon collaborators all the time. Collaborative writing is very rare in literary writing; it sometimes happens in nonliterary writing (for example here, or for a less successful example, here), but usually there it follows a model of asymmetric collaboration, as with Freakonomics where Levitt supplied the material, Dubner supplied the writing, but I assume that both the content and the writing benefited from conversations between the authors.

One of the common effects of ghostwriting is to give a book a homogenized style. Writers of their own books will have their original styles—most of us cannot approach the caliber of Mark Twain, Virginia Woolf, or Jim Thompson, but style is part of how you express yourself—and nonprofessional writers can have charming idiosyncratic styles of their own. The homogenized, airport-biography style comes from writers who are talented enough to produce this sort of thing on demand, while having some financial motivation not to express originality. In contrast, Malcolm Gladwell deserves credit for producing readable prose while having his own interesting style. I doubt he uses a ghostwriter.

Every once in awhile, though, there will be a ghostwriter who adds compelling writing of his own. One example is the aforementioned Alex Haley; another is the great Leonard Shecter. I’d say Stephen Dubner too, but I see him as more of a collaborator than a hired gun. Also Ralph Leighton: much of the charm in the Feynman memoirs is that voice, and you gotta give the ghostwriter some of the credit here, even if only to keep that voice as is and not replace it with generic prose.

There must be some other ghostwriters who added style rather than blandness, although I can’t think of any examples right now.

More generally, I remain interested in the idea that collaboration is so standard in academic writing (even when we are writing fiction) and for Hollywood/TV scripts (as discussed in comments) and so unusual elsewhere, with the exception of ghostwriting.

Effective Number of Parameters in a Statistical Model

This is my talk for a seminar organized by Joe Suzuki at Osaka University on Tues 10 Sep 2024, 8:50-10:20am Japan time / 19:50-21:20 NY time:

Effective Number of Parameters in a Statistical Model

Andrew Gelman, Department of Statistics, Columbia University

Degrees-of-freedom adjustment for estimated parameters is a general idea in small-sample hypothesis testing, uncertainty estimation, and assessment of prediction accuracy. The effective number of parameters gets interesting In the presence of nonlinearity, constraints, boundary conditions, hierarchical models, informative priors, discrete parameters, and other complicating factors. Many open questions remain, including: (a) defining the effective number of parameters, (b) measuring how the effective number of parameters can depend on data and vary across parameter space, and (c) understanding how the effective number of parameters changes as sample size increases. We discuss using examples from demographics, imaging, pharmacology, political science, and other application areas.

It will be a remote talk—I won’t be flying to Japan—so maybe the eventual link will be accessible to outsiders.

It feels kinda weird to be scheduling a talk nearly a year in advance, but since I had to give a title and abstract anyway, I thought I’d share it with you. My talk will be part of a lecture series they are organizing for graduate students at Osaka, “centered around WAIC/WBIC and its mathematical complexities, covering topics such as Stan usage, regularity conditions, WAIC, WBIC, cross-validation, SBIC, and learning coefficients.” I’m not gonna touch the whole BIC thing.

Now that we have loo, I don’t see any direct use for “effective number of parameters” in applied statistics, but the concept still seems important for understanding fitted models, in vaguely the same way that R-squared is useful for understanding, even though it does not answer any direct question of interest. I thought it could be fun to give a talk on all the things that confuse me about effective number of parameters, because I think it’s a concept that we often take for granted without fully thinking through.

Agreeing to give the talk could also motivate me to write a paper on the topic, which I’d like to do, given that it’s been bugging me for about 35 years now.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any nod toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

“Guns, Race, and Stats: The Three Deadliest Weapons in America”

Geoff Holtzman writes:

In April 2021, The Guardian published an article titled “Gun Ownership among Black Americans is Up 58.2%.” In June 2022, Newsweek claimed that “Gun ownership rose by 58 percent in 2020 alone.” The Philadelphia Inquirer first reported on this story in August 2020, and covered it again as recently as March 2023 in a piece titled “The Growing Ranks of Gun Owners.” In between, more than two dozen major media outlets reported this same statistic. Despite inconsistencies in their reporting, all outlets (directly or indirectly) cite as their source a survey-based infographic conducted by a firearm industry trade association.

Last week, I shared my thoughts on the social, political, and ethical dimensions of these stories in an article published in The American Prospect. Here, I address whether and to what extent their key statistical claim is true. And an examination of the infographic—produced by the National Shooting Sports Foundation (NSSF)—reveals that it is not. Below, I describe six key facts about the infographic that undermine the media narrative. After removing all false, misleading, or meaningless words from the Guardian’s headline and Newsweek’s claim, the only words remaining are “Among” “Is,” “In,” and “By.”

(1) 58.2% only refers to the first six months of 2020

To understand demographic changes in firearms purchases or ownership in 2020, one needs to ascertain firearm sales or ownership demographics from before 2020 and after 2020. The best way to do this is with a longitudinal panel, which is how Pew found no change in Black gun ownership rates among Americans from 2017 (24%) to 2021 (24%). Longitudinal research in The Annals of Internal Medicine, also found no change in gun ownership among Black Americans from 2019 (21%) through 2020/2021 (21%).

By contrast, the NSSF conducted a one-time survey of its own member retailers. In July 2020, the NSSF asked these retailers to estimate demographics in the first six months of 2020 to demographics in the first six months of 2019. A full critique of this approach and its drawbacks would require a lengthy discussion of the scientific literature on recency bias, telescoping effects, and so on. To keep this brief, I’d just like to point out that by July 2020, many of us could barely remember what the world was like back in 2019.

Ironically, the media couldn’t even remember when the survey took place. In September 2020, NPR reported—correctly—that “according to AOL News,” the survey concerned “the first six months of 2020.”  But in October of 2020, CNN said it reflected gun sales “through September.” And by June 2021, CNN revised its timeline to be even less accurate, claiming the statistic was “gun buyers in 2020 compared to 2019.”

Strangely, it seems that AOL News may have been one of the few media outlets that actually looked at the infographic it reported. The timing of the survey—along with other critical but collectively forgotten information on its methods are printed at the top of the infographic. The entire top quarter of the NSSF-produced image is devoted to these details:  “FIREARM & AMMUNITION SALES DURING 1ST HALF OF 2020, Online Survey Fielded July 2020 to NSSF Members.”

But as I discuss in my article in The American Prospect, a survey about the first half of 2020 doesn’t really support a narrative about Black Americans’ response to “protests throughout the summer” of 2020 or to that November’s “contested election.” This is a great example of a formal fallacy (post hoc reasoning), memory bias (more than one may have been at work here), and motivated reasoning all rolled into one. To facilitate these cognitive errors, the phrase “in 2020” is used ambiguously in the stories, referring at times to its first six months of 2020 and at times specific days or periods during the last seven months. This part of the headlines and stories is not false, but it does conflate two distinct time periods.

The results of the NSSF survey cannot possibly reflect the events of the Summer and Fall of 2020. Rather, the survey’s methods and materials were reimagined, glossed over, or ignored to serve news stories about those events.

(2) 58.2% describes only a tiny, esoteric fraction of Americans

To generalize about gun owner demographics in the U.S., one has to survey a representative, random sample of Americans. But the NSSF survey was not sent to a representative sample of Americans—it was only sent to NSSF members. Furthermore, it doesn’t appear to have been sent to a random sample of NSSF members—we have almost no information on how the sample of fewer than 200 participants were drawn from the NSSF’s membership of nearly 10,000. Most problematically—and bizarrely—the survey is supposed to tell us something about gun buyers, yet the NSSF chose to send the survey exclusively to its gun sellers.

The word “Americans” in these headlines is being used as shorthand for “gun store customers as remembered by American retailers up to 18 months later.” In my experience, literally no one assumes I mean the latter when I say the former. The latter is not representative of the former, so this part of the headlines and news stories is misleading.

(3) 58.2% refers to some abstract, reconstructed memory of Blackness

The NSSF doesn’t provide demographic information for the retailers it surveyed. Demographics can provide crucial descriptive information for interpreting and weighting data from any survey, but their omission is especially glaring for a survey that asked people to estimate demographics. But there’s a much bigger problem here.

We don’t have reliable information about the races of these retailers’ customers, which is what the word “Black” is supposed to refer to in news coverage of the survey. This is not an attack on firearms retailers; it is a well-established statistical tendency in third-party racial identification. As I’ve discussed in The American Journal of Bioethics, a comparison of CDC mortality data to Census records shows that funeral directors are not particularly accurate in reporting the race of one (perfectly still) person at a time. Since that’s a simpler task than searching one’s memory and making statistical comparisons of all customers from January through June of two different years, it’s safe to assume that the latter tends to produce even less accurate reports.

The word “Black” in these stories really means “undifferentiated masses of people from two non-consecutive six-month periods recalled as Black.” Again, the construct picked out by “Black” in the news coverage is a far cry from the construct actually measured by the survey.

(4) 58.2% appears to be about something other than guns

The infographic doesn’t provide the full wording of survey items, or even make clear how many items there were. Of the six figures on the infographic, two are about “sales of firearms,” two are about “sales of ammunition,” and one is about “overall demographic makeup of your customers.” But the sixth and final figure—the source of that famous 58.2%—does not appear to be about anything at all. In its entirety, that text on the infographic reads: “For any demographic that you had an increase, please specify the percent increase.”

Percent increase in what? Firearms sales? Ammunition sales? Firearms and/or ammunition sales? Overall customers? My best guess would be that the item asked about customers, since guns and ammo are not typically assigned a race. But the sixth figure is uninterpretable—and the 58.2% statistic meaningless—in the absence of answers.

(5) 58.2% is about something other than ownership

I would not guess that the 58.2% statistic was about ownership, unless this were a multiple choice test and I was asked to guess which answer was a trap.

The infographic might initially appear to be about ownership, especially to someone primed by the initial press release. It’s notoriously difficult for people to grasp distinctions like those between purchases by customers and ownership in a broader population. I happen to think that the heuristics, biases, and fallacies associated with that difficulty—reverse inference, base rate neglect, affirming the consequent, etc.—are fascinating, but I won’t dwell on them here. In the end, ammunition is not a gun, a behavior (purchasing) is not a state (ownership), and customers are none of the above.

To understand how these concepts differ, suppose that 80% of people who walk into a given gun store in a given year own a gun. The following year, the store could experience a 58% increase in customers, or a 58% increase in purchases, but not observe a 58% increase in ownership. Why? Because even the best salesperson can’t get 126% of customers to own guns. So the infographic neither states nor implies anything specific about changes in gun ownership.

(6) 58.2% was calculated deceptively

I can’t tell if the data were censored (e.g., by dropping some responses before analysis) or if the respondents were essentially censored (e.g., via survey skip logic), but 58.2% is the average guess only of retailers who reported an increase in Black customers. Retailers who reported no increase in Black customers were not counted toward the average. Consequently, the infographic can’t provide a sample size for this bar chart. Instead, it presents a range of sample sizes for individual bars: “n=19-104.”

Presenting means from four distinct, artificially constructed, partly overlapping samples as a single bar chart without specifying the size of any sample renders that 58.2% number uninterpretable. It is quite possible that only 19 of 104 retailers reported an increase in Black customers, and that all 104 reported an increase in White customers—for whom the infographic (but not the news) reported a 51.9% increase. Suppose 85 retailers did not report an increase in Black customers, and instead reported no change for that group (i.e., a change of 0%). Then if we actually calculated the average change in demographics reported by all survey respondents, we would find just a 10.6% increase in Black customers (19/104 x 58.2%), as compared to a 51.9% increase in white customers (104/104 x 51.9%).

A proper analysis of the full survey data could actually undermine the narrative of a surge in gun sales driven by Black Americans. In fact, a proper calculation may even have found a decrease, not an increase, for this group. The first two bar charts on the infographic report percentages of retailers who thought overall sales of firearms and of ammunition were “up,” “down,” or the “same.” We don’t know if the same response options were given for the demographic items, but if they were, a recount of all votes might have found a decrease in Black customers. We’ll never know.

The 58.2% number is meaningless without additional but unavailable information. Or, to use more technical language, it is a ceilingestimate, as opposed to a real number. In my less-technical write-up, I simply call it a fake number.

This is kind of in the style of our recent article in the Atlantic, The Statistics That Come Out of Nowhere, but with lot more detail. Or, for a simpler example, a claim from a few years ago about political attitudes of the super-rich, which came from a purported survey about which no details were given. As with some of those other claims, the reported number of 58% was implausible on its face, but that didn’t stop media organizations from credulously repeating it.

On the plus side, a few years back a top journal (yeah, you guessed it, it was Lancet, that fount of politically-motivated headline-bait) published a ridiculous study on gun control and, to their credit, various experts expressed their immediate skepticism.

To their discredit, the news media reports on that 58% thing did not even bother running it by any experts, skeptical or otherwise. Here’s another example (from NBC), here’s another (from Axios), here’s CNN . . . you get the picture.

I guess this story is just too good to check, it fits into existing political narratives, etc.

Book on Stan, R, and Python by Kentaro Matsuura

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

Some people have no cell phone and never check their email before 4pm.

Paul Alper points to this news article, “Barely a quarter of Americans still have landlines. Who are they?”, by Andrew Van Dam, my new favorite newspaper columnist. Van Dam writes:

Only 2 percent of U.S. adults use only landlines. Another 3 percent mostly rely on landlines and 1 percent don’t have phones at all. The largest group of holdouts, of course, are folks 65 and older. That’s the only demographic for which households with landlines still outnumber wireless-only households. . . . about 73 percent of American adults lived in a household without a landline at the end of last year — a figure that has tripled since 2010.

Here’s some statistics:

“People who have cut the cord” — abandoning landlines to rely only on wireless — “are generally more likely to engage in risky behaviors,” Blumberg told us. “They’re more likely to binge drink, more likely to smoke and more likely to go without health insurance.” That’s true even when researchers control for age, sex, race, ethnicity and income.

OK, they should say “adjust for,” not “control for,” but I get the idea.

The article continues:

Until recently, we weren’t sure that data even existed. But it turns out we were looking in the wrong place. Phone usage is tracked in the National Health Interview Survey, of all things, the same source we used in previous columns to measure the use of glasses and hearing aids by our fellow Americans.

Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior.

This came up in our piranha paper, and it’s convenient to have these references in one place:

Here are just some of the factors that have been published in the social priming and related literatures as having large and predictable effects on attitudes and behavior: hormones (Petersen et al., 2013; Durante et al., 2013), subliminal images (Bartels, 2014; Gelman, 2015b), the outcomes of recent football games (Healy et al., 2010; Graham et al., 2022; Fowler and Montagnes, 2015, 2022), irrelevant news events such as shark attacks (Achen and Bartels, 2002; Fowler and Hall, 2018), a chance encounter with a stranger (Sands, 2017; Gelman, 2018b), parental socioeconomic status (Petersen et al., 2013), weather (Beall and Tracy, 2014; Gelman, 2018a), the last digit of one’s age (Alter and Hershfield, 2014; Kühnea et al., 2015), the sex of a hurricane name (Jung et al., 2014; Freese, 2014), the sexes of siblings (Blanchard and Bogaert, 1996; Bogaert, 2006; Gelman and Stern, 2006), the position in which a person is sitting (Carney et al., 2010; Cesario and Johnson, 2018), and many others.

These individual studies have lots of problems (see references below to criticisms); beyond that, the piranha principle implies that it would be very difficult for many of these large and consistent effects to coexist in the wild.

References to the claims:

Kristina M. Durante, Ashley Rae, and Vladas Griskevicius. The fluctuating female vote: Politics, religion, and the ovulatory cycle. Psychological Science, 24:1007–1016, 2013.

Larry Bartels. Here’s how a cartoon smiley face punched a big hole in democratic theory. Washington Post, https://www.washingtonpost.com/news/monkey-cage/wp/2014/09/04/heres-how-a-cartoon-smiley-face-punched-a-big-hole-in-democratic-theory/, 2014.

A. J. Healy, N. Malhotra, and C. H. Mo. Irrelevant events affect voters’ evaluations of government performance. Proceedings of the National Academy of Sciences, 107:12804–12809, 2010.

Matthew H. Graham, Gregory A. Huber, Neil Malhotra, and Cecilia Hyunjung Mo. Irrelevant events and voting behavior: Replications using principles from open science. Journal of Politics, 2022.

C. H. Achen and L. M. Bartels. Blind retrospection: Electoral responses to drought, flu, and shark attacks. Presented at the Annual Meeting of the American Political Science Association, 2002.

Anthony Fowler and Andrew B. Hall. Do shark attacks influence presidential elections? Reassessing a prominent finding on voter competence. Journal of Politics, 80:1423–1437, 2018.

Melissa L. Sands. Exposure to inequality affects support for redistribution. Proceedings of the National Academy of Sciences, 114:663–668, 2017.

Michael Bang Petersen, Daniel Sznycer, Aaron Sell, Leda Cosmides, and John Tooby. The ancestral logic of politics: Upper-body strength regulates men’s assertion of self-interest over economic redistribution. Psychological Science, 24:1098–1103, 2013.

Alec T. Beall and Jessica L. Tracy. The impact of weather on women’s tendency to wear red or pink when at high risk for conception. PLoS One, 9:e88852, 2014.

A. L. Alter and H. E. Hershfield. People search for meaning when they approach a new decade in chronological age. Proceedings of the National Academy of Sciences, 111:17066–17070, 2014.

Kiju Jung, Sharon Shavitt, Madhu Viswanathan, and Joseph M. Hilbe. Female hurricanes are deadlier than male hurricanes. Proceedings of the National Academy of Sciences, 111:8782–8787, 2014.

R. Blanchard and A. F. Bogaert. Homosexuality in men and number of older brothers. American Journal of Psychiatry, 153:27–31, 1996.

A. F. Bogaert. Biological versus nonbiological older brothers and men’s sexual orientation. Proceedings of the National Academy of Sciences, 103:10771–10774, 2006.

D. R. Carney, A. J. C. Cuddy, and A. J. Yap. Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21:1363–1368, 2010.

References to some criticisms:

Andrew Gelman. The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. Journal of Management, 41:632–643, 2015a.

Andrew Gelman. Disagreements about the strength of evidence. Chance, 28:55–59, 2015b.

Anthony Fowler and B. Pablo Montagnes. College football, elections, and false-positive results in observational research. Proceedings of the National Academy of Sciences, 112:13800–13804, 2015.

Anthony Fowler and B. Pablo Montagnes. Distinguishing between false positives and genuine results: The case of irrelevant events and elections. Journal of Politics, 2022.

Andrew Gelman. Some experiments are just too noisy to tell us much of anything at all: Political science edition. Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2018/05/29/exposure-forking-paths-affects-support-publication/, 2018b.

Andrew Gelman. Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students). Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2013/05/29/another-one-of-those-psychological-science-papers/

Andrew Gelman. When you believe in things that you don’t understand. Statistical Modeling, Causal Inference, and Social Science, https://statmodeling.stat.columbia.edu/2014/04/15/believe-things-dont-understand/, 2018a.

Simon Kühnea, Thorsten Schneiderb, and David Richter. Big changes before big birthdays? Panel data provide no evidence of end-of-decade crises. Proceedings of the National Academy of Sciences, 112:E1170, 2015.

Jeremy Freese. The hurricane name people strike back! Scatterplot, https://scatter.wordpress.com/2014/06/16/the-hurricane-name-people-strike-back/, 2014.

Andrew Gelman and Hal Stern. The difference between “significant” and “not significant” is not itself statistically significant. American Statistician, 60:328–331, 2006.

J. Cesario and D. J. Johnson. Power poseur: Bodily expansiveness does not matter in dyadic interactions. Social Psychological and Personality Science, 9:781–789, 2018.

Lots more out there:

The above is not intended to be an exhaustive or representative list or even a full list of examples we’ve covered here on the blog! There’s the “lucky golf ball” study, the case of the missing shredder, pizzagate, . . . we could go on forever. The past twenty years have featured many published and publicized claims about essentially irrelevant stimuli having large and predictable effects, along with quite a bit of criticism and refutation of these claims. The above is only a very partial list, just a paragraph giving a small sense of the wide variety of stimuli that are supposed to have been demonstrated to have large and consistent effects, and it’s relevant to our general point that it’s not possible for all these effects to coexist in the world. Again, take a look at the piranha paper for further discussion of this point.