These are the three ways of attacking a statistical problem (illustrated with the NFL example)

Posted on March 30, 2026 9:14 AM by Andrew

The following question came up the other day:

What's the most common four game start to an NFL season?
W W W W
W W L L
L L W W
W L W L
L W L W
L L L L

I replied:

Logic and math suggest that it’s either the first one or the last one. I think that extremely shitty teams are more prevalent than extremely good teams, so I’m guessing the last option.

There was a bunch of discussion in comments and so I thought I’d elaborate by describing three different ways of attacking this problem. The various ideas discussed in the comments to my earlier post can be thought of as approximations to these three approaches.

1. Probability calculation

Just to flesh out my intuitive reasoning above, let’s go with classic item response theory (a class of models that was originally developed around the time of the founding of the NFL, actually, but for different purposes!) and model the probability that team i beats team j as:

Pr(team i beats team j) = invlogit(a_i – a_j + b*home_ij),

where a_i and a_j are the ability parameters for teams i and j, and home_ij is a home-field measure, equal to 1 if i is the home team, -1 if j is the home team, and 0 if they’re playing on a neutral field. I’ll keep things simple by excluding the possibility of a tie game.

And now some numbers. First, what’s the home-field advantage? It says here that home teams win about 55% of their games, and if we assume this 55% roughly applies to two equally-matched teams, then b = logit(0.55) = 0.2.

Next come the team abilities. Let’s start with a normal distribution: a_i ~ normal(0, sigma_a). What’s a good value for sigma_a? Well, let’s compare a team that’s 1 sd better than average to a team that’s 1 sd worse than average. These are the 84th and 16th percentiles, which for a 32-team league would be roughly the 5th and 27th best teams. The probability that the 5th best team beats the 27th best team on a neutral field will be invlogit(2*sigma_a). What is that probability? Let’s say 90%? In that case, sigma_a = logit(0.9)/2 = 1.1. Or if the probability is 80%, then logit(0.8)/2 = 0.7. I don’t know . . . let’s say sigma_a = 1.

Now we can do some math . . . ummm, let’s just simulate a million games:

n_sim <- 1e6
b <- 0.2
sigma_a <- 1.0
a_i <- rnorm(n_sim, 0, sigma_a)
y <- rep(0, n_sim)
for (k in 1:4){
  a_j <- rnorm(n_sim, 0, sigma_a)
  home_ij <- rbinom(n_sim, 1, 0.5)
  y <- y + 10^(4-k) * rbinom(n_sim, 1, invlogit(a_i - a_j + b*home_ij))
}
output <- table(y)
names(output) <- sprintf("%04d", as.numeric(names(output)))
print(output)

Kind of hacky . . . this is how I learned how to code back in the 1970s!

Anyway, here's the result:

  0000   0001   0010   0011   0100   0101   0110   0111   1000   1001   1010   1011   1100   1101   1110   1111 
102992  56795  56726  48753  56499  48423  48170  63246  56738  48173  48075  63398  48344  62814  63208 127646

Hey, that's wack! As predicted, more at the extremes, but more 1111's than 0000's! I'd've expected they'd be equal, given that I've simulated the sigma_a's from a symmetric distribution.

Let's try again with a new set of random numbers:

  0000   0001   0010   0011   0100   0101   0110   0111   1000   1001   1010   1011   1100   1101   1110   1111 
102759  56663  56887  48353  56695  48292  48581  63118  56562  48748  48057  63443  48541  62860  62885 127556

Again, lots more 1111's!

I guess it has something to do with the home-field advantage . . . oh, I see, I have a bug in my code! I'd assigned home_ij as equally likely to be 0 or 1, but what I should be doing is having it equally likely to be -1 and 1. So I'll swap out the line

  home_ij <- rbinom(n_sim, 1, 0.5)

with:

  home_ij <- sample(c(-1,1), n_sim, replace=TRUE)

And now I'll run the corrected code. Here's what we get:

  0000   0001   0010   0011   0100   0101   0110   0111   1000   1001   1010   1011   1100   1101   1110   1111 
114165  60279  60102  48707  59340  48413  48431  59904  59922  48644  48859  59741  48284  60178  60115 114916

Ahhhh, much better.

But maybe those numbers are too extreme . . . does a good team really have a 90% chance of beating a bad team? Remember the saying, "any given Sunday"! So let's try again with sigma_a = 0.7. Here's what a million simulations gets us:

 0000  0001  0010  0011  0100  0101  0110  0111  1000  1001  1010  1011  1100  1101  1110  1111 
94877 61211 61547 53396 61252 53336 52875 61609 61432 53280 52924 61390 53079 61498 61484 94810

So, even then, lots more 4-game losing streaks and 4-game winning streaks than anything else.

Some commenters said that the NFL does schedule balancing so that good teams are more likely to play good teams and bad teams are more likely to play bad teams. This would reduce the counts at the extremes. We could model that too but I'm kinda lazy so I won't do it here. As the textbook writers say, I'll leave it as an exercise for the reader.

But what about that other thing, that there are more extremely shitty teams than extremely good teams? We can use some skewed distribution . . . ummmm, I don't know much about these! There's something called the noncentral t . . . I'd like to do something with some intuition, something I understand. OK, for now I'll just hack it, replacing the normal(0,1) distribution by a normal(0,0.7) on the positive side and normal(0,1.0) on the negative.

So, in the above code I'll add the function:

rnorm_split <- function(n_sim, mu, sigma_neg, sigma_pos) {
  z <- rnorm(n_sim, 0, 1)
  ifelse(z < 0, mu + sigma_neg*z, mu + sigma_pos*z)
}

and then change the two instances of

rnorm(n_sim, 0, sigma_a)

rnorm_split(n_sim, 0, 1.0, 0.7)

And here we have it:

  0000   0001   0010   0011   0100   0101   0110   0111   1000   1001   1010   1011   1100   1101   1110   1111 
106711  59807  59534  50637  59860  50934  50615  61828  59241  50613  50626  62066  50725  61784  61985 103034

To check uncertainties, we do it again:

  0000   0001   0010   0011   0100   0101   0110   0111   1000   1001   1010   1011   1100   1101   1110   1111 
107451  59519  59545  50471  59237  50756  51172  62029  59726  50706  50685  61763  50829  61576  61785 102750

So, yeah, slightly more 0000's than 1111's, but not a lot, so who's to say what will happen after piping it through the scheduling thing where better teams play each other more often. I still think the general pattern will hold, but it might not show up in a small dataset. We'll get back to that point in a bit.

2. Purely empirical solution

According to wikipedia, "The NFL was formed in 1920 as the American Professional Football Association (APFA) before renaming itself the National Football League for the 1922 season." So let's start in 1920. Back in the day they had a lot of ties, so I'll make the decision to exclude ties; thus the above question will be interpreted as, "What's the most common four game start to an NFL season, excluding ties?"
Somebody who knows how to scrape should be able to could scrape the data from all the NFL seasons and just count up what happened.

OK, that's the planned data analysis. Next comes the design analysis: our expectation of what we might see.

The NFL used to have about 15 to 20 teams and now it has 32; just as a rough calculation I'll go with 25 teams per season x 100 seasons = 2500 teams, with 16 possible outcomes for the first 4 games of the season. 2500/16 is approximately 150, and if the games were all decided by coin flips (which they're not), then we'd expect approximately 150 +/ sqrt(150), that is 150 +/- 12 in each category.

But the games aren't decided by coin flips. See section 1 above. Our best guess is that there will be more cases on the extremes. Let's take the above numbers, which are based on a million teams playing 4 games each, and scale them down to 2500. That is, I'll simply take the numbers above and divide them by 400:

print(output*2500/1e6)

This yields:

    0000     0001     0010     0011     0100     0101     0110     0111     1000     1001     1010     1011     1100     1101     1110     1111 
268.6275 148.7975 148.8625 126.1775 148.0925 126.8900 127.9300 155.0725 149.3150 126.7650 126.7125 154.4075 127.0725 153.9400 154.4625 256.8750

Ugly! Let's try again:

print(round(output*2500/1e6))

Which yields:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
 269  149  149  126  148  127  128  155  149  127  127  154  127  154  154  257

The counts at the extreme have approximate standard errors of sqrt(260) = 16, so, yeah, we should be able to detect this from all 106 NFL seasons. But the bit about 0000 being more common than 1111? That's kind of lost in the noise.

What about just the past 10 seasons (320 teams)?

print(round(output*320/1e6))

The result:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
  34   19   19   16   19   16   16   20   19   16   16   20   16   20   20   33

sqrt(34) = 6, so this should still be detectable, but there is a chance that the results could look weird. And you can forget about getting any useful information comparing 0000 to 1111.

One way to see this is to do a couple simulations with n_sim = 320. Here's one:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
  31   19   21   17   15   21   22   26   20   15   20   19   13   17   16   28

And here's another:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
  40   10   16   21   14   13   20   25   24   20   17   17   22   15   18   28

And another:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
  29   20   18   20   22   16   16   20   21   17   11   24   16   11   18   41

If you want to do it from one season, you'd run the simulation with n_sim = 32 . . . forget about it! Here's an example:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1101 1111 
   2    2    1    2    1    1    1    4    3    1    3    4    4    3

You'll need data from multiple seasons to detect any patterns here.

3. Statistical modeling

A completely different approach is to fit a model to data. Let's assume that our helpful scrapers have done their job and supplied us with a clean dataset of all the games since 1920.

We could fit the above item-response model to the wins and losses (keeping things simple by excluding tie games).

Teams change from season to season, so I'd recommend fitting this model separately for each season, but pooling b (the home-field advantage parameter) and sigma_a (the sd of team abilities), maybe fitting a separate hierarchical model fore each decade.

But we can do better than that. Once we have the game scores, we can model them directly. Don’t model the probability of win, model the expected score differential. Something like this:

score differential ~ normal(a_i - a_j + b*home_ij, sigma_y).

The parameters a and b have slightly different interpretations now--they're on the scale of points rather than logit probabilities--but that's fine. A hierarchical model should be easy to fit. Again I'd fit a different model to each decade, or you can get more sophisticated with some sort of time series model allowing team abilities to change over the season and to have some stability between season. There's no end to the amount of modeling you can do here, if you have interest in the problem.

Once we have this model, we can simulate games and get a model-based estimate of the probability of each of the possible four-game outcomes in each season. Indeed we can do it with the actual matchups and then compare to what would happen in expectation under random matchups. The difference will give us some sense of the effect of the NFL schedule imbalances.

25 thoughts on “These are the three ways of attacking a statistical problem (illustrated with the NFL example)”

Gregory C. Mayer on March 30, 2026 10:05 AM at 10:05 am said:

Something has gone awry with the formatting; much of the post is showing as if it’s all code (grayed) in lines that don’t fit on the screen.

Reply ↓
Daniel Lakeland on March 30, 2026 10:07 AM at 10:07 am said:

Somewhere you left a code block open and the rest of the article is formatted as code, which makes it very hard to read because it doesn’t wordwrap.

Reply ↓
Will on March 30, 2026 10:42 AM at 10:42 am said:

I’m surprised there’s so much debate about this, actually. Given the choices, you’d expect the overall probability of mediocre team’s outcome to be split among WWLL, LLWW, WLWL, and LWLW; but the good team’s probability mass is entirely concentrated at WWWW.

But as others pointed out in the last thread, focusing on number of wins, rather than sequence that generated that number of wins, changes the logic. I wonder if confused people are intuitively thinking in terms of number of wins in the sequence framing of the question.

Reply ↓
Anonymous on March 30, 2026 11:20 AM at 11:20 am said:

You do the simulation is a naive way that probably works for showing off the general point, but if you were to make it more sophisticated you can make a schedule for the teams so that each win for one team causes a loss for another. You can also make sure that teams don’t play each other more than once in the first four games, which I think is fairly realistic. Doesn’t change the underlying conclusions at all though.

Reply ↓
Anoneuoid on March 30, 2026 12:39 PM at 12:39 pm said:

R script and full results here:
https://pastebin.com/Uxht7qdi

This is using the kaggle 1926-2024 dataset, which includes ties. Columns are Pattern/Count/Percent. I get for the top 10 (which I am sure will not render correctly):

LLLL 212 9.8 WWWW 189 8.8 LWWW 138 6.4 WLWW 137 6.4 WWLW 133 6.2 WLLL 129 6.0 LWLL 126 5.8 WWWL 125 5.8 LWWL 117 5.4 LLWL 115 5.3

Reply ↓
- Andrew on March 30, 2026 2:07 PM at 2:07 pm said:
  
  Hey–that fits our model pretty well! I feel vindicated (assuming these numbers are correct).
  
  Reply ↓
  - Anoneuoid on March 30, 2026 6:03 PM at 6:03 pm said:
    
    My only real contribution was telling the bot not to use tidyverse. The code looks fine to me and results make sense, but I never spot checked the source data either.
    
    As referenced in the pastebin, you can download the 1926-2024_COMBINED_NFL_SCORES csv from here: https://www.kaggle.com/datasets/flynn28/1926-2024-nfl-scores
    
    Reply ↓

Here’s a simulation where I use a Bradley-Terry model with team abilities drawn from a standard normal. I made sure to avoid duplicate games in the first four. This is simulating 1,000,000 4-game stretches involving 16 teams and making sure there are no duplicate games. The counts look like this:

temp2$ python3 fourgames-nodup.py
Sequence   | Total
LLLL       | 1838832
LLLW       | 958832
LLWL       | 959012
LLWW       | 774449
LWLL       | 959730
LWLW       | 774872
LWWL       | 774625
LWWW       | 959648
WLLL       | 959653
WLLW       | 774561
WLWL       | 775453
WLWW       | 959208
WWLL       | 774279
WWLW       | 959241
WWWL       | 958416
WWWW       | 1839189

And here’s the code, this time generated by Gemini just using Google search and following up in AI mode. It uses a dumb generate-and-test loop to make sure teams haven’t already played each other in a season—there are much better ways to do this when the constraints get tighter. Also using strings in a dictionary for indexing is super slow compared to direct numerical indexing. On the other hand, this ran fast enough to report results.

import numpy as np
from itertools import product
from collections import Counter

def simulate_no_repeats(n_seasons, n_teams, n_weeks):
    possible_seqs = ["".join(p) for p in product("WL", repeat=n_weeks)]
    total_counts = Counter({seq: 0 for seq in possible_seqs})
    
    for _ in range(n_seasons):
        abilities = np.random.normal(0, 1, n_teams)
        records = [[] for _ in range(n_teams)]
        
        # Track opponents for each team: {0: {opponent_ids}, 1: {opponent_ids}, ...}
        history = {i: set() for i in range(n_teams)}
        
        for week in range(n_weeks):
            teams_to_pair = list(range(n_teams))
            np.random.shuffle(teams_to_pair)
            
            # Simple pairing logic: attempt to pair team 'i' with the next available valid opponent
            pairs = []
            while teams_to_pair:
                t1 = teams_to_pair.pop(0)
                for idx, t2 in enumerate(teams_to_pair):
                    if t2 not in history[t1]:
                        pairs.append((t1, t2))
                        teams_to_pair.pop(idx)
                        break
                else:
                    # If a valid pairing isn't found, restart the week's shuffle
                    teams_to_pair = list(range(n_teams))
                    np.random.shuffle(teams_to_pair)
                    pairs = []

            for t1, t2 in pairs:
                # Update history
                history[t1].add(t2)
                history[t2].add(t1)
                
                # Bradley-Terry outcome, equivalently inv_logit(abilities[t1] - abilities[t2])
                prob_t1_wins = np.exp(abilities[t1]) / (np.exp(abilities[t1]) + np.exp(abilities[t2]))
                
                if np.random.random() < prob_t1_wins:
                    records[t1].append("W")
                    records[t2].append("L")
                else:
                    records[t1].append("L")
                    records[t2].append("W")
        
        season_outcomes = ["".join(r) for r in records]
        total_counts.update(season_outcomes)
        
    return total_counts

# Execute
results = simulate_no_repeats(n_seasons=1000, n_teams=16, n_weeks=4)
print(f"{'Sequence':<10} | {'Total'}")
for seq, count in sorted(results.items()):
    print(f"{seq:<10} | {count}")

I don't see a measurable difference with reducing the number of teams to 8 from 16 or with not removing duplicates in the schedule. This blog reply is too small for the full workflow :-).

At least I'm happy to see that my own intuition that 4 wins and 4 losses would be most common holds true when you take the abilities generated randomly like this.

Reply ↓

Sermet Pekin on March 30, 2026 5:05 PM at 5:05 pm said:

Since you left the schedule balancing idea as an exercise for the reader, I decided to actually code it up and see what happens.
I used Bob Carpenter’s league setup from the comments so the teams wouldn’t play duplicate games in the first month,
but I kept your skewed normal distribution to make sure there were still more garbage teams than elite ones.

First I ran a completely random schedule. As you’d expect, 0-4 and 4-0 sequences were the most common, hitting around 10.5% each.

Next, I used a strict tier system. For the first four weeks, the top 8 teams only play each other, and the bottom 8 teams play only each other.
This confirms the balancing effect—the extreme results went way down.
Both LLLL and WWWW dropped to about 7%, which is almost like a coin flip.
When you remove the unfair games, the 4-game streaks mostly disappear.

But in reality, the schedule is based on last year’s standings, which rarely maps perfectly to the new season due to roster turnover,
rookies, and injuries. So I added a wrinkle: I took that strictly balanced schedule
but randomly swapped a couple of teams into the “wrong” tier right before week 1.

Just injecting that little bit of preseason misclassification bumped the extremes back up to nearly 9%.
Turns out all it takes is a decent team getting handed an accidentally easy schedule (or a bad team getting thrown to the wolves) for the streaks to come right back.

I threw the code up as a GitHub gist if anyone wants to mess around with the noise parameter:
https://gist.github.com/SermetPekin/ec5aaa7289469606e62e06d1c712f190

Reply ↓
Anonymous on March 30, 2026 10:28 PM at 10:28 pm said:

Off topic but any comment on the Michael Weisman paper on the two leak Bayesian logic paper?

Is he on point or wrong?

Is it settled or are the criticized authors not agreeing?

Reply ↓
- Andrew on March 31, 2026 8:52 AM at 8:52 am said:
  
  Anon:
  
  I’ll do a new post on the topic but my short answer is that all these papers, on both sides of the issue, leave me very confused and I don’t know what to think.
  
  The topic is tricky because it can be so political but here I’m just talking about the statistical challenges of thinking about a problem where there are so many gaps in my understanding.
  
  Reply ↓
  - Anonymous on March 31, 2026 1:58 PM at 1:58 pm said:
    
    The author of the critical paper has a podcast, just out, with a Harvard statistician questioning him. It’s meant for general readers (less sophisticated than you). Weissman claims that the authors set a hurdle for the single outbreaks that was higher than the one he set for the double outbreaks.
    
    https://econjwatch.org/podcast/michael-weissman-on-lab-leak-and-science
    
    Thanks for your response. I agree this stuff can be tricky. Heck, even type 1 or type 2 in six sigma can be tricky. And I have seen Harvard MBAs and tenured econ profs make sunk cost fallacy errors. (Not saying I buy Weissman, just that these sort of logic errors can occur. If anything, I suspect, the plugging everything into a computer program and not thinking about the logic is part of the issue. Have seen it on other peer reviewed stats papers.)
    
    Reply ↓
    - Andrew on March 31, 2026 5:38 PM at 5:38 pm said:
      
      Yes, actually originally I was the one who was going to do that interview (I’m on the editorial board of Econ Journal Watch, which published the critical paper), but then I didn’t feel prepared to do the podcast so they had someone else do it.
    - Joshua on March 31, 2026 8:46 PM at 8:46 pm said:
      
      Andrew –
      
      Since I referenced these comments of yours over at Michael’s Substack, and in case your ears were burning, here’s the link to the thread where I did that (scroll down a bit; it’s part of a longer exchange):
      
      https://michaelweissman.substack.com/p/an-inconvenient-probability-v57/comment/236304884?
      
      I’m still trying, so far without much luck, to get Michael to answer a fairly narrow question: whether, in his Bayesian analysis, there are behavioral or institutional assumptions that are effectively built in but not addressed explicitly. I’m not trying to analyze the technical model myself; I don’t have the brains for that. What I’m doing is looking at the structure of the argument from a non-expert angle and trying to understand whether any of those implicit components were formally weighted or tested.
    - Anonymous on April 2, 2026 1:39 AM at 1:39 am said:
      
      Andrew, as the original anonymous, I’m mostly interested in your take on the logic debate itself. I think starting with something granular like that is valuable. Don’t pass just because there are many other lab leak debate topics. Is it really a simple logic error or not?
    - Anoneuoid on April 2, 2026 10:46 AM at 10:46 am said:
      
      If its so difficult for people to tell if there’s basic logic errors, identifying logic errors is a bad approach.
      
      Like you all shouldn’t need to be waiting for an expert to tell you what to think.
  - Michael Weissman on March 31, 2026 10:05 PM at 10:05 pm said:
    
    JFC Andrew- That’s a reasonable take on the broad Bayesian question but that’s not what the EJW paper was about. It was about an obvious blatant error in a simple Bayes exercise in a famous paper. They used conditional probabilities of detailed observations for the hypothesis they didn’t like and just assigned conditional probabilities of 1.0 for them for the hypothesis they liked. It was fixable using their own simulations of their own mo0del of their own data.
    I understand your agnosticism about the bigger questions but this one was like a first week homework problem and they fucked it up.
    
    For those who haven’t followed, here’s a quote from my article.
    
    It may be helpful to first illustrate the basic error with an analogy. If one were to update the odds for deciding which of two suspects committed a burglary using the observation that a blue Toyota had been seen leaving the scene it would not be correct to use the ratio
    P(drives car|suspect 2)/P(drives blue Toyota|suspect 1). That comparison would incorrectly favor the “suspect 2” hypothesis since most cars are neither blue nor Toyotas. That fundamental error in logic is precisely analogous to the core error of P2022. In what follows I shall calculate the corrections to the I2 likelihood used by P2022 when it is estimated using the same observed properties as were used for I1.
    
    Reply ↓
    - Andrew on April 1, 2026 7:31 AM at 7:31 am said:
      
      Michael:
      
      Yes, just to be clear, I’m talking about my general confusion on the topic, not on the specific issues you raised with that paper, the specific criticisms given of your paper, etc. I just felt spooked doing a podcast on the topic without having a better understanding of the big picture. I’m glad the podcast was done; I just didn’t feel up to doing it myself!
    - Michael Weissman on April 1, 2026 8:02 AM at 8:02 am said:
      
      Andrew- I think it would help if you acknowledged that despite your uncertainty about the big picture, about the “context” section of my EJW paper, and about the concluding “reflections” section, there is just no ambiguity at all about the core technical point. Pekar et a.l required that their unfavored I_1 produce two lineages differing by 2 nt and with a size ratio between 3/7 and 7/3. (equivalent to blue Toyota) For the two-spill I_2 they put no requirement at all on the nt difference or on the size ratio. Therefore they violated the basics of Bayesian logic.
      
      This was noted years ago by Angus McCowan, who also obtained the correct minimal corrections. My contribution was to show that the data to make the minimal corrections were already sitting in the Pekar paper and its supplementary files. In fact, just looking at one of their figures carefully was enough to show that within their own model of their own data P(I_2) < P(I_1).
      
      What to make of that rigorous conclusion is a matter for dispute. I don't think it changes the odds for a lab leak a whole lot because I never bought the hype that the Pekar paper was very important.
      
      But it would help if you would just say clearly that Angus, Jamie, Nick Patterson, John Marden and I have not all had simultaneous strokes causing identical breakdowns in basic reasoning. The core technical case that Angus and I make is just not an issue for dispute, regardless of what what one thinks about other issues.
    - Michael Weissman on April 1, 2026 11:48 AM at 11:48 am said:
      
      IOW
      https://drive.google.com/file/d/1FWXTYAh7TIZk2I6-srSX_RO-dPhBPjki/view?usp=sharing
    - Joshua on April 1, 2026 3:58 PM at 3:58 pm said:
      
      Michael –
      
      I’m not equipped to judge the Pekar math myself, but I do wonder about your argument that Pekar et al. would fail a “first‑week homework” assignment.
      
      Because I lack the technical background to evaluate the dispute directly, I’ve been using an LLM as a way to sanity‑check the level of the alleged error—not as an authority, but as a tool for evaluating probabilities when experts disagree and I don’t have the domain knowledge to adjudicate the details myself. After reviewing your description and the relevant papers, it concluded that beyond the already‑corrected coding bugs, the remaining issues involve subtle and contested modeling choices in phylogenetic Bayesian inference, not an unambiguous “fuck‑up” as you have described it.
      
      That matches what I see in the public record: several experts think Pekar has fundamental problems, but I can’t find anyone besides you describing the error as trivial or first‑week‑homework‑level. Even if your co‑authors share your view, that doesn’t provide independent confirmation of the “first‑week homework” framing, since co‑authors naturally share assumptions and priors. If your broader Bayesian argument assigns significant weight to the *obviousness* of the error, then that weighting itself seems to me as an observer, like an unmodeled assumption.
      
      Which brings me back to the Bayesian question I’ve been trying to ask. You’ve responded to other comments, but not to this one:
      
      Shouldn’t the probability that your interpretation is correct—versus the possibility that this is a more subtle or debatable modeling choice where reasonable experts can disagree—be made explicit, or at least tested with sensitivity analysis?
      
      I have the same question about the burglary–car analogy. In your example the observation is a single discrete fact (a specific car seen leaving the scene). In Pekar the “data” is a complex phylogenetic tree with many parameters (topology, mutation counts, rooting, sampling biases, etc.). Are the conditioning choices in P2022 really analogous to comparing “drives car” vs. “drives blue Toyota,” or does that analogy simplify away important structure?
      
      A similar issue arises with the behavioral components in your broader framework. You place substantial weight on the 3‑year non‑correction of Pekar, journal resistance to comments, slow community updating, WIV database removals, and Daszak’s “unwelcome attention” note. These rely on assumptions about how scientists and institutions—including Chinese scientists working within Chinese institutional constraints—behave under pressure. But those behavioral assumptions don’t seem to be specified or tested with probabilities or sensitivity analyses. And because expertise in Chinese institutional behavior or the psychology of Chinese scientists is highly specialized, the uncertainty around those assumptions seems especially important to model explicitly rather than treat them as fixed inputs.
      
      I’m not trying to settle the math or the origins question. I’m trying to understand how a full Bayesian analysis handles uncertainty around both the “obviousness” of technical claims and the weighting of institutional or behavioral responses.
    - Michael Weissman on April 1, 2026 6:25 PM at 6:25 pm said:
      
      Joshua- I don’t think we should continue to hijack Andrew’s blog with these lengthy posts. How’s about moving that one over to my substack on P2022, where it’s directly relevant?
    - Joshua on April 2, 2026 9:28 AM at 9:28 am said:
      
      Michael –
      
      Happy to repost my comment on your substack and move away from here.
      
      I’ve been waiting for you to address those same issues over on your substack. Since you didn’t answer over there, that’s why I posted them here.
      
      My comment was directly relevant after you posted a meme to apparently mock the idea that Andrew says he finds these Bayesian‑origins debates complicated, even if he wasn’t weighing in on the specific Pekar critique.
      
      That kind of hesitation from a respected statistician, to me, reinforces my question as to whether you should model the “obviousness” of the error, not treat it as fixed.
      
      In any case, I’ll repost my last comment on your P2022 Substack and follow up there. I would still like to know if you think your Bayesian framework should handle uncertainty around the “obviousness” of the alleged error and the behavioral assumptions in your broader argument.
Travis on March 31, 2026 5:50 PM at 5:50 pm said:

Another issue is that teams tend to alternate playing at home and away, and basically never have 4 road or 4 home games in a row. So that might make the results slightly more balanced.

Reply ↓
- Andrew on April 1, 2026 7:35 AM at 7:35 am said:
  
  Travis:
  
  I think that variation in the number of home games will be a very minor factor here compared to variation in team abilities. But let me check . . . I’ll take the code above and replace
  home_ij <- sample(c(-1,1), n_sim, replace=TRUE)
  with
  home_ij <- 2*(k%%2) - 1
  So now every team plays 2 home games and 2 away games.
  
  Here's the result for a million teams:
  0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 106544 49295 69629 49970 49109 35277 50123 51005 70008 49978 71123 72382 50040 51175 72924 101418
  So, slightly fewer at the extremes but very little change from before.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

These are the three ways of attacking a statistical problem (illustrated with the NFL example)

25 thoughts on “These are the three ways of attacking a statistical problem (illustrated with the NFL example)”

Leave a Reply Cancel reply