The following question came up the other day:
What's the most common four game start to an NFL season? W W W W W W L L L L W W W L W L L W L W L L L L
I replied:
Logic and math suggest that it’s either the first one or the last one. I think that extremely shitty teams are more prevalent than extremely good teams, so I’m guessing the last option.
There was a bunch of discussion in comments and so I thought I’d elaborate by describing three different ways of attacking this problem. The various ideas discussed in the comments to my earlier post can be thought of as approximations to these three approaches.
1. Probability calculation
Just to flesh out my intuitive reasoning above, let’s go with classic item response theory (a class of models that was originally developed around the time of the founding of the NFL, actually, but for different purposes!) and model the probability that team i beats team j as:
Pr(team i beats team j) = invlogit(a_i – a_j + b*home_ij),
where a_i and a_j are the ability parameters for teams i and j, and home_ij is a home-field measure, equal to 1 if i is the home team, -1 if j is the home team, and 0 if they’re playing on a neutral field. I’ll keep things simple by excluding the possibility of a tie game.
And now some numbers. First, what’s the home-field advantage? It says here that home teams win about 55% of their games, and if we assume this 55% roughly applies to two equally-matched teams, then b = logit(0.55) = 0.2.
Next come the team abilities. Let’s start with a normal distribution: a_i ~ normal(0, sigma_a). What’s a good value for sigma_a? Well, let’s compare a team that’s 1 sd better than average to a team that’s 1 sd worse than average. These are the 84th and 16th percentiles, which for a 32-team league would be roughly the 5th and 27th best teams. The probability that the 5th best team beats the 27th best team on a neutral field will be invlogit(2*sigma_a). What is that probability? Let’s say 90%? In that case, sigma_a = logit(0.9)/2 = 1.1. Or if the probability is 80%, then logit(0.8)/2 = 0.7. I don’t know . . . let’s say sigma_a = 1.
Now we can do some math . . . ummm, let’s just simulate a million games:
n_sim <- 1e6
b <- 0.2
sigma_a <- 1.0
a_i <- rnorm(n_sim, 0, sigma_a)
y <- rep(0, n_sim)
for (k in 1:4){
a_j <- rnorm(n_sim, 0, sigma_a)
home_ij <- rbinom(n_sim, 1, 0.5)
y <- y + 10^(4-k) * rbinom(n_sim, 1, invlogit(a_i - a_j + b*home_ij))
}
output <- table(y)
names(output) <- sprintf("%04d", as.numeric(names(output)))
print(output)
Kind of hacky . . . this is how I learned how to code back in the 1970s!
Anyway, here's the result:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 102992 56795 56726 48753 56499 48423 48170 63246 56738 48173 48075 63398 48344 62814 63208 127646
Hey, that's wack! As predicted, more at the extremes, but more 1111's than 0000's! I'd've expected they'd be equal, given that I've simulated the sigma_a's from a symmetric distribution.
Let's try again with a new set of random numbers:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 102759 56663 56887 48353 56695 48292 48581 63118 56562 48748 48057 63443 48541 62860 62885 127556
Again, lots more 1111's!
I guess it has something to do with the home-field advantage . . . oh, I see, I have a bug in my code! I'd assigned home_ij as equally likely to be 0 or 1, but what I should be doing is having it equally likely to be -1 and 1. So I'll swap out the line
home_ij <- rbinom(n_sim, 1, 0.5)
with:
home_ij <- sample(c(-1,1), n_sim, replace=TRUE)
And now I'll run the corrected code. Here's what we get:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 114165 60279 60102 48707 59340 48413 48431 59904 59922 48644 48859 59741 48284 60178 60115 114916
Ahhhh, much better.
But maybe those numbers are too extreme . . . does a good team really have a 90% chance of beating a bad team? Remember the saying, "any given Sunday"! So let's try again with sigma_a = 0.7. Here's what a million simulations gets us:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 94877 61211 61547 53396 61252 53336 52875 61609 61432 53280 52924 61390 53079 61498 61484 94810
So, even then, lots more 4-game losing streaks and 4-game winning streaks than anything else.
Some commenters said that the NFL does schedule balancing so that good teams are more likely to play good teams and bad teams are more likely to play bad teams. This would reduce the counts at the extremes. We could model that too but I'm kinda lazy so I won't do it here. As the textbook writers say, I'll leave it as an exercise for the reader.
But what about that other thing, that there are more extremely shitty teams than extremely good teams? We can use some skewed distribution . . . ummmm, I don't know much about these! There's something called the noncentral t . . . I'd like to do something with some intuition, something I understand. OK, for now I'll just hack it, replacing the normal(0,1) distribution by a normal(0,0.7) on the positive side and normal(0,1.0) on the negative.
So, in the above code I'll add the function:
rnorm_split <- function(n_sim, mu, sigma_neg, sigma_pos) {
z <- rnorm(n_sim, 0, 1)
ifelse(z < 0, mu + sigma_neg*z, mu + sigma_pos*z)
}
and then change the two instances of
rnorm(n_sim, 0, sigma_a)
to
rnorm_split(n_sim, 0, 1.0, 0.7)
And here we have it:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 106711 59807 59534 50637 59860 50934 50615 61828 59241 50613 50626 62066 50725 61784 61985 103034
To check uncertainties, we do it again:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 107451 59519 59545 50471 59237 50756 51172 62029 59726 50706 50685 61763 50829 61576 61785 102750
So, yeah, slightly more 0000's than 1111's, but not a lot, so who's to say what will happen after piping it through the scheduling thing where better teams play each other more often. I still think the general pattern will hold, but it might not show up in a small dataset. We'll get back to that point in a bit.
2. Purely empirical solution
According to wikipedia, "The NFL was formed in 1920 as the American Professional Football Association (APFA) before renaming itself the National Football League for the 1922 season." So let's start in 1920. Back in the day they had a lot of ties, so I'll make the decision to exclude ties; thus the above question will be interpreted as, "What's the most common four game start to an NFL season, excluding ties?"
Somebody who knows how to scrape should be able to could scrape the data from all the NFL seasons and just count up what happened.
OK, that's the planned data analysis. Next comes the design analysis: our expectation of what we might see.
The NFL used to have about 15 to 20 teams and now it has 32; just as a rough calculation I'll go with 25 teams per season x 100 seasons = 2500 teams, with 16 possible outcomes for the first 4 games of the season. 2500/16 is approximately 150, and if the games were all decided by coin flips (which they're not), then we'd expect approximately 150 +/ sqrt(150), that is 150 +/- 12 in each category.
But the games aren't decided by coin flips. See section 1 above. Our best guess is that there will be more cases on the extremes. Let's take the above numbers, which are based on a million teams playing 4 games each, and scale them down to 2500. That is, I'll simply take the numbers above and divide them by 400:
print(output*2500/1e6)
This yields:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
268.6275 148.7975 148.8625 126.1775 148.0925 126.8900 127.9300 155.0725 149.3150 126.7650 126.7125 154.4075 127.0725 153.9400 154.4625 256.8750
Ugly! Let's try again:
print(round(output*2500/1e6))
Which yields:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 269 149 149 126 148 127 128 155 149 127 127 154 127 154 154 257
The counts at the extreme have approximate standard errors of sqrt(260) = 16, so, yeah, we should be able to detect this from all 106 NFL seasons. But the bit about 0000 being more common than 1111? That's kind of lost in the noise.
What about just the past 10 seasons (320 teams)?
print(round(output*320/1e6))
The result:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 34 19 19 16 19 16 16 20 19 16 16 20 16 20 20 33
sqrt(34) = 6, so this should still be detectable, but there is a chance that the results could look weird. And you can forget about getting any useful information comparing 0000 to 1111.
One way to see this is to do a couple simulations with n_sim = 320. Here's one:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 31 19 21 17 15 21 22 26 20 15 20 19 13 17 16 28
And here's another:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 40 10 16 21 14 13 20 25 24 20 17 17 22 15 18 28
And another:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 29 20 18 20 22 16 16 20 21 17 11 24 16 11 18 41
If you want to do it from one season, you'd run the simulation with n_sim = 32 . . . forget about it! Here's an example:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1101 1111 2 2 1 2 1 1 1 4 3 1 3 4 4 3
You'll need data from multiple seasons to detect any patterns here.
3. Statistical modeling
A completely different approach is to fit a model to data. Let's assume that our helpful scrapers have done their job and supplied us with a clean dataset of all the games since 1920.
We could fit the above item-response model to the wins and losses (keeping things simple by excluding tie games).
Teams change from season to season, so I'd recommend fitting this model separately for each season, but pooling b (the home-field advantage parameter) and sigma_a (the sd of team abilities), maybe fitting a separate hierarchical model fore each decade.
But we can do better than that. Once we have the game scores, we can model them directly. Don’t model the probability of win, model the expected score differential. Something like this:
score differential ~ normal(a_i - a_j + b*home_ij, sigma_y).
The parameters a and b have slightly different interpretations now--they're on the scale of points rather than logit probabilities--but that's fine. A hierarchical model should be easy to fit. Again I'd fit a different model to each decade, or you can get more sophisticated with some sort of time series model allowing team abilities to change over the season and to have some stability between season. There's no end to the amount of modeling you can do here, if you have interest in the problem.
Once we have this model, we can simulate games and get a model-based estimate of the probability of each of the possible four-game outcomes in each season. Indeed we can do it with the actual matchups and then compare to what would happen in expectation under random matchups. The difference will give us some sense of the effect of the NFL schedule imbalances.
Something has gone awry with the formatting; much of the post is showing as if it’s all code (grayed) in lines that don’t fit on the screen.
Somewhere you left a code block open and the rest of the article is formatted as code, which makes it very hard to read because it doesn’t wordwrap.
I’m surprised there’s so much debate about this, actually. Given the choices, you’d expect the overall probability of mediocre team’s outcome to be split among WWLL, LLWW, WLWL, and LWLW; but the good team’s probability mass is entirely concentrated at WWWW.
But as others pointed out in the last thread, focusing on number of wins, rather than sequence that generated that number of wins, changes the logic. I wonder if confused people are intuitively thinking in terms of number of wins in the sequence framing of the question.
You do the simulation is a naive way that probably works for showing off the general point, but if you were to make it more sophisticated you can make a schedule for the teams so that each win for one team causes a loss for another. You can also make sure that teams don’t play each other more than once in the first four games, which I think is fairly realistic. Doesn’t change the underlying conclusions at all though.
R script and full results here:
https://pastebin.com/Uxht7qdi
This is using the kaggle 1926-2024 dataset, which includes ties. Columns are Pattern/Count/Percent. I get for the top 10 (which I am sure will not render correctly):
LLLL 212 9.8WWWW 189 8.8
LWWW 138 6.4
WLWW 137 6.4
WWLW 133 6.2
WLLL 129 6.0
LWLL 126 5.8
WWWL 125 5.8
LWWL 117 5.4
LLWL 115 5.3
Hey–that fits our model pretty well! I feel vindicated (assuming these numbers are correct).
My only real contribution was telling the bot not to use tidyverse. The code looks fine to me and results make sense, but I never spot checked the source data either.
As referenced in the pastebin, you can download the 1926-2024_COMBINED_NFL_SCORES csv from here: https://www.kaggle.com/datasets/flynn28/1926-2024-nfl-scores
Here’s a simulation where I use a Bradley-Terry model with team abilities drawn from a standard normal. I made sure to avoid duplicate games in the first four. This is simulating 1,000,000 4-game stretches involving 16 teams and making sure there are no duplicate games. The counts look like this:
And here’s the code, this time generated by Gemini just using Google search and following up in AI mode. It uses a dumb generate-and-test loop to make sure teams haven’t already played each other in a season—there are much better ways to do this when the constraints get tighter. Also using strings in a dictionary for indexing is super slow compared to direct numerical indexing. On the other hand, this ran fast enough to report results.
import numpy as np from itertools import product from collections import Counter def simulate_no_repeats(n_seasons, n_teams, n_weeks): possible_seqs = ["".join(p) for p in product("WL", repeat=n_weeks)] total_counts = Counter({seq: 0 for seq in possible_seqs}) for _ in range(n_seasons): abilities = np.random.normal(0, 1, n_teams) records = [[] for _ in range(n_teams)] # Track opponents for each team: {0: {opponent_ids}, 1: {opponent_ids}, ...} history = {i: set() for i in range(n_teams)} for week in range(n_weeks): teams_to_pair = list(range(n_teams)) np.random.shuffle(teams_to_pair) # Simple pairing logic: attempt to pair team 'i' with the next available valid opponent pairs = [] while teams_to_pair: t1 = teams_to_pair.pop(0) for idx, t2 in enumerate(teams_to_pair): if t2 not in history[t1]: pairs.append((t1, t2)) teams_to_pair.pop(idx) break else: # If a valid pairing isn't found, restart the week's shuffle teams_to_pair = list(range(n_teams)) np.random.shuffle(teams_to_pair) pairs = [] for t1, t2 in pairs: # Update history history[t1].add(t2) history[t2].add(t1) # Bradley-Terry outcome, equivalently inv_logit(abilities[t1] - abilities[t2]) prob_t1_wins = np.exp(abilities[t1]) / (np.exp(abilities[t1]) + np.exp(abilities[t2])) if np.random.random() < prob_t1_wins: records[t1].append("W") records[t2].append("L") else: records[t1].append("L") records[t2].append("W") season_outcomes = ["".join(r) for r in records] total_counts.update(season_outcomes) return total_counts # Execute results = simulate_no_repeats(n_seasons=1000, n_teams=16, n_weeks=4) print(f"{'Sequence':<10} | {'Total'}") for seq, count in sorted(results.items()): print(f"{seq:<10} | {count}")I don't see a measurable difference with reducing the number of teams to 8 from 16 or with not removing duplicates in the schedule. This blog reply is too small for the full workflow :-).
At least I'm happy to see that my own intuition that 4 wins and 4 losses would be most common holds true when you take the abilities generated randomly like this.
Since you left the schedule balancing idea as an exercise for the reader, I decided to actually code it up and see what happens.
I used Bob Carpenter’s league setup from the comments so the teams wouldn’t play duplicate games in the first month,
but I kept your skewed normal distribution to make sure there were still more garbage teams than elite ones.
First I ran a completely random schedule. As you’d expect, 0-4 and 4-0 sequences were the most common, hitting around 10.5% each.
Next, I used a strict tier system. For the first four weeks, the top 8 teams only play each other, and the bottom 8 teams play only each other.
This confirms the balancing effect—the extreme results went way down.
Both LLLL and WWWW dropped to about 7%, which is almost like a coin flip.
When you remove the unfair games, the 4-game streaks mostly disappear.
But in reality, the schedule is based on last year’s standings, which rarely maps perfectly to the new season due to roster turnover,
rookies, and injuries. So I added a wrinkle: I took that strictly balanced schedule
but randomly swapped a couple of teams into the “wrong” tier right before week 1.
Just injecting that little bit of preseason misclassification bumped the extremes back up to nearly 9%.
Turns out all it takes is a decent team getting handed an accidentally easy schedule (or a bad team getting thrown to the wolves) for the streaks to come right back.
I threw the code up as a GitHub gist if anyone wants to mess around with the noise parameter:
https://gist.github.com/SermetPekin/ec5aaa7289469606e62e06d1c712f190
Off topic but any comment on the Michael Weisman paper on the two leak Bayesian logic paper?
Is he on point or wrong?
Is it settled or are the criticized authors not agreeing?
Anon:
I’ll do a new post on the topic but my short answer is that all these papers, on both sides of the issue, leave me very confused and I don’t know what to think.
The topic is tricky because it can be so political but here I’m just talking about the statistical challenges of thinking about a problem where there are so many gaps in my understanding.
The author of the critical paper has a podcast, just out, with a Harvard statistician questioning him. It’s meant for general readers (less sophisticated than you). Weissman claims that the authors set a hurdle for the single outbreaks that was higher than the one he set for the double outbreaks.
https://econjwatch.org/podcast/michael-weissman-on-lab-leak-and-science
Thanks for your response. I agree this stuff can be tricky. Heck, even type 1 or type 2 in six sigma can be tricky. And I have seen Harvard MBAs and tenured econ profs make sunk cost fallacy errors. (Not saying I buy Weissman, just that these sort of logic errors can occur. If anything, I suspect, the plugging everything into a computer program and not thinking about the logic is part of the issue. Have seen it on other peer reviewed stats papers.)
Yes, actually originally I was the one who was going to do that interview (I’m on the editorial board of Econ Journal Watch, which published the critical paper), but then I didn’t feel prepared to do the podcast so they had someone else do it.
Andrew –
Since I referenced these comments of yours over at Michael’s Substack, and in case your ears were burning, here’s the link to the thread where I did that (scroll down a bit; it’s part of a longer exchange):
https://michaelweissman.substack.com/p/an-inconvenient-probability-v57/comment/236304884?
I’m still trying, so far without much luck, to get Michael to answer a fairly narrow question: whether, in his Bayesian analysis, there are behavioral or institutional assumptions that are effectively built in but not addressed explicitly. I’m not trying to analyze the technical model myself; I don’t have the brains for that. What I’m doing is looking at the structure of the argument from a non-expert angle and trying to understand whether any of those implicit components were formally weighted or tested.
Andrew, as the original anonymous, I’m mostly interested in your take on the logic debate itself. I think starting with something granular like that is valuable. Don’t pass just because there are many other lab leak debate topics. Is it really a simple logic error or not?
If its so difficult for people to tell if there’s basic logic errors, identifying logic errors is a bad approach.
Like you all shouldn’t need to be waiting for an expert to tell you what to think.
JFC Andrew- That’s a reasonable take on the broad Bayesian question but that’s not what the EJW paper was about. It was about an obvious blatant error in a simple Bayes exercise in a famous paper. They used conditional probabilities of detailed observations for the hypothesis they didn’t like and just assigned conditional probabilities of 1.0 for them for the hypothesis they liked. It was fixable using their own simulations of their own mo0del of their own data.
I understand your agnosticism about the bigger questions but this one was like a first week homework problem and they fucked it up.
For those who haven’t followed, here’s a quote from my article.
It may be helpful to first illustrate the basic error with an analogy. If one were to update the odds for deciding which of two suspects committed a burglary using the observation that a blue Toyota had been seen leaving the scene it would not be correct to use the ratio
P(drives car|suspect 2)/P(drives blue Toyota|suspect 1). That comparison would incorrectly favor the “suspect 2” hypothesis since most cars are neither blue nor Toyotas. That fundamental error in logic is precisely analogous to the core error of P2022. In what follows I shall calculate the corrections to the I2 likelihood used by P2022 when it is estimated using the same observed properties as were used for I1.
Michael:
Yes, just to be clear, I’m talking about my general confusion on the topic, not on the specific issues you raised with that paper, the specific criticisms given of your paper, etc. I just felt spooked doing a podcast on the topic without having a better understanding of the big picture. I’m glad the podcast was done; I just didn’t feel up to doing it myself!
Andrew- I think it would help if you acknowledged that despite your uncertainty about the big picture, about the “context” section of my EJW paper, and about the concluding “reflections” section, there is just no ambiguity at all about the core technical point. Pekar et a.l required that their unfavored I_1 produce two lineages differing by 2 nt and with a size ratio between 3/7 and 7/3. (equivalent to blue Toyota) For the two-spill I_2 they put no requirement at all on the nt difference or on the size ratio. Therefore they violated the basics of Bayesian logic.
This was noted years ago by Angus McCowan, who also obtained the correct minimal corrections. My contribution was to show that the data to make the minimal corrections were already sitting in the Pekar paper and its supplementary files. In fact, just looking at one of their figures carefully was enough to show that within their own model of their own data P(I_2) < P(I_1).
What to make of that rigorous conclusion is a matter for dispute. I don't think it changes the odds for a lab leak a whole lot because I never bought the hype that the Pekar paper was very important.
But it would help if you would just say clearly that Angus, Jamie, Nick Patterson, John Marden and I have not all had simultaneous strokes causing identical breakdowns in basic reasoning. The core technical case that Angus and I make is just not an issue for dispute, regardless of what what one thinks about other issues.
IOW
https://drive.google.com/file/d/1FWXTYAh7TIZk2I6-srSX_RO-dPhBPjki/view?usp=sharing
Michael –
I’m not equipped to judge the Pekar math myself, but I do wonder about your argument that Pekar et al. would fail a “first‑week homework” assignment.
Because I lack the technical background to evaluate the dispute directly, I’ve been using an LLM as a way to sanity‑check the level of the alleged error—not as an authority, but as a tool for evaluating probabilities when experts disagree and I don’t have the domain knowledge to adjudicate the details myself. After reviewing your description and the relevant papers, it concluded that beyond the already‑corrected coding bugs, the remaining issues involve subtle and contested modeling choices in phylogenetic Bayesian inference, not an unambiguous “fuck‑up” as you have described it.
That matches what I see in the public record: several experts think Pekar has fundamental problems, but I can’t find anyone besides you describing the error as trivial or first‑week‑homework‑level. Even if your co‑authors share your view, that doesn’t provide independent confirmation of the “first‑week homework” framing, since co‑authors naturally share assumptions and priors. If your broader Bayesian argument assigns significant weight to the *obviousness* of the error, then that weighting itself seems to me as an observer, like an unmodeled assumption.
Which brings me back to the Bayesian question I’ve been trying to ask. You’ve responded to other comments, but not to this one:
Shouldn’t the probability that your interpretation is correct—versus the possibility that this is a more subtle or debatable modeling choice where reasonable experts can disagree—be made explicit, or at least tested with sensitivity analysis?
I have the same question about the burglary–car analogy. In your example the observation is a single discrete fact (a specific car seen leaving the scene). In Pekar the “data” is a complex phylogenetic tree with many parameters (topology, mutation counts, rooting, sampling biases, etc.). Are the conditioning choices in P2022 really analogous to comparing “drives car” vs. “drives blue Toyota,” or does that analogy simplify away important structure?
A similar issue arises with the behavioral components in your broader framework. You place substantial weight on the 3‑year non‑correction of Pekar, journal resistance to comments, slow community updating, WIV database removals, and Daszak’s “unwelcome attention” note. These rely on assumptions about how scientists and institutions—including Chinese scientists working within Chinese institutional constraints—behave under pressure. But those behavioral assumptions don’t seem to be specified or tested with probabilities or sensitivity analyses. And because expertise in Chinese institutional behavior or the psychology of Chinese scientists is highly specialized, the uncertainty around those assumptions seems especially important to model explicitly rather than treat them as fixed inputs.
I’m not trying to settle the math or the origins question. I’m trying to understand how a full Bayesian analysis handles uncertainty around both the “obviousness” of technical claims and the weighting of institutional or behavioral responses.
Joshua- I don’t think we should continue to hijack Andrew’s blog with these lengthy posts. How’s about moving that one over to my substack on P2022, where it’s directly relevant?
Michael –
Happy to repost my comment on your substack and move away from here.
I’ve been waiting for you to address those same issues over on your substack. Since you didn’t answer over there, that’s why I posted them here.
My comment was directly relevant after you posted a meme to apparently mock the idea that Andrew says he finds these Bayesian‑origins debates complicated, even if he wasn’t weighing in on the specific Pekar critique.
That kind of hesitation from a respected statistician, to me, reinforces my question as to whether you should model the “obviousness” of the error, not treat it as fixed.
In any case, I’ll repost my last comment on your P2022 Substack and follow up there. I would still like to know if you think your Bayesian framework should handle uncertainty around the “obviousness” of the alleged error and the behavioral assumptions in your broader argument.
Another issue is that teams tend to alternate playing at home and away, and basically never have 4 road or 4 home games in a row. So that might make the results slightly more balanced.
Travis:
I think that variation in the number of home games will be a very minor factor here compared to variation in team abilities. But let me check . . . I’ll take the code above and replace
home_ij <- sample(c(-1,1), n_sim, replace=TRUE)with
home_ij <- 2*(k%%2) - 1So now every team plays 2 home games and 2 away games.
Here's the result for a million teams:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
106544 49295 69629 49970 49109 35277 50123 51005 70008 49978 71123 72382 50040 51175 72924 101418
So, slightly fewer at the extremes but very little change from before.