As you may have heard, there was MASSIVE FRAUD in the California election a couple days ago. Something like 4 million illegal votes, lots of dead people voting, mysterious suitcase, the whole deal. Maybe Ted Cruz will investigate. He could ask the attorney general of Texas to look into it and hire Mary Rosh and that Williams College math professor to run the stats.
In any case, candidate Larry Elder beat us all to it a few days ago when his team used Benford’s law to discover irregularities in the voting . . . before the votes had been tallied. Elder’s campaign reported that their results “can be readily reproduced.” That’s cool. I looove reproducible research.
Unfortunately the campaign, in its pre-election rush, did not get around to posting their data and code, so we can’t reproduce their analysis just quite yet.
But I wanted to do something useful, and then it struck me: California . . . liberals manipulating elections . . . Hollywood . . . endless sequels . . . I got it!
Let’s apply Benford’s Law to movie and TV franchises. Are there some shenanigans going on?
So, I put together a dataset based on all the numbered movies and franchises I could think of, and then I tallied the first digits and compared to the Benford distribution. I followed the convention that the first movie in a series, if unnumbered, would be given the number 1. So, for example, Bill and Ted’s Excellent Adventure is Bill and Ted 1. Also, if there was a movie and a sequel with the same number, I didn’t count that same number both times. (I’m looking at you, Danny Ocean!)
Here are the results, plotted in chronological order of the first movie or TV show in the franchise:
Looks pretty fishy to me. Most of the franchises don’t come even close to the Benford distribution. Maybe that Larry Elder dude was on to something!
I was then curious to see what would happen if I pooled the data and looked at the distribution of all the first digits:
I was actually surprised to see how well Benford’s law fits to the whole ensemble. But I guess it makes sense. Most of these franchises have 1’s and 2’s, so you’ll see more of those low numbers. Then there are some numbers in the 10’s and 100’s, which gives us more 1’s, then declining numbers of 2 through 9.
The mechanism of this distribution is not the same as the process that creates Benford’s law (a uniform distribution on the log scale), but it produces a qualitatively similar pattern of decline across the digits.
P.S. my analysis is not only reproducible; it also exists! Here are the data:
1974 Taking_of_Pelham 123 1960 Surfside 6 1958 Sunset_Strip 77 1984 Beverly_Hills 1 2 3 4 90210 1972 Godfather 1 2 3 1989 Bill_and_Ted 1 2 3 1961 Dalmatians 101 102 1960 Ocean's 11 12 13 8 1987 Leonard 6 2001 Fast_and_Furious 1 2 3 4 5 6 7 8 9 1985 Turk 182 1968 Hawaii 50 2015 Area 51 1999 Things_I_Hate_About_You 10 1957 Angry_Men 12 1987 Jump_Street 21 22 1968 Space_Odyssey 2001 2010 1935 Steps 39 1957 Yuma 310 1959 Coups 400 1970 Easy_Pieces 5 1984 Candles 16 1954 Samurai 7 1996 Hard 8 1964 Up 7 14 21 28 35 42 49 56 63 1953 Fingers_of_Dr._T 5000 1998 Rush_Hour 1 2 3 1985 Back_to_the_Future 1 2 3 2013 Brooklyn 99 1988 Naked_Gun 1 2.5 33.333 1970 Airport 1 1975 77 79 1975 Space 1999 1988 Mystery_Science_Theater 3000 2012 Zero_Dark 30 1973 Pyramid 10000 20000 25000 50000 100000
And here’s the code:
# Read in data
raw <- readLines("hollywood_benford.txt")
spaces <- gregexpr(" ", raw)
N <- length(spaces)
year <- rep(NA, N)
franchise_name <- rep(NA, N)
first_digits <- as.list(rep(NA, N))
for (i in 1:N){
year[i] <- as.numeric(substr(raw[i], 1, spaces[[i]][1] - 1))
franchise_name[i] <- substr(raw[i], spaces[[i]][1] + 1, spaces[[i]][2] - 1)
n <- length(spaces[[i]]) - 1
first_digits[[i]] <- rep(NA, n)
for (j in 1:n){
first_digits[[i]][j] <- as.numeric(substr(raw[i], spaces[[i]][j+1] + 1, spaces[[i]][j+1] + 1))
}
}
franchise_name <- gsub("_", " ", franchise_name)
franchise_name <- c(franchise_name, "All franchises combined")
first_digits <- c(first_digits, list(unlist(first_digits)))
# Compute Benford probabilities
benford <- rep(NA, 9)
for (i in 1:9){
benford[i] <- log10(i+1) - log10(i)
}
# Make the plots
pdf("hollywood_benford_1.pdf", width=11.5, height=9)
par(mfrow=c(6,6), oma=c(0,0,5.5,0))
par(mar=c(2,2,3,1), mgp=c(1.5,.5,0), tck=0)
for (i in 1:N){
index <- sort.list(year)[i]
n <- length(first_digits[[index]])
hist(first_digits[[index]], ylim=c(0, 1.02*n), breaks=seq(0.5, 9.5, 1), xlab="", xaxt="n", ylab="", yaxt="n", yaxs="i", main=franchise_name[[index]])
points(1:9, benford*n, pch=20, col="red")
lines(1:9, benford*n, lwd=0.5, col="red")
axis(1, c(0, 1:9, 9.5), c("", 1:9, ""))
axis(2, 0:n, tck=-.02)
}
mtext("Benford Goes to Hollywood", side=3, line=3.5, cex=1.3, outer=TRUE)
mtext("Comparing the distributions of first digits of movie and TV franchises to the theoretical distribution predicted by Benford's law", side=3, line=1.8, cex=1.2, outer=TRUE)
dev.off()
pdf("hollywood_benford_2.pdf", width=6, height=4)
par(mar=c(2,2,3,1), mgp=c(1.5,.5,0), tck=0)
n <- length(first_digits[[N+1]])
hist(first_digits[[N+1]], ylim=c(0, 1.05*max(benford)*n), breaks=seq(0.5, 9.5, 1), xlab="", xaxt="n", ylab="", yaxt="s", yaxs="i", main=franchise_name[[N+1]])
points(1:9, benford*n, pch=20, col="red")
lines(1:9, benford*n, lwd=0.5, col="red")
axis(1, c(0, 1:9, 9.5), c("", 1:9, ""))
axis(2, tck=-.02)
dev.off()
Yeah, I know, ggplot2 would be better. I'm not claiming that this is the best code or even good code; it's just code that did the job for me right now.


*Re*producible research is old news. Elder has shown us the path forward: *PRE*producible research.
https://en.m.wikipedia.org/wiki/Benford's_law
Seems like it behaved as expected? You don’t really get multiple orders of magnitude in titles of the same franchise, but across many you could.
Anon:
Yes, exactly. Also with the movies there’s another thing going on is lots of 1’s, 2’s, and 3’s. Basically I just wanted an excuse to make a pretty graph with implicit jokes.
No, they mean data that spans several orders of magnitude, not data that is always within the same order of magnitude but sometimes occurs at different absolute values.
If 101 Dalmatians was the 100th sequel to “Dalmatian”, that would satisfy the requirement, because it would demonstrate the possibility for a movie to have 100 sequels. But it isn’t; for this purpose, the Dalmatians series and the Space Odyssey series are no different than the Back to the Future series or the Leonard “series”. Series contain between 1 and 10 movies; that is not enough range for Benford’s Law to apply.
(You can see this more clearly by considering a dataset with half of the numbers drawn from the range 50-99 and some other numbers drawn from the range 8050-8099. It doesn’t matter at all that 8000 is bigger than 50; that’s not what the requirement for the data to cover “multiple orders of magnitude” is talking about. This hypothetical data is always restricted to within a range of about 50 numbers; that’s not very many orders of magnitude.)
Ugh, terrible proofreading in my comment.
“it would demonstrate the possibility for a movie to have 100 sequels. But it isn’t” – instead of “But it isn’t”, please read “There is no such possibility” or similar.
“a dataset with half of the numbers drawn from the range 50-99 and some other numbers drawn from…” – please read “half of the numbers drawn from the range 50-99 and the other half drawn from…”.
Michael:
The Airport series illustrates some of the challenges here. “Airport” was labeled as “Airport 1” by convention, then there was “Airport 1975,” but then they switched to “Airport ’77” and “Airport ’79.” I pretty much stuck to series where the sequences were numbered so as not to have to decide whether “Beyond the Poseidon Adventure” should be labeled “Poseidon Adventure 2” or whether “The Drowning Pool” should be labeled “Harper 2.”
I’ve been told that “Ocean’s 8” was a transparent bid for sequels, as they left the spots 9 and 10 blank. Alternatively, they could’ve made Ocean’s 10 and then counted down from there, perhaps following the classic Agatha Christie plot in slow motion through the series.
This is great!
Great but what about 102 Dalmatians?
Person:
102 Dalmatians is there–see the dataset! The first digit is a 1.
Your data is flawed – there is no Naked Gun 3 1/3. (It’s Naked Gun 33 1/3, which makes no difference since all we’re considering is the leading digit. But still.)
Michael:
Good catch! I just went in and fixed it.
Ingenious design of the y-axis, allowing reference curves to appear the same on each chart, albeit with different values.
I’m guessing maybe you share my skepticism that Benford style analysis is the universal fraud detector. Have anyone studied what classes of mechanisms create a uniform distribution in log scale?
What I like about this analysis is the systematic way of deriving the data… “a dataset based on all the numbered movies and franchises I could think of.” I might point that this part of the workflow is less than universally reproducible, but that would be uncharitable.
Oh no – Jonathan another one is saying reproducibility is not sufficient. You can have convenient samples, errors in the data, controversial assumptions, etc. and we can reproduce all of that to our heart’s content. So make sure you also have the Limitations section disclosing you know all of the above :)
Jonathan:
Actually my description was dishonest, as I could think of others that I didn’t include, mostly because of data coding issues. For example, is Star Wars labeled 1 through 9, or do we follow the original and label them 1, 2, 3, 1, 2, 3, 7, 8, 9? And how do we handle out-of-sequence movies? And what about I Love Lucy? Do we count the Lucy Show as I Love Lucy 2? What about series like The Thin Man and the Road movies, which were not numbered at all? And I wasn’t sure how to code Jaws 3D because of the character string in the sequence: Does Benford’s law really apply to character strings that happen to have a leading digit? Also I didn’t include zillions and zillions of kids’ movie franchises I didn’t include, just because there were too many. I’m thinking of things like Toy Story, Ice Age, Hotel Transylvania, Frozen, etc. From the other direction, I included all the inappropriate non-sequential examples I could think of, like Hawaii 50, Hard 8, Zero Dark 30, etc. Actually, Zero Dark 30 created its own problem because you can’t apply Benford’s law to zeroes. This is why I also excluded Zero Effect, which is arguably the most underrated movie of all time.
My biggest regret is not including this data point:
Too bad this was just a blog post and not a publication in a peer-reviewed journal. If it had been in a journal, I could blast youall for being second-string Stasi parasites for daring to question me. And then I could follow up by suggesting you submit your criticisms in the form of a letter to my personal friend, the editor of the journal that published my article. Eventually your criticism would appear in Econ Journal Watch and it would go unnoticed until someone saw it and emailed it to some statistical blogger as an example of corruption in science. And so the cycle continues . . .
I was going to make a snarky comment, but I love Zero Effect as well, so this is the best analysis I have ever seen.
Stay tuned for the sequel, as Andrew analyzes major numbered sporting events! A good fit is expected, as UFC is approaching several orders of magnitude already! However, controversy erupts over how to deal with the Super Bowl, as the most common leading digit is X…