BREAKING: Benford’s law violations in California. Hollywood TV and movie franchises got some splainin to do!

Posted on September 16, 2021 9:02 AM by Andrew

As you may have heard, there was MASSIVE FRAUD in the California election a couple days ago. Something like 4 million illegal votes, lots of dead people voting, mysterious suitcase, the whole deal. Maybe Ted Cruz will investigate. He could ask the attorney general of Texas to look into it and hire Mary Rosh and that Williams College math professor to run the stats.

In any case, candidate Larry Elder beat us all to it a few days ago when his team used Benford’s law to discover irregularities in the voting . . . before the votes had been tallied. Elder’s campaign reported that their results “can be readily reproduced.” That’s cool. I looove reproducible research.

Unfortunately the campaign, in its pre-election rush, did not get around to posting their data and code, so we can’t reproduce their analysis just quite yet.

But I wanted to do something useful, and then it struck me: California . . . liberals manipulating elections . . . Hollywood . . . endless sequels . . . I got it!

Let’s apply Benford’s Law to movie and TV franchises. Are there some shenanigans going on?

So, I put together a dataset based on all the numbered movies and franchises I could think of, and then I tallied the first digits and compared to the Benford distribution. I followed the convention that the first movie in a series, if unnumbered, would be given the number 1. So, for example, Bill and Ted’s Excellent Adventure is Bill and Ted 1. Also, if there was a movie and a sequel with the same number, I didn’t count that same number both times. (I’m looking at you, Danny Ocean!)

Here are the results, plotted in chronological order of the first movie or TV show in the franchise:

Looks pretty fishy to me. Most of the franchises don’t come even close to the Benford distribution. Maybe that Larry Elder dude was on to something!

I was then curious to see what would happen if I pooled the data and looked at the distribution of all the first digits:

I was actually surprised to see how well Benford’s law fits to the whole ensemble. But I guess it makes sense. Most of these franchises have 1’s and 2’s, so you’ll see more of those low numbers. Then there are some numbers in the 10’s and 100’s, which gives us more 1’s, then declining numbers of 2 through 9.

The mechanism of this distribution is not the same as the process that creates Benford’s law (a uniform distribution on the log scale), but it produces a qualitatively similar pattern of decline across the digits.

P.S. my analysis is not only reproducible; it also exists! Here are the data:

1974 Taking_of_Pelham 123
1960 Surfside 6
1958 Sunset_Strip 77
1984 Beverly_Hills 1 2 3 4 90210
1972 Godfather 1 2 3
1989 Bill_and_Ted 1 2 3
1961 Dalmatians 101 102
1960 Ocean's 11 12 13 8
1987 Leonard 6
2001 Fast_and_Furious 1 2 3 4 5 6 7 8 9
1985 Turk 182
1968 Hawaii 50
2015 Area 51
1999 Things_I_Hate_About_You 10
1957 Angry_Men 12
1987 Jump_Street 21 22
1968 Space_Odyssey 2001 2010
1935 Steps 39
1957 Yuma 310
1959 Coups 400
1970 Easy_Pieces 5
1984 Candles 16
1954 Samurai 7
1996 Hard 8
1964 Up 7 14 21 28 35 42 49 56 63
1953 Fingers_of_Dr._T 5000
1998 Rush_Hour 1 2 3
1985 Back_to_the_Future 1 2 3
2013 Brooklyn 99
1988 Naked_Gun 1 2.5 33.333
1970 Airport 1 1975 77 79
1975 Space 1999
1988 Mystery_Science_Theater 3000
2012 Zero_Dark 30
1973 Pyramid 10000 20000 25000 50000 100000

And here’s the code:

# Read in data

raw <- readLines("hollywood_benford.txt")
spaces <- gregexpr(" ", raw)
N <- length(spaces)
year <- rep(NA, N)
franchise_name <- rep(NA, N)
first_digits <- as.list(rep(NA, N))
for (i in 1:N){
  year[i] <- as.numeric(substr(raw[i], 1, spaces[[i]][1] - 1))
  franchise_name[i] <- substr(raw[i], spaces[[i]][1] + 1, spaces[[i]][2] - 1)
  n <- length(spaces[[i]]) - 1
  first_digits[[i]] <- rep(NA, n)
  for (j in 1:n){
    first_digits[[i]][j] <- as.numeric(substr(raw[i], spaces[[i]][j+1] + 1, spaces[[i]][j+1] + 1))
  }
}
franchise_name <- gsub("_", " ", franchise_name)

franchise_name <- c(franchise_name, "All franchises combined")
first_digits <- c(first_digits, list(unlist(first_digits)))

# Compute Benford probabilities

benford <- rep(NA, 9)
for (i in 1:9){
  benford[i] <- log10(i+1) - log10(i)
}

# Make the plots

pdf("hollywood_benford_1.pdf", width=11.5, height=9)
par(mfrow=c(6,6), oma=c(0,0,5.5,0))
par(mar=c(2,2,3,1), mgp=c(1.5,.5,0), tck=0)
for (i in 1:N){
  index <- sort.list(year)[i]
  n <- length(first_digits[[index]])
  hist(first_digits[[index]], ylim=c(0, 1.02*n), breaks=seq(0.5, 9.5, 1), xlab="", xaxt="n", ylab="", yaxt="n", yaxs="i", main=franchise_name[[index]])
  points(1:9, benford*n, pch=20, col="red")
  lines(1:9, benford*n, lwd=0.5, col="red")
  axis(1, c(0, 1:9, 9.5), c("", 1:9, ""))
  axis(2, 0:n, tck=-.02)
}
mtext("Benford Goes to Hollywood", side=3, line=3.5, cex=1.3, outer=TRUE)
mtext("Comparing the distributions of first digits of movie and TV franchises to the theoretical distribution predicted by Benford's law", side=3, line=1.8, cex=1.2, outer=TRUE)
dev.off()

pdf("hollywood_benford_2.pdf", width=6, height=4)
par(mar=c(2,2,3,1), mgp=c(1.5,.5,0), tck=0)
n <- length(first_digits[[N+1]])
hist(first_digits[[N+1]], ylim=c(0, 1.05*max(benford)*n), breaks=seq(0.5, 9.5, 1), xlab="", xaxt="n", ylab="", yaxt="s", yaxs="i", main=franchise_name[[N+1]])
points(1:9, benford*n, pch=20, col="red")
lines(1:9, benford*n, lwd=0.5, col="red")
axis(1, c(0, 1:9, 9.5), c("", 1:9, ""))
axis(2, tck=-.02)
dev.off()

Yeah, I know, ggplot2 would be better. I'm not claiming that this is the best code or even good code; it's just code that did the job for me right now.

17 thoughts on “BREAKING: Benford’s law violations in California. Hollywood TV and movie franchises got some splainin to do!”

Aaron Montgomery on September 16, 2021 9:21 AM at 9:21 am said:

*Re*producible research is old news. Elder has shown us the path forward: *PRE*producible research.

Reply ↓
Anoneuoid on September 16, 2021 9:24 AM at 9:24 am said:

Benford’s law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford’s law applies. For instance, one can expect that Benford’s law would apply to a list of numbers representing the populations of UK settlements. But if a “settlement” is defined as a village with population between 300 and 999, then Benford’s law will not apply.[16][17]

https://en.m.wikipedia.org/wiki/Benford's_law

Seems like it behaved as expected? You don’t really get multiple orders of magnitude in titles of the same franchise, but across many you could.

Reply ↓
- Andrew on September 16, 2021 10:46 AM at 10:46 am said:
  
  Anon:
  
  Yes, exactly. Also with the movies there’s another thing going on is lots of 1’s, 2’s, and 3’s. Basically I just wanted an excuse to make a pretty graph with implicit jokes.
  
  Reply ↓
- Michael Watts on September 17, 2021 7:44 AM at 7:44 am said:
  
  No, they mean data that spans several orders of magnitude, not data that is always within the same order of magnitude but sometimes occurs at different absolute values.
  
  If 101 Dalmatians was the 100th sequel to “Dalmatian”, that would satisfy the requirement, because it would demonstrate the possibility for a movie to have 100 sequels. But it isn’t; for this purpose, the Dalmatians series and the Space Odyssey series are no different than the Back to the Future series or the Leonard “series”. Series contain between 1 and 10 movies; that is not enough range for Benford’s Law to apply.
  
  (You can see this more clearly by considering a dataset with half of the numbers drawn from the range 50-99 and some other numbers drawn from the range 8050-8099. It doesn’t matter at all that 8000 is bigger than 50; that’s not what the requirement for the data to cover “multiple orders of magnitude” is talking about. This hypothetical data is always restricted to within a range of about 50 numbers; that’s not very many orders of magnitude.)
  
  Reply ↓
  - Michael Watts on September 17, 2021 7:49 AM at 7:49 am said:
    
    Ugh, terrible proofreading in my comment.
    
    “it would demonstrate the possibility for a movie to have 100 sequels. But it isn’t” – instead of “But it isn’t”, please read “There is no such possibility” or similar.
    
    “a dataset with half of the numbers drawn from the range 50-99 and some other numbers drawn from…” – please read “half of the numbers drawn from the range 50-99 and the other half drawn from…”.
    
    Reply ↓
  - Andrew on September 17, 2021 9:06 AM at 9:06 am said:
    
    Michael:
    
    The Airport series illustrates some of the challenges here. “Airport” was labeled as “Airport 1” by convention, then there was “Airport 1975,” but then they switched to “Airport ’77” and “Airport ’79.” I pretty much stuck to series where the sequences were numbered so as not to have to decide whether “Beyond the Poseidon Adventure” should be labeled “Poseidon Adventure 2” or whether “The Drowning Pool” should be labeled “Harper 2.”
    
    I’ve been told that “Ocean’s 8” was a transparent bid for sequels, as they left the spots 9 and 10 blank. Alternatively, they could’ve made Ocean’s 10 and then counted down from there, perhaps following the classic Agatha Christie plot in slow motion through the series.
    
    Reply ↓
Raghu Parthasarathy on September 16, 2021 2:08 PM at 2:08 pm said:

This is great!

Reply ↓
Person on September 16, 2021 2:47 PM at 2:47 pm said:

Great but what about 102 Dalmatians?

Reply ↓
- Andrew on September 16, 2021 3:07 PM at 3:07 pm said:
  
  Person:
  
  102 Dalmatians is there–see the dataset! The first digit is a 1.
  
  Reply ↓
Michael Watts on September 17, 2021 7:35 AM at 7:35 am said:

Your data is flawed – there is no Naked Gun 3 1/3. (It’s Naked Gun 33 1/3, which makes no difference since all we’re considering is the leading digit. But still.)

Reply ↓
- Andrew on September 17, 2021 8:59 AM at 8:59 am said:
  
  Michael:
  
  Good catch! I just went in and fixed it.
  
  Reply ↓
Kaiser on September 17, 2021 10:31 AM at 10:31 am said:

Ingenious design of the y-axis, allowing reference curves to appear the same on each chart, albeit with different values.
I’m guessing maybe you share my skepticism that Benford style analysis is the universal fraud detector. Have anyone studied what classes of mechanisms create a uniform distribution in log scale?

Reply ↓
Jonathan (another one) on September 17, 2021 10:32 AM at 10:32 am said:

What I like about this analysis is the systematic way of deriving the data… “a dataset based on all the numbered movies and franchises I could think of.” I might point that this part of the workflow is less than universally reproducible, but that would be uncharitable.

Reply ↓
- Kaiser on September 17, 2021 10:54 AM at 10:54 am said:
  
  Oh no – Jonathan another one is saying reproducibility is not sufficient. You can have convenient samples, errors in the data, controversial assumptions, etc. and we can reproduce all of that to our heart’s content. So make sure you also have the Limitations section disclosing you know all of the above :)
  
  Reply ↓
- Andrew on September 17, 2021 11:19 AM at 11:19 am said:
  Jonathan:
  
  Actually my description was dishonest, as I could think of others that I didn’t include, mostly because of data coding issues. For example, is Star Wars labeled 1 through 9, or do we follow the original and label them 1, 2, 3, 1, 2, 3, 7, 8, 9? And how do we handle out-of-sequence movies? And what about I Love Lucy? Do we count the Lucy Show as I Love Lucy 2? What about series like The Thin Man and the Road movies, which were not numbered at all? And I wasn’t sure how to code Jaws 3D because of the character string in the sequence: Does Benford’s law really apply to character strings that happen to have a leading digit? Also I didn’t include zillions and zillions of kids’ movie franchises I didn’t include, just because there were too many. I’m thinking of things like Toy Story, Ice Age, Hotel Transylvania, Frozen, etc. From the other direction, I included all the inappropriate non-sequential examples I could think of, like Hawaii 50, Hard 8, Zero Dark 30, etc. Actually, Zero Dark 30 created its own problem because you can’t apply Benford’s law to zeroes. This is why I also excluded Zero Effect, which is arguably the most underrated movie of all time.
  
  My biggest regret is not including this data point:
```
1972 Newhart 1 2
```
  Too bad this was just a blog post and not a publication in a peer-reviewed journal. If it had been in a journal, I could blast youall for being second-string Stasi parasites for daring to question me. And then I could follow up by suggesting you submit your criticisms in the form of a letter to my personal friend, the editor of the journal that published my article. Eventually your criticism would appear in Econ Journal Watch and it would go unnoticed until someone saw it and emailed it to some statistical blogger as an example of corruption in science. And so the cycle continues . . .
  Reply ↓
  - Jonathan (another one) on September 17, 2021 12:16 PM at 12:16 pm said:
    
    I was going to make a snarky comment, but I love Zero Effect as well, so this is the best analysis I have ever seen.
    
    Reply ↓
John N-G on September 18, 2021 12:08 AM at 12:08 am said:

Stay tuned for the sequel, as Andrew analyzes major numbered sporting events! A good fit is expected, as UFC is approaching several orders of magnitude already! However, controversy erupts over how to deal with the Super Bowl, as the most common leading digit is X…

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

BREAKING: Benford’s law violations in California. Hollywood TV and movie franchises got some splainin to do!

17 thoughts on “BREAKING: Benford’s law violations in California. Hollywood TV and movie franchises got some splainin to do!”

Leave a Reply Cancel reply