Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


 

8 thoughts on “Just show me the data, baseball edition

  1. This DH practice for interleague games continued for seasons 1997 through 2021, although in 2020 the MLB allowed the use of DH for all games as a health and safety measure during the COVID-19 pandemic. Starting with the 2022 season, the MLB commissioner announced that a “Universal DH” (DH in both leagues) would be used, and I assume that the Universal DH will be used in future MLB seasons.

    Is there more to this story? How is a DH batting instead of the pitcher supposed to protect against covid?

    Regarding the odd trend, we would expect about (1/9)*(1/2) ~ .056 of plate appearances to be DH. Thats 9 batters for about 30 teams, half of which are in the american leage that used DH.

    Then we see a deviation higher of ~0.0035 in ~1980-1990, then lower ~ 0.0035 for ~2000-2010 seasons.

    The obvious connection is (1/30)*(1/9) = 0.0037. Ie, about one batter.

    Then look more carefully at the dates.

    Pre-1977: 12 teams per league

    1977-1992: 14 AL teams vs 12 NL teams

    1993-1997: 14 teams per league

    1998-2012: 14 AL teams vs 16 NL teams

    2013-2021: 15 teams per league

    https://en.m.wikipedia.org/wiki/Timeline_of_Major_League_Baseball

    So a factor of two error remains, along with error due to the increasing total number of teams. I am confident this is the explanation but I’ll leave the details to be worked out by others.

    • Actually I think this makes a good causal inference example.

      We abduced an explanation, deduced the quantitative consequences of the explanation, then saw the model (even in very oversimplified form) fit the data very well. Ideally we would also make a prediction for future seasons, but that doesn’t even seem neccesary. That would become necessary when/if a different model with similar fit is proposed.

      How would one approach this problem using DAGs or using NHST to pick between models controlling/adjusting for this or that variable? I genuinely do not know.

    • he DH wasn’t protective against COVID. It was nominally supposed to protect pitchers since spring training was so short in 2020 that they wanted them not to waste time working on htting. Of course, everyone at the time knew that was a subterfuge to make a rule change that the Commissioner had wanted for some time. Same with the extra-inning rule change to shorten games.

      • Makes sense. So it is essentially shrinkflation using “health and safety” as an excuse.

        Actually the trash pickup by me used to be twice per week before covid. Then it changed to one day each week during covid for that “health and safety” reason and has stayed that way since.

    • If you read the comment section of Jim Albert’s blog post, this explanation (a change of the AL to NL team ratio) was already discussed on July 25th, and accepted as a reasonable explanation of the observation.

      • Yea, I saw that after. I looked closer at the dates because I first saw plate appearances corresponded to 1 (it turned out to be 2) batters (ie, something was happening at the team level). But someone more familiar with baseball history would probably make that connection first.

        It is interesting to see how the dates line up so precisely that we will immediately accept that explanation, even without considering the proportion of plate appearances or checking any predictions. There just isn’t going to be an alternative explanation that fits so well without any free parameters.

Leave a Reply

Your email address will not be published. Required fields are marked *