He was fooled by randomness—until he replicated his study and put it in a multilevel framework. Then he saw what was (not) going on.

An anonymous correspondent who happens to be an economist writes:

I contribute to an Atlanta Braves blog and I wanted to do something for Opening Day. Here’s a very surprising regression I just ran. I took the 50 Atlanta Braves full seasons (excluding strike years and last year) and ran the OLS regression: Wins = A + B Opening_Day_Win.

I was expecting to get B fairly close to 1, ie, “it’s only one game”. Instead I got 79.8 + 7.9 Opening_Day_Win. The first day is 8 times as important as a random day! The 95% CI is 0.5-15.2 so while you can’t quite reject B=1 at conventional significance levels, it’s really close. F-test p-value of .066

I have an explanation for this (other than chance) which is that opening day is unique in that you’re just about guaranteed to have a meeting of your best pitcher against the other team’s, which might well give more information than a random game, but I find this really surprising. Thoughts?

Note: If I really wanted to pursue this, I would add other teams, try random games rather than opening day, and maybe look at days two and three.

Before I had a chance to post anything, my correspondent sent an update, subject-line “Never mind”:

I tried every other day: 7.9 is kinda high, but there are plenty of other days that are higher and a bunch of days are negative. It’s just flat-out random…. (There’s a lesson there somewhere about robustness.) Here’s the graph of the day-to-day coefficients:

The lesson here is, as always, to take the problem you’re studying and embed it in a larger hierarchical structure. You don’t always have to go to the trouble of fitting a multilevel model; it can be enough sometimes to just place your finding as part of the larger picture. This might not get you tenure at Duke, a Ted talk, or a publication in Psychological Science circa 2015, but those are not the only goals in life. Sometimes we just want to understand things.

31 thoughts on “He was fooled by randomness—until he replicated his study and put it in a multilevel framework. Then he saw what was (not) going on.

    • Graham:

      Sure, except that (a) papers are published and conclusions get made from “p less than 0.1” also, when they fit a story people want to tell, and (b) the p-value in this case could easily have been less than 0.05. So, no, I don’t think the framework of “make a strong claim if some threshold of statistical significance is reached” works! What does work is the framework of “if you see something interesting, follow it up with more data and see if it replicates.”

  1. Totally fascinating.

    Curiously. have you repeated this for the next game played, with the same pitcher and players, but simply the next game (say, the fourth or fifth game). B should be fairly close to 8, right?

  2. The typical coefficient is higher than 1 across all days because team quality varies from year to year. Winning the focal game adds one to your win total, and also means that we’re more likely to be looking a year in which the team is good.

    • Yes. I’m surprised the emailer and Gelman did not comment on this. If you added in a control for team quality (maybe a pre-season projection) then the coefficient would be at least somewhat interesting, although I think it would still be theoretically greater than 1 as winning that game contains some information about quality not captured in the pre-season projection. This is particularly surprising given that the emailer has an economics background, where interpreting regressions causally is the name of the game.

    • This seems right. Toy version of the problem:

      In half their seasons, the team is “good” and has a 55% chance of winning each game. In the other half they’re bad and have a 45% chance of winning each game.

      These assumptions imply B=2.61.

      Among the cases where the team wins game 1 (or any other individual game that you pick), 55% of the time they’re a good team that season and 45% of the time they’re a bad team. When they’re a good team, they go on to win .55*161 of their remaining games, when they’re bad .45*161. So number of wins is their game 1 win plus 161*(.55^2+.45^2), which is 82.305. When the team doesn’t win game 1, they win 161*(.45*.55+.55*.45)=79.695 games. Which is a difference of 2.61, with 1 coming from the win they already have and the remaining 1.61 coming from the game 1 winners more often being a good team and therefore doing better over the rest of the season.

      Tinkering with the numbers, if the good/bad split is 53% vs. 47% then B=1.58, and if it’s 57% vs. 43% then B=4.16.

      Now checking some actual data… in their best 25 seasons out of the 50 with the most games played, the Atlanta Braves averaged a 58.3% winning percentage and in the other 25 seasons they averaged a 44.5% winning percentage. Plugging those numbers in gives a gap of B=4.07.

      That’s close to the observed B=4.27 average that Carlos Ungil found.

  3. What needs are the Wins over the Win% over the entire time frame. Over the last 50 years, the team has a 51.6% Winning% or averaging 83.7 Wins per 162 game season. So the Opening Day win bumps them up to 88 Wins while the loss down to 80 Wins.

    As Sam stated, it might be from Maddux (or some other stud) starting Opening Day and having him for the whole season. An Opening Day win might just be a proxy from having a staff ace. He could be worth 0 to 4 WAR more than an average Opening Day starter.

    There could also be some to winning leads to more winning going on. A team one game over .500 at the trade deadline might look to add talent (Braves this year), but if they are under .500, they might sell off pieces and sink further down.

    While I’m sure there is some randomness going on, but I wouldn’t completely dismiss the initial finding.

    • But we can dismiss it because the opening day coefficient isn’t an outlier when compared to any other randomly chosen game. The coefficient on game_win is just partly picking up the quality of the team that year, which is why on average it is well above 1 – we don’t need any fancier explanations than that.

      • Not even close for me and as you said, it’s just picking up part of the team’s quality. There are no either-or’s in baseball, it’s several pieces being added together with hopefully a better understanding.

        Baseball is insanely noisy and just yesterday over at FanGraphs, I was happy to find a factor that correlates with an r-squared of 0.02. I expected none.

        All the low-picked fruit is all gone from baseball analysis. It’s just baby steps from now on.

        • Ya, you are overthinking this one. Nobody is suggesting this is ground-breaking research; this was a simple mix-up of correlation vs causation.

  4. So the multilevel regression run here is `Wins ~ intercept + Game1*beta1 +Game2*beta2….` right?

    To be clear is the lesson here that we just cannot use a win on *any* day to really predict because of randomness of the coefficients?

    Also from the picture larger coefficient values are >= 1; is it that 95% confidence intervals would perhaps show a better picture of randomness here?

  5. I didn’t look at the Braves record for each season since they have been around a long time. But the regression is just saying that on average the Braves tend to win 80 games when they lose opening day and 88 when they win opening day. It is obviously not causal.

  6. I don’t think I agree with the assumption that beta should be 1. Better teams are more likely to win games, and better teams are more likely to win their first game. If I told you team X won their first game and team Y lost theirs, and asked you how many more wins you thought team X would have at the end of the year you’d be silly to say 1. Of course a team that won any particular game is likely a little better in general and will win more games overall, the emailer’s story treats baseball games like coin tosses.

    • Beta is not “how many more wins” the team will have at the end of the year. Beta is “how much this win matters for the final count”. And if all the outcomes are independent (conditional on a fixed probability of winning) we can expect beta to be one. At least that’s what I’ve found, repeating the simulation https://imgur.com/a/rOzDKYm with a different probability of winning gives a similar result.

      • But of course you can’t condition on a fixed probability of winning. The confounding variable here can be thought of as a team’s ‘true’ win probability. This is correlated with winning the first game and also with total wins in a season.

        • The point is that it’s “correlated” with winning the firs game and with winning every other game such that the beta is one (assuming that the probability of winning every game is the same, it cannot be a confounding variable if it doesn’t vary!)..

      • You’re ignoring the selection effect, are all teams equally likely to win their first game? No, better teams are more likely to do so, and better teams are more likely to win more games in general. I have no idea how you generated your graph, but I think if you generated a distribution of win percentages (or Elo or whatever), simulated a bunch of seasons, and repeated this procedure, you would see what the emailer saw.

        Beta is not “how much this win matters for the final count,” it is “how much this win matters for the final count plus selection effects,” and beta=1 ignores the selection effects.

        • I’ve you simulate a bunch of seasons where the probability of winning every match in every season is say 90% and you do the regression of the total number of wins T (a number around 0.9 times the number of matches in the season) on the result F of the first match (0 or 1) you will find a beta 1.

          Don’t you agree that if you do a regression of the number of wins excluding the first match (i.e. T if F=0 and T-1 if F=1) you will find beta 0? The result of the first match is independent of whether the team wins more or less than 90% of the remaining matches (assuming the probability of winning every match is 90% and they are independent events).

        • Maybe we’re saying the same thing: we would expect beta one if the probability of winning was constant. But we shouldn’t expect the probability of winning to be constant so we shouldn’t expect beta to be one either.

        • Exactly, yes. I suppose I’m pushing back because I think it should be quite clear without resorting to simulation that the results in the post above are expected and in line with reality.

        • I don’t think it’s _that_ clear what the results in the post above are. The correspondent conclusion was “It’s just flat-out random…” but as Daniel Weissman pointed out the typical coefficient is high. The point of my chart was to make clear how far the results are from what would be found if it was just noise and not caused by serial correlation.

        • > quite clear without resorting to simulation
          Clear to whom on the basis of what?

          Whatever that may be, it can be verified by resorting (being critically scientific using) to simulation.

          Even if it is a theorem, other than embarrassment and some wasted time, I do not see a downside?

        • Let’s think about this like good Bayesians. E[wins] = number of wins now + games remaining*win_probability where win_probability is a random variable denoting the proportion of *future* games you think that team will win. If your win_probability does not increase after a team wins their first game I would argue you 1) think beta should equal 1, and 2) are not a very good bayesian or 3) don’t understand sports. Perhaps this is the best decomposition of the problem, do you think win_probability should remain the same or increase after observing a single win? Elo users around the world think it should go up, and so do I.

          Beta is merely E[wins|win first game]-E[wins|lost first game], which equals 1 if win_probability is fixed after a win (non-bayesian) and which equals >1 if win_probability increases after a win.

      • This actually might be a good example of how people interpret “regressions” differently. Carlos is referring to the structural model, y=xb + u where u and x might be correlated. b is a structural parameter, (theoretically) defined to be the “effect of winning the first game on a team’s win total” (i.e. their win total if they win their first game minus the counterfactual win total if they lose their first game). This is how economists are taught to think about regression (as evidenced by the notion that we require assumptions for b to be unbiased, namely cov(x,u) = 0, when in fact b is chosen such that “assumption” is always true).

        Conversely, Jackson is interpreting y=xb + u as a regression model, where b is defined such that x and u are uncorrelated. As such, confounding variables like team quality get loaded into b in this model and its interpretation changes from the structural interpretation.

        • I come from an econometrics background and I think of this is a simple case of Cov(x,u)>0 -> E[bhat]>b. I think of the regression as the effect of winning the first game plus selection effects on a team’s win total. We know the former is 1 and the latter is positive, so the results in the blog post are completely expected. This may be one area where econometrics’ obsession with unbiasedness and confounding leads to a correct understanding the first time around, the notion that this regression coefficient should be one (or even close) is ludicrous in my eyes. Econometrics rightfully gets a bad rap on this blog a lot, but I’m chalking this one up as our first win.

          My metrics courses were pretty reduced form/causal inference, perhaps you studied at a freshwater school where it’s all about the structural interpretation, baby.

        • My point is only that if you are calling y=xb+u a ‘regression model’, then we know that x and u are uncorrelated (as that is part of the definition of a regression model), and therefore ‘b’ already has the aforementioned selection effects in it. If you are going to make some claim about cov(x,u) then you must be referring to a structural model, where ‘b’ has some theoretical meaning. Anyways, this is mostly semantics.

          My training is in econometrics as well. Reading the first chapter of Mostly Harmless was an eye-opening experience for me, as that is not how economists are typically taught regression.

  7. I feel one way to view this is as a typical biased coin estimation. If you start with a uniform prior for the year’s win %, a single flip of heads will give a posterior with mean 2/3 win prob, and a tails would give a mean of 1/3. Extending to 162 games, you’d predict a 54 win difference from observing a single win vs. a single loss. Of course, a uniform prior is dumb, as we can leverage the wins from previous seasons.

Leave a Reply

Your email address will not be published. Required fields are marked *