Someone who wants to remain anonymous writes:
I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning.
This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss).
Finally, do you have suggestions for what package could accomplish this in Stata or R?
To answer the questions in reverse order:
3. I’d hope this could be done in Stan (which can be run from R).
2. Yes, a model with varying coefficients would make sense. I’d play around with the data, graph some estimates based on different timesteps, and then from there fit a parametric model that fits the data and makes sense.
1. Don’t model the probability of win, model the expected score differential. Yeah, I know, I know, what you really want to know is who wins. But the most efficient way to get there is to model the score differential and then map that back to win probabilities. The exact same issue comes up in election modeling: it makes sense to predict vote differential and then map that to Pr(win), rather than predicting Pr(win) directly. This is most obvious in very close games (or elections) or blowouts; in either of these settings the win/loss outcome provides essentially zero information. But it’s true more generally that there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.