I need to model a basketabll game, and figure out the probabilities for winners and the probability of total points.

I thought of a double normal distribution with decaying (over time) expectancies ans standard deviations for each team.

The problem though is tne negative values of ND, which make no sense.

As game progresses and the curve (of the remaining points) moves closer to negative values of x’x axis, a lot of probabilities fall into negative area, hence my total probs do not sum up to 1

( with respect to concering normal distr to descrete i was thinking about continuity correction.

So any idea how could i model this? Let’s say for convenience that the margial destributions of each team’s points are independent.

F(x,y)= fhome(x)* fawat(y)

Price all x,y correct scores and figure out winners and total points.

]]>Yeah, ignore Andrew. He don’t know much about soccer. I, on the other hand, am a former fantasy champion in the California Penal League. So I believe that fully qualifies me to help.

Now, as for how you map expected differential into probability, and assuming that true differentials for individual games are distributed (basically) symmetrically around the mean of your estimate:

If E(diff) > 0 —> Prob(win) > .5

If E(diff) Prob(win) < .5

I got no credentials. My record as an assistant coach is 3-3. So no reason for you to trust anything I write on soccer.

]]>use scores as Andrew suggests. One way to do this is described here: March madness, quantile regression bracketology, and the Hayek hypothesis

Roger Koenker; Gilbert W. Bassett Jr. JBES, Vol. 28.2010, 1, p. 26-35. ]]>

Andrew – the approach I took was a modification of the LOESS technique, which is basically just a rolling, weighted linear regression. A nice feature of LOESS is that it is very responsive to what the data is actually telling you, without trying to force it into a pre-determined equation with a set of parameters to optimize (not too familiar with multilevel modeling, it may have the same advantage). I used R’s locfit package which extends the LOESS framework to incorporate logistic regression. I see your point regarding modeling point differential, but I think that may break down as you try to model late game situations.

Even with the vast amount of data points you’re going to get from NBA play by play data, things are still going to be noisy at the 5 second bucket increments, so you’ll still probably need to “borrow” nearby data points from other time buckets to get rational win probabilities that don’t jerk up and down (this is why I went with the LOESS approach).

All that being said, I think my regression approach starts to break down as you get down to the last 30 seconds or so of game time. I think Brian Burke has called out similar difficulties when modeling NFL win probability (and his task was several orders of magnitude harder: more discrete states to model, and sparser data).

]]>This is a good excuse to recall that statistical inference is different than ranking. There are many settings where data are available that will improve prediction or estimation but are not used in rating or ranking because of issues of fairness or incentives. For example, suppose students in a class are given a pre-test at the beginning of the semester and a post-test at the end. To form the final grade, it will (probably) be more efficient to include pre-test in the grading formula, but it does not seem fair to base a student’s final grade on a pre-test.

]]>To answer the question of whether to fit 2000 regressions, the answer is almost certainly no. As other respondents above have implied, it’s better to think about the game as a temporal stochastic process. Doing 2000 regressions, one for each “k seconds left to go” timeslice opens you up to incoherent inferences between timeslices and throws away a lot of useful information that you could share between timeslices that are close to each other. The stochastic process view allows you to do that information sharing, and ensures that your inference will be coherent between timeslices. It also gives you a way to do an honest accounting of your uncertainty, because in the 2000 regression approach, you’d be using the same outcome 2000 times, once for each regression, but your standard errors would be computed within each model without an easy way to assess how your errors are correlated across timeslices. The stochastic process view solves this too, since you can think of each increment as being a replication, and you can use one model to treat the whole game, which gives you coherent standard errors.

]]>I guess the equivalent sports analogy would be having the score for each time at certain time point, 0, 10, 22, 40, 50 for example, and trying to infer what the score might have been at intermediate times conditional on the timeseries going through the known values at the known time points.

]]>Yes, Stan could simulate time 51, 52, etc. sequentially as generated quantities—at least, if I’m understanding your model correctly. All of this seems to be simply forward simulation, no MCMC needed conditional on the inferences for the parameters in the model.

]]>If Stan can do that, I would love to know how.

]]>Wouldn’t it be even more general try to model the scores for both teams, not just the differential? For example, some teams might be good at defending and generally have very slow scoring games etc. So using just the score differential you would thow some information away.

But, I guess, it would quote a bit be more difficult as well. What would be the right choice for the joint distribution of scores etc.

]]>It depends. Stan can’t do inference on discrete parameters, but Stan can simulate discrete generated quantities. So, if you have a model with continuous parameters (which would be appropriate for modeling basketball teams) with data up to halftime, and then you want to simulate from the posterior distribution of final score differentials, yes, you can do that in Stan with no problem.

]]>You can see those here: http://www.advancednflstats.com/2011/03/live-ncaa-basketball-win-probability.html

Ken Pomeroy has also done similar: http://kenpom.com/blog/index.php/weblog/entry/in-game_win_probabilities ]]>

Why is this a problem? I’m doing something similar with football (soccer) using Stan without any problems so far.

]]>Some other older work by Ryan J Parker (using Brownian Motion model) and Ed Kupfer (both work for NBA teams now) is located at http://web.archive.org/web/20080820164306/http://www.whichteamwins.com/blog/2008/04/29/nba-win-probability-graphs/

and http://web.archive.org/web/20081004132640/http://sonicscentral.com/apbrmetrics/viewtopic.php?t=586

There are others who have dealt with the subject as well over at the APBRmetrics forum.

]]>I think you’d have problems fitting that kind of thing in Stan because of the poisson nature of the score. Perhaps basketball games score high enough that you could divide the score by 100 and treat it as a gaussian process. That certainly wouldn’t make sense for something like baseball or football where there are relatively few scoring events.

]]>And then how this interacts with Vegas lines, prediction models, etc. And if it is a winning strategy.

]]>http://www.stat.columbia.edu/~gelman/research/published/thirds5.pdf

(Not that I can speak intelligently on score differentials in sports…)

]]>