I need to model a basketabll game, and figure out the probabilities for winners and the probability of total points.

I thought of a double normal distribution with decaying (over time) expectancies ans standard deviations for each team.

The problem though is tne negative values of ND, which make no sense.

As game progresses and the curve (of the remaining points) moves closer to negative values of x’x axis, a lot of probabilities fall into negative area, hence my total probs do not sum up to 1

( with respect to concering normal distr to descrete i was thinking about continuity correction.

So any idea how could i model this? Let’s say for convenience that the margial destributions of each team’s points are independent.

F(x,y)= fhome(x)* fawat(y)

Price all x,y correct scores and figure out winners and total points.

]]>Soccer,

Yeah, ignore Andrew. He don’t know much about soccer. I, on the other hand, am a former fantasy champion in the California Penal League. So I believe that fully qualifies me to help.

Now, as for how you map expected differential into probability, and assuming that true differentials for individual games are distributed (basically) symmetrically around the mean of your estimate:

If E(diff) > 0 —> Prob(win) > .5

If E(diff) Prob(win) < .5

Soccer:

I got no credentials. My record as an assistant coach is 3-3. So no reason for you to trust anything I write on soccer.

]]>I think the issue here is modelling scoring *within* a game, whereas if you’re modelling *over* games, you have continuous parameters that give you lambdas for a Poisson distribution, and there is, with that, no problem.

]]>use scores as Andrew suggests. One way to do this is described here: March madness, quantile regression bracketology, and the Hayek hypothesis

Roger Koenker; Gilbert W. Bassett Jr. JBES, Vol. 28.2010, 1, p. 26-35. ]]>

Thanks Daniel.

Andrew – the approach I took was a modification of the LOESS technique, which is basically just a rolling, weighted linear regression. A nice feature of LOESS is that it is very responsive to what the data is actually telling you, without trying to force it into a pre-determined equation with a set of parameters to optimize (not too familiar with multilevel modeling, it may have the same advantage). I used R’s locfit package which extends the LOESS framework to incorporate logistic regression. I see your point regarding modeling point differential, but I think that may break down as you try to model late game situations.

Even with the vast amount of data points you’re going to get from NBA play by play data, things are still going to be noisy at the 5 second bucket increments, so you’ll still probably need to “borrow” nearby data points from other time buckets to get rational win probabilities that don’t jerk up and down (this is why I went with the LOESS approach).

All that being said, I think my regression approach starts to break down as you get down to the last 30 seconds or so of game time. I think Brian Burke has called out similar difficulties when modeling NFL win probability (and his task was several orders of magnitude harder: more discrete states to model, and sparser data).

]]>Ties are an issue but a minor issue. Overtime games are rare. One could just count the score before overtime (thus including ties) or else just use final score and not worry about it. For most purposes it won’t really matter.

]]>Why is that more efficient? Unless you are rating the teacher? To the extent you believe grades reflect & signal the quality of your student product to external “purchasers” isn’t the pre-test score absolutely irrelevant?

]]>Jonathan:

This is a good excuse to recall that statistical inference is different than ranking. There are many settings where data are available that will improve prediction or estimation but are not used in rating or ranking because of issues of fairness or incentives. For example, suppose students in a class are given a pre-test at the beginning of the semester and a post-test at the end. To form the final grade, it will (probably) be more efficient to include pre-test in the grading formula, but it does not seem fair to base a student’s final grade on a pre-test.

]]>To answer the question of whether to fit 2000 regressions, the answer is almost certainly no. As other respondents above have implied, it’s better to think about the game as a temporal stochastic process. Doing 2000 regressions, one for each “k seconds left to go” timeslice opens you up to incoherent inferences between timeslices and throws away a lot of useful information that you could share between timeslices that are close to each other. The stochastic process view allows you to do that information sharing, and ensures that your inference will be coherent between timeslices. It also gives you a way to do an honest accounting of your uncertainty, because in the 2000 regression approach, you’d be using the same outcome 2000 times, once for each regression, but your standard errors would be computed within each model without an easy way to assess how your errors are correlated across timeslices. The stochastic process view solves this too, since you can think of each increment as being a replication, and you can use one model to treat the whole game, which gives you coherent standard errors.

]]>hmm. I guess I’m thinking actually of a different kind of difficulty. I had a friend who was working on something similar, but in her case it was kind of a missing data interpolation issue. Since the sum of all the missing data values had to add up to certain observed values the outcomes for intermediate time points were conditional on later data and couldn’t be simply generated quantities.

I guess the equivalent sports analogy would be having the score for each time at certain time point, 0, 10, 22, 40, 50 for example, and trying to infer what the score might have been at intermediate times conditional on the timeseries going through the known values at the known time points.

]]>Daniel:

Yes, Stan could simulate time 51, 52, etc. sequentially as generated quantities—at least, if I’m understanding your model correctly. All of this seems to be simply forward simulation, no MCMC needed conditional on the inferences for the parameters in the model.

]]>Suppose your model is a Markov chain, so that the posterior distribution of score differential at time i is dependent on the score and other parameters at i-1. Now you split up the time into 100 intervals and you have data up to time 50, I could see how maybe stan could sample from the posterior for time 51 but could it sample from the posterior for time 100?? Now various statistics at time 51 become parameters that affects time 52 and that in turn affects 53… etc

If Stan can do that, I would love to know how.

]]>Wouldn’t it be even more general try to model the scores for both teams, not just the differential? For example, some teams might be good at defending and generally have very slow scoring games etc. So using just the score differential you would thow some information away.

But, I guess, it would quote a bit be more difficult as well. What would be the right choice for the joint distribution of scores etc.

]]>Daniel:

It depends. Stan can’t do inference on discrete parameters, but Stan can simulate discrete generated quantities. So, if you have a model with continuous parameters (which would be appropriate for modeling basketball teams) with data up to halftime, and then you want to simulate from the posterior distribution of final score differentials, yes, you can do that in Stan with no problem.

]]>Stan can’t sample discrete parameters. I guess so long as you’re only interested in historic games that’s ok since they’re observed, but if you’re trying to observe say up to half-time and then see distributions over future score differentials (where the future score differentials are now parameters) then you won’t be able to do it until Stan can sample the poisson paths.

]]>You can see those here: http://www.advancednflstats.com/2011/03/live-ncaa-basketball-win-probability.html

Ken Pomeroy has also done similar: http://kenpom.com/blog/index.php/weblog/entry/in-game_win_probabilities ]]>

“I think you’d have problems fitting that kind of thing in Stan because of the poisson nature of the score.”

Why is this a problem? I’m doing something similar with football (soccer) using Stan without any problems so far.

]]>Some other older work by Ryan J Parker (using Brownian Motion model) and Ed Kupfer (both work for NBA teams now) is located at http://web.archive.org/web/20080820164306/http://www.whichteamwins.com/blog/2008/04/29/nba-win-probability-graphs/

and http://web.archive.org/web/20081004132640/http://sonicscentral.com/apbrmetrics/viewtopic.php?t=586

There are others who have dealt with the subject as well over at the APBRmetrics forum.

]]>Also, your score can’t go down, but score differential can change in both directions, so modeling score differential would work better if you’re going to divide by 100 and use a gaussian type process.

]]>It sounds like some kind of model where the score is a timeseries poisson process with variable rate (of scoring), where the rate is predicted from the current score, score differential, possession, and proxies for tiredness and other aspects of the development of the game.

I think you’d have problems fitting that kind of thing in Stan because of the poisson nature of the score. Perhaps basketball games score high enough that you could divide the score by 100 and treat it as a gaussian process. That certainly wouldn’t make sense for something like baseball or football where there are relatively few scoring events.

]]>And then how this interacts with Vegas lines, prediction models, etc. And if it is a winning strategy.

]]>http://www.stat.columbia.edu/~gelman/research/published/thirds5.pdf

(Not that I can speak intelligently on score differentials in sports…)

]]>