Original data:

a: 2.231

b: -.2557

predicted prob of success for x=0: .903

New data:

a: 2.836

b: -.2481

predicted prob of success for x=0: .9446

Original data appended with New data (ie. one big dataset):

a: 2.832

b: -.2481

predicted prob of success for x=0: .9443

What were the populations of the original and new data? Amateurs and pros, or all pro players?

Justin

]]>Look at the data. The golfers made 45183 out of 45198 putts that were an average of 0.28 feet from the hole. 45183/45198 = 0.9997. Not 0.94.

]]>http://www.statisticool.com/golfputtingmodels.htm

For the new data, I’m going to assume the data from 1996 and 2016-2018 are independent, and merge the datasets together, and run standard logistic regression on them.

Summary:

-golfers predicted to miss more than make putts if distance > ~12ft

-prob(making distance=0 putt) = 94%. Maybe not 100% for small distances because of the well-known “yips”

-model over-predicts success for short distances, and under-predicts success for large distances

Now I’ll do the same thing, but if a distance is 16, I’ll merge those counts in with the distance 16.43 cases, etc., instead of treating them like separate categories. For example:

16 201 27

16.43 35712 7196

16.43 35913(=35712+201) 7223(=7196+27)

I wouldn’t have merged like this if it was 16.67. I would have merged that in with the 17 category. Dichotomania.

Doing that I get…not much changes at all from what I did at first.

I realize there was a ton of overdispersion. So I refit with Williams’ method to accommodate it. Much better I think, but still not great.

Justin

]]>So the analysis can be redone to see what has changed because of the new rule. Some players leave the pin in and some don’t, and some leave it in sometimes and not others, so it would be particularly interesting if the data included information as to whether the pin was left in on a given putt.

]]>I wasn’t taught Bayesian methods in school. I have switched to them over the last few years, and it went like this: an actual estimate and uncertainty seemed more useful than a hypothesis test -> I looked at bootstrapping and CI’s -> the definition of a CI wasn’t easy -> the average PI interprets CI as a posterior probability interval anyway -> Bayesian interpretation seemed like what was wanted. Then: I almost always needed models with varying intercepts and slopes -> for lme4 to get a good estimate of the CI for parameters it seemed I needed to do bootstrapping or MCMC -> this took a long time -> running model in brms seemed actually faster than running it quickly in lme4 + bootstrapping. Plus brms seemed way more flexible and had tools to present results. Then: realized I have some prior information -> can use this in Bayesian models. And in general running models with varying slopes and intercepts just seems easier.

So far, it just seemed like a much better way to run the same frequentist models and report results in a more useful way and avoid NHST.

Then I saw this golf model on a YouTube video, and it really made me realize how much more one could do with some thinking + expertise. ..now to actually try to implement this.

]]>σ_angle is estimated at 0.02, which when corresponds to σ_degrees= 1.0. According to the fitted model, there is a standard deviation of 1.0 degree in the angles of putts taken by pro golfers. The estimate of σangle has decreased compared to the earlier model that only had angular errors [from 0.03; 1.53 degrees]. This makes sense: now that distance errors have been included in the model, there is no need to explain so many of the missed shots using errors in angle.

They need to measure sigma_angle some other way to check the model.

]]>At each step, you take a stylized fact such as “even if you have a very specific constant sized angular error, the ability to sink the put changes with distance because the hole takes up a smaller and smaller angle at larger distances, with the angle being proportional to 1/distance” and then incorporates this knowledge into the model. Later you incorporate things like “binomial errors assume the variance is related directly to sample size, but the variance is actually related more to distance traveled” etc…

People often seem to think that a Bayesian analysis is just about taking a frequentist analysis and adding a prior and interpreting the results as posterior probability… but it’s not, not even close.

]]>I think the new data are smoother for two reasons. First, the sample size for the new data is much higher, so you don’t see the pure noise variation that you see in the older, smaller data set. Second, the new data are gathered in a more organized way, I think they’re all the data from some large set of tournaments. I’m not quite sure where the older dataset came from, and there could be some problems with measurement or selection.

]]>I probably missed something in my reading, but why does the new data (in red) look so much smoother than the old data (blue)? The old data looks more ‘real’ with more variability. Is the new data actual data from golf putts?

]]>