Matt Selove writes:
My question is about Bayesian analysis of the linear regression model. It seems to me that in some cases this approach throws out useful information.
As an example, imagine you have two basketball players randomly drawn from the pool of NBA players (which provides the prior). You’d like to estimate how many free throws each can make out of 100. You have two pieces of information:
– Session 1: Each player shoots 100 shots, and you learn player A’s total minus player B’s total
– Session 2: Player A does another session where he shoots 100 shots alone, and you learn his total
If we take the regression approach:
y_i = number of shots made
beta_A = player A’s expected number out of 100
beta_B = player B’s expected number out of 100
x_i = vector of zeros and ones showing which player took shots
In the above example, our datapoints are:
y_1 (first number reported) = beta_A * 1 + beta_B * (-1) + epsilon_1
y_2 (second number reported) = beta_A * 1 + beta_B * 0 + epsilon_2
My understanding, based on chapter 14 of your Bayesian Data Analysis textbook, is that the marginal posterior distribution for player A would update using a formula similar to the classical regression estimation formula, which would only incorporate information from the second session, where only player A shoots. But it seems to me the first session, providing the difference between the two scores (did player A beat player B or lose to player B), is also useful information that should be incorporated into beliefs about player A.
Am I missing something? Would the marginal posterior distribution for player A incorporate information from both sessions or just the second session?
Indeed, both data points are relevant for inference about player A. The easiest way to see this is to just write the model directly in Stan. It’s also clear if you consider extreme cases, for example if y1=100, you’ve obviously learned something from that data point.
This was posted on econjobrumors blog too. It seems that since y1 is a difference and y2 is the actual number made by A in a single trial, you should just y1 in terms of number of shots so that they are two observations measuring the same thing.
For example, if y1=100, then it must be the case that A made 100 in the first session. If y1 != 100, then we have a range for the number of shots A made in session 1. If y1 = 50, then A must have made anywhere between 50 and 100 shots. Thus, y1 is just another observation for number of shots A made, but it is measured with a range rather than a number.
There are different ways of analyzing the data, but to answer Selove’s original question, the Bayesian analysis does not throw away information. In particular, y1 is not just a range; you want to model the data directly.
Seems like a pretty standard bayesian analysis.
What would the frequentist answer to this problem be? Would they throw away information from y_1 because using it seems to require priors on the betas?
There is (should be) nothing complicated going on here other than figuring out how to run Stan and compare how the posteriors compare to the prior.
And as most frequentist answers involve working with the likelihood and that is conveniently available via posterior/prior (which is invariant to the prior for the multivariate likelihood) it should settle that too (up to multivariate likelihood).
A sketch to avoid Stan given here
beta_A=rbeta(reps,4,4) * 100
beta_B=rbeta(reps,4,4) * 100
y_1=rbinom(reps,100,beta_A/100) – rbinom(reps,100,beta_A/100)
# For 10 on y_1 and 50 on y_2
post.1=joint[joint[,3] == 10,]
post.2=joint[joint[,4] == 50,]
post.12=post.1[post.1[,4] == 50,]
post.21=post.2[post.2[,3] == 10,]
# Graph how the distribution of beta_A changes from prior (joint) to post.i
A bit more background on two stage simulation for doing Bayes on really simple examples here http://stats.stackexchange.com/questions/41794/bayesian-updating-for-a-discrete-rating-value/43048#43048