Here’s question 2 of our exam:

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

And the solution to question 1:

1. A randomized experiment is performed within a survey. 1000 people are contacted. Half the people contacted are promised a $5 incentive to participate, and half are not promised an incentive. The result is a 50% response rate among the treated group and 40% response rate among the control group.

(a) Give an estimate and standard error of the average treatment effect.

(b) Give code to fit a logistic regression of response on the treatment indicator. Give the complete code, including assigning the data, setting up the variables, etc. It is not enough to simply give the one line of code for running the logistic regression.

(a) The estimate is 0.5 – 0.4 = 0.1, and the standard error is sqrt(0.5^2/500 + 0.5^2/500) = 0.03.

(b) Here’s some code:

n <- 1000 z <- rep(c(1,0), c(n/2, n/2)) y <- rep(c(1,0,1,0), c(0.5, 0.5, 0.4, 0.6)*n/2) library("rstanarm") fit <- stan_glm(y ~ z, family=binomial(link="logit")) print(fit)

The estimated logistic regression coefficient is approximately 0.4 with a standard error of approximately 0.12, which is consistent with the answer from (a) after applying the divide-by-4 rule.

**Common mistakes**

Most of the students in the class got part (a) correct, but some were sloppy and did some sort of sqrt(p*(1-p)/n) calculation with n=1000, not recognizing that the estimate was a *difference* between proportions.

For (b), I was expecting some syntax errors---it was an in-class exam without computers, so I was just asking them to write some code as best they could---but a common mistake was that many, actually most, students did it by simulating fake data with probabilities 0.5 and 0.4, rather than by entering the actual data as in the code above. Perhaps I needed to write the question better to make it more clear what was required.

P(Correct) = P(Correct | Guess)P(Guess) + P(Correct | Know)P(Know)

P(Correct) = q

P(Correct | Guess) = 0.25

P(Correct | Know) = 1

P(Know) = p

P(Guess) = 1 – p

q = 0.25(1 – p) + p = 0.25 + 0.75p

p = (q – 0.25) / 0.75

q.hat = 0.6

p.hat = (q.hat – 0.25) / 0.75 = 0.467

n = 100

Var[p.hat] = Var[(q.hat – 0.25) / 0.75] = 1.778Var[q.hat] = 1.778(q(1 – q) / n)

SE.hat[p.hat] = sqrt(1.778(q.hat(1 – q.hat) / n)) = sqrt(1.778(0.6 * 0.4 / 100)) = 0.065

95% CI = p.hat +/- 1.96 * SE.hat[p.hat] = 0.467 +/- 1.96 * 0.065 = (0.340, 0.594)

Answer:

47% (34%, 59%)

The alternative approach is to obtain the CI for q, and then transform the endpoints to obtain the CI for p, but by linearity this produces the same CI as directly obtaining the CI for p.

This procedure can generate “ugly” confidence intervals including impossible values. For example if the correct responses were 25% instead of 60% the confidence interval would be centered around 0. And if we make the sample size small enough the confidence interval covers completely the [0,1] range.

A more concerning problem: the procedure is seriously wrong if the percentage in the population who know the answer is close to 100%. In that case all the responses may be correct by chance (because all the students selected know the answer, or those who don’t get lucky) and the confidence interval produced is pointwise and doesn’t include the true value.

I did a simulation (and I may have done something wrong!) but what I got is that coverage is reasonable for true values below 0.75. There are some patterns (low-freq and high-freq) but the coverage remains more often than not between 94% and 95% (and always above 92%). For true values above 0.75 the oscillations in coverage become larger, going below 90% four times between 0.85 and 0.95, down to 80% coverage for p=0.96 and again for p>0.98 (with coverage approaching 0% as we get closer to 100%).

Hi Carlos, you are questioning the normal approximation of a restricted variable p \in (0,1). The requirement for q+2sd< 1 result in P(observed correct)=q <0.96 and therefore P(know)=p <0.95 for this sample size. Anything higher than that leads to mass loss on the right end. This explains your simulation results.

If it is a concern, and if we still want to solve it by hand rather than referring to the Beta distribution table, we can use log-normal approximation. Let (1-q)~z and z~log_normal(mu, sigma), using mu+sigma^2/2= 1-\hat p and exp(sigma^2-1) exp(2mu +sigma ^2) = \hat p ( 1- \hat p) /100 we can derive mu and sigma (by hand), and the normal interval of log z translates into the log-normal interval of p and q.

Whew! I got the right answer. I was pretty confident I was on the right track, but would not have been surprised at a bug in either thinking or implementation. Maybe I can add a chapter on standard errors for MLE now that I’ve worked out a simple example.

Hey, question 2 is just more of the same if you’re doing this by simulation. At least if I correctly recall what a confidence interval is :-) It also helps that I’ve already coded an IRT model with guessing in working out Stan case studies. I sort of feel like all these questions have a little trick in them like this in that they’re not just turn the crank. You have to think about how the log odds scale matters in that first question. At least these questions should get students thinking hard about the likelihoods involved.

P.S. Thanks for getting the

pretags working in comments again. It helps if you want to post code.Oops. Is that second mistake because you just wanted students to use prebuilt software or is it because you wanted me to do something like a bootstrap estimate from the actual data? It gives you the same draws here to bootstrap from data and to binomial sample from the MLE. So I don’t see the problem.

For question 1, wouldn’t the standard error be sqrt(0.5^2/500 + (0.4) * (0.6)/500)? As you are subtracting a binomial with p=0.4 from a binomial with p=0.5?

Ethan:

It doesn’t matter. For this purpose, 0.25 and 0.24 are the same thing. I went with 0.25 because it’s generally cleaner in considering the design of studies to just use that 0.25 upper bound. If the estimated probability were 0.1 or something, then, yeah, I’d use p*(1-p).

Andrew:

I disagree that ‘It doesn’t matter.’ When you make people double check and doubt the ‘correct’ answer then thats annoying.

Sure it doesn’t matter in practice. For a student’s sanity it might.

Sam:

When I teach this material I make the point that sqrt(p*(1-p)) has a maximum of 0.25 and that it is very close to this maximum for a wide range of values of p. If a student were to use 0.4*0.6 in the above problem, I’d still give full credit. Either answer is ok here.

Would you lose marks if the used glm instead if stan_glm?

No.

For Question 2:

If the model used for estimating the effect size is a logistic regression, wouldn’t it make more sense to frame the effect size as an odds ratio (1.5) or a log odds ratio rather than a difference in percentages (10%)?

It seems like the odds ratio is an effect size measure that’s more likely to generalize, for example if you want to predict the effect of an incentive on a similar survey which would otherwise get, say, a 30% response rate.

In any case, thanks for sharing these questions. It’s always an interesting and slightly-humbling thing to read through.

D’oh, sorry, I meant to reference Question 1 and not Question 2.