Question 3 of our Applied Regression final exam (and solution to question 2)

Here’s question 3 of our exam:

Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.

glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit"))
               coef.est coef.se
(Intercept)        0.00    0.08
dist100           -0.90    0.10
arsenic            0.46    0.04
n = 3020, k = 3

Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.

And the solution to question 2:

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

Let p be the proportion of students in the population who would get the question correct. p has an estimate of 0.6 and a standard error of sqrt(0.5^2/100) = 0.05.

Let theta be the proportion of students in the population who actually know the answer. Based on the description above, we can write:
p = theta + 0.25*(1 – theta) = 0.25 + 0.75*theta,
thus theta = (p – 0.25)/0.75.
This gives us an estimate of theta of (0.6 – 0.25)/0.75 = 0.47 and a standard error of 0.05/0.75 = 0.07, so the 95% confidence interval is [0.47 +/- 2*0.07] = [0.31, 0.59]

Common mistakes

Most of the students had no idea what to do here, but some of them figured out how to solve for theta. None of them got the standard error correct. The students who figured out the estimate of 0.47 simply computed a standard error as sqrt(0.47*(1 – 0.47)/1000). Kinda frustrating. I’m not really sure how to teach this, although of course I could just assign this particular problem as homework and then maybe students would remember the general point about estimates and standard errors under transformations.

I’m also thinking this would be a good example to program up in Stan because then all these difficulties are handled automatically.

P.S. There was some question about how you can convince yourself that the above answer is correct. Here’s how you can do it using simulation:

Start by assuming a true value of theta. It shouldn’t matter exactly what value we choose; say 0.40 as this is comfortably within our confidence interval. Then Pr(correct answer) is 0.25 + 0.75*theta = 0.55 in this case. Now it’s easy to simulate 100 students’ responses: y = rbinom(1, 100, 0.55). Do this 1000 times, so you have 1000 simulations from the sampling distribution of y, conditional on the assumed true value of theta.

Now for each of these 1000 simulated y’s, compute p_hat = y/100 and theta_hat = (p_hat – 0.25)/0.75.

And now we’re ready to use these simulations to approximate the sampling distribution of theta_hat | theta. Compute the mean of the 1000 theta_hat’s, this will be approx 0.40 because theta_hat is an unbiased estimate of theta. Compute the sd of the 1000 theta_hat’s, and you’ll get something close to 0.7, because that’s the standard error we worked out above.

32 thoughts on “Question 3 of our Applied Regression final exam (and solution to question 2)

  1. Thanks for posting exam questions, that’s always useful.
    But this one is either trivial or hard, depending on what you mean by “how much more likely”. If you want an absolute measure then the result depends utterly on the distance to the well. Maybe that’s what you are testing?
    If you want a relative measure and accept relative odds as an approximation, then the OR is exp(0.23), confidence bounds exp(0.21) to exp(0.29), approximate SE width of CI/4.

  2. Sorry to be a stickler on solution to no.2, but if you are asking for a confidence interval you should go through the moves and multiply the s.e. by 1.96 etc.

    Ok now for my perhaps ignorant question: why use 0.5 for the estimate’s s.e. calculation? Would you have chosen the same if 85% of student got it correct? Am I missing something obvious?

    • Jean:

      1. Good catch. I added the 95% conf interval to the above solution.

      2. I used 0.5 for the s.e. because that’s standard practice when the probability is near 50%. Had it been 85%, I would’ve used sqrt(p*(1-p)).

      • Strictly speaking using 0.5 for the s.e. calculation gives a “better” confidence interval, at least if valid values of theta are between 0 and 1.

        When the true value of the parameter is not around 0.35 the coverage gets better than 95%: for theta=0 or theta>0.7 is above 97% and becomes essentially 100% for theta > 0.95. But this is “acceptable” in the sense that the coverage is at least 95% for all the values of theta.

        The proposed alternative doesn’t even guarantee a 90% coverage for theta between 0.85 and 0.95 and covverage collapses as theta aproaches 1. This seems an acceptable confidence interval calculation only if the domain of theta is restricted and doesn’t extend to the whole [0 1] interval.

        • Carlos:

          Sure, if I really cared about the problem I’d just fit the model in Stan. But the point of this aspect of the exam is to check that students understand the basic principles of estimation and uncertainty.

        • I don’t really care about the problem either, I care (a little) about the definition of “95% confidence interval”. If this is not one of the basic principles that you want to test you could ask the students to provide just an estimate and standard error.

    • I’ve got the same question or misunderstanding.

      Consider a modified version of the question. Assume (1) there are a million possible answers and (2) of 100 test takers, 99 get the right answer. I estimate that 99% of the population know the answer and the confidence interval is more or less meaningless (given the discrete nature of the test takers)—but it’s very small.

      Bob

    • Evan:

      I think the quickest simplest way is to search for “estimating a population proportion” in your favorite search engine. You’ll likely get enough there to bridge the gap.

  3. One (easier?) way to teach the students may be to direct them to calculate the CI for p first and then translate that CI into the CI for theta.

  4. Why not use 0.5 for the s.e. on theta also? Or the students’ answer of 0.47 for that matter? Why does the 60% value warrant widening the confidence interval?

    • Ea:

      The binomial model applies to p. Theta is a linear transformation of p. It’s theta = (p – 0.25)/0.75. The easiest way to solve the problem, as in the above solution, is to first use the binomial model to get inference for p, then apply the linear transformation to obtain inference for theta.

      • Your students’ responses seem just as correct. The binomial model applies to theta. The agent is either type 1 or type 2 and p is just a linear transformation of theta. A binomial model certainly applies to p too, but I would describe it with a mixture model.

        • Ea:

          Nope, you’re wrong on this one.

          If you need convincing, you could fit the model in Stan. Or you could do a simulation study using fake data, simulating the process 1000 times and each time computing the estimate of theta, then looking at the sd of those 1000 estimates. Or you could consider how your (mistaken) se formula works in the edge case when 100% of students get the question correct.

        • I tried to do such a simulation. But I stumbled on the problem of “What is the process to be simulated?” In particular, do we (1) assume a fixed number of students who know the answer and a variable number of students who guess correctly—leading to variation in the number of correct answers or
          (2) assume a fixed number of students (60) of students who get the answer correct and therefore a varying number of students who guess correctly and a correspondingly varying number of students who know the answer?

          If we choose (1) what is the proper number of students to assume know the answer?

          Bob

        • Bob:

          Try this. Start by assuming a true value of theta. It shouldn’t matter exactly what value we choose; say 0.40 as this is comfortably within our confidence interval. Then Pr(correct answer) is 0.25 + 0.75*theta = 0.55 in this case. Now it’s easy to simulate 100 students’ responses: y = rbinom(1, 100, 0.55). Do this 1000 times, so you have 1000 simulations from the sampling distribution of y, conditional on the assumed true value of theta.

          Now for each of these 1000 simulated y’s, compute p_hat = y/100 and theta_hat = (p_hat – 0.25)/0.75.

          And now we’re ready to use these simulations to approximate the sampling distribution of theta_hat | theta. Compute the mean of the 1000 theta_hat’s, this will be approx 0.40 because theta_hat is an unbiased estimate of theta. Compute the sd of the 1000 theta_hat’s, and you’ll get something close to 0.7, because that’s the standard error we worked out above.

        • Well, after reflection I am even more confused.

          Your formulation has number of right answers distributed as binomial(0.55, 100). But it appears to me that, assuming theta = 0.4, the number of right answers is distributed as 40 + X, where X is distributed as binomial(0.25, 60).

          Unless I have made a mistake, the variance of the first distribution is 24.75 while the variance of the second is 11.25.

          I haven’t tried to do the relevant simulations, but it seems to me highly likely that the 40 + X model will give different answers than will the binomial(0.55, 100) model.

          Bob

        • Bob:

          I think my answer is correct but I didn’t describe it so clearly. Theta is the proportion of students in the population who know the answer. So if theta=0.4, that doesn’t mean that 40 of the 100 students in the class know the answer.

          To put it another way, you’re getting inference for the finite-sample quantity: what proportion of students in the class know the answer? In my problem, I’m asking for inference for the population quantity: what proportion of students in the population know the answer? The finite-sample inference will be more precise than the population inference, which makes sense: we know more about these 100 students than we do about the general population of which we are considering them as a sample.

        • > Your formulation has number of right answers distributed as binomial(0.55, 100). But it appears to me that, assuming theta = 0.4, the number of right answers is distributed as 40 + X, where X is distributed as binomial(0.25, 60).

          The number of right answers is distributed as K+X where K (the number of students who know the correct answer) is distributed as binomial(theta,100) and X is distributed as binomial(0.25,K).

        • Andrew, thanks for your patience and explanation. Population theta (duh!). Sigh.

          This blog is fun and useful. I only have one problem with it. When I go to my mailbox I find that I fear that there will be a large tuition bill from Columbia.

          Bob

  5. I made another, interesting, error, probably on a cultural basis. I misread :

    > A multiple-choice test item has four options

    as meaning “four *non-mutually-exclusive* options”. So the set of possibles answer had cardinal 16, only one being actually correct. (The rest of my reasoning was identical to yours).

    Is that a frequent misunderstanding ?

    • This would likely be cultural. Multiple correct options on a multiple choice test in the US is almost always handled as an option.
      ex What is the answer?

      A: Red
      B: Blue
      C: Green
      D: All of the above
      E: A and B

      • Hmmm… I was thinking of more subtle questions. For example, given a problem P and a set of possible solution methods, one could ask, for example

        Q: What are the admissible methods for solving P, and why ?

        A: S1 because X
        B: S2 because Y
        C: S1 and S2, but S1 can be preferred if Z(P)
        D: S1 and S2, but S2 should be preferred because T(P)

        Example : a set of possible solutions to a given statistical problem, where the subject matter *may* involve a preference in the bias/precision tradeoff. Depending on P, some combinations of the answers may or may not be possible.

        This type of question allows to get more information about the knowledge and the reasoning abilities of the respondent than a simple alternative…

      • I often ask multiple choice questions with the instruction “select all that apply” to signal that the correct answer may consist of multiple items.

        But, this doesn’t work with popular auto-grading systems. Multiple choice exams are popular because they can be graded by machines, and the machines usually require that there be a single correct answer.

  6. Shouldn’t theta be >= (p-0.25)/0.75 to acknowledge the possibility that some may know the answer but get it incorrect (due to time constraints, transcription errors, etc.)

  7. Leaving some code here for those seeking intuition for Question 2.

    # Frequentist

    n = 100 # number of students
    theta = 0.47 # assumed population proportion of students who know answer
    z = rbinom(500, n, theta) # 500 samples of # of students who know answer
    y = rbinom(500, n-z, 0.25) # total number of correct answers among those who don’t know answer
    p = (z+y)/n # sample proportion of correct answers
    hist(p)
    mean(p)
    quantile(p, probs=c(0.025, 0.975))
    # when I ran this, I got 0.60 mean, (0.51, 0.69) 95% interval as expected

    # Bayesian

    theta = runif(1000) # “flat prior”: 1000 draws of proportion of correct answers
    z = sapply(theta, function (q) rbinom(1, n, q)) # number of students who know answer
    y = sapply(z, function (k) rbinom(1, n-k, 0.25)) # number of correct answers among those who don’t know
    p = (z+y)/n # proportion of correct answers

    # let’s look at p versus theta
    plot(p ~ theta, las=1, cex=0.7, cex.axis=0.7)
    # get regression line
    ptheta.lm = lm(p~theta)
    abline(ptheta.lm, col=”red”)
    text(0.3, 0.9, paste(“p = “, round(coef(ptheta.lm)[1],2), “+ “, round(coef(ptheta.lm)[2],2), “* theta”), col=”red”, cex=0.7)
    abline(h=0.6) # observed p was 0.6
    abline(h=c(0.51, 0.69), lty=2) # 95% interval around 0.6 as found above

    # restrict attention to samples with p around 0.6 as observed
    theta2 = theta[round(p,1)==0.6]
    hist(theta2)
    mean(theta2)
    quantile(theta2, probs=c(0.025, 0.975))
    # when I ran this, I got mean 0.47, 95% interval (0.34, 0.62), close to expected

Leave a Reply to Jean P. Cancel reply

Your email address will not be published. Required fields are marked *