Here’s question 8 of our exam:

8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

And the solution to question 7:

7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

Assume 250 got the treatment and 250 got the control. Then the standard error of the estimated treatment effect is sqrt(0.5^2/250 + 0.5^2/250) = 0.045. An estimate is statistically significant if it is at least 2 standard errors from 0, so the answer to the question is 0.09, an effect of 9 percentage points.

**Common mistakes**

Most of the students couldn’t handle this one. One problem was forgivable: I didn’t actually say that half the people got the treatment and half got the control. I guess I should’ve made that clear in the statement of the problem.

But that wasn’t the only issue. Many of the students weren’t clear on how to get started on this one. One key point is that you can plug p=0.5 into the sqrt(p*(1-p)/n) formula.

I’ve enjoyed following along with these questions! Another issue with this question is you ask how large which seems to imply “What’s the largest effect size that could be detected as statistically significant?” It might have also helped to say “how **small** of an effect would you be able to detect…”.

Completely agree.. I don’t think this question is up to par to pose on an exam. It’s mostly testing of the student can untangle a hidden question rather than use the statistical skills needed.

The rule of 3 gives as a CI (0, 3 / 50), or (0%, 6%).

Another rule of thumb that I’ve seen: add 1 good and 1 bad -> 1/52 gives a CI of (0%, 5.7%)

p = (0+1)/(50+2) = 1.92% is the estimate obtained using Laplace’s sucession rule (which corresponds to a Bayesian analysis with a beta(1,1) prior, equivalent to having sampled two people before one of which had held office). The standard error of this estimate is sqrt(p*(1-p)/(50+3)) = 1.89%

Indeed, a conjugate Bayesian model here results in a posterior distribution for \theta of Beta(1,51). From this, I get a 95% credible interval of [0%,7%] after rounding, with a median of 1.3%. I don’t know about the confidence coverage, but I suspect this works just fine, although it is a bit larger than the ‘rule of 3’ adduced above.

If I’ve not done some mistake, this is what the coverage looks like for true values between 0.001 and 0.999 (for each value there are three points, corresponding to simulations with 10000 events each, to give an idea of the variability).

https://imgur.com/a/ArMUX8e

In general, seems about right – overly conservative if ‘true’ theta is very close to either 0 or 1. Thinking more about this, if we are bothering to specify a conjugate Beta prior, let’s just go full Bayesian and use a more informed prior :) I wouldn’t put the prior expectation any higher than 2.5% so a Beta(1,39). Plugging that in, my 95% credible interval becomes [0,4%].

Hi Andrew,

I love these questions!

Would it be possible to publish the complete exam and solutions all in one place once you are finished with the blog posts? This would make it more accessible for future references.

Best,

Andrew

Is the question “how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?” somehow different from “what observed difference would be statistically significant?”

Maybe some students didn’t know how to start because they didn’t understand what was being asked.

Did anyone propose a one-tailed test?

Yeah I think that “how large an effect” is confusing (and I also had decided I would say that I assumed half the sample got the treatment and not to think about how this sampling was done). I also wondered about the one tailed issue. What bothers me more is not knowing the baseline voting rate, since I think that actually matters. But maybe it’s a fair assumption that this is not a place with extreme voting rates.

The baseline voting rate does matter in that the variance will be lower for voting rates near 0 or 1 (due to the fact that the variance of a binomial is sqrt(p * (1 – p) / n).

The variance estimator used by Andrew is conservative in that it assumes the worst case (for variance) voting rate 0.5.

Why is the question “how large an effect would you be able to detect?”? Isn’t detecting larger effects easier, so that there’s no upper limit? I feel like it should ask “how small of an effect would you be able to detect?”.

8. The real answer to this question is going to depend very strongly on your prior because you have so little data.

One approach here is to get a confidence interval by inverting the hypothesis test. This would give you an interval from 0 to 0.00103. (And yeah, this approach has issues as noted on this blog earlier in https://statmodeling.stat.columbia.edu/2014/12/11/fallacy-placing-confidence-confidence-intervals/).

Do you mean 0 to 0.058? I think that what you gave is the 5% confidence interval.

Oops! You are totally correct.

Andrew – I like this question as is. It is very real-world. Anyone working in data science or analytics will need it to size an “A/B test”. So what if you didn’t say 50:50 split? The student ought to be able to make the assumption and keep moving along.

I don’t particularly mind the problem as stated. But it is not at all real world. In the real world if you don’t know the size of the control vs treatment you wouldn’t get past GO.

By real world, I mean the analyst will not be supplied all the info/data needed to solve the problem. The analyst then needs to either find out from other people the necessary data, or make a reasonable assumption and keeping going. Don’t be the analyst who throws up his/her hand and do nothing because he/she doesn’t have all the data.

Why even ask questions about power? what we care about is how much precision we can measure the effect with. ask them to choose a beta prior over p that expresses their uncertainty, then write code to generate 100 prior predictive data sets, and then calculate the average length of the posterior 95% interval…

It was not completely clear to me if (7) was a question about power, but looking at the answer it seems it was not the case