Question 9 of our Applied Regression final exam (and solution to question 8)

Here’s question 9 of our exam:

9. We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We created a new variables, age10 = age/10. We then fit a regression:

lm(formula = weight ~ age10)
            coef.est coef.se
(Intercept)    161.0     7.3
age10            2.6     1.6
  n = 2009, k = 2
  residual sd = 119.7, R-Squared = 0.00

Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis). Label the axes appropriately, draw the fitted regression line, and make a scatterplot of a bunch of points consistent with the information given and with ages ranging roughly uniformly between 18 and 90.

And the solution to question 8:

8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

This is a job for the Agresti-Coull interval. y* = y + 2, n* = n + 4, p* = y*/n* = 2/54 = 0.037, with standard error sqrt(p*(1-p*)/n*) = sqrt((2/54)*(52/54)/54) = 0.026. Estimate is [p* +/- 2se] = [-0.014, 0.088], but the probability can’t be negative, so [0, 0.088] or simply [0, 0.09].

Common mistakes

Most of the students remembered the Agresti-Coull interval, but some made the mistake of giving confidence intervals that excluded zero (which can’t be right, given that the data are 0/50) or that included negative values.

47 thoughts on “Question 9 of our Applied Regression final exam (and solution to question 8)

  1. > [-0.014, 0.088], but the probability can’t be negative, so [0, 0.088]

    If I had given the response [-0.01, 0.09] and it was marked as incorrect I would claim it’s as good [0, 0.09] as far as the definition of a confidence interval goes.

    Anyway, it’s interesting that the confidence intervals given here (using 4 different methods) were substantially narrower: [0, 0.06] or [0, 0.07] instead of [0, 0.09].

    • Well, the Agresti-Coull interval corresponds to a Beta(2,2) prior, which shrinks estimates towards 0.5 more than Beta(1,1), and certainly a lot more than my attempt at an informative Beta(1,39) prior. It’s not clear to me that we really want Beta(2,2) here, so I would question Agresti-Coull- it seems to me ‘the rule of 3’ is better here, or just use the Bayesian approach instead.

      • I also find the Beta(2,2) too strong. By the way, unlike in the Beta(1,1) case, the corresponding 95% high-density region [0.0009 0.0869] doesn’t include 0.

        Agresti-Coull is not really equivalent to a Bayesian analysis, it’s what Jaynes would call an adhoc device to produce an interval with some frequentist properties.

      • There’s no point in being too picky here. For one thing, the people in the survey might get into political office after the survey was taken. So it’s not really a matter of “never held office” but “not until now”. That would bring an age distribution into it, but we don’t have any data about that.

        In the interest of being as simple as possible, and not being picky, we could just say that the result of zero responses out of 50 is unlikely if the probability p were actually non-zero. How unlikely? Probably no more than one standard deviation out. For this condition, we would have

        s = sqrt(Npq) = Np # estimate of standard deviation of survey counts
        ==> q = Np = 1-p
        p = 1/(N + 1) ~= .02

        Sure, this is smaller than the one based on Agresti-Coull, but considering the other uncertainties, not too bad.

    • That’s what I thought too. Zero successes means there’s a closed form solution for the upper bound that is pretty simple (hence reasonable for an exam question): 1 – (alpha/2)^(1/n), alpha=0.05, n=50.

        • I prefer a Beta(0.5, 0.5) for a default (or non-informative, whatever that means) prior as it is the Jeffreys prior and thus probability-matching to a higher order than other priors.

        • I like probability-matching as a default/non-informative approach insofar as it leads to a non-data-dependent prior. This seems to be possible only for nice models with univariate parameters and picks out the Jeffreys prior in those cases.

        • In this case at least, the coverage properties of the posterior high-density regions are better for the Laplace prior than for the Jeffreys prior:

          https://imgur.com/a/Lx43Rru

          I added the average coverage (even though it doesn’t really mean anything in from a frequentist point of view).

          Note as well that the reasonable response for the confidence interval is [0, 0.088] so if Laplace’s [0, 0.57] is bad Jeffreys’ [0, 0.38] is twice as bad.

        • > if Laplace’s [0, 0.57] is bad Jeffreys’ [0, 0.38] is twice as bad.

          I obviously meant [0, 0.057] and [0, 0.038].

          As a bonus, Haldane’s prior is included here: https://imgur.com/a/v4JU3MZ

          It gives the interval [0, 0] which is indeed quite unreasonable!

        • Yes, my charts are for N=50. My beta(0.5,0.5) chart corresponds to the “Jeffreys Prior HPD Interval” in figure 11 in that paper. It’s true that the two-tailed intervals in Figure 5 are better but even in that case there are some spikes that lead the authors to propose an ad-hoc modification (figure 10).

          I find the two-tailed intervals less principled, even though they behave better and are easier to calculate. For the Clopper-Pearson interval I don’t understand why the two-tailed probability alpha/2 is used even when one of the tails disappears. Using alpha (and recovering the rule of three) makes more sense for x=0 (or x=1), leading to narrower intervals which still have the right coverage (we see in figure 11 that close to 0 or 1 the using alpha/2 gives too much coverage). Even when two tails with probability alpha/2 exist the equal-tailed interval is not optimal: https://hal.archives-ouvertes.fr/hal-00537683/document

  2. Regarding Question 8:

    For the high end of the confidence interval (the low end obviously being zero), aren’t we asking how high the percent of the population that has held office could be, such that the odds of polling 0/50 are .05 or less? If so, then if .088 of the population has held office, then the odds of sampling 0/50 are 0.912^50, which is 0.01, not 0.05. (OK, 0.009994 given these exact rounded figures.) Am I misunderstanding the question?

  3. To everyone above:

    I recommend you read the Agresti and Coull paper, which discusses many of these issues. It’s good stuff. I don’t think the Agresti-Coull method solves all problems—I agree that if you have real prior information you should use it—but it does give that approximate 95% coverage, it’s easy to use, and the result is reasonable.

      • Daniel:

        Since I’d already told them how to do this in class, and since it’s in the book too, I wasn’t really expecting anything other than the Agresti-Coull interval here. There are lots of answers that one can give to this problem, but in the context of this applied statistics course, Agresti-Coull was the answer. We didn’t ever even talk about the beta distribution in that class.

        And I do think that calculating the numbers is an important part of applied statistics. I see too many cases where students (and, later on, practitioners) given nonsensical answers just because they popped out of some wrongly-applied formula.

        • Sure, but to me it’s exactly the focus on correctly applied method and not on getting numbers that I’d be emphasizing. Anyone can get a number from somewhere, getting the number from the right place is what seems important.

        • Daniel:

          I like teaching the Agresti-Coull method because (a) these y=0 or y=n examples do come up in real life from time to time, and (b) talking about these examples reveals a problem with the standard sqrt(p(1-p)/n) interval which is what I usually recommend. The interval also has approximately 95% coverage which is not such a big deal to me but can be convenient in applications because then I can just point to the Agresti-Coull paper and move on. As can be seen from some of the comments in the above thread, it’s good to give some reasonable answer here or else people can do all sorts of weird things.

        • What weird things are you referring to? Is the answer [0, 0.09] more reasonable than the interval [0, 0.06] obtained using the rule of threes or inverting a test?

        • Carlos:

          Well, I asked for a 95% confidence interval, which is not what those other methods give. Anyway, this was the final exam for the class I taught, based on the book I wrote. I hadn’t mentioned those other approaches in my class or my book, so it’s unlikely any students would come up with these.

        • I still find it kind of shocking that you teach a whole applied regression class that still mainly focuses on some kind of classical methods without reference to Bayes, or am I over-interpreting what you wrote?

        • Daniel:

          The course and the book are mostly Bayesian (you can see in a few months when Regression and Other Stories appears in print), but we do have some classical confidence intervals.

        • Sure, the Agresti-Coull method is fine as far as any CI procedure is fine, but obviously many procedures are fine, that’s a real world fact as well, it seems like having a logically sound method to justify your procedure is what’s really important.

        • Agree – its the ” later on, practitioners” that should be the primary focus of applied statistics courses.

          That “what will people repeatedly do” later given what they should have learned in the course.

          And if the course won’t or can’t cover Bayesian Workflow adequately, pointing them to Bayes may well do more harm than good.

          Bayes has to be practiced safely!

        • Keith, do you think the “safety warning” on practicing Bayesian statistics should be greater than classical/frequentist statistics? I ask, because I am honestly not sure on this one. On one hand, yes we need a good workflow to do Bayes safely. On the other, I see little to no evidence that the workflow being taught in classical stats (on average) works as advertised.

        • Good point and the widespread number of people doing traditional stats isn’t necessarily a benefit if they mostly do the sort of stuff that leads to the kind of noisy false certainty we see so often. my impression is sloppy Bayes is no worse than and may well be better than sloppy Classical stats

        • > “safety warning” on practicing Bayesian statistics should be greater than classical/frequentist statistics?
          No, but hopefully we have learned how damaging unsafe Classical statistics has been.

          Which style of unsafe statistics is worse (Daniel’s question)?

          Not sure – which is the most powerful or gives seeming most compelling results…

        • …re: seeming most compelling results. Insofar as Bayes gives the more intuitive answers, one could argue that it might have a special liability issue here, whereas a strictly interpreted Frequentist approach always frustratingly answers a question that is never directly intuitive (and in many cases, of no interest at all, e.g. the whole strawman NHST stuff in most applications…)
          Thankfully, I find that it is super humbling and clarifying to always insist on recalling that the Bayesian inference is all conditional on the model, up to and including the prior specification – and of course the particular data at hand. The posterior measure is relative to the prior measure. It’s all conditional on “this is the (small) world” as Lindley would say. What this prevents is naive reification of probabilities, or models for that matter, as some kind of ultimate thing (“probability does not exist!”). It puts the emphasis squarely back on the workflow of scientific reasoning and modeling assumptions at play.
          Anyhow, only tangentially related, but perhaps pertinent is that personally I only ever feel like I can understand most non-Bayesian procedures after mapping them to some kind of Bayesian analogue.

        • “which is the most powerful”

          when we interpret powerful as political power, I think it’s clear that Classical Statistics has the most political power, that is, the power to get people to believe things and change policy or alter funding decisions etc… Today Bayes is questioned at every turn, and ridiculed for being “subjective” with a focus on the prior, or modeling “belief”. People in current power to make decisions about resources etc are predominantly users of Classical type methods (hypothesis testing, straw man NHST specifically, and to a lesser extent maximum likelihood fitting and in econ Difference In Difference analysis and synthetic controls and robust standard errors and etc all based on sampling theory typically without mechanistic models…

          The alternative is hard: model mechanisms directly, use Bayes to constrain the model to the reasonable range of applicability, and do a lot of computing to get fitted results that are difficult for anyone without a lot of Bayesian background to understand, and that specifically make a lot of assumptions and choices that are easy to question. It’s hard to argue against “model free inference procedures” that “guarantee unbiased estimates of causal effects” and etc. But it’s easy to argue that some specific structural assumption might be wrong and therefore the result of a Bayesian analysis might not hold…

          So from a political perspective, I see Classical Stats as it’s applied in many areas as a way to try to wield power to crush dissent.

        • Note, on an everyday level I think the ridiculing Bayes as a “subjective belief” system is thankfully less common, but when it comes to politically charged arguments I do think that it will immediately be drawn out like a machete.

        • Daniel:

          Yup. But the funny thing is that I think that a lot of the people doing bad science also feel that they’re being pounded by classical statistics.

          It goes like this:
          – Researcher X has an idea for an experiment.
          – X does the experiment and gathers data, would love to publish.
          – Because of the annoying hegemony of classical statistics, X needs to do a zillion analyses to find statistical significance.
          – Publication! NPR! Gladwell! Freakonomics, etc.
          – Methodologist Y points to problems with the statistical analysis, the nominal p-values aren’t correct, etc.
          – X is angry: first the statistical establishment required statistical significance, now the statistical establishment is saying that statistical significance isn’t good enough.
          – From Researcher X’s point of view, statistics is being used to crush new ideas and it’s being used to force creative science into narrow conventional pathways.

          This is a narrative that’s held by some people who detest me (and, no, I’m not Methodologist Y; this might be Greg Francis or Uri Simonsohn or all sorts of people.) There’s some truth to the narrative, which is one thing that makes things complicated.

        • I mostly agree with Chris, Daniel and Andrews comments here.

          From Chris “Bayesian inference is all conditional on the model, up to and including the prior specification … It puts the emphasis squarely back on the workflow of scientific reasoning and modeling assumptions at play.”

          And “only ever feel like I can understand most non-Bayesian procedures after mapping them to some kind of Bayesian analogue” – me too.

          It really should, but that’s more recent than you think and likely still disregarded in many places. I took a shot at it being disregarded in 1996 “When a posterior is presented, I believe it should be clearly and primarily stressed as being a “function” of the prior probabilities and not the probability “of treatment effects.” Two cheers for Bayes. https://www.sciencedirect.com/science/article/pii/S0197245696907539?via%3Dihub

          From Daniel “I think it’s clear that Classical Statistics has the most political power … applied in many areas as a way to try to wield power to crush dissent.” That certainly is the case in areas I have worked. One concern would be how it would hold up in court :-(

          From Andrew ” first the statistical establishment required statistical significance, now the statistical establishment is saying that statistical significance isn’t good enough” I would add “not providing a clear sense of what would be widely acceptable today.”

  4. Given that there are several perfectly appropriate ways to calculate an interval for the data in question (I would use Wilson’s scores interval, [0,0.7]), what penalty was applied to students who did not follow your taught precedent? How can a penalty be justified?

    • Michael:

      The Agresti-Coull method was the only thing I taught in class, so students either got this right because they remembered what to do, or they screwed up somewhere. If someone had come up with a completely different approach on their own, I’m not sure what I’d’ve done, but it didn’t come to that.

  5. Little late, but here’s my take in PyMC3,

    with pm.Model() as q_8_model:

    p = pm.Beta(“p”, alpha=2, beta=2)

    y = pm.Binomial(“y”, n=50, p=p, observed=[0])

    mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
    p 0.036841 0.025556 0.000446 0.001293 0.08663 3875.80476 1.001024

  6. I’m confused by the CI needing to include zero. We know some ‘Americans’ have held political office, so we’re 100% certain 0 is not a possible value for ‘the proportion of Americans who have ever held political office.’. It’d be different if the example was about an alien species who may or may not have a concept of political office and were zero is thus a possible answer. But in this case something like 1 in 10000 is what I would expect as lower bound, as the interval 0 to 0.0001 will contain the true value in exactly 0% of cases.
    This concern appears to be dismissed by ‘given the data are 0/50’, but that’s irrelevant, because the data also are ‘Americans’ and we know Americans have elected offices held by other Americans. Even if we polled 5 million Americans without finding a single one who held elected office it’s still more likely that our procedure is faulty or we’re just very very unlucky than that everything we know about the American political system is wrong.

    • I think the issue is we don’t really have a meaningfully different from 0 estimate of the lower bound. Let’s say all the current office holders in major institutions divided by the population… so that’s maybe 50 states x a few hundred plus a few hundred for federal, let’s call it 1e4, divide by 3.2e8 about 3e-5 it’s close enough to zero that the approximation of starting at zero probably doesn’t matter much, we’ve only got a sample of 50, so the upper bound on the confidence interval is on the order of 1e-1, the interval [0.00003, 0.1] is not meaningfully different from [0, 0.1]

    • I think Andrew is going full-frequentist there and given than zero is the maximum-likelihood estimate it is the best estimate. The phrasing of the question seems to suggest that use if prior knowledge is forbidden (“From this information…”).

      • @Daniel
        Every place with 10k inhabitants has a mayor, almost certainly every place above 5k, hence my numbers. Larger communities have more than one elected office. Not American, so can’t guess what the lowest reasonable bound is, but could be as low as 1/500 which is meaningfully different from zero.

        @Carlos
        1) But the prior knowledge that the CI can’t start below zero is expected to be incorporated? That’s doublethink. I took the instruction to mean that you can’t plug in the true number you happen to know from elsewhere.
        2) I thought this place was about thinking beyond the numbers.
        3) I’m not suggesting students whose CI starts at zero should be considered as having failed. My issue is with ‘confidence intervals that excluded zero (which can’t be right, given that the data are 0/50)’ which can’t be right, given that I know we’re talking about _Americans_. That information is given in the question, yet subsequently we’re expected to selectively ignore it.

        • So confidence intervals MUST ignore prior information, because by definition they must work (have proper coverage) regardless of the true parameter value.

          it’s just a test question, so shouldn’t be taken as advice about what to Actually do.

        • I do agree with your comments. I was just commenting on the answer that was expected. We discussed above some bayesian solutions, in particular from the beta(1,1) uniform prior, and strictly speaking those bayesian intervals do not include the value 0.

    • Markus:

      Sure, fair enough. I hadn’t thought of that. I agree that we could do better using prior information. Considering the question is not just current officers but “Have you ever held,” the true population value is not negligible. I’ve never had a student try to include that information in the answer, but it would be ok if they did.

Leave a Reply

Your email address will not be published. Required fields are marked *