## We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

1. Christian Hennig says:

I actually don’t like questions with “you expect”, because whatever they expect is what they expect. If they expect something wrong, it’s still correct that they expect that.

• Andrew says:

Christian:

Good point. Instead of “You would expect the interval to contain…”, I could say, “The interval should contain…”

• Phil says:

Seems a bit pedantic, but OK. The question could be rephrased to “You should expect…”

• It’s good to explicitly state the background that should be assumed in questions. This is a stumbling block for a lot of advanced students who might know something beyond the course and so don’t know how to answer a question that say makes simplifying assumptions that they know are wrong or etc. That used to drive me crazy in physics class… there’d be no correct answer for some multiple choice questions… because the “right” answer was only approximately right according to some assumptions.

• Phil says:

Daniel, to your point, I still remember how mad I was when a junior high school teacher marked me wrong for responding True to a question that was something like “True or false: a cyclonic tropical storm in the Atlantic that has wind speeds over 85 mph is a hurricane.” We had learned that an Atlantic cyclone with wind speeds over 75 mph (or something like that) is a hurricane, so the storm in the question meets the definition of hurricane. True. But no, the teacher marked it false. When I complained, he said that what he _meant_ was “True or false: a hurricane is a cyclonic tropical storm in the atlantic that has wind speeds over 85 mph.” Yeah, OK, but that’s not what you asked! And he said “you knew what I meant.” No! No I didn’t! Grrrrr.

So, yeah, you should make sure your question asks what you intend it to ask, and not something else that is similar and that can be interpreted in some other way. I agree with the principle.

Even so, in this case, I think the ‘You would expect…’ wording is OK: I think it’s understood by all non-pedants that the ‘you’ here is intended to be someone who understands the problem and can find the right answer, it’s not literally ‘you’. Or I should say I think it’s understood by _almost_ all non-pedants: I can imagine an autistic person, for example, taking ‘you’ literally. I agree it would be good to fix this question, but it’s hard for me to believe anyone would really think this was really a question about the state of mind of the test-taker rather than being a statistics question.

• Carlos Ungil says:

How are these questions problematic? At least in this case: “True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).”

If the student would expect something different from the interval containing the true value approximately 950 times he can answer “false” and explain why. And he will be graded accordingly.

2. george says:

Doesn’t part (b) need to state something stronger than just “bimodal”? If the error distribution were bimodal but didn’t have finite moments I’d wouldn’t expect CLT results to hold.

• Andrew says:

George:

I guess so. It never came up in class so I didn’t think about this when writing the question.

• Do you think a discussion of error distributions that do not have finite moments is an important topic in an applied regression course?

• I’m kinda split on this one. I think it’s worth it for students to know something about heavy tailed distributions, particularly for certain kinds of disaster risk analysis, but teaching this in a useful way is not easy.

• george says:

Keith: yes, that or something similar. If students are going to rely on CLT results we should equip them with some knowledge of when those approximations don’t work well. Heavy-tailed distributions – or large values of higher order moments, in finite samples – are one source of trouble.