Question 4 of our Applied Regression final exam (and solution to question 3)

Here’s question 4 of our exam:

4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?

(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.

(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.

And the solution to question 3:

Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.

glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit"))
               coef.est coef.se
(Intercept)        0.00    0.08
dist100           -0.90    0.10
arsenic            0.46    0.04
n = 3020, k = 3

Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.

Using the divide-by-4 rule, the expected difference in Pr(switch), per unit change in arsenic level, is approximately 0.46/4 = 0.11 (recall that with the divide-by-4 rule we round down) with standard error 0.01. But we’re looking at a difference of 0.5, so we need to multiply these coefficients by 0.5, thus 0.055 with standard error 0.005, and a 95% interval of [0.055 +/- 2*0.005] = [0.045, 0.065].

The divide-by-4 rule works when the predicted probabilities are near the middle of the range, that is, near 50/50. The arsenic example was in the textbook and students should be able to recall that the probabilities of switching are indeed not far from 50%.

Common mistakes

Most of the students had no problem with this one. The ones who made mistakes, did so by trying to apply the logistic formula directly and then messing up somewhere. Please please please please please: Invlogit is invlogit. Do not write it as exp(x)/(1 + exp(x)) or as 1/(1 + exp(-x)). Logit is its own function which has as much integrity as log or exp. Understand what logit looks like and you’ll be fine.

12 thoughts on “Question 4 of our Applied Regression final exam (and solution to question 3)

  1. I’m at a loss here. That “divide by 4” rule is brutal. Here it only works if the next well is a few paces away. The distance can only be positive.
    Imagine the clean well is at 300m. The 1st person has a probability of exp(0.46-2.7)/(1+exp(idem)) = 0.096, the 2nd person exp(0.23-2.7)/(1+exp(idem)) = 0.078, difference = 0.018. Not even in the ballpark of 0.055.

    • Thomas:

      First, you missed the point of the last paragraph of my post. It’s invlogit(…) ,not exp/(1+exp).

      Second, you missed the point of the last sentence before the last paragraph, that the probabilities of switching are indeed not far from 50%. Almost nobody in that area in Bangladesh lives as far as 300m from the nearest safe well. The distances are mostly much less, mostly less than 50m from the nearest safe well.

      I agree that if probabilities are not close to 1/2, you shouldn’t use the divide-by-4 rule, you should just work with invlogit directly. But the divide-by-4 rule is often useful, I think it’s usually a good starting point, and in this particular example it works just fine.

      • Thanks Andrew. Indeed I fail to see what’s special about invlogit versus writing out the function. Using the function shows the student can convert an odds to a probability, that’s plus.

        I stand corrected about the distances. I assumed there would be some distance to a safe well, if wells are 50m apart they drain the same water table and will have the same arsenic content.

        I guess what bothers me about the divide by 4 (not really a) rule is that it disregards the nature of a multiplicative model. The absolute difference cannot be constant across values of other (non null) covariates. I would pass a student who answered “I cannot answer that, it all depends on the distance to the well”.

  2. I’m sure I am missing something obvious, but I am not following the prohibition against applying the invlogit directly. Both exp(x)/(1 + exp(x)) and 1/(1 + exp(-x)) are valid forms of the invlogit, why shouldn’t they be applied directly?

    • Anon:

      There’s no prohibition of anything. It’s find to use invlogit. It’s also fine to divide by 4 if the probability is close to 1/2. An advantage of dividing by 4 is that if you have a table of regression coefficients, it’s easy to divide by 4 and see what they are.

      The bigger picture is that I want students to understand regression models quantitatively, not just qualitatively. The usual story is that people look at coefficients and just see whether they’re positive or negative and whether they’re statistically significant. I think that’s a mistake, so I train students to be able to see a fitted regression model and interpret it quantitatively.

        • Jd:

          Just to clarify: In the last paragraph of my above post, when I say “Invlogit is invlogit,” I mean, use invlogit() (or, in default R, plogis()). Do not write it as exp(x)/(1 + exp(x)) or as 1/(1 + exp(-x)) because that misses the point that invlogit() is its own function. If you want to use logistic regression, you should understand logit and invlogit as functions in their own right, not as complicated formulas involving exponentials. My point is not mathematical accuracy or speed of computation—all these formulas are indeed the same thing—but, rather, understanding.

  3. Does something like “confidence interval vs prediction interval” apply to Question 3? In other words, are we talking about an interval for the difference between individuals, or an interval between the “mean response” given the values of the predictors?

    • Well, invlogit is a special case of the logistic in which the maximum value is 1. In this special case the situation is indeed a bit confusing:

      Logit: [0, 1] -> R // qlogis(x)
      Invlogit: R -> [0, 1] // plogis(x)

      Logistic: R -> [0, 1]
      Inverse logistic: [0, 1] -> R

      So here the logit is the inverse of logistic which in self is the inverse logit which is the inverse of the inverse logistic.

Leave a Reply to "xlbt", as Donald Duck didn't say Cancel reply

Your email address will not be published. Required fields are marked *