Solutions to the final exam in my first-semester applied statistics class

I just graded the final exams for my first-semester graduate statistics course that I taught in the economics department at Sciences Po.

I posted the exam itself here last week; you might want to take a look at it and try some of it yourself before coming back here for the solutions.

And see here for my thoughts about this particular exam, this course, and final exams in general.

Now on to the exam solutions, which I will intersperse with the exam questions themselves:

You may use the textbook and one feuille of notes (both sides). No other references. You can use a pocket calculator but no computer.

Also please note that all these examples are fake; they are not based on real data.

The students were given two hours to complete the exam. This was not enough time (or, to put it another way, the exam was too long). Also, I did not prepare them enough for it; see more on this point here.

Problem 1.

1000 people are surveyed in country A and asked how many political activities they have done in the previous year. 500 people say “none,” 300 say “one,” and 200 say “two or more.” The same survey is performed in country B, where 300 people say “none,” 400 say “one,” and 300 say “two or more.”

Give an estimate and a standard error for the difference in the average number of political activities done by people in country A, compared to those in country B.

Count “two or more” as 3. Then the mean for group A is .5*0 + .3*1 + .2*3 = .9 and the mean for group B is .3*0 + .4*1 + .3*3 = 1.3. The difference is .4.

Graph each of these two distributions and you’ll see that each has an sd of about 1. (A more exact calculation gives sd’s of 1.1 for group A and 1.3 for group B, but we don’t need perfection here.) The se of the difference is sqrt (s1^2/n1 + s1^/n2). n1=n2=1000 in this case, so the se is approx sqrt (1^2/1000 + 1^2/1000) = .04.

Problem 2.

A survey is done asking Americans if they support Barack Obama’s health care plan. 80% of Democrats say yes (with a standard error of 5%), 50% of independents say yes (with a standard error of 10%), and 30% of Republicans say yes (with a standard error of 5%).

Assuming the general population is 30% Democrat, 30% independent, and 20% Republican, give an estimate of the percentage of Americans who support Obama’s plan.

As a blog commenter noted, I had a typo there; the three percentages in the second paragraph were supposed to add up to 100. Since they didn’t, I’m going to renormalize and get .375, .375, .25, thus getting inference for the 80% of the population represented by these groups (or, implicitly assuming the avg for the other 20% is the same as the avg in this 80%). The estimate is then 0.375*.8 + .375*.5 + .25*.2 = .54, and the standard error (which I didn’t ask for but everybody gave anyway) is sqrt (0.375^2*.05^2 + .375*.1^2 + .25*.05^2) = .07.

(I didn’t take off credit if people simply did .3*.2 + .3*.5 + .2*.2; in an exam setting, it’s forgivable to not think of checking to see if the percentages add to 100.)

Problem 3.

A survey is done asking people if they support government-supplied child care. 55% of the men in the survey answer yes (with a standard error of 10%), and 70% of the women answer yes (with a standard error of 10%).

Give an estimate and standard error for the difference in support between men and women.

Assuming the data were collected using a simple random sample, how many people responded to the survey?

How large would the sample size have to be for the difference in support between men and women to have a standard error of 1%?

Estimate is .15 with a standard error of sqrt (.1^2+.1^2) = .14.

To get sample size: .5/sqrt(n) = .1, thus approx n=25 in each group, or 50 total. (You could be more exact using sqrt(.55*.45) and sqrt(.7*.3), but this is not really necessary.)

To get se of difference to be 1%, you have to reduce the se by a factor of .14/.01=14, thus increase sample size by 14^2=200. Total sample size needed is then 50*200=10000.

Problem 4.

1100 people are surveyed: 500 say they voted for John McCain and 600 say they voted for Barack Obama. Of the McCain voters, 30% support the health-care bill that is being debated in Congress. Of the Obama voters, 80% support the bill. (People who did not vote in the past election are excluded from this study.)

Suppose you take the data (y=1 for supporting the bill and 0 for opposing it) and run a linear regression, predicting y from a variable called “obama,” which is 1 for people who say they voted for Obama and 0 for people who say they voted for McCain.

The regression output looks like this:

lm(formula = y ~ obama)
(Intercept) 0.__ 0.__
obama 0.__ 0.__

n = ____, k = _
residual sd = 0.__, R-Squared = 0.__

Fill in the blanks.

For intercept, the estimate is the mean for McCain (since obama=0 in that case), which is .3. The se is sqrt (.3*.7/500) = .02. For obama coef, the estimate is the difference, .05, and the se is the se of the difference, that is sqrt (.3*.7/500 + .8*.2/600) = .03.

n=1100, k=2

To get residual sd, first figure out the avg residual variance of .3*.7 for the McCain voters and .8*.2 for the Obama voters. Avg residual var is then (500/1100)*.3*.7 + (600/1100)*.8*.2, and residual sd is sqrt ((500/1100)*.3*.7 + (600/1100)*.8*.2) = .43

Rsquared = 1 – (residual sd)^2/(total sd)^2. Residual sd is .43, total sd is based on the model with no predictors and is sqrt (p*(1-p)/n), where p = (500/1100)*.3 + (600/1100)*.8 = .57, and n=1100, thus total sd is sqrt (p*(1-p)/n) = .50.
R-squared = 1 – .43^2/.50^2 = .26.

Problem 5.

Students in a laboratory experiment are given the choice of which of two video games to play, a storybook adventure or a shoot-em-up game. A logistic regression is fit predicting the students’ choice (y=0 if they choose the adventure or 1 if they choose the shooting game), given their age in years. (The students’ ages in the experiment range from 18 to 25 years.)

The logistic regression output looks like this:

glm(formula = y ~ age, family = binomial(link = “logit”))
(Intercept) 3.00 1.61
age -0.20 0.08

n = 230, k = 2
residual deviance = 263.5, null deviance = 281.0 (difference = 17.4)

From this model, what is the probability of choosing the shooting game if you are 20 years old?

Graph the fitted regression line (from x ranging from 18 to 25), and put in a bunch of dots representing values of x and y that are consistent with the fitted line.

At age 20, the probability is invlogit (3-.2*20) = invlogit (-1). From the divide-by-4 rule, this is .5 + (-1)/4 = .25. (A more exact calculation gives invlogit (-1) = .27.)

To make the graph, just figure out invlogit (3-.2*18) and invlogit (3-.2*25) and connect the dots to form a smooth curve. There should be 230 dots, some at y=0 and some at y=1, but mostly at y=0.

Problem 6.

You want to perform a simulation study to see what happens if you fit a linear regression model with a noise predictor. Your model is that the underlying data follow a linear regression, y = b0 + b1*x1 + b2*x2 + error, where x1 is a binary variable (for example, sex) that equals 0 with probability 0.5 and 1 with probability 0.5, and x2 is a binary variable (for example, race) that equals 0 with probability 0.9 and 1 with probability 0.1 and is independent of x2. Also assume that the true values of b0, b1, b2 are 1.1, 2.2, and 3.3, and that the sd of the regression errors is 4.4.

Write R code to simulate 100 data points from this model, to fit and display a linear regression, and to check whether the true coefficients fall within the 68% interval as estimated from the model.

Now add two “noise” predictors to the model, x3 and x4. Both of these will be independent of everything else in the model and be drawn from normal distributions with mean 1 and standard deviation 2.

Write R to simulate the noise predictors and to fit and display the linear regression including these (in addition to the original predictors).

Here’s my code that I wrote on the exam paper. I didn’t actually try it out–that would be cheating!–so it might have some bugs.

n <- 100 x1 <- rbinom (n, 1, .5) x2 <- rbinom (n, 1, .1) b <- c (1.1, 2.2, 3.3) sigma <- 4.4 y <- rnorm (b[1] + b[2]*x1 + b[3]*x2, sigma) M1 <- lm (y ~ x1 + x2) display (M1) inside <- rep (NA, 3) for (j in 1:3){ inside[j] <- abs (b[j] - coef(M1)[j]) < se,coef(M1)[j] } print (inside) x3 <- rnorm (n, 1, 2) x4 <- rnorm (n, 1, 2) M2 <- lm (y ~ x1 + x2 + x3 + x4) display (M2)

Problem 7.

Consider the following lottery: a ticket costs $1, and when you play there is a 0.1% chance that you will win $500, a 0.9% chance you will win $50, and a 99% chance that you win nothing.

Write R code to simulate what might happen if you play this lottery 100 times. The final line of your R code should give your net winnings after these 100 plays. (“Net winnings” = total amount won, minus total amount spent on tickets.)

Write R code to loop this 1000 times and then to calculate the probability (based on these simulations) that your net winnings after playing 100 times would be positive.

n <- 100 z <- runif (n) <- ifelse (z < .001, 500, ifelse (z < .01, 50, 0)) - 1 total <- sum ( n.loop <- 1000 total <- rep (NA, n.loop) for (loop in 1:n.loop){ n <- 100 z <- runif (n) <- ifelse (z < .001, 500, ifelse (z < .01, 50, 0)) - 1 total[loop] <- sum ( } print (mean (total > 0))

Problem 8.

A hypothetical experiment is performed in eight small cities in a country. Four cities are randomly selected to be treated and the remaining four are the controls. The treatment is for the government to offer free cable television to everyone in the city for one year, and the outcome of interest is the average number of hours of television watched per day by the residents of the city when the year is over. Data are collected from each city in two waves. First, before the experiment begins, a survey is done in each city to estimate the average number of hours of television watched. Second, after the year is over, a survey is done in each city to estimate (1) the proportion of people who took the offer of free cable television and (2) the average number of hours of television watched.

Set up appropriate notation and give R code for computing the estimated effect of offering cable TV on number of hours watched. What assumptions are needed for this estimate to be valid?

Give R code for computing the estimated effect of having cable TV on number of hours watched. What assumptions are needed for this estimate to be valid?

Define T to be the treatment (a vector that is 1 if treated and 0 if control), x = # hours watched pre-experiment, y = #hours watched post-experiment, z = proportion who took the offer in the city. Each of these variables is a vector of length 8.

Intent-to-treat estimate of offering cable TV is coef (lm (y ~ T + x))$T.

Instrumental variables estimate of having cable TV is coef (lm (y ~ T + x))$T / coef (lm (z ~ T + x))$T.

The assumptions are the usual (see chapters 9 and 10 of ARM).

Problem 9.

An experiment is performed on an experimental drug to lower blood pressure. The experiment is performed on ten people, who are divided into five pairs: in each pair, one person is randomly given the drug and the other person is randomly given the placebo. A pre-test and post-test measurement is taken on each person.

Pair Person x T yT yC
1 A 100 1 120 ___
1 B 110 0 ___ 120
2 C 120 0 ___ 130
2 D 130 1 100 ___
3 E 140 0 ___ 90
3 F 150 1 100 ___
4 G 160 0 ___ 140
4 H 170 1 140 ___
5 I 180 1 150 ___
5 J 190 0 ___ 160

Set up appropriate notation and give R code for the regression that is equivalent to estimating the treatment effect by taking the post-test difference between treatment and control.

Give R code for the regression that is equivalent to comparing post-test treatment to control, adjusting for pre-test.

Suppose the treatment effect is exactly 5. Fill in the unobserved values above.

Define y to be the measured outcome (that is, yT if T=1 or yC if T=0).
To get the post-test difference, do coef (lm (y ~ T))$T or simply mean (y[T=1]) – mean (y[T=0]). Actually we’ll want to take the negative, since we’re measuring the effect of the drug on lowering blood pressure.

To adjust for pre-test, do – coef (lm (y ~ T + x))$T.

The above regressions give the correct point estimates but the wrong standard errors. The problem is that they do not account for the pairing. The right thing to do is to either include the pairing as a multilevel factor (that is, do lmer (y ~ T + x + (1 | Pair)) and look at the coef and se of T), or else first take the differences in pairs and then do an analysis on the n=5 differences:
n <- 5 x.diff <- rep (NA, n) y.diff <- rep (NA, n) for (i in 1:n){ x.diff <- x[Pair==i & T==1] - x[Pair==i & T==0] y.diff <- y[Pair==i & T==1] - y[Pair==i & T==0] } display (lm (y.diff ~ 1)) display (lm (y.diff ~ x.diff)) I didn't expect people to do the above calculation (to address the pairing) in their exams, but for completeness I wanted to include it here. Finally, if the treatment lowers blood pressure by 5, the blank elements in the table should be 125, 115, 125, 105, 85, 105, 135, 145, 155, 155.

Again, see here for more discussion of this exam, and of exams in general.

5 thoughts on “Solutions to the final exam in my first-semester applied statistics class

  1. Can you comment more on your 'Count "two or more" as 3' assumption?

    The distributions feel rather differently placed on the number line, even something like assuming 3 for the first and 4 for the second almost doubles the estimate for the difference. It feels like the s.e. should include this large source of uncertainty.

    Also a typo: I assume you mean "the difference is 0.4"?

  2. Typo fixed. And, yes, more could be done with the "two or more" category; I was just trying to do something quick here, the way that someone might do for a final exam. Probably would've been a good idea to make this clear in the question.

Comments are closed.