I’m not particularly proud of this one, but I thought it might interest some of you in any case. It’s the final exam for the course I taught this fall to the economics students at Sciences Po. Students were given two hours.
You may use the textbook and one feuille of notes (both sides). No other references. You can use a pocket calculator but no computer.
Also please note that all these examples are fake; they are not based on real data.
1000 people are surveyed in country A and asked how many political activities they have done in the previous year. 500 people say “none,” 300 say “one,” and 200 say “two or more.” The same survey is performed in country B, where 300 people say “none,” 400 say “one,” and 300 say “two or more.”
Give an estimate and a standard error for the difference in the average number of political activities done by people in country A, compared to those in country B.
A survey is done asking Americans if they support Barack Obama’s health care plan. 80% of Democrats say yes (with a standard error of 5%), 50% of independents say yes (with a standard error of 10%), and 30% of Republicans say yes (with a standard error of 5%).
Assuming the general population is 30% Democrat, 30% independent, and 20% Republican, give an estimate of the percentage of Americans who support Obama’s plan.
A survey is done asking people if they support government-supplied child care. 55% of the men in the survey answer yes (with a standard error of 10%), and 70% of the women answer yes (with a standard error of 10%).
Give an estimate and standard error for the difference in support between men and women.
Assuming the data were collected using a simple random sample, how many people responded to the survey?
How large would the sample size have to be for the difference in support between men and women to have a standard error of 1%?
1100 people are surveyed: 500 say they voted for John McCain and 600 say they voted for Barack Obama. Of the McCain voters, 30% support the health-care bill that is being debated in Congress. Of the Obama voters, 80% support the bill. (People who did not vote in the past election are excluded from this study.)
Suppose you take the data (y=1 for supporting the bill and 0 for opposing it) and run a linear regression, predicting y from a variable called “obama,” which is 1 for people who say they voted for Obama and 0 for people who say they voted for McCain.
The regression output looks like this:
lm(formula = y ~ obama)
(Intercept) 0.__ 0.__
obama 0.__ 0.__
n = ____, k = _
residual sd = 0.__, R-Squared = 0.__
Fill in the blanks.
Students in a laboratory experiment are given the choice of which of two video games to play, a storybook adventure or a shoot-em-up game. A logistic regression is fit predicting the students’ choice (y=0 if they choose the adventure or 1 if they choose the shooting game), given their age in years. (The students’ ages in the experiment range from 18 to 25 years.)
The logistic regression output looks like this:
glm(formula = y ~ age, family = binomial(link = “logit”))
(Intercept) 3.00 1.61
age -0.20 0.08
n = 230, k = 2
residual deviance = 263.5, null deviance = 281.0 (difference = 17.4)
From this model, what is the probability of choosing the shooting game if you are 20 years old?
Graph the fitted regression line (from x ranging from 18 to 25), and put in a bunch of dots representing values of x and y that are consistent with the fitted line.
You want to perform a simulation study to see what happens if you fit a linear regression model with a noise predictor. Your model is that the underlying data follow a linear regression, y = b0 + b1*x1 + b2*x2 + error, where x1 is a binary variable (for example, sex) that equals 0 with probability 0.5 and 1 with probability 0.5, and x2 is a binary variable (for example, race) that equals 0 with probability 0.9 and 1 with probability 0.1 and is independent of x2. Also assume that the true values of b0, b1, b2 are 1.1, 2.2, and 3.3, and that the sd of the regression errors is 4.4.
Write R code to simulate 100 data points from this model, to fit and display a linear regression, and to check whether the true coefficients fall within the 68% interval as estimated from the model.
Now add two “noise” predictors to the model, x3 and x4. Both of these will be independent of everything else in the model and be drawn from normal distributions with mean 1 and standard deviation 2.
Write R to simulate the noise predictors and to fit and display the linear regression including these (in addition to the original predictors).
Consider the following lottery: a ticket costs $1, and when you play there is a 0.1% chance that you will win $500, a 0.9% chance you will win $50, and a 99% chance that you win nothing.
Write R code to simulate what might happen if you play this lottery 100 times. The final line of your R code should give your net winnings after these 100 plays. (“Net winnings” = total amount won, minus total amount spent on tickets.)
Write R code to loop this 1000 times and then to calculate the probability (based on these simulations) that your net winnings after playing 100 times would be positive.
A hypothetical experiment is performed in eight small cities in a country. Four cities are randomly selected to be treated and the remaining four are the controls. The treatment is for the government to offer free cable television to everyone in the city for one year, and the outcome of interest is the average number of hours of television watched per day by the residents of the city when the year is over. Data are collected from each city in two waves. First, before the experiment begins, a survey is done in each city to estimate the average number of hours of television watched. Second, after the year is over, a survey is done in each city to estimate (1) the proportion of people who took the offer of free cable television and (2) the average number of hours of television watched.
Set up appropriate notation and give R code for computing the estimated effect of offering cable TV on number of hours watched. What assumptions are needed for this estimate to be valid?
Give R code for computing the estimated effect of having cable TV on number of hours watched. What assumptions are needed for this estimate to be valid?
An experiment is performed on an experimental drug to lower blood pressure. The experiment is performed on ten people, who are divided into five pairs: in each pair, one person is randomly given the drug and the other person is randomly given the placebo. A pre-test and post-test measurement is taken on each person.
Pair Person x T yT yC
1 A 100 1 120 ___
1 B 110 0 ___ 120
2 C 120 0 ___ 130
2 D 130 1 100 ___
3 E 140 0 ___ 90
3 F 150 1 100 ___
4 G 160 0 ___ 140
4 H 170 1 140 ___
5 I 180 1 150 ___
5 J 190 0 ___ 160
Set up appropriate notation and give R code for the regression that is equivalent to estimating the treatment effect by taking the post-test difference between treatment and control.
Give R code for the regression that is equivalent to comparing post-test treatment to control, adjusting for pre-test.
Suppose the treatment effect is exactly 5. Fill in the unobserved values above.