]]>The pharmacometrics group led by France Mentre (IAME, INSERM, Univ Paris) is very pleased to host a free ISoP Statistics and Pharmacometrics (SxP) SIG local event at Faculté Bichat, 16 rue Henri Huchard, 75018 Paris, on Thursday afternoon the 11th of July 2019.

It will features talks from Professor Andrew Gelman, Univ of Columbia (We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions) and Professor Rob Bies, Univ of Buffalo (A hybrid genetic algorithm for NONMEM structural model optimization).

We welcome all of you (please register here). Registration is capped at 70 attendees.

If you would like to present some of your work (related to SxP), please contact us by July 1, 2019. Send a title and short abstract (julie.bertrand@inserm.fr).

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

And the solution to question 14:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),

for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

(a) For an average classroom, the curve is invlogit(1 + 0.1x), so it goes through the 50% point at x = -10. So the easiest way to draw the curve is to extend it outside the range of the data. But in the graph, the x-axis should go from 0 to 50. Recall that invlogit(5) = 0.99, so the probability of passing reaches 99% when x reaches 40. From all this information, you can draw the curve.

(b) The 25th and 75th percentage points of the normal distribution are at the mean +/- 0.67 standard errors. Thus, the 25th and 75th percentage points of the intercepts are 1 +/- 0.67*2, or -0.34, 2.34, so the curves to draw are invlogit(-0.34 + 0.1x) and invlogit(2.34 + 0.1x). These are just shifted versions of the curve from a, shifting by 1.34/0.1 = 13.4 to the left and the right.

**Common mistakes**

Students didn’t always use the range of x. The most common bad answer was to just draw a logistic curve and then put some numbers on the axes.

A key lesson that I had not conveyed well in class: draw and label the axes first, then draw the curve.

]]>14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),

for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

And the solution to question 13:

13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as

y_i = a_j[i] + bx + ε_i

a_j = A + Bu_j + η_j.(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

(a) The estimates of b, A, and B are 0.7, 2.5, and 0.5, respectively, and the estimated distributions are ε ~ normal(0, 3.0) and η ~ normal(0, 2.0).

(b) 3.0 and sqrt(3.0^2 + 2.0^2) = 3.6.

**Common mistakes**

Almost everyone got part (a) correct, and most people got (b) also, but there was some confusion about the uncertainty for a new observation in a new group.

]]>Parul Sehgal has a devastating review of the latest from Naomi Wolf, but while Sehgal is being justly praised for her sharp and relentless treatment of her subject, she stops short before she gets to the most disturbing and important implication of the story.

There’s an excellent case made here that Wolf’s career should have collapsed long ago under the weight of her contradictions and factual errors, but the question of responsibility, of how enablers have sustained that career, and how many other journalistic all-stars owe their successes to the turning of blind eyes.

For example, Sehgal’s review ran in the New York Times. One of, if not the most prominent voice of that paper is David Brooks. . . .

Really these columnists should just stick to football writing, where it’s enough just to be entertaining, and accuracy and consistency don’t matter so much.

]]>13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as

y_i = a_j[i] + bx + ε_i

a_j = A + Bu_j + η_j.(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

And the solution to question 12:

12. In the regression above, suppose you replaced height in inches by height in centimeters. What would then be the intercept and slope of the regression? (One inch is 2.54 centimeters.)

The intercept remains the same at -21.51, and the slope is divided by 2.54, so it becomes 0.28/2.54 = 0.11: a change of one centimeter corresponds to 1/2.54 inches.

**Common mistakes**

About half the students go the right answer here; the others guessed in various ways: some changed the intercept as well as the slope, some multiplied by 2.54 instead of dividing. The problem’s trickier than it looks, and to make sure my answer was correct I had to think twice and make sure I wasn’t getting it backward.

I’m not quite sure how I could teach the class better so that students would get this question right. Except of course by drilling them, giving them lots of homeworks covering this material.

]]>12. In the regression above, suppose you replaced height in inches by height in centimeters. What would then be the intercept and slope of the regression? (One inch is 2.54 centimeters.)

And the solution to question 11:

11. We defined a new variable based on weight (in pounds):

heavy <- weight>200and then ran a logistic regression, predicting “heavy” from height (in inches):

glm(formula = heavy ~ height, family = binomial(link = "logit")) coef.est coef.se (Intercept) -21.51 1.60 height 0.28 0.02 --- n = 1984, k = 2(a) Graph the logistic regression curve (the probability that someone is heavy) over the approximate range of the data. Be clear where the line goes through the 50% probability point.

(b) Fill in the blank: near the 50% point, comparing two people who differ by one inch in height, you’ll expect a difference of ____ in the probability of being heavy.

(a) The x-axis should range from approximately 60 to 80 (most people have heights between 60 and 80 inches), and the y-axis should range from 0 to 1. The easiest way to draw the logistic regression curve is to first figure out where it goes through 0.5. That’s when the linear predictor equals 0, thus -21.51 + 0.28*x = 0, so x = 21.58/0.28 = 79. Then at that point the line has slope 0.07 (remember the divide-by-4 rule), and that will be enough to get something pretty close to the fitted curve.

(b) As just noted, the divide-by-4 rule gives us an answer of 0.07.

**Common mistakes**

In making the graphs, most of the students didn’t think about the range of x, for example having x go from 0 to 100, which doesn’t make sense as there aren’t any people close to 0 or 100 inches tall. To demand that the range of the curve fit the range of the data is *not* just being picky: it changes the entire interpretation of the fitted model because changing the range of x also changes the range of probabilities on the y-axis.

When we interpret powerful as political power, I think it’s clear that Classical Statistics has the most political power, that is, the power to get people to believe things and change policy or alter funding decisions etc… Today Bayes is questioned at every turn, and ridiculed for being “subjective” with a focus on the prior, or modeling “belief”. People in current power to make decisions about resources etc are predominantly users of Classical type methods (hypothesis testing, straw man NHST specifically, and to a lesser extent maximum likelihood fitting and in econ Difference In Difference analysis and synthetic controls and robust standard errors and etc all based on sampling theory typically without mechanistic models…).

The alternative is hard: model mechanisms directly, use Bayes to constrain the model to the reasonable range of applicability, and do a lot of computing to get fitted results that are difficult for anyone without a lot of Bayesian background to understand, and that specifically make a lot of assumptions and choices that are easy to question. It’s hard to argue against “model free inference procedures” that “guarantee unbiased estimates of causal effects” and etc. But it’s easy to argue that some specific structural assumption might be wrong and therefore the result of a Bayesian analysis might not hold…

So from a political perspective, I see Classical Stats as it’s applied in many areas as a way to try to wield power to crush dissent.

My reply:

Yup. But the funny thing is that I think that a lot of the people doing bad science also feel that they’re being pounded by classical statistics.

It goes like this:

– Researcher X has an idea for an experiment.

– X does the experiment and gathers data, would love to publish.

– Because of the annoying hegemony of classical statistics, X needs to do a zillion analyses to find statistical significance.

– Publication! NPR! Gladwell! Freakonomics, etc.

– Methodologist Y points to problems with the statistical analysis, the nominal p-values aren’t correct, etc.

– X is angry: first the statistical establishment required statistical significance, now the statistical establishment is saying that statistical significance isn’t good enough.

– From Researcher X’s point of view, statistics is being used to crush new ideas and it’s being used to force creative science into narrow conventional pathways.

This is a narrative that’s held by some people who detest me (and, no, I’m not Methodologist Y; this might be Greg Francis or Uri Simonsohn or all sorts of people.) There’s some truth to the narrative, which is one thing that makes things complicated.

]]>11. We defined a new variable based on weight (in pounds):

heavy <- weight>200and then ran a logistic regression, predicting “heavy” from height (in inches):

glm(formula = heavy ~ height, family = binomial(link = "logit")) coef.est coef.se (Intercept) -21.51 1.60 height 0.28 0.02 --- n = 1984, k = 2(a) Graph the logistic regression curve (the probability that someone is heavy) over the approximate range of the data. Be clear where the line goes through the 50% probability point.

(b) Fill in the blank: near the 50% point, comparing two people who differ by one inch in height, you’ll expect a difference of ____ in the probability of being heavy.

And the solution to question 10:

10. For the above example, we then created indicator variables, age18_29, age30_44, age45_64, and age65up, for four age categories. We then fit a new regression:

lm(formula = weight ~ age30_44 + age45_64 + age65up) coef.est coef.se (Intercept) 157.2 5.4 age30_44TRUE 19.1 7.0 age45_64TRUE 27.2 7.6 age65upTRUE 8.5 8.7 n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis) and draw the fitted regression model. Again, this graph should be consistent with the above computer output.

The graph of weight vs. age should be identical to that in the previous problem. Fitting a new model does not change the data. The fitted regression model is four horizontal lines: a line from x=18 to x=30 at the level 157.2, a line from x=30 to x=45 at the level 157.2 + 19.1, a line from x=45 to x=65 at the level 157.2 + 27.2, and a line from x=65 to x=90 at the level 157.2 + 8.5.

**Common mistakes**

The biggest mistake was discretizing x. Another common mistake was to draw the regression line and then draw the dots relative to the line, so that there were jumps in the underlying data at the cutpoints in the model. Also, as in the previous problem, when students drew the dots, they almost always didn’t include enough vertical spread, and their scatterplots looked nothing like what the data might look like. Again, when teaching, I need to clarify the distinction between the model and the data.

]]>10. For the above example, we then created indicator variables, age18_29, age30_44, age45_64, and age65up, for four age categories. We then fit a new regression:

lm(formula = weight ~ age30_44 + age45_64 + age65up) coef.est coef.se (Intercept) 157.2 5.4 age30_44TRUE 19.1 7.0 age45_64TRUE 27.2 7.6 age65upTRUE 8.5 8.7 n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis) and draw the fitted regression model. Again, this graph should be consistent with the above computer output.

And the solution to question 9:

9. We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We created a new variables, age10 = age/10. We then fit a regression:

lm(formula = weight ~ age10) coef.est coef.se (Intercept) 161.0 7.3 age10 2.6 1.6 n = 2009, k = 2 residual sd = 119.7, R-Squared = 0.00Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis). Label the axes appropriately, draw the fitted regression line, and make a scatterplot of a bunch of points consistent with the information given and with ages ranging roughly uniformly between 18 and 90.

The x-axis should go from 18 to 90, or from 0 to 90 and the y-axis should go from approximately 100 to 300, or from 0 to 300. It’s easy enough to draw the regression line, as the intercept and slope are right there. The scatterplot should have enough vertical spread to be consistent with a residual sd of 120. Recall that approximately 2/3 of the points should fall between +/- 1 sd of the regression line in vertical distance.

**Common mistakes**

Everyone could draw the regression line; nearly nobody could draw a good scatterplot. Typical scatterplots were very tightly clustered around the regression line, not at all consistent with a residual sd of 120 and an R-squared of essentially zero.

I guess we should have more assignments where students draw scatterplots and sketch possible data.

]]>9. We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We created a new variables, age10 = age/10. We then fit a regression:

lm(formula = weight ~ age10) coef.est coef.se (Intercept) 161.0 7.3 age10 2.6 1.6 n = 2009, k = 2 residual sd = 119.7, R-Squared = 0.00Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis). Label the axes appropriately, draw the fitted regression line, and make a scatterplot of a bunch of points consistent with the information given and with ages ranging roughly uniformly between 18 and 90.

And the solution to question 8:

8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

This is a job for the Agresti-Coull interval. y* = y + 2, n* = n + 4, p* = y*/n* = 2/54 = 0.037, with standard error sqrt(p*(1-p*)/n*) = sqrt((2/54)*(52/54)/54) = 0.026. Estimate is [p* +/- 2se] = [-0.014, 0.088], but the probability can’t be negative, so [0, 0.088] or simply [0, 0.09].

**Common mistakes**

Most of the students remembered the Agresti-Coull interval, but some made the mistake of giving confidence intervals that excluded zero (which can’t be right, given that the data are 0/50) or that included negative values.

]]>8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

And the solution to question 7:

7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

Assume 250 got the treatment and 250 got the control. Then the standard error of the estimated treatment effect is sqrt(0.5^2/250 + 0.5^2/250) = 0.045. An estimate is statistically significant if it is at least 2 standard errors from 0, so the answer to the question is 0.09, an effect of 9 percentage points.

**Common mistakes**

Most of the students couldn’t handle this one. One problem was forgivable: I didn’t actually say that half the people got the treatment and half got the control. I guess I should’ve made that clear in the statement of the problem.

But that wasn’t the only issue. Many of the students weren’t clear on how to get started on this one. One key point is that you can plug p=0.5 into the sqrt(p*(1-p)/n) formula.

]]>7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

And the solution to question 6:

6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

(a) is true, (b) is false. The problem is purely predictive, and adding a good predictor should help (on average; sure, you could find individual examples where it would make things worse, but there’s no reason to think it wouldn’t help in the generically-described example above). When the goal is prediction (rather than estimating regression coefficients which will be given a direct causal interpretation), there’s no problem with adding this sort of informative predictor.

**Common mistakes**

Just about all the students got this one correct.

]]>6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

And the solution to question 5:

5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

(a) is false. If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be negative.

(b) is false. It’s no problem at all to have more questions than students. Even in a classical regression, even without a multilevel model, this is typically no problem as long as each question is answered by a few different students.

**Common mistakes**

Most of the students had the impression that one of (a) or (b) had to be true, so a common response was to work through one of the two options, figure out that it was false, and then mistakenly conclude that the other one was true. I guess I should rephrase the question. Instead of “Which of the following statements are basically true?”, I could say, “For each of the following statements, say whether it is true or false.”

]]>5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

And the solution to question 4:

4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?

(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.

(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.

Option (a) is wrong and option (b) is right. We discuss this in the missing-data chapter of the book. The point prediction from a regression model gives you something in the middle of the distribution. You need to add noise in order to approximate the correct spread.

**Common mistakes**

Almost all the students got this one correct.

]]>I searched up *Tony nominations mean nothing* and I found nothing. So I had to write this.

There are currently 41 theaters that the Tony awards accept when nominating their choices. If we are being as generous as possible, we could say that every one of those theaters will be hosting a performance that fits all of the requirements for an award. The Tony awards have 26 different categories. There are 129 nominations this year, not including the special categories. For a play in this day and age to not get a single nomination is just a testament to its mediocrity. Plays or People can even get multiple nominations in the same category. The Best Featured Actress in a Musical has these marvelous nominations:

Lilli Cooper, “Tootsie”

Amber Gray, “Hadestown”

Sarah Stiles, “Tootsie”

Ali Stroker, “Oklahoma!”

Mary Testa, “Oklahoma!”

People will frequently get nominated twice in the same category for different pieces!According to the official Tony Awards website, “A show is only eligible in the season when it first opens, no matter how long it runs on Broadway.” This immediately gets rid of many current shows, and leaves only 21 shows by my counting. I may be slightly wrong, but that is still a very small number. If there are 129 possible nominations for your piece, and you are only 1 out of 29 possibilities, receiving a tony nomination is not a badge of honor, but a badge of shame. There was recently an article in the New York Times about how King Lear, a show that received mixed reviews, was disappointed that it only got 1 nomination. I’d like to see if anyone else can help me figure this out.

My reply: OK, so here’s the question. Why so many Tonys for so few shows, which would seem to reduce its value?

The most natural answer is that Tonys and Tony nominations give value to “Broadway theatre” more generally: the different shows are in friendly competition, and more awards and more nominations get the butts in the seats.

But that doesn’t *really* answer the question, as at some point there have to be diminishing returns. The real question is where’s the equilibrium.

Remember that post from a few years ago about the economist who argued that the members of the Motion Picture Academy were irrational because they were giving Oscars to insufficiently popular movies: “One would hope the Academy would at least pay a bit more attention to the people paying the bills. Not only does it seem wrong (at least to this economist) to argue that movies many people like are simply not that good, focusing on the box office would seem to make good financial sense for the Oscars as well”?

The discussion there led to familiar territory in econ-talk: How much should we think that an institution (e.g., the Oscars, the Tonys) is at a sensible equilibrium, kept there by a mixture of rational calculation and the discipline of the market, and how much should we focus on the institution’s imperfections (slowness to change, principal-agent problems, etc.) and suggest improvements?

One comparison point is academic awards. Different academic fields seem to have different rates of giving awards. It would just about always seem to make sense to add an award: for example, if the Columbia stat dept added a best research paper award for its Ph.D. students, I think this would at the margin help the recipients get jobs, more than it would hurt the prospects of the students who didn’t get the award. On balance it would benefit our program. But we don’t have such an award—or, at least, I don’t think we have. Maybe we should. The point is that it doesn’t seem that statistics academia has reached equilibrium when it comes to awards. Political science, that’s another story: they have zillions of awards, all over the place. Equilibrium may well have been reached in that case.

Dan Simpson or Brian Pike might have more thoughts on the specific case of the Tonys. Maybe someone could “at” them?

**P.S.** When I was a kid, nobody cared about the Tonys, Emmys, or Grammys. But every year we watched the Oscars, Miss America, and the Wizard of Oz.

4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?

(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.

(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.

And the solution to question 3:

Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.

glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit")) coef.est coef.se (Intercept) 0.00 0.08 dist100 -0.90 0.10 arsenic 0.46 0.04 n = 3020, k = 3Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.

Using the divide-by-4 rule, the expected difference in Pr(switch), per unit change in arsenic level, is approximately 0.46/4 = 0.11 (recall that with the divide-by-4 rule we round down) with standard error 0.01. But we’re looking at a difference of 0.5, so we need to multiply these coefficients by 0.5, thus 0.055 with standard error 0.005, and a 95% interval of [0.055 +/- 2*0.005] = [0.065, 0.075].

The divide-by-4 rule works when the predicted probabilities are near the middle of the range, that is, near 50/50. The arsenic example was in the textbook and students should be able to recall that the probabilities of switching are indeed not far from 50%.

**Common mistakes**

Most of the students had no problem with this one. The ones who made mistakes, did so by trying to apply the logistic formula directly and then messing up somewhere. Please please please please please: Invlogit is invlogit. Do not write it as exp(x)/(1 + exp(x)) or as 1/(1 + exp(-x)). Logit is its own function which has as much integrity as log or exp. Understand what logit looks like and you’ll be fine.

]]>Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.

glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit")) coef.est coef.se (Intercept) 0.00 0.08 dist100 -0.90 0.10 arsenic 0.46 0.04 n = 3020, k = 3Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.

And the solution to question 2:

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

Let p be the proportion of students in the population who would get the question correct. p has an estimate of 0.6 and a standard error of sqrt(0.5^2/100) = 0.05.

Let theta be the proportion of students in the population who actually know the answer. Based on the description above, we can write:

p = theta + 0.25*(1 – theta) = 0.25 + 0.75*theta,

thus theta = (p – 0.25)/0.75.

This gives us an estimate of theta of (0.6 – 0.25)/0.75 = 0.47 and a standard error of 0.05/0.75 = 0.07, so the 95% confidence interval is [0.47 +/- 2*0.07] = [0.31, 0.59]

**Common mistakes**

Most of the students had no idea what to do here, but some of them figured out how to solve for theta. None of them got the standard error correct. The students who figured out the estimate of 0.47 simply computed a standard error as sqrt(0.47*(1 – 0.47)/1000). Kinda frustrating. I’m not really sure how to teach this, although of course I could just assign this particular problem as homework and then maybe students would remember the general point about estimates and standard errors under transformations.

I’m also thinking this would be a good example to program up in Stan because then all these difficulties are handled automatically.

**P.S.** There was some question about how you can convince yourself that the above answer is correct. Here’s how you can do it using simulation:

Start by assuming a true value of theta. It shouldn’t matter exactly what value we choose; say 0.40 as this is comfortably within our confidence interval. Then Pr(correct answer) is 0.25 + 0.75*theta = 0.55 in this case. Now it’s easy to simulate 100 students’ responses: y = rbinom(1, 100, 0.55). Do this 1000 times, so you have 1000 simulations from the sampling distribution of y, conditional on the assumed true value of theta.

Now for each of these 1000 simulated y’s, compute p_hat = y/100 and theta_hat = (p_hat – 0.25)/0.75.

And now we’re ready to use these simulations to approximate the sampling distribution of theta_hat | theta. Compute the mean of the 1000 theta_hat’s, this will be approx 0.40 because theta_hat is an unbiased estimate of theta. Compute the sd of the 1000 theta_hat’s, and you’ll get something close to 0.7, because that’s the standard error we worked out above.

]]>There can be

somelarge and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data. The analogy is to a fish tank full of piranhas: it won’t take long before they eat each other.

And she said, wait, you better check to see if this is right. Are piranhas cannibals? That doesn’t seem right, if they’re cannibals they’ll just eat each other and die out. But if they’re not cannibals, the analogy doesn’t work.

So when I got home, I looked it up. I googled *are piranhas cannibals*. And this is the first thing that came up:

So my analogy is safe, and we’re good to go.

**P.S.** I guess I could’ve titled the above post, Are Piranhas Cannibals?, but that would’ve violated the anti-SEO principles of this blog. Our general rule is, make the titles as boring as possible, then anyone who clicks through to read the post will be pleasantly surprised by all the entertainment value we offer.

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

And the solution to question 1:

1. A randomized experiment is performed within a survey. 1000 people are contacted. Half the people contacted are promised a $5 incentive to participate, and half are not promised an incentive. The result is a 50% response rate among the treated group and 40% response rate among the control group.

(a) Give an estimate and standard error of the average treatment effect.

(b) Give code to fit a logistic regression of response on the treatment indicator. Give the complete code, including assigning the data, setting up the variables, etc. It is not enough to simply give the one line of code for running the logistic regression.

(a) The estimate is 0.5 – 0.4 = 0.1, and the standard error is sqrt(0.5^2/500 + 0.5^2/500) = 0.03.

(b) Here’s some code:

n <- 1000 z <- rep(c(1,0), c(n/2, n/2)) y <- rep(c(1,0,1,0), c(0.5, 0.5, 0.4, 0.6)*n/2) library("rstanarm") fit <- stan_glm(y ~ z, family=binomial(link="logit")) print(fit)

The estimated logistic regression coefficient is approximately 0.4 with a standard error of approximately 0.12, which is consistent with the answer from (a) after applying the divide-by-4 rule.

**Common mistakes**

Most of the students in the class got part (a) correct, but some were sloppy and did some sort of sqrt(p*(1-p)/n) calculation with n=1000, not recognizing that the estimate was a *difference* between proportions.

For (b), I was expecting some syntax errors---it was an in-class exam without computers, so I was just asking them to write some code as best they could---but a common mistake was that many, actually most, students did it by simulating fake data with probabilities 0.5 and 0.4, rather than by entering the actual data as in the code above. Perhaps I needed to write the question better to make it more clear what was required.

]]>Here’s the first question on the test:

1. A randomized experiment is performed within a survey. 1000 people are contacted. Half the people contacted are promised a $5 incentive to participate, and half are not promised an incentive. The result is a 50% response rate among the treated group and 40% response rate among the control group.

(a) Give an estimate and standard error of the average treatment effect.

(b) Give code to fit a logistic regression of response on the treatment indicator. Give the complete code, including assigning the data, setting up the variables, etc. It is not enough to simply give the one line of code for running the logistic regression.

See tomorrow’s post for the solution and a discussion of common errors.

]]>We present a selection criterion for the Euclidean metric adapted during warmup in a Hamiltonian Monte Carlo sampler that makes it possible for a sampler to automatically pick the metric based on the model and the availability of warmup draws. Additionally, we present a new adaptation inspired by the selection criterion that requires significantly fewer warmup draws to be effective. The effectiveness of the selection criterion and adaptation are demonstrated on a number of applied problems. An implementation for the Stan probabilistic programming language is provided.

And here’s their conclusion:

Adapting an effective metric is important for the performance of HMC. This paper outlines a criterion that can be used to automate the selection of an efficient metric from an array of options. In addition, we present a new low-rank adaptation scheme that makes it possible to sample effectively from highly correlated posteriors, even when few warmup draws are available. The selection criterion and the new adaptation are demonstrated to be effective on a number of different models.

All of the necessary eigenvalues and eigenvectors needed to evaluate the selection criterion and build the new adaptation can be computed efficiently with the Lanczos algorithm, making this method suitable for models with large numbers of parameters.

This research looks like it will have a big practical impact.

]]>1. Exhortations to look at your data, make graphs, do visualizations and not just blindly follow statistical procedures.

2. Criticisms and suggested improvements for graphs, both general (pie-charts! double y-axes! colors! labels!) and specific.

3. Instruction and examples of how to make effective graphs using available software.

4. Demonstration or celebration of particular beautiful and informative graphs.

5. Theorizing about what makes certain graphs work well or poorly.

We’ve done lots of all these over the years—this blog has about 600 posts on statistical graphics—as have Kaiser Fung and others, following the lead of Tukey, Cleveland, Tufte, Wainer, etc. When writing about graphics, the above five things are what we do.

Almost always when we and others write about statistical graphics, it is in the spirit of exhortation, criticism, celebration, demonstration, or instruction—but not of open inquiry.

Yes, my views on graphics have evolved, I’m open to new ideas, and in some of my writings I’ve thought hard about the virtues of other perspectives (as in this paper with Antony Unwin on different goals in visualization)—but just about always we’re writing to advance some argument or to simply celebrate the virtues of graphical display.

There’s been some literature on comparative evaluation of different graphical approaches, but much of what I’ve seen in that area hasn’t been so impressive. It’s hard to quantitatively evaluate something as slippery as statistical graphics, given the many goals that graphs serve.

With that as background, I was very happy to read this post, “Data is Personal. What We Learned from 42 Interviews in Rural America,” where Evan Peck describes a study he did with Sofia Ayuso and Omar El-Etr:

We asked 40+ people from rural Pennsylvania to rank a set of 10 graphs. Then we talked about it.

At a farmers market, a construction site, and in university dining facilities, we interviewed 42 members of our community about graphs and charts to understand how they understand and engage with data.

We showed people 10 data visualizations about drug use that varied in their visual encodings, their style, and their source.

We asked them to rank the 10 graphs (without source information!) based on their usefulness.

After revealing the sources of the graphs, people were given an opportunity to rerank their visualizations.

The people we talked to weren’t just young and weren’t just in college. They diverse in their education (60% never completed college) and age (26% were 55+, 33% were between 35–44). Through many hours of conversations, here is what we found . . .

Their most interesting findings are qualitative. You can read Peck’s post to learn more.

The point I want to make here, beyond that I found the stories fascinating, is that Peck is demonstrating a new way of writing about statistical graphics, going beyond the five standard approaches listed above.

This suggests to me that our thinking about statistical graphics is moving to a new level of sophistication, and I think that’s very important, that we can go beyond the usual tropes of exhortation, celebration, criticism, instruction, and theorizing.

This is big news for those of us who work in this field.

]]>Journal editing is a volunteer job, and people sign up for it because they want to publish exciting new work, or maybe because they enjoy the power trip, or maybe out of a sense of duty—but, in any case, they typically aren’t in it for the controversy.

Jon Baron, editor of the journal Judgment and Decision Making, saw this and wrote:

In my case, the reasons are “all three”! But it isn’t a matter of “exciting new work” so much as “solid work with warranted conclusions, even if boring”. This is a very old-fashioned experimental psychologist’s approach. Boring is good. And the “power” is not a trivial consideration; many things that academics do have the purpose of influencing their fields, and editing, for me, beats teaching, blogging, writing trade books, giving talks, or even . . . (although it does not beat writing a textbook).

I’ve been asked many times to edit journals but I’ve always said no because I’ve felt that, personally, I can make better contributions to the field as a loner. Editing a journal would require too much social skill for me. We each should contribute where we can.

Also recall this story:

I remember, close to 20 years ago, an economist friend of mine was despairing of the inefficiencies of the traditional system of review, and he decided to do something about it: He created his own system of journals. They were all online (a relatively new thing at the time), with an innovative transactional system of reviewing (as I recall, every time you submitted an article you were implicitly agreeing to review three articles by others) and a multi-tier acceptance system, so that very few papers got rejected; instead they were just binned into four quality levels. And all the papers were open-access or something like that.

The system was pretty cool, but for some reason it didn’t catch on—I guess that, like many such systems, it relied a lot on continuing volunteer efforts of its founder, and perhaps he just got tired of running an online publishing empire, and the whole thing kinda fell apart. The journals lost all their innovative aspects and became just one more set of social science publishing outlets. My friend ended up selling his group of journals to a traditional for-profit company, they were no longer free, etc. It was like the whole thing never happened.

A noble experiment, but not self-sustaining. Which was too bad, given that he’d put so much effort into building a self-sustaining structure.

Perhaps one lesson from my friend’s unfortunate experience is that it’s not enough to build a structure; you also need to build a community.

Another lesson is that maybe it can help to lean on some existing institution. This guy built up his whole online publishing company from scratch, which was kinda cool, but then when he no longer felt like running it, it dissolved, and then he ended up with a pile of money, which he probably didn’t need and he might never get around to spending, while losing the scientific influence, which is more interesting and important. Maybe it would’ve been better for him to have teamed up with an economics society, or with some university, governmental body, or public-interest organization.

Good intentions are not enough, and even good intentions + a lot of effort aren’t enough. You have to work with existing institutions, or create your own. This blog works in part because it piggybacks off the existing institution of blogging. Nowadays there isn’t much blogging anymore, but the circa 2005-era blogosphere was helpful in giving us a sense of how to set up our community. We built upon the strengths of the blogosphere and avoided some of the pitfalls.

Similarly this is the challenge of reforming scientific communication: to do something better while making use of existing institutions and channels whereby researchers donate their labor.

]]>It’s 3pm Pacific time in CSB (Cognitive Science Building) 003 at the University of California, San Diego.

This is what they asked for in the invite:

Our Friday afternoon COGS200 series has been a major foundation of the Cognitive Science community and curriculum in our department for decades and is attended by faculty and students from diverse fields (e.g. anthropology, human-computer-interface/design, AI/machine learning, neuroscience, philosophy of mind, psychology, genetics, etc).

One of the goals of our Spring quarter series is to expose attendees to research on the cultural practices surrounding data acquisition, analysis, and interpretation. In particular, we were hoping to have a section exploring current methods in statistical inference, with an emphasis on designing analyses appropriate to the question being asked. If you are interested and willing, we would love for you to share your expertise on multilevel / hierarchical modeling—as well as your more general perspective on how scientists can better deploy statistical models for conducting good, replicable science. (Relevant papers that come to mind include your 2016 paper on “multiverse analyses”, as well as your 2017 “Abandon statistical significance” paper.)

I’m still not sure what’s the best thing to talk about. I guess I’ll start with what’s in that above paragraph and then go from there.

]]>Now that abandoning significance and embracing uncertainty is in the air, we think this package, which runs in R or Stata, may be of interest to both you and your readers.

Concurve plots consonance curves, p-value functions, and S-value functions to allow readers and researchers to get a better feel of the range of values with which their data are compatible. The package is able to do this for everything from mean differences to meta-analytic estimates.

These confidence intervals aren’t really my thing—I prefer a Bayesian approach—but I’m sharing this as it might interest some of you.

]]>How does the current replication crisis, along with other recent psychological trends, affect scientific creativity? To answer this question, we consider current debates regarding replication through the lenses of creativity research and theory. Both scientific work and creativity require striking a balance between ideation and implementation and between freedom and constraints. However, current debates about replication and some of the emerging guidelines stemming from them threaten this balance and run the risk of stifling innovation.

This claim is situated in the context of a fight in psychology between the traditionalists (who want published work to stand untouched and respected for as long as possible) and replicators (who typically don’t trust a claim until it is reproduced by an outside lab).

Rather than get into this debate right here, I’d like to step back and consider the proposal of Kaufman and Glǎveanu on its own merits.

I’m 100% with them on reducing barriers to creativity, and I think that journals in psychology and elsewhere should start by not requiring “p less than 0.05” to publish things.

Nothing is stopping researchers such as the authors of the above paper from publishing their work without replication. So I’m not quite sure what they’re complaining about. They don’t like that various third parties are demanding they replicate their work, but why can’t they ignore these demands.

Indeed, as I wrote above, I think the barriers to publication should be lowered, not raised. And if an Association for Psychological Science journal doesn’t want to publish your article (perhaps because you don’t have personal connections with the editors; see P.S. below), then you can publish it in some other journal.

If, you flip a coin 6 times and get four heads, and you’d like to count that as evidence for precognition or telekinesis, and publish that somewhere, then go for it.

As long as you clearly and openly present your data, evidence, and argument, it seems fine to me to publish whatever you’ve got. And if others care enough, they can do their own replications. Not your job, no problem.

What strikes me is that the authors of the above article, and other people who present similar anti-replication arguments, are *not* merely saying they want the freedom to be creative. They, and their colleagues and students, already have that freedom. And they already have the freedom to publish un-preregistered, un-replicated work in top journals; they do it all the time.

So what’s the problem?

It seems that what these people really are pushing for is **the suppression of criticism**. It’s not that they want to publish in Psychological Science (which, in its online version, could theoretically publish unlimited numbers of papers); it’s that they don’t want the rest of us publishing there.

It’s all about status (and money and jobs and fame). Publishing in Psychological Science and PNAS has value because these journals reject a lot of papers. They’re yammering on about creativity—but nothing’s getting in the way of them being creative and conducting and publishing their unreplicated studies. No, what they want is to be able to: (a) perform these unreplicated studies, (b) publish them as is, (c) get tenure, media exposure, etc., and (d) deny legitimacy to criticism from outside. They key steps are (c) and (d), and for these they need to play gatekeeper, to maintain scarcity by preserving their private journals such as Psychological Science and PNAS for themselves and their friends, and to shout down dissenting voices from inside and outside their profession.

Innovation is not being “stifled.” What’s being stifled is their ability to have their shaky work celebrated without question within academia and the news media, their ability to dole out awards and jobs to their friends, etc.

Freedom of speech means freedom of speech. It does not mean freedom from criticism.

**P.S.** I have no idea how much reviewing happened on the above-linked paper before it was published. Here’s what it says at the end of the article:

And here’s something that the first author of the article posted on the internet recently:

This last bit is interesting as it suggests that Kaufman does not understand the Javert paradox. He’s criticizing people who “devote their time” to criticism, without recognizing that, in the real world, if you care about something and want it to be understood, you have to “devote time” to it. In the particular case under discussion, people criticized Sternberg’s policies quietly, and Sternberg responded by brushing the criticism aside. Then the critics followed up with more criticism. Sure, they could’ve just given up, but they didn’t, because they thought the topic was important.

Flip it around. Why did Kaufman and Glǎveanu write the above-linked article? It’s because they think psychology is important—important enough that they want to stop the implementation of policies that they think will slow down research in the field. Fair enough. One might disagree with them, but we can all respect the larger goal, and we can all respect that these authors think the larger goal is important enough, that they’ll devote time to it. Similarly, people who have criticized Sternberg’s policy of filling up journals with papers by himself and his friends, and suppressing dissent, have done so because they too feel that psychology is important—important enough that they want to stop the implementation of policies that they think will slow down research in the field. It’s the same damn thing. Susan Matthews put it well: We need to normalize the pursuit of accuracy as a good-intentioned piece of the scientific puzzle.

**P.P.S.** I think I see another problem. In the reference list to the above-linked paper, I see this book:

Kaufman A. B., Kaufman J. C. (Eds.). (2017). Pseudoscience: The conspiracy against science. Cambridge, MA: MIT Press.

**P.P.P.S.** Just to clarify my recommendation to “publish everything”: I do think reviewing is valuable. I just think it should be done after publication. Put everything on Arxiv-like servers, then “journals” can do the review process, where the positive outcome of a review is “endorsement,” not “publication.” Post-publication reviewers can even ask for changes as a condition of endorsement, in the same way that journals currently ask for changes as a condition of publication.

The advantages of publishing first, reviewing later, are: (a) papers aren’t sitting in limbo for years during the review process, and (b) post-publication review can concentrate on the most important papers, rather than, as now, so much of the effort going into reading and reviewing papers that just about no one will ever read.

For more, see:

An efficiency argument for post-publication review

and

]]>He [Peter Ellis] started forecasting elections in New Zealand as a way to learn how to use Stan, and the hobby has stuck with him since he moved back to Australia in late 2018.

You may remember Peter from my previous post on his analysis of NZ traffic crashes.

**The talk**

*Speaker:* Peter Ellis

*Title:* Poll position: statistics and the Australian federal election

*Abstract:* The result of the Australian federal election in May 2019 stunned many in the politics-observing class because it diverged from a long chain of published survey results of voting intention. How surprising was the outcome? Not actually a complete outlier; about a one in six chance, according to Peter Ellis’s forecasting model for Free Range Statistics. This seminar will walk through that model from its data management (the R package ozfedelect, built specifically to support it), the *state-space model written in Stan and R that produces the forecasts*; and its eventual visualisation and communication to the public. There are several interesting statistical issues relating to how we translate crude survey data into actual predicted seats, and some even more *interesting communication issues about how all this is understood by the public*. This talk is *aimed at those with an interest in one or more of R, Stan, Bayesian modelling and forecasts, and Australian voting behaviour*.

*Location:* 11am, 31 May 2019. Room G03, Learning and Teaching Building, 19 Ancora Imparo Way, Clayton Campus, Monash University [Melbourne, Australia]

**The details**

Ellis’s blog, Free Range Statistics, has the details of the Australian election model and much much more.

You can also check out his supporting R package, ozfedelect, on GitHub.

**From hobbyist to pundit**

Ellis’s hobby led to his being quoted by *The Economist* in a recent article, Did pollsters misread Australia’s election or did pundits?. Quite the hobby.

**But wait, there’s more…**

There are a lot more goodies on Peter Ellis’s blog, both with and without Stan.

**A plea**

I’d like to make a plea for a Stan version of the Bradley-Terry model (the statistical basis of the Elo rating system) for predicting Australian Football League matches. It’s an exercise somewhere in the regression sections of Gelman et al.’s *Bayesian Data Analysis* to formulate the model (including how to extend to ties). I have a half-baked Bradley-Terry case study I’ve been meaning to finish, but would be even happier to get an outside contribution! I’d be happy to share what I have so far.

[edit: fixed spelling of “Elo”]

]]>The above image, from T. J. Mahr, is a cleaned-up version of this graph:

which in turn is a slight improvement on a graph posted by Dan Goldstein (with R code!) which came from Ashton Anderson.

The original, looks like this:

This is just fine, but I had a few changes to make. I thought the color scheme could be improved, also I wanted change the order of the pieces on the graph: it didn’t seem quite right to start with the bishop. I’d do some order such as Pawn, Knight, Bishop, Queen, Rook, Castling, King, which is roughly the order that pieces get moved (except that I’ve put Castling between Rook and King because it seems to make sense to go there).

I wonder how Anderson came up with the order in the above graph. Let me look at the code . . . OK, I see, it’s alphabetical! (B, K, N, O, P, Q, R). We don’t like alphabetical order.

But enuf complaining: I should be able to go to the code and clean things up. And, a half hour later, here it is, my (slightly) adapted code:

library(tidyverse) theme_set(theme_minimal()) mt <- read.csv(url('https://gist.githubusercontent.com/ashtonanderson/cfbf51e08747f60472ee2132b0d35efb/raw/80acd2ad7c0fba4e85c053e61e9e5457137e00ee/moveno_piecetype_counts')) mt$piece_type <- factor(mt$piece_type, levels=c("P","N","B","Q","R","O","K")) mt <- mt %>% group_by(move_number) %>% mutate(tot = sum(count),frac = count/tot) p <- ggplot(mt %>% filter(move_number <= 125),aes(move_number,frac)) + geom_area(aes(fill = piece_type), position = 'stack') + scale_fill_brewer(type='qual',palette=3,name='Piece type', labels=c("Pawn","Knight","Bishop","Queen","Rook","Castling","King")) + theme(panel.border=element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + xlab('Move number') + ylab('') + scale_y_continuous(labels = scales::percent, breaks=seq(0,1,0.2)) p

The only things I did were change the order of the pieces, cut down on the y-axis labeling (I'd also like to add tick marks and change the sizes and locations of the axis labels but I don't know how to do that in ggplot2). Also, just for laffs, I extended the x-axis to 125 moves, cos why stop at 80?

The result is the second graph above.

I prefer it to the original. To me, the generally monotone pattern allows me to see what's happening more clearly, whereas in the original, I had to spend a lot of time going back and forth between the legend and the curves. Even better would be to label the filled area directly; I don't know how to do that in ggplot2 either, but I'm sure it's easy enough for those who know the proper function call, and indeed Mahr did that for us to produce the graph shown at the top of the page.

Also there's some glitch where there's some white space in some of the early moves. I don't know where that's coming from, but I see some if it in the original graph too.

In any case, hats off to Anderson for posting his data and code (and Goldstein for sharing) so that the rest of us can easily play with it all.

**P.S.** Anne Pier Salverda made some new graphs for us:

And here's the code:

library(tidyverse) library(ggridges) theme_set(theme_minimal()) mt <- read.csv(url('https://gist.githubusercontent.com/ashtonanderson/cfbf51e08747f60472ee2132b0d35efb/raw/80acd2ad7c0fba4e85c053e61e9e5457137e00ee/moveno_piecetype_counts')) mt$piece_type <- factor( mt$piece_type, levels = c("P","N","B","Q","R","O","K"), labels = c( "Pawn", "Knight", "Bishop", "Queen", "Rook", "Castling", "King" ) ) n_games = sum(mt[mt$move_number == 1, "count"]) mt <- mt %>% group_by(move_number) %>% mutate( tot = sum(count), frac = count / tot, frac_games = count / n_games ) %>% ungroup() ggplot(mt %>% filter(move_number < 100), aes(x = move_number, y = piece_type, height = frac)) + geom_ridgeline( stat = "identity", col = "gray60", fill = "gray90" ) + labs( title = "Normalized by number of surviving games", x = "Move number", y = "" ) ggplot(mt %>% filter(move_number < 100), aes(x = move_number, y = piece_type, height = frac_games)) + geom_ridgeline( stat = "identity", col = "gray60", fill = "gray90" ) + labs( title = "Normalized by total number of games", x = "Move number", y = "" ) mt %>% filter(move_number < 125) %>% group_by(move_number) %>% summarize(n_games = sum(count)) %>% ggplot(aes(x = move_number, y = n_games)) + geom_line(col = "gray60") + scale_y_continuous(labels = scales::comma) + labs( x = "Total number of moves", y = "Number of games" )

Maybe we'd also like to see that first set of plots redone, normalizing by the number of pieces of that type remaining in the game.

]]>This is just disgraceful: powerful academics using their influence to suppress (“clamp down on”) dissent. They call us terrorists, they lie about us in their journals, and they plot to clamp down on us. I can’t say at this point that I’m surprised to see this latest, but it saddens and angers me nonetheless to see people who could be research leaders but who instead are saying:

We’re working for the clampdown

We will teach our twisted speech

To the unbelievers

We will train our blue-eyed men

To the unbelievers

What are we gonna do now?

]]>I just finished reading your oped article on reproducibility in science. As an experimental scientist – more precisely a chemical crystallographer – I have had to deal with this kind of situation a number of times, and at least two examples may serve as the possible exceptions to your rules.

One of the beauties of x-ray crystallography is the internal self-consistent confirmation of the result of an x-ray crystal structure determination. X-ray crystal structure determinations are today used (increasing required by journals) as essentially absolute proof of the molecular structure of a new compound. With the result of a determination there is no doubt that the crystal from which the data were obtained contained molecules with the composition and structure that can be represented by a three dimensional ball and stick model. There occasionally are some problems with measurements and refinements of structures, but they are a very, very small fraction of the mass of crystal structures (now nearing one million in the Cambridge Structural Database, the archive for such results) and those problems are almost always readily recognizable by a skilled practitioner.

Having said that, in my own experience now approaching 50 years I have had two situations in which a chemical reaction yielded a totally unexpected result with the production of one single crystal. In both cases the structure of the single crystal was not only unexpected but the structures were polymorphs of materials for which other, totally difference structures were already known. Thus reactions “gone awry” yielded unpredicted and unexpected results, and the veracity of those results is essentially incontestable, and has not been challenged in any way by referees of the resulting publication.

However – and here is where I come to the point of your oped piece – while I have total confidence in the results of those experiments I do have serious doubts about our ability to reproduce the experiments that led to the formation of those crystals. In the interest of full disclosure we have admitted this “failing” in our papers, but that does not detract from the significance of the results of the crystal structure determination. Thus, there may be cases, at least in chemistry, where the conditions of the experiment may be difficult – but certainly not impossible! – to reproduce, but the result is still valid.

Bernstein adds:

I am reviewing some old work and just recalled a 1972 paper we wrote on a crystal structure determination of one single crystal that took an entire year to grow (while the chemist was away on sabbatical). Until then the material had resisted all attempts to crystallize it. I am attaching a copy of that paper (on muconic acid). Our best example is the second polymorph of tetrathiafulvalene, which we published in 1994. I am attaching reprint of that paper as well. The third example is ammonium hydrogen succinate which actually is in press, so I don’t have a reprint to send you (yet!).

I don’t know anything about crystallography except that James Watson didn’t think his colleague had good fashion sense, but I can respect the general idea of hard-to-replicate results, as these arise all the time in observational social science, where you can’t just spin the wheel and get data on N more elections, or recessions, or wars, or whatever.

]]>Lee could’ve fought on the Union side in the Civil War. Or he could’ve saved a couple hundred thousand lives by surrendering his army at some point, once he realized they were going to lose. But, conditional on fighting on, he had to innovate. He had to gamble, over and over again, because that was his only chance.

Similarly for Trump. There was no reason he had to run for president. And, once he had the Republican nomination, he still could’ve stepped down. But, conditional on him running for president with all these liabilities (he’s unpopular, he leads an unpopular party, and he has a lot of legal issues), he’s had to use unconventional tactics, and to continue to use unconventional tactics. Like Lee, there’s no point at which he could rest on his successes.

]]>It’s sunny, I’m in England, and I’m having a very tasty beer, and Lauren, Andrew, and I just finished a paper called *The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling.*

So I guess it’s time to resurrect a blog series. On the off chance that any of you have forgotten, the Against Arianism series focusses on the idea that, in the same way that Arianism^{1} was heretical, so too is the idea that priors and likelihoods can be considered separately. Rather, they are *consubstantial*–built of the same probability substance.

There is no new thing under the sun, so obviously this has been written about a lot. But because it’s my damn blog post, I’m going to focus on a paper Andrew, Michael, and I wrote in 2017 called The Prior Can Often Only Be Understood in the Context of the Likelihood. This paper was dashed off in a hurry and under deadline pressure, but I quite like it. But it’s also maybe not the best place to stop the story.

**An opportunity to comment**

A few months back, the fabulous Lauren Kennedy was visiting me in Toronto on a different project. Lauren is a postdoc at Columbia working partly on complex survey data, but her background is quantitative methods in psychology. Among other things, we saw a fairly regrettable (but excellent) Claire Denis movie about vampires^{2}.

But that’s not relevant to the story. What is relevant was that Lauren had seen an open invitation to write a comment on a paper in Computational Brain & Behaviour about *Robust ^{3} Modelling in Cognitive Science* written by a team cognitive scientists and researchers in scientific theory, philosophy, and practice (Michael Lee, Amy Criss, Berna Devezer, Christopher Donkin, Alexander Etz, Fábio Leite, Dora Matzke, Jeffrey Rouder, Jennifer Trueblood, Corey White, and Joachim Vandekerckhove).

Their bold aim to sketch out the boundaries of good practice for cognitive modelling (and particularly for the times where modelling meets data) is laudable, not least because such an endeavor will always be doomed to fail in some way. But the act of stating some ideas for what constitutes best practice gives the community a concrete pole to hang this important discussion on. And Computational Brain & Behaviour recognized this and decided to hang an issue off the paper and its discussions.

The paper itself is really thoughtful and well done. And obviously I do not agree with *everything* in it, but that doesn’t stop me from the feeling that wide-spread adoption of their suggestions would definitely make quantitative research better.

But Lauren noticed one tool that we have found extremely useful that wasn’t mentioned in the paper: *prior predictive checks*. She asked if I’d be interested in joining her on a paper, and I quickly said yes!

**It turns out there is another BART**

The best thing about working with Lauren on this was that she is a legit psychology researcher so she isn’t just playing in someone’s back yard, she owns a patch of sand. It was immediately clear that it would be super-quick to write a comment that just said “you should use prior predictive checks”. But that would miss a real opportunity. Because cognitive modelling isn’t quite the same as standard statistical modelling (although in the case where multilevel models are appropriate Daniel Schad, Michael Betancourt, and Shravan Vasishth just wrote an excellent paper on importing general ideas of good statistical workflows into Cognitive applications).

Rather than using our standard data analysis models, a lot of the time cognitive models are generative models for the cognitive process coupled (sometimes awkwardly) with models for the data that is generated from a certain experiment. So we wanted an example model that is more in line with this practice than our standard multilevel regression examples.

Lauren found the Balloon Analogue Risk Task (BART) in Lee and Wagenmakers’ book Bayesian Cognitive Modeling: A Practical Course, which conveniently has Stan code online^{4}. We decided to focus on this example because it’s fairly easy to understand and has all the features we needed. But hopefully we will eventually write a longer paper that covers more common types of models.

BART is an experiment that makes participants simulate pumping balloons with some fixed probability of popping after every pump. Every pump gets them more money, but they get nothing if the balloon pops. The model contains a parameter () for risk taking behaviour and the experiment is designed to see if the risk taking behaviour changes as a person gets more drunk. The model is described in the following DAG:

**Exploring the prior predictive distribution **

Those of you who have been paying attention will notice the Uniform(0,10) priors on the *logit* scale and think that these priors are a little bit terrible. And they are! Direct simulation from model leads to absolutely silly predictive distributions for the number of pumps in a single trial. Worse still, the pumps are *extremely *uniform across trials. Which means that the model thinks, *a priori*, that it is quite likely for a tipsy undergraduate to pump a balloon 90 times in each of the 20 trials. The mean number of pumps is a much more reasonable 10.

Choosing tighter upper bounds on the uniform priors leads to more sensible prior predictive distributions, but then Lauren went to test out what changes this made to inference (in particular looking at how it affects the Bayes factor against the null that the parameters were the same across different levels of drunkenness). It made *very *little difference. This seemed odd so she started looking closer.

**Where is the p? Or, the Likelihood Principle gets in the way**

So what is going on here? Well the model describe in Lee and Wagenmaker’s book is not a generative model for the experimental data. Why not? Because *the balloon sometimes pops!* But because in this modelling setup the probability of explosion is *independent *of the number of pumps, this explosive possibility only appears as *a constant in the likelihood*.

The much lauded Likelihood Principle tells us that we do not need to worry about these constants when we are doing inference. But when we are trying to generate data from the prior predictive distribution, we really need to care about these aspects of the model.

Once the *context *on the experiment is taken into account, the prior predictive distributions change *a lot*.

**Context is important when taking statistical methods into new domains**

Prior predictive checks are really powerful tools. They give us a way to set priors, they give us a way to understand what our model does, they give us a way to generate data that we can use to assess the behaviour of different model comparison tools under the experimental design at hand. (Neyman-Pearson acolytes would talk about power here, but the general question lives on beyond that framework).

Modifications of prior predictive checks should also be used to assess how predictions, inference, and model comparison methods behave under different but realistic deviations from the assumed generative model. (One of the points where I disagree with Lee *et al.*‘s paper is that it’s enough to just pre-register model comparision methods. We also need some sort of simulation study to know how they work for the problem at hand!)

But prior predictive checks require understanding of the substantive field *as well as *understanding of how the experiment was performed. And it is not always as simple as *just predict y*!

Balloons pop. Substantive knowledge may only be about contrasts or combinations of predictions. We need to always be aware that it’s a lot of work to translate a tool to a new scientific context. Even when that tool appears to be as straightforward to use and as easy to explain as prior predictive checks.

And maybe we should’ve called that paper The Prior Can Often Only Be Understood in the Context of the *Experiment.*

**Endnotes:**

^{1} The fourth century Christian heresy that posited that Jesus was created by God and hence was not of the same substance. The council of Nicaea ended up writing a creed to stamp that one out.

^{2} Really never let me choose the movie. Never.

^{3} I ** hate **the word “robust” here. Robust against what?! The answer appears to be “robust against un-earned certainty”, but I’m not sure. Maybe they want to Winsorize cognitive science?

^{4} Lauren had to twiddle it a bit, particularly using a non-centered parameterization to eliminate divergences.

But I did notice one thing Le Carre does very well, something that I haven’t seen discussed before in his writing, which is the way he integrates thought and action. A character will be walking down the street, or having a conversation, or searching someone’s apartment, and will be going through a series of thoughts while doing things. The thoughts and actions go together.

Ummm, here’s an example:

It’s not that the above passage by itself is particularly impressive; it’s more that Le Carre does this consistently. So he’s not just writing an action novel with occasional ruminations; rather, the thoughts are part of the action.

Writing this, it strikes me that this is commonplace, almost necessary, in a bande desinnée, but much more rare in a novel.

Also it’s important when we are teaching and when we are writing technical articles and textbooks: we’re doing something and explaining our motivation and what we’re learning, all at once.

]]>Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live'”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would *not* say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does *not* successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study *doesn’t* replicate the original claim, then a defender can say it’s not a *real* replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

**P.S.** Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

]]>Yair writes:

Immediately following the 2018 election, we published an analysis of demographic voting patterns, showing our best estimates of what happened in the election and putting it into context compared to 2016 and 2014. . . .

Since then, we’ve collected much more data — precinct results from more states and, importantly, individual-level vote history records from Secretaries of State around the country. This analysis updates the earlier work and adds to it in a number of ways. Most of the results we showed remain the same as in the earlier analysis, but there are some changes.

Here’s the focus:

How much of the change from 2016 was due to different people voting vs. the same people changing their vote choice?

I like how he puts this. Not “Different Electorate or Different Vote Choice?” which would imply that it’s one or the other, but “How much,” which is a more quantitative, continuous, statistical way of thinking about the question.

Here’s Yair’s discussion:

As different years bring different election results, many people have debated the extent to which these changes are driven by (a) differential turnout or (b) changing vote choice.

Those who believe turnout is the driver point to various pieces of evidence. Rates of geographic ticket splitting have declined over time as elections have become more nationalized. Self-reported consistency between party identification and vote choice is incredibly high. In the increasingly nasty discourse between people who are involved in day-to-day national politics, it is hard to imagine there are many swing voters left. . . .

Those who think changing vote choice is important point to different sets of evidence. Geographic ticket splitting has declined, but not down to zero, and rates of ticket splitting only reflect levels of geographic consistency anyway. Surveys do show consistency, but again not 100% consistency, and survey respondents are more likely to be heavily interested in politics, more ideologically consistent, and less likely to swing back and forth. . . . there is little evidence that this extends to the general public writ large. . . .

How to sort through it all?

Yair answers his own question:

Our voter registration database keeps track of who voted in different elections, and our statistical models used in this analysis provide estimates of how different people voted in the different elections. . . .

Let’s build intuition about our approach by looking at a fairly simple case: the change between 2012 and 2014 . . . likely mostly due to differential turnout.

What about the change from 2016 to 2018? The same calculations from earlier are shown in the graph above and tell a different story. . . .

Two things happened between 2016 and 2018. First, there was a massive turnout boost that favored Democrats, at least compared to past midterms. . . . But if turnout was the only factor, then Democrats would not have seen nearly the gains that they ended up seeing. Changing vote choice accounted for a +4.5% margin change, out of the +5.0% margin change that was seen overall — a big piece of Democratic victory was due to 2016 Trump voters turning around and voting for Democrats in 2018.

Also lots more graphs, including discussion of some individual state-level races. And this summary:

First, on turnout: there are few signs that the overwhelming enthusiasm of 2018 is slowing down. 2018 turnout reached 51% of the citizen voting-age population, 14 points higher than 2014. 2016 turnout was 61%. If enthusiasm continues, how high can it get? . . . turnout could easily reach 155 to 160 million votes . . .

Second, on vote choice . . . While 2018 was an important victory for Democrats, the gains that were made could very well bounce back to Donald Trump in 2020.

You can compare this to our immediate post-election summary and Yair’s post-election analysis from last November.

(In the old days I would’ve crossposted all of this on the Monkey Cage, but they don’t like crossposting anymore.)

]]>I was thinking about writing a short paper aimed at getting political scientists to not make some common but easily avoidable graphical mistakes. I’ve come up with the following list of such mistakes. I was just wondering if any others immediately came to mind?

– Label lines directly

– Make labels big enough to read

– Small multiples instead of spaghetti plots

– Avoid stacked barplots

– Make graphs completely readable in black-and-white

– Leverage time as clearly as possible by placing it on the x-axis.

That reminds me . . . I was just at a pharmacology conference. And everybody there—I mean *everybody*—used the rainbow color scheme for their graphs. Didn’t anyone send them the memo, that we don’t do rainbow anymore? I prefer either a unidirectional shading of colors, or a bidirectional shading as in figure 4 here, depending on context.

I have a question concerning papers comparing two broad domains of modeling: neural nets and statistical models. Both terms are catch-alls, within each of which there are, quite obviously, multiple subdomains. For instance, NNs could include ML, DL, AI, and so on. While statistical models should include panel data, time series, hierarchical Bayesian models, and more.

I’m aware of two papers that explicitly compare these two broad domains:

(1) Sirignano, et al., Deep Learning for Mortgage Risk,

(2) Makridakis, et al., Statistical and Machine Learning forecasting methods: Concerns and ways forward

But there must be more than just these two examples. Are there others that you are aware of? Do you think a post on your blog would be useful? If so, I’m sure you can think of better ways to phrase or express my “two broad domains.”

My reply:

I don’t actually know.

Back in 1994 or so I remember talking with Radford Neal about the neural net models in his Ph.D. thesis and asking if he could try them out on analysis of data from sample surveys. The idea was that we have two sorts of models: multilevel logistic regression and Gaussian processes. Both models can use the same predictors (characteristics of survey respondents such as sex, ethnicity, age, and state), and both have the structure that similar respondents have similar predicted outcomes—but the two models have different mathematical structures. The regression model works with a linear predictor from all these factors, whereas the Gaussian process model uses an unnormalized probability density—a prior distribution—that encourages people with similar predictors to have similar outcomes.

My guess is that the two models would do about the same, following the general principle that the most important thing about a statistical procedure is not what you do with the data, but what data you use. In either case, though, some thought might need to go into the modeling. For example, you’ll want to include state-level predictors. As we’ve discussed before, when your data are sparse, multilevel regression works much better if you have good group-level predictors, and some of the examples where it appears that MRP performs poorly, are examples where people are not using available group-level information.

Anyway, to continue with the question above, asking about neural nets and statistical models: Actually, neural nets are a special case of statistical models, typically Bayesian hierarchical logistic regression with latent parameters. But neural nets are typically estimated in a different way: the resulting posterior distributions will generally be multimodal, so rather than try the hopeless task of traversing the whole posterior distribution, we’ll use various approximate methods, which then are evaluated using predictive accuracy.

By the way, Radford’s answer to my question back in 1994 was that he was too busy to try fitting his models to my data. And I guess I was too busy too, because I didn’t try it either! More recently, I asked a computer scientist and he said he thought the datasets I was working with were too small for his methods to be very useful. More generally, though, I like the idea of RPP, also the idea of using stacking to combine Bayesian inferences from different fitted models.

]]>It’s perhaps not well known (although it’s consistent with what we found in Red State Blue State) that just about all the polarization on abortion comes from whites, and most of that is from upper-income, well-educated whites. Here’s an incomplete article that Yair and I wrote on this from 2010; we haven’t followed up on it recently.

]]>The quick story is that I don’t think the alternative histories within alternative histories are completely arbitrary. It seems to me that there’s a common theme in the best alternative history stories, a recognition that our world is the true one and that the people in the stories are living in a fake world. This is related to the idea that the real world is overdetermined, so these alternatives can’t ultimately make sense. From that perspective, characters living within an alternative history are always at risk of realizing that their world is not real, and the alternative histories they themselves construct can be ways of channeling that recognition.

I was also thinking about this again the other day when rereading T. J. Shippey’s excellent The Road to Middle Earth. Tolkien put in a huge amount of effort into rationalizing his world, not just in its own context (internal consistency) but also making it fit into our world. It seems that he felt that a completely invented world would not ultimately make sense; it was necessary for his world to be reconstructed, or discovered, and for that it had to be real.

]]>Political Science and the Replication Crisis

We’ve heard a lot about the replication crisis in science (silly studies about ESP, evolutionary psychology, miraculous life hacks, etc.), how it happened (p-values, forking paths), and proposed remedies both procedural (preregistration, publishing of replications) and statistical (replacing hypothesis testing with multilevel modeling and decision analysis). But also of interest are the theories, implicit or explicit, associated with unreplicated or unreplicable work in medicine, psychology, economics, policy analysis, and political science: a model of the social and biological world driven by hidden influences, a perspective which we argue is both oversimplified and needlessly complex. When applied to political behavior, these theories seem to be associated with a cynical view of human nature that lends itself to anti-democratic attitudes. Fortunately, the research that is said to support this view has been misunderstood.

Some recommended reading:

[2015] Disagreements about the strength of evidence

**Quantitative Methods Committee and QMSA (10:30am, Fri 24 May 2019, 5757 S. University in Saieh Hall (lower Level) Room 021):**

]]>Multilevel Modeling as a Way of Life

The three challenges of statistical inference are: (1) generalizing from sample to population, (2) generalizing from control to treatment group, and (3) generalizing from observed measurements to the underlying constructs of interest. Multilevel modeling is central to all of these tasks, in ways that you might not realize. We illustrate with several examples in social science and public health.

Some recommended reading:

[2004] Treatment effects in before-after data

[2012] Why we (usually) don’t have to worry about multiple comparisons

[2018] Bayesian aggregation of average data: An application in drug development

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

**P.S.** Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

**P.P.S.** On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!

]]>I’ve just saw this image in a paper discussing the weight of evidence for a “hiatus” in the global warming signal and immediately thought of the garden of forking paths.

From the paper:

Tree representation of choices to represent and test pause-periods. The ‘pause’ is defined as either no-trend or a slow-trend. The trends can be measured as ‘broken’ or ‘continuous’ trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of ‘historical’ versions of the datasets as they existed, or contemporary versions providing full dataset ‘hindsight’. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The ‘year’ rows are for assessments undertaken at each year in time.

Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the ‘pause’ (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the ‘pause’ as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree.

Actually, it’s the multiverse.

]]>It’s a problem. First, it’s a problem that people will repeat unjustified claims, also a problem that when data are attached, you can get complete credulity, even for claims that are implausible on the face of it.

So it’s good to be reminded: “Data” are just numbers. You need to know where the data came from before you can learn anything from them.

]]>*My original post is below, followed by a post script regarding the retraction.*

Matthew Heston writes:

First time, long time. I don’t know if anyone has sent over this recent paper [“Did Jon Stewart elect Donald Trump? Evidence from television ratings data,” by Ethan Porter and Thomas Wood] which claims that Jon Stewart leaving The Daily Show “spurred a 1.1% increase in Trump’s county-level vote share.”

I’m not a political scientist, and not well versed in the methods they say they’re using, but I’m skeptical of this claim. One line that stood out to me was: “To put the effect size in context, consider the results from the demographic controls. Unsurprisingly, several had significant results on voting. Yet the effects of The Daily Show’s ratings decline loomed larger than several controls, such as those related to education and ethnicity, that have been more commonly discussed in analyses of the 2016 election.” This seems odd to me, as I wouldn’t expect a TV show host change to have a larger effect than these other variables.

They also mention that they’re using “a standard difference-in-difference approach.” As I mentioned, I’m not too familiar with this approach. But my understanding is that they would be comparing pre- and post- treatment differences in a control and treatment group. Since the treatment in this case is a change in The Daily Show host, I’m unsure of who the control group would be. But maybe I’m missing something here.

Heston points to our earlier posts on the Fox news effect.

Anyway, what do I think of this new claim? The answer is that I don’t really know.

Let’s work through what we can.

In reporting any particular effect there’s some selection bias, so let’s start by assuming an Edlin factor of 1/2, so now the estimated effect of Jon Stewart goes from 1.1% to 0.55% in Trump’s county-level vote share. Call it 0.6%. Vote share is approximately 50%, so a 0.6% change is approximately a 0.3 percentage point in the vote. Would this have swung the election? I’m not sure, maybe not quite.

Let’s assume the effect is real. How to think about it? It’s one of many such effects, along with other media outlets, campaign tactics, news items, etc.

A few years ago, Noah Kaplan, David Park, and I wrote an article attempting to distinguish between what we called the random walk and mean-reversion models of campaigning. The random walk model posits that the voters are where they are, and campaigning (or events more generally) moves them around. In this model, campaign effects are additive: +0.3 here, -0.4 there, and so forth. In contrast, the mean-reversion model starts at the end, positing that the election outcome is largely determined by the fundamentals, with earlier fluctuations in opinion mostly being a matter of the voters coming to where they were going to be. After looking at what evidence we could find, we concluded that the mean-reversion model made more sense and was more consistent with the data. This is not to say that the Jon Stewart show would have no effect, just that it’s one of many interventions during the campaign, and I can’t picture each of them having an independent effect and these effects all adding up.

**P.S. After the retraction**

The article discussed above was retracted because the analysis had a coding error.

What to say given this new information?

First, I guess Heston’s skepticism is validated. When you see a claim that seems too big to be true (as here or here), maybe it’s just mistaken in some way.

Second, I too have had to correct a paper whose empirical claims were invalidated by a coding error. It happens—and not just to Excel users!

Third, maybe the original reaction to that study was a bit too strong. See the above post: Even had the data shown what had originally been claimed, the effect they found was not as consequential as it might’ve seen at first. Setting aside all questions of data errors and statistical errors, there’s a limit to what can be learned about a dynamic process—an election campaign—from an isolated study.

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated. I had a similar feeling a few years ago regarding the publicity surrounding the college-football-and-voting study. The particular claims regarding football and voting have since been disputed, but even if you accept the original study as is, its implications aren’t as strong as had been claimed in the press. Whatever these causal effects are, they vary by person and scenario, and they’re not occurring in isolation.

]]>I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

]]>Someone sent this to me:

and I was like, wtf? I don’t say wtf very often—at least, not on the blog—but this just seemed weird.

For one thing, Nate and I did a project together once using MRP: this was our estimate of attitudes on heath care reform by age, income, and state:

Without MRP, we couldn’t’ve done anything like it.

**So, what gives?**

Here’s a partial list of things that MRP has done:

– Estimating public opinion in slices of the population

– Improved analysis using the voter file

– Polling using the Xbox that outperformed conventional poll aggregates

– Changing our understanding of the role of nonresponse in polling swings

– Post-election analysis that’s a lot more trustworthy than exit polls

OK, sure, MRP has solved lots of problems, it’s revolutionized polling, no matter what Team Buggy Whip says.

That said, it’s possible that MRP is overrated. “Overrated” is a difference between rated quality and actual quality. MRP, wonderful as it is, might well be rated too highly in some quarters. I wouldn’t call MRP a “forecasting method,” but that’s another story.

I guess the thing that bugged me about the Carmelo Anthony comparison is that my impression from reading the sports news is not just that Anthony is overrated but that he’s an actual liability for his teams. Whereas I see MRP, overrated as it may be (I’ve seen no evidence that MRP is overrated but I’ll accept this for the purpose of argument), as still a valuable contributor to polling.

**Ten years ago . . .**

The end of the aughts. It was a simpler time. Nate Silver was willing to publish an analysis that used MRP. We all thought embodied cognition was real. Donald Trump was a reality-TV star. Kevin Spacey was cool. Nobody outside of suburban Maryland had heard of Beach Week.

And . . . Carmelo Anthony got lots of respect from the number crunchers.

So here’s the story according to Nate: MRP is like Carmelo Anthony because they’re both overrated. But Carmelo Anthony isn’t overrated, he’s really underrated. So maybe Nate’s MRP jab was just a backhanded MRP compliment?

Simpler story, I guess, is that back around 2010 Nate liked MRP and he liked Carmelo. Back then, he thought the people who thought Carmelo was overrated, were wrong. In 2018, he isn’t so impressed with either of them. Nate’s impression of MRP and Carmelo Anthony go up and down together. That’s consistent, I guess.

**In all seriousness . . .**

Unlike Nate Silver, I claim no expertise on basketball. For all I know, Tim Tebow will be starting for the Knicks next year!

I do claim some expertise on MRP, though. Nate described MRP as “not quite ‘hard’ data.” I don’t really know what Nate meant by “hard” data—ultimately, these are all just survey responses—but, in any case, I replied:

I guess MRP can mean different things to different people. All the MRP analyses I’ve ever published are entirely based on hard data. If you want to see something that’s a complete mess and is definitely overrated, try looking into the guts of classical survey weighting (see for example this paper). Meanwhile, Yair used MRP to do these great post-election summaries. Exit polls are a disaster; see for example here.

Published poll toplines are not the data, warts and all; they’re processed data, sometimes not adjusted for enough factors as in the notorious state polls in 2016. I agree with you that raw data is the best. Once you have raw data, you can make inferences for the population. That’s what Yair was doing. For understandable commercial reasons, lots of pollsters will release toplines and crosstabs but not raw data. MRP (or, more generally, RRP) is just a way of going from the raw data to make inference about the general population. It’s the general population (or the population of voters) that we care about. The people in the sample are just a means to an end.

Anyway, if you do talk about MRP and how overrated it is, you might consider pointing people to some of those links to MRP successes. Hey, here’s another one:

weused MRP to estimate public opinion on health care. MRP has quite a highlight reel, more like Lebron or Steph or KD than Carmelo, I’d say!One thing I will say is that data and analysis go together:

– No modern survey is good enough to be able to just interpret the results without any adjustment. Nonresponse is just too big a deal. Every survey gets adjusted, but some don’t get adjusted well.

– No analysis method can do it on its own without good data. All the modeling in the world won’t help you if you have serious selection bias.

Yair added:

Maybe it’s just a particularly touchy week for Melo references.

Both Andy and I would agree that MRP isn’t a silver bullet. But nothing is a silver bullet. I’ve seen people run MRP with bad survey data, bad poststratification data, and/or bad covariates in a model that’s way too sparse, and then over-promise about the results. I certainly wouldn’t endorse that. On the other side, obviously I agree with Andy that careful uses of MRP have had many successes, and it can improve survey inferences, especially compared to traditional weighting.

I think maybe you’re talking specifically about election forecasting? I haven’t seen comparisons of your forecasts to YouGov or PredictWise or whatever else. My vague sense pre-election was that they were roughly similar, i.e., that the meaty part of the curves overlapped. Maybe I’m wrong and your forecasts were much better this time—but non-MRP forecasters have also done much worse than you, so is that an indictment of MRP, or are you just really good at forecasting?

More to my main point—in one of your recent podcasts, I remember you said something about how forecasts aren’t everything, and people should look at precinct results to try to get beyond the toplines. That’s roughly what we’ve been trying to do in our post-election project, which has just gotten started. We see MRP as a way to combine all the data—pre-election voter file data, early voting, precinct results, county results, polling—into a single framework. Our estimates aren’t going to be perfect, for sure, but hopefully an improvement over what’s been out there, especially at sub-national levels. I know we’d do better if we had a lot more polling data, for instance. FWIW I get questions from clients all the time about how demographic groups voted in different states. Without state-specific survey data, which is generally unavailable and often poorly collected/weighted, not sure what else you can do except some modeling like MRP.

Maybe you’d rather see the raw unprocessed data like the precinct results. Fair enough, sometimes I do too! My sense is the people who want that level of detail are in the minority of the minority. Still, we’re going to try to do things like show the post-processed MRP estimates, but also some of the raw data to give intuition. I wonder if you think this is the right approach, or if you think something else would be better.

And Ryan Enos writes:

To follow up on this—I think you’ll all be interested in seeing the back and forth between Nate and Lynn Vavreck who was interviewing him. It was more of a discussion of tradeoffs between different approaches, then a discussion of what is wrong with MRP. Nate’s MRP alternative was to do a poll in every district, which I think we can all agree would be nice – if not entirely realistic. Although, as Nate pointed out, some of the efforts from the NY Times this cycle made that seem more realistic. In my humble opinion, Lynn did a nice job pushing Nate on the point that, even with data like the NY Times polls, you are still moving beyond raw data by weighting and, as Andrew points out, we often don’t consider how complex this can be (I have a common frustration with academic research about how much out of the box survey weights are used and abused).

I don’t actually pay terribly close attention to forecasting – but in my mind, Nate and everybody else in the business is doing a fantastic job and the YouGov MRP forecasts have been a revelation. From my perspective, as somebody who cares more about what survey data can teach us about human behavior and important political phenomenon, I think MRP has been a revelation in that it has allowed us to infer opinion in places, such as metro areas, where it would otherwise be missing. This has been one of the most important advances in public opinion research in my lifetime. Where the “overrated” part becomes true is that just like every other scientific advance, people can get too excited about what it can do without thinking about what assumptions are going into the method and this can lead to believing it can do more than it can—but this is true of everything.

Yair, to your question about presentation—I am a big believer in raw data and I think combining the presentation of MRP with something like precinct results, despite the dangers of ecological error, can be really valuable because it can allow people to check MRP results with priors from raw data.

It’s fine to do a poll in every district but then you’d still want to do MRP in order to adjust for nonresponse, estimate subgroups of the population, study public opinion in between the districtwide polls, etc.

]]>And here’s the kicker:

Not quite as cool as the time I was mentioned in Private Eye, but it’s still pretty satisfying.

My next goal: Getting a mention in Sports Illustrated. (More on this soon.)

In all seriousness, it’s so cool when methods that my collaborators and I have developed are just out there, for anyone to use. I only wish Tom Little were around to see it happening.

**P.S.** Some commenters are skeptical, though:

I agree that polls can be wrong. The issue is not so much the size of the sample but rather that the sample can be unrepresentative. But I do think that polls provide some information; it’s better than just guessing.

**P.P.S.** Unrelatedly, Morris wrote, with Ian White and Michael Crowther, this article on using simulation studies to evaluate statistical methods.

Fake-data simulation. Yeah.

]]>