Skip to content

Against Arianism 3: Consider the cognitive models of the field

“You took my sadness out of context at the Mariners Apartment Complex” – Lana Del Rey

It’s sunny, I’m in England, and I’m having a very tasty beer, and Lauren, Andrew, and I just finished a paper called The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling.

So I guess it’s time to resurrect a blog series. On the off chance that any of you have forgotten, the Against Arianism series focusses on the idea that, in the same way that Arianism1 was heretical, so too is the idea that priors and likelihoods can be considered separately. Rather, they are consubstantial–built of the same probability substance.

There is no new thing under the sun, so obviously this has been written about a lot. But because it’s my damn blog post, I’m going to focus on a paper Andrew, Michael, and I wrote in 2017 called The Prior Can Often Only Be Understood in the Context of the Likelihood. This paper was dashed off in a hurry and under deadline pressure, but I quite like it. But it’s also maybe not the best place to stop the story.

An opportunity to comment

A few months back, the fabulous Lauren Kennedy was visiting me in Toronto on a different project. Lauren is a postdoc at Columbia working partly on complex survey data, but her background is quantitative methods in psychology. Among other things, we saw a fairly regrettable (but excellent) Claire Denis movie about vampires2.

But that’s not relevant to the story. What is relevant was that Lauren had seen an open invitation to write a comment on a paper in Computational Brain & Behaviour about Robust3 Modelling in Cognitive Science written by a team cognitive scientists and researchers in scientific theory, philosophy, and practice  (Michael Lee, Amy Criss, Berna Devezer, Christopher Donkin, Alexander Etz, Fábio Leite, Dora Matzke, Jeffrey Rouder, Jennifer Trueblood, Corey White, and Joachim Vandekerckhove).

Their bold aim to sketch out the boundaries of good practice for cognitive modelling (and particularly for the times where modelling meets data) is laudable, not least because such an endeavor will always be doomed to fail in some way. But the act of stating some ideas for what constitutes best practice gives the community a concrete pole to hang this important discussion on. And Computational Brain & Behaviour recognized this and decided to hang an issue off the paper and its discussions.

The paper itself is really thoughtful and well done. And obviously I do not agree with everything in it, but that doesn’t stop me from the feeling that wide-spread adoption of  their suggestions would definitely make quantitative research better.

But Lauren noticed one tool that we have found extremely useful that wasn’t mentioned in the paper: prior predictive checks. She asked if I’d be interested in joining her on a paper, and I quickly said yes!

It turns out there is another BART

The best thing about working with Lauren on this was that she is a legit psychology researcher so she isn’t just playing in someone’s back yard, she owns a patch of sand. It was immediately clear that it would be super-quick to write a comment that just said “you should use prior predictive checks”. But that would miss a real opportunity. Because cognitive modelling isn’t quite the same as standard statistical modelling (although in the case where multilevel models are appropriate Daniel  Schad, Michael Betancourt, and  Shravan Vasishth just wrote an excellent paper on importing general ideas of good statistical workflows into Cognitive applications).

Rather than using our standard data analysis models, a lot of the time cognitive models are generative models for the cognitive process coupled (sometimes awkwardly) with models for the data that is generated from a certain experiment. So we wanted an example model that is more in line with this practice than our standard multilevel regression examples.

Lauren found the Balloon Analogue Risk Task (BART) in Lee and Wagenmakers’ book Bayesian Cognitive Modeling: A Practical Course, which conveniently has Stan code online4. We decided to focus on this example because it’s fairly easy to understand and has all the features we needed. But hopefully we will eventually write a longer paper that covers more common types of models.

BART is an experiment that makes participants simulate pumping balloons with some fixed probability of popping after every pump. Every pump gets them more money, but they get nothing if the balloon pops. The model contains a parameter (\gamma^+) for risk taking behaviour and the experiment is designed to see if the risk taking behaviour changes as a person gets more drunk.  The model is described in the following DAG:

Exploring the prior predictive distribution 

Those of you who have been paying attention will notice the Uniform(0,10) priors on the logit scale and think that these priors are a little bit terrible. And they are! Direct simulation from model leads to absolutely silly predictive distributions for the number of pumps in a single trial. Worse still, the pumps are extremely uniform across trials. Which means that the model thinks, a priori, that it is quite likely for a tipsy undergraduate to pump a balloon 90 times in each of the 20 trials. The mean number of pumps is a much more reasonable 10.

Choosing tighter upper bounds on the uniform priors leads to more sensible prior predictive distributions, but then Lauren went to test out what changes this made to inference (in particular looking at how it affects the Bayes factor against the null that the \gamma^+ parameters were the same across different levels of drunkenness). It made very little difference.  This seemed odd so she started looking closer.

Where is the p? Or, the Likelihood Principle gets in the way

So what is going on here? Well the model describe in Lee and Wagenmaker’s book is not a generative model for the experimental data. Why not? Because the balloon sometimes pops! But because in this modelling setup the probability of explosion is independent of the number of pumps, this explosive possibility only appears as a constant in the likelihood.

The much lauded Likelihood Principle tells us that we do not need to worry about these constants when we are doing inference. But when we are trying to generate data from the prior predictive distribution, we really need to care about these aspects of the model.

Once the context on the experiment is taken into account, the prior predictive distributions change a lot.

Context is important when taking statistical methods into new domains

Prior predictive checks are really powerful tools. They give us a way to set priors, they give us a way to understand what our model does, they give us a way to generate data that we can use to assess the behaviour of different model comparison tools under the experimental design at hand. (Neyman-Pearson acolytes would talk about power here, but the general question lives on beyond that framework).

Modifications of prior predictive checks should also be used to assess how predictions, inference, and model comparison methods behave under different but realistic deviations from the assumed generative model. (One of the points where I disagree with Lee et al.‘s paper is that it’s enough to just pre-register model comparision methods. We also need some sort of simulation study to know how they work for the problem at hand!)

But prior predictive checks require understanding of the substantive field as well as understanding of how the experiment was performed. And it is not always as simple as just predict y!

Balloons pop. Substantive knowledge may only be about contrasts or combinations of predictions. We need to always be aware that it’s a lot of work to translate a tool to a new scientific context. Even when that tool  appears to be as straightforward to use and as easy to explain as prior predictive checks.

And maybe we should’ve called that paper The Prior Can Often Only Be Understood in the Context of the Experiment.


1 The fourth century Christian heresy that posited that Jesus was created by God and hence was not of the same substance. The council of Nicaea ended up writing a creed to stamp that one out.

2 Really never let me choose the movie. Never.

3 I hate the word “robust” here. Robust against what?! The answer appears to be “robust against un-earned certainty”, but I’m not sure. Maybe they want to Winsorize cognitive science?

4 Lauren had to twiddle it a bit, particularly using a non-centered parameterization to eliminate divergences.

John Le Carre is good at integrating thought and action

I was reading a couple old Le Carre spy novels. They have their strong points and their weak points; I’m not gonna claim that Le Carre is a great writer. He’s no George Orwell or Graham Greene. (This review by the great Clive James nails Le Carre perfectly.)

But I did notice one thing Le Carre does very well, something that I haven’t seen discussed before in his writing, which is the way he integrates thought and action. A character will be walking down the street, or having a conversation, or searching someone’s apartment, and will be going through a series of thoughts while doing things. The thoughts and actions go together.

Ummm, here’s an example:

It’s not that the above passage by itself is particularly impressive; it’s more that Le Carre does this consistently. So he’s not just writing an action novel with occasional ruminations; rather, the thoughts are part of the action.

Writing this, it strikes me that this is commonplace, almost necessary, in a bande desinnée, but much more rare in a novel.

Also it’s important when we are teaching and when we are writing technical articles and textbooks: we’re doing something and explaining our motivation and what we’re learning, all at once.

Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live'”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

Going beyond the rainbow color scheme for statistical graphics

Yesterday in our discussion of easy ways to improve your graphs, a commenter wrote:

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

(from Yair): What Happened in the 2018 Election

Yair writes:

Immediately following the 2018 election, we published an analysis of demographic voting patterns, showing our best estimates of what happened in the election and putting it into context compared to 2016 and 2014. . . .

Since then, we’ve collected much more data — precinct results from more states and, importantly, individual-level vote history records from Secretaries of State around the country. This analysis updates the earlier work and adds to it in a number of ways. Most of the results we showed remain the same as in the earlier analysis, but there are some changes.

Here’s the focus:

How much of the change from 2016 was due to different people voting vs. the same people changing their vote choice?

I like how he puts this. Not “Different Electorate or Different Vote Choice?” which would imply that it’s one or the other, but “How much,” which is a more quantitative, continuous, statistical way of thinking about the question.

Here’s Yair’s discussion:

As different years bring different election results, many people have debated the extent to which these changes are driven by (a) differential turnout or (b) changing vote choice.

Those who believe turnout is the driver point to various pieces of evidence. Rates of geographic ticket splitting have declined over time as elections have become more nationalized. Self-reported consistency between party identification and vote choice is incredibly high. In the increasingly nasty discourse between people who are involved in day-to-day national politics, it is hard to imagine there are many swing voters left. . . .

Those who think changing vote choice is important point to different sets of evidence. Geographic ticket splitting has declined, but not down to zero, and rates of ticket splitting only reflect levels of geographic consistency anyway. Surveys do show consistency, but again not 100% consistency, and survey respondents are more likely to be heavily interested in politics, more ideologically consistent, and less likely to swing back and forth. . . . there is little evidence that this extends to the general public writ large. . . .

How to sort through it all?

Yair answers his own question:

Our voter registration database keeps track of who voted in different elections, and our statistical models used in this analysis provide estimates of how different people voted in the different elections. . . .

Let’s build intuition about our approach by looking at a fairly simple case: the change between 2012 and 2014 . . . likely mostly due to differential turnout.

What about the change from 2016 to 2018? The same calculations from earlier are shown in the graph above and tell a different story. . . .

Two things happened between 2016 and 2018. First, there was a massive turnout boost that favored Democrats, at least compared to past midterms. . . . But if turnout was the only factor, then Democrats would not have seen nearly the gains that they ended up seeing. Changing vote choice accounted for a +4.5% margin change, out of the +5.0% margin change that was seen overall — a big piece of Democratic victory was due to 2016 Trump voters turning around and voting for Democrats in 2018.

Also lots more graphs, including discussion of some individual state-level races. And this summary:

First, on turnout: there are few signs that the overwhelming enthusiasm of 2018 is slowing down. 2018 turnout reached 51% of the citizen voting-age population, 14 points higher than 2014. 2016 turnout was 61%. If enthusiasm continues, how high can it get? . . . turnout could easily reach 155 to 160 million votes . . .

Second, on vote choice . . . While 2018 was an important victory for Democrats, the gains that were made could very well bounce back to Donald Trump in 2020.

You can compare this to our immediate post-election summary and Yair’s post-election analysis from last November.

(In the old days I would’ve crossposted all of this on the Monkey Cage, but they don’t like crossposting anymore.)

What are some common but easily avoidable graphical mistakes?

John Kastellec writes:

I was thinking about writing a short paper aimed at getting political scientists to not make some common but easily avoidable graphical mistakes. I’ve come up with the following list of such mistakes. I was just wondering if any others immediately came to mind?

– Label lines directly

– Make labels big enough to read

– Small multiples instead of spaghetti plots

– Avoid stacked barplots

– Make graphs completely readable in black-and-white

– Leverage time as clearly as possible by placing it on the x-axis.

That reminds me . . . I was just at a pharmacology conference. And everybody there—I mean everybody—used the rainbow color scheme for their graphs. Didn’t anyone send them the memo, that we don’t do rainbow anymore? I prefer either a unidirectional shading of colors, or a bidirectional shading as in figure 4 here, depending on context.

Neural nets vs. regression models

Eliot Johnson writes:

I have a question concerning papers comparing two broad domains of modeling: neural nets and statistical models. Both terms are catch-alls, within each of which there are, quite obviously, multiple subdomains. For instance, NNs could include ML, DL, AI, and so on. While statistical models should include panel data, time series, hierarchical Bayesian models, and more.

I’m aware of two papers that explicitly compare these two broad domains:

(1) Sirignano, et al., Deep Learning for Mortgage Risk,

(2) Makridakis, et al., Statistical and Machine Learning forecasting methods: Concerns and ways forward

But there must be more than just these two examples. Are there others that you are aware of? Do you think a post on your blog would be useful? If so, I’m sure you can think of better ways to phrase or express my “two broad domains.”

My reply:

I don’t actually know.

Back in 1994 or so I remember talking with Radford Neal about the neural net models in his Ph.D. thesis and asking if he could try them out on analysis of data from sample surveys. The idea was that we have two sorts of models: multilevel logistic regression and Gaussian processes. Both models can use the same predictors (characteristics of survey respondents such as sex, ethnicity, age, and state), and both have the structure that similar respondents have similar predicted outcomes—but the two models have different mathematical structures. The regression model works with a linear predictor from all these factors, whereas the Gaussian process model uses an unnormalized probability density—a prior distribution—that encourages people with similar predictors to have similar outcomes.

My guess is that the two models would do about the same, following the general principle that the most important thing about a statistical procedure is not what you do with the data, but what data you use. In either case, though, some thought might need to go into the modeling. For example, you’ll want to include state-level predictors. As we’ve discussed before, when your data are sparse, multilevel regression works much better if you have good group-level predictors, and some of the examples where it appears that MRP performs poorly, are examples where people are not using available group-level information.

Anyway, to continue with the question above, asking about neural nets and statistical models: Actually, neural nets are a special case of statistical models, typically Bayesian hierarchical logistic regression with latent parameters. But neural nets are typically estimated in a different way: the resulting posterior distributions will generally be multimodal, so rather than try the hopeless task of traversing the whole posterior distribution, we’ll use various approximate methods, which then are evaluated using predictive accuracy.

By the way, Radford’s answer to my question back in 1994 was that he was too busy to try fitting his models to my data. And I guess I was too busy too, because I didn’t try it either! More recently, I asked a computer scientist and he said he thought the datasets I was working with were too small for his methods to be very useful. More generally, though, I like the idea of RPP, also the idea of using stacking to combine Bayesian inferences from different fitted models.

Abortion attitudes: The polarization is among richer, more educated whites

Abortion has been in the news lately. A journalist asked me something about abortion attitudes and I pointed to a post from a few years ago about partisan polarization on abortion. Also this with John Sides on why abortion consensus is unlikely. That was back in 2009, and consensus doesn’t seem any more likely today.

It’s perhaps not well known (although it’s consistent with what we found in Red State Blue State) that just about all the polarization on abortion comes from whites, and most of that is from upper-income, well-educated whites. Here’s an incomplete article that Yair and I wrote on this from 2010; we haven’t followed up on it recently.

Alternatives and reality

I saw this cartoon from Randall Munroe, and it reminded me of something I wrote awhile ago.

The quick story is that I don’t think the alternative histories within alternative histories are completely arbitrary. It seems to me that there’s a common theme in the best alternative history stories, a recognition that our world is the true one and that the people in the stories are living in a fake world. This is related to the idea that the real world is overdetermined, so these alternatives can’t ultimately make sense. From that perspective, characters living within an alternative history are always at risk of realizing that their world is not real, and the alternative histories they themselves construct can be ways of channeling that recognition.

I was also thinking about this again the other day when rereading T. J. Shippey’s excellent The Road to Middle Earth. Tolkien put in a huge amount of effort into rationalizing his world, not just in its own context (internal consistency) but also making it fit into our world. It seems that he felt that a completely invented world would not ultimately make sense; it was necessary for his world to be reconstructed, or discovered, and for that it had to be real.

My talks at the University of Chicago this Thursday and Friday

Political Economy Workshop (12:30pm, Thurs 23 May 2019, Room 1022 of Harris Public Policy (Keller Center) 1307 E 60th Street):

Political Science and the Replication Crisis

We’ve heard a lot about the replication crisis in science (silly studies about ESP, evolutionary psychology, miraculous life hacks, etc.), how it happened (p-values, forking paths), and proposed remedies both procedural (preregistration, publishing of replications) and statistical (replacing hypothesis testing with multilevel modeling and decision analysis). But also of interest are the theories, implicit or explicit, associated with unreplicated or unreplicable work in medicine, psychology, economics, policy analysis, and political science: a model of the social and biological world driven by hidden influences, a perspective which we argue is both oversimplified and needlessly complex. When applied to political behavior, these theories seem to be associated with a cynical view of human nature that lends itself to anti-democratic attitudes. Fortunately, the research that is said to support this view has been misunderstood.

Some recommended reading:

[2015] Disagreements about the strength of evidence

[2015] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective

[2016] The mythical swing voter.

[2018] Some experiments are just too noisy to tell us much of anything at all: Political science edition

Quantitative Methods Committee and QMSA (10:30am, Fri 24 May 2019, 5757 S. University in Saieh Hall (lower Level) Room 021):

Multilevel Modeling as a Way of Life

The three challenges of statistical inference are: (1) generalizing from sample to population, (2) generalizing from control to treatment group, and (3) generalizing from observed measurements to the underlying constructs of interest. Multilevel modeling is central to all of these tasks, in ways that you might not realize. We illustrate with several examples in social science and public health.

Some recommended reading:

[2004] Treatment effects in before-after data

[2012] Why we (usually) don’t have to worry about multiple comparisons

[2013] Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups

[2018] Bayesian aggregation of average data: An application in drug development

Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!

Hey, people are doing the multiverse!

Elio Campitelli writes:

I’ve just saw this image in a paper discussing the weight of evidence for a “hiatus” in the global warming signal and immediately thought of the garden of forking paths.

From the paper:

Tree representation of choices to represent and test pause-periods. The ‘pause’ is defined as either no-trend or a slow-trend. The trends can be measured as ‘broken’ or ‘continuous’ trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of ‘historical’ versions of the datasets as they existed, or contemporary versions providing full dataset ‘hindsight’. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The ‘year’ rows are for assessments undertaken at each year in time.

Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the ‘pause’ (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the ‘pause’ as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree.

Actually, it’s the multiverse.

Data quality is a thing.

I just happened to come across this story, where a journalist took some garbled data and spun a false tale which then got spread without question.

It’s a problem. First, it’s a problem that people will repeat unjustified claims, also a problem that when data are attached, you can get complete credulity, even for claims that are implausible on the face of it.

So it’s good to be reminded: “Data” are just numbers. You need to know where the data came from before you can learn anything from them.

“Did Jon Stewart elect Donald Trump?”

I wrote this post a couple weeks ago and scheduled it for October, but then I learned from a reporter that the research article under discussion was retracted, so it seemed to make sense to post this right away while it was still newsworthy.

My original post is below, followed by a post script regarding the retraction.

Matthew Heston writes:

First time, long time. I don’t know if anyone has sent over this recent paper [“Did Jon Stewart elect Donald Trump? Evidence from television ratings data,” by Ethan Porter and Thomas Wood] which claims that Jon Stewart leaving The Daily Show “spurred a 1.1% increase in Trump’s county-level vote share.”

I’m not a political scientist, and not well versed in the methods they say they’re using, but I’m skeptical of this claim. One line that stood out to me was: “To put the effect size in context, consider the results from the demographic controls. Unsurprisingly, several had significant results on voting. Yet the effects of The Daily Show’s ratings decline loomed larger than several controls, such as those related to education and ethnicity, that have been more commonly discussed in analyses of the 2016 election.” This seems odd to me, as I wouldn’t expect a TV show host change to have a larger effect than these other variables.

They also mention that they’re using “a standard difference-in-difference approach.” As I mentioned, I’m not too familiar with this approach. But my understanding is that they would be comparing pre- and post- treatment differences in a control and treatment group. Since the treatment in this case is a change in The Daily Show host, I’m unsure of who the control group would be. But maybe I’m missing something here.

Heston points to our earlier posts on the Fox news effect.

Anyway, what do I think of this new claim? The answer is that I don’t really know.

Let’s work through what we can.

In reporting any particular effect there’s some selection bias, so let’s start by assuming an Edlin factor of 1/2, so now the estimated effect of Jon Stewart goes from 1.1% to 0.55% in Trump’s county-level vote share. Call it 0.6%. Vote share is approximately 50%, so a 0.6% change is approximately a 0.3 percentage point in the vote. Would this have swung the election? I’m not sure, maybe not quite.

Let’s assume the effect is real. How to think about it? It’s one of many such effects, along with other media outlets, campaign tactics, news items, etc.

A few years ago, Noah Kaplan, David Park, and I wrote an article attempting to distinguish between what we called the random walk and mean-reversion models of campaigning. The random walk model posits that the voters are where they are, and campaigning (or events more generally) moves them around. In this model, campaign effects are additive: +0.3 here, -0.4 there, and so forth. In contrast, the mean-reversion model starts at the end, positing that the election outcome is largely determined by the fundamentals, with earlier fluctuations in opinion mostly being a matter of the voters coming to where they were going to be. After looking at what evidence we could find, we concluded that the mean-reversion model made more sense and was more consistent with the data. This is not to say that the Jon Stewart show would have no effect, just that it’s one of many interventions during the campaign, and I can’t picture each of them having an independent effect and these effects all adding up.

P.S. After the retraction

The article discussed above was retracted because the analysis had a coding error.

What to say given this new information?

First, I guess Heston’s skepticism is validated. When you see a claim that seems too big to be true (as here or here), maybe it’s just mistaken in some way.

Second, I too have had to correct a paper whose empirical claims were invalidated by a coding error. It happens—and not just to Excel users!

Third, maybe the original reaction to that study was a bit too strong. See the above post: Even had the data shown what had originally been claimed, the effect they found was not as consequential as it might’ve seen at first. Setting aside all questions of data errors and statistical errors, there’s a limit to what can be learned about a dynamic process—an election campaign—from an isolated study.

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated. I had a similar feeling a few years ago regarding the publicity surrounding the college-football-and-voting study. The particular claims regarding football and voting have since been disputed, but even if you accept the original study as is, its implications aren’t as strong as had been claimed in the press. Whatever these causal effects are, they vary by person and scenario, and they’re not occurring in isolation.

“In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

Tom Daula writes:

I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

“MRP is the Carmelo Anthony of election forecasting methods”? So we’re doing trash talking now??

What’s the deal with Nate Silver calling MRP “the Carmelo Anthony of forecasting methods”?

Someone sent this to me:

and I was like, wtf? I don’t say wtf very often—at least, not on the blog—but this just seemed weird.

For one thing, Nate and I did a project together once using MRP: this was our estimate of attitudes on heath care reform by age, income, and state:

Without MRP, we couldn’t’ve done anything like it.

So, what gives?

Here’s a partial list of things that MRP has done:

– Estimating public opinion in slices of the population

– Improved analysis using the voter file

– Polling using the Xbox that outperformed conventional poll aggregates

– Changing our understanding of the role of nonresponse in polling swings

– Post-election analysis that’s a lot more trustworthy than exit polls

OK, sure, MRP has solved lots of problems, it’s revolutionized polling, no matter what Team Buggy Whip says.

That said, it’s possible that MRP is overrated. “Overrated” is a difference between rated quality and actual quality. MRP, wonderful as it is, might well be rated too highly in some quarters. I wouldn’t call MRP a “forecasting method,” but that’s another story.

I guess the thing that bugged me about the Carmelo Anthony comparison is that my impression from reading the sports news is not just that Anthony is overrated but that he’s an actual liability for his teams. Whereas I see MRP, overrated as it may be (I’ve seen no evidence that MRP is overrated but I’ll accept this for the purpose of argument), as still a valuable contributor to polling.

Ten years ago . . .

The end of the aughts. It was a simpler time. Nate Silver was willing to publish an analysis that used MRP. We all thought embodied cognition was real. Donald Trump was a reality-TV star. Kevin Spacey was cool. Nobody outside of suburban Maryland had heard of Beach Week.

And . . . Carmelo Anthony got lots of respect from the number crunchers.

Check this out:

So here’s the story according to Nate: MRP is like Carmelo Anthony because they’re both overrated. But Carmelo Anthony isn’t overrated, he’s really underrated. So maybe Nate’s MRP jab was just a backhanded MRP compliment?

Simpler story, I guess, is that back around 2010 Nate liked MRP and he liked Carmelo. Back then, he thought the people who thought Carmelo was overrated, were wrong. In 2018, he isn’t so impressed with either of them. Nate’s impression of MRP and Carmelo Anthony go up and down together. That’s consistent, I guess.

In all seriousness . . .

Unlike Nate Silver, I claim no expertise on basketball. For all I know, Tim Tebow will be starting for the Knicks next year!

I do claim some expertise on MRP, though. Nate described MRP as “not quite ‘hard’ data.” I don’t really know what Nate meant by “hard” data—ultimately, these are all just survey responses—but, in any case, I replied:

I guess MRP can mean different things to different people. All the MRP analyses I’ve ever published are entirely based on hard data. If you want to see something that’s a complete mess and is definitely overrated, try looking into the guts of classical survey weighting (see for example this paper). Meanwhile, Yair used MRP to do these great post-election summaries. Exit polls are a disaster; see for example here.

Published poll toplines are not the data, warts and all; they’re processed data, sometimes not adjusted for enough factors as in the notorious state polls in 2016. I agree with you that raw data is the best. Once you have raw data, you can make inferences for the population. That’s what Yair was doing. For understandable commercial reasons, lots of pollsters will release toplines and crosstabs but not raw data. MRP (or, more generally, RRP) is just a way of going from the raw data to make inference about the general population. It’s the general population (or the population of voters) that we care about. The people in the sample are just a means to an end.

Anyway, if you do talk about MRP and how overrated it is, you might consider pointing people to some of those links to MRP successes. Hey, here’s another one: we used MRP to estimate public opinion on health care. MRP has quite a highlight reel, more like Lebron or Steph or KD than Carmelo, I’d say!

One thing I will say is that data and analysis go together:

– No modern survey is good enough to be able to just interpret the results without any adjustment. Nonresponse is just too big a deal. Every survey gets adjusted, but some don’t get adjusted well.

– No analysis method can do it on its own without good data. All the modeling in the world won’t help you if you have serious selection bias.

Yair added:

Maybe it’s just a particularly touchy week for Melo references.

Both Andy and I would agree that MRP isn’t a silver bullet. But nothing is a silver bullet. I’ve seen people run MRP with bad survey data, bad poststratification data, and/or bad covariates in a model that’s way too sparse, and then over-promise about the results. I certainly wouldn’t endorse that. On the other side, obviously I agree with Andy that careful uses of MRP have had many successes, and it can improve survey inferences, especially compared to traditional weighting.

I think maybe you’re talking specifically about election forecasting? I haven’t seen comparisons of your forecasts to YouGov or PredictWise or whatever else. My vague sense pre-election was that they were roughly similar, i.e., that the meaty part of the curves overlapped. Maybe I’m wrong and your forecasts were much better this time—but non-MRP forecasters have also done much worse than you, so is that an indictment of MRP, or are you just really good at forecasting?

More to my main point—in one of your recent podcasts, I remember you said something about how forecasts aren’t everything, and people should look at precinct results to try to get beyond the toplines. That’s roughly what we’ve been trying to do in our post-election project, which has just gotten started. We see MRP as a way to combine all the data—pre-election voter file data, early voting, precinct results, county results, polling—into a single framework. Our estimates aren’t going to be perfect, for sure, but hopefully an improvement over what’s been out there, especially at sub-national levels. I know we’d do better if we had a lot more polling data, for instance. FWIW I get questions from clients all the time about how demographic groups voted in different states. Without state-specific survey data, which is generally unavailable and often poorly collected/weighted, not sure what else you can do except some modeling like MRP.

Maybe you’d rather see the raw unprocessed data like the precinct results. Fair enough, sometimes I do too! My sense is the people who want that level of detail are in the minority of the minority. Still, we’re going to try to do things like show the post-processed MRP estimates, but also some of the raw data to give intuition. I wonder if you think this is the right approach, or if you think something else would be better.

And Ryan Enos writes:

To follow up on this—I think you’ll all be interested in seeing the back and forth between Nate and Lynn Vavreck who was interviewing him. It was more of a discussion of tradeoffs between different approaches, then a discussion of what is wrong with MRP. Nate’s MRP alternative was to do a poll in every district, which I think we can all agree would be nice – if not entirely realistic. Although, as Nate pointed out, some of the efforts from the NY Times this cycle made that seem more realistic. In my humble opinion, Lynn did a nice job pushing Nate on the point that, even with data like the NY Times polls, you are still moving beyond raw data by weighting and, as Andrew points out, we often don’t consider how complex this can be (I have a common frustration with academic research about how much out of the box survey weights are used and abused).

I don’t actually pay terribly close attention to forecasting – but in my mind, Nate and everybody else in the business is doing a fantastic job and the YouGov MRP forecasts have been a revelation. From my perspective, as somebody who cares more about what survey data can teach us about human behavior and important political phenomenon, I think MRP has been a revelation in that it has allowed us to infer opinion in places, such as metro areas, where it would otherwise be missing. This has been one of the most important advances in public opinion research in my lifetime. Where the “overrated” part becomes true is that just like every other scientific advance, people can get too excited about what it can do without thinking about what assumptions are going into the method and this can lead to believing it can do more than it can—but this is true of everything.

Yair, to your question about presentation—I am a big believer in raw data and I think combining the presentation of MRP with something like precinct results, despite the dangers of ecological error, can be really valuable because it can allow people to check MRP results with priors from raw data.

It’s fine to do a poll in every district but then you’d still want to do MRP in order to adjust for nonresponse, estimate subgroups of the population, study public opinion in between the districtwide polls, etc.

Scandal! Mister P appears in British tabloid.

Tim Morris points us to this news article:

And here’s the kicker:

Mister P.

Not quite as cool as the time I was mentioned in Private Eye, but it’s still pretty satisfying.

My next goal: Getting a mention in Sports Illustrated. (More on this soon.)

In all seriousness, it’s so cool when methods that my collaborators and I have developed are just out there, for anyone to use. I only wish Tom Little were around to see it happening.

P.S. Some commenters are skeptical, though:

I agree that polls can be wrong. The issue is not so much the size of the sample but rather that the sample can be unrepresentative. But I do think that polls provide some information; it’s better than just guessing.

P.P.S. Unrelatedly, Morris wrote, with Ian White and Michael Crowther, this article on using simulation studies to evaluate statistical methods.

Fake-data simulation. Yeah.

Horse-and-buggy era officially ends for survey research

Peter Enns writes:

Given the various comments on your blog about evolving survey methods (e.g., Of buggy whips and moral hazards; or, Sympathy for the Aapor), I thought you might be interested that the Roper Center has updated its acquisitions policy and is now accepting non-probability samples and other methods. This is an exciting move for the Roper Center.

Jeez. I wonder what the President of American Association of Buggy-Whip Manufacturers thinks about that!

In all seriousness, let’s never forget that our inferences are only as good as our data. Whether your survey responses come by telephone, or internet, or any other method, you want to put in the effort to get good data from a representative sample, and then to adjust as necessary. There’s no easy solution, it just needs the usual eternal vigilance.

P.S. I’m posting this one now, rather than with the usual six-month delay, because you can now go to the Roper Center and get these polls. I didn’t want you to have to wait!

When we had fascism in the United States

I was reading this horrifying and hilarious story by Colson Whitehead, along with an excellent article by Adam Gopnik in the New Yorker (I posted a nitpick on it a couple days ago) on the Reconstruction and post-Reconstruction era in the United States, and I was suddenly reminded of something.

In one of the political science classes I took in college, we were told that one of the big questions about U.S. politics, compared to Europe, is why we’ve had no socialism and no fascism. Sure, there have been a few pockets of socialism where they’ve won a few elections, and there was Huey Long in 1930s Louisiana, but nothing like Europe where the Left and the Right have ruled entire countries. and where, at least for a time, socialist and fascism were the ideologies of major parties.

That’s what we were taught. But, as Whitehead and Gopnik (and Henry Louis Gates, the author of the book that Gopnik was reviewing) remind us, that’s wrong. We have had fascism here for a long time—in the post-reconstruction South.

What’s fascism all about? Right-wing, repressive government, political power obtained and maintained through violence and the threat of violence, a racist and nationalist ideology, and a charismatic leader.

The post-reconstruction South didn’t have a charismatic leader, but the other parts of the description fit, so on the whole I’d call it a fascist regime.

In the 1930s, Sinclair Lewis wrote It Can’t Happen Here about a hypothetical fascist Americanism, and there was that late book by Philip Roth with a similar theme. I guess other people have had this thought so I googled *it has happened here* and came across this post talking about fascism in the United States, pointing to Red scares, the internment of Japanese Americans in WW2, and FBI infiltration of the civil rights movement. All these topics are worth writing about, but none of them seem to me to be even nearly as close to fascism as what happened for close to a century in the post-reconstruction South.

Louis Hartz wrote The Liberal Tradition in America back in the 1950s. The funny thing is, back in the 1950s there was still a lot of fascism down there.

But nobody made that connection to us when we were students.

Maybe the U.S. South just seemed unique, and the legacy of slavery distracted historians and political scientists so much they didn’t see the connection to fascism, a political movement with a nationalistic racist ideology that used violence to take and maintain power in a democratic system. It’s stunning in retrospect that Huey Long was discussed as a proto-fascist without any recognition that the entire South had a fascist system of government.

P.S. A commenter points to this article by Ezekiel Kweku and Jane Coaston from a couple years ago making the same point:

Fascism has happened before in America.

For generations of black Americans, the United States between the end of Reconstruction, around 1876, and the triumphs of the civil rights movement in the early 1960s was a fascist state.

Good to know that others have seen this connection before. It’s still notable, I think, that we weren’t aware of this all along.

Name this fallacy!

It’s the fallacy of thinking that, just cos you’re good at something, that everyone should be good at it, and if they’re not, they’re just being stubborn and doing it badly on purpose.

I thought about this when reading this line from Adam Gopnik in the New Yorker:

[Henry Louis] Gates is one of the few academic historians who do not disdain the methods of the journalist . . .

Gopnik’s article is fascinating, and I have no doubt that Gates’s writing is both scholarly and readable.

My problem is with Gopnik’s use of the word “disdain.” The implication seems to be that other historians could write like journalists if they felt like it, but they just disdain to do so, maybe because they think it would be beneath their dignity, or maybe because of the unwritten rules of the academic profession.

The thing that Gopnik doesn’t get, I think, is that it’s hard to write well. Most historians can’t write like A. J. P. Taylor or Henry Louis Gates. Sure, maybe they could approach that level if they were to work hard at it, but it would take a lot of work, a lot of practice, and it’s not clear this would be the best use of their time and effort.

For a journalist to say that most academics “disdain the methods of the journalist” would be like me saying that most journalists “disdain the methods of the statistician.” OK, maybe some journalists actively disdain quantitative thinking—the names David Brooks and Gregg Easterbrook come to mind—but mostly I think it’s the same old story: math is hard, statistics is hard, these dudes are doing their best but sometimes their best isn’t good enough, etc. “Disdain” has nothing to do with it. To not choose to invest years of effort into a difficult skill that others can do better, to trust in the division of labor and do your best at what you’re best at . . . that can be a perfectly reasonable decision. If an academic historian does careful archival work and writes it up in hard-to-read prose—not on purpose but just cos hard-to-read prose is what he or she knows how to write—that can be fine. The idea would be that a journalist could write it up later for others. No disdaining. Division of labor, that’s all. Not everyone on the court has to be a two-way player.

I had a similar reaction a few years ago to Steven Pinker’s claim that academics often write so badly because “their goal is not so much communication as self-presentation—an overriding defensiveness against any impression that they may be slacker than their peers in hewing to the norms of the guild. Many of the hallmarks of academese are symptoms of this agonizing self­consciousness . . .” I replied that I think writing is just not so easy, and our discussion continued here.

Anyway, here’s the question. This fallacy, of thinking that when people can’t do what you can do, that they’re just being stubborn . . . is there a name for it? The Expertise Fallacy??

Give this one a good name, and we can add it to the lexicon.