Skip to content

John Le Carre is good at integrating thought and action

I was reading a couple old Le Carre spy novels. They have their strong points and their weak points; I’m not gonna claim that Le Carre is a great writer. He’s no George Orwell or Graham Greene. (This review by the great Clive James nails Le Carre perfectly.)

But I did notice one thing Le Carre does very well, something that I haven’t seen discussed before in his writing, which is the way he integrates thought and action. A character will be walking down the street, or having a conversation, or searching someone’s apartment, and will be going through a series of thoughts while doing things. The thoughts and actions go together.

Ummm, here’s an example:

It’s not that the above passage by itself is particularly impressive; it’s more that Le Carre does this consistently. So he’s not just writing an action novel with occasional ruminations; rather, the thoughts are part of the action.

Writing this, it strikes me that this is commonplace, almost necessary, in a bande desinnée, but much more rare in a novel.

Also it’s important when we are teaching and when we are writing technical articles and textbooks: we’re doing something and explaining our motivation and what we’re learning, all at once.

Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live'”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

Going beyond the rainbow color scheme for statistical graphics

Yesterday in our discussion of easy ways to improve your graphs, a commenter wrote:

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

(from Yair): What Happened in the 2018 Election

Yair writes:

Immediately following the 2018 election, we published an analysis of demographic voting patterns, showing our best estimates of what happened in the election and putting it into context compared to 2016 and 2014. . . .

Since then, we’ve collected much more data — precinct results from more states and, importantly, individual-level vote history records from Secretaries of State around the country. This analysis updates the earlier work and adds to it in a number of ways. Most of the results we showed remain the same as in the earlier analysis, but there are some changes.

Here’s the focus:

How much of the change from 2016 was due to different people voting vs. the same people changing their vote choice?

I like how he puts this. Not “Different Electorate or Different Vote Choice?” which would imply that it’s one or the other, but “How much,” which is a more quantitative, continuous, statistical way of thinking about the question.

Here’s Yair’s discussion:

As different years bring different election results, many people have debated the extent to which these changes are driven by (a) differential turnout or (b) changing vote choice.

Those who believe turnout is the driver point to various pieces of evidence. Rates of geographic ticket splitting have declined over time as elections have become more nationalized. Self-reported consistency between party identification and vote choice is incredibly high. In the increasingly nasty discourse between people who are involved in day-to-day national politics, it is hard to imagine there are many swing voters left. . . .

Those who think changing vote choice is important point to different sets of evidence. Geographic ticket splitting has declined, but not down to zero, and rates of ticket splitting only reflect levels of geographic consistency anyway. Surveys do show consistency, but again not 100% consistency, and survey respondents are more likely to be heavily interested in politics, more ideologically consistent, and less likely to swing back and forth. . . . there is little evidence that this extends to the general public writ large. . . .

How to sort through it all?

Yair answers his own question:

Our voter registration database keeps track of who voted in different elections, and our statistical models used in this analysis provide estimates of how different people voted in the different elections. . . .

Let’s build intuition about our approach by looking at a fairly simple case: the change between 2012 and 2014 . . . likely mostly due to differential turnout.

What about the change from 2016 to 2018? The same calculations from earlier are shown in the graph above and tell a different story. . . .

Two things happened between 2016 and 2018. First, there was a massive turnout boost that favored Democrats, at least compared to past midterms. . . . But if turnout was the only factor, then Democrats would not have seen nearly the gains that they ended up seeing. Changing vote choice accounted for a +4.5% margin change, out of the +5.0% margin change that was seen overall — a big piece of Democratic victory was due to 2016 Trump voters turning around and voting for Democrats in 2018.

Also lots more graphs, including discussion of some individual state-level races. And this summary:

First, on turnout: there are few signs that the overwhelming enthusiasm of 2018 is slowing down. 2018 turnout reached 51% of the citizen voting-age population, 14 points higher than 2014. 2016 turnout was 61%. If enthusiasm continues, how high can it get? . . . turnout could easily reach 155 to 160 million votes . . .

Second, on vote choice . . . While 2018 was an important victory for Democrats, the gains that were made could very well bounce back to Donald Trump in 2020.

You can compare this to our immediate post-election summary and Yair’s post-election analysis from last November.

(In the old days I would’ve crossposted all of this on the Monkey Cage, but they don’t like crossposting anymore.)

What are some common but easily avoidable graphical mistakes?

John Kastellec writes:

I was thinking about writing a short paper aimed at getting political scientists to not make some common but easily avoidable graphical mistakes. I’ve come up with the following list of such mistakes. I was just wondering if any others immediately came to mind?

– Label lines directly

– Make labels big enough to read

– Small multiples instead of spaghetti plots

– Avoid stacked barplots

– Make graphs completely readable in black-and-white

– Leverage time as clearly as possible by placing it on the x-axis.

That reminds me . . . I was just at a pharmacology conference. And everybody there—I mean everybody—used the rainbow color scheme for their graphs. Didn’t anyone send them the memo, that we don’t do rainbow anymore? I prefer either a unidirectional shading of colors, or a bidirectional shading as in figure 4 here, depending on context.

Neural nets vs. regression models

Eliot Johnson writes:

I have a question concerning papers comparing two broad domains of modeling: neural nets and statistical models. Both terms are catch-alls, within each of which there are, quite obviously, multiple subdomains. For instance, NNs could include ML, DL, AI, and so on. While statistical models should include panel data, time series, hierarchical Bayesian models, and more.

I’m aware of two papers that explicitly compare these two broad domains:

(1) Sirignano, et al., Deep Learning for Mortgage Risk,

(2) Makridakis, et al., Statistical and Machine Learning forecasting methods: Concerns and ways forward

But there must be more than just these two examples. Are there others that you are aware of? Do you think a post on your blog would be useful? If so, I’m sure you can think of better ways to phrase or express my “two broad domains.”

My reply:

I don’t actually know.

Back in 1994 or so I remember talking with Radford Neal about the neural net models in his Ph.D. thesis and asking if he could try them out on analysis of data from sample surveys. The idea was that we have two sorts of models: multilevel logistic regression and Gaussian processes. Both models can use the same predictors (characteristics of survey respondents such as sex, ethnicity, age, and state), and both have the structure that similar respondents have similar predicted outcomes—but the two models have different mathematical structures. The regression model works with a linear predictor from all these factors, whereas the Gaussian process model uses an unnormalized probability density—a prior distribution—that encourages people with similar predictors to have similar outcomes.

My guess is that the two models would do about the same, following the general principle that the most important thing about a statistical procedure is not what you do with the data, but what data you use. In either case, though, some thought might need to go into the modeling. For example, you’ll want to include state-level predictors. As we’ve discussed before, when your data are sparse, multilevel regression works much better if you have good group-level predictors, and some of the examples where it appears that MRP performs poorly, are examples where people are not using available group-level information.

Anyway, to continue with the question above, asking about neural nets and statistical models: Actually, neural nets are a special case of statistical models, typically Bayesian hierarchical logistic regression with latent parameters. But neural nets are typically estimated in a different way: the resulting posterior distributions will generally be multimodal, so rather than try the hopeless task of traversing the whole posterior distribution, we’ll use various approximate methods, which then are evaluated using predictive accuracy.

By the way, Radford’s answer to my question back in 1994 was that he was too busy to try fitting his models to my data. And I guess I was too busy too, because I didn’t try it either! More recently, I asked a computer scientist and he said he thought the datasets I was working with were too small for his methods to be very useful. More generally, though, I like the idea of RPP, also the idea of using stacking to combine Bayesian inferences from different fitted models.

Abortion attitudes: The polarization is among richer, more educated whites

Abortion has been in the news lately. A journalist asked me something about abortion attitudes and I pointed to a post from a few years ago about partisan polarization on abortion. Also this with John Sides on why abortion consensus is unlikely. That was back in 2009, and consensus doesn’t seem any more likely today.

It’s perhaps not well known (although it’s consistent with what we found in Red State Blue State) that just about all the polarization on abortion comes from whites, and most of that is from upper-income, well-educated whites. Here’s an incomplete article that Yair and I wrote on this from 2010; we haven’t followed up on it recently.

Alternatives and reality

I saw this cartoon from Randall Munroe, and it reminded me of something I wrote awhile ago.

The quick story is that I don’t think the alternative histories within alternative histories are completely arbitrary. It seems to me that there’s a common theme in the best alternative history stories, a recognition that our world is the true one and that the people in the stories are living in a fake world. This is related to the idea that the real world is overdetermined, so these alternatives can’t ultimately make sense. From that perspective, characters living within an alternative history are always at risk of realizing that their world is not real, and the alternative histories they themselves construct can be ways of channeling that recognition.

I was also thinking about this again the other day when rereading T. J. Shippey’s excellent The Road to Middle Earth. Tolkien put in a huge amount of effort into rationalizing his world, not just in its own context (internal consistency) but also making it fit into our world. It seems that he felt that a completely invented world would not ultimately make sense; it was necessary for his world to be reconstructed, or discovered, and for that it had to be real.

My talks at the University of Chicago this Thursday and Friday

Political Economy Workshop (12:30pm, Thurs 23 May 2019, Room 1022 of Harris Public Policy (Keller Center) 1307 E 60th Street):

Political Science and the Replication Crisis

We’ve heard a lot about the replication crisis in science (silly studies about ESP, evolutionary psychology, miraculous life hacks, etc.), how it happened (p-values, forking paths), and proposed remedies both procedural (preregistration, publishing of replications) and statistical (replacing hypothesis testing with multilevel modeling and decision analysis). But also of interest are the theories, implicit or explicit, associated with unreplicated or unreplicable work in medicine, psychology, economics, policy analysis, and political science: a model of the social and biological world driven by hidden influences, a perspective which we argue is both oversimplified and needlessly complex. When applied to political behavior, these theories seem to be associated with a cynical view of human nature that lends itself to anti-democratic attitudes. Fortunately, the research that is said to support this view has been misunderstood.

Some recommended reading:

[2015] Disagreements about the strength of evidence

[2015] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective

[2016] The mythical swing voter.

[2018] Some experiments are just too noisy to tell us much of anything at all: Political science edition

Quantitative Methods Committee and QMSA (10:30am, Fri 24 May 2019, 5757 S. University in Saieh Hall (lower Level) Room 021):

Multilevel Modeling as a Way of Life

The three challenges of statistical inference are: (1) generalizing from sample to population, (2) generalizing from control to treatment group, and (3) generalizing from observed measurements to the underlying constructs of interest. Multilevel modeling is central to all of these tasks, in ways that you might not realize. We illustrate with several examples in social science and public health.

Some recommended reading:

[2004] Treatment effects in before-after data

[2012] Why we (usually) don’t have to worry about multiple comparisons

[2013] Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups

[2018] Bayesian aggregation of average data: An application in drug development

Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!

Hey, people are doing the multiverse!

Elio Campitelli writes:

I’ve just saw this image in a paper discussing the weight of evidence for a “hiatus” in the global warming signal and immediately thought of the garden of forking paths.

From the paper:

Tree representation of choices to represent and test pause-periods. The ‘pause’ is defined as either no-trend or a slow-trend. The trends can be measured as ‘broken’ or ‘continuous’ trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of ‘historical’ versions of the datasets as they existed, or contemporary versions providing full dataset ‘hindsight’. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The ‘year’ rows are for assessments undertaken at each year in time.

Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the ‘pause’ (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the ‘pause’ as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree.

Actually, it’s the multiverse.

Data quality is a thing.

I just happened to come across this story, where a journalist took some garbled data and spun a false tale which then got spread without question.

It’s a problem. First, it’s a problem that people will repeat unjustified claims, also a problem that when data are attached, you can get complete credulity, even for claims that are implausible on the face of it.

So it’s good to be reminded: “Data” are just numbers. You need to know where the data came from before you can learn anything from them.

“Did Jon Stewart elect Donald Trump?”

I wrote this post a couple weeks ago and scheduled it for October, but then I learned from a reporter that the research article under discussion was retracted, so it seemed to make sense to post this right away while it was still newsworthy.

My original post is below, followed by a post script regarding the retraction.

Matthew Heston writes:

First time, long time. I don’t know if anyone has sent over this recent paper [“Did Jon Stewart elect Donald Trump? Evidence from television ratings data,” by Ethan Porter and Thomas Wood] which claims that Jon Stewart leaving The Daily Show “spurred a 1.1% increase in Trump’s county-level vote share.”

I’m not a political scientist, and not well versed in the methods they say they’re using, but I’m skeptical of this claim. One line that stood out to me was: “To put the effect size in context, consider the results from the demographic controls. Unsurprisingly, several had significant results on voting. Yet the effects of The Daily Show’s ratings decline loomed larger than several controls, such as those related to education and ethnicity, that have been more commonly discussed in analyses of the 2016 election.” This seems odd to me, as I wouldn’t expect a TV show host change to have a larger effect than these other variables.

They also mention that they’re using “a standard difference-in-difference approach.” As I mentioned, I’m not too familiar with this approach. But my understanding is that they would be comparing pre- and post- treatment differences in a control and treatment group. Since the treatment in this case is a change in The Daily Show host, I’m unsure of who the control group would be. But maybe I’m missing something here.

Heston points to our earlier posts on the Fox news effect.

Anyway, what do I think of this new claim? The answer is that I don’t really know.

Let’s work through what we can.

In reporting any particular effect there’s some selection bias, so let’s start by assuming an Edlin factor of 1/2, so now the estimated effect of Jon Stewart goes from 1.1% to 0.55% in Trump’s county-level vote share. Call it 0.6%. Vote share is approximately 50%, so a 0.6% change is approximately a 0.3 percentage point in the vote. Would this have swung the election? I’m not sure, maybe not quite.

Let’s assume the effect is real. How to think about it? It’s one of many such effects, along with other media outlets, campaign tactics, news items, etc.

A few years ago, Noah Kaplan, David Park, and I wrote an article attempting to distinguish between what we called the random walk and mean-reversion models of campaigning. The random walk model posits that the voters are where they are, and campaigning (or events more generally) moves them around. In this model, campaign effects are additive: +0.3 here, -0.4 there, and so forth. In contrast, the mean-reversion model starts at the end, positing that the election outcome is largely determined by the fundamentals, with earlier fluctuations in opinion mostly being a matter of the voters coming to where they were going to be. After looking at what evidence we could find, we concluded that the mean-reversion model made more sense and was more consistent with the data. This is not to say that the Jon Stewart show would have no effect, just that it’s one of many interventions during the campaign, and I can’t picture each of them having an independent effect and these effects all adding up.

P.S. After the retraction

The article discussed above was retracted because the analysis had a coding error.

What to say given this new information?

First, I guess Heston’s skepticism is validated. When you see a claim that seems too big to be true (as here or here), maybe it’s just mistaken in some way.

Second, I too have had to correct a paper whose empirical claims were invalidated by a coding error. It happens—and not just to Excel users!

Third, maybe the original reaction to that study was a bit too strong. See the above post: Even had the data shown what had originally been claimed, the effect they found was not as consequential as it might’ve seen at first. Setting aside all questions of data errors and statistical errors, there’s a limit to what can be learned about a dynamic process—an election campaign—from an isolated study.

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated. I had a similar feeling a few years ago regarding the publicity surrounding the college-football-and-voting study. The particular claims regarding football and voting have since been disputed, but even if you accept the original study as is, its implications aren’t as strong as had been claimed in the press. Whatever these causal effects are, they vary by person and scenario, and they’re not occurring in isolation.

“In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

Tom Daula writes:

I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

“MRP is the Carmelo Anthony of election forecasting methods”? So we’re doing trash talking now??

What’s the deal with Nate Silver calling MRP “the Carmelo Anthony of forecasting methods”?

Someone sent this to me:

and I was like, wtf? I don’t say wtf very often—at least, not on the blog—but this just seemed weird.

For one thing, Nate and I did a project together once using MRP: this was our estimate of attitudes on heath care reform by age, income, and state:

Without MRP, we couldn’t’ve done anything like it.

So, what gives?

Here’s a partial list of things that MRP has done:

– Estimating public opinion in slices of the population

– Improved analysis using the voter file

– Polling using the Xbox that outperformed conventional poll aggregates

– Changing our understanding of the role of nonresponse in polling swings

– Post-election analysis that’s a lot more trustworthy than exit polls

OK, sure, MRP has solved lots of problems, it’s revolutionized polling, no matter what Team Buggy Whip says.

That said, it’s possible that MRP is overrated. “Overrated” is a difference between rated quality and actual quality. MRP, wonderful as it is, might well be rated too highly in some quarters. I wouldn’t call MRP a “forecasting method,” but that’s another story.

I guess the thing that bugged me about the Carmelo Anthony comparison is that my impression from reading the sports news is not just that Anthony is overrated but that he’s an actual liability for his teams. Whereas I see MRP, overrated as it may be (I’ve seen no evidence that MRP is overrated but I’ll accept this for the purpose of argument), as still a valuable contributor to polling.

Ten years ago . . .

The end of the aughts. It was a simpler time. Nate Silver was willing to publish an analysis that used MRP. We all thought embodied cognition was real. Donald Trump was a reality-TV star. Kevin Spacey was cool. Nobody outside of suburban Maryland had heard of Beach Week.

And . . . Carmelo Anthony got lots of respect from the number crunchers.

Check this out:

So here’s the story according to Nate: MRP is like Carmelo Anthony because they’re both overrated. But Carmelo Anthony isn’t overrated, he’s really underrated. So maybe Nate’s MRP jab was just a backhanded MRP compliment?

Simpler story, I guess, is that back around 2010 Nate liked MRP and he liked Carmelo. Back then, he thought the people who thought Carmelo was overrated, were wrong. In 2018, he isn’t so impressed with either of them. Nate’s impression of MRP and Carmelo Anthony go up and down together. That’s consistent, I guess.

In all seriousness . . .

Unlike Nate Silver, I claim no expertise on basketball. For all I know, Tim Tebow will be starting for the Knicks next year!

I do claim some expertise on MRP, though. Nate described MRP as “not quite ‘hard’ data.” I don’t really know what Nate meant by “hard” data—ultimately, these are all just survey responses—but, in any case, I replied:

I guess MRP can mean different things to different people. All the MRP analyses I’ve ever published are entirely based on hard data. If you want to see something that’s a complete mess and is definitely overrated, try looking into the guts of classical survey weighting (see for example this paper). Meanwhile, Yair used MRP to do these great post-election summaries. Exit polls are a disaster; see for example here.

Published poll toplines are not the data, warts and all; they’re processed data, sometimes not adjusted for enough factors as in the notorious state polls in 2016. I agree with you that raw data is the best. Once you have raw data, you can make inferences for the population. That’s what Yair was doing. For understandable commercial reasons, lots of pollsters will release toplines and crosstabs but not raw data. MRP (or, more generally, RRP) is just a way of going from the raw data to make inference about the general population. It’s the general population (or the population of voters) that we care about. The people in the sample are just a means to an end.

Anyway, if you do talk about MRP and how overrated it is, you might consider pointing people to some of those links to MRP successes. Hey, here’s another one: we used MRP to estimate public opinion on health care. MRP has quite a highlight reel, more like Lebron or Steph or KD than Carmelo, I’d say!

One thing I will say is that data and analysis go together:

– No modern survey is good enough to be able to just interpret the results without any adjustment. Nonresponse is just too big a deal. Every survey gets adjusted, but some don’t get adjusted well.

– No analysis method can do it on its own without good data. All the modeling in the world won’t help you if you have serious selection bias.

Yair added:

Maybe it’s just a particularly touchy week for Melo references.

Both Andy and I would agree that MRP isn’t a silver bullet. But nothing is a silver bullet. I’ve seen people run MRP with bad survey data, bad poststratification data, and/or bad covariates in a model that’s way too sparse, and then over-promise about the results. I certainly wouldn’t endorse that. On the other side, obviously I agree with Andy that careful uses of MRP have had many successes, and it can improve survey inferences, especially compared to traditional weighting.

I think maybe you’re talking specifically about election forecasting? I haven’t seen comparisons of your forecasts to YouGov or PredictWise or whatever else. My vague sense pre-election was that they were roughly similar, i.e., that the meaty part of the curves overlapped. Maybe I’m wrong and your forecasts were much better this time—but non-MRP forecasters have also done much worse than you, so is that an indictment of MRP, or are you just really good at forecasting?

More to my main point—in one of your recent podcasts, I remember you said something about how forecasts aren’t everything, and people should look at precinct results to try to get beyond the toplines. That’s roughly what we’ve been trying to do in our post-election project, which has just gotten started. We see MRP as a way to combine all the data—pre-election voter file data, early voting, precinct results, county results, polling—into a single framework. Our estimates aren’t going to be perfect, for sure, but hopefully an improvement over what’s been out there, especially at sub-national levels. I know we’d do better if we had a lot more polling data, for instance. FWIW I get questions from clients all the time about how demographic groups voted in different states. Without state-specific survey data, which is generally unavailable and often poorly collected/weighted, not sure what else you can do except some modeling like MRP.

Maybe you’d rather see the raw unprocessed data like the precinct results. Fair enough, sometimes I do too! My sense is the people who want that level of detail are in the minority of the minority. Still, we’re going to try to do things like show the post-processed MRP estimates, but also some of the raw data to give intuition. I wonder if you think this is the right approach, or if you think something else would be better.

And Ryan Enos writes:

To follow up on this—I think you’ll all be interested in seeing the back and forth between Nate and Lynn Vavreck who was interviewing him. It was more of a discussion of tradeoffs between different approaches, then a discussion of what is wrong with MRP. Nate’s MRP alternative was to do a poll in every district, which I think we can all agree would be nice – if not entirely realistic. Although, as Nate pointed out, some of the efforts from the NY Times this cycle made that seem more realistic. In my humble opinion, Lynn did a nice job pushing Nate on the point that, even with data like the NY Times polls, you are still moving beyond raw data by weighting and, as Andrew points out, we often don’t consider how complex this can be (I have a common frustration with academic research about how much out of the box survey weights are used and abused).

I don’t actually pay terribly close attention to forecasting – but in my mind, Nate and everybody else in the business is doing a fantastic job and the YouGov MRP forecasts have been a revelation. From my perspective, as somebody who cares more about what survey data can teach us about human behavior and important political phenomenon, I think MRP has been a revelation in that it has allowed us to infer opinion in places, such as metro areas, where it would otherwise be missing. This has been one of the most important advances in public opinion research in my lifetime. Where the “overrated” part becomes true is that just like every other scientific advance, people can get too excited about what it can do without thinking about what assumptions are going into the method and this can lead to believing it can do more than it can—but this is true of everything.

Yair, to your question about presentation—I am a big believer in raw data and I think combining the presentation of MRP with something like precinct results, despite the dangers of ecological error, can be really valuable because it can allow people to check MRP results with priors from raw data.

It’s fine to do a poll in every district but then you’d still want to do MRP in order to adjust for nonresponse, estimate subgroups of the population, study public opinion in between the districtwide polls, etc.

Scandal! Mister P appears in British tabloid.

Tim Morris points us to this news article:

And here’s the kicker:

Mister P.

Not quite as cool as the time I was mentioned in Private Eye, but it’s still pretty satisfying.

My next goal: Getting a mention in Sports Illustrated. (More on this soon.)

In all seriousness, it’s so cool when methods that my collaborators and I have developed are just out there, for anyone to use. I only wish Tom Little were around to see it happening.

P.S. Some commenters are skeptical, though:

I agree that polls can be wrong. The issue is not so much the size of the sample but rather that the sample can be unrepresentative. But I do think that polls provide some information; it’s better than just guessing.

P.P.S. Unrelatedly, Morris wrote, with Ian White and Michael Crowther, this article on using simulation studies to evaluate statistical methods.

Fake-data simulation. Yeah.

Horse-and-buggy era officially ends for survey research

Peter Enns writes:

Given the various comments on your blog about evolving survey methods (e.g., Of buggy whips and moral hazards; or, Sympathy for the Aapor), I thought you might be interested that the Roper Center has updated its acquisitions policy and is now accepting non-probability samples and other methods. This is an exciting move for the Roper Center.

Jeez. I wonder what the President of American Association of Buggy-Whip Manufacturers thinks about that!

In all seriousness, let’s never forget that our inferences are only as good as our data. Whether your survey responses come by telephone, or internet, or any other method, you want to put in the effort to get good data from a representative sample, and then to adjust as necessary. There’s no easy solution, it just needs the usual eternal vigilance.

P.S. I’m posting this one now, rather than with the usual six-month delay, because you can now go to the Roper Center and get these polls. I didn’t want you to have to wait!

When we had fascism in the United States

I was reading this horrifying and hilarious story by Colson Whitehead, along with an excellent article by Adam Gopnik in the New Yorker (I posted a nitpick on it a couple days ago) on the Reconstruction and post-Reconstruction era in the United States, and I was suddenly reminded of something.

In one of the political science classes I took in college, we were told that one of the big questions about U.S. politics, compared to Europe, is why we’ve had no socialism and no fascism. Sure, there have been a few pockets of socialism where they’ve won a few elections, and there was Huey Long in 1930s Louisiana, but nothing like Europe where the Left and the Right have ruled entire countries. and where, at least for a time, socialist and fascism were the ideologies of major parties.

That’s what we were taught. But, as Whitehead and Gopnik (and Henry Louis Gates, the author of the book that Gopnik was reviewing) remind us, that’s wrong. We have had fascism here for a long time—in the post-reconstruction South.

What’s fascism all about? Right-wing, repressive government, political power obtained and maintained through violence and the threat of violence, a racist and nationalist ideology, and a charismatic leader.

The post-reconstruction South didn’t have a charismatic leader, but the other parts of the description fit, so on the whole I’d call it a fascist regime.

In the 1930s, Sinclair Lewis wrote It Can’t Happen Here about a hypothetical fascist Americanism, and there was that late book by Philip Roth with a similar theme. I guess other people have had this thought so I googled *it has happened here* and came across this post talking about fascism in the United States, pointing to Red scares, the internment of Japanese Americans in WW2, and FBI infiltration of the civil rights movement. All these topics are worth writing about, but none of them seem to me to be even nearly as close to fascism as what happened for close to a century in the post-reconstruction South.

Louis Hartz wrote The Liberal Tradition in America back in the 1950s. The funny thing is, back in the 1950s there was still a lot of fascism down there.

But nobody made that connection to us when we were students.

Maybe the U.S. South just seemed unique, and the legacy of slavery distracted historians and political scientists so much they didn’t see the connection to fascism, a political movement with a nationalistic racist ideology that used violence to take and maintain power in a democratic system. It’s stunning in retrospect that Huey Long was discussed as a proto-fascist without any recognition that the entire South had a fascist system of government.

P.S. A commenter points to this article by Ezekiel Kweku and Jane Coaston from a couple years ago making the same point:

Fascism has happened before in America.

For generations of black Americans, the United States between the end of Reconstruction, around 1876, and the triumphs of the civil rights movement in the early 1960s was a fascist state.

Good to know that others have seen this connection before. It’s still notable, I think, that we weren’t aware of this all along.

Name this fallacy!

It’s the fallacy of thinking that, just cos you’re good at something, that everyone should be good at it, and if they’re not, they’re just being stubborn and doing it badly on purpose.

I thought about this when reading this line from Adam Gopnik in the New Yorker:

[Henry Louis] Gates is one of the few academic historians who do not disdain the methods of the journalist . . .

Gopnik’s article is fascinating, and I have no doubt that Gates’s writing is both scholarly and readable.

My problem is with Gopnik’s use of the word “disdain.” The implication seems to be that other historians could write like journalists if they felt like it, but they just disdain to do so, maybe because they think it would be beneath their dignity, or maybe because of the unwritten rules of the academic profession.

The thing that Gopnik doesn’t get, I think, is that it’s hard to write well. Most historians can’t write like A. J. P. Taylor or Henry Louis Gates. Sure, maybe they could approach that level if they were to work hard at it, but it would take a lot of work, a lot of practice, and it’s not clear this would be the best use of their time and effort.

For a journalist to say that most academics “disdain the methods of the journalist” would be like me saying that most journalists “disdain the methods of the statistician.” OK, maybe some journalists actively disdain quantitative thinking—the names David Brooks and Gregg Easterbrook come to mind—but mostly I think it’s the same old story: math is hard, statistics is hard, these dudes are doing their best but sometimes their best isn’t good enough, etc. “Disdain” has nothing to do with it. To not choose to invest years of effort into a difficult skill that others can do better, to trust in the division of labor and do your best at what you’re best at . . . that can be a perfectly reasonable decision. If an academic historian does careful archival work and writes it up in hard-to-read prose—not on purpose but just cos hard-to-read prose is what he or she knows how to write—that can be fine. The idea would be that a journalist could write it up later for others. No disdaining. Division of labor, that’s all. Not everyone on the court has to be a two-way player.

I had a similar reaction a few years ago to Steven Pinker’s claim that academics often write so badly because “their goal is not so much communication as self-presentation—an overriding defensiveness against any impression that they may be slacker than their peers in hewing to the norms of the guild. Many of the hallmarks of academese are symptoms of this agonizing self­consciousness . . .” I replied that I think writing is just not so easy, and our discussion continued here.

Anyway, here’s the question. This fallacy, of thinking that when people can’t do what you can do, that they’re just being stubborn . . . is there a name for it? The Expertise Fallacy??

Give this one a good name, and we can add it to the lexicon.

Did blind orchestra auditions really benefit women?

You’re blind!
And you can’t see
You need to wear some glasses
Like D.M.C.

Someone pointed me to this post, “Orchestrating false beliefs about gender discrimination,” by Jonatan Pallesen criticizing a famous paper from 2000, “Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians,” by Claudia Goldin and Cecilia Rouse.

We’ve all heard the story. Here it is, for example, retold in a news article from 2013 that Pallesen links to and which I also found on the internet by googling *blind orchestra auditions*:

In the 1970s and 1980s, orchestras began using blind auditions. Candidates are situated on a stage behind a screen to play for a jury that cannot see them. In some orchestras, blind auditions are used just for the preliminary selection while others use it all the way to the end, until a hiring decision is made.

Even when the screen is only used for the preliminary round, it has a powerful impact; researchers have determined that this step alone makes it 50% more likely that a woman will advance to the finals. And the screen has also been demonstrated to be the source of a surge in the number of women being offered positions.

That’s what I remembered. But Pallesen tells a completely different story:

I have not once heard anything skeptical said about that study, and it is published in a fine journal. So one would think it is a solid result. But let’s try to look into the paper. . . .

Table 4 presents the first results comparing success in blind auditions vs non-blind auditions. . . . this table unambigiously shows that men are doing comparatively better in blind auditions than in non-blind auditions. The exact opposite of what is claimed.

Now, of course this measure could be confounded. It is possible that the group of people who apply to blind auditions is not identical to the group of people who apply to non-blind auditions. . . .

There is some data in which the same people have applied to both orchestras using blind auditions and orchestras using non-blind auditions, which is presented in table 5 . . . However, it is highly doubtful that we can conclude anything from this table. The sample sizes are small, and the proportions vary wildly . . .

In the next table they instead address the issue by regression analysis. Here they can include covariates such as number of auditions attended, year, etc, hopefully correcting for the sample composition problems mentioned above. . . . This is a somewhat complicated regression table. Again the values fluctuate wildly, with the proportion of women advanced in blind auditions being higher in the finals, and the proportion of men advanced being higher in the semifinals. . . . in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.

Hmmm . . . OK, we better go back to the original published article. I notice two things from the conclusion.

First, some equivocal results:

The question is whether hard evidence can support an impact of discrimination on hiring. Our analysis of the audition and roster data indicates that it can, although we mention various caveats before we summarize the reasons. Even though our sample size is large, we identify the coefficients of interest from a much smaller sample. Some of our coefficients of interest, therefore, do not pass standard tests of statistical significance and there is, in addition, one persistent result that goes in the opposite direction. The weight of the evidence, however, is what we find most persuasive and what we have emphasized. The point estimates, moreover, are almost all economically significant.

This is not very impressive at all. Some fine words but the punchline seems to be that the data are too noisy to form any strong conclusions. And the bit about the point estimates being “economically significant”—that doesn’t mean anything at all. That’s just what you get when you have a small sample and noisy data, you get noisy estimates so you can get big numbers.

But then there’s this:

Using the audition data, we find that the screen increases—by 50 percent—the probability that a woman will be advanced from certain preliminary rounds and increases by severalfold the likelihood that a woman will be selected in the final round.

That’s that 50% we’ve been hearing about. I didn’t see it in Pallesen’s post. So let’s look for it in the Goldin and Rouse paper. It’s gotta be in the audition data somewhere . . . Also let’s look for the “increases by severalfold”—that’s even more, now we’re talking effects of hundreds of percent.

The audition data are described on page 734:

We turn now to the effect of the screen on the actual hire and estimate the likelihood an individual is hired out of the initial audition pool. . . . The definition we have chosen is that a blind audition contains all rounds that use the screen. In using this definition, we compare auditions that are completely blind with those that do not use the screen at all or use it for the early rounds only. . . . The impact of completely blind auditions on the likelihood of a woman’s being hired is given in Table 9 . . . The impact of the screen is positive and large in magnitude, but only when there is no semifinal round. Women are about 5 percentage points more likely to be hired than are men in a completely blind audition, although the effect is not statistically significant. The effect is nil, however, when there is a semifinal round, perhaps as a result of the unusual effects of the semifinal round.

That last bit seems like a forking path, but let’s not worry about that. My real question is, Where’s that “50 percent” that everybody’s talkin bout?

Later there’s this:

The coefficient on blind [in Table 10] in column (1) is positive, although not significant at any usual level of confidence. The estimates in column (2) are positive and equally large in magnitude to those in column (1). Further, these estimates show that the existence of any blind round makes a difference and that a completely blind process has a somewhat larger effect (albeit with a large standard error).

Huh? Nothing’s statistically significant but the estimates “show that the existence of any blind round makes a difference”? I might well be missing something here. In any case, you shouldn’t be running around making a big deal about point estimates when the standard errors are so large. I don’t hold it against the authors—this was 2000, after all, the stone age in our understanding of statistical errors. But from a modern perspective we can see the problem.

Here’s another similar statement:

The impact for all rounds [columns (5) and (6)] [of Table 9] is about 1 percentage point, although the standard errors are large and thus the effect is not statistically significant. Given that the probability of winning an audition is less than 3 percent, we would need more data than we currently have to estimate a statistically significant effect, and even a 1-percentage-point increase is large, as we later demonstrate.

I think they’re talking about the estimates of 0.011 +/- 0.013 and 0.006 +/- 0.013. To say that “the impact . . . is about 1 percentage point” . . . that’s not right. The point here is not to pick on the authors for doing what everybody used to do, 20 years ago, but just to emphasize that we can’t really trust these numbers.

Anyway, where’s the damn “50 percent” and the “increases by severalfold”? I can’t find it. It’s gotta be somewhere in that paper, I just can’t figure out where.

Pallesen’s objections are strongly stated but they’re not new. Indeed, the authors of the original paper were pretty clear about its limitations. The evidence was all in plain sight.

For example, here’s a careful take posted by BS King in 2017:

Okay, so first up, the most often reported findings: blind auditions appear to account for about 25% of the increase in women in major orchestras. . . . [But] One of the more interesting findings of the study that I have not often seen reported: overall, women did worse in the blinded auditions. . . . Even after controlling for all sorts of factors, the study authors did find that bias was not equally present in all moments. . . .

Overall, while the study is potentially outdated (from 2001…using data from 1950s-1990s), I do think it’s an interesting frame of reference for some of our current debates. . . . Regardless, I think blinding is a good thing. All of us have our own pitfalls, and we all might be a little better off if we see our expectations toppled occasionally.

So where am I at this point?

I agree that blind auditions can make sense—even if they do not had the large effects claimed in that 2000 paper, or indeed even if they have no aggregate relative effects on men and women at all. What about that much-publicized “50 percent” claim, or for that matter the not-so-well-publicized but even more dramatic “increases by severalfold”? I have no idea. I’ll reserve judgment until someone can show me where that result appears in the published paper. It’s gotta be there somewhere.

P.S. See comments for some conjectures on the “50 percent” and “severalfold.”