Skip to content

Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!

Hey, people are doing the multiverse!

Elio Campitelli writes:

I’ve just saw this image in a paper discussing the weight of evidence for a “hiatus” in the global warming signal and immediately thought of the garden of forking paths.

From the paper:

Tree representation of choices to represent and test pause-periods. The ‘pause’ is defined as either no-trend or a slow-trend. The trends can be measured as ‘broken’ or ‘continuous’ trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of ‘historical’ versions of the datasets as they existed, or contemporary versions providing full dataset ‘hindsight’. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The ‘year’ rows are for assessments undertaken at each year in time.

Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the ‘pause’ (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the ‘pause’ as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree.

Actually, it’s the multiverse.

Data quality is a thing.

I just happened to come across this story, where a journalist took some garbled data and spun a false tale which then got spread without question.

It’s a problem. First, it’s a problem that people will repeat unjustified claims, also a problem that when data are attached, you can get complete credulity, even for claims that are implausible on the face of it.

So it’s good to be reminded: “Data” are just numbers. You need to know where the data came from before you can learn anything from them.

“Did Jon Stewart elect Donald Trump?”

I wrote this post a couple weeks ago and scheduled it for October, but then I learned from a reporter that the research article under discussion was retracted, so it seemed to make sense to post this right away while it was still newsworthy.

My original post is below, followed by a post script regarding the retraction.

Matthew Heston writes:

First time, long time. I don’t know if anyone has sent over this recent paper [“Did Jon Stewart elect Donald Trump? Evidence from television ratings data,” by Ethan Porter and Thomas Wood] which claims that Jon Stewart leaving The Daily Show “spurred a 1.1% increase in Trump’s county-level vote share.”

I’m not a political scientist, and not well versed in the methods they say they’re using, but I’m skeptical of this claim. One line that stood out to me was: “To put the effect size in context, consider the results from the demographic controls. Unsurprisingly, several had significant results on voting. Yet the effects of The Daily Show’s ratings decline loomed larger than several controls, such as those related to education and ethnicity, that have been more commonly discussed in analyses of the 2016 election.” This seems odd to me, as I wouldn’t expect a TV show host change to have a larger effect than these other variables.

They also mention that they’re using “a standard difference-in-difference approach.” As I mentioned, I’m not too familiar with this approach. But my understanding is that they would be comparing pre- and post- treatment differences in a control and treatment group. Since the treatment in this case is a change in The Daily Show host, I’m unsure of who the control group would be. But maybe I’m missing something here.

Heston points to our earlier posts on the Fox news effect.

Anyway, what do I think of this new claim? The answer is that I don’t really know.

Let’s work through what we can.

In reporting any particular effect there’s some selection bias, so let’s start by assuming an Edlin factor of 1/2, so now the estimated effect of Jon Stewart goes from 1.1% to 0.55% in Trump’s county-level vote share. Call it 0.6%. Vote share is approximately 50%, so a 0.6% change is approximately a 0.3 percentage point in the vote. Would this have swung the election? I’m not sure, maybe not quite.

Let’s assume the effect is real. How to think about it? It’s one of many such effects, along with other media outlets, campaign tactics, news items, etc.

A few years ago, Noah Kaplan, David Park, and I wrote an article attempting to distinguish between what we called the random walk and mean-reversion models of campaigning. The random walk model posits that the voters are where they are, and campaigning (or events more generally) moves them around. In this model, campaign effects are additive: +0.3 here, -0.4 there, and so forth. In contrast, the mean-reversion model starts at the end, positing that the election outcome is largely determined by the fundamentals, with earlier fluctuations in opinion mostly being a matter of the voters coming to where they were going to be. After looking at what evidence we could find, we concluded that the mean-reversion model made more sense and was more consistent with the data. This is not to say that the Jon Stewart show would have no effect, just that it’s one of many interventions during the campaign, and I can’t picture each of them having an independent effect and these effects all adding up.

P.S. After the retraction

The article discussed above was retracted because the analysis had a coding error.

What to say given this new information?

First, I guess Heston’s skepticism is validated. When you see a claim that seems too big to be true (as here or here), maybe it’s just mistaken in some way.

Second, I too have had to correct a paper whose empirical claims were invalidated by a coding error. It happens—and not just to Excel users!

Third, maybe the original reaction to that study was a bit too strong. See the above post: Even had the data shown what had originally been claimed, the effect they found was not as consequential as it might’ve seen at first. Setting aside all questions of data errors and statistical errors, there’s a limit to what can be learned about a dynamic process—an election campaign—from an isolated study.

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated. I had a similar feeling a few years ago regarding the publicity surrounding the college-football-and-voting study. The particular claims regarding football and voting have since been disputed, but even if you accept the original study as is, its implications aren’t as strong as had been claimed in the press. Whatever these causal effects are, they vary by person and scenario, and they’re not occurring in isolation.

“In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

Tom Daula writes:

I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

“MRP is the Carmelo Anthony of election forecasting methods”? So we’re doing trash talking now??

What’s the deal with Nate Silver calling MRP “the Carmelo Anthony of forecasting methods”?

Someone sent this to me:

and I was like, wtf? I don’t say wtf very often—at least, not on the blog—but this just seemed weird.

For one thing, Nate and I did a project together once using MRP: this was our estimate of attitudes on heath care reform by age, income, and state:

Without MRP, we couldn’t’ve done anything like it.

So, what gives?

Here’s a partial list of things that MRP has done:

– Estimating public opinion in slices of the population

– Improved analysis using the voter file

– Polling using the Xbox that outperformed conventional poll aggregates

– Changing our understanding of the role of nonresponse in polling swings

– Post-election analysis that’s a lot more trustworthy than exit polls

OK, sure, MRP has solved lots of problems, it’s revolutionized polling, no matter what Team Buggy Whip says.

That said, it’s possible that MRP is overrated. “Overrated” is a difference between rated quality and actual quality. MRP, wonderful as it is, might well be rated too highly in some quarters. I wouldn’t call MRP a “forecasting method,” but that’s another story.

I guess the thing that bugged me about the Carmelo Anthony comparison is that my impression from reading the sports news is not just that Anthony is overrated but that he’s an actual liability for his teams. Whereas I see MRP, overrated as it may be (I’ve seen no evidence that MRP is overrated but I’ll accept this for the purpose of argument), as still a valuable contributor to polling.

Ten years ago . . .

The end of the aughts. It was a simpler time. Nate Silver was willing to publish an analysis that used MRP. We all thought embodied cognition was real. Donald Trump was a reality-TV star. Kevin Spacey was cool. Nobody outside of suburban Maryland had heard of Beach Week.

And . . . Carmelo Anthony got lots of respect from the number crunchers.

Check this out:

So here’s the story according to Nate: MRP is like Carmelo Anthony because they’re both overrated. But Carmelo Anthony isn’t overrated, he’s really underrated. So maybe Nate’s MRP jab was just a backhanded MRP compliment?

Simpler story, I guess, is that back around 2010 Nate liked MRP and he liked Carmelo. Back then, he thought the people who thought Carmelo was overrated, were wrong. In 2018, he isn’t so impressed with either of them. Nate’s impression of MRP and Carmelo Anthony go up and down together. That’s consistent, I guess.

In all seriousness . . .

Unlike Nate Silver, I claim no expertise on basketball. For all I know, Tim Tebow will be starting for the Knicks next year!

I do claim some expertise on MRP, though. Nate described MRP as “not quite ‘hard’ data.” I don’t really know what Nate meant by “hard” data—ultimately, these are all just survey responses—but, in any case, I replied:

I guess MRP can mean different things to different people. All the MRP analyses I’ve ever published are entirely based on hard data. If you want to see something that’s a complete mess and is definitely overrated, try looking into the guts of classical survey weighting (see for example this paper). Meanwhile, Yair used MRP to do these great post-election summaries. Exit polls are a disaster; see for example here.

Published poll toplines are not the data, warts and all; they’re processed data, sometimes not adjusted for enough factors as in the notorious state polls in 2016. I agree with you that raw data is the best. Once you have raw data, you can make inferences for the population. That’s what Yair was doing. For understandable commercial reasons, lots of pollsters will release toplines and crosstabs but not raw data. MRP (or, more generally, RRP) is just a way of going from the raw data to make inference about the general population. It’s the general population (or the population of voters) that we care about. The people in the sample are just a means to an end.

Anyway, if you do talk about MRP and how overrated it is, you might consider pointing people to some of those links to MRP successes. Hey, here’s another one: we used MRP to estimate public opinion on health care. MRP has quite a highlight reel, more like Lebron or Steph or KD than Carmelo, I’d say!

One thing I will say is that data and analysis go together:

– No modern survey is good enough to be able to just interpret the results without any adjustment. Nonresponse is just too big a deal. Every survey gets adjusted, but some don’t get adjusted well.

– No analysis method can do it on its own without good data. All the modeling in the world won’t help you if you have serious selection bias.

Yair added:

Maybe it’s just a particularly touchy week for Melo references.

Both Andy and I would agree that MRP isn’t a silver bullet. But nothing is a silver bullet. I’ve seen people run MRP with bad survey data, bad poststratification data, and/or bad covariates in a model that’s way too sparse, and then over-promise about the results. I certainly wouldn’t endorse that. On the other side, obviously I agree with Andy that careful uses of MRP have had many successes, and it can improve survey inferences, especially compared to traditional weighting.

I think maybe you’re talking specifically about election forecasting? I haven’t seen comparisons of your forecasts to YouGov or PredictWise or whatever else. My vague sense pre-election was that they were roughly similar, i.e., that the meaty part of the curves overlapped. Maybe I’m wrong and your forecasts were much better this time—but non-MRP forecasters have also done much worse than you, so is that an indictment of MRP, or are you just really good at forecasting?

More to my main point—in one of your recent podcasts, I remember you said something about how forecasts aren’t everything, and people should look at precinct results to try to get beyond the toplines. That’s roughly what we’ve been trying to do in our post-election project, which has just gotten started. We see MRP as a way to combine all the data—pre-election voter file data, early voting, precinct results, county results, polling—into a single framework. Our estimates aren’t going to be perfect, for sure, but hopefully an improvement over what’s been out there, especially at sub-national levels. I know we’d do better if we had a lot more polling data, for instance. FWIW I get questions from clients all the time about how demographic groups voted in different states. Without state-specific survey data, which is generally unavailable and often poorly collected/weighted, not sure what else you can do except some modeling like MRP.

Maybe you’d rather see the raw unprocessed data like the precinct results. Fair enough, sometimes I do too! My sense is the people who want that level of detail are in the minority of the minority. Still, we’re going to try to do things like show the post-processed MRP estimates, but also some of the raw data to give intuition. I wonder if you think this is the right approach, or if you think something else would be better.

And Ryan Enos writes:

To follow up on this—I think you’ll all be interested in seeing the back and forth between Nate and Lynn Vavreck who was interviewing him. It was more of a discussion of tradeoffs between different approaches, then a discussion of what is wrong with MRP. Nate’s MRP alternative was to do a poll in every district, which I think we can all agree would be nice – if not entirely realistic. Although, as Nate pointed out, some of the efforts from the NY Times this cycle made that seem more realistic. In my humble opinion, Lynn did a nice job pushing Nate on the point that, even with data like the NY Times polls, you are still moving beyond raw data by weighting and, as Andrew points out, we often don’t consider how complex this can be (I have a common frustration with academic research about how much out of the box survey weights are used and abused).

I don’t actually pay terribly close attention to forecasting – but in my mind, Nate and everybody else in the business is doing a fantastic job and the YouGov MRP forecasts have been a revelation. From my perspective, as somebody who cares more about what survey data can teach us about human behavior and important political phenomenon, I think MRP has been a revelation in that it has allowed us to infer opinion in places, such as metro areas, where it would otherwise be missing. This has been one of the most important advances in public opinion research in my lifetime. Where the “overrated” part becomes true is that just like every other scientific advance, people can get too excited about what it can do without thinking about what assumptions are going into the method and this can lead to believing it can do more than it can—but this is true of everything.

Yair, to your question about presentation—I am a big believer in raw data and I think combining the presentation of MRP with something like precinct results, despite the dangers of ecological error, can be really valuable because it can allow people to check MRP results with priors from raw data.

It’s fine to do a poll in every district but then you’d still want to do MRP in order to adjust for nonresponse, estimate subgroups of the population, study public opinion in between the districtwide polls, etc.

Scandal! Mister P appears in British tabloid.

Tim Morris points us to this news article:

And here’s the kicker:

Mister P.

Not quite as cool as the time I was mentioned in Private Eye, but it’s still pretty satisfying.

My next goal: Getting a mention in Sports Illustrated. (More on this soon.)

In all seriousness, it’s so cool when methods that my collaborators and I have developed are just out there, for anyone to use. I only wish Tom Little were around to see it happening.

P.S. Some commenters are skeptical, though:

I agree that polls can be wrong. The issue is not so much the size of the sample but rather that the sample can be unrepresentative. But I do think that polls provide some information; it’s better than just guessing.

P.P.S. Unrelatedly, Morris wrote, with Ian White and Michael Crowther, this article on using simulation studies to evaluate statistical methods.

Fake-data simulation. Yeah.

Horse-and-buggy era officially ends for survey research

Peter Enns writes:

Given the various comments on your blog about evolving survey methods (e.g., Of buggy whips and moral hazards; or, Sympathy for the Aapor), I thought you might be interested that the Roper Center has updated its acquisitions policy and is now accepting non-probability samples and other methods. This is an exciting move for the Roper Center.

Jeez. I wonder what the President of American Association of Buggy-Whip Manufacturers thinks about that!

In all seriousness, let’s never forget that our inferences are only as good as our data. Whether your survey responses come by telephone, or internet, or any other method, you want to put in the effort to get good data from a representative sample, and then to adjust as necessary. There’s no easy solution, it just needs the usual eternal vigilance.

P.S. I’m posting this one now, rather than with the usual six-month delay, because you can now go to the Roper Center and get these polls. I didn’t want you to have to wait!

When we had fascism in the United States

I was reading this horrifying and hilarious story by Colson Whitehead, along with an excellent article by Adam Gopnik in the New Yorker (I posted a nitpick on it a couple days ago) on the Reconstruction and post-Reconstruction era in the United States, and I was suddenly reminded of something.

In one of the political science classes I took in college, we were told that one of the big questions about U.S. politics, compared to Europe, is why we’ve had no socialism and no fascism. Sure, there have been a few pockets of socialism where they’ve won a few elections, and there was Huey Long in 1930s Louisiana, but nothing like Europe where the Left and the Right have ruled entire countries. and where, at least for a time, socialist and fascism were the ideologies of major parties.

That’s what we were taught. But, as Whitehead and Gopnik (and Henry Louis Gates, the author of the book that Gopnik was reviewing) remind us, that’s wrong. We have had fascism here for a long time—in the post-reconstruction South.

What’s fascism all about? Right-wing, repressive government, political power obtained and maintained through violence and the threat of violence, a racist and nationalist ideology, and a charismatic leader.

The post-reconstruction South didn’t have a charismatic leader, but the other parts of the description fit, so on the whole I’d call it a fascist regime.

In the 1930s, Sinclair Lewis wrote It Can’t Happen Here about a hypothetical fascist Americanism, and there was that late book by Philip Roth with a similar theme. I guess other people have had this thought so I googled *it has happened here* and came across this post talking about fascism in the United States, pointing to Red scares, the internment of Japanese Americans in WW2, and FBI infiltration of the civil rights movement. All these topics are worth writing about, but none of them seem to me to be even nearly as close to fascism as what happened for close to a century in the post-reconstruction South.

Louis Hartz wrote The Liberal Tradition in America back in the 1950s. The funny thing is, back in the 1950s there was still a lot of fascism down there.

But nobody made that connection to us when we were students.

Maybe the U.S. South just seemed unique, and the legacy of slavery distracted historians and political scientists so much they didn’t see the connection to fascism, a political movement with a nationalistic racist ideology that used violence to take and maintain power in a democratic system. It’s stunning in retrospect that Huey Long was discussed as a proto-fascist without any recognition that the entire South had a fascist system of government.

P.S. A commenter points to this article by Ezekiel Kweku and Jane Coaston from a couple years ago making the same point:

Fascism has happened before in America.

For generations of black Americans, the United States between the end of Reconstruction, around 1876, and the triumphs of the civil rights movement in the early 1960s was a fascist state.

Good to know that others have seen this connection before. It’s still notable, I think, that we weren’t aware of this all along.

Name this fallacy!

It’s the fallacy of thinking that, just cos you’re good at something, that everyone should be good at it, and if they’re not, they’re just being stubborn and doing it badly on purpose.

I thought about this when reading this line from Adam Gopnik in the New Yorker:

[Henry Louis] Gates is one of the few academic historians who do not disdain the methods of the journalist . . .

Gopnik’s article is fascinating, and I have no doubt that Gates’s writing is both scholarly and readable.

My problem is with Gopnik’s use of the word “disdain.” The implication seems to be that other historians could write like journalists if they felt like it, but they just disdain to do so, maybe because they think it would be beneath their dignity, or maybe because of the unwritten rules of the academic profession.

The thing that Gopnik doesn’t get, I think, is that it’s hard to write well. Most historians can’t write like A. J. P. Taylor or Henry Louis Gates. Sure, maybe they could approach that level if they were to work hard at it, but it would take a lot of work, a lot of practice, and it’s not clear this would be the best use of their time and effort.

For a journalist to say that most academics “disdain the methods of the journalist” would be like me saying that most journalists “disdain the methods of the statistician.” OK, maybe some journalists actively disdain quantitative thinking—the names David Brooks and Gregg Easterbrook come to mind—but mostly I think it’s the same old story: math is hard, statistics is hard, these dudes are doing their best but sometimes their best isn’t good enough, etc. “Disdain” has nothing to do with it. To not choose to invest years of effort into a difficult skill that others can do better, to trust in the division of labor and do your best at what you’re best at . . . that can be a perfectly reasonable decision. If an academic historian does careful archival work and writes it up in hard-to-read prose—not on purpose but just cos hard-to-read prose is what he or she knows how to write—that can be fine. The idea would be that a journalist could write it up later for others. No disdaining. Division of labor, that’s all. Not everyone on the court has to be a two-way player.

I had a similar reaction a few years ago to Steven Pinker’s claim that academics often write so badly because “their goal is not so much communication as self-presentation—an overriding defensiveness against any impression that they may be slacker than their peers in hewing to the norms of the guild. Many of the hallmarks of academese are symptoms of this agonizing self­consciousness . . .” I replied that I think writing is just not so easy, and our discussion continued here.

Anyway, here’s the question. This fallacy, of thinking that when people can’t do what you can do, that they’re just being stubborn . . . is there a name for it? The Expertise Fallacy??

Give this one a good name, and we can add it to the lexicon.

Did blind orchestra auditions really benefit women?

You’re blind!
And you can’t see
You need to wear some glasses
Like D.M.C.

Someone pointed me to this post, “Orchestrating false beliefs about gender discrimination,” by Jonatan Pallesen criticizing a famous paper from 2000, “Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians,” by Claudia Goldin and Cecilia Rouse.

We’ve all heard the story. Here it is, for example, retold in a news article from 2013 that Pallesen links to and which I also found on the internet by googling *blind orchestra auditions*:

In the 1970s and 1980s, orchestras began using blind auditions. Candidates are situated on a stage behind a screen to play for a jury that cannot see them. In some orchestras, blind auditions are used just for the preliminary selection while others use it all the way to the end, until a hiring decision is made.

Even when the screen is only used for the preliminary round, it has a powerful impact; researchers have determined that this step alone makes it 50% more likely that a woman will advance to the finals. And the screen has also been demonstrated to be the source of a surge in the number of women being offered positions.

That’s what I remembered. But Pallesen tells a completely different story:

I have not once heard anything skeptical said about that study, and it is published in a fine journal. So one would think it is a solid result. But let’s try to look into the paper. . . .

Table 4 presents the first results comparing success in blind auditions vs non-blind auditions. . . . this table unambigiously shows that men are doing comparatively better in blind auditions than in non-blind auditions. The exact opposite of what is claimed.

Now, of course this measure could be confounded. It is possible that the group of people who apply to blind auditions is not identical to the group of people who apply to non-blind auditions. . . .

There is some data in which the same people have applied to both orchestras using blind auditions and orchestras using non-blind auditions, which is presented in table 5 . . . However, it is highly doubtful that we can conclude anything from this table. The sample sizes are small, and the proportions vary wildly . . .

In the next table they instead address the issue by regression analysis. Here they can include covariates such as number of auditions attended, year, etc, hopefully correcting for the sample composition problems mentioned above. . . . This is a somewhat complicated regression table. Again the values fluctuate wildly, with the proportion of women advanced in blind auditions being higher in the finals, and the proportion of men advanced being higher in the semifinals. . . . in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.

Hmmm . . . OK, we better go back to the original published article. I notice two things from the conclusion.

First, some equivocal results:

The question is whether hard evidence can support an impact of discrimination on hiring. Our analysis of the audition and roster data indicates that it can, although we mention various caveats before we summarize the reasons. Even though our sample size is large, we identify the coefficients of interest from a much smaller sample. Some of our coefficients of interest, therefore, do not pass standard tests of statistical significance and there is, in addition, one persistent result that goes in the opposite direction. The weight of the evidence, however, is what we find most persuasive and what we have emphasized. The point estimates, moreover, are almost all economically significant.

This is not very impressive at all. Some fine words but the punchline seems to be that the data are too noisy to form any strong conclusions. And the bit about the point estimates being “economically significant”—that doesn’t mean anything at all. That’s just what you get when you have a small sample and noisy data, you get noisy estimates so you can get big numbers.

But then there’s this:

Using the audition data, we find that the screen increases—by 50 percent—the probability that a woman will be advanced from certain preliminary rounds and increases by severalfold the likelihood that a woman will be selected in the final round.

That’s that 50% we’ve been hearing about. I didn’t see it in Pallesen’s post. So let’s look for it in the Goldin and Rouse paper. It’s gotta be in the audition data somewhere . . . Also let’s look for the “increases by severalfold”—that’s even more, now we’re talking effects of hundreds of percent.

The audition data are described on page 734:

We turn now to the effect of the screen on the actual hire and estimate the likelihood an individual is hired out of the initial audition pool. . . . The definition we have chosen is that a blind audition contains all rounds that use the screen. In using this definition, we compare auditions that are completely blind with those that do not use the screen at all or use it for the early rounds only. . . . The impact of completely blind auditions on the likelihood of a woman’s being hired is given in Table 9 . . . The impact of the screen is positive and large in magnitude, but only when there is no semifinal round. Women are about 5 percentage points more likely to be hired than are men in a completely blind audition, although the effect is not statistically significant. The effect is nil, however, when there is a semifinal round, perhaps as a result of the unusual effects of the semifinal round.

That last bit seems like a forking path, but let’s not worry about that. My real question is, Where’s that “50 percent” that everybody’s talkin bout?

Later there’s this:

The coefficient on blind [in Table 10] in column (1) is positive, although not significant at any usual level of confidence. The estimates in column (2) are positive and equally large in magnitude to those in column (1). Further, these estimates show that the existence of any blind round makes a difference and that a completely blind process has a somewhat larger effect (albeit with a large standard error).

Huh? Nothing’s statistically significant but the estimates “show that the existence of any blind round makes a difference”? I might well be missing something here. In any case, you shouldn’t be running around making a big deal about point estimates when the standard errors are so large. I don’t hold it against the authors—this was 2000, after all, the stone age in our understanding of statistical errors. But from a modern perspective we can see the problem.

Here’s another similar statement:

The impact for all rounds [columns (5) and (6)] [of Table 9] is about 1 percentage point, although the standard errors are large and thus the effect is not statistically significant. Given that the probability of winning an audition is less than 3 percent, we would need more data than we currently have to estimate a statistically significant effect, and even a 1-percentage-point increase is large, as we later demonstrate.

I think they’re talking about the estimates of 0.011 +/- 0.013 and 0.006 +/- 0.013. To say that “the impact . . . is about 1 percentage point” . . . that’s not right. The point here is not to pick on the authors for doing what everybody used to do, 20 years ago, but just to emphasize that we can’t really trust these numbers.

Anyway, where’s the damn “50 percent” and the “increases by severalfold”? I can’t find it. It’s gotta be somewhere in that paper, I just can’t figure out where.

Pallesen’s objections are strongly stated but they’re not new. Indeed, the authors of the original paper were pretty clear about its limitations. The evidence was all in plain sight.

For example, here’s a careful take posted by BS King in 2017:

Okay, so first up, the most often reported findings: blind auditions appear to account for about 25% of the increase in women in major orchestras. . . . [But] One of the more interesting findings of the study that I have not often seen reported: overall, women did worse in the blinded auditions. . . . Even after controlling for all sorts of factors, the study authors did find that bias was not equally present in all moments. . . .

Overall, while the study is potentially outdated (from 2001…using data from 1950s-1990s), I do think it’s an interesting frame of reference for some of our current debates. . . . Regardless, I think blinding is a good thing. All of us have our own pitfalls, and we all might be a little better off if we see our expectations toppled occasionally.

So where am I at this point?

I agree that blind auditions can make sense—even if they do not had the large effects claimed in that 2000 paper, or indeed even if they have no aggregate relative effects on men and women at all. What about that much-publicized “50 percent” claim, or for that matter the not-so-well-publicized but even more dramatic “increases by severalfold”? I have no idea. I’ll reserve judgment until someone can show me where that result appears in the published paper. It’s gotta be there somewhere.

P.S. See comments for some conjectures on the “50 percent” and “severalfold.”

Maintenance cost is quadratic in the number of features

Bob Carpenter shares this story illustrating the challenges of software maintenance. Here’s Bob:

This started with the maintenance of upgrading to the new Boost version 1.69, which is this pull request:

https://github.com/stan-dev/math/pull/1082

for this issue:

https://github.com/stan-dev/math/issues/1081

The issue happens first, then the pull request, then the fun of debugging starts.

Today’s story starts an issue from today [18 Dec 2018] reported by Daniel Lee, the relevant text of which is:

@bgoodri, it looks like the unit tests for integrate_1d is failing. It looks like the new version of Boost has different behavior then what was there before.

This is a new feature (1D integrator) and it already needs maintenance.

This issue popped up when we updated Boost 1.68 to Boost 1.69. Boost is one of only three C++ libraries we depend on, but we use it everywhere (the other two libaries are limited to matrix operations and solving ODEs). Boost has been through about 20 versions since we started the project—twice or three times/year.

Among other reasons, we have to update Boost because we have to keep in synch with CRAN package BH (Boost headers) due to CRAN maximum package size limitations. We can’t distribute our own version of Boost so as to control the terms of when these maintenance events happen, but we’d have to keep updating anyway just to keep up with Boost’s bug fixes and new features, etc.

What does this mean in practical terms? Messages like the one above pop up. I get flagged, as does everyone else following the math lib issues. Someone has to create a GitHub issue, create a GitHub branch, debug the problem on the branch, create a GitHub pull request, get that GitHub pull request to pass tests on all platforms for continuous integration, get the code reviewed, make any updates required by code review and test again, then merge. This is all after the original issue and pull request to update Boost. That was just the maintenance that revealed the bug.

This is not a five minute job.

It’ll take one person-hour minimum with all the GitHub overhead
and reviewing. And it’ll take something like a compute-day on our continuous integration servers if it passes the tests (less for failures). Deubgging may take anywhere from 10 minutes to a day or maybe two in the extreme.

My point is just that the more things we have like integrate_1d, the more of these things come up. As a result, maintenance cost is quadratic in the number of features.

Bob summarizes:

It works like this:

Let’s suppose a maintenance event comes up every 2 months or
so (e.g., new version of Boost, reorg of repo, new C++ version etc.). For each maintenance event, the amount of maintenance we have to do is proportional to the number of features we have. As a result, the amount of maintenance we have to do is quadratic (e.g., a linear growth in features looks like this: 1 + 2 + 3 + … + and we do maintenance at regular intervals, so the amount of time it takes is quadratic.

This is why I’m always so reluctant to add features, especially when they have complicated dependencies.

That illusion where you think the other side is united and your side is diverse

Lots of people have written about this illusion of perspective: The people close to you look to be filled with individuality and diversity, while the people way over there in the other corner of the room all look kind of alike.

But widespread knowledge of this illusion does not stop people from succumbing from it. Here’s Michael Tomasky writing in the New York Times about what if America had a proportional-representation voting system:

Let’s just imagine that we had a pure parliamentary system in which we elected our representatives by proportional representation, so that if a minor party’s candidates got 4 percent of the legislative votes, they’d win 4 percent of the seats. What might our party alignment look like?

He identifies six hypothetical parties: the center left, the socialist left, the green left, a party for ethnic and lifestyle minorities, a white nationalist party, and a center-right party. Thus, Tomasky continues:

If I’m right, the Democrats would split into four parties, and the Republicans into two, although the second one would be tiny. In other words: The Trump-era Republican Party already is in essence a parliamentary party. . . .

The Democrats, however, are an unruly bunch. . . . The Democrats will never be a party characterized by parliamentary discipline; unlike the Republicans, their constituencies are too heterogeneous.

When it comes to racial/ethnic diversity, sure, the two parties are much different, with Democrats being much more of a coalition of groups and the Republicans being overwhelmingly white. More generally, though, no, I don’t buy Tomasky’s argument. He’s a liberal Democrat, so from his perspective his side is full of different opinions and argumentation. But I think that a columnist coming from the opposite side of the political spectrum would see it the other way, noticing all the subtleties in the Republican position. Overall, the Democrats and Republicans each receive about 30% of the vote (with typically a slightly higher percentage for the Democrats), with the other 40% voting for other parties or, mostly, not voting at all. I don’t think it makes sense to say that one group of 30% could support four different parties with the other group of 30% only supporting two. Even though I can see how it would look like that from one side.

Gremlin time: “distant future, faraway lands, and remote probabilities”

Chris Wilson writes:

It appears that Richard Tol is still publishing these data, only now fitting a piecewise linear function to the same data-points.
https://academic.oup.com/reep/article/12/1/4/4804315#110883819

Also still looks like counting 0 as positive, “Moreover, the 11 estimates for warming of 2.5°C indicate that researchers disagree on the sign of the net impact: 3 estimates are positive and 8 are negative. Thus it is unclear whether climate change will lead to a net welfare gain or loss.”

This is a statistically mistaken thing for Tol to do, to use a distribution of point estimates to make a statement about what might happen. To put it another way: suppose all 11 estimates were negative. That alone would not mean that it would be clear that climate change would lead to a net welfare loss. Even setting aside that “welfare loss” is not, and can’t be, clearly defined, the 11 estimates can—indeed, should—be correlated.

Tol’s statement is also odd if you look at his graph:

As Wilson notes, even if you take that graph at face value (which I don’t think you should, for reasons we’ve discussed before on this blog), what you really have is 1 positive point, several points that are near zero (but one of those points corresponds to a projection of global cooling so it’s not relevant to this discussion), and several more points that are negative. And, as we’ve discussed earlier, all the positivity is being driven by one single point, which is Tol’s own earlier study.

Tol’s paper also says:

This review of estimates in the literature indicates that the impact of climate change on the economy and human welfare is likely to be limited, at least in the twenty-first century. . . . negative impacts will be substantially greater in poorer, hotter, and lower-lying countries . . . climate change would appear to be an important issue primarily for those who are concerned about the distant future, faraway lands, and remote probabilities.

I’m surprised to see this sort of statement in a scientific journal. “Faraway lands”?? Who talks like that? I looked up the journal description and found this:

The Review of Environmental Economics and Policy is the official journal of the Association of Environmental and Resource Economists and the European Association of Environmental and Resource Economists.

So I guess they are offering a specifically European perspective. Europe is mostly kinda cold, so global warming is mostly about faraway lands. Still seems kinda odd to me.

P.S. Check out the x-axis on the above graph. “Centigrade” . . . Wow—I didn’t know that anyone still used that term!

The Arkansas paradox

Palko writes:

I had a recent conversation with a friend back in Arkansas who gives me regular updates of the state and local news. A few days ago he told me about a poll that was getting a fair amount of coverage. (See also here, for example.) The poll showed that a number of progressive social issues like marriage equality for the first time were getting majority support in the state. This agrees with a great deal of anecdotal evidence I’ve observed which suggest a strange paradox in the state (and, I suspect, in much of the Bible Belt). We are seeing a simultaneous spike in tolerance and intolerance around the very same issues.

Don’t get me wrong. I’m not saying that a Russellville Arkansas has become a utopia of inclusiveness, but from a historical standpoint, the acceptance of people who are openly gay, or who are in an interracial relationship has never been higher in the area. At the same time, conservative media has achieved critical mass, racist and inflammatory rhetoric is at a 50 year high, and the reactionaries have gained full control of the government for the first time since at least the election of Sen. Fulbright.

Arkansas is getting redder in partisan terms while looking increasingly purple ideologically.

I’m not sure how to think about this one, so I’m bouncing it over to you, the readers.

Stan examples in Harezlak, Ruppert and Wand (2018) Semiparametric Regression with R

I saw earlier drafts of this when it was in preparation and they were great.

Jarek Harezlak, David Ruppert and Matt P. Wand. 2018. Semiparametric Regression with R. UseR! Series. Springer.

I particularly like the careful evaluation of variational approaches. I also very much like that it’s packed with visualizations and largely based on worked examples with real data and backed by working code. Oh, and there are also Stan examples.

Overview

From the authors:

Semiparametric Regression with R introduces the basic concepts of semiparametric regression and is focused on applications and the use of R software. Case studies are taken from environmental, economic, financial, medical and other areas of applications. The book contains more than 50 exercises. The HRW package that accompanies the book contains all of the scripts used in the book, as well as datasets and functions.

There’s a sample chapter linked from the book’s site. It’s the intro chapter with lots of examples.

R code

There’s a thorough site supporting the book with all the R code. R comes with its own warning label on the home page:

All of the examples and exercises in this book [Semiparametric Regression with R] depend on the R computing environment. However, since R is continually changing readers should regularly check the book’s News, Software Updates and Errata web-site.

You’ve got to respect the authors’ pragmatism and forthrightness. I’m pretty sure most of the lack of backward compatibility users experience in R is primarily due to contributed packages, not the language itself.

Background reading

The new book’s based on an earlier book by an overlapping set of authors:

D. Ruppert, M. P. Wand and R. J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.

Cost and Format

First the good news. You can buy a pdf. I wish more authors and published had this as an option. I want to read everything in pdf format on my iPad.

Now the bad news. The pdf is US$89.00 or $29.95 per chapter. The softcover book is US$119.99. The printed book’s a bit less on Amazon at US$109.29 as of today. I wonder who works out the pennies in these prices.

Here’s the Springer page for the book in case you want a pdf.

Sometimes the Columbia library has these Springer books available to download a chapter at a time as pdfs. I’ll have to check about this one when I’ve logged back into the network.

Difference-in-difference estimators are a special case of lagged regression

Fan Li and Peng Ding write:

Difference-in-differences is a widely-used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trend, which is scale dependent and may be questionable in some applications. A common alternative method is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that difference-in-differences and the lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming the parallel trend will overestimate the effect; in contrast, if the parallel trend is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings without assuming either ignorability or parallel trend. We also extend the result to semiparametric estimation based on inverse probability weighting.

Li and Ding sent the paper to me because I wrote something on the topic a few years ago, under the title, Difference-in-difference estimators are a special case of lagged regression.

P.S. Li and Ding’s paper has been updated, so I updated the link above.

Do regression structures affect research capital? The case of pronoun drop. (also an opportunity to quote Bertrand Russell: This is one of those views which are so absurd that only very learned men could possibly adopt them.)

A linguist pointed me with incredulity to this article by Horst Feldmann, “Do Linguistic Structures Affect Human Capital? The Case of Pronoun Drop,” which begins:

This paper empirically studies the human capital effects of grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. By de‐emphasizing the significance of the individual, such languages may perpetuate ancient values and norms that give primacy to the collective, inducing governments and families to invest relatively little in education because education usually increases the individual’s independence from both the state and the family and may thus reduce the individual’s commitment to these institutions. Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative effects of pronoun‐drop languages. The individual‐level analysis uses data on 114,894 individuals from 75 countries over 1999‐2014. It establishes that speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the effect is substantial, particularly among females.

Another linguist saw this paper and asked if it was a prank.

I don’t think it’s a prank. I think it’s serious.

It would be easy, and indeed reasonable, to just laugh at this one and move on, to file it along other cross-country comparisons such as this—but I thought it could be instructive instead to take the paper seriously and see what went wrong.

I’m hoping these steps can be useful to students when trying to understand published research. Or, for that matter, when trying to understand their own regression.

So how can we figure out what’s really going on in this article?

To start with, the claimed effect is within-person (speaking a certain type of language affects your behavior) and within-country (speaking a certain type of language affects national values and norms), but all the data are observational and all the comparisons are between people and between countries. Thus, any causal interpretations are tenuous at best.

So we can start by rewriting the above abstract in descriptive terms. I’ll just repeat the empirical parts, and for convenience I’ll put my changes in bold

This paper empirically studies the correlation of human capital with grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. . . Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative correlations of pronoun‐drop languages with outcomes of interest after adjusting for various demographic variables. . . . speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the correlation is substantial, particularly among females.

OK, that helps a little.

Now we have to dig in a bit more. First, what’s a pronoun-drop language? Or, more to the point, which languages have pronoun drop and which don’t? I looked through the paper for a list of these languages ora map of where they are spoken. I didn’t see such a list or map, so I went to wikipedia and found this:

Among major languages, two of which might be called a pro-drop language are Japanese and Korean (featuring pronoun deletion not only for subjects, but for practically all grammatical contexts). Chinese, Slavic languages, and American Sign Language also exhibit frequent pro-drop features. In contrast, non-pro-drop is an areal feature of many northern European languages (see Standard Average European), including French, (standard) German, and English. . . . Most Romance languages (with the notable exception of French) are often categorised as pro-drop too, most of them only in the case of subject pronouns . . . Among the Indo-European and Dravidian languages of India, pro-drop is the general rule . . . Outside of northern Europe, most Niger–Congo languages, Khoisan languages of Southern Africa and Austronesian languages of the Western Pacific, pro-drop is the usual pattern in almost all linguistic regions of the world. . . . In many non-pro-drop Niger–Congo or Austronesian languages, like Igbo, Samoan and Fijian, however, subject pronouns do not occur in the same position as a nominal subject and are obligatory, even when the latter is present. . . .

Hmmmm, now things don’t seem so clear. Much will depend on how the languages are categorized.

The next thing we need, after we have a handle on the data, is a scatterplot. Actually a bunch of scatterplots. A scatterplot for each within-country analysis and a scatterplot for the between-country analysis. Outcome of interest on y-axis, predictor of interest on x-axis. OK, the within-country data will have to be plotted in a different way because the predictor and outcome are discrete, but something can be done there.

The point is, we need to see what’s going on. In the within-country analysis, where do we see this correlation and where do we not see it? In the between-country analysis, what countries are driving the correlation?

Again, the analysis is all descriptive, and that’s fine, but the point is we need to understand what we’re describing.

I have no idea if the causal claims in this paper are true—given what I’ve seen so far, I see no particular reason to believe the claims. But, in any case, if these patterns are interesting—and I have no idea on that either—then they’re worth understanding. The regression won’t give us understanding; it just chews up the data and gives meaningless claims such as “we find that the magnitude of the effect is substantial and slightly larger for women. Specifically, women who speak a pronoun drop language are 9‐11 percentage points less likely to have completed secondary or tertiary education than women who speak a non‐pronoun drop language. For men, the probability is 8‐10 percentage points.” That way lies madness. We—Science—can do better.

P.S. I scrolled down to the end of the paper and found this sentence which begins the final footnote:

Pronoun drop rules are not perfect measures of ancient collectivism.

Ya think? In all seriousness, who could think that pronoun drop rules are any sort of measure of “ancient collectivism” at all? As Bertrand Russell said, this is one of those views which are so absurd that only very learned men could possibly adopt them.

Post-Hoc Power PubPeer Dumpster Fire

We’ve discussed this one before (original, polite response here; later response, after months of frustration, here), but it keeps on coming.

Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.

13 Reasons not to trust that claim that 13 Reasons Why increased youth suicide rates

A journalist writes:

My eye was caught by this very popular story that broke yesterday — about a study that purported to find a 30 percent (!) increase in suicides, in kids 10-17, in the MONTH after a controversial show about suicide aired. And that increase apparently persisted for the rest of the year. It’s an observational study, but the hypothesis is that this show caused the bump.

This seems manifestly implausibly to me (although huge, if true).

I wondered if you thought there might be something short to be written, quickly, about this study? Should we believe it? Are there obvious flaws, etc.?

I took a look and here’s my reply:

The AP article you cite has some problems in that it jumps around from numbers to rates, with the time scale shifting from one month or several months or five years. All the numbers seem like a moving target; it’s hard for me to track exactly what’s going on.

Looking at the paper, I see this one graph:

Just as a minor thing, it seems to me to be a poor choice to report suicides as 0.4 per 100,000 people. I think it would be easier to interpret as 4 per million people, for two reasons: (a) 4 is easier to understand than 0.4, and (b) million is easier for me to interpret than 100,000. For example, NYC has 8 million people so we’d expect 32 suicides in a month?
Actually I think it would be even better for them to multiply by 12 and report annualized rates, thus 4 would become an annualized rate of 48 suicides per million people per year.

The statistical analysis is fine for what it is. The quasi-Poisson model is fine but it doesn’t really matter, either. The time series filtering model is probably fine; I guess I’d also like to see something simpler that estimates an offset for each of the 12 months of the year. But, again, I doubt it will make much of a difference in the results. I find the orange lines on the graph to be more distracting than helpful, but I can just ignore them and look at the blue dots. Instead of the jagged orange lines, they should plot the fitted seasonal + trend in orange, as that would give us a better sense of the comparison model with no jump.

Looking at the above graph, the key result is a jump from Feb-Mar-Apr 2017. Comparable sized jumps appear elsewhere in the dataset, for example Jul-Aug-Sep 2014, or Nov-Dec 2015, but Feb-Mar-Apr 2017 is more striking because the jump is happening during the spring, when we’d expect suicide rates to be dropping. On the other hand, Feb-Mar-Apr 2013 shows a steady increase. Not quite as high as Feb-Mar-Apr 2017 but a pretty big jump and in the same direction. Of course the authors could argue at this point that something may have happened in March 2013 to influence suicide rates, but that’s kinda the point: every month of every year, there’s _something_ going on.

I’m not quite sure what is their conceptual model. If the TV show causes suicides, does it cause people to commit suicide earlier than they otherwise would have, or are they assuming it will induce new suicides that otherwise never would’ve happened? I assume it would be a bit of both, but I don’t see this point discussed anywhere in the paper. You can see a big drop from Apr to Jul, but that happens in other years too.

Also this: “When the observed and forecasted rates of youth suicide were graphed based on the Holt-Winters analysis, there was a visible and statistically significant effect of the release of 13 Reasons Why on subsequent suicide (Figure 1), with observed rates in April, June, and December being significantly higher than corresponding rates forecasted using Holt-Winters modeling. Interestingly, the observed rate in the month of March (promotional period) is also statistically significantly higher than the model forecast.” They have a story for March and April. But June and December? No story for that, indeed the June and Dec results make me think that their story “proves too much,” as the saying goes. If 4 of the months of 2019 have elevated rates, this suggests just that suicide rates went up in 2019. Again, at the end of the paper, they write, “Suicide rates in two subsequent months remained elevated over forecasted rates, resulting in 195 additional deaths.” That’s just weird, to count June and December just because they’re higher than the predicted model. Maybe it’s the model that has the problem, huh?

They also made a statistical error. In the abstract, they write: “these associations were restricted to boys.” But in the text, they say: “When analyses were stratified by sex, a statistically significant increase in suicide rates was observed for boys in keeping with overall results in the 10- to 17- year-old age group (IRR=1.35, 95% CI=1.12-1.64; Table 1 and Figure S1, available online). Although the mean monthly count and rate of suicide for female children and adolescents increased after the series’ release, the difference was not statistically significant (IRR=1.15; 95% CI=0.89-1.50), with no change in post-release trends (IRR=0.97, 95% CI=0.93-1.01). Observed suicide rates for 10- to 17-year-old girls in June, 2017 were significantly greater than corresponding forecasted rates, but observed rates in September were significantly lower than expected rates (Figure S1, available online).”

There are a bunch of weird things about this summary. First, as the saying goes, the difference between “significant” and “not significant” is not itself statistically significant, so it is an error for them to say “these associations were restricted to boys.” It’s bad news that they waste a paragraph on pages 9-10 explaining this bit of random noise. Second, who cares about the observed rates in September? At this point it seems like they’re just fishing through the data. Where did September come from?

Finally, this sentence in the last paragraph seems a bit over the top: “There is no discernible public health benefit associated with viewing the series.” I just had fried chicken for lunch. There is no discernible public health benefit of that either, my dude.

Look I understand that suicide is a problem, it’s important for public heath researchers to study it, and I don’t know enough about the topic to have a sense of whether the claim is plausible, that the airing of a TV show could cause a 29% increase in suicide rates. It seems like a big number to me, but I don’t really know. If it really caused 195 kids to die, that’s really sad. Their statistical analysis seems over-certain, though. First are the identification issues, which are mentioned in the research article: Exposure to this show did not happen in isolation. Second is the data analysis: again, given they found effects in June and December as well, it seems that their method is pretty sensitive to noise.

I can see the logic of the argument that a dramatization of suicide can encourage copycats; more generally, TV and movies have lots of dramatizations of bad behavior, including horrible crimes. I just don’t think there are any easy ways of cleanly estimating their effects with these sorts of aggregate statistical analyses, and this particular paper has some issues.

Reporters are wising up

I searched on the web and found a bunch of news articles. Most of these reports were uncritical, just reporting the mild caveats given by the study’s authors, but some were more searching, and I’d like to give them a shout-out:

Beth Mole in Ars Technica:

A study out this week suggests that the release of the first season of Netflix’s 13 Reasons Why series in 2017 led to a small but notable uptick in teen suicides. The finding seems to confirm widespread apprehensions among mental health experts and advocates that a suicide “contagion” could spread from the teen drama, which centers around a 17-year-old girl’s suicide and includes graphic details. But the study contains significant caveats, and the findings should be interpreted cautiously. . . .

In a press statement, co-author Lisa Horowitz, a clinical scientist at the National Institute of Mental Health, said that the finding “should raise awareness that young people are particularly vulnerable to the media. All disciplines, including the media, need to take good care to be constructive and thoughtful about topics that intersect with public health crises.”

While the point that care should be taken with regard to suicide should be duly noted, it’s still unclear just how vulnerable young people are to the show’s content. The study has significant caveats and limitations. And the overall field of research into the epidemiology of suicide is a bit murky. . . .

As Harvard psychologist Matthew K. Nock noted in an interview with The New York Times, “Suicide rates bounce around a lot more when the cell sizes are low, as they are with kids aged 10 to 17 years. So, this new paper suggests there may be an association between 13 Reasons Why and the suicide rate. However, we must always be cautious when trying to draw causal conclusions from correlational data.” . . . the authors reported finding a significant uptick in suicides in April—the month after the show’s release—but they also found them in June and December. It’s unclear how the show is linked to changes in those specific months. Moreover, the authors found a statistically significant increase in suicides in a fourth month—the month of March, which would be prior to the show’s release on March 31. The authors say this finding “raises questions about effects of pre-release media promotion of the series premiere.” However, it also raises questions about whether factors or events unrelated to the show may explain or contribute to the reported increase in suicide rates. . . .

Another odd wrinkle emerged from the data when the authors looked at the sex breakdown of those deaths. The statistically significant increase in suicides was entirely due to suicides in boys, not girls, as the researchers had hypothesized. . . . The sex finding flies in the face of some ideas of a “suicide contagion,” a term used by the authors of the new study and used generally by researchers to discuss the hypothetical contagiousness of suicide from events or media. . . . Overall, the research into 13 Reasons Why serves to highlight the complexity of suicide and suicide prevention—and also the murkiness of the research field that surrounds it.

Well put.

And Christie D’Zurilla did a good job in the LA Times, with a story entitled, “‘13 Reasons Why’ influenced the suicide rate? It’s not that simple.”

Also Chelsea Whyte’s appropriately skeptical take in New Scientist, “Did Netflix’s 13 Reasons Why really increase suicide rates?”, which presents an arguments that this sort of TV show could actually reduce suicide rates.

It’s good to see that lots of reporters are getting the point, that statistical significance + identification strategy + published in a respected journal does not necessarily mean we have to believe it. Credulous journalists have been burned too many times, with studies of beauty and sex ratio, ESP, embodied cognition, ovulation and voting, himmicanes, air rage, etc.: now they’re starting to get the picture that you can’t always take these claims at face value. More recently there was the claim of elevated traffic accidents on 4/20 and lower homicide rates during the NRA convention—that last one was particularly ridiculous. The press is starting to wise up and no longer believe that “Disbelief is not an option . . . You have no choice but to accept that the major conclusions of these studies are true.”

It can be hard to talk about these things—suicide is an important topic, and who are we to question people who are trying to fight it?—but, as the above-linked news articles discuss, suicide is also complicated, and it’s not clear that we’re doing potential victims any favors by pushing simple stories.