Skip to content

The intellectual explosion that didn’t happen

A few years ago, we discussed the book, “A Troublesome Inheritance: Genes, Race, and Human History,” by New York Times reporter Nicholas Wade.

Wade’s book was challenging to read and review because it makes lots of claims that are politically explosive and could be true but do not seem clearly proved given available data. There’s a temptation in reviewing such a book to either accept the claims as correct and move straight to the implications, or conversely to argue that the claims are false.

The way I put it was:

The paradox of racism is that at any given moment, the racism of the day seems reasonable and very possibly true, but the racism of the past always seems so ridiculous.

I reviewed Wade’s book for Slate, we discussed it on the blog, and then I further discussed on the sister blog the idea that racism is a framework, not a theory, and that its value, or anti-value, comes from it being a general toolkit which can be used to explain anything.

I recently came a review essay on Wade’s book, by sociologist Philip Cohen from 2015, that made some interesting points, in particular addressing the political appeal of scientific racism.

Cohen quotes from a book review in the Wall Street Journal by conservative author Charles Murray, who wrote that the publication of “A Troublesome Inheritance” would “trigger an intellectual explosion the likes of which we haven’t seen for a few decades.”

This explosion did not happen.

Maybe one reason that Murray anticipated such an intellectual explosion is that this is what happened with his own book, “The Bell Curve,” back in 1995.

So Murray’s expectation was that A Troublesome Inheritance would be the new Bell Curve: Some people would love it, some would hate it, but everyone would have to reckon with it. That’s what happened with The Bell Curve, and also with Murray’s earlier book, Losing Ground. A Troublesome Inheritance was in many ways a follow-up to Murray’s two successful books, it was written by a celebrated New York Times author, so it would seem like a natural candidate to get talked about.

Another comparison point is Jared Diamond’s “Guns, Germs, and Steel,” which, like Wade, attempted to answer the question of why some countries are rich and some are poor. I’m guessing that a big part of Diamond’s success was his book’s title. His book is not so much about guns or steel, but damn that’s a good title. A Troublesome Inheritance, not so much.

So what happened? Why did Wade’s book not take off? It can’t just be the title, right? Nor can it simply be that Wade was suppressed by the forces of liberal political correctness. After all, those forces detested Murray’s books too.

Part of the difference is that The Bell Curve got a push within the established media, as it was promoted by the “even the liberal” New Republic. A Troublesome Inheritance got no such promotion or endorsement. But it’s hard for me to believe that’s the whole story either: for one thing, the later book was written by a longtime New York Times reporter, so “the call was coming from inside the house,” as it were. But it still didn’t catch on.

Another possibility is that Wade’s book was just ahead of its time, not scientifically speaking but politically speaking. In 2014, racism seemed a bit tired out and it did not seem to represent much of a political constituency. After 2016, with Donald Trump’s victory in the U.S. and the rise of neo-fascist parties in Europe, racism is much more of a live topic. If Wade’s book had come out last year, maybe it would be taken as a key to understanding the modern world, a book to be taken “seriously but not literally” etc. If the book had come out when racism was taken to represent an important political constituency, then many there would’ve been a more serious attempt to understand its scientific justifications. At this point, though, the book is five years old so it’s less likely to trigger any intellectual explosions.

Anyway, the above is all just preamble to a pointer to Philip Cohen’s thoughtful article.

The latest Perry Preschool analysis: Noisy data + noisy methods + flexible summarizing = Big claims

Dean Eckles writes:

Since I know you’re interested in Heckman’s continued analysis of early childhood interventions, I thought I’d send this along: The intervention is so early, it is in their parents’ childhoods.

See the “Perry Preschool Project Outcomes in the Next Generation” press release and the associated working paper.

The estimated effects are huge:

In comparison to the children of those in the control group, Perry participants’ children are more than 30 percentage points less likely to have been suspended from school, about 20 percentage points more likely never to have been arrested or suspended, and over 30 percentage points more likely to have a high school diploma and to be employed.

The estimates are significant at the 10% level. Which may seem like quite weak evidence (perhaps it is), but actually the authors employ a quite conservative inferential approach that reflects their uncertainty about how the randomization actually occurred, as discussed in a related working paper.

My quick response is that using a noisy (also called “conservative”) measure and then finding p less than 0.10 does not constitute strong evidence. Indeed, the noisier (more “conservative”) the method, the less informative is any given significance level. This relates to the “What does not kill my statistical significance makes me stronger” fallacy that Eric Loken and I wrote about (and here’s our further discussion)—but only more so here, as the significance is at the 10% rather than the conventional 5% level.

In addition, I see lots and lots and lots of forking paths and researcher degrees of freedom in statements such as, “siblings, especially male siblings, who were already present but ineligible for the program when families began the intervention were more likely to graduate from high school and be employed than the siblings of those in the control group.”

Just like everyone else, I’m rooting for early childhood intervention to work wonders. The trouble is, there are lots and lots of interventions that people hope will work wonders. It’s hard to believe they all have such large effects as claimed. It’s also frustrating when people such as Heckman routinely report biased estimates (see further discussion here). They should know better. Or they should at least know enough to know that they don’t know better. Or someone close to them should explain it to them.

I’ll say this again because it’s such a big deal: If you have a noisy estimate (because of biased or noisy measurements, small sample size, inefficient (possibly for reasons of conservatism or robustness) estimation, or some combination of these reasons), this does not strengthen your evidence. It’s not appropriate to give extra credence to your significance level, or confidence interval, or other statement of uncertainty, based on the fact that your data collection or statistical inference are noisy.

I’d say that I don’t think the claims in the above report would replicate—but given the time frame of any potential replication study, I don’t think replication will be tested one way or another, so a better way to put it is that I don’t think the estimates are at all accurate or reasonable.

But, hey, if you pick four point estimates to display, you get this:

That and favorable publicity will get you far.

P.S. Are we grinches for pointing out the flaws in poor arguments in favor of early childhood intervention? I don’t think so. Ultimately, our goal has to be to help these kids, not just to get stunning quotes to be used in PNAS articles, NPR stories, and Ted talks. If the researchers in this area want to flat-out make the argument that exaggeration of effects serves a social good, that these programs are so important that it’s worth making big claims that aren’t supported by the data, then I’d like to hear them make this argument in public, for example in comments to this post. But I think what’s happening is more complicated. I think these eminent researchers really don’t understand the problems with noise, researcher degrees of freedom, and forking paths. I think they’ve fooled themselves into thinking that causal identification plus statistical significance equals truth. And they’re supported by a academic, media, and governmental superstructure that continues to affirm them. These guys have gotten where they are in life by not listening to naysayers, so why change the path now? This holds in economics and policy analysis, just as it does in evolutionary psychology, social psychology, and other murky research areas. And, as always, I’m not saying that all or even most researchers are stuck in this trap; just enough for it to pollute our discourse.

What makes me sad is not so much the prominent researchers who get stuck in this way, but the younger scholars who, through similar good intentions, follow along these mistaken paths. There’s often a default assumption that, as the expression goes, with all this poop, there must be a pony somewhere. In addition to all the wasted resources involved in sending people down blind alleys, and in addition to the statistical misconceptions leading to further noisy studies and further mistaken interpretations of data, this sort of default credulity crowds out stronger, more important work, perhaps work by some junior scholar that never gets published in a top 5 journal or whatever because it doesn’t have that B.S. hook.

Remember Gresham’s Law of bad science? Every minute you spend staring at some bad paper, trying to figure out reasons why what they did is actually correct, is a minute you didn’t spend looking at something more serious.

And, yes, I know that I’m giving attention to bad work here, I’m violating my own principles. But we can’t spend all our time writing code. We have to spend some time unit testing and, yes, debugging. I put a lot of effort into doing (what I consider to be) exemplary work, into developing and demonstrating good practices, and into teaching others how to do better. I think it’s also valuable to explore how things can go wrong.

Are the tabloids better than we give them credit for?

Joshua Vogelstein writes:

I noticed you disparage a number of journals quite frequently on your blog.
I wonder what metric you are using implicitly to make such evaluations?
Is it the number of articles that they publish that end up being bogus?
Or the fraction of articles that they publish that end up being bogus?
Or the fraction of articles that get through their review process that end up being bogus?
Or the number of articles that they publish that end up being bogus AND enough people read them and care about them to identify the problems in those articles.

My guess (without actually having any data), is that Nature, Science, and PNAS are the best journals when scored on the metric of fraction of bogus articles that pass through their review process. In other words, I bet all the other journals publish a larger fraction of the false claims that are sent to them than Nature, Science, or PNAS.

The only data I know on it is described here. According to the article, 62% of social-science articles in Science and Nature published from 2010-2015 replicated. A earlier paper from the same group found that 61% of papers from specialty journals published between 2011 and 2014 replicated.

I’d suspect that the fraction of articles on social sciences that pass the review criteria for Science and Nature is much smaller than that of the specialty journals, implying that the fraction of articles that get through peer review in Science and Nature that replicate is much higher than the specialty journals.

My reply: I’ve looked at no statistics on this at all. It’s my impression that social science articles in the tabloids (Science, Nature, PNAS) are, on average, worse than those in top subject-matter journals (American Political Science Review, American Sociological Review, American Journal of Sociology, etc.). But I don’t know.


A computer program can be completely correct, it can be correct except in some edge cases, it can be approximately correct, or it can be flat-out wrong.

A statistical model can be kind of ok but a little wrong, or it can be a lot wrong. Except in some rare cases, it can’t be correct.

An iterative computation such as a Stan fit can have approximately converged, or it can be far from convergence. Except in some rare cases, it will never completely converge.

Where are the famous dogs? Where are the famous animals?

We were having a conversation the other day about famous dogs. There are surprisingly few famous dogs. Then I realized it’s not just that. There are very few famous animals, period.

If you exclude racehorses and the pets of heads of state, these are all the famous animals we could think of:

dogs: Lassie, Rin Tin Tin, Balto
cats: Trim, Grumpy cat, Morris the cat
horses: Clever Hans, Traveller
sheep: Dolly
groundhogs: Punxsutawney Phil
octopuses: Paul
gorillas etc.: Harambe, also that chimp that learned sign language
dolphins: Flipper
cows: Mrs. O’Leary’s
lions: Cecil
elephants: Jumbo
dinosaurs: Sue

That’s only 18. 18! Or 19 if you count Dolly as 2. Just 18 or 19 from the entire animal kingdom. I’m sure we’re missing a few, but still. I wouldn’t have thought that there were so few famous animals (again, not counting racehorses and royal or presidential pets, which I’d consider to be special cases).

P.S. Fictional animals don’t count.

P.P.S. Lots of good suggestions in comments. The #1 missing item above is Laika. You don’t have to believe me on this, but we did discuss Laika in our conversation. It was just my bad to forget to include him her when typing up the blog post.

From comments, some others in addition to Laika:

horses: Bucephalus, Incitatus, Mr. Ed
lions: Elsa
gorillas: Koko

Top 5 literary descriptions of poker

Yesterday I wrote about Pocket Kings by Ted Heller, which gives one of the most convincing literary descriptions of poker that I’ve ever read. (Much more so than all those books and articles where the author goes on expense account to compete at the World Series of Poker. I hope to never see that again.)

OK, here’s my list of the best literary descriptions of poker, starting at the top:

1. James Jones, From Here to Eternity. The best ever. An entirely convincing poker scene near the beginning drives the whole plot of this classic novel.

2. Dealer’s Choice, by Patrick Marber. Deemonds!

3. David Spanier, Total Poker. Lots of wonderful stories as well as some poker insight. He wrote some other books about poker that were not so interesting or readable.

4. Frank Wallace, Poker: A guaranteed income for life by using the advanced concepts of poker. I tracked this one down and read it after reading about it in Total Poker. Wallace’s book is pretty much devoid of any intentional literary merit, but I agree with Spanier that on its own terms it’s a kind of outsider-art masterpiece.

5. Ted Heller, Pocket Kings. See my review from yesterday.

That’s it. I can’t think of anything else I’ve read about poker that would be worth mentioning here. Lots of poker manuals which in some cases are well written but I would not say they are particularly interesting to read except for the poker content, and lots of books about poker by serious writers with poker scenes that do not seem at all insightful in any general way. So the above four, that’s all I have to offer.

Am I missing anything that’s worth including in the above list?

P.S. In my first version of this post, I forgot Dealer’s Choice. I added it after Phil reminded me.

Pocket Kings by Ted Heller

So. I’m most of the way through Pocket Kings by Ted Heller, author of the classic Slab Rat. And I keep thinking: Ted Heller is the same as Sam Lipsyte. Do these two guys know each other? They’re both sons of famous writers (OK, Heller’s dad is more famous than Lipsyte’s, but still). They write about the same character: an physically unattractive, mildly talented, borderline unethical shlub from New Jersey, a guy in his thirties or forties who goes through life powered by a witty resentment toward those who are more successful than him. A character who thinks a lot about his wife and about his friends his age, but never his parents or siblings. (A sort of opposite character from fellow Jerseyite Philip Roth / Nathan Zuckerman, whose characters tended to be attractive, suave, and eternally focused on the families of their childhoods. Indeed, the Heller/Lipsyte character is the sort of irritating pest who Roth/Zuckerman is always trying to shake off.)

It’s hard for me to see how Ted Heller and Sam Lipsyte can coexist in the same universe, but there you have it. One thing I don’t quite understand is the age difference: Lipsyte was born in 1968, which makes sense given the age of his characters, but Heller was born twelve years earlier, which makes him a decade or two older than the protagonist of Pocket Kings. That’s ok, of course—no requirement that an author write about people his or her own age—still, it’s a bit jarring to me to think about in the context of these particular authors, who seem so strongly identified with this particular character type.

One more thing. With their repeated discussions of failure, fear of failure, living with failure, etc., these books all seem to be about themselves, and their authors’ desire for success and fears of not succeeding.

Some works of art are about themselves. Vermeer making an incredibly detailed painting of a person doing some painstaking task. Titanic being the biggest movie of all time, about the biggest ship of all time. Primer being a low-budget, technically impressive movie about some people who build a low-budget time machine. Shakespeare with his characters talking about acting and plays. And the Heller/Lipsyte oeuvre.

I feel like a lot of these concerns are driven by economics. What with iphones and youtube and all these other entertainment options available, there’s not so much room for books. In Pocket Kings, Heller expresses lots of envy and resentment toward successful novelists such as Gary Shteyngart and everybody’s favorite punching bag, Jonathan Franzen—but, successful as these dudes are, I don’t see them as having the financial success or cultural influence of comparable authors in earlier generations. There’s less room at the top, or even at the middle.

And, as we’ve discussed before, it doesn’t do any help to professional writers that there are people like me around, publishing my writing every day on the internet for free.

Back to Pocket Kings. It’s not a perfect book. The author pushes a bit hard on the jokes at times. But it’s readable, and it connects to some deep ideas—or, at least, ideas that resonate deeply with me.

It’s giving nothing away to say that the book’s main character plays online poker as an escape from his dead-end life, and then he’s living two parallel lives, which intersect in various ways. He’s two different people! But this is true of so many of us, in different ways. We play different roles at home and at work. And, for that matter, when we read a novel, we’re entering a different world. Reading about this character’s distorted life made me question my own preference for reading books and communicating asynchronously (for example, by blogging, which is the ultimate in asynchronous communication, as I’m writing this in August to appear in January). Face-to-face communication can take effort! There must be a reason that so many people seem to live inside their phones. In that sense, Pocket Kings, published in 2012, was ahead of its time.

Some Westlake quotes

Clint Johns writes:

I’m a regular visitor to your blog, so I thought you might be interested in this link. It’s a relatively recent article (from 7/12) about Donald Westlake and his long career. For my money, the best part of it is the generous number of Westlake quotations from all sorts of places, including interviews as well as his novels. There are lots of writers who can turn a phrase, but Westlake was in a class by himself (or maybe with just a few others).

The Westlake quotes are good, but my favorite for these sorts of quotes is still George V. Higgins.

Graphs of school shootings in the U.S.

Bert Gunter writes:

This link is to an online CNN “analysis” of school shootings in the U.S. I think it is a complete mess (you may disagree, of course).

The report in question is by Christina Walker and Sam Petulla.

Gunter lists two problems:

1. Graph labeled “Race Plays A Factor in When School Shootings Occur”:
AFAICT, they are graphing number of casualties vs. time of shooting. But they should be graphing the number of shootings vs time; in fact, as they should be comparing incident *rates* vs time by race, they should be graphing the proportion of each category of schools that have shooting incidents vs time (I of course ignore more formal statistical modeling, which would not be meaningful for a mass market without a good deal of explanatory work).

2. Graph of “Shootings at White Schools Have More Casualties”:
The area of the rectangles in the graph appears to be proportional to the casualties per incident but with both different lengths and widths, it is not possible to glean clear information by eye (for me anyway). And aside from the obvious huge 3 or 4 largest incidents in the White Majority schools, I do not see any notable differences by category. Paraphrasing Bill Cleveland, the graph is a puzzle to be decipered: it appears to violate most of the principles of good graphics.

Moreover, it is not clear that casualties per incident is all that meaningful anyway. Maybe White schools involved in shootings just have more students so that it’s easier for a shooter to amass more casualties.

The “appropriate” analysis is: “Most school shootings everywhere involve 1 or 2 people, except for a handful of mass shootings at White schools. The graph is a deliberate attempt to mislead, not just merely bad.”

Unfortunately, as you are well aware, due to intense competition for viewer eyeballs, both formerly only print (NYT, WSJ, etc.) and purely online news media are now full of such colorful, sometimes interactive, and increasingly animated data analyses whose quality is, ummm… rather uneven. So impossible to discuss statistical deficiences and the possible political/sociological consequences of such mass media data analytical malfeasance in it all.

My reply:

I think the report is pretty good. Sure, some of the graphs don’t present data patterns so clearly, but as Antony Unwin and I wrote a few years ago, infovis and statistical graphics have different goals and different looks. In this case, I think these are the main messages being conveyed by these plots:
– There have been a lot of school shootings in the past decade.
– They’ve been happening all over the place, at all different times and to all different sorts of students.
– This report is based on real data that the researchers collected.
Indeed, at the bottom of the report they provide a link to the data on Github.

Regarding Gunter’s points 1 and 2 above, sure, there are other ways of analyzing and graphing the data. But (a) I don’t see why he says the graph is a deliberate attempt to mislead, and (b) I think the graphs are admirably transparent.

Consider for example the first two graphs in the report, here:

and here:

Both these graphs have issues, and there are places where I would’ve made different design choices. For example, I think the color scheme is confusing in that the same palette is used in two different ways, also I think it’s just wack to make three different graphs for early morning, daytime, and late afternoon and evening (and to compress the time scales for some of these). Also a mistake to compress Sat/Sun into one date: distorting the scale obscures the data. Instead, they could simply have rotated that second graph 90 degrees, running day of week down from Monday to Sunday on the vertical axis and time of day from 00:00 to 24:00 on the horizontal axis. One clean graph would then display all the shootings and their times.

The above graph has a problem that I see a lot in data graphics, and in statistical analysis more generally, which is that it is overdesigned. The breaking up into three graphs, the distortion of the hour and day scales, the extraneous colors (which convey no information, as time is already indicated by position on the plot) all just add confusion and make a simple story look more complicated.

So, sure, the graphs are not perfect. Which is no surprise. We all have deadlines. My own published graphs could be improved too.

The thing I really like about the graphs in Walker and Petulla’s report is that they are so clearly tied to the data. That’s important.

If someone were to do more about this, I think the next step would be to graph shootings and other violent crimes that occur outside of schools.

In Bayesian inference, do people cheat by rigging the prior?

Ulrich Atz writes in with a question:

A newcomer to Bayesian inference may argue that priors seem sooo subjective and can lead to any answer. There are many counter-arguments (e.g., it’s easier to cheat in other ways), but are there any pithy examples where scientists have abused the prior to get to the result they wanted? And if not, can we rely on this absence of evidence as evidence of absence?

I don’t know. It certainly could be possible to rig an analysis using a prior distribution, just as you can rig an analysis using data coding or exclusion rules, or by playing around with what variables are included in a least-squares regression. I don’t recall ever actually seeing this sort of cheatin’ Bayes, but maybe that’s just because Bayesian methods are not so commonly used.

I’d like to believe that in practice it’s harder to cheat using Bayesian methods because Bayesian methods are more transparent. If you cheat (or inadvertently cheat using forking paths) with data exclusion, coding, or subsetting, or setting up coefficients in a least squares regression, or deciding which “marginally significant” results to report, that can slip under the radar. But the prior distribution—that’s something everyone will notice. I could well imagine that the greater scrutiny attached to Bayesian methods makes it harder to cheat, at least in the obvious way by using a loaded prior.

American Causal Inference May 2020 Austin Texas

Carlos Carvalho writes:

The ACIC 2020 website is now up and registration is open.

As a reminder, proposals information can be found in the front page of the website.
Deadline for submissions is February 7th.

I think that we organized the very first conference in this series here at Columbia, many years ago!

Is it accurate to say, “Politicians Don’t Actually Care What Voters Want”?

Jonathan Weinstein writes:

This was a New York Times op-ed today, referring to this working paper. I found the pathologies of the paper to be worth an extended commentary, and wrote a possible blog entry, attached. I used to participate years ago in a shared blog at Northwestern, “Leisure of the Theory Class,” but nowadays I don’t have much of a platform for this.

The op-ed in question is by Joshua Kalla and Ethan Porter with title, “Politicians Don’t Actually Care What Voters Want,” and subtitle, “Does that statement sound too cynical? Unfortunately, the evidence supports it.” The working paper, by the same authors, is called, “Correcting Bias in Perceptions of Public Opinion Among American Elected Officials: Results from Two Field Experiments,” and begins:

While concerns about the public’s receptivity to factual information are widespread, muchless attention has been paid to the factual receptivity, or lack thereof, of elected officials. Re-cent survey research has made clear that U.S. legislators and legislative staff systematicallymisperceive their constituents’ opinions on salient public policies. We report results from twofield experiments designed to correct misperceptions of sitting U.S. legislators. The legislators (n=2,346) were invited to access a dashboard of constituent opinion generated using the 2016 Cooperative Congressional Election Study. Here we show that despite extensive outreach ef-forts, only 11% accessed the information. More troubling for democratic norms, legislators who accessed constituent opinion data were no more accurate at perceiving their constituents’ opinions. Our findings underscore the challenges confronting efforts to improve the accuracy of elected officials’ perceptions and suggest that elected officials may be more resistant to factual information than the mass public.

Weinstein’s criticism of the Kalla and Porter article is here, and this is Weinstein’s main point:

The study provided politicians with data on voters’ beliefs, and attempted to measure changes in the politicians’ perception of these beliefs. No significant effects were found. But there are always many possible explanations for null results! The sensational, headlined explanation defies common sense and contradicts other data in the paper itself, while other explanations are both intuitive and supported by the data.


The authors claim that the study is “well-powered,” suggesting an awareness of the issue, but they do not deal with it adequately, say by displaying confidence intervals and arguing that they prove the effect is small. It is certainly not obvious that a study in which only 55 of 2,346 potential subjects complied with all phases is actually well-powered.

My reaction to all this was, as the social scientists say, overdetermined. That is, the story had a bunch of features that might incline me to take one view or another:

1. Weinstein contacted me directly and said nice things about this blog. +1 for the criticism. A polite email doesn’t matter, but it should.

2. Weinstein’s an economist, Kalla and Porter are political scientists and the topic of the research is politics. My starting point is to assume that economists know more about economics, political scientists know more about politics, sociologists know more about sociology. So +1 for the original paper.

3. On the substance, there’s some work by Lax and Phillips on congruence of political attitudes and legislative positions. The summary of this work is that public opinion does matter to legislators. So +1 for the criticism. On the other hand, public opinion is really hard to estimate. Surveys are noisy, there’s lots of conflicting information out there, and I could well believe that, in many cases, even if legislators would like to follow public opinion, it wouldn’t make sense for them to do much with it. So +1 for the original paper.

4. The sample size of 55, that seems like an issue, and I think we do have to worry about claims of null effects based on not seeing any clear pattern in noisy data. So +1 for the criticism.

5. The paper uses Mister P to estimate state-level opinion. +1 for the paper.

And . . . all the pluses balance out! I don’t know what I’m supposed to think!

Also, I don’t know any of these people—I don’t think that at the time of this writing [July 2019] I’ve ever even met them. None of this is personal. Actually, I think my reactions would be pretty similar even if I did know some of these people. I’m willing to criticize friends’ work and to praise the work of people I dislike or don’t know personally.

Anyway, my point in this digression is not that it’s appropriate to evaluate research claims based on these sorts of indirect arguments, which are really just one step above attitudes of the form, “Don’t trust that guy’s research, he’s from Cornell!”—but rather to recognize that it’s inevitable that we will have some reactions based on meta-data, and I think it’s better to recognize these quasi-Bayesian inferences that we are doing, even if for no better reason than to avoid over-weighting them when drawing our conclusions.

OK, back to the main story . . . With Weinstein’s permission, I sent his criticisms to Kalla and Porter, who replied to Weinstein’s 3-page criticism with a 3-page defense, which makes the following key point:

His criticisms of the paper, however, do not reflect exposure to relevant literature—literature that makes our results less surprising and our methods more defensible . . .

Since Miller and Stokes (1963), scholars have empirically studied whether elected officials know what policies their constituents want. Recent work in political science has found that there are systematic biases in elite perceptions that suggest many state legislators and congressional staffers do not have an accurate assessment of their constituents’ views on several key issues. . . . Hertel-Fernandez, Mildenberger and Stokes (2019) administer surveys on Congressional staff and come to the same conclusion. . . . elected officials substantially misperceive what their constituents want. The polling that does take place in American politics either is frequently devoid of any issue content (horserace polling) or is devised to develop messages to distract and manipulate the mass public, as documented in Druckman and Jacobs (2015). Contrary to Professor Weinstein’s description, our results are far from “bizarre,” given the state of the literature.

Regarding the small sample size and acceptance of the null, Kalla and Porter write:

Even given our limited sample size, we do believe that our study is sufficiently well-powered to demonstrate that this null is normatively and politically meaningful . . . our study was powered for a minimal detectable effect of a 7 percentage point reduction in misperceptions, where the baseline degree of misperception was 18 percentage points in the control condition.

So there you have it. In summary:
Research article
Response to criticism

I appreciate the behavior of all the researchers here. Kalla and Porter put their work up on the web for all to read. Weinstein followed up with a thoughtful criticism. Harsh, but thoughtful and detailed, touching on substance as well as method. Kalla and Porter used the criticism as a way to clarify issues in their paper.

What do I now think about the underlying issues? I’m not sure. Some of my answer would have to depend on the details of Kalla and Porter’s design and data, and I haven’t gone through all that in detail.

(To those of you who say that I should not discuss a paper that I’ve not read in full detail, I can only reply that this is a ridiculous position to take. We need to make judgments based on partial information. All. The. Time. And one of the services we provide as this blog is to model such uncertain reactions, to take seriously the problem of what conclusions should be drawn based on the information available to us, processed in available time using available effort.)

But I can offer some more general remarks on the substantive question given in the title of this post. My best take on this, given all the evidence I’ve seen, is that it makes sense for politicians to know where their voters stand on the issues, but that information typically isn’t readily available. At this point, you might ask why politicians don’t do more local polling on issues, and I don’t know—maybe they do—but one issue might be that, when it comes to national issues, you can use national polling and approximately adjust using known characteristics of the district compared to the country, based on geography, demographics, etc. Also, what’s typically relevant is not raw opinion but some sort of average, weighted by likelihood to vote, campaign contributions, and so forth.

I guess what I’m saying is that I don’t see a coherent story here yet. This is not meant as a criticism of Kalla and Porter, who must have a much better sense of the literature than I do, but rather to indicate a difficulty in how we think about the links between public opinion and legislator behavior. I don’t think it’s quite that “Politicians Don’t Actually Care What Voters Want”; it’s more that politicians don’t always have a good sense of what voters want, politicians aren’t always sure what they would do with that information if they had it, and whatever voters think they want is itself inherently unstable and does not always exist independent of framing. As Jacobs and Shapiro wrote, “politicians don’t pander.” They think of public opinion as a tool to get what they want, not as some fixed entity that they have to work around.

These last comments are somewhat independent of whatever was in Kalla and Porter’s study, which doesn’t make that study irrelevant to our thinking; it just implies that further work is needed to connect these experimental results to our larger story.

Call for proposals for a State Department project on estimating the prevalence of human trafficking

Abby Long points us to this call for proposals for a State Department project on estimating the prevalence of human trafficking:

The African Programming and Research Initiative to End Slavery (APRIES) is pleased to announce a funding opportunity available through a cooperative agreement with the U.S. Department of State, Office to Monitor and Combat Trafficking in Persons (J/TIP) with the following two aims:

To document the robustness of various methodological approaches in human trafficking prevalence research.
To identify and build the capacity of human trafficking teams in the design, testing, and dissemination of human trafficking prevalence data.
To achieve these aims, we are seeking strong research teams to apply at least two methods of estimating human trafficking prevalence in a selected hot spot and sector outside the United States.*

View the full call for proposals of the Prevalence Reduction Innovation Forum (PRIF).

Application deadline: March 4, 2020, 5:00 PM Eastern Standard Time. Please submit full proposals to (strongly preferred) with the subject line “PRIF proposal” or mail to the address indicated below by this deadline. Late submissions will not be accepted.

Dr. Lydia Aletraris, Project Coordinator
African Programming and Research Initiative to End Slavery
School of Social Work, Room 204
279 Williams Street,
Athens GA, 30602, USA

Award notification: April 2020

Questions prior to the deadline may be submitted via email to Use the subject line “PRIF questions.”

Award Amount: $200,000-$450,000. Only in exceptional circumstances might a higher budget be considered for funding.

Eligibility: Nonprofit organizations in or outside of the United States, including universities, other research organizations, NGOs, INGOs are eligible to apply. Government agencies and private entities are not eligible to apply.

Will decentralised collaboration increase the robustness of scientific findings in biomedical research? Some data and some causal questions.

Mark Tuttle points to this press release, “Decentralising science may lead to more reliable results: Analysis of data on tens of thousands of drug-gene interactions suggests that decentralised collaboration will increase the robustness of scientific findings in biomedical research,” and writes:

In my [Tuttle’s] opinion, the explanation is more likely to be sociological – group think and theory-driven observation – rather than methodological. Also, independent group tend to be rivals, not collaborators, and thus will be more inclined to be critical; I have seen this kind of thing in action . . .

I replied that I suspect it could also be that it is the more generalizable results that outside labs are more interested in replicating in the first place. So the descriptive correlation could be valid without the causal conclusion being warranted.

The research article in question is called, “Meta-Research: Centralized scientific communities are less likely to generate replicable results,” by Valentin Danchev, Andrey Rzhetsky, and James Evans, and it reports:

Here we identify a large sample of published drug-gene interaction claims curated in the Comparative Toxicogenomics Database (for example, benzo(a)pyrene decreases expression of SLC22A3) and evaluate these claims by connecting them with high-throughput experiments from the LINCS L1000 program. Our sample included 60,159 supporting findings and 4253 opposing findings about 51,292 drug-gene interaction claims in 3363 scientific articles. We show that claims reported in a single paper replicate 19.0% (95% confidence interval [CI], 16.9–21.2%) more frequently than expected, while claims reported in multiple papers replicate 45.5% (95% CI, 21.8–74.2%) more frequently than expected. We also analyze the subsample of interactions with two or more published findings (2493 claims; 6272 supporting findings; 339 opposing findings; 1282 research articles), and show that centralized scientific communities, which use similar methods and involve shared authors who contribute to many articles, propagate less replicable claims than decentralized communities, which use more diverse methods and contain more independent teams. Our findings suggest how policies that foster decentralized collaboration will increase the robustness of scientific findings in biomedical research.

Seeing this, I’d like to separate the descriptive from the causal claims.

The first descriptive statement is that claims reported in multiple papers paper replicate more often that claims reported in single papers. I’ll buy that (at least in theory; I have not gone through the article in enough detail to understand what is meant by the “expected” rate of replication, also I can’t see how the numbers add up: If some claims replicate 19% more frequently than expected, and others replicate 45% more frequently than expected (let’s pass over the extra decimal place in “45.5%” etc. in polite silence), then some claims must replicate less frequently than expected, no? But every claim is reported in a single paper or in multiple papers, so I feel like I’m missing something. But, again, that must be explained in the Danchev et al. article, and it’s not my main focus here).

The second descriptive statement is that claims produced by centralized scientific communities replicate less well than claims produced by decentralized communities. Again, I’ll assume the researchers did a good analysis here and that this descriptive statement is valid for their data and will generalize to other research in biomedicine.

Finally, the causal statement is that “policies that foster decentralized collaboration will increase the robustness of scientific findings in biomedical research.” I can see that this causal statement is consistent with the descriptive findings, but I don’t see it as implied by them. It seems to me that if you want to make this sort of causal statement, you need to do an experimental or observational study. I’m assuming experimental data on this question aren’t available, so you’ll want to do an observational study, comparing results under different practices, and adjusting for pre-treatment differences between exposed and unexposed groups. But it seems that they just did a straight comparison, and that seems subject to selection bias.

I contacted the authors of the article regarding my concern that the descriptive correlation could be valid without the causal conclusion being warranted, and they replied as follows:

James Evans:

It is most certainly the case that more generalizable findings are more likely to be repeated and published by someone (within or without author-cluster)—which we detail in our most recent version of the paper, but I do not believe that it is the case that outside labs publish on those that generalize; but that inside labs tend to agree with former findings even if they are wrong (eg., even if the finding is in the opposite direction). It seems much more likely that the same labs refuse to publish contrary findings than that outside labs magically know before they have begun their studies what the right claims are to study.

Valentin Danchev:

How are generalizable results defined seems to be a key here. In the paper, we defined generalizability across experimental settings in L1000 but if the view is that outside labs select results known to be generalizable at the time of study, then generalizability should come from the research literature (Gen1). But from this definition, clustered publications with shared authors appear to provide the most generalizable results, which are virtually always confirmed (0.989). However, these may not be the kind of generalizable results we can rely on as when we add another layer of generalizability (Gen 2) — confirmation of published results, i.e. matching effect direction, in L1000 experiments — results from those clustered publications are less likely to be confirmed. Note that if results remain in clustered publications simply because they were not selected by outside labs due to non-generalizability, without any relation to author centralization or overlap, then we should expect conflicting papers about those non-generalized results, but, again, we found this not to be the case — results in clustered papers are more confirmatory while less likely to replicate (i.e. match the direction) in L1000.

As James mentioned, the more likely a published result is confirmed in L1000 experiments (Gen2), the more likely this result is to be published multiple times, whereas non-confirmed results are more likely to be published once (and probably tested further but put in the ‘file drawer’). This does support a view that generalizable results are learned over time. But without locating centralized or overlapping groups, some results would likely turn out to be false positives. Hence, for outside labs to establish which results are actually generalizable, a few independent publications are needed in the first place; once established, many more outside labs are indeed likely to further consider those results. I have no firm knowledge about the relationship between generalizable results and outside labs, but I would not be surprised if they correlate in the long run when results are globally known to be generalizable, with the caveat that independent labs appear to initially be a condition for establishing generalizability.

Overall, I think, some of our findings – results generalizable in L1000 as well as results supported in multiple publications are both more likely to replicate – do suggest a process of self-correction in which the community learns what works and what does not work, not necessary orthogonal to the observation that outside labs would select generalizable results (if/when known), but connectivity also plays a role as it can foster or impede self-correction and learning. Of course, as one of the reviewers suggested, the findings are of associational rather than of causal nature.

Interesting. My concerns were generic involving the statistics of causal inference. The responses are focused on the specific questions they are studying, and it would be hard for me to try to evaluate these arguments without putting in some extra effort to understand these details.

Steven Pinker on torture

I’ve recently been thinking about that expression, “A liberal is a conservative who’s been arrested.”

Linguist and public intellectual Steven Pinker got into some trouble recently when it turned out that he’d been offering expert advice to the legal team of now-disgraced financier Jeffrey Epstein.

I would not condemn Pinker for this. After all, everybody deserves a fair trial. Also I agree with the statement here that it’s not fair to tar Pinker with guilt by association just because he met Epstein a few times and had friends who were friends with Epstein. We’re all of us two or three links from some pretty bad stuff. I have friends who’ve done things they shouldn’t have, and I’d be surprised if I didn’t have friends of friends who’ve committed some horrible crimes. Social networks are large. (We’ve estimated that the average American knows 750 people. If each of them knows roughly that many people, then, well, do the math.)

Pinker has come up before on this blog, notably in this 2007 post regarding a newspaper article he wrote, “In defense of dangerous ideas: In every age, taboo questions raise our blood pressure and threaten moral panic. But we cannot be afraid to answer them.” That article contained a long list of ideas labeled as dangerous. At the time, I questioned how dangerous some of those ideas really were, and Pinker replied to my questions. You can go read all that and make your own call on how the discussion holds up, twelve years later.

But the “In defense of dangerous ideas” idea of Pinker that I want to focus on here is this one:

Would damage from terrorism be reduced if the police could torture suspects in special circumstances?

Recall Pinker’s statement:

In every age, taboo questions raise our blood pressure and threaten moral panic. But we cannot be afraid to answer them.

One thing we do know, though, is that the U.S. government supported torture of terrorist suspects back in 2007 when that article came out, and torture is supported by the current U.S. president and also Pinker’s friend at Harvard Law School. So I guess it’s not such a taboo question after all.

I googled *steven pinker torture* to see if he’d written anything on the topic since then, and I found this from 2012 or so:

Question: You say that cruel punishments and slavery have been abolished. But torture was practiced by the United States during the Bush administration, and human trafficking still takes place in many countries.

Response by Pinker: There is an enormous difference between a clandestine, illegal, and universally decried practice in a few parts of the world and an open, institutionalized, and universally approved practice everywhere in the world. Human trafficking, as terrible as it is, cannot be compared to the African slave trade (see pp. 157–188), nor can the recent harsh interrogation of terrorist suspects to extract information, as indefensible as it was, be compared to millennia of sadistic torture all over the world for punishment and entertainment (see pp. 130-132 and 144–149). In understanding the history of violence, one has to make distinctions among levels of horror.

This seems to contradict Pinker’s “In defense of dangerous ideas” position. In 2007 he was defending the idea that torture would save lives—not that he necessarily agreed with the point, but he thought it worth discussing. But a few years later he was arguing that modern torture is “indefensible” and not such a big deal because it is “clandestine, illegal, and universally decried” and only occurring “in a few parts of the world” (unfortunately, the U.S. is one of those few parts of the world, also I don’t think it’s at all accurate to describe it as universally decried, but no need to get into that here as the point is pretty obvious and so I guess Pinker was just doing some rhetorical overreach when he said that).

So what’s the take on torture? Is it indefensible and universally decried, or is it an idea worth discussing, supported by a large chunk of our national political leadership?

One argument that’s sometimes been made in favor of torturing terrorist suspects is that terrorism is a new danger, unique in modern times. But I don’t think Pinker would make that argument, as he’s on record as saying that we live in “an unusually peaceful time” and that terrorism is a hazard that “most years kills fewer people than bee stings and lightning strikes.”

It kinda makes you wonder if he’d support police torture of beekeepers, or manufacturers of faulty lightning rods. Only in special circumstances, I’m sure.

Just to be clear: I agree with Pinker that less torture is better than more torture, and I’m not equating current U.S. military practices with the Spanish inquisition.

OK, so what’s the connection to Jeffrey Epstein?

Two things.

First, torture may well be “indefensible” in any case, but I think we can all agree that it’s particularly horrible if, when instituted as part of an anti-terror program, it ends up administered to someone who isn’t actually a terrorist. From news reports, it seems that happens sometimes. In Pinker’s 2007 “In defense of dangerous ideas” article, there’s the suggestion that torture of terrorism suspects could be ok—at least, worth discussing—without so much concern that terrorist suspects might be guilty of nothing more than association. The problems with guilt-by-association could well be clearer now to Pinker.

The second connection is that Pinker’s “In defense of dangerous ideas” article also appeared as the preface to a book called “What’s Your Dangerous Idea?”, the 2007 edition of a set of collections of short essays published by the Edge Foundation. This 2007 volume did not include any contributions from Jeffrey Epstein, as he was consumed with legal troubles at the time, but the now-famous financier returned the next year as one of “the intellectual elite, the brains the rest of us rely on to make sense of the universe and answer the big questions” to contribute to the volume, “What have you changed your mind about? Why?”

My point in making this connection is not to tar Pinker with guilt by association (or, for that matter, to imply that Epstein ever supported the idea of torture). My point is that your perspective on legal actions will appear different, depending on which end of the telescope you’re looking into. If you’re talking about an unnamed terrorist suspect, maybe you think that torture is a bold and dangerous taboo idea worth defending; if the person accused is a friend of a friend, they get legal advice.

I’m sure that most of us, myself included, behave this way to some extent. When it comes to intellectual debate, I can be as hard on my friends, and on myself (or here), as I am on acquaintances or people I don’t know at all. But when it comes to people actually getting hurt, then, sure, it’s a lot easier to summon empathy for people who are close to me, and it’s a lot easier to apply some principle of loyalty to friends of friends. Clashes between loyalty and other principles: that can be the stuff of tragedy.

So, again, the point of the this post is not to trap Pinker in some sort of Gotcha. It’s fine with me that his position on terrorism and torture changed between 2007 and 2012. In 2007, the World Trade Center attacks were still fresh in everyone’s mind, but by 2012, Pinker had been spending years reflecting on the global decline of violence so he was less willing to consider terrorism as a police tactic worth defending. Fair enough. If you’re an Ivy League pundit and you want to support police torture as an idea worth defending, you have to think that the benefits are greater than the damage—while taking account of the fact that lots of the damage would be happening to people you have no connection to, people who might not be clients of your law-professor friends.

And then there’s that line, “A liberal is a conservative who’s been arrested.” Due process and not tarring people based on guilt by association: that’s a principle that applies to associates of sex traffickers and also associates of terrorists.

Studying incoherence

The above might come off as anti-Pinker, but that’s not where I’m going here. So let me explain.

Coherence of beliefs and attitudes is an ideal or norm that should not be achievable in practice. We shouldn’t expect complete coherence among someone’s beliefs, for the same reason we shouldn’t expect complete coherence in the actions of a complex organization such as a corporation or a government.

To put it another way: Consider a house that’s been lived in for a few decades, long enough that it’s had various weather events, changes in occupancy, and other things that have required repainting of various rooms at various times. The colors of the different rooms won’t be quite in synch. Similarly with an economy: different products are coming at different times, prices keep changing, there never will be complete coherence. The point of noting an incoherence in someone’s views is not a Snap! You’re a hypocrite!, but rather: That’s interesting; let’s juxtapose these views and see if we can understand what’s going on.

We learn through anomalies. That’s what posterior predictive checking is all about, and that’s what stories are all about. The anomalies of Pinker—a prominent libertarian who’s open to the idea of police torture, a supporter of due process for sex trafficking suspects but not for terrorism suspects, etc.—are interesting in giving a sense of the contradictions in certain contrarian political perspectives.

P.S. Speaking of the Spanish inquisition, my googling turned up this Pinker quote from 2019:

Changing sensibilities: In 1988, I enjoyed A Fish Called Wanda. 30 years later, I found it cringeworthy: We’re supposed to laugh at stuttering, torture, cruelty to animals, the anguish of those who care, & a women using her sexuality strategically. Not un-PC, just un-funny.

I’m surprised that the torture bothered him so much. What ever happened to defending dangerous ideas??? Maybe the problem is that the torture in the movie wasn’t being done by the police, and it wasn’t being done on a suspected terrorist.

I thought A Fish Called Wanda was hilarious in 1988 and I happened to see it again last year, and I found it hilarious the second time as well. Not cringeworthy at all. My favorite scene was when Kevin Kline flipped the gun to himself as he was walking through the security gate at the airport. And that was just one of so many memorable scenes. I’d call it a modern classic, except that I guess 1988 isn’t really “modern” anymore.

I wonder what Pinker thinks about The Bad News Bears? I loved that one too!

P.P.S. I also came across this thoughtful and approximately 60% positive review from sociologist Claude Fischer of Pinker’s book, “The Better Angels of Our Nature.” I highly recommend Fischer’s review. Indeed, I think you’re better off reading Fischer’s review than my post above.

Hey—the New York Times is hiring an election forecaster!

Chris Wiggins points us to this job opening:

Staff Editor – Statistical Modeling

The New York Times is looking to increase its capacity for statistical projects in the newsroom, especially around the 2020 election.

You will help produce statistical forecasts for election nights, as part of The Times’s ambitious election results operation. That operation is responsible for designing, building and delivering live results to a large national audience.

You may also make technical contributions to our original polling or be part of other data projects in the newsroom, like the dialect quiz or rent-buy calculators.

This is a collaborative role. You will work with reporters, developers, and graphics editors, occasionally in high-pressure situations, and at odd hours.

We are R users. As a candidate, you should demonstrate excellence in R and data management, especially in production situations. Candidates should also demonstrate an understanding of statistical models, and an imagination for how these models could fail. Your work should reflect meticulous attention to detail. And you should enjoy exploring data.

As part of your cover letter, please describe or link to an example of a statistical model you’ve created. Please also describe any reporting, development or data visualization skills you may have.


– Familiarity with statistical modeling
– Expertise with R
– Experience with production-level code
– Interest in covering elections and politics
– Familiarity with JavaScript is highly desirable.
– Familiarity with election returns, voter files or polling is desirable.

They forgot to mention “familiarity with Mister P” and “familiarity with Stan” here, but I’m sure that’s just an oversight.

How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole

This one’s buggin me.

We’re in a situation now with forking paths in applied-statistics-being-done-by-economists where we were, about ten years ago, in applied-statistics-being-done-by-psychologists. (I was going to use the terms “econometrics” and “psychometrics” here, but that’s not quite right, because I think these mistakes are mostly being made, by applied researchers in economics and psychology, but not so much by actual econometricians and psychometricians.)

It goes like this. There’s a natural experiment, where some people get the treatment or exposure and some people don’t. At this point, you can do an observational study: start by comparing the average outcomes in the treated and control group, then do statistical adjustment for pre-treatment differences between groups. This is all fine. Resulting inferences will be model-dependent, but there’s no way around it. You report your results, recognize your uncertainty, and go forward.

That’s what should happen. Instead, what often happens is that researchers push that big button on their computer labeled REGRESSION DISCONTINUITY ANALYSIS, which does two bad things: First, it points them toward an analysis that focuses obsessively on adjusting for just one pre-treatment variable, often a relatively unimportant variable, while insufficiently adjusting for other differences between treatment and control groups. Second, it leads to an overconfidence borne from the slogan, “causal identification,” which leads researchers, reviewers, and outsiders to think that the analysis has some special truth value.

What we typically have is a noisy, untrustworthy estimate of a causal effect, presented with little to no sense of the statistical challenges of observational research. And, for the usual “garden of forking paths” reason, the result will typically be “statistically significant,” and, for the usual “statistical significance filter” reason, the resulting estimate will be large and newsworthy.

Then the result appears in the news media, often reported entirely uncritically or with minimal caveats (“while it’s too hasty to draw sweeping conclusions on the basis of one study,” etc.).

And then someone points me with alarm to the news report, and I read the study, and sometimes it’s just fine but often it has the major problems listed above. And then I post something on the study, and sometime between then and six months in the future there is a discussion, where most of the commenters agree with me (selection bias!) and some commenters ask some questions such as, But doesn’t the paper have a robustness study? (Yes, but this doesn’t address the real issues because all the studies in the robustness analysis are flawed in a similar way as the original study) and, But regression discontinuity analysis is OK, right? (Sometimes, but ultimately you have to think of such problems as observational studies, and all the RD in the world won’t solve your problem if there are systematic differences between treatment and control groups that are not explained by the forcing variable) and, But didn’t they do a placebo control analysis that found no effect? (Yes, but this doesn’t address the concern that the statistically-significant main finding arose from forking paths, and there are forking paths in the choice of placebo study too, also the difference between statistically significant and non-significant is not itself . . . ok, I guess you know where I’m heading here), and so on.

These questions are ok. I mean, it’s a little exhausting seeing them every time, but it’s good practice for me to give the answers.

No, the problem I see is outside this blog, where journalists and, unfortunately, many economists, have the inclination to accept these analyses as correct by default.

It’s whack-a-mole. What’s happening is that researchers are using a fundamentally flawed statistical approach, and if you look carefully you’ll find the problems, but the specific problem can look different in each case.

With the air-pollution-in-China example, the warning signs were the fifth-degree polynomial (obviously ridiculous from a numerical analysis perspective—Neumann is spinning in his grave!—but it took us a few years to explain this to the economics profession) and the city with the 91-year life expectancy (which apparently would’ve been 96 years had it been in the control group). With the air-filters-in-schools example, the warning sign was that there was apparently no difference between treatment and control groups in the raw data; the only way that any result could be obtained was through some questionable analysis. With the unions-and-stock-prices example, uh, yeah, just about everything there was bad, but it got some publicity nonetheless because it told a political story that people wanted to hear. Other examples show other problems. But one problem with whack-a-mole is that the mole keeps popping up in different places. For example, if example #1 teaches you to avoid high-degree polynomials, you might thing that example #2 is OK because it uses a straight-line adjustment. But it’s not.

So what’s happening is that, first, we get lost in the details and, second, you get default-credulous economists and economics journalists needing to be convinced, each time, of the problems in each particular robustness study, placebo check, etc.

One thing that all those examples have in common is that if you just look at the RD plot straight, removing all econometric ideology, it’s pretty clear that overfitting is going on:

In every case, the discontinuity jumps out only because it’s been set against an artifactual trend going the other direction. In short: an observed difference close to zero that is magnified by something big by means of a spurious adjustment. It can go the other way too—an overfitted adjustment used to knock out a real difference—but I guess we’d be less likely to see that, as researchers are motivated to find large and statistically significant effects. Again, all things are possible, but it is striking that if you just look at the raw data you don’t see anything: this particular statistical analysis is required to make the gap appear.

And, the true sign of ideological blinders: the authors put these graphs in their own articles without seeing the problems.

Good design, bad estimate

Let me be clear here. There’s good and bad.

The good is “regression discontinuity,” in the sense of a natural experiment that allows comparison of exposed and control groups, where there is a sharp rule for who gets exposed and who gets the control: That’s great. It gives you causal identification in the sense of not having to worry about selection bias: you know the treatment assignment rule.

The bad is “regression discontinuity,” in the sense of a statistical analysis that focuses on modeling of the forcing variable with no serious struggle with the underlying observational study problem.

So, yes, it’s reasonable that economists, policy analysts, and journalists like to analyze and write about natural experiments: this really can be a good way of learning about the world. But this learning is not automatic. It requires adjustment for systematic differences between exposed and control groups—which cannot in general be done by monkeying with the forcing variable. Monkeying with the forcing variable can, however, facilitate the task of coming up with a statistically significant coefficient on the discontinuity, so there’s that.

But there’s hope

But there’s hope. Why do I say this? Because where we are now in applied economics—well-meaning researchers performing fatally flawed studies, well-meaning economists and journalists amplifying these claims and promoting quick-fix solutions, skeptics needing to do the unpaid work of point-by-point rebuttals and being characterized as “vehement statistics nerds”—this is exactly where psychology was, five or ten years ago.

Remember that ESP study? When it came out, various psychologists popped out to tell us that it was conducted just fine, that it was solid science. It took us years to realize how bad that study was. (And, no, this is not a moral statement, I’m not saying the researcher who did the study was a bad person. I don’t really know anything about him beyond what I’ve read in the press. I’m saying that he is a person who was doing bad science, following the bad-science norms in his field.) Similarly with beauty-and-sex ratio, power pose, that dude who claimed to he could predict divorces with 90% accuracy, etc.: each study had its own problems, which had to be patiently explained, over and over again to scientists as well as to influential figures in the news media. (Indeed, I don’t think the Freakonomics team ever retracted their endorsement of the beauty-and-sex-ratio claim, which was statistically and scientifically ridiculous but fit in well with a popular gender-essentialist view of the world.)

But things are improving. Sure, the himmicanes claim will always be with us—that combination of media exposure, PNAS endorsement, and researcher chutzpah can go a long way—but, if you step away from some narrow but influential precincts such as the Harvard and Princeton psychology departments, NPR, and Ted World HQ, you’ll see something approaching skepticism. More and more researchers and journalists are realizing that randomized experiment plus statistical significance does not necessarily equal scientific discovery, that, in fact, “randomized experiment” can motivate researchers to turn off their brains, “statistical significance” occurs all by itself with forking paths, and the paradigm of routine “scientific discovery” can mislead.

And it’s an encouraging sign that if you criticize a study that happens to have been performed by a psychologist, that psychologists and journalists on the web do not immediately pop up with, But what about the robustness study?, or Don’t you know that they have causal identification?, etc. Sure, there are some diehards who will call you a Stasi terrorist because you’re threatening the status quo of backscratching comfort, but it’s my impression that the mainstream of academic psychology recognizes that randomized experiment plus statistical significance does not necessarily equal scientific discovery. They’re no longer taking a published claim as default truth.

My message to economists

Savvy psychologists have realized that just because a paper has a bunch of experiments, each with a statistically significant result, it doesn’t mean we should trust any of the claims in the paper. It took psychologists (and statisticians such as myself) a long time to grasp this. But now we have.

So, to you economists: Make that transition that savvy psychologists have already made. In your case, my advice is, no longer accept a claim by default just because it contains an identification strategy, statistical significance, and robustness checks. Don’t think that a claim should stand, just cos nobody’s pointed out any obvious flaws. And when non-economists do come along and point out some flaws, don’t immediately jump to the defense.

Psychologists have made the conceptual leap: so can you.

My message to journalists

I’ll repeat this from before:

When you see a report of an interesting study, he contact the authors and push them with hard questions: not just “Can you elaborate on the importance of this result?” but also “How might this result be criticized?”, “What’s the shakiest thing you’re claiming?”, “Who are the people who won’t be convinced by this paper?”, etc. Ask these questions in a polite way, not in any attempt to shoot the study down—your job, after all, is to promote this sort of work—but rather in the spirit of fuller understanding of the study.

Science journalists have made the conceptual leap: so can you.

P.S. You (an economist, or a journalist, or a general reader) might read all the above and say, Sure, I get your point, robustness studies aren’t what they’re claimed to be, forking paths are a thing, you can’t believe a lot of these claims, etc., BUT . . . air pollution is important! evolutionary psychology is important! power pose could help people! And, if it doesn’t help, at least it won’t hurt much. Same with air filters: who could be against air filters?? To which I reply: Sure, that’s fine. I got no problem with air filters or power pose or whatever (I guess I do have a problem with those beauty-and-sex-ratio claims as they reinforce sexist attitudes, but that’s another story, to be taken up with Freakonomics, not with Vox): If you want to write a news study promoting air filters in schools, or evolutionary psychology, or whatever, go for it: just don’t overstate the evidence you have. In the case of the regression discontinuity analyses, I see the overstatement of evidence as coming from a culture of credulity within academia and journalism, a combination of methodological credulity with in academic social science (the idea that identification strategy + statistical significance = discovery until it’s been proved otherwise) and credulity in science reporting (the scientist-as-hero narrative).

P.P.S. I’m not trying to pick on econ here, or on Vox. Economists are like psychologists, and Vox reporters are like science reporters in general: they all care about the truth, they all want to use the best science, and they all want to help people. I sincerely think that if psychologists and science reporters can realize what’s been going on and do better, so can economists and Vox reporters. I know it’s taken me awhile to (see here and here) to move away from default credulity. It’s not easy, and I respect that.

P.P.P.S. Yes, I know that more important things are going on in the world right now. I just have to make my contributions where I can.

It’s not about the blame

Just to elaborate in one more direction: I’m not saying that all or even most economists and policy journalists make the mistake of considering this sort of study correct by default. I’m saying that enough make this mistake that they keep the bad-science feedback loop going.

To put it another way:

I don’t mind that researchers study the effects of natural experiments. Indeed, I think it’s a good thing that they do this (see section 4 here for more on this point).

And I don’t mind that researchers perform poor statistical analyses. I mean, sure, I wish they didn’t do it, but statistics is hard, and the price we pay for good analyses is bad analyses. To put it another way, I’ve published poor statistical analyses myself. Every time we do an analysis, we should try out best, and that means that sometimes we’re gonna do a bad job. That’s just the way it goes.

What’s supposed to happen next is that if you do a bad analysis, and it’s on a topic that people care about, that someone will notice the problems with the analysis, and we can go from there.

That’s the self-correcting nature of science.

But some things get in the way. In particular, if enough people consider published and publicized results as correct by default, than that slows the self-correcting process.

I wrote the above post in an attempt to push the process in a better direction. It’s not that 100% or 50% or even 25% of economists and policy journalists act as if identification strategy + statistical significance = discovery. It’s that X% act this way, and I’d like to reduce X. I have some reason for optimism because it’s my impression that X went down a lot in psychology in the past ten years, so I’m hoping it could now decline in economics. Not all the way down to zero (I fear that PNAS, NPR, and Ted will always be with us), but low enough to end the sustainability of the hype cycle as it currently exists.

The funny thing is, economists are often very skeptical! They just sometimes turn off that skepticism when an identification strategy is in the room.

P.P.P.P.S. To clarify after some discussion in comments: I’m not saying that all or most or even a quarter of RD publications or preprints in economics are bad. I’ve not done any kind of survey. What I think is that economists and policy journalists should avoid the trap of reflexive credulity.


We’ve been talking a lot about football lately. I just wrote a football-themed post. It will appear in two weeks, that is, the morning of 26 Jan.

Please send an appropriate picture of your cat and I can append it to the post? Thank you.

Four projects in the intellectual history of quantitative social science

1. The rise and fall of game theory.

My impression is that game theory peaked in the late 1950s. Two classics from that area are Philip K. Dick’s “Solar Lottery” and R. Duncan Luce and Howard Raiffa’s “Games and Decisions.” The latter is charming in its retro attitude that all that remained were some minor technical problems that were on the edge of being solved. In retrospect, I think that book from 1958 represents the high-water mark of the idea of game theory as an all-encompassing tool in social science. Game theory has seen lots of important specific advances since then but its limitations have become clearer too. (See, for one small example, my article, “Methodology as ideology: some comments on Robert Axelrod’s ‘The Evolution of Cooperation,’” published in 2008 but originally from 1986, making a point also made by Joanne Gowa in that year.)

The intellectual history project here is to trace game theory and decision theory from their heights in the 1940s (when game theory and operations research were used by the U.S. military to help win World War II), to Peak Game Theory in the 1950s (when it seemed that we were on the cusp of solving all the important problems of international conflict and social choice, as epitomized by someone like Herman Kahn, who we know recognize as a buffoon but who it seems was considered a serious thinker back in his day), to the business decision making era of 1960s-70s, to the modern day, in which game theory is a particular subfield of political science and economics with continuing developments but no longer viewed as a sort of master key to understanding and solving social problems.

2. The disaster that is “risk aversion.”

I’ve written about this several times before (2005, 2008, 2011, 2014, 2018 “It would be as if any discussion of intercontinental navigation required a preliminary discussion of why the evidence shows that the earth is not flat.”

In particular see this post from 2009, “Slipperiness of the term ‘risk aversion'” and this from 2016, “Risk aversion is a two-way street.”

For the point of this intellectual history, the point is not to criticize naive ideas of risk aversion that are held within the economics profession, but rather to ask how it was that this problematic model became the default. (Just for an example, go to Amazon, look up Mankiw’s Principles of Economics, Search Inside on “averse,” and go to Figure 1 on page 581.)

3. From model-based psychophysics to black-box social psychology experiments.

I think there’s an interesting intellectual history to be done here, tracing from “psychophysics” and “psychometrics” from circa 100 years ago which uses physics-inspired mechanistic or quasi-mechanistic models, to “behaviorism” from circa 80 years ago whose models seem to me to be more Boolean-inspired (do X and observe Y), to “judgment and decision making” from circa 50 years ago whose models were derived from psychophysics, to “behavioral economics” and “evolutionary psychology” which are typically studied in the way that ESP has been studied for so many years, using null hypothesis significance testing with any underlying model being a black box of no inherent interest.

There are two interesting things going on in this particular intellectual history: first, the idea that “social psychology” / “evolutionary psychology” / “behavioral economics” is in many ways a conceptual step backward in psychology, a denial of the cognitive revolution and a return to the black-box modeling associated with the pre-cognitive behaviorism of the 1940s. Second, there is the specific issue discussed in this thread that, although “judgment and decision making” directly derives from “psychophysics,” the mathematical and statistical modeling in modern JDM is typically much cruder than the models used in classical psychophysics and psychometrics.

This came up in this discussion thread.

4. The two models of microeconomics.

Pop economists (or, at least, pop micro-economists) are often making one of two arguments:
(a) People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.
(b) People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument a is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument b is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get a or b above. They’re complete opposites! I blogged about this in 2011 and it’s come up several times since then. The intellectual history question is how this happened, and how this is perceived within the economics community. This is related to the analogy between economics now and Freudian psychiatry in the 1950s, and also related to discussions of the political implications of various social science theories.

There are other questions of intellectual history I’d like to study, but the above four are a start.

Of Manhattan Projects and Moonshots

Palko writes:

I think we have reversed the symbolic meaning of a Manhattan project and a moonshot.

He explains:

The former has come to mean a large, focus, and dedicated commitment to rapidly addressing a challenging but solvable problem. The second has come to mean trying to do something so fantastic it seems impossible.

But, he writes:

The reality was largely the opposite. Building an atomic bomb was an incredible goal that required significant advances in our understanding of the underlying scientific principles. Getting to the moon was mainly a question of committing ourselves to spending a nontrivial chunk of our GDP on an undertaking that was hugely ambitious in terms of scale but which relied on technology that was already well-established by the beginning of the Sixties.