Rainy Day Women #13 & 36

Well, they’ll spam you when you’re trying to be so good
They’ll spam you just like they said they would
They’ll spam you when you’re trying to go home
And they’ll spam you when you’re there all alone
But I would not feel so all damned
Everybody must get spammed.
— Bob Dylan, almost.

OK, this one’s funny. A few years ago we had a post, “It’s . . . spam-tastic!”, reporting on a scam where they send personal-looking emails to scholars to get them to pay $3000 or so to get their articles published in a journal that no one will read. They are parasites on the goodwill of academia, and each such bit of spam degrades that trust, drawing from the non-bottomless aquifer of the cooperative scientific enterprise.

At the time of this writing, that post has 43 comments, several of which are by other academics that received this spam solicitation.

OK, here’s the funny part. Today I was going through the moderated comments, and . . . I see a spam comment on that post, from the same spam publisher I was exposing! I did not approve the comment. I sent it straight into the spam folder.

It’s funny, but on second thought it’s not so funny, it’s sad. I see their webpage with their fake journal, and I think of the researchers who got conned into giving them $3000 . . . it just makes me want to cry. I’m a researcher, I want people to read my articles, I understand this desire, and it breaks my heart to see scammers exploiting it. I feel the same way as I do about those people who advertise health supplements on late night TV or whatever to scam Grandpa out of his social security checks.

0/0 = . . . 0? That’s Australian math, mate!

Tom Davies writes:

I looked down on stats when I was at university, and now it’s the only area of maths which is of any use to me.

And he points us to this amusing example:

What’s great about this story is that it is happening in a “faraway land” (as the Gremlins researcher might say), and so I have no idea who the good guys and bad guys are supposed to be. No need to be happy that the bad guys blew it one more time, or to be frustrated that the good guys dropped the ball. I’ve never heard of the Burnet Institute (or, for that matter, the @BurnetInstitute) or the article in question. The 0/0 thing does look fishy, though, so good to see people getting called out on this sort of thing.

P.S. The above title is a joke. American math is just as bad! (And that last link came directly from the U.S. government.)

My slides and paper submissions for Prob Prog 2021

Prob Prog 2021 just ended. Prob Prog is the big (250 registered attendees, and as many as 180 actually online at one point) probabilistic programming conference. It’s a very broadly scoped conference.

The online version this year went very smoothly. It ran a different schedule every day to accommodate different time zones. So I wound up missing the Thursday talks other than the posters because of the early start. There was a nice amount of space between sessions to hang out in the break rooms and chat.

Given that there’s no publication for this conference, I thought I’d share my slides here. The talks should go up on YouTube at some point.

Slides: What do we need from a PPL to support Bayesian workflow?

There was a lot of nice discussion around bits of workflow we don’t really discuss in the paper or book: how to manage file names for multiple models, how to share work among distributed teammates, how to put models into production and keep them updated for new data. In my talk, I brought up issues others have to deal with like privacy or intellectual property concerns.

My main focus was on modularity. After talking to a bunch of people after my talk, I still don’t think we have any reasonable methodology as a field to test out components of a probabilistic program that are between the level of a density we can unit test and a full model we can subject to our whole battery of workflow tests. How would we go about just testing a custom GP prior or spatio-temporal model component? There’s not even a way to represent such a module in Stan, which was the motivation for Maria Gorinova‘s work on SlicStan. Ryan Bernstein (a Stan developer and Gelman student) is also working on IDE-like tools that provide a new language for expressing a range of models.

Then Eli Bingham (of Pyro fame) dropped the big question: is there any hope we could use something like these PPLs to develop a scalable, global climate model? Turns out that we don’t even know how they vet the incredibly complicated components of these models. Just the soil carbon models are more complicated than most of the PK/PD models we fit and they’re one of the simplest parts of these models.

I submitted two abstracts this year and then they invited me to do a plenary session and I decided to focus on the first.

Paper submission 1: What do we need from a probabilistic programming language to support Bayesian workflow?

Paper submission 2: Lambdas, tuples, ragged arrays, and complex numbers in Stan

P.S. Andrew: have you considered just choosing another theme at random? It’s hard to imagine it’d be harder to read than this one.

How did the international public health establishment fail us on covid? By “explicitly privileging the bricks of RCT evidence over the odd-shaped dry stones of mechanistic evidence”

Peter Dorman points us to this brilliant article, “Miasmas, mental models and preventive public health: some philosophical reflections on science in the COVID-19 pandemic,” by health research scholar Trisha Greenhalgh, explaining what went wrong in the response to the coronavirus by British and American public health authorities.

Greenhalgh starts with the familiar (and true) statement that science proceeds through the interplay of theory and empirical evidence. Theory can’t stand alone, and empirical evidence in the human sciences is rarely enough on its own either. Indeed, if you combine experimental data with the standard rules of evidence (that is, acting as if statistically-significant comparisons represent real and persistent effects and as if non-statistically-significant comparisons represent zero effects), you can be worse off than had you never done your damn randomized trial in the first place.

Greenhalgh writes that some of our key covid policy disasters were characterized by “ideological movements in the West [that] drew—eclectically—on statements made by scientists, especially the confident rejection by some members of the EBM movement of the hypothesis that facemasks reduce transmission.”

Her story with regard to covid and masks has fourth parts. First, the establishment happened to start with “an exclusively contact-and-droplet model” of transmission. That’s unfortunate, but mental models are unavoidable, and you have to start somewhere. The real problem came in the second step, which was to take a lack of relevant randomized studies on mask efficacy as implicit support to continue to downplay the threat of aerosol transmission. This was an avoidable error. (Not that I noticed it at the time! I was trusting the experts, just like so many other people were.) The error was compounded in the third step, which was to take the non-statistically-significant result from a single study, the Danmask trial (which according to Greenhalgh was “too small by an order of magnitude to test its main hypothesis” and also had various measurement problems), as evidence that masks do not work. Fourth, this (purportedly) evidence-based masks-don’t-work conclusion was buttressed by evidence-free speculation of reasons why masks might make things worse.

Greenhalgh’s message is not that we need theory without evidence, or evidence without theory. Her message, informed by what seems to me is a very reasonable reading of the history and philosophy of science, is that theories (“mental models”) are in most cases necessary, and we should recognize them as such. We should use evidence where it is available, without acting as if our evidence, positive or negative, is stronger than it is.

All this sounds unobjectionable, but when you look at what happened—and is still happening—in the covid discourse of the past year and a half, you’ll see lots of contravention of these reasonable principles, with the errors coming not just from Hoover Institution hacks but also from the Centers for Disease Control and other respected government agencies. It might sound silly to say that people are making major decisions based on binary summaries of statistical significance from seriously flawed randomized studies, but that seems to be what’s happening. But, as Greenhalgh emphasizes, the problem is not just with the misunderstanding of what to do with statistical evidence; it’s also with the flawed mental model of droplet transmission that these people really really didn’t want to let go of.

And check out her killer conclusion:

While I [Greenhalgh] disagree with the scientists who reject the airborne theory of SARS-CoV-2 transmission and the evidence for the efficacy of facemasks, they should not be dismissed as ideologically motivated cranks. On the contrary, I believe their views are—for the most part—sincerely held and based on adherence to a particular set of principles and quality standards which make sense within a narrow but by no means discredited scientific paradigm. That acknowledged, scientists of all creeds and tribes should beware, in these fast-moving and troubled times, of the intellectual vices that tempt us to elide ideology with scientific hypothesis.

Well put. Remember how we said that honesty and transparency are not enuf? Bad statistical methods are a problem in part because they can empower frauds and cheaters, but they also can degrade the work of researchers who would like to do well. Slaves to some long-defunct etc etc. And it’s not just a concern for this particular example; my colleagues and I have argued that these problems arise with so-called evidence-based practice more generally. As I put it a few years ago, evidence-based medicine eats itself.

P.S. The problems with the public health establishment should not be taken to imply that we should trust anti-establishment sources. For all its flaws, the public health establishment is subject to democratic control and has the motivation to improve public health. They make mistakes and we can try to help them do better. There’s some anti-establishment stuff that’s apparently well funded and just horrible.

I like Steven Pinker’s new book. Here’s why:

I first heard about Rationality, the latest book from linguist Steven Pinker, from his publisher, offering to send me a review copy. Pinker has lots of interesting things to say, so I was happy to take a look. I’ve had disagreements with him in the past, but it’s always been cordial, (see here, here, and here), and he’s always shown respect for my research—indeed, if I do what Yair calls a “Washington read” of Pinker’s new book, I find a complimentary citation to me—indeed, a bit too complimentary, in that he credits me with coining the phrase “garden of forking paths,” but all I did was steal it from Borges. Not that I’ve ever actually read that story; in the words of Daniel Craig, “I like the title.” More generally, I appreciate Pinker’s willingness to engage with criticism, and I looked forward to receiving his book and seeing what he had to say about rationality.

As with many books I’m called upon to review, this one wasn’t really written for me. This is the nature of much of nonfiction book reviewing. An author writes a general-audience book on X, you get a reviewer who’s an expert on X, and the reviewer needs to apply a sort of abstraction, considering how a general reader would react. That’s fine; it’s just the way things are.

That said, I like the book, by which I mean that I agree with its general message and I also agree with many of the claims that Pinker presents as supporting evidence.

Pinker’s big picture

I’ll quickly summarize the message here. The passage below is not a quote; rather, it’s my attempt at a summary:

Humans are rational animals. Yes, we are subject to cognitive illusions, but the importance of these illusions is in some way a demonstration of our rationality, in that we seek reasons for our beliefs and decisions. (This is the now standard Tversky-Kahneman-Giverenzer synthesis; Pinker emphasizes Gigerenzer a bit less than I would, but I agree with the general flow.) But we are not perfectly rational, and our irrationalities can cause problems. In addition to plain old irrationality, there are also people who celebrate irrationality. Many of those celebrators of irrationality are themselves rational people, and it’s important to explain to them why rationality is a good thing. If we all get together and communicate the benefits of rationality, various changes can be made in our society to reduce the spread of irrationality. This should be possible given the decrease in violent irrationality during the past thousand years. Rationality isn’t fragile, exactly, but it could use our help, and the purpose of Pinker’s book is to get his readers to support this project.

There’s a bit of hope there near the end, but some hope is fine too. It’s all about mixing hope and fear in the correct proportions.

Chapter by chapter

A few years ago, I wrote that our vision of what makes us human has changed. In the past, humans were compared to animals, and we were “the rational animal”: our rationality was our most prized attribute. But now (I wrote in 2005) the standard of comparison is the computer, we were “the irrational computer,” and it was our irrationality that was said to make us special. This seemed off to me.

Reading chapter 1 of Pinker’s book made me happy because he’s clearly on my side (or, maybe I should say, I’m on his side): we are animals, not computers, and it’s fair to say that our rationality is what makes us human. I’m not quite sure why he talks so much about cognitive illusions (the availability heuristic, etc.), but I guess that’s out of a sense of intellectual fairness on his part: He wants to make the point that we are largely rational and that’s a good thing, and so he clears the deck by giving some examples of irrationality and then explaining how this does not destroy his thesis. I like that: it’s appealing to see a writer put the evidence against his theory front and center and then discuss why he thinks the theory still holds. I guess I’d only say that some of these cognitive illusions are pretty obscure—for example I’m not convinced that the Linda paradox is so important. Why not bring in some of the big examples of irrationality in life: on the individual level, behaviors such as suicide and drug addiction; at the societal levels, decisions such as starting World War 1 and obvious misallocations of resources such as financing beach houses in hurricane zones? I see a sort of parochialism here, a focus on areas of academic psychology that the author is close to and familiar with. Such parochialism is unavoidable—I write books about political science and statistics!—but I’d still kinda like to see Pinker step back and take a bigger perspective. In saying this, I realize that Pinker gets this from both sides, as other critics will tell him to stick to his expertise in linguistics and not try to make generalizations about the social world. So no easy answer here, and I see why he wrote the chapter the way he did, but it still leaves me slightly unsatisfied despite my general agreement with his perspective.

I liked most of chapter 2 as well: here Pinker talks about the benefits of people taking a rational approach, both for themselves as individuals and for society. I don’t need much convincing here, but I appreciated seeing him make the case.

Pinker writes that “ultimately even relativists who deny the possibility of objective truth . . . lack the courage of their convictions.” I get his point, at least sometimes, for example consider the people who do double-blind randomized controlled trials of intercessory prayer—they somehow think that God has the ability to cause a statistically significant improvement among the treatment group but that He can’t just screw with the randomization. On the other hand, maybe Pinker is too optimistic. He writes that purported “relativists” would not go so far as to deny the Holocaust, climate change, and the evils of slavery—but of course lots of people we encounter on the internet are indeed relativistic enough in their attitudes to deny these things, and they appear to be happy to set aside logic and evidence and objective scholarship to hold beliefs that they want to believe (as Pinker actually notes later on in chapter 10 of his book). Sometimes it seems that the very absurdity of these beliefs is part of their appeal: defending slavery and the Confederacy, or downplaying the crimes of Hitler, Stalin, Mao, etc., is a kind of commitment device for various political views. I guess climate change denial (and, formerly, smoking-cancer denial) is more of a mixed bag, with some people holding these views as a part of their political identity and others going with the ambiguity of the evidence to take a particular position. Belief in intercessory prayer is a different story because, at least in this country, it’s a majority position, so if you have a generally rational outlook and you also believe in the effectiveness of intercessory prayer, it makes sense that you’d try your best to fit it into your rational view of the world, in the same sense that rational rationalizers might try to construct rationales for fear, love, and other strong emotions that aren’t particularly rational in themselves.

Elsewhere I think Pinker’s too pessimistic. I guess he doesn’t hear that complaint much, but here it is! He writes: “Modern universities—oddly enough, given that their mission is to evaluate ideas—have been at the forefront of finding ways to suppress opinions, including disinviting and drowning out speakers, removing controversial teachers from the classroom, revoking offers of jobs and support, expunging contentious articles from archives, and classifying differences of opinion as punishable harassment and discrimination.” I guess I’m lucky to be at Columbia because I don’t think they do any of that here. I’ll take Pinker at his word that these things have happened at modern universities; still I wouldn’t say that universities “at the forefront of finding ways to suppress opinions,” just because their administrations sometimes make bad decisions. If universities are at the forefront of finding ways to suppress opinions, where does that put the Soviet Union, the Cultural Revolution, and other such institutions that remain in living memory? I agree that we should fight suppression of free speech, but let’s keep things in perspective and save the doomsaying for the many places where it’s appropriate!

There was another thing in chapter 2 that didn’t ring true to me, but I’ll get to it later, as right here I don’t want these particular disagreements to get in the way of my agreement with the main message of the chapter, which is that rational thinking is generally beneficial in life and society, even beyond narrow areas such as science and business.

I have less to say about chapters 3 through 9, which cover logic, probability, Bayesian reasoning, expected utility, hypothesis testing, game theory, and causal inference. He makes some mistakes (for example, defining statistical significance as “a Bayesian likelihood: the probability of obtaining the data given the hypothesis”), but he does a pretty good job at covering a lot of material in a small amount of space, and I was happy to see him including two of my favorite examples: the hot hand fallacy fallacy explained by Miller and Sanjurjo, and Gigerenzer’s idea of expressing probabilities as natural frequencies. I’m not quite sure how well this works as a book—to me, it sits in the uncanny valley between a college text and a popular science treatment—but I’m not the target audience here, so who am I to say.

Just one thing. At one point in these chapters on statistics, Pinker talks about fallacies that have contributed to the replication crisis in science (that’s where he mentions my forking-paths work with Eric Loken). I think this treatment would be stronger if he were to admit that some of his professional colleagues have been taken in by junk science in its different guises. There was that ESP study published by one of the top journals in the field of psychology. There was the absolutely ridiculous and innumerate “critical positivity ratio” theory that, as recently as last year, was the centerpiece of a book that was endorsed by . . . Steven Pinker! There was the work of “Evilicious” disgraced primatologist Marc Hauser, who wrote a fatuous article for the Edge Foundation’s “Reality Club” . . . almost a decade before Harvard “found him guilty of scientific misconduct and he resigned” (according to wikipedia). I think that including these examples would be a freebie. Admitting that he and other prominent figures in his field were fooled would give more of a bite to these chapters. Falling for hoaxes and arguments with gaping logical holes is not just for loser Q followers on the internet; it happens to decorated Harvard professors too.

The final two chapters of the book return to the larger theme of the benefits of rationality. Chapter 10 leads off with a review of covid science denial, fake news, and irrational beliefs. Apparently 32% of Americans say they believe in ghosts and 21% say they believe in witches. The witches thing is just silly, but covid denial has killed people, and climate change denial has potentially huge consequences. How to reconcile this with the attitude that people are generally rational? Pinker’s answer is motivated reasoning—basically, people believe what they want—and that most of these beliefs are in what he calls “the mythology zone,” beliefs such as ghosts and witches that have no impact on most people’s lives. He argues that “the arc of knowledge is a long one, and it bends toward rationality.” I don’t know, though. I feel like the missing piece in his story is politics. The problem with covid denial is not individual irrationality; it’s the support of this denialism by prominent political institutions. In the 1960s and again in recent years, there’s been widespread concern about lawlessness in American politics. When observers said that the world was out of control in the 1960s, or when they say now that today’s mass politics are reminiscent of the 1930s, the issue is not the percentage of people holding irrational beliefs; it’s the inability of traditional institutions to contain these attitudes.

Getting to details: A couple places in his book, Pinker refers to the irrationality of assuming that different groups of people are identical on average in “socially significant variables” such as “test scores, vocational interests, social trust, income, marriage rates, life habits, rates of different types of violence.” As Woody Guthrie sang, “Some will rob you with a six-gun, and some with a fountain pen.” Fine. I get it. Denying group differences is irrational. But it’s funny that Pinker doesn’t mention irrationality in traditional racism and sexism, the belief that women or ethnic minorities just can’t do X, Y, or Z. These sorts of prejudices are among the most famous examples of irrational thinking. Irrationalities bounce off each other, and one irrationality can be a correction for another. Covid denialism and climate change denialism, as unfortunate and irrational as they are, can be seen as reactions to the earlier irrationality of blind trust in our scientific overlords, with these reactions stirred up by various political and media figures.

At one point Pinker writes, “Rationality is disinterested. It is the same for everyone everywhere, with a direction and momentum of its own.” I see the appeal of this Karenina-esque statement, but I don’t buy it. Rationality is a mode of thinking, but the details of rationality change. For example, nowadays we have Bayesian reasoning and the scientific method. Aristotle, rational as he may have been, didn’t have these tools. In his concluding chapter, Pinker seems to get this, as he talks about the ever-expanding bounds of rationality during the past several centuries. I guess the challenge is that people may be more rational than they used to be, but in the meantime our irrationality can cause more damage. Technology is such that we can do more damage than ever before. What’s relevant is not irrationality, but its consequences.

Now that I’ve read the whole book, let me try to summarize the sweep of Pinker’s argument. It goes something like this:

Chapter 1. It is our nature as humans for our beliefs and attitudes to have a mix of rationality and irrationality. We’re all subject to cognitive illusions while at the same time capable of rational reasoning.

Chapter 2. Rationality, to the extent we use it, is a benefit to individuals and society.

Chapters 3-9. Rationality ain’t easy. To be fully rational you should study logic, game theory, probability, and statistics.

Chapter 10. We’re irrational because of motivated reasoning.

Chapter 11. Things are getting better. Rationality is on the rise.

The challenge is to reconcile chapter 1 (irrationality is human nature) with chapters 3-8 (rationality ain’t easy) and 10 (rationality is on the rise). Pinker’s resolution, I think, is that science is progressing (that’s all the stuff in chapters 3-8 that can help the readers of his book become more rational in their lives and understand the irrationality of themselves and others) and society is improving. Regarding that last point, he could be right; at the same time, he never really gives a good reason for his confidence that we don’t have to be concerned about the social and environmental costs of increasing political polarization, beyond a vague assurance that “The new media of every era open up a Wild West of apocrypha and intellectual property theft until truth-serving counteremasures are put into place” and then some general recommendations regarding social media companies, pundits, and deliberative democracy, with the statement (which I agree with) that rationality “is not just a cognitive virtue but a moral one.” As the book concludes, Pinker alternates between saying that we’re in trouble and we need rationality to save us, and that progress is the way of the world. This is a free-will paradox that is common in the writings of social reformers: everything is getting better, but only because we put in the work to make it so. The Kingdom of Heaven has been foretold, but it is we, the Elect, who must create it. Or, to put it in a political context, We will win, but only with your support. This does not mean that Pinker’s story is wrong: it may well be that rationality will prevail (in some sense) due to the effort of Pinker and the rest of us; I’m just saying that his argument has a certain threading-the-needle aspect.

Still and all, I like Pinker’s general theme of the complexity and importance of rationality, even if I think he focuses a bit too much on the psychological aspect of the problem and not enough on the political.

Parochialism and taboo

One unfortunate feature of the book is a sort of parochialism that privileges recent academic work in psychology and related fields. For example this on page 62: “Can certain thoughts be not just strategically compromising but evil to think? This is the phenomenon called taboo, from a Polynesian word for ‘forbidden.’ The psychologist Philip Tetlock has shown that taboos are not just customs of South Sea islanders but active in all of us.” And then there’s a footnote to research articles from 2000 and 2003.

That’s all well and good, but:

1. No way that Tetlock or anybody else has shown that an attitude is “active in all of us.” At best these sorts of studies can only tell us about the people in the studies themselves, but, also, this evidence is almost always statistical, with the result being that average behavior is different under condition A than under condition B. I can’t think of any study of this sort that would claim that something occurs 100% of the time. Beyond this, there do seem to be some people who are not subject to taboos. Jeffrey Epstein, for example.

2. If we weaken the claim from “taboos are active in all of us” to “taboos are a general phenomenon, not limited to some small number of faraway societies,” then it seems odd to attribute this to someone writing in the year 2000. The idea of taboos being universal and worth studying rationally is at least as old as Freud. Or, if you don’t want to cite Freud, lots of anthropology since then. Nothing wrong with bringing in Tetlock’s research, but it seems a bit off, when introducing taboos, to focus on obscure issues such as “forbidden base rates” or attitudes on the sale of kidneys rather than the biggies such as taboos against incest, torture, etc.

I’ve disagreed with Pinker before about taboos, and I think my key point of disagreement that sometimes he labels something a “taboo” that I would just call a bad or immoral idea. For example, a few years ago Pinker wrote, “In every age, taboo questions raise our blood pressure and threaten moral panic. But we cannot be afraid to answer them.” One of his questions was, “Would damage from terrorism be reduced if the police could torture suspects in special circumstances?” I don’t think it’s “moral panic” to be opposed to torture; indeed, I don’t think it’s “moral panic” for the question of torture to be taken completely off the table. I support free speech, including the right of people to defend Jerry Sandusky, Jeffrey Epstein, John Yoo, etc etc., and, hey, who knows, someone might come up with a good argument in favor of their behavior—but until such an argument appears, I feel no obligation to seriously consider these people’s actions as moral. Pinker might call that a taboo on my part; I’d call this a necessary simplification of life, the same sort of shortcut that allows me to assume, until I’m shown otherwise, that dishes when dropped off the table will fall down to the floor rather than up to the ceiling. Again, Pinker’s free to hold is own view on this—I understand that since making the above-quoted statement he’s changed his position and is now firmly anti-torture—; my point is that labeling an attitude as “taboo” can itself be a strong statement.

Another example is that Pinker describes it as “a handicap in mental freedom” to refuse to answer the question, “For how much money would you sell your child?” Here he seems to be missing the contextual nature of psychology. Many people will sell their children—if they’re poor enough. I doubt many readers of Pinker’s book are in that particular socioeconomic bracket; indeed, in his previous paragraph he asks you to “try playing this game at your next dinner party.” I think it’s safe to say that if you’re reading Pinker’s book and attending dinner parties, that there’s no amount of money for which you’d sell your child. So the question isn’t so much offensive as silly. My guess is that if someone asks this at such a party, the response would not be offense but some sort of hypothetical conversation, similar to if you were asked whether you’d prefer invisibility or the power of flight. Or maybe Pinker hangs out with a much more easily-offended crowd than I do. On the other hand, what about people who actually sold their children, or the equivalent, to Jeffrey Epstein? Pinker’s on record as saying this is reprehensible. How does this line up with his belief that it’s “a handicap in mental freedom” to not consider for how much money you would sell your child?

This example points to a sort of inner contradiction of Pinker’s reasoning. On one hand, he’s saying we all have taboos. I guess that includes him too! He’s also saying that we live in a society where there are all sorts of things we can’t talk about, not just torture and the selling of children and the operation of a private island for sex with underage women, but also organ donation, decisions of hospital administrators, and budgetary decisions. On the other hand, he’s writing for an audience of readers who, if they don’t already agree with him, are at least possibly receptive to his ideas—so they’re not subject to these taboos, or at least maybe not. This gets back to the question of what Pinker’s dinner parties are like: is it a bunch of people sitting around the table talking about the potential benefits of torture, subsidized jury duty, and an open market in kidneys; or a bunch of people all wanting to talk about these things but being afraid to say so; or a bunch of people whose taboos are so internalized that they refuse to even entertain these forbidden ideas? You can see how this loops back to my first point above about that phrase “active in all of us.” Later on, Pinker says, “It’s wicked to treat an individual according to that person’s race, sex, or ethnicity.” “Wicked,” huh? That seems pretty strong! Torture or selling your child are OK conversation topics, but treating men different than women is wicked? I honestly can’t figure out where he draws the line. That’s ok—there’s no reason to believe we’re rational in what bothers us—but then maybe he could be a bit more understanding about those of us think that torture is “wicked,” rather than just “taboo.”

Also I don’t quite get when Pinker writes that advertisers of life insurance “describe the policy as a breadwinner protecting a family rather than one spouse betting the other will die.” This just seems perverse on his part. When I bought life insurance, I was indeed doing it to protect my family in the event that I die young. I get it that you could say that mathematically this is equivalent to my wife betting that I would die, but really that makes no sense, given that I was the one paying for the insurance (so she’s not “betting” anything) and, more importantly, the purpose of the insurance was not to gamble but to reduce uncertainty. It would make more sense to say I was “hedging” against the possibility that I would die young. Here it seems that Pinker wants to anti-euphemize, to replace an accurate description (buying life insurance to protect one’s family) by an inaccurate wording whose only virtue is harshness.

Had I written this book, I would’ve emphasized slightly different things. As noted above, it seems strange to me that, when talking about irrationality, Pinker focuses so much on irrational beliefs rather than on irrational actions. At one level, I understand: the belief is father to the action. But it’s the actions that matter, no? I guess one reason I say this is my political science background. For example, the irrational action of funding housing construction in flood zones can be explained in part through various political deals and subsidies. Spreading better understanding of climate change should help, but it’s not clear that individual irrationality is the biggest problem here, and I’m concerned that Pinker is falling into an individualistic trap when studying society. To take a more positive example, cigarette smoking rates are much lower than they were a half-century ago. I would attribute this not to an increase in rationality or Odyssean self-control but rather to notions of fashion and coolness of which Pinker seems so dismissive. Smoking used to be cool, it’s no longer. I remember 20 years ago when NYC banned smoking in restaurants and bars; various pundits and lobbying organizations declared that this was a horrible infringement on liberty, that the people would rise up, etc. . . none of those things happened. They banned smoking and people just stopped smoking indoors. I guess that did induce some Odyssean self-control among smokers, so I’m not saying these individualistic behavioral concepts are useless, just that they’re not the whole story, and indeed sometimes they don’t seem to be the most important part of the story.

But that’s not really a criticism of Pinker’s book, that I would’ve written something different. It’s a limitation of his story, but all stories have limitations.

Big hair

One thing I found charming about the book, but others might find annoying, is the datedness of some of its references and perspectives. Chapter 1 reads as if it was written in the 1980s, back when the work of Tversky, Kahneman, and Gigerenzer was new. (I was going to say “new and exciting,” but that would be misleading: yes, their work was new and exciting in the 1980s, but it remains exciting even now, long after it was new.) Chapter 2 begins, “Rationality is uncool. To describe someone with a slang word for the cerebral, like nerd, wonk, geek, or brainaic, is to imply they are terminally challenged in hipness. I guess there’s always the possibility that he’s kidding, but . . . things have changed in the past 40 years, dude! In recent years, lots of people have been proud to be called nerds, wonks, or geeks; if anything, it’s “hipsters” who are not considered to be so cool. Pinker supports his point with quotes from Talking Heads, Prince, and . . . Zorba the Greek? That’s a movie from 1964! Later he refers to Peter, Paul and Mary (or, as he calls them, “Peter, Paul, and Mary”—prescriptive linguist that he is). When it comes to basketball, his go-to example is Vinnie Johnson from the 1980s Pistons. OK, I get it, he’s a boomer. That’s cool. You be you, Steve. But it might be worth not just updating your cultural references but considering that the culture has changed in some ways in the past half-century. In that same paragraph as the one with Zorba, Pinker describes “postmodernism” and “critical theory” as “fashionable academic movements.” I’m sure that there are professors still teaching these things, but no way that postmodernism and critical theory are “fashionable.” It’s been close to 40 years since they were making the headlines! You might as well label suspenders, big hair, and cocaine as fashionable. I half-expected to hear him talk about “yuppies” and slip in an Alex P. Keaton reference.

Review of reviews

After reading Pinker’s book, I did some googling and read some reviews. Given the title of the book, I guess we shouldn’t be surprised that reason.com liked it! Other reviews were mixed, with The Economist’s “Steven Pinker’s new defence of reason is impassioned but flawed” catching the general attitude that he had some things to say but had bitten off more than he could chew.

The New York Times review argues that Pinker gets things wrong in the details (for example, Pinker pointing to the irrationality of “half of Americans nearing retirement age who have saved nothing for retirement” without recognizing that “the median income for those non-saving households is $26,000, which isn’t enough money to pay for living expenses, let alone save for retirement”), while the Economist reviewer is OK with the details but is concerned about the big picture, reminding us that rationality can be deadly: “Rationality involves people knowing they are right. And from the French revolution on, being right has been used to justify appalling crimes. Mr Pinker would no doubt call the Terror a perversion of reason, just as Catholics brand the Inquisition a denial of God’s love. It didn’t always seem that way at the time.” Good point. This is an argument that Pinker should’ve addressed in his book: violence can come from purported rationality (for example, the Soviets) as well as from open irrationality (for example, the Nazis).

The published review whose perspective is closest to mine comes from Nick Romeo in the Washington Post, who characterizes the review as “a pragmatic dose of measured optimism, presenting rationality as a fragile but achievable ideal in personal and civic life,” offering “the welcome prospect of a return to sanity.” Like me, Romeo suggests that Pinker’s individualist argument could be improved by making more connections to politics (in his case, “the political economy of journalism — its funding structures, ownership concentration and increasing reliance on social media shares”). Ultimately, though, I think we have to judge a book by what it is, not what it is not. Pinker is a psychology professor so it makes sense that, when writing about rationality, he focuses on its psychological aspects.

New blog formatting

We needed to update the blog because the old theme was no longer being maintained by WordPress, and we were having security problems and issues with the comment screening. So we replaced it with this new theme which was having some issues. We patched it so now it’s functional, but I’ve been told it isn’t so great on phones.

I just wanted to let you know that we’re working on putting together something that looks a little better, less whitespace, things like that. It’s not as easy as you might think, in particular because we’ve got tons of content that you want to read, also we want to make the commenting experience as easy as possible under the constraints of security and not getting overwhelmed by spam. Thanks for your patience!

How does post-processed differentially private Census data affect redistricting? How concerned should we be about gerrymandering with the new DAS?

This is Jessica. Since my last post on use of differential privacy at the Census I’ve been reading some scholarly takes on the impacts of differentially private Census data releases in various applications. At times it reminds me of what economist Chuck Manski might call dueling certitude, i.e., different conclusions in the face of different approaches to dealing with the lack of necessary information to more decisively evaluate the loss of accuracy and changes in privacy preservation (because the Census can’t provide the full unnoised microdata for testing until 72 years have passed). There are also many ways in which analyses of the legal implications of the new disclosure avoidance system (DAS) remind me of what Manski calls incredible certitude and which comes up often on this blog i.e., an unwarranted belief in the precision or certainty of data-driven estimates.  

A preprint of one of the more comprehensive papers about political implications of the new DAS was posted by Cory McCartan on my last post. His coauthor Shiro Kuriwaki now sends a link to The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census, by Kenny, Kuriwaki, McCartan, Rosenman, Simko, and Imai, recently published in Science Advances. It’s dense but their methods and results are worth discussing given that use of Census data figures prominently in certain legal standards like One Person, One Vote and Votings Rights Act. 

A few things to note right off the bat as background: 

  • The Census TopDown algorithm that makes up the DAS involves both the addition of random noise, calibrated according to differential privacy (DP), and a bunch of post-processing steps to give the data “facial validity” and make it “usable”, in the sense of not containing negative counts, making sure certain aggregated population counts are accurate and consistent with the released data, etc. The demonstration data the Census has released for 2010 provides final results from this pipeline, and offers three different epsilon (“privacy-loss”) budgets (the key parameter for DP), two of which are evaluated in the main body of the Kenny et al. paper: with total epsilon of 4.5 (the initial budget, with more privacy preservation) and one with total epsilon of 12.2 (less privacy protection in favor of accuracy based on feedback from data users). However, the end of the paper reports their analysis applied to the new DAS-19.61 epsilon data, which is what the Census will use for 2020 – pretty high, but I assume the bureau still thinks it’s an improvement over the old technique if they are using it.  
  • The Census has not released DP-noised data without the subsequent post-processing steps, so Kenny et al. compare to the (already noised) 2010 Census data release. Treating deviation from 2010 figures as error, especially when the analyses focus on race, carries some caveats since the 2010 Census data has already been obscured, through swapping data from carefully-chosen households across blocks to limit disclosure of protected attributes like race. However, given the 72 year rule, I guess it’s this or using the recently released 1942 Census, as some other work has.   
  • Regarding post-processing, Kenny et al. say “The question is whether these sensible adjustments unintentionally induce systematic (instead of random) discrepancies in reported Census statistics.” However, it seems pretty apparent from prior work and the statements of privacy experts that it’s the postprocessing that’s introducing the bigger issues. See, e.g., letters from various experts calling for release of the noisy counts file to avoid the bias that enters when you try to make the data more realistic. There’s also a prior analysis by Cohen et al. that finds that TopDown, applied to Texas data, can bias nearby Census tracts in the same direction in ways that wouldn’t be expected from the noised but unadjusted counts, impeding the cancellation of error that usually improves accuracy at higher geographical levels.   
  • Finally, there are two relevant legal standards to redistricting that the paper discusses. First the One Person, One Vote standard requires states to minimize deviation from equal population districts according to Census data. Second, to invoke the Voting Rights Act in a Voting Rights case, one has to provide evidence that race is highly correlated with vote choice and show that it’s possible to create a district where the minority group makes up over 50% of the voting age population. So the big questions are about how the ability to comply with these standards and the districts that might result in the new DAS will be affected. 

The datasets that Kenny et al. use combine data spanning various elections from districts and precincts in states that are of interest in redistricting (PA, NC), Deep South states (SC, LA, AL), small states (DE), and heavily republican (UT) and democratic (WA) states. 

table of states used by Kenny et al.

Undercounting by race and party

The first set of results expose a bias toward undercounting certain racial and partisan groups with epsilon 12.2. They fit a generalized additive model to the precinct-level population errors for the DAS-12.2 data (defined relative to the 2010 Census), to assess the degree of systematic bias versus residual noise (what DP alone adds). Predictors include two-party Democratic vote share of elections in the precinct, turnout as a fraction of the voting age population, log population density, the fraction of the population that is White, and the Herfindahl-Hirschman index (HHI) as a measure of racial heterogeneity (calculated by summing the squares of the proportion of the population from four different groups of White, Black, Hispanic, Other, so that 1 denotes complete lack of diversity). i is the precinct index. 

PDAS,i PCensus,i = t(Democratici,Turnouti, log(Densityi))+ s(Whitei)+s(HHIi)+

Here are two figures, one showing bias given non-White population, the other showing bias given HHI. The authors attribute the loss of population of mixed White/non-White precincts and consequently their diluted electoral power to the way the new DAS prioritizes accuracy for the largest racial group in an area. 

plots of bias in precinct counts relative to 2010 Census by percent nonwhite and HHI

Several of the HHI graphs flatten closer to 0 around 40% HHI; is that an artifact of higher sample size when you reach that HHI? It’s hard from these plots to tell what proportion of precincts have more extreme error. I also wonder how consistent the previous swapping approach to protecting attributes like race was – my understanding is that households were assigned a risk level related to their uniqueness, and higher risk households more likely to be swapped, hence it would make sense to expect more swapping in areas that were more racially homogenous. But then we might expect Census 2010 data to already be biased at low/high % non-White and high HHI. Complicating things, Mark Hansen provides an example suggesting that swapping did not necessarily keep race consistent. It’s rumored that about 5% of households were swapped in Census 2010, so existing bias in the 2010 data may be significant. 

The biases are qualitatively similar but less severe when they use the epsilon 19.61 dataset:

bias in precinct counts by percent nonwhite and hhi for das 19.61

Using DAS 12.2, the authors also find in a supplemental analysis, presumably due to the relationship between party and racial heterogeneity, that “moderately Democratic precincts are, on average, assigned less population under the DAS than the actual 2010 Census”, while higher-turnout precincts are on average assigned more population than they should otherwise have (on the order of 5 to 15 voters per precinct). They argue that when aggregated to congressional districts, these errors can become large (on average 308 people, but up to 2151 for a district in Pennsylvania; “orders of magnitude larger than the difference under block population numbers released in 2010”). They don’t discuss the size of the average congressional district however, but I believe it’s in the hundreds of thousands. With epsilon=19.61, the errors go down 10 fold, falling between -216 and 319. While I find the consistency of the race and party bias hard to ignore, I can’t help but wonder how these errors compare to the estimated error associated with the imputation process Census uses for missing data. I would expect at least an order of magnitude difference (with the DAS-19.61 errors surely being substantially smaller) there, in which case it seems we should at least be mindful that some of these deviations from the new DAS are within the bounds of our ignorance. 

Achieving population parity for One Person, One Vote, detecting gerrymandering

The next set of analyses rely on simulations to look at potential redistricting plans. Kenny et al. describe:

The DAS-12.2 data yield precinct population counts that are roughly 1.0% different from the original Census, and the DAS-4.5 data are about 1.9% different. For the average precinct, this amounts to a discrepancy of 18 people (for DAS-12.2) or 33 people (for DAS-4.5) moving across precinct boundaries. 

Their simulation looks at how these precinct-level differences propagate to district level. They use a few different sampling algorithms, including a merge-split MCMC sampler for state legislative district simulations described as building from some prior work available on arxiv, which are implemented in the redist package. 

They use two states, Louisiana and Pennsylvania, to look at deviation from population parity, the requirement that “as nearly as is practical, one [person’s] vote in a congressional election is worth as much as another’s.” They simulate maps for Pennsylvania congressional districts and Louisiana State Senate districts which are constrained to achieve population parity with varying tolerated deviation from parity ranging from 0.1% to around 0.65%, defined using either Census 2010 data or one of the new DAS demonstration datasets. Using Pennsylvania data, they find that plans generated from one dataset (e.g., epsilon=12.2) given a certain tolerated deviation from parity on the generating data had much larger deviation when measured against a different dataset (e.g., released Census 2010). For example, 9,915 out of 10,000 maps simulated from DAS-12.2 exceeded the maximum population deviation threshold according to Census 2010 data, with high variance in the error per plan making it hard for redistricters to predict how far off any given generated plan might be from parity. For Louisiana State Senate, they find that if enacted plans were created from DAS-12.2 they would exceed the 5% target for population parity according to 2010 data. However, for DAS-19.61 data, things again look a lot better, with a much smaller percentage of generated plans exceeding the 5% for Louisiana (when evaluated against Census 2010).  

Louisiana results

At this point I have to ask, did population parity ever make sense, given that Census population counts are obviously not precise to the final digit, even if the bureau releases that level of precision? Kenny et. al. describe how “[e]ven minute differences in population parity across congressional districts must be justified, including those smaller than the expected error decennial census figures.” From this discussion around population parity and later discussion around the Voting Rights Act, it seems clear that there’s been a strong consensus legally that we will treat the Census figures as if they are perfect (again, Chuck Manski has a term for this – conventional certitude). 

How much of this willful ignorance can be argued to be reasonable, since it supports some consensus on when district creation might be biased? To what extent might it instead represent collective buy-in to an illusion about the precision of census data? (One which I imagine many users of Census data may hate to see shattered since it would complicate their analysis pipelines). Previously it has seemed to me that some arguments over differential privacy are a bit like knee-jerk reactions to the possibility that we might have to do more to model aspects of the data generating process (which DP, when not accompanied by post processing, allows us to do pretty cleanly). danah boyd has written a little about this in trying to unpack reasons behind the communication breakdown between some stakeholders and the bureau. Kenny et al describe how “it remains to be seen whether the Supreme Court will see deviations due to Census privacy protection as legitimate;” yep, I’m curious to see how that goes down. 

Kenny et al. next use simulation to compare the partisan composition of possible plans, specifically how the biases in precinct-level populations affect one’s ability to detect partisan and racial bias in redistricting plans (such as packing and cracking). They use an approach to detecting gerrymandered plans with each data source informed by common practice, which is to simulate many possible plans using a redistricting algorithm and then look at where the expected election results from the enacted plan falls in the distribution of expected election results from the set of simulated (unbiased) plans. 

So they simulate plans using North and South Carolina, Delaware, and Pennsylvania data, and plot the distribution of differences between the enacted plan’s expected Democratic vote share and each simulated plan’s expected Democratic vote share. When this distribution of differences is completely above zero, the enacted plan had fewer Democratic voters than would be expected, vice versa if lower than zero. They find that for a handful enacted plans compared against the distribution of simulated plans generated using Census 2010 data, what looks like evidence of packing or cracking reverses when the simulated plans are instead generated using DAS 12.2 data (see the light blue (Census 2010) vs green (DAS-12.2) boxplots on the right half of the upper chart below of South Carolina house elections, which orders districts according to ascending expected Democratic vote share). They also do a similar analysis on racial composition. At the district level, results are again mixed across the states and types of districts, with some states (N/S Carolina) congressional district make-up relatively unaffected, while others show that the DAS datasets (up to epsilon 12.2) reverse evidence of packing or cracking of Black populations? that can be seen using Census 2010. Results are less severe but qualitatively similar with epsilon 19.61 (bottom chart), where they note that few districts where the DAS datasets can differ by up to a few percentage points, potentially shifting the direction of measured partisan bias for a plan. Again we should keep in mind that the Census 2010 data is itself obscured, and we don’t know exactly how much this might affect our ability to detect actual packing and cracking under that dataset. 

south caroline state house packing vs cracking Census 2010 vs DAS

They also compare the probability of a precinct being included in a majority minority district (MMD) under different datasets, since evidence that the creation of MMDs is being prevented is apparently frequently the focus of voting rights cases. Here they find that as their algorithm for finding MMDs (defined by Black population and applied to the Louisiana State House districts in Baton Rouge) searches more aggressively for MMDs (y-axis on the below figure), the DAS data leads to different membership probabilities for different precincts. Without knowing much about how the algorithm encouraging MMD formation works, I find the implications of this relationship between aggressiveness of MMD attempts and error relative to 2010 hard to understand, but I guess there’s a presumption that since these districts come up in Voting Rights Act cases we should assume that aggressive MMD search can happen. 

Majority minority district results - differences in probability

A critique of this analysis by Sam Wang and Ari Goldbloom-Helzner argues that the paper implies that a precinct jumping between having a 49.9% versus 50.1% chance of being included in an MMD across different data sources is meaningful, when in reality such a difference is so small as to be of no practical consequence. They mention finding in their own analysis (of Alabama state house districts) that for districts that were in the range of 50.0% to 50.5% Black voting age population, there was less than 0.1 percentage point (as fraction of total voting age population) difference in the majority of cases, and always less than 0.2 percentage point difference. They also question whether Kenny et al.’s results may be obscured by artifacts from rerunning the simulation algorithm anew on each different dataset to produce maps, rather than keeping maps consistent across datasets. 

I have my own questions about the precision of the redistricting analyses. In motivating their analyses Kenny et al. describe: 

If changes in reported population in precincts affect the districts in which they are assigned to, then this has implications for which parties win those districts. While a change in population counts of about 1% may seem small, differences in vote counts of that magnitude can reverse some election outcomes. Across the five U.S. House election cycles between 2012 and 2020, 25 races were decided by a margin of less than a percentage point between the Republican and Democratic party’s vote shares, and 228 state legislative races were decided by less than a percentage point between 2012 and 2016. 

Should we not distinguish between the level of precision with which we can describe an election outcome after it has taken place, and our ability to forecast a priori with that level of precision? If we accept that Census figures are presented with much more precision than is warranted by the data collection process to begin with, would we still be as concerned over small shifts under the new DAS, or would we arrive at new higher thresholds on the amount of deviation that’s no longer acceptable? 

All in all, I find the evidence of biases that Kenny et al. provide eye opening but am still mulling over what to make of them in light of entrenched (and seemingly incredible) expectations about precision when using Census data for districting.

There’s a final set of analyses in the paper related to establishing evidence that race and vote choice are correlated through Bayesian Improved Surname Geocoding. Since this post is already long, I’ll cover that in a separate post coming soon. There is lots more to chew on there. 

P.S. Thanks to Abie Flaxman and Priyanka Nanayakkara for comments on a draft of this post. It takes a village to sort out all of the implications of the new DAS.

Webinar: Kernel Thinning and Stein Thinning

This post is by Eric.

Tomorrow, we will be hosting Lester Mackey from Microsoft Research. You can register here.


This talk will introduce two new tools for summarizing a probability distribution more effectively than independent sampling or standard Markov chain Monte Carlo thinning:

  • Given an initial n point summary (for example, from independent sampling or a Markov chain), kernel thinning finds a subset of only square-root n points with comparable worst-case integration error across a reproducing kernel Hilbert space.
  • If the initial summary suffers from biases due to off-target sampling, tempering, or burn-in, Stein thinning simultaneously compresses the summary and improves the accuracy by correcting for these biases.

These tools are especially well-suited for tasks that incur substantial downstream computation costs per summary point like organ and tissue modeling in which each simulation consumes 1000s of CPU hours.

About the speaker

Lester Mackey is a Principal Researcher at Microsoft Research, where he develops machine learning methods, models, and theory for large-scale learning tasks driven by applications from healthcare, climate forecasting, and the social good.  Lester moved to Microsoft from Stanford University, where he was an assistant professor of Statistics and (by courtesy) of Computer Science.  He earned his Ph.D. in Computer Science and MA in Statistics from UC Berkeley and his BSE in Computer Science from Princeton University.  He co-organized the second place team in the Netflix Prize competition for collaborative filtering, won the Prize4Life ALS disease progression prediction challenge, won prizes for temperature and precipitation forecasting in the yearlong real-time Subseasonal Climate Forecast Rodeo, and received best paper and best student paper awards from the ACM Conference on Programming Language Design and Implementation and the International Conference on Machine Learning.

Cross validation and model assessment comparing models with different likelihoods? The answer’s in Aki’s cross validation faq!

Nick Fisch writes:

After reading your paper “Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC”, I am curious as to whether the criteria WAIC or PSIS-LOO can be used to compare models that are fit using different likelihoods? I work in fisheries assessment, where we are frequently fitting highly parameterized nonlinear models to multiple data sources using MCMC (generally termed “integrated fisheries assessments”). If I build two models that solely differ in the likelihood specified for a specific data source (one Dirichlet, the other Multinomial), would WAIC or loo be able to distinguish these or must I use some other method to compare the models (such as goodness of fit, sensitivity, etc). I should note that the posterior distribution will be the unnormalized posterior distribution in these cases.

My response: for discrete data I think you’d just want to work with the log probability of the observed outcome (log p), and it would be fine if the families of models are different. I wasn’t sure what was the best solution with continuous variables, so I forwarded the question to Aki, who wrote:

This question is answered in my [Aki’s] cross validation FAQ:

12 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

First to make the terms more clear, p(y∣θ) as a function of y is an observation model and p(y∣θ) as a function of θ is a likelihood. It is better to ask “Can cross-validation be used to compare different observation models?”

– You can compare models given different discrete observation models and it’s also allowed to have different transformations of y as long as the mapping is bijective (the probabilities will the stay the same).
– You can’t compare densities and probabilities directly. Thus you can’t compare model given continuous and discrete observation models, unless you compute probabilities in intervals from the continuous model (also known as discretising continuous model).
– You can compare models given different continuous observation models, but you have exactly the same y (loo functions in rstanarm and brms check that the hash of y is the same). If y is transformed, then the Jacobian of that transformation needs to be included. There is an example of this in mesquite case study.

It is better to use cross-validation than WAIC as the computational approximation in WAIC fails more easily and it’s more difficult to diagnose when it fails.

P.S. Nick Fisch is a Graduate Research Fellow in Fisheries and Aquatic Sciences at the University of Florida. How cool is that? I’m expecting to hear very soon from Nick Beef at the University of Nebraska.

American Causal Inference Conference 2022

Avi Feller and Maya Petersen write:

Join us at the long-awaited American Causal Inference Conference 2022 @ UC Berkeley

When: May 23-25th, 2022

Where: UC Berkeley

What: ACIC is the oldest and largest meeting for causal inference research and practice, bringing together hundreds of participants from a range of disciplines since 2004. After the pandemic postponed both the 2020 and 2021 meetings, ACIC 2022 @ UC Berkeley will be an exciting opportunity to engage with this rapidly expanding field and with the many people advancing the theory, methodology, and application of causal inference.

Details: See acic.berkeley.edu for more. Registration to open November 1, 2021.

I’ve heard that previous conferences in this series have been both fun and productive. It used to be called the Atlantic Causal Inference Conference but then it went continental and they didn’t even have to change their acronym!

Found poetry 2021

The poem is called “COVID-19 and Flu Vaccination | Walgreens Immunization Services.pdf,” and here it is:

What happened was we went online to Walgreens to schedule our booster shots. When we printed out the receipt, lots of extra pages spewed out, and we were using these as scratch paper. And then one day I noticed the above string of code words that didn’t seem to go together. Or maybe they do go together in some way. I guess that’s the point of found poetry, to reveal connections that we hadn’t thought about.

Just to break the spell a bit, I’ll try to analyze what makes this poem seem so striking and absurd. What first jumps out is “Modern Slavery and Human Trafficking,” which is so horrible and doesn’t seem to fit in with anything else here, except I guess “Affiliate Program”? And then something about how “Do Not Sell My / Personal / Information” is broken into three lines, which deconstructs the message, in the same way that if you look at a word very carefully, letter by letter, it starts to fall apart in your mind. Finally, the juxtaposition of all this corporate-speak seems to speak poetically about the contradictions of the modern world, or of any world.

Suburban Dicks

The cover drew my attention so I opened up the book and saw this Author’s Note:

West Windsor and Plainsboro are real towns in New Jersey. Unlike the snarky description in the book, it’s a pretty good area to live in . . . I mean, as far as New Jersey goes . . .

I was amused enough to read this aloud to the person with me in the bookstore, who then said, This sounds like your kind of book. You should get it. So I did.

I liked the book. It delivered what it promised: fun characters, jokes, and enough plot to keep me turning the pages. It also did a good job with the balance necessary in non-hard-boiled mystery stories, keeping it entertaining without dodging the seriousness of the crimes. This balance is not always easy; to my taste, it’s violated by too-cosy detective stories on one hand and obnoxious tough-guy vigilantes on the other.

Suburban Dicks has some similarities to Knives Out, with a slightly different mix of the ingredients of characters, plot, laughs, and social commentary. Briefly: Knives out was a more professional piece of work (we’ll get back to that in a moment) and had a much better plot, both in its underlying story and how it was delivered. The books were equally funny but in different ways: as one might guess from its title, Suburban Dicks was more slapstick, more of a Farrelly brothers production, in the book’s case leaning into the whole Jersey thing. Suburban Dicks was also more brute force in its social commentary, to the extent that it could put off some readers, but to me it worked, in the sense that this was the story the author wanted to tell.

Suburban Dicks works in large part because of the appealing characters and the shifting relationships between them, all of which are drawn a bit crudely but, again, with enough there to make it all work. I liked the shtick, and I liked that the characters had lives outside the plot of the story.

This all might sound like backhanded compliments, and I apologize for that, because I really enjoyed the book: it was fun to read, and I felt satisfied with it when it was all over. What’s most relevant to the experience, both during the reading and in retrospect, are the strengths, not the weaknesses.

One more thing, though. The book is well written, but every once in awhile there’s a passage that’s just off, to the extent that I wonder if the book had an editor. Here’s an example:

“Listen, I know what it sounds like, but, I don’t know, think of it this way,” Andrea said. “You were a child-psych major at Rutgers, right? And you got a job at Robert Wood Johnson as a family caseworker for kids in the pediatric care facility, right?”


Who talks that way? This is a classic blunder, to have character A tell character B something she already knows, just to inform the reader. I understand how this can happen—in an early draft. But it’s the job of an editor to fix this, no?

But then it struck me . . . nobody buys books! More books are published than ever before, but it’s cheap to publish a book. Sell a few thousand and you break even, I guess. (Maybe someone in comments can correct me here.) There’s not so much reading for entertainment any more, not compared to the pre-internet days. I’m guessing the economics in book publishing is that the money’s in the movie rights. So, from the publisher’s point of view, the reason for this book is not so much that it might sell 50,000 copies and make some money, but that they get part of the rights for the eventual filmed version (again, experts on publishing, feel free to correct me on this one). So, from that point of view, who cares if there are a few paragraphs that never got cleaned up? And, to be honest, those occasional slip-ups didn’t do much to diminish my reading experience. Seeing some uncorrected raw prose breaks the fourth wall a bit, but the book as a whole is pretty transparent; indeed, there’s a kind of charm to seeing the author as a regular guy who occasionally drops a turd of a paragraph.

It makes me sad that there was no editor to carefully read the book and point out the occasional lapses in continuity, but I can understand the economics of why the publisher didn’t bother. I’m sure the eventual movie script will be looked over more carefully.

In any case, let me say again that I enjoyed the book and I recommend it to many of you. After reading it, I googled the author’s name and found out that he writes for comic books, most famously creating the character Deadpool. His wikipedia page didn’t mention Suburban Dicks at all so I added something. And then, in preparing this post, I googled again and came across this article, “Legendary comic book writer’s first novel set in West Windsor,” from a Central New Jersey news site, from which I learned:

Suburban Dicks debuted to rave reviews and Nicieza has already been contracted to write a sequel. The book has also been optioned for a television show.

Good to hear.

Actually, this news article, by Bill Sanservino, is excellent. It includes a long and informative interview with Nicieza, interesting in its own right and also in the light it sheds on his book. It’s a long, long interview with lots of good stuff.

There’s only one thing that puzzles me. In the interview, Nicieza talks about all the editors who helped him on the book. That’s cool; there’s no need to do it all yourself. But how could there be all those editors . . . and none of them caught the paragraph quoted above, and a few others throughout the book, that just jumped out at me when I read them? I don’t get it.

“Causal Impact of Masks, Policies, Behavior on Early Covid-19 Pandemic in the U.S”: Chernozhukov et al. respond to Lemoine’s critique

Victor Chernozhukov writes:

Two months ago your blog featured Philip Lemoine’s critique “Lockdowns, econometrics and the art of putting lipstick on a pig” of our paper “Causal Impact of Masks, Policies, Behavior on Early Covid-19 Pandemic in the US” (ArXiv, June 2020, published in Journal of Econometrics). The paper found mitigating effects of masks and personal behavior and could not rule out significant mitigation effects of various “shut down” policies. Over the last two months, we studied Lemoine’s critique, and we prepared a detailed response.

Although Lemoine’s critique appears ideologically driven and overly emotional, some of the key points are excellent and worth addressing. In particular, the sensitivity of our estimation results for (i) including “masks in public spaces” and (ii) updating the data seems important critiques and, therefore, we decided to analyze the updated data ourselves.

After analyzing the updated data, we find evidence that reinforces the conclusions reached in the original study. In particular, focusing on the first three points to keep this note short:

(1) Lemoine showed that replacing “masks for employees” (business mask mandates) by “masks in public spaces” (public mask mandates) changes the effect estimate from negative to slightly positive. This critique is an obvious mistake because dropping the “masks for employees” variable introduces a confounder bias in estimating the effect of “masks in public spaces.” When we include both “masks for employees only” and “masks in public spaces” in the regression, the estimated coefficients of both variables are substantially negative in the original data. Lemoine’s argument seems to be an obvious but honest mistake.

(2) The second main point of Lemoine’s critique is non-robustness of results with respect to the data update. However, Lemoine has not validated the new data. We find that the timing of the first mask mandate for Hawaii (and another state) is mis-coded in the updated data. After correcting this data mistake, the estimated coefficients of “masks for employees only” and “masks in public spaces” continue to be substantially negative. This critique is also an honest (though not obvious) mistake.

(3) Lemoine analyzed the updated data that kept the original sample period from March 7 to June 3, 2020. The negative effects of masks on case growth continue to hold when we extend the endpoint of the sample period to July 1, August 1, or September 1 (before the start of school season). With the extended data, the estimated coefficients of “masks in public spaces” range from −0.097 to −0.124 with standard errors of 0.02 ∼ 0.03 in Tables 5-7, and are roughly twice as large as those of “masks for employees only.” A preprint version of our paper was available in ArXiv in late May of 2020 and was submitted for publication shortly after, which is why we did not analyze either the updated data or the extended data in our original paper.

Response to other points raised and supporting details on (1)-(3) are given in the ArXiv paper.

It’s great that outsiders such as Lemoine can contribute to the discussion, and I think it’s great that Chernozhukov et al. replied. I’ll leave the details to all of you.

Standard Deviation by Katherine Heiny

It’s a novel, not a statistics book, and actually the novel has nothing to do with statistics, but that’s fine, I didn’t read the book because of its title, I read it because it was mentioned on the radio. Anyway, I liked the book. It reminded me a lot of Anne Tyler, and that’s not a bad thing.

The one thing that struck me about Heiny’s book more than anything else, though, was how Waspy it was. Really Waspy. Really really Waspy. OK, one of the characters had an Irish last name so that doesn’t quite fit—but he’s Protestant and he seems to have no ethnicity at all. Certainly nothing about any Irish ancestors. His first name is Graham, for chrissake! And being an American while having no ethnicity—that’s pretty Waspy.

So one thing I enjoyed about reading this novel was that it was a kind of travelogue into an alien culture that exists in parallel with mine. The characters live in New York—not far from our neighborhood, actually—but in this alternative world in which everybody has a “den” in their apartment and eat potato salad and drink a lot of alcohol. When they’re not drinking, they’re thinking about drinking. And in a Waspy way—we’re not talking six-packs of Bud here. The female characters are always putting on make-up. The characters get around in the city by driving their car! Who does that? Lots of people do, I guess. They live among us. The city—any city—is many parallel cities.

I’m worried that I’m giving the impression here that I’m making fun of this novel. I’m not. It’s as legitimate to write about Wasps as it is to write about Irish people, or Jews, or Nigerian immigrants, or whatever. I’ve read lots and lots of fiction about Wasps and Wasp-adjacent cultures: Cheever, Updike, etc. This book was particularly fun because the cultural characteristics were taken up a notch into the level of parody: the characters didn’t just drink, it was more like alcohol was a major presence in their lives. The main male character wasn’t just a self-styled gourmet chef, he cooked retro favorites like beef Stroganoff. And so on. It was a comic novel, in the style of Philip Roth or, as I said above, Anne Tyler.

I’m not saying I liked the book because of the ethnic stereotyping. I liked it because of the writing and the characters and the situations. Lots of funny and perceptive lines, interesting interactions between the characters. The ethnic bit was just something that jumped out at me.

P.S. I wanted to check on the Anne Tyler comparison so I picked one of her books off the shelf, and it featured this ’80s-vintage blurb: “Among the finest woman novelists publishing in America today. — Philadelphia Inquirer.” Oof!

I have some skepticism about this “Future Directions for Applying Behavioral Economics to Policy” project

Kevin Lewis points to this announcement:

The National Academies of Sciences, Engineering, and Medicine is undertaking a study to review the evidence regarding the application of insights from behavioral economics to key public policy objectives (e.g., related to public health, multiple areas of chronic illness . . . economic well-being, responses to global climate change). The committee will examine applications from the past 5 to 10 years . . . and also less successful applications that may offer valuable lessons. The committee will also examine main controversies that have arisen as field has progressed . . .

The study will be carried out by a committee of approximately 12 volunteer experts in the fields of: Economics, Behavioral Economics, Psychology, Medicine, Cognitive Science (e.g. judgment and decision making) and Methodology.

The National Academies are committed to enhancing diversity and inclusion in order to strengthen the quality of our work. Diverse perspectives contribute to finding innovative approaches and solutions to challenging issues. . . .

Lewis asks, “How about just ‘rigorous empiricism’ instead of arbitrarily limiting to ‘behavioral economics’?”

But I’m wondering about the last bit of the announcement. I agree on the value of diverse perspectives, but they list 6 fields:

2 are economics
2 are psychology
1 is medicine
1 is “methodology” (which is not actually a field).

Economics, psychology, and medicine are fine—but do you really need to include each of economics and psychology twice? And, if we’re looking at public policy objectives, does it makes sense to get someone from medicine rather than public health? And what about public policy and political science? And what’s with “methodology”? Is that a way to say statistics without saying statistics? In any case, I don’t think the above list is a good start if their goal is diverse perspectives.

More generally, I’m concerned about such a project coming out of the National Academy of Sciences. I’m sure this organization does lots of wonderful things, but just by its nature it represents entrenched powerful conservative forces in science, which would seem to make it not the best organization to take a critical perspective on the existing scientific establishment, which in turn will get in the way of the goal of studying “controversies that have arisen as field has progressed.” I fear there will be just too much pressure for happy talk along the lines of, “Sure, science has some problems but a few tweaks will solve them, and we can move ever upward in a glorious spiral of NPR appearances, Ted talks, and corporate and government funding.” Don’t get me wrong, I love media appearances and corporate and government funding as much as the next researcher; I just don’t anticipate much value in a blue-ribbon panel promoting complacency.

On the plus side, they promise to examine “less successful applications that may offer valuable lessons.” I can give them many examples there! I’d also like them to examine the unfortunate tendency of promoters of behavioral economics to drop past failures into the memory hole rather than admit they got conned. It’s hard to learn from your failures when you never talk about them.

At the end of the above National Academies page, it says:

We invite you to submit nominations for committee members and/or reviewers for this study by October 25, 2021.

Here are some nominations for the committee:

Nick Brown
John Bullock
Anna Dreber
Sander Greenland
Andy Hall
Megan Higgs
Sabine Hossenfelder
Blake McShane
Beth Tipton
Simine Vazire

I could think of many more but this is a start. I even included an economist and a psychologist—see how open-minded I am!

And here are some suggested reviewers:

Malcolm Gladwell
Mark Hauser
Anil Potti
Diederik Stapel
Matthew Walker
Brian Wansink

A couple of those guys might be busy, but most of them should have pretty open calendars and so could devote lots of time to a careful read of the report.

P.S. Further thoughts from Peter Dorman in comments.

Why are goods stacking up at U.S. ports?

This post is by Phil Price, not Andrew.

I keep seeing articles that say U.S. ports are all backed up, hundreds of ships can’t even offload because there’s no place to put their cargo, etc. And then the news articles will quote some people saying ‘this is a global problem’, ‘there is no single solution’, and so on. I find this a bit perplexing, although I feel like my perplexification could be cleared up with some simple data. How many containers per day typically arrived at U.S. ports pre-pandemic, and how many are arriving now? How many truck drivers were on the road on a typical day in the U.S. pre-pandemic, and how many are on the road now? How many freight train employees were at work on a typical day pre-pandemic, and how many are at work now?

I understand that there are problems all over the place: various cities and countries go in and out of lockdown, companies have gone out of business, factories have closed, there are shortages of raw materials and machine parts etc. due to previous and current pandemic-related shutdowns…that’s all fine, but it does nothing to explain why goods that are sitting at US ports are not moving. Have all of the U.S. truck drivers died of COVID or something? Inquiring minds want to know!

As Seung-Whan Choi explains: Just about any dataset worth analyzing is worth reanalyzing. (The story of when I was “canceled” by Perspectives on Politics. OK, not really.)

The book is by Seung-Whan Choi and my review is here. I’ll repost it in a moment, but first I’ll share the strange story of how this review came to be and why it was never published.

Back in 2017 I received an email from the journal Perspectives on Politics to review this book. I was kinda surprised because I don’t know anything about international relations. But then when I received the book in the mail and read it, I understood, as its arguments were all about quantitative methods.

So I wrote my review and sent it in. A few months later I received the page proofs, and they’d changed a lot of my writing! I assumed someone screwed up somewhere and I replied with an annoyed message that the article they sent me was not what I wrote, and they should change it back right away. Then the editors replied with their own annoyed message that as the journal editors they had the right to rewrite whatever they wanted. Or something like that. I don’t remember all the details and the relevant emails don’t seem to be saved on my computer, but the upshot was that I was annoyed, they were annoyed, and they decided not to run the review. Which is fine. It’s their journal and they have the absolute right to decide what to publish in it.

I pretty forgot the whole story but then I was cleaning my webpage and noticed this article that was never published, and it struck me that I went to the trouble of reviewing this person’s book but then neither he nor pretty much anybody else had seen the review.

That seemed like a waste.

So I’m reposting the review here:

New Explorations into International Relations: Democracy, Foreign Investment, Terrorism, and Conflict. By Seung-Whan Choi. Athens, Ga.: University of Georgia Press, 2016. xxxiii +301pp. $84.95 cloth, $32.95 paper.

Andrew Gelman, Columbia University (review for Perspectives on Politics)

This book offers a critical perspective on empirical work in international relations, arguing that well-known findings on the determinants of civil war, the democratic or capitalist peace, and other topics are fragile, that the conclusions of prominent and much cited published papers are crucially dependent on erroneous statistical analyses. Choi supports this claim by detailed examination of several of these papers, along with reanalyses of his own. After that, he presents several completely new analyses demonstrating his approach to empirical work on international relations topics ranging from civilian control of the military to the prevention of terrorism.

I have no expertise on international relations and would not feel comfortable weighing the arguments in any of the examples under consideration. Suffice it to say that I find Choi’s discussions of the substantive positions and the reasoning given on each side of each issue to be clear, and the topics themselves are both important and topical. The book seems, at least to this outsider, to present a fair overview of several controversial topics in modern international relations scholarship, along with an elucidation of the connections between substantive claims, statistical analysis, and the data being used to support each position; as such, I would think it serve as an excellent core for a graduate seminar.

As a methodologist, my main problem with Choi’s reanalyses are their reliance on a few tools—regression, instrumental variables, and statistical significance—that I do not think can always bear the burden of what they are being asked to do. I am not saying that these methods are not useful, nor am I criticizing Choi for using the same methods for different problems—it make sense for any analyst, the present writer included, to heavily use the methods with which are most familiar. Rather, I have specific concerns with the routine attribution of causation to regression coefficients.

For the purpose of this review I do not attempt to carefully read or evaluate the entire book; instead I focus on chapter 1, a reevaluation of James Fearon and David Laitin’s paper, “Ethnicity, Insurgency, and Civil War” (American Political Science Review, 97(1)), and chapter 2, on the democratic or capitalist peace. In both chapters I am convinced by Choi’s arguments about the fragility of published work in this area but am less persuaded by his own data analyses.

The abstract to Fearon and Laitin (1993) begins: “An influential conventional wisdom holds that civil wars proliferated rapidly with the end of the Cold War and that the root cause of many or most of these has been ethnic and religious antagonisms. We show that the current prevalence of internal war is mainly the result of a steady accumulation of protracted conflicts since the 1950s and 1960s rather than a sudden change associated with a new, post-Cold War international system. We also find that after controlling for per capita income, more ethnically or religiously diverse countries have been no more likely to experience significant civil violence in this period.”

Fearon and Laitin’s language moves from descriptive (“proliferated rapidly”) to causal (“root cause . . . the result of . . ..”), then back to descriptive (“no more likely”). The paper continues throughout to mix descriptive and causal terms such as “controlling for,” “explained by,” “determinant,” “proxy,” and “impact.” In a June 27, 2012, post on the Monkey Cage political science blog, Fearon was more explicitly predictive: “The claim we were making was not about the motivations of civil war participants, but about what factors distinguish countries that have tended to have civil wars from those that have not.” Fearon also wrote, “associating civil war risk with measures of grievance across countries doesn’t tell us anything about the causal effect of an exogenous amping up grievances on the risk of civil war.”

There is nothing wrong with going back and forth between descriptive analysis and causal theorizing—arguably this interplay is at the center of social science—but the result can be a blurring of critical discussion. Choi also oscillates between descriptive terms such as “greater risk” and “likely to experience” civil war and causal terms such as “endogeneity” and “the main causes” (p.2). Choi criticizes Fearon and Laitin’s estimates as being “biased at best and inaccurate at worst” (p.3), a characterization I do not understand—but in any case the difficulty here is that bias of an estimator can only be defined relative to the estimand—the underlying quantity being estimated—and neither Fearon/Laitin nor Choi seem to have settled on what this quantity is. Yes, they are modeling the probability of outbreak of civil war, but it is not clear how one is supposed to interpret the particular parameters in their models.

Getting to some of the specifics, I am skeptical of the sort of analysis that proceeds by running a multiple regression (whether directly on data or using instrumental variables) and giving causal interpretations to several of its coefficients. The difficulty is that each regression coefficient is interpretable as a comparison of items (in this case, country-years) with all other predictors held constant, and, it is rare to be able to understand more then one coefficient in this way in a model fit to observational data.

I have similar feelings about the book’s second chapter, which begins with a review of the literature on the democratic or capitalist peace, a topic which is typically introduced to outsiders in general terms such as “Democracies never fight each other” but then quickly gets into the mud of regression specifications and choices of how exactly to measure “democratic” or “peace.” As in the civil war example discussed above, I am more convinced by Choi’s criticisms of the sensitivity of various published claims to assumptions and modeling choices, than I am by his more positive claim that, after correction of errors, “democracy retains its explanatory power in relation with interstate conflict” (p.38). Explanatory power depends on what other predictors in the model, which reminds us that descriptive summaries, like causal claims, do not occur in a vacuum.

Where, then, does this lead us? The quick answer is that statistical analysis of historical data can help us build and understand theories but can rarely on its own provide insight about the past and direct guidance about the future. We can use data analysis within a model to estimate parameters and evaluate, support, or rule out hypotheses; and we can also analyze data more agnostically or descriptively to summarize historical patterns or reveal patterns or anomalies that can raise new questions and motivate new theoretical work.

As Choi both explains in his book, just about any dataset worth analyzing is worth reanalyzing: “The charge that replication studies produce no stand-alone research is ironic in the sense that most empirical research already relies on publicly available data sources . . . Stand-alone researchers claim to be doing original work, but their data often comes from collections previously published by private and government agencies” (p.xxv). I expect that Choi’s explications and reanalyses in several important areas of international relations will be of interest to students and scholars in this field, even if I have qualms about his readiness to assign causal interpretations to regression coefficients.

Perspectives on Politics really dodged a bullet by not publishing that review as written, huh? The only thing that puzzles me is why they asked me to write the review in the first place. What did they think they were going to get from me, exactly?

Can the “Dunning-Kruger effect” be explained as a misunderstanding of regression to the mean?

The above (without the question mark) is the title of a news article, “The Dunning-Kruger Effect Is Probably Not Real,” by Jonathan Jarry, sent to me by Herman Carstens. Jarry’s article is interesting, but I don’t like its title I don’t like the framing of this sort of effect as “real” or “not real.” I think that all these sorts of effects are real, but they vary: sometimes the effects are large, sometimes they’re small, sometimes they’re positive and sometimes negative. So the real question is not, “Are these effects real?”, but “What’s really going on.”

Jarry writes:

First described in a seminal 1999 paper by David Dunning and Justin Kruger, this effect has been the darling of journalists who want to explain why dumb people don’t know they’re dumb.

I [Jarry] was planning on writing a very short article about the Dunning-Kruger effect and it felt like shooting fish in a barrel. Here’s the effect, how it was discovered, what it means. End of story.

But as I double-checked the academic literature, doubt started to creep in. . . .

In a nutshell, the Dunning-Kruger effect was originally defined as a bias in our thinking. If I am terrible at English grammar and am told to answer a quiz testing my knowledge of English grammar, this bias in my thinking would lead me, according to the theory, to believe I would get a higher score than I actually would. And if I excel at English grammar, the effect dictates I would be likely to slightly underestimate how well I would do. I might predict I would get a 70% score while my actual score would be 90%. But if my actual score was 15% (because I’m terrible at grammar), I might think more highly of myself and predict a score of 60%. . . .

This is what student participants went through for Dunning and Kruger’s research project in the late 1990s. There were assessments of grammar, of humour, and of logical reasoning. Everyone was asked how well they thought they did and everyone was also graded objectively, and the two were compared. . . .

In the original experiment, students took a test and were asked to guess their score. Therefore, each student had two data points: the score they thought they got (self-assessment) and the score they actually got (performance). In order to visualize these results, Dunning and Kruger separated everybody into quartiles: those who performed in the bottom 25%, those who scored in the top 25%, and the two quartiles in the middle. For each quartile, the average performance score and the average self-assessed score was plotted. This resulted in the famous Dunning-Kruger graph.

Jarry continues:

In 2016 and 2017, two papers were published in a mathematics journal called Numeracy. In them, the authors argued that the Dunning-Kruger effect was a mirage. And I tend to agree.

The two papers, by Dr. Ed Nuhfer and colleagues, argued that the Dunning-Kruger effect could be replicated by using random data. . . .

Hey—that sounds like regression to the mean: When evaluating a prediction, you should plot actual vs. predicted, not predicted vs. actual. We discuss this in Section 11.3 of Regression and Other Stories. If you plot predicted vs. actual, you’ll see that slope of less than 1, just from natural variation.

Jarry also writes:

A similar simulation was done by Dr. Phillip Ackerman and colleagues three years after the original Dunning-Kruger paper, and the results were similar.

Wait a second! So this isn’t news at all! The Dunning-Kruger claim was shot down nearly 20 years ago. From the abstract of the 2020 article by Phillip Ackerman, Margaret Beier, and Kristy Bowen, published in the journal Personality and Individual Differences:

Recently, it has become popular to state that “people hold overly favorable views of their abilities in many social and intellectual domains” [Kruger, J., & Dunning, D. (1999)] . . . The current paper shows that research from the other side of the scientific divide, namely the correlational approach (which focuses on individual differences), provides a very different perspective for people’s views of their own intellectual abilities and knowledge. Previous research is reviewed, and an empirical study of 228 adults between 21 and 62 years of age is described where self-report assessments of abilities and knowledge are compared with objective measures. Correlations of self-rating and objective-score pairings show both substantial convergent and discriminant validity, indicating that individuals have both generally accurate and differentiated views of their relative standing on abilities and knowledge.

I did some searching on Google Scholar and found this article in the journal Political Psychology from 2018, “Partisanship, Political Knowledge, and the Dunning‐Kruger Effect,” by Ian Anson, which reports:

A widely cited finding in social psychology holds that individuals with low levels of competence will judge themselves to be higher achieving than they really are. . . . Survey experimental results confirm the Dunning‐Kruger effect in the realm of political knowledge. They also show that individuals with moderately low political expertise rate themselves as increasingly politically knowledgeable when partisan identities are made salient.

It’s helpful for me to see this in the context of political attitudes this relates more to my own area of research.

A quick Google search also led to this article from 2020 by Gilles Gignac and Marcin Zajenkowski, “The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data,” which a couple of other papers, Krueger and Mueller (Journal of Personality and Social Psychology, 2002), and Krajc and Ortmann (Journal of Economic Psychology, 2008), that make a similar point.

Anyway, if all these analyses are correct, it’s interesting that people have been pointing it out for nearly twenty years in published papers in top journals, but the message still isn’t getting through.

One relevant point, I guess, is that even if the observed effect is entirely a product of regression to the mean, it’s still a meaningful thing to know that people with low abilities in these settings are overestimating their true abilities how well they’ll do. That is, even if this is an unavoidable consequence of measurement error and population variation, it’s still happening.

Comments on a Nobel prize in economics for causal inference

A reporter contacted me to ask my thoughts on the recent Nobel prize in economics. I didn’t know that this had happened so I googled *nobel prize economics* and found the heading, “David Card, Joshua Angrist and Guido Imbens Win Nobel in Economics.” Hey—I know two of these people!

Fortunately for you, our blog readers, I’d written something a few years ago on the topic of a Nobel prize in economics for causal inference, so I can excerpt from it here.

Causal inference is central to social science and especially economics. The effect of an intervention on an individual i (which could be a person, a firm, a school, a country, or whatever particular entity is being affected by the treatment) is defined as the difference in the outcome yi, comparing what would have happened under the intervention to what would have happened under the control. If these potential outcomes are labeled as yiT and yiC, then the causal effect for that individual is yiT − yiC . But for any given individual i, we can never observe both potential outcomes yj0 and yj1, thus the causal effect is impossible to directly measure. This is commonly referred to as the fundamental problem of causal inference, and it is at the core of modern economics.

Resolutions to the fundamental problem of causal inference are called “identification strategies”; examples include linear regression, nonparametric regression, propensity score matching, instrumental variables, regression discontinuity, and difference in differences. Each of these has spawned a large literature in statistics, econometrics, and applied fields, and all are framed in response to the problem that it is not possible to observe both potential outcomes on a single individual.

From this perspective, what is amazing is that this entire framework of potential outcomes and counterfactuals for causal inference is all relatively recent.

Here is a summary of the history by Guido Imbens:

My [Imbens’s] understanding of the history is as follows. The potential outcome frame- work became popular in the econometrics literature on causality around 1990. See Heckman (1990, American Economic Review, Papers and Proceedings, “Varieties of Selection Bias,” 313–318) and Manski (1990 American Economic Review, Papers and Proceedings, “Nonparametric Bounds on Treatment Effects,” 319–323). Both those papers read very differently from the classic paper in the econometric literature on program evaluation and causality, published five years earlier, (Heckman, and Robb, 1985, “Alternative Methods for Evaluating the Impact of Interventions,” in Heckman and Singer (eds.), Longitudinal Analysis of Labor Market Data) which did not use the potential outcome framework. When the potential outcome framework became popular, there was little credit given to Rubin’s work, but there were also no references to Neyman (1923), Roy (1951) or Quandt (1958) in the Heckman and Manski papers. It appears that at the time the notational shift was not viewed as sufficiently important to attribute to anyone.

Haavelmo is certainly thinking of potential outcomes in his 1943 paper, and I [Imbens] view Haavelmo’s paper (and a related paper by Tinbergen) as the closest to a precursor of the Rubin Causal Model in economics. However, Haavelmos notation did not catch on, and soon econometricians wrote their models in terms of realized, not potential, outcomes, not returning to the explicit potential outcome notation till 1990.

The causality literature is actually one where there is a lot of cross-discipline referencing, and in fact a lot of cross-discipline collaborations between statisticians, econometricians, political scientists and computer scientists.

The potential-outcome or counterfactual-based model of causal inference has led to conceptual, methodological, and applied breakthroughs in core areas of econometrics and economics.

The key conceptual advances come from the idea of a unit-level treatment effect, yiT − yiC, which, although it is unobservable, can be aggregated in various ways. So, instead of the treatment effect being thought of as a parameter (“β” in a regression model), it is an average of individual effects. From one direction, this leads to the “local average treatment effect” of Angrist and Imbens, the principal stratification idea of Frangakis and Rubin, and various other average treatment effects considered in the causal inference literature. Looked at another way, the fractalization of treatment effects allows one to determine what exactly can be identified from any study. A randomized experiment can estimate the average treatment effects among the individuals under study; if those individuals are themselves a random sample, then the average causal effect in the population is also identifiable. With an observational study, one can robustly estimate a local average treatment effect in the region of overlap between treatment and control groups, but inferences for averages outside this zone will be highly sensitive to model specification. The overarching theme here is that the counterfactual expression of causal estimands is inherently nonparametric and unbounds causal inference from the traditional regression modeling framework. The counterfactual approach thus fits in very well with modern agent-based foundations of micro- and macro-economics which are based on individual behavior.

On the applied side, I think it’s fair to say that economics has moved in the past forty years to a much greater concern with causality, and much greater rigor in causal measurement, with the key buzzword being “identification.” Traditionally, in statistics, identification comes from the likelihood, that is, from the parametric statistical model. The counterfactual model of causal effects has shifted this: with causality defined nonparametrically in terms of latent data, there is a separation between (a) definition of the estimand, and (b) the properties of the estimator—a separation that has been fruitful both in the definition of causal summaries such as various conditional average treatment effects, and in the range of applications of these ideas. Organizations such as MIT’s Poverty Action Lab and Yale’s Innovations for Poverty Action have revolutionized development economics using randomized field experiments, and similar methods have spread within political science. I’m sure that a Nobel Prize will soon be coming to Duflo, Banerjee, or some others in the subfield of experimental economic development. [This was written in 2017, a couple years before that happened. — AG] Within micro-economics, identification strategies have been used not just for media-friendly “cute-o-nomics” but also in areas such as education research and the evaluation of labor and trade policies where randomized experiments are either impossible or impractical to do at scale.

Joshua Angrist and Guido Imbens are two economists who have worked on methods and applications for causal inference. In an influential 1994 paper in Econometrica, they introduced the concept of the local average treatment effect, which is central to any nonparametric understanding of causal inference. This idea generalizes the work of Rubin in the 1970s in defining what quantities are identifiable from any given design. Imbens has also done important work on instrumental variables, regression discontinuity, matching, propensity scores, and other statistical methods for causal identification. Angrist is an influential labor economist who has developed and applied modern methods for causal inference to estimate treatment effects in areas of education, employment, and human capital. The work of Angrist and Imbens is complementary in that Imbens has developed generally-applicable methods, and Angrist and his collaborators have solved real problems in economics.

Previous Nobel prizes that are closely related to this work include Trygve Haavelmo, Daniel McFadden, and James Heckman.

Haavelmo’s work on simultaneous equations can be seen as a bridge between macroeconomic models of interacting variables, and the problems of causal identification. As noted in the Imbens quote above, Haavelmo’s work can be seen to have anticipated the later ideas of potential outcomes as a general framework for causal inference. The ideas underlying McFadden’s work on choice models has been important in modern microeconomics and political science and are related to causal inference in that all these models depend crucially on unmeasurable latent individual variables. Unobservable continuous utility parameters can be thought of as a form of potential outcome. I do not know of any direct ways in which these research streams influenced each other; rather, my point is that these different ideas, coming from different academic perspectives and motivated by different applied problems, have a commonality which is the explicit modeling of aggregate outcomes and measures such as choices or local average treatment effects, in terms of underlying distributions of latent variables. This approach now seems so natural that it can be hard to realize what a conceptual advance it is, as compared to direct modeling of observed data.

Heckman’s work on selection models is important very broadly in microeconomics, as it is the nature of economic decision making that choices do not, in general, look like random assignments. Indeed it can be said that all of microeconomics resides in this gap gap between statics and dynamics. Heckman’s selection model differs from the most popular modern approaches to identification in that it relies on a statistical model rather than any sort of natural experiment, but it falls within the larger category of econometric methods for removing or reducing the biases that arise from taking naive comparisons or regressions and considering these as causal inferences.

You can take all the above not as an authoritative discussion of the history of econometrics but rather as a statistician’s perspective on how these ideas fit together. See this 2013 post and discussion thread for some further discussion of the history of causal inference in statistics and econometrics.

P.S. More here.

Learning by confronting the contradictions in our statements/actions/beliefs (and how come the notorious Stanford covid contrarians can’t do it?)

The fundamental principle of Neumann/Morgenstern decision theory is coherence. You can start with utilities and probabilities and deduce decision recommendations, or you can start with decisions and use these to deduce utilities and probabilities. More realistically, you can move back and forth: start with utilities and probabilities and deduce decision recommendations, then look at places where these recommendations conflict with your actual decisions or inclinations and explore your incoherence, ultimately leading to a change in assessed utilities/probabilities or a change in decision plans.

That is, decision theory is two things. In the immediate sense, it’s a tool for allowing you to make decisions under uncertainty, in tangled settings where human intuition can fail. At the next level, it’s a tool for identifying incoherence—contradictions between our beliefs and actions that might not be immediately clear but are implicit if we work out their implications.

How does this play out in practice? We don’t typically don’t write down or otherwise specify utilities and probabilities—indeed, it would generally be impossible to do so, given the complexity of all that might happen in the world—but in any setting where we make a series of decisions or statements, we can try to derive what we can of our implicit utilities, and from there find incoherence in our attitudes and behaviors.

Finding incoherence is not the ultimate goal here—we’re animals, not machines, and so of course we’ll be incoherent in all sorts of ways. Rather, the point in identifying incoherence is to be able to go back and find problems with our underlying assumptions.

We’ve discussed this many times over the years, both in terms of conceptual errors (the freshman fallacy, the hot hand fallacy fallacy, the “what does not destroy my statistical significance makes me stronger” fallacy, and all the others), with particular statistical models where people find errors in our work and this enables us to do better, and with public forecasting errors such as overconfidence election forecasts and sports betting odds (see here and here, among others).

As the saying goes, “a foolish consistency is the hobgoblin of little minds.” Again, consistency is not a goal, it’s a tool that allows us to investigate our assumptions.

And that brings us to the notorious Stanford covid contrarians. Mallory Harris tells the story in the school newspaper:

As a Stanford student, I [Harris] have appreciated the University’s unambiguous commitment to universal, basic COVID-19 precautions. Throughout this pandemic, Stanford has relied on evidence-based advice from faculty experts to prevent large outbreaks on campus and keep us safe from COVID. . . . As a student, it has been discouraging to witness multiple Stanford affiliates repeatedly leveraging the University’s name through national media appearances, policy advising and expert-witness testimony to attack the same measures being implemented on our campus. . . .

Stanford was one of the first universities in the country to require vaccines for students, faculty, and staff. For this practice, Professor Jay Bhattacharya singled out Stanford in an op-ed and characterized policies like ours as “ill-advised” and “unethical.” He also advised against vaccination for anyone who has had COVID, claiming (despite evidence to the contrary) that “it simply adds a risk, however small, without any benefit.” . . . At the same time, Bhattacharya recently cited Stanford’s vaccine requirement as a reason he feels comfortable returning to in-person instruction this fall.

The evening after the FDA granted full approval to the Pfizer vaccine, Bhattacharya appeared on Fox News to assert that the approval was too fast and that we lack sufficient data on safety and efficacy of the vaccine. These statements are debunked by Stanford Healthcare and the Stanford Center for Health Education’s Digital Medic Initiative, which affirm the COVID vaccines are safe, effective and well-researched. Indeed, Stanford Medicine helped lead Phase-3 trials for the Johnson and Johnson vaccine and pediatric trials of the Pfizer vaccine. Disturbingly, the latter trial was attacked on Fox News by [Hoover Institution fellow and former Stanford medical school professor Scott] Atlas, who baselessly accused researchers of violating medical ethics and characterized a clinical trial participant as “brainwashed” and “psychologically damaged.” . . .

Bhattacharya appeared on Fox News to discuss [a study that was later retracted] as evidence of dangers to children, likening masks to “child abuse.” These comments were never revisited after the paper’s retraction weeks later . . .

My point here is not that Bhattacharya, Atlas, etc., got some things wrong. We all get things wrong. It’s not even that they made mistakes that were consequential or potentially so. They’re in the arena, working in an important and controversial area, and in that case there’s always the risk of screwing up. Even if some of their errors were avoidable . . . we all make avoidable errors too, sometimes.

No, what I want to focus on here is that they keep missing the opportunity to learn from their mistakes. If you first advise against vaccination, then you turn around and cite a vaccine requirement as a reason to feel comfortable, then confront that incoherence. If you first say the vaccine approval was too fast, then you turn around and characterize the vaccine as “a miraculous development”; if you cite a study and it is then retracted; . . . these are opportunities to learn!

Pointing out these contradictions is not a game of Gotcha. It’s a chance to figure out what went wrong. It’s the scientific method. And when people don’t use their incoherences to learn what went wrong with their assumptions, then . . . well, I wouldn’t quite say that they’re not doing science—I guess that “doing science” is whatever scientists do, and real-world scientists often spend a lot of time and effort avoiding coming to terms with the contradictions in their worldviews—but I will say that they’re not being the best scientists they could be. This has nothing to do with Fox News or anything like that, it’s about a frustrating (to me) practice of missing a valuable opportunity to learn.

When I come across a reversal or contradiction in my own work, I treasure it. I look at it carefully and treat it as a valuable opportunity for discovery. Anomalies are how we learn; they’re the core of science, as discussed for example here and here. And that’s more special than any number of Ted talks and TV appearances.