How did the international public health establishment fail us on covid? By “explicitly privileging the bricks of RCT evidence over the odd-shaped dry stones of mechanistic evidence”

Peter Dorman points us to this brilliant article, “Miasmas, mental models and preventive public health: some philosophical reflections on science in the COVID-19 pandemic,” by health research scholar Trisha Greenhalgh, explaining what went wrong in the response to the coronavirus by British and American public health authorities.

Greenhalgh starts with the familiar (and true) statement that science proceeds through the interplay of theory and empirical evidence. Theory can’t stand alone, and empirical evidence in the human sciences is rarely enough on its own either. Indeed, if you combine experimental data with the standard rules of evidence (that is, acting as if statistically-significant comparisons represent real and persistent effects and as if non-statistically-significant comparisons represent zero effects), you can be worse off than had you never done your damn randomized trial in the first place.

Greenhalgh writes that some of our key covid policy disasters were characterized by “ideological movements in the West [that] drew—eclectically—on statements made by scientists, especially the confident rejection by some members of the EBM movement of the hypothesis that facemasks reduce transmission.”

Her story with regard to covid and masks has fourth parts. First, the establishment happened to start with “an exclusively contact-and-droplet model” of transmission. That’s unfortunate, but mental models are unavoidable, and you have to start somewhere. The real problem came in the second step, which was to take a lack of relevant randomized studies on mask efficacy as implicit support to continue to downplay the threat of aerosol transmission. This was an avoidable error. (Not that I noticed it at the time! I was trusting the experts, just like so many other people were.) The error was compounded in the third step, which was to take the non-statistically-significant result from a single study, the Danmask trial (which according to Greenhalgh was “too small by an order of magnitude to test its main hypothesis” and also had various measurement problems), as evidence that masks do not work. Fourth, this (purportedly) evidence-based masks-don’t-work conclusion was buttressed by evidence-free speculation of reasons why masks might make things worse.

Greenhalgh’s message is not that we need theory without evidence, or evidence without theory. Her message, informed by what seems to me is a very reasonable reading of the history and philosophy of science, is that theories (“mental models”) are in most cases necessary, and we should recognize them as such. We should use evidence where it is available, without acting as if our evidence, positive or negative, is stronger than it is.

All this sounds unobjectionable, but when you look at what happened—and is still happening—in the covid discourse of the past year and a half, you’ll see lots of contravention of these reasonable principles, with the errors coming not just from Hoover Institution hacks but also from the Centers for Disease Control and other respected government agencies. It might sound silly to say that people are making major decisions based on binary summaries of statistical significance from seriously flawed randomized studies, but that seems to be what’s happening. But, as Greenhalgh emphasizes, the problem is not just with the misunderstanding of what to do with statistical evidence; it’s also with the flawed mental model of droplet transmission that these people really really didn’t want to let go of.

And check out her killer conclusion:

While I [Greenhalgh] disagree with the scientists who reject the airborne theory of SARS-CoV-2 transmission and the evidence for the efficacy of facemasks, they should not be dismissed as ideologically motivated cranks. On the contrary, I believe their views are—for the most part—sincerely held and based on adherence to a particular set of principles and quality standards which make sense within a narrow but by no means discredited scientific paradigm. That acknowledged, scientists of all creeds and tribes should beware, in these fast-moving and troubled times, of the intellectual vices that tempt us to elide ideology with scientific hypothesis.

Well put. Remember how we said that honesty and transparency are not enuf? Bad statistical methods are a problem in part because they can empower frauds and cheaters, but they also can degrade the work of researchers who would like to do well. Slaves to some long-defunct etc etc. And it’s not just a concern for this particular example; my colleagues and I have argued that these problems arise with so-called evidence-based practice more generally. As I put it a few years ago, evidence-based medicine eats itself.

P.S. The problems with the public health establishment should not be taken to imply that we should trust anti-establishment sources. For all its flaws, the public health establishment is subject to democratic control and has the motivation to improve public health. They make mistakes and we can try to help them do better. There’s some anti-establishment stuff that’s apparently well funded and just horrible.

I like Steven Pinker’s new book. Here’s why:

I first heard about Rationality, the latest book from linguist Steven Pinker, from his publisher, offering to send me a review copy. Pinker has lots of interesting things to say, so I was happy to take a look. I’ve had disagreements with him in the past, but it’s always been cordial, (see here, here, and here), and he’s always shown respect for my research—indeed, if I do what Yair calls a “Washington read” of Pinker’s new book, I find a complimentary citation to me—indeed, a bit too complimentary, in that he credits me with coining the phrase “garden of forking paths,” but all I did was steal it from Borges. Not that I’ve ever actually read that story; in the words of Daniel Craig, “I like the title.” More generally, I appreciate Pinker’s willingness to engage with criticism, and I looked forward to receiving his book and seeing what he had to say about rationality.

As with many books I’m called upon to review, this one wasn’t really written for me. This is the nature of much of nonfiction book reviewing. An author writes a general-audience book on X, you get a reviewer who’s an expert on X, and the reviewer needs to apply a sort of abstraction, considering how a general reader would react. That’s fine; it’s just the way things are.

That said, I like the book, by which I mean that I agree with its general message and I also agree with many of the claims that Pinker presents as supporting evidence.

Pinker’s big picture

I’ll quickly summarize the message here. The passage below is not a quote; rather, it’s my attempt at a summary:

Humans are rational animals. Yes, we are subject to cognitive illusions, but the importance of these illusions is in some way a demonstration of our rationality, in that we seek reasons for our beliefs and decisions. (This is the now standard Tversky-Kahneman-Giverenzer synthesis; Pinker emphasizes Gigerenzer a bit less than I would, but I agree with the general flow.) But we are not perfectly rational, and our irrationalities can cause problems. In addition to plain old irrationality, there are also people who celebrate irrationality. Many of those celebrators of irrationality are themselves rational people, and it’s important to explain to them why rationality is a good thing. If we all get together and communicate the benefits of rationality, various changes can be made in our society to reduce the spread of irrationality. This should be possible given the decrease in violent irrationality during the past thousand years. Rationality isn’t fragile, exactly, but it could use our help, and the purpose of Pinker’s book is to get his readers to support this project.

There’s a bit of hope there near the end, but some hope is fine too. It’s all about mixing hope and fear in the correct proportions.

Chapter by chapter

A few years ago, I wrote that our vision of what makes us human has changed. In the past, humans were compared to animals, and we were “the rational animal”: our rationality was our most prized attribute. But now (I wrote in 2005) the standard of comparison is the computer, we were “the irrational computer,” and it was our irrationality that was said to make us special. This seemed off to me.

Reading chapter 1 of Pinker’s book made me happy because he’s clearly on my side (or, maybe I should say, I’m on his side): we are animals, not computers, and it’s fair to say that our rationality is what makes us human. I’m not quite sure why he talks so much about cognitive illusions (the availability heuristic, etc.), but I guess that’s out of a sense of intellectual fairness on his part: He wants to make the point that we are largely rational and that’s a good thing, and so he clears the deck by giving some examples of irrationality and then explaining how this does not destroy his thesis. I like that: it’s appealing to see a writer put the evidence against his theory front and center and then discuss why he thinks the theory still holds. I guess I’d only say that some of these cognitive illusions are pretty obscure—for example I’m not convinced that the Linda paradox is so important. Why not bring in some of the big examples of irrationality in life: on the individual level, behaviors such as suicide and drug addiction; at the societal levels, decisions such as starting World War 1 and obvious misallocations of resources such as financing beach houses in hurricane zones? I see a sort of parochialism here, a focus on areas of academic psychology that the author is close to and familiar with. Such parochialism is unavoidable—I write books about political science and statistics!—but I’d still kinda like to see Pinker step back and take a bigger perspective. In saying this, I realize that Pinker gets this from both sides, as other critics will tell him to stick to his expertise in linguistics and not try to make generalizations about the social world. So no easy answer here, and I see why he wrote the chapter the way he did, but it still leaves me slightly unsatisfied despite my general agreement with his perspective.

I liked most of chapter 2 as well: here Pinker talks about the benefits of people taking a rational approach, both for themselves as individuals and for society. I don’t need much convincing here, but I appreciated seeing him make the case.

Pinker writes that “ultimately even relativists who deny the possibility of objective truth . . . lack the courage of their convictions.” I get his point, at least sometimes, for example consider the people who do double-blind randomized controlled trials of intercessory prayer—they somehow think that God has the ability to cause a statistically significant improvement among the treatment group but that He can’t just screw with the randomization. On the other hand, maybe Pinker is too optimistic. He writes that purported “relativists” would not go so far as to deny the Holocaust, climate change, and the evils of slavery—but of course lots of people we encounter on the internet are indeed relativistic enough in their attitudes to deny these things, and they appear to be happy to set aside logic and evidence and objective scholarship to hold beliefs that they want to believe (as Pinker actually notes later on in chapter 10 of his book). Sometimes it seems that the very absurdity of these beliefs is part of their appeal: defending slavery and the Confederacy, or downplaying the crimes of Hitler, Stalin, Mao, etc., is a kind of commitment device for various political views. I guess climate change denial (and, formerly, smoking-cancer denial) is more of a mixed bag, with some people holding these views as a part of their political identity and others going with the ambiguity of the evidence to take a particular position. Belief in intercessory prayer is a different story because, at least in this country, it’s a majority position, so if you have a generally rational outlook and you also believe in the effectiveness of intercessory prayer, it makes sense that you’d try your best to fit it into your rational view of the world, in the same sense that rational rationalizers might try to construct rationales for fear, love, and other strong emotions that aren’t particularly rational in themselves.

Elsewhere I think Pinker’s too pessimistic. I guess he doesn’t hear that complaint much, but here it is! He writes: “Modern universities—oddly enough, given that their mission is to evaluate ideas—have been at the forefront of finding ways to suppress opinions, including disinviting and drowning out speakers, removing controversial teachers from the classroom, revoking offers of jobs and support, expunging contentious articles from archives, and classifying differences of opinion as punishable harassment and discrimination.” I guess I’m lucky to be at Columbia because I don’t think they do any of that here. I’ll take Pinker at his word that these things have happened at modern universities; still I wouldn’t say that universities “at the forefront of finding ways to suppress opinions,” just because their administrations sometimes make bad decisions. If universities are at the forefront of finding ways to suppress opinions, where does that put the Soviet Union, the Cultural Revolution, and other such institutions that remain in living memory? I agree that we should fight suppression of free speech, but let’s keep things in perspective and save the doomsaying for the many places where it’s appropriate!

There was another thing in chapter 2 that didn’t ring true to me, but I’ll get to it later, as right here I don’t want these particular disagreements to get in the way of my agreement with the main message of the chapter, which is that rational thinking is generally beneficial in life and society, even beyond narrow areas such as science and business.

I have less to say about chapters 3 through 9, which cover logic, probability, Bayesian reasoning, expected utility, hypothesis testing, game theory, and causal inference. He makes some mistakes (for example, defining statistical significance as “a Bayesian likelihood: the probability of obtaining the data given the hypothesis”), but he does a pretty good job at covering a lot of material in a small amount of space, and I was happy to see him including two of my favorite examples: the hot hand fallacy fallacy explained by Miller and Sanjurjo, and Gigerenzer’s idea of expressing probabilities as natural frequencies. I’m not quite sure how well this works as a book—to me, it sits in the uncanny valley between a college text and a popular science treatment—but I’m not the target audience here, so who am I to say.

Just one thing. At one point in these chapters on statistics, Pinker talks about fallacies that have contributed to the replication crisis in science (that’s where he mentions my forking-paths work with Eric Loken). I think this treatment would be stronger if he were to admit that some of his professional colleagues have been taken in by junk science in its different guises. There was that ESP study published by one of the top journals in the field of psychology. There was the absolutely ridiculous and innumerate “critical positivity ratio” theory that, as recently as last year, was the centerpiece of a book that was endorsed by . . . Steven Pinker! There was the work of “Evilicious” disgraced primatologist Marc Hauser, who wrote a fatuous article for the Edge Foundation’s “Reality Club” . . . almost a decade before Harvard “found him guilty of scientific misconduct and he resigned” (according to wikipedia). I think that including these examples would be a freebie. Admitting that he and other prominent figures in his field were fooled would give more of a bite to these chapters. Falling for hoaxes and arguments with gaping logical holes is not just for loser Q followers on the internet; it happens to decorated Harvard professors too.

The final two chapters of the book return to the larger theme of the benefits of rationality. Chapter 10 leads off with a review of covid science denial, fake news, and irrational beliefs. Apparently 32% of Americans say they believe in ghosts and 21% say they believe in witches. The witches thing is just silly, but covid denial has killed people, and climate change denial has potentially huge consequences. How to reconcile this with the attitude that people are generally rational? Pinker’s answer is motivated reasoning—basically, people believe what they want—and that most of these beliefs are in what he calls “the mythology zone,” beliefs such as ghosts and witches that have no impact on most people’s lives. He argues that “the arc of knowledge is a long one, and it bends toward rationality.” I don’t know, though. I feel like the missing piece in his story is politics. The problem with covid denial is not individual irrationality; it’s the support of this denialism by prominent political institutions. In the 1960s and again in recent years, there’s been widespread concern about lawlessness in American politics. When observers said that the world was out of control in the 1960s, or when they say now that today’s mass politics are reminiscent of the 1930s, the issue is not the percentage of people holding irrational beliefs; it’s the inability of traditional institutions to contain these attitudes.

Getting to details: A couple places in his book, Pinker refers to the irrationality of assuming that different groups of people are identical on average in “socially significant variables” such as “test scores, vocational interests, social trust, income, marriage rates, life habits, rates of different types of violence.” As Woody Guthrie sang, “Some will rob you with a six-gun, and some with a fountain pen.” Fine. I get it. Denying group differences is irrational. But it’s funny that Pinker doesn’t mention irrationality in traditional racism and sexism, the belief that women or ethnic minorities just can’t do X, Y, or Z. These sorts of prejudices are among the most famous examples of irrational thinking. Irrationalities bounce off each other, and one irrationality can be a correction for another. Covid denialism and climate change denialism, as unfortunate and irrational as they are, can be seen as reactions to the earlier irrationality of blind trust in our scientific overlords, with these reactions stirred up by various political and media figures.

At one point Pinker writes, “Rationality is disinterested. It is the same for everyone everywhere, with a direction and momentum of its own.” I see the appeal of this Karenina-esque statement, but I don’t buy it. Rationality is a mode of thinking, but the details of rationality change. For example, nowadays we have Bayesian reasoning and the scientific method. Aristotle, rational as he may have been, didn’t have these tools. In his concluding chapter, Pinker seems to get this, as he talks about the ever-expanding bounds of rationality during the past several centuries. I guess the challenge is that people may be more rational than they used to be, but in the meantime our irrationality can cause more damage. Technology is such that we can do more damage than ever before. What’s relevant is not irrationality, but its consequences.

Now that I’ve read the whole book, let me try to summarize the sweep of Pinker’s argument. It goes something like this:

Chapter 1. It is our nature as humans for our beliefs and attitudes to have a mix of rationality and irrationality. We’re all subject to cognitive illusions while at the same time capable of rational reasoning.

Chapter 2. Rationality, to the extent we use it, is a benefit to individuals and society.

Chapters 3-9. Rationality ain’t easy. To be fully rational you should study logic, game theory, probability, and statistics.

Chapter 10. We’re irrational because of motivated reasoning.

Chapter 11. Things are getting better. Rationality is on the rise.

The challenge is to reconcile chapter 1 (irrationality is human nature) with chapters 3-8 (rationality ain’t easy) and 10 (rationality is on the rise). Pinker’s resolution, I think, is that science is progressing (that’s all the stuff in chapters 3-8 that can help the readers of his book become more rational in their lives and understand the irrationality of themselves and others) and society is improving. Regarding that last point, he could be right; at the same time, he never really gives a good reason for his confidence that we don’t have to be concerned about the social and environmental costs of increasing political polarization, beyond a vague assurance that “The new media of every era open up a Wild West of apocrypha and intellectual property theft until truth-serving counteremasures are put into place” and then some general recommendations regarding social media companies, pundits, and deliberative democracy, with the statement (which I agree with) that rationality “is not just a cognitive virtue but a moral one.” As the book concludes, Pinker alternates between saying that we’re in trouble and we need rationality to save us, and that progress is the way of the world. This is a free-will paradox that is common in the writings of social reformers: everything is getting better, but only because we put in the work to make it so. The Kingdom of Heaven has been foretold, but it is we, the Elect, who must create it. Or, to put it in a political context, We will win, but only with your support. This does not mean that Pinker’s story is wrong: it may well be that rationality will prevail (in some sense) due to the effort of Pinker and the rest of us; I’m just saying that his argument has a certain threading-the-needle aspect.

Still and all, I like Pinker’s general theme of the complexity and importance of rationality, even if I think he focuses a bit too much on the psychological aspect of the problem and not enough on the political.

Parochialism and taboo

One unfortunate feature of the book is a sort of parochialism that privileges recent academic work in psychology and related fields. For example this on page 62: “Can certain thoughts be not just strategically compromising but evil to think? This is the phenomenon called taboo, from a Polynesian word for ‘forbidden.’ The psychologist Philip Tetlock has shown that taboos are not just customs of South Sea islanders but active in all of us.” And then there’s a footnote to research articles from 2000 and 2003.

That’s all well and good, but:

1. No way that Tetlock or anybody else has shown that an attitude is “active in all of us.” At best these sorts of studies can only tell us about the people in the studies themselves, but, also, this evidence is almost always statistical, with the result being that average behavior is different under condition A than under condition B. I can’t think of any study of this sort that would claim that something occurs 100% of the time. Beyond this, there do seem to be some people who are not subject to taboos. Jeffrey Epstein, for example.

2. If we weaken the claim from “taboos are active in all of us” to “taboos are a general phenomenon, not limited to some small number of faraway societies,” then it seems odd to attribute this to someone writing in the year 2000. The idea of taboos being universal and worth studying rationally is at least as old as Freud. Or, if you don’t want to cite Freud, lots of anthropology since then. Nothing wrong with bringing in Tetlock’s research, but it seems a bit off, when introducing taboos, to focus on obscure issues such as “forbidden base rates” or attitudes on the sale of kidneys rather than the biggies such as taboos against incest, torture, etc.

I’ve disagreed with Pinker before about taboos, and I think my key point of disagreement that sometimes he labels something a “taboo” that I would just call a bad or immoral idea. For example, a few years ago Pinker wrote, “In every age, taboo questions raise our blood pressure and threaten moral panic. But we cannot be afraid to answer them.” One of his questions was, “Would damage from terrorism be reduced if the police could torture suspects in special circumstances?” I don’t think it’s “moral panic” to be opposed to torture; indeed, I don’t think it’s “moral panic” for the question of torture to be taken completely off the table. I support free speech, including the right of people to defend Jerry Sandusky, Jeffrey Epstein, John Yoo, etc etc., and, hey, who knows, someone might come up with a good argument in favor of their behavior—but until such an argument appears, I feel no obligation to seriously consider these people’s actions as moral. Pinker might call that a taboo on my part; I’d call this a necessary simplification of life, the same sort of shortcut that allows me to assume, until I’m shown otherwise, that dishes when dropped off the table will fall down to the floor rather than up to the ceiling. Again, Pinker’s free to hold is own view on this—I understand that since making the above-quoted statement he’s changed his position and is now firmly anti-torture—; my point is that labeling an attitude as “taboo” can itself be a strong statement.

Another example is that Pinker describes it as “a handicap in mental freedom” to refuse to answer the question, “For how much money would you sell your child?” Here he seems to be missing the contextual nature of psychology. Many people will sell their children—if they’re poor enough. I doubt many readers of Pinker’s book are in that particular socioeconomic bracket; indeed, in his previous paragraph he asks you to “try playing this game at your next dinner party.” I think it’s safe to say that if you’re reading Pinker’s book and attending dinner parties, that there’s no amount of money for which you’d sell your child. So the question isn’t so much offensive as silly. My guess is that if someone asks this at such a party, the response would not be offense but some sort of hypothetical conversation, similar to if you were asked whether you’d prefer invisibility or the power of flight. Or maybe Pinker hangs out with a much more easily-offended crowd than I do. On the other hand, what about people who actually sold their children, or the equivalent, to Jeffrey Epstein? Pinker’s on record as saying this is reprehensible. How does this line up with his belief that it’s “a handicap in mental freedom” to not consider for how much money you would sell your child?

This example points to a sort of inner contradiction of Pinker’s reasoning. On one hand, he’s saying we all have taboos. I guess that includes him too! He’s also saying that we live in a society where there are all sorts of things we can’t talk about, not just torture and the selling of children and the operation of a private island for sex with underage women, but also organ donation, decisions of hospital administrators, and budgetary decisions. On the other hand, he’s writing for an audience of readers who, if they don’t already agree with him, are at least possibly receptive to his ideas—so they’re not subject to these taboos, or at least maybe not. This gets back to the question of what Pinker’s dinner parties are like: is it a bunch of people sitting around the table talking about the potential benefits of torture, subsidized jury duty, and an open market in kidneys; or a bunch of people all wanting to talk about these things but being afraid to say so; or a bunch of people whose taboos are so internalized that they refuse to even entertain these forbidden ideas? You can see how this loops back to my first point above about that phrase “active in all of us.” Later on, Pinker says, “It’s wicked to treat an individual according to that person’s race, sex, or ethnicity.” “Wicked,” huh? That seems pretty strong! Torture or selling your child are OK conversation topics, but treating men different than women is wicked? I honestly can’t figure out where he draws the line. That’s ok—there’s no reason to believe we’re rational in what bothers us—but then maybe he could be a bit more understanding about those of us think that torture is “wicked,” rather than just “taboo.”

Also I don’t quite get when Pinker writes that advertisers of life insurance “describe the policy as a breadwinner protecting a family rather than one spouse betting the other will die.” This just seems perverse on his part. When I bought life insurance, I was indeed doing it to protect my family in the event that I die young. I get it that you could say that mathematically this is equivalent to my wife betting that I would die, but really that makes no sense, given that I was the one paying for the insurance (so she’s not “betting” anything) and, more importantly, the purpose of the insurance was not to gamble but to reduce uncertainty. It would make more sense to say I was “hedging” against the possibility that I would die young. Here it seems that Pinker wants to anti-euphemize, to replace an accurate description (buying life insurance to protect one’s family) by an inaccurate wording whose only virtue is harshness.

Had I written this book, I would’ve emphasized slightly different things. As noted above, it seems strange to me that, when talking about irrationality, Pinker focuses so much on irrational beliefs rather than on irrational actions. At one level, I understand: the belief is father to the action. But it’s the actions that matter, no? I guess one reason I say this is my political science background. For example, the irrational action of funding housing construction in flood zones can be explained in part through various political deals and subsidies. Spreading better understanding of climate change should help, but it’s not clear that individual irrationality is the biggest problem here, and I’m concerned that Pinker is falling into an individualistic trap when studying society. To take a more positive example, cigarette smoking rates are much lower than they were a half-century ago. I would attribute this not to an increase in rationality or Odyssean self-control but rather to notions of fashion and coolness of which Pinker seems so dismissive. Smoking used to be cool, it’s no longer. I remember 20 years ago when NYC banned smoking in restaurants and bars; various pundits and lobbying organizations declared that this was a horrible infringement on liberty, that the people would rise up, etc. . . none of those things happened. They banned smoking and people just stopped smoking indoors. I guess that did induce some Odyssean self-control among smokers, so I’m not saying these individualistic behavioral concepts are useless, just that they’re not the whole story, and indeed sometimes they don’t seem to be the most important part of the story.

But that’s not really a criticism of Pinker’s book, that I would’ve written something different. It’s a limitation of his story, but all stories have limitations.

Big hair

One thing I found charming about the book, but others might find annoying, is the datedness of some of its references and perspectives. Chapter 1 reads as if it was written in the 1980s, back when the work of Tversky, Kahneman, and Gigerenzer was new. (I was going to say “new and exciting,” but that would be misleading: yes, their work was new and exciting in the 1980s, but it remains exciting even now, long after it was new.) Chapter 2 begins, “Rationality is uncool. To describe someone with a slang word for the cerebral, like nerd, wonk, geek, or brainaic, is to imply they are terminally challenged in hipness. I guess there’s always the possibility that he’s kidding, but . . . things have changed in the past 40 years, dude! In recent years, lots of people have been proud to be called nerds, wonks, or geeks; if anything, it’s “hipsters” who are not considered to be so cool. Pinker supports his point with quotes from Talking Heads, Prince, and . . . Zorba the Greek? That’s a movie from 1964! Later he refers to Peter, Paul and Mary (or, as he calls them, “Peter, Paul, and Mary”—prescriptive linguist that he is). When it comes to basketball, his go-to example is Vinnie Johnson from the 1980s Pistons. OK, I get it, he’s a boomer. That’s cool. You be you, Steve. But it might be worth not just updating your cultural references but considering that the culture has changed in some ways in the past half-century. In that same paragraph as the one with Zorba, Pinker describes “postmodernism” and “critical theory” as “fashionable academic movements.” I’m sure that there are professors still teaching these things, but no way that postmodernism and critical theory are “fashionable.” It’s been close to 40 years since they were making the headlines! You might as well label suspenders, big hair, and cocaine as fashionable. I half-expected to hear him talk about “yuppies” and slip in an Alex P. Keaton reference.

Review of reviews

After reading Pinker’s book, I did some googling and read some reviews. Given the title of the book, I guess we shouldn’t be surprised that liked it! Other reviews were mixed, with The Economist’s “Steven Pinker’s new defence of reason is impassioned but flawed” catching the general attitude that he had some things to say but had bitten off more than he could chew.

The New York Times review argues that Pinker gets things wrong in the details (for example, Pinker pointing to the irrationality of “half of Americans nearing retirement age who have saved nothing for retirement” without recognizing that “the median income for those non-saving households is $26,000, which isn’t enough money to pay for living expenses, let alone save for retirement”), while the Economist reviewer is OK with the details but is concerned about the big picture, reminding us that rationality can be deadly: “Rationality involves people knowing they are right. And from the French revolution on, being right has been used to justify appalling crimes. Mr Pinker would no doubt call the Terror a perversion of reason, just as Catholics brand the Inquisition a denial of God’s love. It didn’t always seem that way at the time.” Good point. This is an argument that Pinker should’ve addressed in his book: violence can come from purported rationality (for example, the Soviets) as well as from open irrationality (for example, the Nazis).

The published review whose perspective is closest to mine comes from Nick Romeo in the Washington Post, who characterizes the review as “a pragmatic dose of measured optimism, presenting rationality as a fragile but achievable ideal in personal and civic life,” offering “the welcome prospect of a return to sanity.” Like me, Romeo suggests that Pinker’s individualist argument could be improved by making more connections to politics (in his case, “the political economy of journalism — its funding structures, ownership concentration and increasing reliance on social media shares”). Ultimately, though, I think we have to judge a book by what it is, not what it is not. Pinker is a psychology professor so it makes sense that, when writing about rationality, he focuses on its psychological aspects.

Cross validation and model assessment comparing models with different likelihoods? The answer’s in Aki’s cross validation faq!

Nick Fisch writes:

After reading your paper “Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC”, I am curious as to whether the criteria WAIC or PSIS-LOO can be used to compare models that are fit using different likelihoods? I work in fisheries assessment, where we are frequently fitting highly parameterized nonlinear models to multiple data sources using MCMC (generally termed “integrated fisheries assessments”). If I build two models that solely differ in the likelihood specified for a specific data source (one Dirichlet, the other Multinomial), would WAIC or loo be able to distinguish these or must I use some other method to compare the models (such as goodness of fit, sensitivity, etc). I should note that the posterior distribution will be the unnormalized posterior distribution in these cases.

My response: for discrete data I think you’d just want to work with the log probability of the observed outcome (log p), and it would be fine if the families of models are different. I wasn’t sure what was the best solution with continuous variables, so I forwarded the question to Aki, who wrote:

This question is answered in my [Aki’s] cross validation FAQ:

12 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

First to make the terms more clear, p(y∣θ) as a function of y is an observation model and p(y∣θ) as a function of θ is a likelihood. It is better to ask “Can cross-validation be used to compare different observation models?”

– You can compare models given different discrete observation models and it’s also allowed to have different transformations of y as long as the mapping is bijective (the probabilities will the stay the same).
– You can’t compare densities and probabilities directly. Thus you can’t compare model given continuous and discrete observation models, unless you compute probabilities in intervals from the continuous model (also known as discretising continuous model).
– You can compare models given different continuous observation models, but you have exactly the same y (loo functions in rstanarm and brms check that the hash of y is the same). If y is transformed, then the Jacobian of that transformation needs to be included. There is an example of this in mesquite case study.

It is better to use cross-validation than WAIC as the computational approximation in WAIC fails more easily and it’s more difficult to diagnose when it fails.

P.S. Nick Fisch is a Graduate Research Fellow in Fisheries and Aquatic Sciences at the University of Florida. How cool is that? I’m expecting to hear very soon from Nick Beef at the University of Nebraska.

Suburban Dicks

The cover drew my attention so I opened up the book and saw this Author’s Note:

West Windsor and Plainsboro are real towns in New Jersey. Unlike the snarky description in the book, it’s a pretty good area to live in . . . I mean, as far as New Jersey goes . . .

I was amused enough to read this aloud to the person with me in the bookstore, who then said, This sounds like your kind of book. You should get it. So I did.

I liked the book. It delivered what it promised: fun characters, jokes, and enough plot to keep me turning the pages. It also did a good job with the balance necessary in non-hard-boiled mystery stories, keeping it entertaining without dodging the seriousness of the crimes. This balance is not always easy; to my taste, it’s violated by too-cosy detective stories on one hand and obnoxious tough-guy vigilantes on the other.

Suburban Dicks has some similarities to Knives Out, with a slightly different mix of the ingredients of characters, plot, laughs, and social commentary. Briefly: Knives out was a more professional piece of work (we’ll get back to that in a moment) and had a much better plot, both in its underlying story and how it was delivered. The books were equally funny but in different ways: as one might guess from its title, Suburban Dicks was more slapstick, more of a Farrelly brothers production, in the book’s case leaning into the whole Jersey thing. Suburban Dicks was also more brute force in its social commentary, to the extent that it could put off some readers, but to me it worked, in the sense that this was the story the author wanted to tell.

Suburban Dicks works in large part because of the appealing characters and the shifting relationships between them, all of which are drawn a bit crudely but, again, with enough there to make it all work. I liked the shtick, and I liked that the characters had lives outside the plot of the story.

This all might sound like backhanded compliments, and I apologize for that, because I really enjoyed the book: it was fun to read, and I felt satisfied with it when it was all over. What’s most relevant to the experience, both during the reading and in retrospect, are the strengths, not the weaknesses.

One more thing, though. The book is well written, but every once in awhile there’s a passage that’s just off, to the extent that I wonder if the book had an editor. Here’s an example:

“Listen, I know what it sounds like, but, I don’t know, think of it this way,” Andrea said. “You were a child-psych major at Rutgers, right? And you got a job at Robert Wood Johnson as a family caseworker for kids in the pediatric care facility, right?”


Who talks that way? This is a classic blunder, to have character A tell character B something she already knows, just to inform the reader. I understand how this can happen—in an early draft. But it’s the job of an editor to fix this, no?

But then it struck me . . . nobody buys books! More books are published than ever before, but it’s cheap to publish a book. Sell a few thousand and you break even, I guess. (Maybe someone in comments can correct me here.) There’s not so much reading for entertainment any more, not compared to the pre-internet days. I’m guessing the economics in book publishing is that the money’s in the movie rights. So, from the publisher’s point of view, the reason for this book is not so much that it might sell 50,000 copies and make some money, but that they get part of the rights for the eventual filmed version (again, experts on publishing, feel free to correct me on this one). So, from that point of view, who cares if there are a few paragraphs that never got cleaned up? And, to be honest, those occasional slip-ups didn’t do much to diminish my reading experience. Seeing some uncorrected raw prose breaks the fourth wall a bit, but the book as a whole is pretty transparent; indeed, there’s a kind of charm to seeing the author as a regular guy who occasionally drops a turd of a paragraph.

It makes me sad that there was no editor to carefully read the book and point out the occasional lapses in continuity, but I can understand the economics of why the publisher didn’t bother. I’m sure the eventual movie script will be looked over more carefully.

In any case, let me say again that I enjoyed the book and I recommend it to many of you. After reading it, I googled the author’s name and found out that he writes for comic books, most famously creating the character Deadpool. His wikipedia page didn’t mention Suburban Dicks at all so I added something. And then, in preparing this post, I googled again and came across this article, “Legendary comic book writer’s first novel set in West Windsor,” from a Central New Jersey news site, from which I learned:

Suburban Dicks debuted to rave reviews and Nicieza has already been contracted to write a sequel. The book has also been optioned for a television show.

Good to hear.

Actually, this news article, by Bill Sanservino, is excellent. It includes a long and informative interview with Nicieza, interesting in its own right and also in the light it sheds on his book. It’s a long, long interview with lots of good stuff.

There’s only one thing that puzzles me. In the interview, Nicieza talks about all the editors who helped him on the book. That’s cool; there’s no need to do it all yourself. But how could there be all those editors . . . and none of them caught the paragraph quoted above, and a few others throughout the book, that just jumped out at me when I read them? I don’t get it.

Learning by confronting the contradictions in our statements/actions/beliefs (and how come the notorious Stanford covid contrarians can’t do it?)

The fundamental principle of Neumann/Morgenstern decision theory is coherence. You can start with utilities and probabilities and deduce decision recommendations, or you can start with decisions and use these to deduce utilities and probabilities. More realistically, you can move back and forth: start with utilities and probabilities and deduce decision recommendations, then look at places where these recommendations conflict with your actual decisions or inclinations and explore your incoherence, ultimately leading to a change in assessed utilities/probabilities or a change in decision plans.

That is, decision theory is two things. In the immediate sense, it’s a tool for allowing you to make decisions under uncertainty, in tangled settings where human intuition can fail. At the next level, it’s a tool for identifying incoherence—contradictions between our beliefs and actions that might not be immediately clear but are implicit if we work out their implications.

How does this play out in practice? We don’t typically don’t write down or otherwise specify utilities and probabilities—indeed, it would generally be impossible to do so, given the complexity of all that might happen in the world—but in any setting where we make a series of decisions or statements, we can try to derive what we can of our implicit utilities, and from there find incoherence in our attitudes and behaviors.

Finding incoherence is not the ultimate goal here—we’re animals, not machines, and so of course we’ll be incoherent in all sorts of ways. Rather, the point in identifying incoherence is to be able to go back and find problems with our underlying assumptions.

We’ve discussed this many times over the years, both in terms of conceptual errors (the freshman fallacy, the hot hand fallacy fallacy, the “what does not destroy my statistical significance makes me stronger” fallacy, and all the others), with particular statistical models where people find errors in our work and this enables us to do better, and with public forecasting errors such as overconfidence election forecasts and sports betting odds (see here and here, among others).

As the saying goes, “a foolish consistency is the hobgoblin of little minds.” Again, consistency is not a goal, it’s a tool that allows us to investigate our assumptions.

And that brings us to the notorious Stanford covid contrarians. Mallory Harris tells the story in the school newspaper:

As a Stanford student, I [Harris] have appreciated the University’s unambiguous commitment to universal, basic COVID-19 precautions. Throughout this pandemic, Stanford has relied on evidence-based advice from faculty experts to prevent large outbreaks on campus and keep us safe from COVID. . . . As a student, it has been discouraging to witness multiple Stanford affiliates repeatedly leveraging the University’s name through national media appearances, policy advising and expert-witness testimony to attack the same measures being implemented on our campus. . . .

Stanford was one of the first universities in the country to require vaccines for students, faculty, and staff. For this practice, Professor Jay Bhattacharya singled out Stanford in an op-ed and characterized policies like ours as “ill-advised” and “unethical.” He also advised against vaccination for anyone who has had COVID, claiming (despite evidence to the contrary) that “it simply adds a risk, however small, without any benefit.” . . . At the same time, Bhattacharya recently cited Stanford’s vaccine requirement as a reason he feels comfortable returning to in-person instruction this fall.

The evening after the FDA granted full approval to the Pfizer vaccine, Bhattacharya appeared on Fox News to assert that the approval was too fast and that we lack sufficient data on safety and efficacy of the vaccine. These statements are debunked by Stanford Healthcare and the Stanford Center for Health Education’s Digital Medic Initiative, which affirm the COVID vaccines are safe, effective and well-researched. Indeed, Stanford Medicine helped lead Phase-3 trials for the Johnson and Johnson vaccine and pediatric trials of the Pfizer vaccine. Disturbingly, the latter trial was attacked on Fox News by [Hoover Institution fellow and former Stanford medical school professor Scott] Atlas, who baselessly accused researchers of violating medical ethics and characterized a clinical trial participant as “brainwashed” and “psychologically damaged.” . . .

Bhattacharya appeared on Fox News to discuss [a study that was later retracted] as evidence of dangers to children, likening masks to “child abuse.” These comments were never revisited after the paper’s retraction weeks later . . .

My point here is not that Bhattacharya, Atlas, etc., got some things wrong. We all get things wrong. It’s not even that they made mistakes that were consequential or potentially so. They’re in the arena, working in an important and controversial area, and in that case there’s always the risk of screwing up. Even if some of their errors were avoidable . . . we all make avoidable errors too, sometimes.

No, what I want to focus on here is that they keep missing the opportunity to learn from their mistakes. If you first advise against vaccination, then you turn around and cite a vaccine requirement as a reason to feel comfortable, then confront that incoherence. If you first say the vaccine approval was too fast, then you turn around and characterize the vaccine as “a miraculous development”; if you cite a study and it is then retracted; . . . these are opportunities to learn!

Pointing out these contradictions is not a game of Gotcha. It’s a chance to figure out what went wrong. It’s the scientific method. And when people don’t use their incoherences to learn what went wrong with their assumptions, then . . . well, I wouldn’t quite say that they’re not doing science—I guess that “doing science” is whatever scientists do, and real-world scientists often spend a lot of time and effort avoiding coming to terms with the contradictions in their worldviews—but I will say that they’re not being the best scientists they could be. This has nothing to do with Fox News or anything like that, it’s about a frustrating (to me) practice of missing a valuable opportunity to learn.

When I come across a reversal or contradiction in my own work, I treasure it. I look at it carefully and treat it as a valuable opportunity for discovery. Anomalies are how we learn; they’re the core of science, as discussed for example here and here. And that’s more special than any number of Ted talks and TV appearances.

Webinar: Towards responsible patient-level causal inference: taking uncertainty seriously

This post is by Eric.

We are resuming our Webinar series this Thursday with Uri Shalit from Technion. You can register here.


A plethora of new methods for estimating patient-level causal effects have been proposed recently, focusing on what is technically known as (high-dimensional) conditional average effects (CATE). The intended use of many of these methods is to inform human decision-makers about the probable outcomes of possible actions, for example, clinicians choosing among different medications for a patient. For such high-stakes decisions, it is crucial for any algorithm to responsibly convey a measure of uncertainty about its output, in order to enable informed decision making on the side of the human and to avoid catastrophic errors.

We will discuss recent work where we present new methods for conveying uncertainty in CATE estimation stemming from several distinct sources: (i) finite data (ii) covariate shift (iii) violations of the overlap assumption (iv) violation of the no-hidden confounders assumption. We show how these measures of uncertainty can be used to responsibly decide when to defer decisions to experts and avoid unwarranted errors.

This is joint work with Andrew Jesson, Sören Mindermann, and Yarin Gal of Oxford University.

About the speaker

Uri Shalit is an Assistant Professor in the Faculty of Industrial Engineering and Management at Technion University. He received his Ph.D. in Machine Learning and Neural Computation from the Hebrew University in 2015.  Prior to joining Technion, Uri was a postdoctoral researcher at NYU working with prof. David Sontag.

Uri’s research is currently focused on three subjects. The first is applying machine learning to the field of healthcare, especially in terms of providing physicians with decision support tools based on big health data. The second subject Uri is interested in is the intersection of machine learning and causal inference, especially the problem of learning individual-level effects. Finally, Uri is working on bringing ideas from causal inference into the field of machine learning, focusing on problems in robust learning, transfer learning, and interpretability.

The video is available here.

“For a research assistant, do you think there is an ethical responsibility to inform your supervisor/principal investigator if they change their analysis plan multiple times during the research project in a manner that verges on p-hacking?”

A reader who works as a research assistant and who wishes to remain anonymous writes:

I have a hypothetical question about ethics in statistics. For a research assistant, do you think there is an ethical responsibility to inform your supervisor/principal investigator if they change their analysis plan multiple times during the research project in a manner that verges on p-hacking? Or do you think that the hierarchy within this relationship places the burden on the supervisor/principal investigator and not the research assistant?

My reply:

Let me separate this into two issues:

1. The ethics of the supervisor’s behavior.

2. The ethics of the research assistant’s behavior.

1. Regarding the ethics of the supervisor’s behavior, I guess this depends a bit on the social relevance of the application area. If the supervisor is p-hacking on the way to a JPSP paper on extra-sensory perception, then I guess all that’s at stake is the integrity of science, some research funding, and the reputation of Ivy League universities, so no big deal. But if he’s p-hacking on the way to a claim about the effectiveness of some social intervention, whether it be early-childhood intervention or food labeling in schools, then there are some policy implications to exaggerating your results. Indeed, even if there’s no p-hacking at all, we can expect published estimates to be overestimates, and that seems like an ethical problem to me. Somewhere in between are claims with no direct policy implications that still have ideological implications. For example, a p-hacked claim that beautiful parents are more likely to have daughters does not directly do any damage—except to the extent that it is used as ammunition in a sexist political agenda. The most immediately dangerous cases would be manipulating an analysis to make a drug appear safer or more effective than it really is. I guess this happens all the time, and yes, it’s unethical!

2. Regarding the research assistant: I’d say, yeah, the burden is on the supervisor. I admire whistleblowers, but it’s awkward to say there’s an ethical responsibility to blow the whistle, given the possibility of retaliation by the supervisor.

Rather than saying there’s an ethical responsibility of the research assistant to blow the whistle, I’d rather say that the supervisor has an ethical responsibility to set up a working environment where it’s clear that subordinates can express their concerns without fear of retaliation, and the institution where everyone is working has the ethical responsibility to enforce that subordinates can express their concerns without fear of retaliation, and society has an ethical responsibility to enforce that institutions allow safe complaints.

Two questions then remain:

3. Is the supervisor’s behavior actually unethical here? Is it even bad statistics or bad science?

4. What should the research assistant do?

3. Is it unethical to “change an analysis plan multiple times during the research project in a manner that verges on p-hacking”? It depends. It’s unethical to change the plan and hide those changes. It’s not unethical to make these changes openly. Then comes the analysis. What’s important in the analysis is not whether it accounts for all the different plans that were not done. Rather, what’s important is that the analysis reflects the theoretical perspectives embodied in these analysis plans. For example, if the original plan says that variable A should be important, and the later plan says that variable B should be important, then the final analysis should include both variables A and B. Or if the original plan says that the effect of variable A should be positive, and the final plan says the effect should be negative, then the final analysis should respect these contradicting theoretical perspectives rather than just going with whatever noisy pattern appeared in the data. My point here is that the ethics depends not just on the data and the steps of the analysis; it also depends on the substantive theory that motivates the data collection and analysis choices.

4. I don’t know what I’d recommend the research assistant do. In similar situations I’ve suggested an indirect approach: instead of directly confronting the supervisor, make a positive suggestion that the analysis would be improved by a clearer link to the underlying substantive theory. You can also express concerns by invoking a hypothetical third party: say something like, “A reviewer might be concerned about possible forking paths in the data coding and analysis, and maybe a multiverse analysis would allay such a reviewer’s concerns.”

Wanna bet? A COVID-19 example.

This post is by Phil Price, not Andrew.

Andrew wrote a post back on September 2 that plugged a piece by Jon Zelner, Nina Masters, Ramya Naraharisetti, Sanyu Mojola, and Merlin Chowkwanyun about complexities of pandemic modeling. In the comments, someone who calls himself Anoneuoid said (referring to model projections compiled and summarized by the CDC, shown below; the link is to the most recent projections, not the ones we were talking about) “…it is a near certainty cases will be below 500k per week on Oct 1st, yet that looks like it will be below the lower bound of the ensemble 95% interval. If anyone disagrees, lets bet!”

Historic COVID cases, and model projections

I love this kind of thing. Indeed, if a friend says something is “a lock” or “almost certain” or whatever, I will often propose a bet: “if you’re so sure, you should be willing to offer 10:1 odds, so let’s do it! I’ll wager $100 to your $1000. Deal?” Most of the time, people back off on their “almost certain” claim. It seems that usually when people say they’re “sure” or “almost certain” about something they aren’t speaking literally, but instead feel that the outcome has a probability of, say, 75% or 85% or something. In any case I appreciate someone who is willing to put his money where his mouth is.

We had some back-and-forth about the betting. Perhaps I could have shamed him into offering 12:1 or 10:1 or something, which was indeed my initial proposal, but upon reflection that seemed a bit one-sided. If he thinks the probability of a specific thing happening is, say, 8%, and I think it’s 66%, it seems a bit unfair to make him bet on his odds. Why not bet on my odds instead, in which case I should be offering him odds the other way?   (I said at the time: “I know little about this issue or about how much to trust the models. Just looking at the historical peaks we have had thus far, I’d guess this week or next week will be the nationwide peak and that things will fall off about half as quickly as they climbed. My central estimate for the week in question would be something like 650K new cases, but with wide enough error bars that I’d put maybe 30 or 35% of the probability below 500,000. But I’d also have a substantial chunk of probability over $1M, which you must think is pretty much nuts.”)

In the end we decided to sort of split the difference: if the number of new cases is over 500K that week, he’ll pay me $100; otherwise I pay him $34. We’ll settle about a week and a half after October 1, in case there are later updates to the numbers (due to reporting issues or whatever). Andrew asked me to do a short post about this, to “have it all in one place”, so here it is. 

This post is by Phil.


Instead of comparing two posterior distributions, just fit one model including both possible explanations of the data.

Gabriel Weindel writes:

I am a PhD student in psychology and I have a question about Bayesian statistics. I want to compare two posterior distributions of parameters estimated from a (hierarchical) cognitive model fitted on two dependent variables (hence both fits are completely separated). One fit is from a DV allegedly containing psychological process X and Y, and the other one is from a DV that only contains X. The test is to look up whether the cognitive model does ‘notice’ the removal of Y selectively in the parameter that is supposed to contain this process !

My take was to assume that, as I have access to the posterior distribution of the population parameters for both fits, I can simply compute the overlap (or equivalent) between both posterior distributions and if this overlap is high/low-to-null, conclude that there is high/low-to-no evidence that the true parameters of the fit on the two DVs are the same.

But my senior co-authors disagree with me and reviewers will probably also as, first this might be wrong and second this obviously goes against most of the statistics used in psychology and elsewhere were you need a criterion to decide between a null and an alternative hypothesis and where you rarely have access to a posterior distribution of the population parameter. However to me it appears to be both the most desirable and valid solution.

Does this reasoning seem valid to you ?

My quick answer is that I don’t think it makes sense to compare posterior distributions. Instead I think you should fit one larger model that includes both predictors.

Weindel responds:

I don’t see why it doesn’t make sense. We had thought about fitting a larger model but we would then add a dummy variable (DV1 = 0, DV2 = 1) and the two predictors would be highly correlated as they share a process (r = .85), wouldn’t that be a problem also?

My reply: Sure, when two predictors are highly correlated, then it’s hard from the data alone to tell them apart. That’s just the way it is!

Accounting for uncertainty during a pandemic

Jon Zelner, Julien Riou, Ruth Etzioni, and I write:

Just as war makes every citizen into an amateur geographer and tactician, a pandemic makes epidemiologists of us all. Instead of maps with colored pins, we have charts of exposure and death counts; people on the street argue about infection fatality rates and herd immunity the way they might have debated wartime strategies and alliances in the past. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has brought statistics and uncertainty assessment into public discourse to an extent rarely seen except in election season and the occasional billion-dollar lottery jackpot. In this paper, we reflect on our role as statisticians and epidemiologists and lay out some of the challenges that arise in measuring and communicating our uncertainty about the behavior of a never-before-seen infectious disease. We look at the problem from multiple directions, including the challenges of estimating the case fatality rate (i.e., proportion of individuals who will die from the disease), the rate of transmission from person to person, and even the number of cases circulating in the population at any time. We advocate for an approach that is more transparent about the limitations of statistical and mathematical models as representations of reality and suggest some ways to ensure better representation and communication of uncertainty in future public health emergencies.

We discuss several issues of statistical design, data collection, analysis, communication, and decision-making that have arisen in recent and ongoing coronavirus studies, focusing on tools for assessment and propagation of uncertainty. This paper does not purport to be a comprehensive survey of the research literature; rather, we use examples to illustrate statistical points that we think are important.

Here are the sections of the paper:

Statistics and uncertainty

Data and measurement quality

Design of clinical trials for treatments and vaccines

Disease transmission models

Multilevel statistical modeling


Information aggregation and decision-making

New issues keep coming up and they didn’t all make it into the article; for example we didn’t get into the confusions arising from aggregation bias. There’s always more to be said.

Bayesian hierarchical stacking: Some models are (somewhere) useful

Yuling Yao, Gregor Pirš, Aki Vehtari, and I write:

Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.

What I really want you to do is read section 3.1, All models are wrong, but some are somewhere useful, where Yuling describes how and why a mixture of wrong models can give better predictive performance than a correct model, even if such a correct model is included in the set of candidates.

The background of this paper is that for a long time it’s been clear that when doing predictive model averaging, it can make sense to have the weights vary by x (see for example figure 12 of this paper)—as Yuling notes, different models can be good at explaining different regions in the input-response space, which is why model averaging can work better than model selection—but it’s tricky because if you let the weights vary without regularization, you’ll just overfit.

Also of interest is section 3.2, on the conditions under which stacking is most helpful. The work here is reminiscent of ideas from psychometrics, where if we want to average several tests to measure some ability, you don’t want tests whose scores are 100% correlated, but it also would not make sense for the tests to have zero correlation. Like many things in statistics, it’s subtle.

This paper combines lots of interesting ideas, including
– Stacking (Wolpert, 1992, Breiman, 1996), an early ensemble learning method;
– Bayesian leave-one-out cross-validation, an idea that Aki and his collaborators have been thinking about since at least 2002;
– Treating a computational problem as a statistics problem, an idea I associate with Xiao-Li Meng and his collaborators (see for example this 2003 paper by Kong, McCullagh, Meng, Nicolae, and Tan);
– Statistical workflow and model understanding (“interpetable AI”);
– Flexible Bayesian modeling (it’s all about the prior for the vector of weights), which is where lots of the technical difficulties arise (and which we can easily do, thanks to Stan);
– And, of course, multilevel modeling: good enough for Laplace, good enough for us!

It’s a pleasure to announce two such exciting papers in the same week, and I’m very lucky in my collaborators (which in turn comes from being well situated at a top university, having the time to participate in all this work, living in a country that offers generous government support for research, etc etc).

Incentives and test performance

Josh Miller points to some references:

Measuring Success in Education: The Role of Effort on the Test Itself by Gneezy et al.

When and Why Incentives (Don’t) Work to Modify Behavior by Gneezy et al.

Behavioral Economics and Psychology of Incentives by Emir Kamenica.

I have no comments on these particular articles, just wanted to post this placeholder so I can refer to it for a classroom demonstration.

Struggling to estimate the effects of policies on coronavirus outcomes

Philippe Lemoine writes:

I published a blog post in which I reanalyze the results of Chernozhukov et al. (2021) on the effects of NPIs in the US during the first wave of the pandemic and, if you have time to take a look at it, I’d be curious to hear your thoughts.

Here is a summary that recaps the main points:

– The effects of non-pharmaceutical interventions on the COVID-19 pandemic are very difficult to evaluate. In particular, most studies on the issue fail to adequately take into account the fact that people voluntarily change their behavior in response to changes in epidemic conditions, which can reduce transmission independently of non-pharmaceutical interventions and confound the effect of non-pharmaceutical interventions.

– Chernozhukov et al. (2021) is unusually mindful of this problem and the authors tried to control for the effect of voluntary behavioral changes. They found that, even when you take that into account, non-pharmaceutical interventions led to a substantial reduction in cases and deaths during the first wave in the US.

– However, their conclusions rest on dubious assumptions, and are very sensitive to reasonable changes in the specification of the model. When the same analysis is performed on a broad range of plausible specifications of the model, none of the effects are robust. This is true even for their headline result about the effect of mandating face masks for employees of public-facing businesses.

– Another reason to regard even this result as dubious is that, when the same analysis is performed to evaluate the effect of mandating face masks for everyone and not just employees of public-facing businesses, the effect totally disappears and is even positive in many specifications. The authors collected data on this broader policy, so they could have performed this analysis in the paper, but they failed to do so despite speculating in the paper that mandating face masks for everyone could have a much larger effect than just mandating them for employees.

– This suggests that something is wrong with the kind of model Chernozhukov et al. used to evaluate the effects of non-pharmaceutical interventions. In order to investigate this issue, I fit a much simpler version of this model on simulated data and find that, even in very favorable conditions, the model performs extremely poorly. I also show with placebo tests that it can easily find spurious effects. This is a problem not just for this particular study, but for any study that relies on that kind of model to study the effects of non-pharmaceutical interventions.

– To be clear, as I stress in the conclusion, this doesn’t mean that mask-wearing doesn’t reduce transmission, because this paper evaluated the effect of mandating mask wearing, which is not the same thing. It may be that, as another study recently found (though I have no idea how good this paper is), mandates don’t really matter because people who are going to wear masks do so even if they’re not legally required to do so.

Anyway, since you disagreed with my harsh take on Flaxman et al.’s paper about the effects of NPIs in Europe during the first wave, I was curious to know your thoughts about this other study.

I replied that I agree with Lemoine’s general point that it’s very hard to untangle the effects of any particular policy, given that so much depends on behavior. Another complication is the desire for definitive results. From the other direction, I see the value of quantitative analyses, as some policy choices need to be made.

Lemoine responded:

On the need to make policy choices and what it means for what should be done with quantitative analyses, I think it’s a very complicated issue. I was a hawk on COVID-19 before it was cool and, back in March, I was in favor of the first lockdown. I changed my mind after that because I became convinced that, whatever their precise effects (I think it’s impossible to estimate them with anything resembling precision), they couldn’t be huge otherwise we’d see it much more easily (as with vaccination) and they generally needed to be huge in order to have a chance of passing a cost-benefit test. One reason I came to deeply regret my initial support for lockdowns is that I have since then realized they have become a sort of institutionalized default response, which is something I think I should have predicted but didn’t, so this has taught me the wisdom of requiring a much higher level of confidence in social scientific results before acting on them. (I’m French and here we have been under a curfew and bars/restaurants have remained completely closed between last October and May of this year!)

In response to my question about what exactly was meant by “lockdown,” Lemoine pointed to his post arguing against lockdowns and added:

I [Lemoine] think it has been a problem in those debates on both sides, but it’s not really a problem in Chernozhukov et al. (2021) since they look at pretty specific policies. My impression is that, when people talk about “lockdowns”, they have in mind a vague set of particularly stringent restrictions such as curfews, closure of “non-essential businesses” and stay-at-home orders. In any case, this is what I’m referring to when I use this term, though in my work I usually talk about “restrictions” and state my position as the claim that, whatever the precise effects of the most stringent restrictions (again things like curfews, closure of “non-essential businesses” and stay-at-home orders) are, they are not plausibly large enough for those policies to pass a cost-benefit test when you take into account their immediate effects on people’s well-being, because even when I make preposterous assumptions about their effects on transmission and do a back-of-the-envelope cost-benefit analysis the results come out as incredibly lopsided against those policies. This is still vague but I think not too vague. In particular, I don’t think mask mandates of any kind count as “lockdowns”, nor do I think that anyone does even the fiercest opponents of those mandates.

I did not have the energy to read Chernozhukov et al.’s paper or Lemoine’s criticism in detail, but as noted above I am sympathetic with Lemoine’s general point that it is difficult to untangle causal effects of policies—and this difficulty persists even if, like Chernozhukov et al., you are fully aware of these difficulties and trying your best to address them. We had a similar discussion a few years ago regarding the deterrent effect of the death penalty, a topic that has seen many quantitative studies of varying quality but which, as Donohue and Wolfers explained, is pretty much impossible to figure out from empirical data. Effects of policies on disease spread should be easier to estimate, as the causal mechanism is much clearer, but we still have the problem of multiple interventions done at the same time, interventions motivated by existing conditions (which can be addressed statistically, but results will be necessarily sensitive to details of how the adjustment is done), effects that vary from one jurisdiction to another, and unclear relationships between behavior and policy. For example, when they closed the schools here in New York City, lots of parents were pulling their kids out of school and lots of teachers were not planning to keep showing up, so the school closing could be thought of as a coordination policy as much as a mandate. And then there are annoying policies such as closing parks and beaches, which nobody really thinks would have much effect on disease spread but represent some sort of signal of seriousness. And the really big thing which is people lowering the spread of disease by avoiding social situations, avoiding talking into each others’ faces, etc. From a policy standpoint it’s hard for me to hold all this in my head at once, especially because I’m really looking forward to teaching in person this fall, masked or otherwise. One of the points of a statistical analysis is to be able to integrate different sources of information—a multivariate probability distribution can “hold all this in its head at once” even when I can’t . . . ummm, at this point I’m just babbling. Speaking as a statistician, let me just say that it’s important to see the trail of breadcrumbs showing how the conclusions came from the data, scientific assumptions, and statistical model, starting from simple comparisons and then doing adjustments from there. I think the sorts of analyses of Chernozhukov et al. and Lemoine should be helpful in taking us in this direction.

P.S. Ethan Bolker shares this letter he sent to the Notices of the American Mathematical Society which he thought would be relevant to our discussion:

‘No regulatory body uses Bayesian statistics to make decisions’

This post is by Lizzie. I also took the kitten photo — there’s a white paw taking up much of the foreground and a little gray tail in the background. As this post is about uncertainty, I thought maybe it worked.

I was back east for work in June, drifting from Boston to Hanover, New Hampshire and seeing a couple colleagues along the way. These meetings were always outside, often in the early evenings, and so they sit in my mind with the lovely luster of nice spring weather in the northeast, with the sun glinting in at just the right angle.

One meeting was sitting on a little sloping patch of grass in a backyard in Arlington, where I was chatting with a former postdoc, who now works for a consulting company tightly intertwined with US government. When he was in my lab he and I learned Bayesian statistics (and Stan), and I asked him how much he was using Bayesian approaches. He smiled slyly at me and told me a story about a recent meeting he was at where one of the senior people said:

“No regulatory body uses Bayesian statistics to make decisions.”

He quickly added that he’s not at all sure this is true, but that it encapsulates a perspective that is not uncommon in his world.

The next meeting was next to the Connecticut river and with a senior ecologist, who works on issues with some real policy implications: how to manage beetle populations as they take off for the north with warming (hello, or should I say goodbye, New Jersey pine barrens), the thawing Arctic, and more. I was asking him if he thought this statement was true, which he didn’t answer, but set off on a different declaratory statement:

“The problem with Bayesian statistics is their emphasis on uncertainty.”

Ah. Uncertainty. Do you think uncertainty is the most commonly used word in the title of blog posts here? (Some recent posts here, here and here.)

In response to my colleague I may have blurted out something like ‘but I love uncertainty!’ or ‘that is a great thing about Bayesian!’ and so the conversation veered deeply into a ditch, from which I am not sure that it ever recovered. I said something along the lines of, isn’t it better to have all that uncertainty out in the middle of the room? Rather than trying to fit in under the cushions of the sofa as I feel so many ecologists do when they do their models in sequential steps, dropping off uncertainty along the way (often using p-values of delta AIC values of 2 or…) to drive ahead to their imaginary land of near-certainty? (I know at some point I also poorly steered it towards my thoughts on whether climate change scientists have done themselves a service or disservice in shying away from communicating uncertainty; I regret that.)

We left mired in the muck that so many of the ecologists around me feel about Bayesian — too much emphasis on uncertainty, too little concrete information that could lead to decision making.

So I pose this back to you all: what should I have said in response to either of these remarks? I am looking for excellent information, and persuasive viewpoints.

I’ll open the floor with what I thought a good reply from Michael Betancourt for the first quote: fisheries, and that Bayesian gives better options to steer policy. For example, if you want maximum sustainable yield without crashing a fish stock, you can more easily suggest a quantile of catch that puts you a little more firmly in ‘non-crashing’ outcome.

Not-so-recently in the sister blog

The role of covariation versus mechanism information in causal attribution:

Traditional approaches to causal attribution propose that information about covariation of factors is used to identify causes of events. In contrast, we present a series of studies showing that people seek out and prefer information about causal mechanisms rather than information about covariation. . . . The subjects tended to seek out information that would provide evidence for or against hypotheses about underlying mechanisms. When asked to provide causes, the subjects’ descriptions were also based on causal mechanisms. . . . We conclude that people do not treat the task of causal attribution as one of identifying a novel causal relationship between arbitrary factors by relying solely on covariation information. Rather, people attempt to seek out causal mechanisms in developing a causal explanation for a specific event.


This finding is supportive of Judea Pearl’s attitude that causal relationships, rather than statistical relationships, are how we understand the world and how we think about evidence. And it’s also supportive of my attitude that we should think about causation in terms of mechanisms rather using black-box reasoning based on identification strategies.

Webinar: A Gaussian Process Model for Response Time in Conjoint Surveys

This post is by Eric.

This Wednesday, at 11:30 am ET, Elea Feit is stopping by to talk to us about her recent work on Conjoint models fit using GPs. You can register here.


Choice-based conjoint analysis is a widely-used technique for assessing consumer preferences. By observing how customers choose between alternatives with varying attributes, consumers’ preferences for the attributes can be inferred. When one alternative is chosen over the others, we know that the decision-maker perceived this option to have higher utility compared to the unchosen options. In addition to observing the choice that a customer makes, we can also observe the response time for each task. Building on extant literature, we propose a Gaussian Process model that relates response time to four features of the choice task (question number, alternative difference, alternative attractiveness, and attribute difference). We discuss the nonlinear relationships between these four features and response time and show that incorporating response time into the choice model provides us with a better understanding of individual preferences and improves our ability to predict choices.

About the speaker

Elea Feit is an Associate Professor of Marketing at Drexel University. Prior to joining Drexel, she spent most of her career at the boundary between academia and industry, including positions at General Motors Research, The Modellers, and Wharton Customer Analytics. Her work is inspired by the decision problems that marketers face and she has published research on using randomized experiments to measure advertising incrementality and using conjoint analysis to design new products. Methodologically, she is a Bayesian with expertise in hierarchical models, experimental design, missing data, data fusion, and decision theory. She is also the co-author of R for Marketing Research and Analytics. More at

Video recording is available here.

Several postdoc, research fellow, and doctoral student positions in Aalto/Helsinki, Finland

This job ad is by Aki

Aalto University, University of Helsinki, and Finnish Center for Artificial Intelligence have a great probabilistic modeling community, and we’re looking for several postdocs, research fellows and doctoral students with topics including a lot of Bayesian statistics.

I’m looking for a postodc and doctoral student to work on Bayesian workflow and a postodc to work on AI assisted modeling.

Other Bayesian flavored topics (I’m involved also in some of these) are

  • AI-assisted design and decisions: from foundations to practice
  • AI-assisted design of experiments and interventions
  • Advanced user models
  • Virtual atmospheric laboratory
  • Variable selection with missing data with applications to genetics
  • Bayesian machine learning for sensing
  • Deep learning with differential equations
  • Deep generative modeling for precision medicine and future clinical trials
  • Probabilistic modelling and Bayesian machine learning
  • Physics-inspired geometric deep representation learning for drug design
  • Probabilistic modelling for collaborative human-in-the-loop design
  • Machine Learning for Health
  • Statistical Genetics and Machine Learning
  • Bayesian machine learning and differential privacy

More information about the topics and how to apply

And here’s a photo of Finnish summer at 11pm +25C A summer sunset 11pm in Finland

“Test & Roll: Profit-Maximizing A/B Tests” by Feit and Berman

Elea McDonnell Feit and Ron Berman write:

Marketers often use A/B testing as a tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame them as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population.

We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit under a wide range of conditions.

We [Feit and Berman] demonstrate the benefits of the method in three different marketing contexts—website design, display advertising and catalog tests—in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.

I’ve not read the paper in detail, but the basic idea makes a lot of sense to me.

I’m not an expert on this literature. I heard about this particular article from a blog comment today. You readers will perhaps have more to say about the topic.

Against either/or thinking, part 978

This one’s no big deal but it annoys me nonetheless. From Andrew Ross Sorkin in the New York Times:

There will be academic case studies on the mania around GameStop’s stock. There will be philosophical debates about whether this was a genuine protest against hedge funds and inequality or a pump-and-dump scheme masquerading as a moral crusade. Eventually, we will learn whether this was a transformational moment powered by social media that will shift the investing landscape forever, or a short-term blip that soon fades away.

Two things:

1. “Whether this was a genuine protest against hedge funds and inequality or a pump-and-dump scheme masquerading as a moral crusade”: Why can’t it be both, and other things as well, most notably the always potent mix of greed and fomo?

2. “Whether this was a transformational moment powered by social media that will shift the investing landscape forever, or a short-term blip that soon fades away”: Why can’t it be something in between?

In summary, either/or thinking has big problems. By saying “either A or B,” you’re ignoring several other possibilities:

– Neither A nor B

– Both A and B

– Something in between A and B.

We’ve discussed this many times before; see for example here or, for a more academically-focused example, page 962 of this article.

I’m not trying to slam Sorkin here; I know that journalists are busy, and I’m sure that in using the “whether A or B” formulation, he’s not trying to exclude these other possibilities. But I also think that words have implications and that expressing things dichotomously in this way can lead us into misunderstandings. I don’t have evidence for that claim of mine—maybe something like it has been studied in the judgment and decision making literature—it just makes sense to me.

She’s thinking of buying a house, but it has a high radon measurement. What should she do?

Someone wrote in with a question:

My Mom, who has health issues, is about to close on a new house in **, NJ. We just saw that ** generally is listed as an area with high radon. If the house has a radon measurement over 4 and the seller puts vents to bring it into compliance, how likely is it that the level will return to 4 shortly thereafter?

Also is 4 a safe amount? I understand that is the EPA guideline while the World Health Organization suggests 2.7. Which level do you consider appropriate?

I forwarded this to my colleague Phil Price (known as “Phil” to you blog readers), who worked for many years as a scientist in the Indoor Environment Department at Lawrence Berkeley National Laboratory. Phil replied:

Unfortunately nobody knows exactly how dangerous it is to live in a house with a radon concentration around 4 pCi/L. Different countries have standards as low as 2 and as high as 10. It’s clear that at levels as high as 12 or 20 pCi/L the residents are at substantially increased risk of lung cancer, but at lower concentrations the danger is not high enough to stick out definitively from the noise, so it’s really hard to know how to extrapolate.

Here are a few facts and observations that might help you decide for yourself, and then I’ll follow up with my personal recommendation which is based on what I [Phil] would do and is not necessarily a recommendation that is good for you or your mom.

Actually, let me start with an answer to your question: if the house has acceptable radon concentrations now (averaged over a full year) then that is unlikely to change spontaneously over time; however, changes to the house or its operation could change things, e.g. if your mom starts heating or cooling parts of the house that have previously been unoccupied. On shorter timescales, the radon concentration is always varying, so a measurement made at one moment

Now, on to those facts I mentioned:
1. The relevant number is the radon concentration in the living areas of the home. That seems obvious, but a lot of measurements are made in unfinished basements or in other areas where people don’t spend much time, such as a crawlspace under the house.
2. Also, the relevant number is the radon concentration averaged over a long period of time — months or years — not the concentration at a given moment.
3. The radon concentration in a basement is normally substantially higher than on the ground floor or above.
4. The indoor radon concentration can vary a lot with time: it might be twice as high in winter as in summer, on average, and it might be four times as high in the highest winter week as it is in the lowest.
5. For most homes, radon mitigation can be performed for under $2500 that will work for many years; you can google “radon sub slab depressurization” to read about this.

Finally, I’ll tell you what I would do. But my choice would not just depend on facts about radon, but also on my personality, so this might not be what your mom should do.

I would buy the house and move in. I would perform a year-long radon test on the lowest level of the home in which I spend time (for instance, the basement if I spent time in a hobby room down there or something, otherwise the ground floor) and after a year I would check the results and hire a radon mitigation company if the result was higher than I thought was safe. I think the 4 pCi/L level is reasonable for making that decision, but if it came back at 3.6 pCi/L or something, maybe I would mitigate even though that is below the ‘action level’.

The idea of waiting that long, knowing that for the whole year you’re being exposed at above the recommended concentration, would freak some people out. If I felt that way, then in addition to the long-term test I might do a short-term test every few months, and if any of the results were really high, I’d hire a mitigation company. But even if the long-term average is fine, the measurement over a short period might be pretty high, so I wouldn’t be too bothered by a single short-term test coming in at 4 or 6 pCi/L, I’d just wait until I had the long-term result. Again, that’s just me. Here is a company that offers various combinations of short- and long-term tests. I have no relationship with them whatsoever, I just looked them up on the internet like anyone could do.

Finally: your mom could consider calling a radon mitigation company and seeing what they say. Most likely they will say they can mitigate just about any house to below 4 pCi/L in the living area; they might even offer a guarantee. Then she could buy the house and go ahead and have a sub-slab depressurization system installed. Odds are pretty good that that would be a waste of a few thousand dollars, but it would give her peace of mind and if it lets her buy a house she likes, it might well be worth it.