Skip to content

The Bayesian cringe

I used this expression the other day in Lauren’s seminar and she told me she’d never heard it before, which surprised me because I feel like I’ve been saying it for awhile, so I googled *statmodeling bayesian cringe* but nothing showed up! So I guess I should wrote it up.

Eventually everything makes its way from conversation to blog to publication. For example, the earliest appearance I can find of “Cantor’s corner” is here, but I’d been using that phrase for awhile before then, and it ultimately appeared in print (using the original Ascii art!) in this article in a physics journal.

So . . . the Bayesian cringe is this attitude that many Bayesian statisticians, including me, have had, in which we’re embarrassed to use prior information. We bend over backward to assure people that we’re estimating all our hyperparameters from the data alone, we say that Bayesian statistics is the quantification of uncertainty, and we don’t talk much about priors at all except as a mathematical construct. The priors we use are typically structural—not in the sense of “structural equation models,” but in the sense that the priors encode structure about the model rather than particular numerical values. An example is the 8 schools model—actually, everything in chapter 5 of BDA—where we use improper priors on hyperparameters and never assign any numerical or substantive prior information.

The Bayesian cringe comes from the attitude that non-Bayesian methods are the default and that we should only use Bayesian approaches when we have very good reasons—and even that isn’t considered enough sometimes, as discussed in section 3 of this article. So that’s led us to emphasize innocuous aspects of Bayesian inference. Now, don’t get me wrong, I think there are virtues to flat-prior Bayesian inference too. Not always—sometimes the maximum likelihood estimate is better, as in some multidimensional problems where the flat prior is actually very strong (see section 3 of this article) or just because it’s a mistake to take a posterior distribution too seriously if it comes from an unrealistic prior (see section 3 here)—but for the reasons given in BDA, I typically think that flat-prior Bayes is a step forward.

But I keep coming across problems where a little prior information really helps—see for example Section 5.9 here, it’s one of my favorite examples—and more and more I’ve been thinking that it makes sense to just start with a strong prior. Instead of starting with the default flat or super-weak prior, start with a strong prior (as here) and then retreat if you have prior information saying that this prior is too strong, that effects really could be huge or whatever.

As the years have gone on, I’ve become more and more sympathetic with the attitude of Dennis Lindley. As a student I’d read his contributions to discussions in statistics journals and thing, jeez what an extremist, he just says the same damn thing over and over. But now I’m like, yeah, informative priors, cut the crap, let’s go baby. As I wrote in 2009, I suspect I’d agree with Lindley on just about any issue of statistical theory and practice. I’ve read some of Lindley’s old articles and contributions to discussions and, even when he seemed like something of an extremist at the time, in retrospect he always seems to be correct.

One way we’ve moved away from the Bayesian cringe is by using the terminology of regularization. Remember how I said that lasso (and, more recently, deep nets) have made the world safe for regularization? And how I said that Bayesian inference is not radical but conservative (sorry, Lindley)? When we talk about regularization, we’re saying that this kind of partial-pooling-toward-the-prior is desirable in itself. Rather than being a regrettable concession to bias that we accept in order to control our mean squared error, we argue that stability is a goal in itself. (Conservative, you see?)

We’re not completely over the Bayesian cringe—look at just about any regression published in a political science journal, and within econometrics there are still some old-school firebreathers of the anti-Bayesian type—but I think we’re gradually moving toward a general idea that it’s best to use all available information, with informative priors being one way to induce stability and thus allow us to fit more complicated, realistic, and better-predicting models.

“As you all know, first prize is a Cadillac El Dorado. Anyone wanna see second prize? Second prize is you’re governor of California. Third prize is you’re fired.”

Ethan Steinberg writes:

I thought you might find this funny given your past blog posts about related subjects (Iran & Benford’s law, preregistration, etc).

Announcing that an election is fraudulent due to Benford’s law on data that doesn’t even exist yet seems like the perfect encapsulation of crazy research.

The link points to this news article, “As Newsom leads California recall polls, Larry Elder pushes baseless fraud claims,” which reports:

Republican Larry Elder appealed on Monday to his supporters to use an online form to report fraud, which claimed it had “detected fraud” in the “results” of the California recall election “resulting in Governor Gavin Newsom being reinstated as governor.”

The only problem: On Monday when the link was live on Elder’s campaign site, the election hadn’t even happened yet. . . .

“Statistical analyses used to detect fraud in elections held in 3rd-world nations (such as Russia, Venezuela, and Iran) have detected fraud in California resulting in Governor Gavin Newsom being reinstated as governor,” the site reads. “The primary analytical tool used was Benford’s Law and can be readily reproduced.”

The site added on Monday afternoon a disclaimer saying it was “Paid For By Larry Elder Ballot Measure Committee Recall Newsom Committee,” with major funding from Elder’s gubernatorial campaign. . . .

Elder’s campaign homepage under the title “stop fraud” links to the website, The site solicits donations for him and asks supporters to sign a petition “demanding a special session of the California legislature to investigate and ameliorate the twisted results of this 2021 Recall Election.”

The site seems to presuppose the outcome of the race, claiming fraud has already been detected in the election and that Newsom won — even though Election Day is not until Tuesday.

The page suggests voters may turn to the “ammo box” if they can’t trust the ballot box. . . .

Kinda funny, also kinda scary.

Here’s the webpage linked from Elder’s campaign site:

Ya gotta love the legalese at the bottom of the page. They’re willing to brazenly lie about hypothetical fraud and make up documentation on votes that haven’t yet been tallied, and they’re threatening to shoot you if you don’t accept their fabricated claims—but they still have time for the legal mumbo-jumbo.

But I didn’t see anything there about Benford’s Law. So I went a few days back in time on the Internet Archive and found it:

I guess they didn’t run the text by Putin. Calling Russia a “3rd-world nation” . . . that’s so insulting!

I eagerly await their Benford’s Law analysis. Funny that it “can be readily reproduced” even though it hasn’t been produced the first time. Kind of a paradox of reproducibility.

P.S. Elder has some endorsements:

David Mamet!

Hey . . . wasn’t there a Benford’s Law scheme in The Spanish Prisoner? The sequel will be awesome. Eastwood directing, Mamet with the script, and Norris can still do his own stunts, no? I don’t know if they’ll be able to get Alec Baldwin on board, though; I’ve heard he’s a Democrat.

Public opinion on vaccine mandates etc.

My general impression from studying U.S. public opinion is that people don’t like being told what to do . . . but they don’t mind so much if other people are being told what to do.

This came up with the president’s recent plan to mandate “that all companies with more than 100 workers require vaccination or weekly testing [and] mandate shots for health care workers, federal contractors and the vast majority of federal workers.”

An argument has been made that, even if vaccines are a good idea (which evidently they are, both for the people vaccinated and for others they might otherwise expose), it’s an infringement on freedom to require people to get vaccinated.

Freedom is a good rallying cry, and, if you oppose the president, this is as good an argument as any to make against him, but speaking more generally I don’t think this argument is so appealing on its own for people who have already been vaccinated.

My colleagues and I thought about this several years ago in the context of gay rights. Here’s what Justin, Jeff, and I wrote back in 2009:

In surveys, 72% of Americans support laws prohibiting employment discrimination on the basis of sexual orientation. An even greater number answer yes when asked, “Do you think homosexuals should have equal rights in terms of job opportunities?” This consensus is remarkably widespread: in all states a majority support antidiscrimination laws protecting gays and lesbians, and in all but 10 states this support is 70% or higher.

But people do not uniformly support gay rights. When asked whether gays should be allowed to work as elementary school teachers, 48% of Americans say no. We could easily understand a consistent pro-gay or anti-gay position. But what explains this seeming contradiction within public opinion so that gays should be legally protected against discrimination but at the same time not be allowed to be teachers?

If anything, we could imagine people holding an opposite constellation of views, saying that gays should not be forbidden to be public school teachers but still allowing private citizens to discriminate against gays. A libertarian, for example, might take that position, but it does not appear to be popular among Americans.

We understand the contradictory attitude on gay rights in terms of framing. Our hypothesis goes as follows: when survey respondents are asked about antidiscrimination laws, they consider the widely-held American view that discrimination is a bad thing, so there should be a law against it. They are unlikely to put themselves in the position of an employer who might want to discriminate, and so are not likely to oppose an anti-discrimination law. But when asked about gay teachers, they identify with parents and students, and might feel that having a gay teacher is a risk they’d rather not take.

Thus, we hypothesize that survey respondents answering this question, in contrast to the antidiscrimination question, think in terms of values and outcomes rather than rights. When viewed in terms of rights alone, public opinion is incoherent: it’s hard to see how it makes sense to allow the government the right to discriminate against gays in hiring teachers while prohibiting private organizations from discriminating. It is coherent if framing matters.

The apparent contradiction in public opinion might suggest why, even though anti-discrimination laws have broad support, only 20 states have adopted such laws. And it might suggest why the national Employment Non-Discrimination Act (ENDA), which would protect gays and lesbians from discrimination at work, not been more enthusiastically supported in Congress. (It was passed by the House of Representatives in 2007 but has did not come to a vote in the Senate.) The problem is that it would be difficult to write legislation that incorporates these contradictory stances.

This also suggests that the presidential candidates can take any view on ENDA while still claiming to have majority support—by stressing the rights frame or the values frame.

John McCain came out in opposition to ENDA and Barack Obama is on record as supporting it. Although gay issues are often thought of as politically risky, the poll results suggest that support of gay rights can be unequivocally popular in almost every state—as long as it is framed in terms of values such as non-discrimination (which is presumably one reason why the Employment Non-Discrimination Act was given this name in the first place).

One way to say this is that anti-discrimination laws are popular because most people don’t do discrimination themselves, nor do they have much sympathy for people who discriminate. Similarly, people mostly want to be tough on crime because most people aren’t criminals and don’t have much sympathy for criminals. Another one: people like the idea of freedom of speech, but I recall seeing poll findings that lots of people support prior restraint on the press, as there’s not much sympathy for irresponsible journalists.

The vaccination thing is more complicated because we’re in such a polarized world now that lots of Republicans who’ve been vaccinated will still be open to reasons to dislike Biden. So I’m not making any prediction about polling on this issue. I just don’t think there’s any reason to think that a majority of people will oppose this policy as a matter of principle based on its restriction of freedoms. For better or worse, people typically seem pretty cool with restricting other people’s freedoms.

“Citizen Keane: The Big Lies Behind the Big Eyes”

I was listening to the radio and they played a song by Keane, which sounded good, so I went over to the website of the public library to see if they had any CDs I could check out. While searching I came across this biography from 2014, by Adam Parfrey and Cletus Nelson, of the artist Margaret Keane, famous for her paintings of big-eyed waifs, and her husband Walter, who promoted the art. I’d read bits and pieces of their story over the years from occasional news articles, and it was interesting to see it all in one place. It’s kind of a weird book—this is not something you can tell from seeing the online listing but is apparent with the physical copy—it just looks a bit unprofessional. And, indeed, it’s published by a small press. But the tools of professional-looking book publishers were available in 2014 for anyone with a computer, so it seems like it was a conscious choice of the author/publisher to give it that Hollywood Babylon look. The other thing is that the book is short, only 160 pages (not counting notes and appendixes), and many of those pages are taken up by photos, so maybe the low-budget-looking layout was just a way to pad the material to fill out a whole book.

Anyway, I liked the book. It has a judicious tone, tells an interesting story, and even features an appearance by Tom Wolfe. You never know what you’ll run across, listening to the radio.

P.S. The Keane album came in from the library and I listened to it but couldn’t get into it at all. So, good thing I read that book so something positive came out of the experience.

Progress! (cross validation for Bayesian multilevel modeling)

I happened to come across this post from 2004 on cross validation for Bayesian multilevel modeling. In it, I list some problems that, in the past 17 years, Aki and others have solved! It’s good to know that we make progress.

Here’s how that earlier post concludes:

Cross-validation is an important technique that should be standard, but there is no standard way of applying it in a Bayesian context. . . . I don’t really know what’s the best next step toward routinizing Bayesian cross-validation.

And now we have a method: Pareto-smoothed importance sampling. Aki assures me that we’ll be solving more problems about temporal, spatial and hierarchical models.

What’s the best way to interact with a computer? Do we want conversations, or do we want to poke it like a thing?

Venkat Govindarajan writes:

John Siracusa in episode 19 of his pre-eminent podcast Hypercritical:

What were the earliest mass-market PC interfaces like? . . . they were like conversations…you’d tell it what to do, it gives information back to you . . . That was the basic paradigm until the Macintosh . . . What it gave you was not a conversation, but a thing . . . you could poke the thing and see how it reacts . . . it worked like a physical object . . .

I [Govindarajan] really liked the comparison of early command-prompt user interfaces to conversations. It struck me that today’s AI assistants (Alexa, Google Assistant, Siri) are all based around having conversations. If these systems ever approached anything close to human intelligence and common-sense, perhaps having a conversation is the best way to interact with AI. But I wonder if there is a better interface to interact with AI? What’s the next leap from conversational AI? Perhaps Augmented Reality is the answer — artificial intelligence dispersed in our lived reality, giving us glancable information and the illusion of a physical object we can interact with. I wonder if this is why Apple is so bullish on AR as well.

Or maybe the best way to interact with artificial intelligence is the same way we interact with other people — using conversations.

“Hello, World!” with Emoji (Stan edition)

Brian Ward sorted out our byte-level string escaping in Stan in this stanc3 pull request and the corresponding docs pull request.

In the next release of Stan (2.28), we’ll be able to include UTF-8 in our strings and write “Hello, World!” without words.

transformed data {
  print("🙋🏼 🌎❗");

I’m afraid we still don’t allow Unicode identifiers. So if you want identifiers like α or ℵ or 平均数 then you’ll have to use Julia.

What’s the difference between xkcd and Freakonomics?

The above, of course, is from xkcd. In contrast, Freakonomics went contrarian cool a bunch of years ago with “A Headline That Will Make Global-Warming Activists Apoplectic,” featuring a claim of “about 30 years of global cooling” and a preemptive slam at any critics for “shrillness.” I guess the Rogue Economists could issue a retraction and apology, but given they never retracted that ridiculous claim that beautiful parents are 36% more likely to have girls (further background here), I guess we can’t expect anything on this one either.

But I can keep bringing it up!

I’m reminded of the joke:

Q: What’s the difference between xkcd and Freakonomics?

A: One of them is a long-running serial featuring a mix of interesting ideas and bad jokes, not to be taken seriously . . . and the other one is a cartoon.

And again I feel the need to say that everybody makes mistakes; what’s important is to learn from them, which requires acknowledging the mistake and coming to terms with the patterns of thought that led to it. Which is especially important if you represent an influential news outlet and you’ve been peddling junk science.

P.S. At this point, you might ask, Why am I picking on Freakonomics so much. They’re not so bad! And, indeed, Freakonomics is not so bad. It has some great stuff! That’s one reason I’m picking on them, because they can do better. There are lots of media outlets our there that are worse than Freakonomics. Alex Jones is worse—a zillion times worse. Gladwell’s worse. That Hidden Brain guy on NPR who falls for everything from PNAS, he’s worse. David Brooks doesn’t even try. Mike Barnicle used to be entertaining but he made stuff up. Gregg Easterbrook . . . well, he’s retired. I’m sad about all those guys too, but I have a special sadness in my heart for Freakonomics because they have the demonstrated potential to be so much better. I’m not trying to persuade them in this blog to change their ways—I’ve kinda given up on that—but maybe this can be a cautionary example for others, a continuing story of a lost opportunity to grapple with errors and learn from one’s mistakes. So sad, it makes me want to cry . . . and sometimes to laugh.

Just think, there was once a time, back in 2006 or so, when Freakonomics was one of the most widely respected science brands in the world and xkcd was an obscure website. How things have changed!

Bayesian workflow for disease transmission modeling in Stan!

Léo Grinsztajn, Elizaveta Semenova, Charles Margossian, and Julien Riou write:

This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic and other infectious diseases in a Bayesian framework. Bayesian modeling provides a principled way to quantify uncertainty and incorporate both data and prior knowledge into the model estimates. Stan is an expressive probabilistic programming language that abstracts the inference and allows users to focus on the modeling. As a result, Stan code is readable and easily extensible, which makes the modeler’s work more transparent. Furthermore, Stan’s main inference engine, Hamiltonian Monte Carlo sampling, is amiable to diagnostics, which means the user can verify whether the obtained inference is reliable. In this tutorial, we demonstrate how to formulate, fit, and diagnose a compartmental transmission model in Stan, first with a simple susceptible-infected-recovered model, then with a more elaborate transmission model used during the SARS-CoV-2 pandemic. We also cover advanced topics which can further help practitioners fit sophisticated models; notably, how to use simulations to probe the model and priors, and computational techniques to scale-up models based on ordinary differential equations.

I love this workflow stuff!

Why did Bill Gates say this one weird thing about Bad Blood, the story of Theranos??

His review of John Carreyrou’s book is titled, “I couldn’t put down this thriller with a tragic ending.”

I get that the book’s a thriller, and I get that Gates couldn’t put it down, but why does he say it has a tragic ending? I’d say it has a funny beginning and a tragic middle, but the ending is happy, no? The bad guys don’t get punished, exactly, but at least they get found out and they stop making things worse.

If the founders of Theranos actually had some great ideas but then failed, then I could see the “tragic ending” thing. But given that it was 100% B.S. from the start, it seems that the only tragedy was that they wasted so much of people’s money for so many years, also I guess the tragedy of our economic/social/legal system that empowers liars to stay afloat for so long.

I’m bothered that Gates saw the story as having a tragic ending, because that makes me think that he has some sympathy for the perps. To me, it’s like reading about some people who stole some cars, drove drunk and got into crash after crash, and finally had their licenses and vehicles taken away. The tragedy is that it took so long to stop them. I feel no sadness that they finally got caught.

It reminds me of when that submarine movie Das Boot came out and I saw it with my father. It was an intense two hours, and when we walked out I asked my dad what he thought. He said, “At least it had a happy ending.” And he meant it! All the time I’d been watching this movie, rooting for the submarine crew, and my dad was sitting there wanting the rivets to pop and wanting the sub to get sunk.

I was read Bad Blood the way my dad watched Das Boot: I was waiting to see the bad guys go down, and the main frustration was (a) how long it took, and (b) that some of the villains escaped scot-free. But Gates . . . he was rooting for the guys in the sub. He wanted Theranos to succeed! Scary.

We were wrong about redistricting.

In 1994, Gary King and I published an article in the American Political Science Review, “Enhancing democracy through legislative redistricting,” which began:

We demonstrate the surprising benefits of legislative redistricting (including partisan gerrymandering) for American representative democracy. In so doing, our analysis resolves two long-standing controversies in American politics. First, whereas some scholars believe that redistricting reduces electoral responsiveness by protecting incumbents, others, that the relationship is spurious, we demonstrate that both sides are wrong: redistricting increases responsiveness. Second, while some researchers believe that gerrymandering dramatically increases partisan bias and others deny this effect, we show both sides are in a sense correct. Gerrymandering biases electoral systems in favor of the party that controls the redistricting as compared to what would have happened if the other party controlled it, but any type of redistricting reduces partisan bias as compared to an electoral system without redistricting.

I think we were correct in our analysis, but it was an analysis of past data (state legislatures in the 1970s and 1980s). Since then, it’s my impression that gerrymandering—extreme partisan redistricting—has gotten much worse. I’m only speculating here because I haven’t looked at the numbers, but I’ve heard of five explanations for why gerrymandering has become more of a problem:

1. Better data and computers. The technology is now available to draw some really biased districting plans using local voting information. There are just more people who can draw these sorts of biased maps, and it’s easier to do.

2. Increased partisan polarization, part 1. Voting patterns are more stable. One of the risks of gerrymandering was always that you could be too clever by half: give your party a bunch of seats that it can win 60-40 and you’re at risk of being buried under a landslide. But elections are more predictable now, so it’s easier for gerrymanderers to assess these risks.

3. Increased partisan polarization, part 2. There’s more of an anything goes attitude and it’s easier for a legislative majority to keep the cohesion to steamroller the minority on all sorts of things, districting included. Gerrymandered districts then make a party with a legislative majority less likely to lose power.

4. Decline in incumbency advantage. Incumbency advantage isn’t as high as it was back in the 1980s, so, from the standpoint of partisan control, incumbency protection isn’t such a big deal. The traditional retirement of incumbents after redistricting does not shake up the system so much—and, as we explained in our article, that was a bit reason why redistricting increased electoral responsiveness in the past.

5. Reduced judicial scrutiny. I’m not sure about this, but I remember back when we were doing the research that led to our 1994 article, we had the impression that the most extreme gerrymanders didn’t happen because legislators were afraid that maps that were too obviously biased would be thrown out by the courts. That doesn’t seem to happen so much anymore.

It would be a useful research project for someone to look into this. From my part, I feel bad about writing that paper, if it made anyone complacent about the ways in which gerrymandering can threaten democracy.

Even back in the 1990s, we always said that nonpartisan or bipartisan redistricting was preferable to partisan redistricting: our argument was in favor of redrawing the district lines, and we just said that the partisan aspect of it wasn’t such a big deal because extreme partisan plans would be constrained by the courts. But in retrospect, yeah, back when gerrymandering wasn’t such a problem, that would’ve been the time to do something about it.

So the title of this post is a slight exaggeration. I don’t think any of the claims in that 1994 paper were wrong. But I think we were wrong to not at least try to anticipate future changes in an era of increasing partisan polarization.

Rodents Performing Visual Tasks

From Pamela Reinagel. I love the deadpan description:

Videos of rats and one squirrel performing visual decision experiments.

Job opening for a statistician at a femtech company in Berlin

Jonathan Thornburg writes:

You’ve sometime posted statistical-analysis job ads on the blog in the past.

I wonder if you might be willing to post or link to this ad for a “Research Data Analyst” at Clue (a Berlin-based Femtech company).

My [Thornburg’s] connection with this is that my wife consults for Clue, I’ve done some data analysis for them, and I offered some suggestions for just what qualifications they’d find most useful for this position.

“Femtech” . . . I’ve never heard that one before!

Nothing in the job description about Bayesian inference but I suppose they’d be open to Bayesian approaches if they could help solve their problems, as I think would be the case.

It’s not an echo chamber, but it’s an echo chamber . . . How does that work, exactly?

Joseph Delaney points to a silly post on twitter by an economist (nobody famous this time, just someone on the internet) who writes:

If a vaccine were 100% safe and 100% effective, then somebody’s decision to *not* take it would have no effect whatsoever on anybody else.

So if people refusing a vaccine bother you, it’s only because you admit it’s not completely safe and/or effective.

As Delaney points out, this is ridiculous for many reasons, first that nothing in life is 100% safe or 100% effective, second that if the vaccine were 100% safe and 100% effective, then somebody’s decision to not take it would definitely have an effect. By not taking the vaccine, you can spread the disease to other people who don’t have the vaccine. This could kill them!

If you want to argue against vaccines using cost-benefit reasoning, go for it. I doubt I’d be convinced, but you can lay out the costs and benefits and make your case. Maybe the person who wrote the tweet feels the social costs are too high. But the direct benefits are clear—even more so if the vaccine is 100% safe and 100% effective. So I don’t think this twitter guy has thought this through at all.

I guess he was trying to be clever but just didn’t think it through. I wonder if part of this is a problem within the field of economics, where there’s a bit of a tradition of silly-clever arguments being celebrated (see here, for example).

Silly arguments occur in all fields of social science, as you can see if you scan through some psychology or sociology journals (not to mention the depths of decadent postmodernism etc.). Ummm, yeah, we have transparently silly arguments in political science too! I guess the point is that different fields have different forms of silly. In political science, the silly can come from measuring something and sticking a name on it, without thinking whether the name fits the measurement. In psychology, the silly comes from running an experiment on 30 people on Mechanical Turk and making general claims about human nature. In other fields, the silly comes from flat-out gobbledygook, as in the famous Sokal paper. In economics, a popular form of silly is the good-is-bad argument. Sometimes the good-is-bad argument has real oomph, as with Adam Smith’s famous paean to the self-interest of the butcher, the brewer, and the baker; other times, as with the quote above, it’s just silly counterintuitiveness for its own sake.

To put it another way: in life, elasticities are typically between 0 and 1, which means that policies typically don’t work quite as well as you might hope (that’s the elasticity less than 1), but they typically go in the intended direction (that’s the elasticity more than 0).

I know, #Notalleconomists. I’m not saying that most or even many economists don’t understand vaccines; I’m just saying that I can see how the error demonstrated in the above quote could appeal to some economists. Similarly, not many political scientists would say that North Carolina is less democratic than North Korea, but some prominent political scientists did attach their reputations to that claim, and it’s the kind of confusing-the-measurement-with-the-reality sort of mistake that political scientists sometimes make.

But I wasn’t really here to talk about economics.

The question that really interested me is how could someone be so wrong on such a simple thing? OK, we live in a world where people deny the existence of school shootings—but in that case the argument is so elaborate that I’d argue that the complexity of the story is part of its appeal. This vaccine thing seems simpler. So what’s going on?

Delaney picked up on a clueless tweet—but the world is full of clueless tweets. I followed the link and the person who wrote the above quote seems to be a bit of a political extremist, for example linking to an Alex Jones fan (which is what got me thinking about school shooting deniers). So the natural thought is that the person who wrote that twitter post is stuck in an echo chamber.

We hear a lot about media echo chambers—go on Facebook or Twitter or even TV news and you’ll hear just one side of the story. In this case, though, it’s more complicated. Following the link, you’ll see that many of the twitter commenters strongly disagreed with the above-quoted post, and many even went to the trouble of explaining what was wrong. But that didn’t seem to matter. The person who wrote the tweet just sarcastically brushed aside all the arguments in the other direction.

I see this a lot when I look things up on twitter. It’s not an echo chamber—in any thread, you’ll often get two strongly opposed perspectives, but typically:

(a) It’s only two perspectives, not three or more, and

(b) Nobody seems to be listening to the other perspective.

So it’s kinda weird. The outcome seems very much what you’d get from an echo chamber, but in the actual process, people are exposed to both sides of an issue. It’s a kind of non-debate debate.

I’m not sure how to think about this, but let me raise a complicating factor, which is that in many of these debates there really are clear right and wrong sides. For example, in the above debate, the claim “If a vaccine were 100% safe and 100% effective, then somebody’s decision to *not* take it would have no effect whatsoever on anybody else” is some mixture of uninformed, foolish, illogical, and flat-out wrong. This happens a lot. Just as most forecast probabilities are close to 0 or to 1, with only a few events being highly uncertain, it’s my impression that in most debates, when looked at from the outside, have a clear right and wrong position. Other times things are more ambiguous, but that can arise from a debate having multiple dimensions, so that the two sides are each right about a different aspect of the disagreement.

I assume that some scholars of communication have looked at this “non-debate debate” phenomenon, where people are exposed to both sides of an issue but don’t even register the opposing arguments. I’ve seen some pretty extreme cases recently where most of the participants in a dispute refuse to even consider the arguments on the other side. But they’re still being exposed to them!

OK, that’s about it for me on this one. I don’t know how to think about this, but I think that my naive earlier view that people were in echo chambers . . . that story ain’t quite right.

“Losing one night’s sleep may increase risk factor for Alzheimer’s, study says”

CNN’s on it:

A preliminary study found the loss of one night’s sleep in healthy young men increased the levels of tau protein in their blood compared to getting a complete night of uninterrupted sleep.

Studies have shown that higher levels of tau protein in the blood is associated with an increased risk of developing Alzheimer’s disease. “Our exploratory study shows that even in young, healthy individuals, missing one night of sleep increases the level of tau in blood suggesting that over time, such sleep deprivation could possibly have detrimental effects,” said study author Dr. Jonathan Cedernaes, a neurologist at Uppsala University in Sweden. The study was published Wednesday in Neurology, the medical journal of the American Academy of Neurology.

From the linked paper:

Methods In a 2-condition crossover study, 15 healthy young men participated in 2 standardized sedentary in-laboratory conditions in randomized order: normal sleep vs overnight sleep loss. Plasma levels of total tau (t-tau), Aβ40, Aβ42, neurofilament light chain (NfL), and glial fibrillary acidic protein (GFAP) were assessed using ultrasensitive single molecule array assays or ELISAs, in the fasted state in the evening prior to, and in the morning after, each intervention.

Results In response to sleep loss (+17.2%), compared with normal sleep (+1.8%), the evening to morning ratio was increased for t-tau (p = 0.035). No changes between the sleep conditions were seen for levels of Aβ40, Aβ42, NfL, or GFAP (all p > 0.10). The AD risk genotype rs4420638 did not significantly interact with sleep loss–related diurnal changes in plasma levels of Aβ40 or Aβ42 (p > 0.10). . . .

Hey, didn’t somebody say something about the difference between significant and non-significant?

Anyway, this all could be a real thing. The headline is just a bit dramatic.

Simulation-based calibration: Some challenges and directions for future research

Simulation-based calibration is the idea of checking Bayesian computation using the following steps, originally from Samantha Cook’s Ph.D. thesis, later appearing in this article by Cook et al., and then elaborated more recently by Talts et al.:

1. Take one draw of the vector of parameters theta_tilde from the prior distribution, p(theta).

2. Take one draw of the data vector y_tilde from the data model, p(y|theta_tilde).

3. Take a bunch of posterior draws of theta from p(theta|y_tilde). This is the part of the computation that typically needs to be checked. Call these draws theta^1,…,theta^L.

4. For each scalar component of theta or quantity of interest h(theta), compute the quantile of the true value h(theta_tilde) within the distribution of values h(theta^l). For a continuous parameter or summary, this quantile will take on one of the values 0/L, 1/L, …, 1.

If all the computations above are correct, then the result of step 4 should be uniformly distributed over the L+1 possible values. To do simulation-based computation, repeat the above 4 steps N times independently and then check that the distribution of the quantiles in step 4 is approximately uniform.

Connection to simulated-data experimentation

Simulation-based calibration is a special case of the more general activity of simulated-data experimentation. In simulation-based calibration, you’re comparing the assumed true value theta_tilde to the posterior simulations theta. More generally, we can do all sorts of experiments. For example, simulated-data experimentation is useful when designing a study: you make some assumptions, simulate some fake data, and then see how accurate the parameter estimates are. The goal here is not necessarily to check that the posterior inference is working well—indeed, there’s no requirement that the inference be Bayesian at all—but rather just to see what might happen under some scenarios. Here’s a discussion from earlier this year on why this can be so useful, and here’s a simple example.

Another application of simulated-data experimentation is to assess the bias in some existing estimation method. For example, suppose we’re concerned about selection bias in an experiment. We can simulate fake data under some assumed true parameter values and some assumed selection model, then do a simple uncorrected estimate and compare this inference with the assumed parameter values. This is similar to simulation-based calibration but not quite the same thing, because we’re not trying to assess whether the fit is calibrated; rather, we’re using the simulation to assess the bias of the estimate (given some assumptions).

A similar idea arises when assessing the bias of a computational algorithm. For example, variational inference can underestimate uncertainty of local parameters in a hierarchical model. One way to get a sense of this inferential bias is to simulate data from the assumed model, then fit using variational inference, check coverage of the resulting 50% intervals (for example), and loop this to see how often these intervals contain the true value. Or this could be done in other ways; the point is that this is similar to, but not the same as, straight simulation based calibration as outlined at the top of this post.

Perhaps most important is the use of simulated-data experimentation in modeling workflow. As Martin Modrák put it: simulation-based calibration was presented originally as validating an algorithm given that you trust your model, but in practice we are typically interested in validating a model implementation, given that you trust your sampling algorithm. Or maybe we want to be checking both the model and the computation.

Problems with simulation-based calibration

OK, now back to the specific idea of using simulated data to check calibration of parameter inferences. There are some problems with the method of Cook et al. (2006) and Talts et al. (2021):

– The method can be computationally expensive. If you loop the above steps 1-4 many times, then you’ll need to do posterior inference—step 3—lots of times. You might not want to occupy your cluster with N=1000 parallel simulations.

– The method is a hypothesis test: it’s checking whether the simulations are exactly correct (to within the accuracy of the test). But our simulations are almost never exact. HMC/Nuts is great, but it’s still only approximate. So there’s an awkwardness to the setup in that we’re testing a hypothesis we don’t expect to be true, and we’re assuming this will work because of some slop in the uncertainty based on a finite number of replications.

– Sometimes we want to test an approximate method that’s not even simulation consistent, something like ADVI or Pathfinder. Simulation-based calibration still seems like a good general idea, but it won’t quite work as stated because we’re not expecting perfect coverage.

– In practice, we don’t necessarily care about the calibration of a procedure everywhere. In a particular application, we typically care about calibration in the neighborhood of the data. If the model being fit has a weak prior distribution, simulation-based calibration drawing from the prior (as in steps 1-4 above) might miss the areas of parameter and data space that we really care about.

– Because of the above concerns, we commonly perform simulation-based calibration using a simpler approach, using a single fixed value theta_tilde set to a reasonable value, where “reasonable” is defined based on our understanding of how the model will be used. This has the virtue of testing the computation where we care about it, but now that we’re no longer averaging over the prior, we can’t make any general statements about calibration, even if the posterior simulations are perfect.

Possible solutions

So, what to do? I’m not sure. But I have a few thoughts.

First, in practice we can often learn a lot from just a single simulated dataset, that is, N=1. Modeling or computational problems are often so severe that they show up from just one simulation, for example we’ll get parameter estimates blowing up or getting stuck somewhere far away from where they should be. In my workflow, I’m often doing this sort of informal experimentation where I set the parameter vector to some reasonable value, simulate fake data, fit the model, and check that the parameter inferences are in the ballpark. It would be good to semi-automate this process so that it can be easy to do in the cmdstanR and cmdstanPy environments, as well as in rstanarm and brms.

Second, we’re often fitting hierarchical models, and then we can follow a hybrid approach. For a hierarchical model with hyperparameters phi and local parameters alpha, we can set phi_tilde to a reasonable value and draw alpha_tilde from its distribution, p(alpha|phi_tilde). If the number of groups is large, there can be enough internal replication in alpha that much can be learned about coverage from just a single simulated dataset—although we have to recognize that this can all depend strongly on phi, so it might make sense to repeat this for a few values of phi.

Third, when checking the computation of a model fit to a particular dataset, I have the following idea. Start by fitting the model, which gives you posterior simulations of theta. Then, for each of several random draws of theta from the posterior, simulate a new dataset y_rep, re-fit the model to this y_rep, and check the coverage of these inferences with respect to the posterior draw of theta. This has the form of simulation-based calibration except that we’re drawing theta from the posterior, not the prior. This makes a lot of sense to me—after all, it’s the posterior distribution that I care about—but we’re no longer simply drawing from the joint distribution, so we can’t expect the same coverage. On the other hand, if we had really crappy coverage, that would be a problem, right? I’m thinking we should try to understand this in the context of some simple examples.

Fourth, I’d like any version of simulation-based calibration to be set up as a measure of miscalibration rather than as a hypothesis test. My model here is R-hat, which is a potential scale reduction factor and which is never expected to be exactly 1.00000.

Fifth, this last idea connects to the use of simulation-based calibrations for explicitly approximate computations such as ADVI or Pathfinder, where the goal should not be to check if the method are calibrated but rather to measure the extent of the miscalibration. One measure that we could try is ((computed posterior mean) – (true parameter value)) / (computed posterior sd). Or maybe that’s not quite right, I’m not sure. The point is that we want a measure of how bad is the fit, not a yes/no hypothesis test.

To summarize, I see three clear research directions:

(a) Adjusting for the miscalibration that will arise when we’re no longer averaging over the prior (because the number of replication draws N is not large or because we want to focus on the posterior or some other area of interest of parameter space).

(b) Coming up with an interpretable measure of miscalibration rather than framing as a test of the hypothesis of perfect calibration.

(c) Incorporating this into workflow so that it’s convenient and not computationally expensive.

“There are no equal opportunity infectors: Epidemiological modelers must rethink our approach to inequality in infection risk”

Jon Zelner, Nina Masters, Ramya Naraharisetti, Sanyu Mojola, and Merlin Chowkwanyun write:

Mathematical models have come to play a key role in global pandemic preparedness and outbreak response . . . However, these models have systematically failed to account for the social and structural factors which lead to socioeconomic, racial, and geographic health disparities. . . . We evaluate potential historical and political explanations for the exclusion of drivers of disparity in infectious disease models for emerging infections, which have often been characterized as “equal opportunity infectors” despite ample evidence to the contrary. We look to examples from other disease systems (HIV, STIs) as a potential blueprint for how social connections, environmental, and structural factors can be integrated into a coherent, rigorous, and interpretable modeling framework. . . .

Zelner adds:

I think it touches on some of the issues in our Patterns piece, but from the perspective of saying that transmission models that omit the structural drivers of risk—almost analogous to hyperpriors on the model parameters—are inherently misspecified.

This sort of connection between statistics and politics always interests me.

Martin Modrák’s tutorial on simulation-based calibration

This new tutorial from Martin Modrák (above image by Modrák and Phil Clemson) is cool! It’s got slides, code, and exercises that you can do on your own to learn what simulation-based calibration is all about!

For tomorrow I have a post scheduled on some open problems in simulation-based calibration.

Adjusting for stratification and clustering in coronavirus surveys

Someone who wishes to remain anonymous writes:

I enjoyed your blog posts and eventual paper/demo about adjusting for diagnostic test sensitivity and specificity in estimating seroprevalence last year. I’m wondering if you had any additional ideas about adjustments for sensitivity and specificity in the presence of complex survey designs. In particular, the representative seroprevalence surveys out there tend to employ stratification and clustering, and sometimes they will sample multiple persons per household. It seems natural that at the multilevel regression stage of the Bayesian specification, you can include varying intercepts for the strata, cluster, and household—all features of the survey design that your 2007 survey weights paper would recommend including (from your 2007 paper: “In addition, a full hierarchical modeling approach should be able to handle cluster sampling (which we have not considered in this article) simply as another grouping factor”).

I think I have some fuzziness about how this would be done in practice—or at least, what happens at the post-stratification stage following the multilevel regression fit. Suppose that we adjust for respondent age and sex in the regression model, in addition to varying intercepts for household, cluster, and geographical strata. And suppose that we have Census counts on strata X age X sex. Would posterior predictions be made using age, sex, and strata, while setting the household and cluster varying intercepts to 0? Somehow I feel uncertain that this is the right approach.

The study estimating seroprevalence in Geneva by Stringhini et al. was not a cluster survey (it was SRS), but they did adjust for clustering within households. They integrate out the varying intercept for household (I think?) in their model (see page 2 of the supplement here). I admit I have a bit of trouble following the intuition and math there (I don’t think they made a mistake, I’m just slow). Is this the right approach?

I’m also aware that there are alternative ways of making these adjustments—like using a (average) design effect for individual post-stratification cells to get an effective sample size (e.g., the deep MRP paper by Ghitza and Gelman, 2013), but if we are in a position where we have full access to the cluster, and household variables, it seems we should use it.

There are a few issues here:

1. Combining sensitivity/specificity adjustments with survey analysis. This should not be difficult using Stan, as discussed in my about-linked paper with Bob Carpenter—as long as you have the survey analysis part figured out. That is, the hard part of this problem is the survey analysis, not the corrections for sensitivity/specificity.

2. Problems with real-world covid surveys. Here the big issue we’ve seen is selection bias in who gets tested. I’m not really sure how to handle this given existing data. We’ve been struggling with the selection bias problem and have no great solutions, even though it’s clear there’s some information relevant to the question.

3. Accounting for stratification and clustering in the survey design. I agree that multilevel modeling is the way to go here. I haven’t looked at the linked paper carefully, so I can’t comment on the details, but I think the general approach makes sense to condition on clustering and then average over clusters. It will be important to include cluster-level predictors so that empty clusters are not simply pooled to the global average of the data.

He was fooled by randomness—until he replicated his study and put it in a multilevel framework. Then he saw what was (not) going on.

An anonymous correspondent who happens to be an economist writes:

I contribute to an Atlanta Braves blog and I wanted to do something for Opening Day. Here’s a very surprising regression I just ran. I took the 50 Atlanta Braves full seasons (excluding strike years and last year) and ran the OLS regression: Wins = A + B Opening_Day_Win.

I was expecting to get B fairly close to 1, ie, “it’s only one game”. Instead I got 79.8 + 7.9 Opening_Day_Win. The first day is 8 times as important as a random day! The 95% CI is 0.5-15.2 so while you can’t quite reject B=1 at conventional significance levels, it’s really close. F-test p-value of .066

I have an explanation for this (other than chance) which is that opening day is unique in that you’re just about guaranteed to have a meeting of your best pitcher against the other team’s, which might well give more information than a random game, but I find this really surprising. Thoughts?

Note: If I really wanted to pursue this, I would add other teams, try random games rather than opening day, and maybe look at days two and three.

Before I had a chance to post anything, my correspondent sent an update, subject-line “Never mind”:

I tried every other day: 7.9 is kinda high, but there are plenty of other days that are higher and a bunch of days are negative. It’s just flat-out random…. (There’s a lesson there somewhere about robustness.) Here’s the graph of the day-to-day coefficients:

The lesson here is, as always, to take the problem you’re studying and embed it in a larger hierarchical structure. You don’t always have to go to the trouble of fitting a multilevel model; it can be enough sometimes to just place your finding as part of the larger picture. This might not get you tenure at Duke, a Ted talk, or a publication in Psychological Science circa 2015, but those are not the only goals in life. Sometimes we just want to understand things.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.