Skip to content

Estimating excess mortality in rural Bangladesh from surveys and MRP

(This post is by Yuling, not by/reviewed by Andrew)

Recently I (Yuling) have contributed to a public heath project with many great collaborates: The goal is to understand the excess mortality in potential relevance to Covid-19. Before recent case surge in south Asia, we have seen stories claiming that the pandemic might have hit some low-income countries less heavily than in the US or Europe, but many low-income countries were also lacking reliable official statistics such that the confirmed Covid-19 death counts might be questionable.  To figure out how mortality changed in rural Bangladesh in 2020, my collaborates conducted repeated phone surveys in 135 villages near Dhaka, and collected death counts from Jan 2019 to Oct 2020 (the survey questions include “Was member X in your household deceased during some time range Y? When? What reason?”).

The findings

The statistical analysis appears to be straightforward: compute the mortality rate in 2019 and 2020 in the sample, subtract, obtain their difference, run a t-test, done: You don’t need a statistician. Well, not really. Some deaths were duplicately reported from multiple families, which needs preprocessing and manual inspection; the survey population dynamically changed overtime because of possible migration, newborns, deaths, and survey drop-out, so we need some survival analysis; The survey has large sample size but we also want to stratify across demographical features such as age, gender and education;  The repeated survey is not random so we need to worry about selection bias from non-response and drop-out.

It is interesting to compare these data to the birthday model on the day-of-month-effect effect. We used to make fun of meaningless heatmap comparison of day times month of birth dates. But this time the day-of-month-effect is real: participants were more likely to report death dates of deceased family members in the first half of the month as well as on some integer days.  Presumable I can model these “rounding errors” by a Gaussian process regression too, but I did not bother and simply aggregated death counts into months.

In the actual statistical modeling,  we stratify observed date into cells defined by the interaction of age, gender, month and education, and fit a multilevel logistic model: the education is not likely to have a big impact on mortality rate, but we would like to include it to adjust for non-response or drop-out bias. We poststratify the inferred mortality rate to the population and make comparison between 2019 and 2020.  The following graph shows the baseline age and month “effect”—not really causal effect though.

All coefficients are in the scale of monthly death log odd ratios. For interpretation, we cannot apply the “divide-by-four” rule as the monthly death rate is low. But because of this low rate, log odds  is close to log probability hence we can directly take exponential of a small coefficient and treat that as a multiplicative factor, which further is approximated by exp(a)-1=a. Hence a logistics coefficient of 0.1 is approximately 10% multiplicatively increase in monthly death probability (it is like “divide-by-one”,  except the genuine divide-by-four rule is on the additive probability scale).

One extra modeling piece is that we want to understand the excess mortality in finer granularity. So I also model this extra risk after Feb 2020 by another multilevel regression across age, gender and education. The following graph shows the excess mortality rate by age in females and from low-education families:

This type of modeling is natural or even default to anyone familiar with this blog. Yet what I want to discuss is two additional challenges in communication.

How do we communicate excess mortality?

The figure above shows the excess mortality in terms of log odds, which in this context is close to log probability. We may want to aggregate this log odds into monthly death probability. That is done by the post-stratification step: we simulate the before and after monthly death probability for all cells, and aggregate them by 2019 census weights (in accordance to age, gender and education) in the same area. Below is the resulted excess mortality, where the unit is death count per month,

We actually see a decline in mortality starting from Feb 2020, especially for high ages (80 and above), likely due to extra cautions or family companions. Because of their high baseline values, this high age group largely dominates the overall mortality comparison. Indeed we estimate a negative overall excess mortality in this area since Feb 2020.  On the other hand, if we look into the average log odds change, the estimate is less constrained and largely overlaps with zero.

We report both results in a transparent way. You can play with our Stan code and data too. But my point here is more about how we communicate “excess mortality”. Think about an artificial make-up example: Say there are three age group, children, adults (age <59.5), elderly adults (age >59.5). Suppose they have a baseline mortality rate (0.5, 5, 30) per year per thousand people, and also assume the census proportion of these three age groups is (20%, 60%, 20%). If their mortality changes by -10%, -10%, 20%, the average percentage change is -4%.  But the absolute mortality change from (0.5, 5, 30) * (20%, 60%, 20%)=9.1 to 9.99, which is 0.89 excess death per year per thousand people, or a 0.89/9.1=10%  increase.  Is it a 10% increase, or a 4% decrease? That all depends on your interpretation. I don’t want to call it a Simpson’s paradox. Here both numbers are meaningful on its own: the average excess log mortality measures the average individual risk change;  the average excess mortality is related to death tolls. I think WHO and CDC report the latter number, but it also has the limitation of being driven by only one subgroup—we have discussed similar math problems in the context of simulated tempering and thermodynamics, and a similar story in age-adjustment over time.

The excess morality depends on the what comparison window to pick

Lastly,  the question we would like to address from the evidence of “excess morality” is essentially a causal question: what we care is the difference in death tolls if covid had not occurred hypothetically.  Except this causal question is even harder: we do not know when covid was likely to be a significant factor in this area. In other words, the treatment label is unobserved.

Instead of doing a change-point-detection, which I think would lead to non-indetfication if I would model the individual treatment effect,  I view the starting month of the excess morality comparison window as part of the causal assumption.  I fit the model with varying assumption and compute the excess mortality rate starting from Feb to Aug.   The results are interesting: the overall excess mortality is nearly negative when we compare  2020 and 2019, which is the result we report in the title. But if we choose a only compare later months, such as starting from July, there is a gradually positive excess mortality, especially for people in the age group 50–79. Well, it is not significant, and we need multiple testing when we vary these assumptions, but I don’t think a hypothesis testing is relevant here. Overall, this increasing trend is concerning, amid a more serious third wave recently in South Asia at large.

We also report the relative changes (there is always a degree of freedom to choose between multiplicative and additive effect in causal inference).

This type of varying-causal-assumption-as-much-as-you-can is what we recommend in workflow: when we have multiple plausible causal assumptions, report fits from all models (Fig 22 of the workflow paper).

But this graph leads to further questions. Recall the birthday model: if the there are fewer newborns in Halloween, then there have to be more on the days before or after. The same baby-has-to-go-somewhere logic applies here. Despite the overall excess mortality was likely negative starting from Feb 2020, we infer there was an increasing trend of mortality since the second half of 2020, especially for the age group 50–79. What we do not know for sure is whether, this positive excess mortality was a real temporal effect (a signal of gradual Covid infections), or a compensation for fewer deaths in the first half the year. Likewise, in many counties where there was a strong positive excess mortality in 2020, there could as well be a mortality decline in the following short term.  In the latter case, is this base-population change from the previous period (aka “dry tinder effect”, or the opposite wet tinder) part of the indirect effect, or mediation, of Covid-19? Or is it the confounder that we want to adjust for by fixing a constant population? The answer depends on how the “treatment” is defined, as akin to the ambiguity of a “race effect” in social science study.

2 reasons why the CDC and WHO were getting things wrong: (1) It takes so much more evidence to correct a mistaken claim than to establish it in the first place; (2) The implicit goal of much of the public health apparatus is to serve the health care delivery system.

Peter Dorman points to an op-ed by Zeynep Tufekci and writes:

This is a high profile piece in the NY Times on why the CDC and WHO have been so resistant to the evidence for aerosol transmission. What makes it relevant is the discussion of two interacting methodological tics, the minimization of Type I error stuff that excludes the accumulation of data that lie below some arbitrary cutoff and the biased application of this standard to work that challenges the received wisdom. It takes so much more evidence to correct a mistaken claim than to establish it in the first place.

I still suspect there is an additional factor: the implicit goal of much of the public health apparatus to serve the health care delivery system. One reason mask use was discouraged at the beginning of the pandemic was to protect the supply of N-95s to health care practitioners. Travel bans were opposed, since international travel and the movement of supplies were regarded as necessary for organizing and administering care, especially in developing countries. Recognition of aerosol transmission would have required the costly air filtration systems used in infectious disease wards to be installed throughout all hospitals and clinics (at least, as I understand it, under current protocols). This also helps explain the prominence given to hospitalization and ICU use as morbidity metrics, and the delayed recognition of long Covid as a health concern in its own right. If the primary constituency, and source of funding and personnel, for the public health apparatus is the medical system, it makes sense that, under conditions of uncertainty, the needs of medical practitioners would take precedence over those of the public at large. You can always justify it by saying we need the hospitals to be well stocked, doctors to travel freely, demand to be well below capacity and operations to not be bogged down by complicated protocols so the public can be better served. This is also relevant to the blog because there has been a sort of slipperiness about what outcomes constitute the costs and benefits that go into decision making under uncertainty.

The pandemic has been a spectacular laboratory for exploring the interconnections between science, the rules of evidence, risk communication, institutional incentives and political pressures. Some probing books about this will probably appear over the coming years.

Interesting points. I hadn’t thought of it that way, but it makes sense. I guess that similar things could be said about education, the criminal justice system, the transportation system, and lots of other parts of society that have dominant stakeholders. Even when these systems have had serious failures, we go through them when trying to implement improvements.

If a value is “less than 10%”, you can bet it’s not 0.1%. Usually.

This post is by Phil Price, not Andrew.

Many years ago I saw an ad for a running shoe (maybe it was Reebok?) that said something like “At the New York Marathon, three of the five fastest runners were wearing our shoes.” I’m sure I’m not the first or last person to have realized that there’s more information there than it seems at first. For one thing, you can be sure that one of those three runners finished fifth: otherwise the ad would have said “three of the four fastest.” Also, it seems almost certain that the two fastest runners were not wearing the shoes, and indeed it probably wasn’t 1-3 or 2-3 either: “The two fastest” and “two of the three fastest” both seem better than “three of the top five.” The principle here is that if you’re trying to make the result sound as impressive as possible, an unintended consequence is that you’re revealing the upper limit. Maybe Andrew can give this principle a clever name and add it to the lexicon. (If it isn’t already in there: I didn’t have the patience to read through them all. I’m a busy man!)

This came to mind recently because this usually-reliable principle has been violated in spectacular manner by the Centers for Disease Control (CDC), as pointed out in a New York Times article by David Leonhardt. The key quote from the CDC press conference is “DR. WALENSKY: … There’s increasing data that suggests that most of transmission is happening indoors rather than outdoors; less than 10 percent of documented transmission, in many studies, have occurred outdoors.”  Less than 10%…as Leonhardt points out, that is true but extremely misleading. Leonardt says “That benchmark ‘seems to be a huge exaggeration,’ as Dr. Muge Cevik, a virologist at the University of St. Andrews, said. In truth, the share of transmission that has occurred outdoors seems to be below 1 percent and may be below 0.1 percent, multiple epidemiologists told me. The rare outdoor transmission that has happened almost all seems to have involved crowded places or close conversation.”

This doesn’t necessarily violate the Reebok principle because it’s not clear what the CDC was trying to achieve. With the running shoes, the ad was trying to make Reeboks seem as performance-boosting as possible, but what was the CDC trying to do? Once they decided to give a number that is almost completely divorced from the data, why not go all the way? They could say “less than 30% of the documented transmissions have occurred outdoors”, or “less than 50%”, or anything they want…it’s all true! 

Frank Sinatra (3) vs. Virginia Apgar; Julia Child advances

I happened to come across this one from a couple years ago and the whole thing made me laugh so hard that I thought I’d share again:

My favorite comment from yesterday came from Ethan, who picked up on the public TV/radio connection and rated our two candidate speakers on their fundraising abilities. Very appropriate for the university—I find myself spending more and more time raising money for Stan, myself. A few commenters picked up on Child’s military experience. I like the whole shark repellent thing, as it connects to the whole “shark attacks determine elections” story. Also, Jeff points out that “a Julia win would open at least the possibility of a Wilde-Child semifinal,” and Diana brings up the tantalizing possibility that Julia Grownup would show up. That would be cool. I looked up Julia Grownup and it turns out she was on Second City too!

As for today’s noontime matchup . . . What can I say? New Jersey’s an amazing place. Hoboken’s own Frank Sinatra is only the #3 seed of our entries from that state, and he’s pitted against Virginia Apgar, an unseeded Jerseyite. Who do you want to invite for our seminar: the Chairman of the Board, or a pioneering doctor who’s a familiar name to all parents of newborns?

Here’s an intriguing twist: I looked up Apgar on wikipedia and learned that she came from a musical family! Meanwhile, Frank Sinatra had friends who put a lot of people in the hospital. So lots of overlap here.

Sinatra advanced to the next round. As much as we’d have loved to see Apgar, we can’t have a seminar speaker who specializes in putting people to sleep. So Frank faced Julia in the second round. You’ll have to see here and here to see how that turned out. . . .

The Javert paradox rears its ugly head

The Javert paradox is, you will recall, the following: Suppose you find a problem with published work. If you just point it out once or twice, the authors of the work are likely to do nothing. But if you really pursue the problem, then you look like a Javert. I labeled the paradox a few years ago in an article entitled, “Can You Criticize Science (or Do Science) Without Looking Like an Obsessive? Maybe Not.”

This came up recently in an email from Chuck Jackson, who pointed to this news article that went like this:

Does ocean acidification alter fish behavior? Fraud allegations create a sea of doubt . . .

[Biologist Philip] Munday has co-authored more than 250 papers and drawn scores of aspiring scientists to Townsville, a mecca of marine biology on Australia’s northeastern coast. He is best known for pioneering work on the effects of the oceans’ changing chemistry on fish, part of it carried out with Danielle Dixson, a U.S. biologist who obtained her Ph.D. under Munday’s supervision in 2012 and has since become a successful lab head at the University of Delaware . . .

In 2009, Munday and Dixson began to publish evidence that ocean acidification—a knock-on effect of the rising carbon dioxide (CO2) level in Earth’s atmosphere—has a range of striking effects on fish behavior, such as making them bolder and steering them toward chemicals produced by their predators. As one journalist covering the research put it, “Ocean acidification can mess with a fish’s mind.” The findings, included in a 2014 report from the Intergovernmental Panel on Climate Change (IPCC), could ultimately have “profound consequences for marine diversity” and fisheries, Munday and Dixson warned.

But their work has come under attack. In January 2020, a group of seven young scientists, led by fish physiologist Timothy Clark of Deakin University in Geelong, Australia, published a Nature paper reporting that in a massive, 3-year study, they didn’t see these dramatic effects of acidification on fish behavior at all. . . .

Some scientists hailed it as a stellar example of research replication that cast doubt on extraordinary claims that should have received closer scrutiny from the start. “It is by far the best environmental science paper I have read for a long time,” declared ecotoxicologist John Sumpter of Brunel University London.

Others have criticized the paper as needlessly aggressive. Although Clark and his colleagues didn’t use science’s F-word, fabrication, they did say “methodological or analytical weaknesses” might have led to irreproducible results. And many in the research community knew the seven authors take a strong interest in sloppy science and fraud—they had blown the whistle on a 2016 Science paper by another former Ph.D. student of Munday’s that was subsequently deemed fraudulent and retracted—and felt the Nature paper hinted at malfeasance. . . .

What the hell? It’s now considered “needlessly aggressive” to bring up methodological or analytical weaknesses?

Have these “Others have criticized” people never seen a referee report?

I’m really bothered by this attitude that says that, before publication, a paper can be slammed a million ways which way by anonymous reviewers. But then, once the paper has appeared and the authors are celebrities, all of a sudden it’s considered poor form to talk about its weaknesses.

The news article continues:

The seven [critics] were an “odd little bro-pocket” whose “whole point is to harm other scientists,” marine ecologist John Bruno of the University of North Carolina, Chapel Hill—who hasn’t collaborated with Dixson and Munday—tweeted in October 2020. “The cruelty is the driving force of the work.”

I have no idea what a “bro-pocket” is, and Google was no help here. The seven authors of the critical article appear to be four men and three women. I guess that makes it a “bro pocket”? If the authors had been four women and three men, maybe they would’ve been called a “coven of witches” or come up with some other insult.

In any case, this seems like a classic Javert bind. Sure, the critics get bothered by research flaws: if they weren’t bothered, they wouldn’t have put in the effort to track down all the problems!

More from the news article:

Clark and three others in the group took another, far bigger step: They asked three funders that together spent millions on Dixson’s and Munday’s work—the Australian Research Council (ARC), the U.S. National Science Foundation (NSF), and the U.S. National Institutes of Health (NIH)—to investigate possible fraud in 22 papers. . . .

Munday calls the allegations of fraud “abhorrent” and “slanderous” . . . Dixson denies making up data as well. . . . But multiple scientists and data experts unconnected to the Clark group who reviewed the case at Science’s request flagged a host of problems in the two data sets, and one of them found what he says are serious irregularities in the data for additional papers co-authored by Munday.

Also this:

Dixson, in the February interview, said she did not know about the allegations. Although she denies making up data, “There hypothetically could be an error in there,” she said, perhaps because of mistakes in transcribing the data; “I don’t know. I’m human.” . . . Clark and colleagues also found problems in the data for the 2014 paper in Nature Climate Change, which showed fish behavior is altered near natural CO2 seeps off the coast of Papua New Guinea. (Munday was the first of five authors on the study, Dixson the third.) That data set also contained several blocks of identical measurements, although far fewer than in the Science paper. . . . Munday says Dixson has recently provided him with one original data sheet for the study, which shows she made a mistake transcribing the measurements into the Excel file, explaining the largest set of duplications. “This is a simple human error, not fraud,” he says. Many other data points are similar because the methodology could yield only a limited combination of numbers, he says. Munday says he has sent Nature Climate Change an author correction but says the mistake does not affect the paper’s conclusions.

Bad data but they do not affect the paper’s conclusions, huh? We’ve heard that one before. It kinda makes you wonder why they bother collecting data at all, given that the conclusions never seem to change.

And here’s someone we’ve heard from before:

[Nicholas] Brown . . . identified problems of a different nature in two more Munday papers that had not been flagged as suspicious by the Clark team and on which Dixson was not an author. At about 20 places in a very large data file for another 2014 paper in Nature Climate Change, the raw data do not add up to total scores that appear a few columns farther to the right. And in a 2016 paper in Conservation Physiology, fractions that together should add up to exactly one often do not; instead the sum varies from 0.15 to 1.8.

Munday concedes that both data sets have problems as well, which he says are due to their first authors hand copying data into the Excel files. He says the files will be corrected and both journals notified. But Brown says the anomalies strongly suggest fabrication. No sensible scientist would calculate results manually and then enter the raw data and the totals—thousands of numbers in one case—into a spreadsheet, he says.

To him, the problems identified in the data sets also cast suspicions on the “ludicrous effect sizes” in many of the 22 papers flagged by the whistleblowers. “Suppose you’re going to the house of somebody you think may have been handling stolen televisions, and you found 22 brand new televisions in his basement, and three had serial numbers that corresponded to ones that have been stolen from shops,” Brown says. “Are you going to say, ‘Yeah, we’ll assume you’ve got the purchase receipts for the other 19?’”

OK, now we’re getting rude. If talking about “methodological or analytical weaknesses” is needlessly aggressive, what is it when you liken someone to a thief of television sets?

Back to Javert

I have not looked into the details of this case. It could be that one or more authors of those papers were committing fraud, it could be that they didn’t know what they’re doing, it could be that they were just really sloppy, or it could be some combination of these, as with the Cornell pizza researcher guy who seemed to have just had a big pile of numbers in his lab and would just grab whatever numbers he needed when it was time to write a paper. It could be that none of those findings are replicable, or it could be that the errors are minor and everything replicates. Someone else will have to track all this down.

What bothers me is the way the critics have been attacked. There was that guy on twitter quoted above, and then there’s Munday, one of the original researchers, who in 2016 wrote: “It seems that Clark and Jutfelt are trying to make a career out of criticizing other people’s work. I can only assume they don’t have enough good ideas of their own to fill in their time . . . Recently, I found out they have been ‘secretly’ doing work on the behavioural effects of high CO2 on coral reef fishes, presumably because they want to be critical of some aspects of our work.”

The idea that there’s something shameful about critically assessing published work, or that it’s bad to “make a career” out of it, or that you can “only assume” that if someone is critical, that “they don’t have enough good ideas of their own to fill in their time” . . . That’s just a horrible, horrible attitude. Criticism is a valuable and often thankless part of science.

And to slam the critics for going public . . . jeez! They tried everything and were stonewalled at every turn, so, yeah, they went public. Why not? The original papers were published in public. I don’t see why the reputations of the scientists who wrote those papers should be considered more valuable than the social value of getting the research right.

This is so annoying.

I think the original researchers should’ve said something like this:

We very much appreciate the efforts of these outside critics who found serious errors in our published papers. We are carefully looking into our data processing and analysis pipeline and will share all of as soon as is possible. In the meantime, we consider all our published findings to be tentative; we will only be able to say more after a careful assessment of our data and procedures. Whatever happens, we are pleased that our studies were reexamined so carefully, and again we thank the critics for their careful work.

P.S. We appreciate that some people have been defending us on social media and that our universities have stood by us. We pride ourselves on our research integrity and we very much regret the sloppiness in our work that has led to our errors. But, please, do not defend us by attacking our critics. There was nothing improper or inappropriate in their criticism of our work! They found flaws in our published papers, and it was their scientific duty to share this information with the world. Telling us personally wouldn’t have been enough. Our papers are in the public record. Our papers did have methodological weaknesses—that is clear, as we report values that are not mathematically or physically possible—and so the authors of the critical paper should not be attacked for pointing out these errors.

Doubting the IHME claims about excess deaths by country

The Institute for Health Metrics and Evaluation at the University of Washington (IHME) was recently claiming 900,000 excess deaths, but that doesn’t appear to be consistent with the above data.

These graphs are from Ariel Karlinsky, who writes:

The main point of the IHME report, that total COVID deaths, estimated by excess deaths, are much larger than reported COVID deaths, is most likely true and the fact that they have drawn attention to this issue is welcome. In a study of 94 countries and territories by Dmitry Kobak and myself – we estimate this ratio (based on actual all-cause mortality data) at 1.6. We believe this to be a lower bound since we lack data for much of the world, where more localized reports and studies demonstrate larger excess.

The issue with the IHME report is that it uses extremely partial data when much more encompassing (such as World Mortality) exists, the issue is that the country-level estimates they showed publicly are incredibly different than known ones (mostly higher) and that they purport to accurately estimate excess deaths where data simply does not exist – this undermines a tremendous effort currently underway to improve and collect vital data in many countries.

Karlinsky also quotes. Stéphane Helleringer:

I [Helleringer] do worry a lot though about false impression of knowledge and confidence that is conveyed by their estimates; especially the detailed global maps like the ones they just produced for COVID death toll and MANY other health indicators for which few or no data are available. The risk is that IHME figures, with their apparent precision, will distract some funders & governments from goal of universal death registration in low to middle Incomes countries. From their standpoint, if IHME readily estimates mortality, why invest in complex systems to register each death?

This is an interesting moral-hazard issue that comes up from time to time when considering statistical adjustments. I remember years ago that some people opposed adjustments for census undercount based on the reasoning that, once the census was allowed to adjust, that would remove their motivation for counting everyone. In practice I think we have to push hard in both data collection and modeling: work to gather the cleanest and fullest possible datasets and then work to adjust for problems with the data. If the apparently very seriously flawed IHME estimates are taken as a reason not to gather good data, that’s a problem not so much with statistics as with governments and the news media who have the habit of running with authoritative-sounding numbers from respected institutions and not checking. We saw that a few years ago in a different setting with that silly Democracy Index. The claims lacked face validity and were based on crappy data, but, hey, it was from Harvard! The University of Washington isn’t quite Harvard, but I guess the IHME had a good enough public relations department that they could get that air of authority. Also, they sent a message that (some) people wanted to hear. Also, the coronavirus authorities, for all their flaws, were lucky in their enemies. Say what you want about the IHME, they weren’t as dumb as last year’s White House Council of Economic Advisors or the Stanford-advised Pandata team or the Hoover Institution’s Richard Epstein, who, when he’s not busy jamming his fingers down people’s throats, made a coronavirus death prediction that was off by a factor of 1000.

P.S. See Karlinsky’s page for more details on data and estimates.

P.P.S. Instead of using legends in his graphs, Karlinsky should’ve placed labels on each line directly. For some reason, many people don’t seem to know about this trick, which allows people to read your graph without having to go back and forth and decode the colors.

Blast from the past

Paul Alper points us to this news article, The Secret Tricks Hidden Inside Restaurant Menus, which is full of fun bits:

There is now an entire industry known as “menu engineering”, dedicated to designing menus that convey certain messages to customers, encouraging them to spend more and make them want to come back for a second helping.
“Even the binding around the menu is passing us important messages about the kind of experience we are about to have,” explains Charles Spence [whose recent book Gastrophysics: the New Science of Eating], a professor in experimental psychology and multisensory perception at the University of Oxford.
“For a large chain that might have a million people a day coming into their restaurants around the world, it can take up to 18 months to put out a menu as we test everything on it three times,” says Gregg Rapp, a menu engineer based in Palm Springs, California
Perhaps the first thing a customer will notice about a menu when the waiter hands it to them is its weight. Heavier menus have been shown to suggest to the customer that they are in a more upscale establishment where they might expect high levels of service.
A study conducted by researchers in Switzerland found that a wine labelled with a difficult-to-read script was liked more by drinkers than the same wine carrying a simpler typeface. Spence’s own research has also found that consumers often associate rounder typefaces with sweeter tastes, while angular fonts tend to convey a salty, sour or bitter experience.
“Naming the farmer who grew the vegetables or the breed of a pig can help to add authenticity to a product,” says Spence.
A study from the University of Cologne in Germany last year showed that by cleverly naming dishes with words that mimic the mouth movements when eating, restaurants could increase the palatability of the food. They found words that move from the front to the back of the mouth were more effective – such as the made up word “bodok”.
Dan Jurafsky, a professor of computational linguistics at Stanford University, performed a study that analysed the words and prices of 650,000 dishes on 6,500 menus. He found that if longer words were used to describe a dish, it tended to cost more. For every letter longer the average word length was, the price of the dish it was describing went up by 18 cents (14p).
“When we[Rapp] do eye tracking on a customer with a menu in their hand, we typically see hotspots in the upper right hand side,” he says. “The first item on the menu is also the best real estate.”

But filling a menu with too many items can actually hamper choice, according to menu design experts. They say offering any more than seven items can overwhelm diners. To overcome this, they tell restaurants to break down their menus into sections of between five and seven dishes.

“More than seven is too many, five is optimal and three is magical,” says Rapp. There is some research to back this up – a study from Bournemouth University found that in fast food restaurants, customers wanted to pick from six items per category. In fine dining establishments, they preferred a little more choice – between seven and 10 items.

“The problem with pictures is that the brain will also taste the food a little bit when it sees a picture, so when the food comes it may not be quite as good as they imagined,” warns Rapp.
In recent years, Pizza Hut began testing eye-tracking technology to predict what diners might want as they scan through 20 different toppings before offering a likely combination to the customer.
But the article is outdated
This article was originally published on November 20, 2017, by BBC Future, and is republished here with permission.
Putting brand names into dish titles is also an effective strategy for many chain restaurants, as are nostalgic labels like “handmade” or “ye olde” according to Brian Wansink from the Food and Brand Lab at Cornell University. A dose of patriotism and family can also boost sales.
I guess we can apply some partial pooling.  If this news article reports the work of several different research groups, and Wansink’s is one of them.  Then, given other things we’ve learned about Wansink’s work, we can make some inference about the distribution of studies of this type . . .
One can also consider this from the reporting standpoint.  100% of the quotations come from people with a direct incentive to promote this work.
Really sad to see this coming from the BBC.  They’re supposed to be a legitimate news organization, no?  I can’t really fault them for citing Wansink—back then, there were still lots of people who hadn’t heard about what was up with his lab—but even in 2017 weren’t they teaching journalists to interview some non-interested parties when preparing their stories?
P.S. The most extreme bit is this quote:
More than seven is too many, five is optimal and three is magical . . .
But that just gives away the game.  Now we’re talking about magic, huh?

Formalizing questions about feedback loops from model predictions

This is Jessica. Recently I asked a question about when a model developer should try to estimate the relationship between model predictions and the observed behavior that results when people have access to the model predictions. Kenneth Tay suggested a recent machine learning paper on Performative Prediction by Juan Perdomo Tijana Zrnic. Celestine Mendler-Dunner and Moritz Hardt. It comes close to answering the question and raises some additional ones.

My question had been about when it’s worthwhile, in terms of achieving better model performance, for the model to estimate and adjust for the function that maps from the predictions you visualize to the realized behavior. This paper doesn’t attempt to address when it’s worthwhile, but assumes that these situations arise and formalizes the concepts you need to figure out how to deal with it efficiently. 

It’s a theoretical paper, but they give a few motivating examples where reactions to model predictions change the target of the prediction: crime prediction changes police allocation changes crime patterns, stock price prediction changes trading activity changes stock price, etc. In ML terms, you get distribution shift, referring to the difference between the distribution you used to develop the model and the one that results after you deploy the model, whenever reactions to predictions interfere with the natural data generating process. They call this “performativity.” So what can be said/done about it? 

First, assume there’s a map D(.) from model parameters to the joint distributions over features (X) and outcomes (Y) they induce, e.g., for any specific parameter choice theta, D(theta) is the specific joint distribution over X and Y that you get as a result of deploying a model with parameters theta. The problem is that the model is calibrated given the data that has been seen prior to deploying it, not the data that results after its deployed. 

Typically in ML the way to deal with this is to retrain the model. However, maybe you don’t have to always do this. The key is to find the decision rule (here defined by the model parameters theta) that you know will perform well on the distribution D(theta) that you’re going to observe when you deploy the model. The paper uses a risk minimization framework to talk about two properties you want to find this rule. 

First you have to define the objective of finding the model specification (parameters theta) that minimizes loss over the induced distribution rather than the fixed distribution you typically assume in supervised learning. They call this “performative optimality.”

Next, you need “performative stability,” which is defined in the context of repeated risk minimization. Imagine a process defined by some update rule where you repeatedly find the model that minimizes risk (i.e., is performatively optimal) on the distribution you observed when you deployed the previous version of the model, D(theta_t-1). You’re looking for a fixed point in this risk minimization process (what I called visualization equilibrium).  

I like this formulation, and the implications of it for thinking about when this kind of thing is achievable. This gets closer to the question I was asking. The authors show that to guarantee that it’s actually feasibly to find the performative optima and performatively stable points exist, you need both your loss function and the map D(.) to have certain properties. 

First, loss needs to be smooth and strongly convex to guarantee a linear convergence rate in retraining to a stable point that approximately minimizes your performative risk. However, you also need the map D(.), to be sufficiently Lipschitz continuous, which constrains the relationship between the distance in parameter space between different thetas and the distance in response distribution space in the different distributions that get induced by those alternative thetas. Stated roughly, your response distribution can’t be too sensitive to changes to the model parameters. If you can get a big change in the response distribution from a small change in model parameters, you might not be able to find your performatively stable solution.  

This is where things get interesting, because now we can tie things back to real world situations and ask, when is this guaranteed? I have some hunches based on my reading of recent work in AI-human collaboration that maybe this doesn’t always hold. For example, some work has discussed how in situations where you have a person overseeing how model predictions are applied, you have to be careful about assuming that it’s always good to update your model because it improves accuracy. Instead, a more accurate model may lead to worse human/model “team” decision making if the newly updated model’s predictions conflict in some noticeable way from the human’s expectations about it. Instead you may want to aim for updates that won’t change the predictions to be so different from the previously deployed model predictions that the human stops trusting the model at all and making all the decisions themselves, because then you’re stuck with human accuracy on a larger proportion of decisions. So this implies that it may be possible for a small change in parameter space to result in a disproportionately large change in response distribution space. 

There’s lots more in the paper, including some analysis to show that it can be harder in general to get performative optimality than to find a performatively stable model. Again it’s theoretical, so it’s more about reflecting on what’s possible with different retraining procedures, though they run some simulations involving a specific game (strategic classification) to demonstrate how the concepts can be applied. It seems there’s been some follow-up work that generalizes to a setting where the distribution you get from some set of model parameters (a result of strategic behavior) isn’t deterministic but depends on the previous state. This setting makes it easier to think about response distribution shifts caused by “broken” mental models for example. At any rate, I’m excited to see that ML researchers are formalizing these questions, so that we have more clues of what to look for in data to better understand and address these issues.

Raymond Smullyan on Ted Cruz, Al Sharpton, and those scary congressmembers

Palko shares this fun logic puzzle from the great Raymond Smullyan which also has obvious implications for modern politics:

Inspector Craig of Scotland Yard was called to Transylvania to solve some cases of vampirism. Arriving there, he found the country inhabited both by vampires and humans. Vampires always lie and humans always tell the truth. However, half the inhabitants, both human and vampire, are insane and totally deluded in their beliefs: all true propositions they believe false, and all false propositions they believe true. The other half of the inhabitants are completely sane: all true statements they know to be true, and all false statements they know to be false. Thus sane humans and insane vampires make only true statements; insane humans and sane vampires make only false statements. Inspector Craig met two sisters, Lucy and Minna. He knew that one was a vampire and one was a human, but knew nothing about the sanity of either. Here is the investigation: Craig (to Lucy): Tell me about yourselves. Lucy: We are both insane. Craig (to Minna): Is that true? Minna: Of course not! From this, Craig was able to prove which of the sisters was the vampire. Which one was it?

With all the conspiracy theories floating around, this distinction between “vampires” and “humans” keeps arising. I assume people such as Al Sharpton and Ted Cruz are “sane vampires” who know when they’re promoting lies, but then there are lots of others like those notorious q-anon congressmembers who are “insane humans” who actually believe what they’re saying.

A complicating factor is that these people help each other. The sane vampires make use of the insane humans in order to increase their political power, and, conversely, the insane humans get support for their false beliefs from the political power of the sane vampires.

So it’s not just Inspector Craig who’s playing the vampires and the humans against each other. The vamps and humans are getting into it directly. And then there are the false statements that get amplified by some mixture of sane vampires and insane humans in the news media.

I don’t think that this post adds anything to our understanding of politics or political science—lots of observers, academics and non-academics, have been talking for awhile about the interaction between political manipulators and sincere believers. And this doesn’t even get into issues such as internet trolls who are being paid expressly to spread disinformation and to attack debunkers. So, these concerns are out there, even if we don’t always know what to do about it. It’s just think it’s interesting to see how Smullyan anticipated all this.

Any graph should contain the seeds of its own destruction

The title of this post is a line that Jeff Lax liked from our post the other day. It’s been something we’ve been talking about a long time; the earliest reference I can find is here, but it had come up before then, I’m sure.

The above histograms illustrate. The upper left plot averages away too much of the detail. The graph with default bin widths, on the upper right, is fine, but I prefer the lower left graph, which has enough detail to reveal the limits of the histogram’s resolution. That’s what I mean by the graph containing the seeds of its own destruction. We don’t need confidence bands or anything else to get a sense of the uncertainty in the bar heights; we see that uncertainty in the visible noise of the graph itself. Finally, the lower right graph goes too far, with so many bins that the underlying pattern is no longer clear.

My preferred graph here is not the smoothest or even the one that most closely approximates the underlying distribution (which in this case is a simple unit normal); rather, I like the graph that shows the data while at the same time giving a visual cue about its uncertainty.

P.S. Here’s the code:

a <- rnorm(1000)
par(mar=c(3,3,1,1), mgp=c(1.5,0.5,0), tck=-.01, mfrow=c(2,2))
hist(a, breaks=seq(-4,4,1), bty="l", main="Not enough bins", xlab="")
hist(a, bty="l", main="Default bins", xlab="")
hist(a, breaks=seq(-4,4,0.25), bty="l", main="Extra bins", xlab="")
hist(a, breaks=seq(-4,4,0.1), bty="l", main="Too many bins", xlab="")

P.S. Yeah, yeah, I agree, it would be better to do it in ggplot2. And, yeah, yeah, it's a hack to hardcode the histogram boundaries at +/-4. I'm just trying to convey the graphical point; go to other blogs for clean code!

Postmodernism for zillionaires

“Postmodernism” in academia is the approach of saying nonsense using a bunch of technical-sounding jargon. At least, I think that’s what postmodernism is . . .

Hmm, let’s check wikipedia:

Postmodernism is a broad movement that developed in the mid- to late 20th century across philosophy, the arts, architecture, and criticism, marking a departure from modernism. The term has been more generally applied to describe a historical era said to follow after modernity and the tendencies of this era.

Postmodernism is generally defined by an attitude of skepticism, irony, or rejection toward what it describes as the grand narratives and ideologies associated with modernism . . .

Postmodernism is often associated with schools of thought such as deconstruction, post-structuralism, and institutional critique, as well as philosophers such as Jean-François Lyotard, Jacques Derrida, and Fredric Jameson.

Criticisms of postmodernism are intellectually diverse and include arguments that postmodernism promotes obscurantism, is meaningless, and that it adds nothing to analytical or empirical knowledge. . . .

OK, so, yeah, postmodernism is a kind of aggressive anti-rigor.

I was thinking about this when reading about Elon Musk’s latest plan, which is to build highway tunnels in Miami . . . a city that’s basically underwater. I mean, why not go all-in and build a fleet of submarines? Musk’s an expert on that, right?

It’s hard for me to believe Musk really plans to build tunnels in Miami; I guess it’s part of some plan he has to grab government $ (not that I have any problem with that, I spend government $ all the time). Meanwhile, various local government officials in Miami are saying positive things about the ridiculous tunnel plan—but I’m guessing that they don’t believe in it either; they just want to say yeah great because that’s what politicians do.

Anyway, the whole thing is so postmodern. It’s like some clever-clever philosopher positing a poststructuralist version of physics, or someone arguing that Moby Dick is just a text with no author, or whatever.

As with academic postmodernism, perhaps the very ridiculousness of the tunnels-in-Miami idea is part of its selling point? After all, anyone can come up with a good idea. It takes someone really special to promote a ridiculous idea with a straight face.

Also as with academic postmodernism, it’s almost irrelevant if the idea makes sense. For example, suppose some literature professor somewhere gets a reputation based on the latest version of hyperstructuralism or whatever. You and I can laugh, but this dude has a steady job. He doesn’t care whether this makes sense, any more than the beauty-and-sex-ratio researchers care whether their statistics make any sense. They have success within a closed community. With a zillionaire, the currency is not academic success but . . . currency. What does it matter to a zillionaire that he’s promoting a ridiculous idea? He has a zillion dollars, which in some way retroactively justifies all his decisions. Kinda like those pharaohs and their cathedrals. Or maybe it’s a Keynesian thing—taking literally the economic dictum about hiring people to dig holes and fill them up again. Experimental theater for the ultra-rich.

P.S. It seems that the above is unfair to postmodernism; see comments here, here, and here.

size of bubbles in a bubble chart

(This post is by Yuling, not Andrew.)

We like bubble charts. In particular, it is the go-to visualization template for binary outcomes (voting, election turnout, mortality…): stratify observations into groups, draw a scatter plot of proportions versus group feature, and use the bubble size to communicate the “group size”. To be concrete, below is a graph I draw in a recent paper, where we have survey data of mortality in some rural villages. The x-axis is the month and the y-axis the survey mortality rate that month. The size of the bubble is the accessible population size under risk during that month. I also put the older population in a separate row as their mortality rates are orders of magnitude higher.

When we make a graph comparison we always have a statistical model in mind: the scale (probability, log probability, log odds…) implies the default modeling scale; one standard error bar corresponds to a normal assumption, etc. Here, as you can imagine, we have a hierarchical model in mind and would like to partial-pool across bubbles. Visualizing the size of the bubbles implicitly conveys the message that “I have many groups! they have imbalanced group sizes! so I need a Bayesian model to enhance my small area estimation!”

OK, nothing new so far. What I want to blog about is which “group size” we should visualize. To be specific in this mortality survey, should the size be the size of the population (n), or the number of death cases (y)? I only need one of them because the y-axis indicates their ratio y/n. This distinction is especially clear for across-age comparisons.

It is common to pick to population size, which is what I did in the graph above. I also googled “gelman bubble chart election”, the first result  jumping out is the “Deep Interactions with MRP” paper in which the example visualized the subgroup population size (n) of income × ethnicity × state group, not their one-party vote count.

But I can provide a counterargument for visualizing the case size (y). Again, a graph is an implicit model: visualizing the proportion corresponds to a Bernoulli trial. The inverse Fisher information of theta in a Bernoulli (theta) likelihood is theta(1-theta). But that could be wrong unit to look at because theta is close to zero anyway. If we look at the log odds, the Fisher information of logit(theta) is theta(1-theta). In the mortality context, theta is small. Hence the “information of logit mortality” from a size-n group will be n* theta(1-theta)≈ y, which also  implies an 1/y variance scaling. This y-dependent factor will determine how much a bubble is pooled toward the shared prior mean in a multilevel posterior.

In this sense, the routine of visualizing the group size comes from a rule-of-thumb 1/n variance scaling, which is a reasonable approximation when the group-specific precision is roughly a constant.   For a Bernoulli model, the reasoning above suggests a better bubble scale could be n*theta(1-theta) ≈ y(1-n/y), but it also sounds pedantry to compute such quantities for raw data summary.

Hmmm,any experimental measure of graphical perception will inevitably not measure what it’s intended to measure.

Indeed, the standard way that statistical hypothesis testing is taught is a 2-way binary grid. Both these dichotomies are inappropriate.

I originally gave this post the title, “New England Journal of Medicine makes the classic error of labeling a non-significant difference as zero,” but was I was writing it I thought of a more general point.

First I’ll give the story, then the general point.

1. Story

Dale Lehman writes:

Here are an article and editorial in this week’s New England Journal of Medicine about hydroxychloroquine. The study has many selection issues, but what I wanted to point out was the major conclusion. It was an RCT (sort of) and the main result was “After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19….” This was the conclusion when the result was “The incidence of new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine (49 of 414 (11.8%) and those receiving placebo (58 of 407 (14.3%)); the absolute difference was -2.4 percentage points (95% confidence interval, -7.0 to 2.2; P=0.35).”

The editorial based on the study said it correctly: “The incidence of a new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine ….” The article had 25 authors, academics and medical researchers, doctors and Phds – I did not check their backgrounds to see whether or how many statisticians were involved. But this is Stat 101 stuff: the absence of a significant difference should not be interpreted as evidence of no difference. I believe the authors, peer reviewers, and editors know this. Yet they published it with the glaring result ready for journalists to use.

To add to this, the study of course does not provide the data. And the editorial makes no mention of their recent publication (and retraction) of the Surgisphere paper. It would seem that that whole episode has had little effect on their processes and policies. I don’t know if you are up for another post on the subject, but I don’t think they should be let off the hook so easily.

Agreed. This reminds me of the stents story. It’s hard to avoid binary thinking: the effect is real or it’s not, the result is statistically significant if it’s not, etc.

2. The general point

Indeed, the standard way that statistical hypothesis testing is taught is a 2-way binary grid, where the underlying truth is “No Effect” or “Effect” (equivalently, Null or Alternative hypothesis) and the measured outcome is “Not statistically significant” or “Statistically significant.”

Both these dichotomies are inappropriate. First, the underlying reality is not a simple Yes or No; in real life, effects vary. Second, it’s a crime to take all the information from an experiment and compress it into a single bit of information.

Yes, I understand that some times in life you need to make binary decisions: you have to decide whether to get on the bus or not. But. This. Is. Not. One. Of. Those. Times. The results of a medical experiment get published and then can inform many decisions in different ways.

Whassup with the weird state borders on this vaccine hesitancy map?

Luke Vrotsos writes:

I thought you might find this interesting because it relates to questionable statistics getting a lot of media coverage.

HHS has a set of county-level vaccine hesitancy estimates that I saw in the NYT this morning in this front-page article. It’s also been covered in the LA Times and lots of local media outlets.

Immediately, it seems really implausible how big some of the state-border discontinuities are (like Colorado-Wyoming). I guess it’s possible that there’s really such a big difference, but if you check the 2020 election results, which are presumably pretty correlated with vaccine hesitancy, it doesn’t seem like there is. For example, estimated vaccine hesitancy for Moffat County, CO is 17% vs. 31% for neighboring Sweetwater County, WY, but Trump’s vote share was actually higher (81%) in Moffat County than in Sweetwater County (74%).

According to HHS’s methodology, they don’t actually have county-level data from their poll (just state-level data), which isn’t too surprising. This is how they arrived at the estimates:

It’s not 100% clear to me what’s skewing the estimates here, but maybe there’s some confounder that’s making the coefficient on state of residence much too big — it could be incorporating the urban/rural split of the state, which they don’t seem to adjust for directly. I guess the way to check if this analysis is wrong would be to re-run it to try to predict county-level election results and see if you get the same discontinuities (which we know don’t exist there).

Let me know what you think. It’s strange to see results that seem so unlikely, just by looking at a map, reported so widely.

I agree that the map looks weird. I wouldn’t be surprised to see some state-level effects, because policies vary by state and the political overtones of vaccines can vary by state, but the border effects just look too large and too consistent here. I wonder if part of the problem here is that they are using health insurance status as a predictor, and maybe that varies a lot from state to state, even after adjusting for demographics?

How big is the Household Pulse Survey? The documentation linked above doesn’t say. I did some googling and finally found this document that says that HPS had 80,000 respondents in week 26 (the source of the data used to make the above map). 80,000 is pretty big! Not big enough to get good estimates for all the 3000 counties in the U.S., but big enough to get good estimates for subsets of states. For example, if we divide states into chunks of 200,000 people each, then we have, ummmm, 80K * 200K / 330 million = 48 people per chunk. That would give us a raw standard error of 0.5/sqrt(48) = 0.07 per chunk, which is pretty big, but (a) some regression modeling should help with that, and (b) it’s still enough to improve certain things such as the North Dakota / Minnesota border.

The other thing is, I guess they know the county of each survey respondent, so they can include state-level and county-level predictors in their model. The model seems to have individual-level predictors but nothing at the state or county level. It might be kinda weird to use election results as a county-level predictor, but there are lots of other things they could use.

On the other hand, the map is not a disaster. The reader of the map can realize that the state borders are artifacts, and that tells us something about the quality of the data and model. I like to say that any graph should contain the seeds of its own destruction, and it’s appealing, in a way, that this graph shows the seams.

P.S. I wrote the above post, then I wrote the title, and then it struck me that this title has the same rhythm as What joker put seven dog lice in my Iraqi fez box?

Whatever you’re looking for, it’s somewhere in the Stan documentation and you can just google for it.

Someone writes:

Do you have link to an example of Zero-inflated poisson and Zero-inflated negbin model using pure stan (not brms, nor rstanarm)? If yes, please share it with me!

I had a feeling there was something in the existing documentation already! So I googled *zero inflated Stan*, and . . . yup, it’s the first link:

We don’t generally recommend the Poisson model; as discussed in Regression and Other Stories, we prefer the negative binomial. So I’m not thrilled with this being the example in the user’s guide. But the code is simple enough that it wouldn’t take much to switch in the negative binomial instead. Really, the main challenge with the negative binomial is not the coding so much as the interpretation of the parameters, which is something we were struggling with in chapter 15 of Regression and Other Stories as well.

Anyway, the real message of this post is that the Stan documentation is amazing. Thanks, Bob (and everybody else who’s contributed to it)!

Responding to Richard Morey on p-values and inference

Jonathan Falk points to this post by Richard Morey, who writes:

I [Morey] am convinced that most experienced scientists and statisticians have internalized statistical insights that frequentist statistics attempts to formalize: how you can be fooled by randomness; how what we see can be the result of biasing mechanisms; the importance of understanding sampling distributions. In typical scientific practice, the “null hypothesis significance test” (NHST) has taken the place of these insights.

NHST takes the form of frequentist signficance testing, but not its function, so experienced scientists and statisticians rightly shun it. But they have so internalized its function that they can call for the general abolition of significance testing. . . .

Here is my basic point: it is wrong to consider a p value as yielding an inference. It is better to think of it as affording critique of potential inferences.

I agree . . . kind of. It depends on what you mean by “inference.”

In Bayesian data analysis (and in Bayesian Data Analysis) we speak of three steps:
1. Model building,
2. Inference conditional on a model,
3. Model checking and improvement.
Hypothesis testing is part of step 3.

So, yes, if you follow BDA terminology and consider “inference” to represent statements about unknowns, conditional on data and a model, then a p-value—or, more generally, a hypothesis test or a model check—is not part of inference; it a critique of potential inferences.

But I think that in the mainstream of theoretical statistics, “inference” refers not just to point estimation, interval estimation, prediction, etc., but also to hypothesis testing. Using that terminology, a p-value is a form of inference. Indeed, in much of statistical theory, null hypothesis significance testing is taken to be fundamental, so that virtually all inference corresponds to some transformations of p-values and families of p-values. I don’t hold that view myself (see here), but it is a view.

The other thing I want to emphasize is that the important idea is model checking, not p-values. You can do everything that Morey wants to do in his post without ever computing a p-value, just by doing posterior predictive checks or the non-Bayesian equivalent, comparing observed data to their predictions under the model. The p-value is one way to do this, but I think it’s rarely a good way to do it. When I was first looking into posterior predictive checks, I was computing lots of p-values, but during the decades since, I’ve moved toward other summaries.

Thoughts inspired by “the Gospel of Jesus’s Wife”

1. Harvard’s current position on the matter

This is at Harvard University’s website:

But, no, it’s not a “Coptic Papyrus Fragment.” That’s a lie. Or, I guess, several years ago we could call that statement a mistake, but given that it’s been known to be false for several years, I think it’s fair to call it a lie at this point.

Also, I love that bit about “Report Copyright Infringement.” Promoting a debunked fraud, that’s no big deal, it’s just a day’s work at the uni. But copyright infringement . . . that’s another story!

2. The story

After reading a review of Ariel Sabar’s “Veritas: A Harvard Professor, a Con Man and the Gospel of Jesus’s Wife,” I decided to follow Paul Alper’s advice and read the book, which was conveniently available at the local library.

At first I thought the book would be boring, not because of the topic but because I’d already read the review so I knew how the story would turn out. But, no, the book was interesting and thought provoking. It had good guys and bad guys but was lots more than that, and there were three major strands: (1) The document itself: where it came from, how it was revealed and publicized, and the ways in which people figured out that it was fake; (2) The story of the German dude who did the forgery; (3) The story of Harvard and the academic world of early Christian studies. Each of these strands was interesting, and they interacted in interesting ways.

As with the book about Theranos, there was something weird about the whole thing, in that the warnings come right at the beginning and never stop. Agatha Christie it ain’t. The big difference is that the Theranos story was full of bad guys—I was particularly annoyed at the lawyer who went around intimidating anyone who might be a whistleblower—whereas the Gospel of Jesus’s Wife story seems to have involved one bad guy and a thousand dupes, people who legitimately felt bad when it turned out they’d been scammed. The Harvard professor in the story, Karen King, was somewhere in the middle: she got fooled, and then when the evidence of the scam started to come in, she kept looking away, as if she could just make the unwelcome evidence go away by just dismissing it.

In comparison, when my political science colleague Don Green learned that he’d been conned by Michael Lacour, a graduate student from another university, he (Green) made the wise decision to just hit reset oon the story. Lacour’s fraud was hard to detect because it required looking carefully at his data. When the claims first came out, I’d written that the published results seemed too good to be true, but I didn’t suspect fraud; I just thought there might be some methodological issue I was missing. On the other hand, to see the problems in Lacour’s data did not require any specialized knowledge of ancient languages or dating of documents.

As noted in my earlier post, the first thing that Sabar’s story reminded me of was various junk science ideas that got debunked, but often only after many defensive moves by the people who originally promoted the bad ideas, and even after the original experimental claims had been abandoned, the bad ideas remained in Cheshire-cat or zombie form. An example is the so-called critical positivity ratio; see here for the latest in that story. One difference is that the “Jesus’s wife” document was an out-and-out fraud, whereas most of the junk science seems more like delusion or just bad scientific reasoning. For example, I have no reason whatsoever to think that the ages-ending-in-9 or ovulation-and-voting or himmicanes researchers engaged in fraud; I just think they received bad (if conventional) training, they didn’t know what they’re doing, and then, once people pointed out the problems in their work, they were too committed to let go.

Another thing that struck me was the role of the news media, both in puffing up fraud or junk science or unsubstantiated claims more generally, then in shooting these claims down, then in promoting salvage operations, etc.

Sabar has this great quote:

King had correctly forecast the need to distance herself from a certain kind of coverage: the tabloids and clickbait sites that would inevitably mischaracterize the scrap as biographical proof that Jesus was married. But she failed to grasp something essential about the more responsible news organizations: they were not there to do her bidding and move on.

I’ve thought about this before. When academics get in the news for their research, they typically get uncritical coverage. So then when negative coverage does happen, it can be a real shock that the media are not “there to do their bidding.”

Later Sabar discusses two researchers who discovered a fatal flaw in the fake Bible document. These scholars had an insider-outsider perspective: they had professional training but were doing this particular research as a side hobby:

Though they groused about doing scholarship in basements alongside loads of laundry, they’d also come to see advantages: live outside academia’s high walls afforded freedoms unavailable inside. Bernhard and Askeland didn’t have to worry about what Harvard might think of them. They didn’t have to weigh the professional cost, as many young scholars do, of challenging powerful gatekeepers who might one day sit on a hiring or tenure committee.

Indeed, I get emails from people all the time who talk about bad things they’re seeing but request anonymity because they fear retaliation. That’s one advantage to me of being in the statistics and political science departments: the Association for Psychological Science can publish lies about me, and I don’t like it, but they live in a different world than I do.

This brings me to something that is notable by its absence in the Jesus’s wife story. There was no nastiness. Yes, the Harvard professor was a bit slippery with her evidence, engaging in wishful thinking long after it was clear that the document was a fraud—but neither she nor anyone else involved attacked their critics, either directly or through proxies. A couple years ago we talked about a ladder of responses to criticism, ranging from the most open (“1. Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.”) to the most defensive (“7. Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.”) In this case, the academics who were fooled by the forged document were somewhere in the middle. Lots of bullshitting but no attacks.

Sabar also discusses the writings of Robert Funk, a scholar of the Bible and collaborator of Karen King:

“The Bible, along with all our histories, is a fiction,” Funk said in his inaugural 1985 speech to the Jesus Seminar. Like all stories, the Bible was a series of “arbitrary” selections by an author who picked characters and events, then forced them into a causal chain with beginning, middle, and end. It was only by exposing the Bible’s fictive underpinnings that scholars could conjure a new, better tale. Unaccountably, however, this new tale wouldn’t necessarily be truer than the one it replaced. “What we need is a new fiction,” Funk told his colleagues . . .

This makes sense to me. When it comes to millennia-old stories, we have to distinguish between truth/fiction of the provenance of the documents and truth/fiction of the stories themselves. Lots of Biblical stories (not the so-called Gospel of Jesus’s wife, but many others, canonical and non-canonical) really were written down between 1500 and 2500 years ago (roughly), so it’s true that the stories existed as stories, even though there’s no independent evidence for the content of the stories. Similarly we can say it’s a fact that Tolkien wrote The Lord of the Rings even though hobbits are no more real than unicorns.

But this brings us to an interesting point, the argument that it wasn’t fair to diss the “Jesus’s wife” document. Sure, it was a modern forgery, but so what? Lots of genuine Biblical documents were written hundreds of years after the events they purport to describe, then there’s the Book of Mormon, etc. So why hold the Jesus’s wife document to a higher standard? I don’t really know the answer to this one. I think we should describe its provenance accurately. If it was really created around the year 2000, then don’t say it’s from the year 400 or 800 or whatever. But, sure, if you want to argue that Jesus was married, you can argue it now as much as you could argue it in the year 400. The reason why the document, if real, would’ve been relevant to Biblical scholarship is because it inform claims about what was being debated about Christ in the centuries after his death.

Remember that Keynes quote about the stock market as a beauty contest where the goal is to predict the face who other contestants think is most beautiful? Similarly, this sort of biblical scholarship is studying not what happened in Jesus’s time but, rather, what people 200 years later were saying happened around 0 A.D.

Here’s another quote, this time from Roger Bagnall, one of the scholars who was fooled by the forged document:

It’s hard to construct a scenario that is at all plausible in which somebody fakes something like this.

This reminds me of the findings from cognitive psychology that we evaluate hypotheses by their “availability.” Remember Linda the bank teller? The funny thing is, people fake documents all the time. They faked the Hitler diaries! So it’s kind of weird that he said this wasn’t plausible. I guess this was his way of saying he didn’t want to think hard about it.

Oh, and I like this line from Sabar after a lab test provided evidence of forgery:

There were no press releases from Harvard Divinity School this time.

Is this a cheap shot? I don’t think so. Especially given that, even now, years after the fraud was publicly exposed Harvard Divinity School continues to host this page, “Gospel of Jesus’s Wife,” which lists three early “Scientific Reports” that purport to support the authenticity of the document, but none of the definitive followups. This is flat-out poor scholarship, especially given that James Yardley, the Columbia professor who did one of the studies, explicitly told Sabar that his earlier report was “never intended to be a proper scientific presentation of the results.” That Harvard webpage also deadpans it with a “transcription” of the document without any indication that it is a fake or any crediting of Mike Grondin’s interlinear translation of the Gospel of Thomas, which is where Walter Fritz had stolen this from. No need to credit Fritz, perhaps, but they should definitely credit Grondin.

What’s up with you, Harvard? Presenting a fake document as real and not crediting the source? That’s not cool. Not at all. Give your sources. Always.

Ummm, ok, yeah, here it is:

Members of the Harvard College community commit themselves to producing academic work of integrity – that is, work that adheres to the scholarly and intellectual standards of accurate attribution of sources, appropriate collection and use of data, and transparent acknowledgement of the contribution of others to our ideas, discoveries, interpretations, and conclusions. Cheating on exams or problem sets, plagiarizing or misrepresenting the ideas or language of someone else as one’s own, falsifying data, or any other instance of academic dishonesty violates the standards of our community, as well as the standards of the wider world of learning and affairs.

Maybe Harvard could set up a single convenient one-stop website for all its false claims. There’s the Gospel of Jesus’s Wife, Surgisphere, that thing about the replication rate in psychology being statistically indistinguishable from 100%, the monkey tapes, etc. The idea is they could put all these in one place, so we’d know that everything else coming out of the Ivy institution could be trusted.

Here’s Sabar’s summary:

The story came first; the data managed after. The narrative before the evidence; the news conference before the scientific analysis; the interpretation before the authentication.

And then once claim gets out there, people work hard to prop it up.

Elsewhere, Sabar talks about some academic politics regarding Harvard’s Divinity School. In 2006 a university committee proposed a new program:

“Reason and Faith is a category unlike any that Harvard has included in its general education curriculum,” the task force wrote. The classes would treat religion academically, covering topics like church versus state, the history of religion, gender and worship, the Vatican as an institution . . .

This proposal was slammed by psychology professor Steven Pinker who wrote, “universities are about reason pure and simple . . . Faith—believing something without good reasons to do so–has no place in anything but a religious institution, and out society has no shortage of these. Imagine if we had a requirement for ‘Astronomy and Astrology’ or ‘Psychology and Parapsychology.'”

I wasn’t there at Harvard for the conversation, and if I were a student I think I’d be annoyed if they tried to require me to take a religion course, in the same way that I’d be annoyed if they tried to require me to take an astronomy course or a psychology course—but I feel like something’s off in the discussion of that religion program, something off in the way it was discussed by its proponents and its opponents. On the “pro” side, there’s the claim that there’s something wonderful and unique about this program—but how is it different from the teaching of literature? If you take a class on Shakespeare, you learn about the history of the plays and about their content, but you’re not required to believe that the story he was telling about Richard III was real. If you take a class on Tolkien, you’ll learn about Beowulf and all sorts of things, but you don’t have to believe in orcs. So, yeah, have the religion program, but I don’t see how it’s so damn special. On the other side, the opponents seem a bit extreme. Universities are not all about reason. You can take art and music at Harvard. You can take a poetry class. Sure, these classes involve reason, but they’re not “about reason pure and simple.” So I don’t agree with Pinker on that one, indeed I can’t see how he ever could’ve believed such a claim.

P.S. Sabar isn’t perfect, though. I noticed this line:

Peer reviewers are academia’s highway patrol—the officers who pull over speeders before they hurt themselves and others.

I don’t think so!

Adam Marcus of Retraction Watch tells it:

Coptic cop-out? Religion journal won’t pull paper based on bogus ‘gospel’

What the Harvard Theological Review giveth, it evidently will not taketh away.

The venerable publication about religious matters is refusing to retract a 2014 article by a noted scholar of early Christianity despite evidence that the article — about Jesus’s wife — was based on a forgery. . . .

However, the journal issued a statement about the article, a cop-out of — bear with us — Biblical proportions:

Harvard Theological Review has scrupulously and consistently avoided committing itself on the issue of the authenticity of the papyrus fragment. HTR is a peer-reviewed journal. Acceptance of an essay for publication means that it has successfully passed through the review process. It does not mean that the journal agrees with the claims of the paper. . . . Given that HTR has never endorsed a position on the issue, it has no need to issue a response.

Good to know they’ve “never endorsed a position on the issue.” We wouldn’t want them calling a forgery a forgery. That would just be rude.

For the straight story, you’ll want to read this article by Leo Depuydt written in 2012 and published in the Harvard Theological Review in 2014. Depuydt’s article, refreshingly, begins:

The following analysis submits that it is out of the question that the so-called Gospel of Jesus’s Wife, also known as the Wife of Jesus Fragment, is an authentic source. The author of this analysis has not the slightest doubt that the document is a forgery, and not a very good one at that.

Too bad the editor of the journal can’t write so clearly. On the plus side, at least they’re not personally attacking their critics. So let’s appreciate that they are showing some restraint.

P.P.S. Alper points to this review of Sabar’s book by Biblical scholar Tony Burke. I agree with Alper that it’s interesting to see Burke’s take on the story from the inside. But there’s one aspect of Burke’s post that I don’t like, which is its defensiveness. I should think Burke would be furious at the Harvard professors etc. who went all-in on this con, thus making his field into a laughingstock, but instead he seems to be bending over backward criticizing the messenger. Sure, I can see how people in this field would be happiest if the fraud were just quietly laid to rest. But, hey, Harvard Divinity School had no problem with positive press accounts. And the idea of criticizing a popular book for being too vividly written, and criticizing an investigative reporter for having “worked way too long and too hard on this story” . . . hey, that’s what investigative reporters are supposed to do! It’s the Javert paradox all over again.

Also, I followed the link to the article by Leo Depuydt that Burke referred to as an “ad hominem attack.” Depuydt’s article not an ad hominem argument (or “attack”; I guess that’s what scholars call it when you disagree with them) at all! It’s entirely focused on technical details. I really really don’t like when people call an article an “ad hominem attack” when what they really mean is that it’s (a) a substantive argument that they happen to disagree with and (b) doesn’t show exaggerated deference to a person who got things wrong. King did not behave well in that situation, Depuydt had every right to be annoyed, and, even if he didn’t have such a right, it’s not an ad hominem attack for him to detail exactly why the argument he’s addressing is ridiculous. Some academics seem so used to deference that they perceive any disagreement that is not swaddled in praise to be an ad hominem attack. One could argue that Depuydt in his article is impolite. But impolite is not the same as ad hominem. The use of the term “ad hominem” implies there is a logical fallacy in the argument. It’s a convenient trick to use if you think that people aren’t going to click through and read the original article. In this case, though, I expect Burke didn’t think it through, and that he just thinks that lack of deference is itself an ad hominem attack. So frustrating. Again, it’s his Harvard colleague who got conned and then stayed with it for way too long. Don’t blame the reporter for the embarrassment, and for Christ’s stake don’t blame an outside scholar who was justifiably annoyed at fraud being promoted by a leading academic institution. When scholar A makes an argument and scholar B impolitely pokes holes in it, that’s not an ad hominem attack, that’s scholarship.

This seems like a case of the Stockholm syndrome, or the shoot-the-messenger syndrome. Some people in Burke’s field get conned and then try their darnedest to look away from the abundant evidence of forgery—but Burke is more annoyed by the people who called it right from the start! That makes no sense to me. I mean, sure, it makes some psychological sense, but it doesn’t make scholarly sense.

P.P.P.S. Completely unrelatedly and much more consequentially, there’s this story from Shane Bauer:

Five months before Monterrosa was killed, the Vallejo [California] Police Officers’ Association had replaced its president, Detective Mat Mustard, who had run the union for ten years. Mustard was notorious in Vallejo for the investigation he led into the kidnapping of a woman named Denise Huskins, in 2015. Someone broke into the house where she and her boyfriend were sleeping, blindfolded and drugged them, and put her in the trunk of a car. When the boyfriend reported the crime, Mustard suspected that he had killed Huskins and invented the kidnapping story. At the police station, the boyfriend said, officers dressed him in jail clothes, then Mustard and others interrogated him for eighteen hours, calling him a murderer. Huskins, who was being held a hundred and sixty miles away, was raped repeatedly. After she was released, the Vallejo police publicly accused her and her boyfriend of faking the kidnapping, comparing the situation to the movie “Gone Girl.” The police threatened to press charges against the couple, and after the rapist e-mailed the San Francisco Chronicle, confessing to the kidnapping, the police accused Huskins and her boyfriend of writing the e-mail. Soon, the rapist was arrested in South Lake Tahoe, after trying to repeat the crime. Even then, the Vallejo police insisted that Huskins and her boyfriend were lying. The couple sued Mustard and the city, eventually winning a $2.5-million settlement. In a show of defiance, the police department named Mustard officer of the year.

I guess a book might be coming out about this one too. Authority figures do something wrong, don’t back down, then reward the perpetrators: that’s a tale as old as time.

Statistical Modeling, Causal Inference, and Social Science gets results!

A few months ago, we posted this job ad from Des McGowan:

We are looking to hire multiple full time analysts/senior analysts to join the Baseball Analytics department at the New York Mets. The roles will involve building, testing, and presenting statistical models that inform decision-making in all facets of Baseball Operations. These positions require a strong background in complex statistics and data analytics, as well as the ability to communicate statistical model details and findings to both a technical and non-technical audience. Prior experience in or knowledge of baseball is not required.

Interested applicants should apply at this link and are welcome to reach out to me (dmcgowan@nymets.com) if they have any questions about the role.

More recently, McGowan informed me that one of the people they ultimately hired applied because he saw it here on this blog.

Cool!

When can a predictive model improve by anticipating behavioral reactions to its predictions?

This is Jessica. Most of my research involves data interfaces in some way or another, and recently I’ve felt pulled toward asking more theoretical questions about what effects interfaces can or should have in different settings. For instance, the title of the post is one question I’ve started thinking about: In situations where a statistical model is used to predict human behavior, and people have access to the predictions, under what conditions can we expect the model to perform better when it explicitly tries to estimate how people will react to the displayed predictions. By explicitly I mean estimating behavioral reactions to the display is part of the model specification. 

The answer depends of course on how one formalizes it (how you define a data generating process, restrict the space of possible models to be explored, define the strategies and payoffs of agents, etc.). But I think it’s an interesting thought exercise.

When might we want to ask such a question? One situation that comes to mind is election forecasting, where you have many people looking at predictions of election outcomes (vote share per candidate for instance) created by news agencies or pollsters etc. Sometimes there are concerns that people will “best respond” to the predictions in ways that change the outcome of the election. For example, presidential election forecasting involves predicting both vote choice and turnout, where turnout might be affected by perceived closeness of the race, i.e., the less close the election, the less mobilized some voters. The effect of the display on behavior here might be thought of as somewhat accidental; people want to know the future, perhaps in part to decide how to act, but not necessarily. Hence there’s a demand for forecasts, but the fact that their availability might change the election outcome in any significant way is perceived as a nuisance or risk. A beneficent forecaster might care about possible behavioral reactions to the display because they would like for their forecast to reduce the amount of regret that eligible voters feel post election over their choice of whether to vote and what candidate to vote for.

There are many other situations where interfaces serve up predictions with the goal of directly informing behavior, e.g., recommender systems. For instance, apps for driving directions like Google Maps predict current travel times along different routes, and the developer might naturally want the predictions to achieve some aggregate goal, like less congested traffic. 

Recently, I did some work with Paula Kayongo, a Ph.D student I work with, and Jason Hartline, a game theorist, which is partially what got me thinking about this question. We came up with the notion of a visualization equilibrium: the visualization of predicted behavior such that when you show people that visualization you observe that behavior. We used a congestion game set up to test whether visualizing a Nash equilibrium for the game leads to that outcome realizing. Not surprisingly, it doesn’t. People have various strategies they use to react to the display such that you can see a different outcome than what was predicted.

In our work so far, the visualization is simply a reflection of past plays of the game, which can be thought of as a simple form of model prediction. But this is less realistic than a setting where the display presents the predictions of some statistical model which might be informed by past behavior but not identical to it. Often, if we think the format of the interface has an effect on decisions people make from it, we might do some “offline” experiments to try to find the version that leads to the least bias and just use that. But if people are reacting to the content of the prediction, it might be worth trying to learn those dynamics as part of the model. So I started wondering, if you have a model that predicts behavior, and you expect people might try to best respond to the visualized predictions in ways that can change the outcome, under what circumstances should you try to anticipate (model) these dynamics directly? 

At a high level, we can think of the display effect as a mapping between a visualization of some predicted outcome to a realized outcome. We can think of a predictive model that anticipates reactions to its own predictions as one that tries to estimate this function.

There are a bunch of parameters to define to pose the question more rigorously. If we assume we have a model that makes predictions about behavior in some payoff-relevant state, where the behavior of others impacts the utility of different possible actions a person faces, then we should consider:

  • What are the parameters of the “game”, including the action set available to agents, payoff functions, assumptions about agents (e.g., economic rationality), etc.?
  • Are all states payoff relevant states, or just some/one of them? (e.g., a political election is a one-shot scenario, but using google maps on a trip is not). 
  • When are predictions available to the agents? Continuously or at certain time points?  
  • What’s the relationship between those who see the display and those who will act in the payoff-relevant state? Is it the same group, or is the former a subset? 
  • How is the space of possible models constrained? What’s the functional form used to estimate the display effect? What form do the data inputs available to the model at any given point take?
  • Where does the state(s) for which the question is being asked occur in a process of best response dynamics? In other words, how far is the system from equilibrium? 

To answer the question requires constraining it by deciding these parameters, but I find the idea of formalizing it and then working out the answer compelling as a thought exercise. Maybe we’ll get more insight into real world interface dynamics. Even without deciding on a specific case to study, we can make conjectures like, modeling display effects is unlikely to be helpful when a system is in equilibrium, or when those who view the predictions make up only a small portion of those who acts in the payoff-relevant state. It also seems related to the martingale property, since if its a martingale your forecast at any time point should be an unbiased predictor of the forecast at any later time. If you expect a lot of movement in the predictions in the future, your uncertainty must be high, hence you won’t be “surprised” by some reaction to the display. There’s lots more that could be said, but for now I’ll just pose the high level question.

Nicky Guerreiro and Ethan Simon write a Veronica Geng-level humor piece

I don’t usually go around recommending amusing things that are completely off topic to the blog, but this piece by Nicky Guerreiro and Ethan Simon was just too funny. It’s Veronica Geng-level quality, and I don’t say that lightly. As with Geng’s articles, you can laugh and be horrified at the same time.

The story combines two of my interests: Warner Brothers cartoons and ridiculous political stunts.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.