Skip to content

Plan for the creation of “a network of new scientific institutes pursuing basic research while not being dependent on universities, the NIH, and the rest of traditional academia and, importantly, not being dominated culturally by academia”

Alexey Guzey is a recent college graduate from Moscow who we heard about in connection with the Why We Sleep saga. He wrote a post a couple years ago called How Life Sciences Actually Work, and some point after that he decided to create a new organization to facilitate research outside academia.

Here’s his pitch:

New Science aims to build new institutions of basic science, starting with the life sciences. Over the next several decades, New Science will create a network of new scientific institutes pursuing basic research while not being dependent on universities, the NIH, and the rest of traditional academia and, importantly, not being dominated culturally by academia. . . .

In the summer of 2022, New Science will run an in-person research fellowship in Boston for young life scientists, during which will collect preliminary data for an ambitious idea of theirs. This is inspired by Cold Spring Harbor Laboratory, which started as a place where leading molecular biologists came for the summer to hang out and work on random projects together . . . The plan is to gradually increase the scope of projects and the number of people funded by New Science, eventually reaching the point where there are entire labs operating outside of traditional academia and then an entire network of new scientific institutes.

I don’t really know anything about these plans, but I like Guzey’s investigation of the sleep book, and his general goals seem reasonable so I agreed to be on the board of advisers. We’ll schedule a post in a few decades—say, 14 May 2061—to see how it worked out.

Tableau and the Grammar of Graphics

The first edition of Lee Wilkinson’s book, The Grammar of Graphics came out in 1999. Whether or not you’ve heard of the book, if you’re an R user you’ve almost certainly indirectly heard about the concept, because . . . you know ggplot2? What do you think the “gg” in ggplot2 stands for? That’s right!

Then in 2002 Chris Stolte, Diane Tang, and Pat Hanrahan of Stanford University published an article, Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases, where they cite The Grammar of Graphics:

Wilkinson [41] recently developed a comprehensive language for describing traditional statistical graphics and proposed a simple interface for generating a subset of the specifications expressible within his language. We have extended Wilkinson’s ideas to develop a specification that can be directly mapped to an interactive interface and that is tightly integrated with the relational data model. . . .

The primary distinctions between Wilkinson’s system and ours arise because of differences in the data models. We chose to focus on developing a tool for multidimensional relational databases . . . The differences in design are most apparent in the table algebra . . .

Shortly afterward, this work was developed into Tableau:

In 2003 Tableau spun out of Stanford University with VizQL™, a technology that completely changes working with data by allowing simple drag and drop functions to create sophisticated visualizations. The fundamental innovation is a patented query language that translates your actions into a database query and then expresses the response graphically.

Both ggplot2 and Tableau have become very successful, the first as freeware and the second as a commercial product. The documentation for ggplot2 (as well as its name) very clearly cite the Grammar of Graphics. It would be good if Tableau did this also.

What did ML researchers talk about in their broader impacts statements?

This is Jessica. A few months back I became fascinated with the NeurIPS broader impact statement “experiment” where NeurIPS organizers asked all authors to in some way address the broader societal implications of their work. It’s an interesting exercise in requiring researchers to make predictions under uncertainty about societal factors they might not be used to thinking about, and to do so in spite of their incentives as researchers to frame their work in a positive light. You can read some of them here if you’re curious.  

Recently I collaborated on analysis of a sample of these statements, with Priyanka Nanayakkara, who led the work and will present on it next week at a conference on AI ethics, and Nick Diakopoulos. The analysis reports on what themes and sources of variation between broader impacts statements were apparent in a random sample of them from 2020 NeurIPS papers. This was a qualitative analysis so there’s some room for interpretation, but the goal was to try to learn something about how ML researchers grappled with a vague prompt that they should address broader implications of their work, including what sort of concerns were given priority.  

Some observations I find interesting: 

  • There was a split in the statements between implications that were mostly technically oriented (discussing criteria computer scientists commonly optimize for, like robustness to perturbations in the input data or parameterization of a model) and those that tended to be more society-facing. Non-trivial proportions of the sampled statements mentioned common desiderata for learning algorithms like efficiency (~30%), generalizability (~15%), and robustness and reliability (~20%). Some of these mapped from general properties to impacts on society (e.g. efficiency leads to saved time and money, a more robust model might positively impact safety, etc.). It’s not really clear to me how much information is added to the paper in such cases, since presumably these properties become clear if you understand the contribution. 
  • Authors of theoretical work varied in whether they considered the exercise relevant to them. Roughly equal proportions of the statements we looked at implied that due to the theoretical nature of the work, there are no foreseeable consequences versus despite the theoretical nature of the work, there may be societal consequences. 
  • Some authors were up front about sources of uncertainty they perceived in trying to identify downstream consequences of their work, noting that any societal impacts depend on the specific application of the model or method, stressing that their results are contingent on assumptions they made, noting that human error or malicious intent could throw a wrench into outcomes, or directly stating that their statements about impacts are speculation. Others however spoke more confidently and didn’t hedge their speculation. 
  • Since the prompt was very open-ended, authors were left to choose what time frame the impacts they discussed were expected to occur in. Only about 10% of the sampled statements mentioned a time frame at all, and those were in broad terms (e.g., either “long term” or “immediate”). So it was sometimes ambiguous whether authors are thinking about impacts immediately after deployment, or impacts after, for instance, users of the predictions have had time to adjust their behavior. 
  • More than half of the statements described who would be impacted, though sometimes in very broad terms, like “the healthcare domain,” “creative industries” (all those generative models!) or other ML researchers. 
  • Recommendations were implicit in many of the statements, and more than half of the statements implied who was responsible for future actions. It was often implied to be other ML researchers doing follow-up work, but regulation and policy were sometimes mentioned. 
  • When it came to themes in the types of socially relevant impacts discussed, common concerns included impacts to bias/fairness (~24%), privacy (~20%) and the environment (~10%). In a lot of ways the statements echoed common points of discussion in conversations about possible ML harms 

Overall, my impression from reading some of the statements is that there was some “ethical deliberation” occurring, and that most authors took the exercise in good faith. I was pleasantly surprised in reading some of them. For example, some of the statements seemed quite useful from the standpoint of providing the reader of a very technical paper with some context on what types of problems might occur, and more generally, with more context on how the work relates to real world applications. I don’t think this latter part was the goal necessarily, but it makes me think high level statements are preferable to asking authors to be specific, since a high level statement can provide more of a “big picture” view than the author might get from the paper alone, where space is often precious, without the statements reading like conjecture presented as knowledge. In terms of making ML papers more accessible to non-ML researchers, the statements may have some value. 

I had found the instructions for the statements unsatisfying because it wasn’t clear how the organizers viewed the goal, so that evaluating the success of the exercise seems impossible. Though now that I’ve read some, if their goal was simply to encourage reflection on implications of tech, it seems to have been successful.  Similarly, if the organizers wanted to amplify the types of concerns that are being discussed in AI/ML these days, like bias/fairness, environmental impacts of training large models, etc. it also seems they were successful. 

If there were or are loftier goals going forward with exercises like this, like helping readers of the paper recognize what to watch out for if they implement a technique, or helping the researchers avoid putting a lot of effort into some “dangerous” technology, then more guidance for the authors on how to translate technical implications to societal ones (at least in non-obvious ways) is needed. Also more information on how detailed they are intended to be in expressing what they see as implications. For example, the statements tended to be fairly short, a few paragraphs at most, and so when there wasn’t a clear set of issues to talk about related to the domain, authors sometimes mentioned more general types of outcomes like bias without it being completely clear what they meant. 

And of course there’s the question of whether ML researchers are in the right position to offer up useful information about longer term societal implications of technology, both in terms of their training and incentives. Not surprisingly, some of the statements read a bit like the “Limitations” sections that sometimes appear in papers, where often for every limitation mentioned there’s some rationalization that it’s actually not that big a problem. I expect the organizers were aware of these doubts and questions, so it seems like the exercise was either meant to signal that they want the community to take ethical considerations more seriously, or it was a kind of open-ended experiment. Probably some of both.   

In retrospect, maybe a good way for the organizers to introduce this exercise would have been to write a broader impact statement when they provided instructions! Now that I’ve read some, here’s my attempt: 

This call asks authors to write a broader impacts statement about their ML research. The outcomes of asking ML researchers to write broader impacts statements are not well known, but we believe that this process is likely to prompt authors to think more deeply about societal consequences of their contributions than they would otherwise, and that this might lead them to wield deep learning more responsibly so we are all less likely to encounter reputation-comprising deep fakes of ourselves in the future. We also expect this exercise to further popularize popular societal concerns that arise in discussion of machine learning ethics. Perhaps writing broader impacts statement will also help novice readers understand how dense tedious technical papers about small improvements to transformer architectures relate to the real world, helping democratize machine learning research.

However, there are also risks to requesting such statements. For example, there is some risk that authors will rationalize any consequences that they perceive, making them feel like they have seriously considered ethical concerns and there are none worth worrying about, even when they had only spent a few minutes writing something right before the deadline when they were clearly mentally compromised by the preceding all-nighters. There is also a risk that researchers will list potential impacts on society that, if a careful analysis were done by ethicists or other social scientists, would be deemed not very likely, but readers will not be able to ascertain this, and will worry about and go on to write future broader impacts statements about impacts that are actually irrelevant. Finally, there is some risk that authors will feel like they are being evaluated based on an exercise for which there is no rubric and it is hard to imagine what one would look like, which as we know from our teaching experience makes many people uncomfortable and leads to bad reviews. Future organizers of NeurIPS should consider how to minimize these risks.

When are Bayesian model probabilities overconfident?

Oscar Oelrich, Shutong Ding, Måns Magnusson, Aki Vehtari, and Mattias Villani write:

Bayesian model comparison is often based on the posterior distribution over the set of compared models. This distribution is often observed to concentrate on a single model even when other measures of model fit or forecasting ability indicate no strong preference. Furthermore, a moderate change in the data sample can easily shift the posterior model probabilities to concentrate on another model. We document overconfidence in two high-profile applications in economics and neuroscience. To shed more light on the sources of overconfidence we derive the sampling variance of the Bayes factor in univariate and multivariate linear regression. The results show that overconfidence is likely to happen when i) the compared models give very different approximations of the data-generating process, ii) the models are very flexible with large degrees of freedom that are not shared between the models, and iii) the models underestimate the true variability in the data.

Stacking is more stable and I think makes more sense. One of the problems with Bayes factor is that people have an erroneous attitude that it’s the pure or correct thing to do. See our 1995 paper for some discussion of that point.

I think the above Oelrich et al. paper is valuable in making a new point—that the procedure of selecting a model using Bayes factor can be very noisy. This is similar to the problem of selecting effects in a fitted model by looking for the highest estimate or the smallest p-value, and the problem is that with just one dataset it is easy to take your inference too seriously. Bootstrapping or equivalent analytical work can be helpful in understanding this variation.

Estimating excess mortality in rural Bangladesh from surveys and MRP

(This post is by Yuling, not by/reviewed by Andrew)

Recently I (Yuling) have contributed to a public heath project with many great collaborates: The goal is to understand the excess mortality in potential relevance to Covid-19. Before recent case surge in south Asia, we have seen stories claiming that the pandemic might have hit some low-income countries less heavily than in the US or Europe, but many low-income countries were also lacking reliable official statistics such that the confirmed Covid-19 death counts might be questionable.  To figure out how mortality changed in rural Bangladesh in 2020, my collaborates conducted repeated phone surveys in 135 villages near Dhaka, and collected death counts from Jan 2019 to Oct 2020 (the survey questions include “Was member X in your household deceased during some time range Y? When? What reason?”).

The findings

The statistical analysis appears to be straightforward: compute the mortality rate in 2019 and 2020 in the sample, subtract, obtain their difference, run a t-test, done: You don’t need a statistician. Well, not really. Some deaths were duplicately reported from multiple families, which needs preprocessing and manual inspection; the survey population dynamically changed overtime because of possible migration, newborns, deaths, and survey drop-out, so we need some survival analysis; The survey has large sample size but we also want to stratify across demographical features such as age, gender and education;  The repeated survey is not random so we need to worry about selection bias from non-response and drop-out.

It is interesting to compare these data to the birthday model on the day-of-month-effect effect. We used to make fun of meaningless heatmap comparison of day times month of birth dates. But this time the day-of-month-effect is real: participants were more likely to report death dates of deceased family members in the first half of the month as well as on some integer days.  Presumable I can model these “rounding errors” by a Gaussian process regression too, but I did not bother and simply aggregated death counts into months.

In the actual statistical modeling,  we stratify observed date into cells defined by the interaction of age, gender, month and education, and fit a multilevel logistic model: the education is not likely to have a big impact on mortality rate, but we would like to include it to adjust for non-response or drop-out bias. We poststratify the inferred mortality rate to the population and make comparison between 2019 and 2020.  The following graph shows the baseline age and month “effect”—not really causal effect though.

All coefficients are in the scale of monthly death log odd ratios. For interpretation, we cannot apply the “divide-by-four” rule as the monthly death rate is low. But because of this low rate, log odds  is close to log probability hence we can directly take exponential of a small coefficient and treat that as a multiplicative factor, which further is approximated by exp(a)-1=a. Hence a logistics coefficient of 0.1 is approximately 10% multiplicatively increase in monthly death probability (it is like “divide-by-one”,  except the genuine divide-by-four rule is on the additive probability scale).

One extra modeling piece is that we want to understand the excess mortality in finer granularity. So I also model this extra risk after Feb 2020 by another multilevel regression across age, gender and education. The following graph shows the excess mortality rate by age in females and from low-education families:

This type of modeling is natural or even default to anyone familiar with this blog. Yet what I want to discuss is two additional challenges in communication.

How do we communicate excess mortality?

The figure above shows the excess mortality in terms of log odds, which in this context is close to log probability. We may want to aggregate this log odds into monthly death probability. That is done by the post-stratification step: we simulate the before and after monthly death probability for all cells, and aggregate them by 2019 census weights (in accordance to age, gender and education) in the same area. Below is the resulted excess mortality, where the unit is death count per month,

We actually see a decline in mortality starting from Feb 2020, especially for high ages (80 and above), likely due to extra cautions or family companions. Because of their high baseline values, this high age group largely dominates the overall mortality comparison. Indeed we estimate a negative overall excess mortality in this area since Feb 2020.  On the other hand, if we look into the average log odds change, the estimate is less constrained and largely overlaps with zero.

We report both results in a transparent way. You can play with our Stan code and data too. But my point here is more about how we communicate “excess mortality”. Think about an artificial make-up example: Say there are three age group, children, adults (age <59.5), elderly adults (age >59.5). Suppose they have a baseline mortality rate (0.5, 5, 30) per year per thousand people, and also assume the census proportion of these three age groups is (20%, 60%, 20%). If their mortality changes by -10%, -10%, 20%, the average percentage change is -4%.  But the absolute mortality change from (0.5, 5, 30) * (20%, 60%, 20%)=9.1 to 9.99, which is 0.89 excess death per year per thousand people, or a 0.89/9.1=10%  increase.  Is it a 10% increase, or a 4% decrease? That all depends on your interpretation. I don’t want to call it a Simpson’s paradox. Here both numbers are meaningful on its own: the average excess log mortality measures the average individual risk change;  the average excess mortality is related to death tolls. I think WHO and CDC report the latter number, but it also has the limitation of being driven by only one subgroup—we have discussed similar math problems in the context of simulated tempering and thermodynamics, and a similar story in age-adjustment over time.

The excess morality depends on the what comparison window to pick

Lastly,  the question we would like to address from the evidence of “excess morality” is essentially a causal question: what we care is the difference in death tolls if covid had not occurred hypothetically.  Except this causal question is even harder: we do not know when covid was likely to be a significant factor in this area. In other words, the treatment label is unobserved.

Instead of doing a change-point-detection, which I think would lead to non-indetfication if I would model the individual treatment effect,  I view the starting month of the excess morality comparison window as part of the causal assumption.  I fit the model with varying assumption and compute the excess mortality rate starting from Feb to Aug.   The results are interesting: the overall excess mortality is nearly negative when we compare  2020 and 2019, which is the result we report in the title. But if we choose a only compare later months, such as starting from July, there is a gradually positive excess mortality, especially for people in the age group 50–79. Well, it is not significant, and we need multiple testing when we vary these assumptions, but I don’t think a hypothesis testing is relevant here. Overall, this increasing trend is concerning, amid a more serious third wave recently in South Asia at large.

We also report the relative changes (there is always a degree of freedom to choose between multiplicative and additive effect in causal inference).

This type of varying-causal-assumption-as-much-as-you-can is what we recommend in workflow: when we have multiple plausible causal assumptions, report fits from all models (Fig 22 of the workflow paper).

But this graph leads to further questions. Recall the birthday model: if the there are fewer newborns in Halloween, then there have to be more on the days before or after. The same baby-has-to-go-somewhere logic applies here. Despite the overall excess mortality was likely negative starting from Feb 2020, we infer there was an increasing trend of mortality since the second half of 2020, especially for the age group 50–79. What we do not know for sure is whether, this positive excess mortality was a real temporal effect (a signal of gradual Covid infections), or a compensation for fewer deaths in the first half the year. Likewise, in many counties where there was a strong positive excess mortality in 2020, there could as well be a mortality decline in the following short term.  In the latter case, is this base-population change from the previous period (aka “dry tinder effect”, or the opposite wet tinder) part of the indirect effect, or mediation, of Covid-19? Or is it the confounder that we want to adjust for by fixing a constant population? The answer depends on how the “treatment” is defined, as akin to the ambiguity of a “race effect” in social science study.

2 reasons why the CDC and WHO were getting things wrong: (1) It takes so much more evidence to correct a mistaken claim than to establish it in the first place; (2) The implicit goal of much of the public health apparatus is to serve the health care delivery system.

Peter Dorman points to an op-ed by Zeynep Tufekci and writes:

This is a high profile piece in the NY Times on why the CDC and WHO have been so resistant to the evidence for aerosol transmission. What makes it relevant is the discussion of two interacting methodological tics, the minimization of Type I error stuff that excludes the accumulation of data that lie below some arbitrary cutoff and the biased application of this standard to work that challenges the received wisdom. It takes so much more evidence to correct a mistaken claim than to establish it in the first place.

I still suspect there is an additional factor: the implicit goal of much of the public health apparatus to serve the health care delivery system. One reason mask use was discouraged at the beginning of the pandemic was to protect the supply of N-95s to health care practitioners. Travel bans were opposed, since international travel and the movement of supplies were regarded as necessary for organizing and administering care, especially in developing countries. Recognition of aerosol transmission would have required the costly air filtration systems used in infectious disease wards to be installed throughout all hospitals and clinics (at least, as I understand it, under current protocols). This also helps explain the prominence given to hospitalization and ICU use as morbidity metrics, and the delayed recognition of long Covid as a health concern in its own right. If the primary constituency, and source of funding and personnel, for the public health apparatus is the medical system, it makes sense that, under conditions of uncertainty, the needs of medical practitioners would take precedence over those of the public at large. You can always justify it by saying we need the hospitals to be well stocked, doctors to travel freely, demand to be well below capacity and operations to not be bogged down by complicated protocols so the public can be better served. This is also relevant to the blog because there has been a sort of slipperiness about what outcomes constitute the costs and benefits that go into decision making under uncertainty.

The pandemic has been a spectacular laboratory for exploring the interconnections between science, the rules of evidence, risk communication, institutional incentives and political pressures. Some probing books about this will probably appear over the coming years.

Interesting points. I hadn’t thought of it that way, but it makes sense. I guess that similar things could be said about education, the criminal justice system, the transportation system, and lots of other parts of society that have dominant stakeholders. Even when these systems have had serious failures, we go through them when trying to implement improvements.

If a value is “less than 10%”, you can bet it’s not 0.1%. Usually.

This post is by Phil Price, not Andrew.

Many years ago I saw an ad for a running shoe (maybe it was Reebok?) that said something like “At the New York Marathon, three of the five fastest runners were wearing our shoes.” I’m sure I’m not the first or last person to have realized that there’s more information there than it seems at first. For one thing, you can be sure that one of those three runners finished fifth: otherwise the ad would have said “three of the four fastest.” Also, it seems almost certain that the two fastest runners were not wearing the shoes, and indeed it probably wasn’t 1-3 or 2-3 either: “The two fastest” and “two of the three fastest” both seem better than “three of the top five.” The principle here is that if you’re trying to make the result sound as impressive as possible, an unintended consequence is that you’re revealing the upper limit. Maybe Andrew can give this principle a clever name and add it to the lexicon. (If it isn’t already in there: I didn’t have the patience to read through them all. I’m a busy man!)

This came to mind recently because this usually-reliable principle has been violated in spectacular manner by the Centers for Disease Control (CDC), as pointed out in a New York Times article by David Leonhardt. The key quote from the CDC press conference is “DR. WALENSKY: … There’s increasing data that suggests that most of transmission is happening indoors rather than outdoors; less than 10 percent of documented transmission, in many studies, have occurred outdoors.”  Less than 10%…as Leonhardt points out, that is true but extremely misleading. Leonardt says “That benchmark ‘seems to be a huge exaggeration,’ as Dr. Muge Cevik, a virologist at the University of St. Andrews, said. In truth, the share of transmission that has occurred outdoors seems to be below 1 percent and may be below 0.1 percent, multiple epidemiologists told me. The rare outdoor transmission that has happened almost all seems to have involved crowded places or close conversation.”

This doesn’t necessarily violate the Reebok principle because it’s not clear what the CDC was trying to achieve. With the running shoes, the ad was trying to make Reeboks seem as performance-boosting as possible, but what was the CDC trying to do? Once they decided to give a number that is almost completely divorced from the data, why not go all the way? They could say “less than 30% of the documented transmissions have occurred outdoors”, or “less than 50%”, or anything they want…it’s all true! 

Frank Sinatra (3) vs. Virginia Apgar; Julia Child advances

I happened to come across this one from a couple years ago and the whole thing made me laugh so hard that I thought I’d share again:

My favorite comment from yesterday came from Ethan, who picked up on the public TV/radio connection and rated our two candidate speakers on their fundraising abilities. Very appropriate for the university—I find myself spending more and more time raising money for Stan, myself. A few commenters picked up on Child’s military experience. I like the whole shark repellent thing, as it connects to the whole “shark attacks determine elections” story. Also, Jeff points out that “a Julia win would open at least the possibility of a Wilde-Child semifinal,” and Diana brings up the tantalizing possibility that Julia Grownup would show up. That would be cool. I looked up Julia Grownup and it turns out she was on Second City too!

As for today’s noontime matchup . . . What can I say? New Jersey’s an amazing place. Hoboken’s own Frank Sinatra is only the #3 seed of our entries from that state, and he’s pitted against Virginia Apgar, an unseeded Jerseyite. Who do you want to invite for our seminar: the Chairman of the Board, or a pioneering doctor who’s a familiar name to all parents of newborns?

Here’s an intriguing twist: I looked up Apgar on wikipedia and learned that she came from a musical family! Meanwhile, Frank Sinatra had friends who put a lot of people in the hospital. So lots of overlap here.

Sinatra advanced to the next round. As much as we’d have loved to see Apgar, we can’t have a seminar speaker who specializes in putting people to sleep. So Frank faced Julia in the second round. You’ll have to see here and here to see how that turned out. . . .

The Javert paradox rears its ugly head

The Javert paradox is, you will recall, the following: Suppose you find a problem with published work. If you just point it out once or twice, the authors of the work are likely to do nothing. But if you really pursue the problem, then you look like a Javert. I labeled the paradox a few years ago in an article entitled, “Can You Criticize Science (or Do Science) Without Looking Like an Obsessive? Maybe Not.”

This came up recently in an email from Chuck Jackson, who pointed to this news article that went like this:

Does ocean acidification alter fish behavior? Fraud allegations create a sea of doubt . . .

[Biologist Philip] Munday has co-authored more than 250 papers and drawn scores of aspiring scientists to Townsville, a mecca of marine biology on Australia’s northeastern coast. He is best known for pioneering work on the effects of the oceans’ changing chemistry on fish, part of it carried out with Danielle Dixson, a U.S. biologist who obtained her Ph.D. under Munday’s supervision in 2012 and has since become a successful lab head at the University of Delaware . . .

In 2009, Munday and Dixson began to publish evidence that ocean acidification—a knock-on effect of the rising carbon dioxide (CO2) level in Earth’s atmosphere—has a range of striking effects on fish behavior, such as making them bolder and steering them toward chemicals produced by their predators. As one journalist covering the research put it, “Ocean acidification can mess with a fish’s mind.” The findings, included in a 2014 report from the Intergovernmental Panel on Climate Change (IPCC), could ultimately have “profound consequences for marine diversity” and fisheries, Munday and Dixson warned.

But their work has come under attack. In January 2020, a group of seven young scientists, led by fish physiologist Timothy Clark of Deakin University in Geelong, Australia, published a Nature paper reporting that in a massive, 3-year study, they didn’t see these dramatic effects of acidification on fish behavior at all. . . .

Some scientists hailed it as a stellar example of research replication that cast doubt on extraordinary claims that should have received closer scrutiny from the start. “It is by far the best environmental science paper I have read for a long time,” declared ecotoxicologist John Sumpter of Brunel University London.

Others have criticized the paper as needlessly aggressive. Although Clark and his colleagues didn’t use science’s F-word, fabrication, they did say “methodological or analytical weaknesses” might have led to irreproducible results. And many in the research community knew the seven authors take a strong interest in sloppy science and fraud—they had blown the whistle on a 2016 Science paper by another former Ph.D. student of Munday’s that was subsequently deemed fraudulent and retracted—and felt the Nature paper hinted at malfeasance. . . .

What the hell? It’s now considered “needlessly aggressive” to bring up methodological or analytical weaknesses?

Have these “Others have criticized” people never seen a referee report?

I’m really bothered by this attitude that says that, before publication, a paper can be slammed a million ways which way by anonymous reviewers. But then, once the paper has appeared and the authors are celebrities, all of a sudden it’s considered poor form to talk about its weaknesses.

The news article continues:

The seven [critics] were an “odd little bro-pocket” whose “whole point is to harm other scientists,” marine ecologist John Bruno of the University of North Carolina, Chapel Hill—who hasn’t collaborated with Dixson and Munday—tweeted in October 2020. “The cruelty is the driving force of the work.”

I have no idea what a “bro-pocket” is, and Google was no help here. The seven authors of the critical article appear to be four men and three women. I guess that makes it a “bro pocket”? If the authors had been four women and three men, maybe they would’ve been called a “coven of witches” or come up with some other insult.

In any case, this seems like a classic Javert bind. Sure, the critics get bothered by research flaws: if they weren’t bothered, they wouldn’t have put in the effort to track down all the problems!

More from the news article:

Clark and three others in the group took another, far bigger step: They asked three funders that together spent millions on Dixson’s and Munday’s work—the Australian Research Council (ARC), the U.S. National Science Foundation (NSF), and the U.S. National Institutes of Health (NIH)—to investigate possible fraud in 22 papers. . . .

Munday calls the allegations of fraud “abhorrent” and “slanderous” . . . Dixson denies making up data as well. . . . But multiple scientists and data experts unconnected to the Clark group who reviewed the case at Science’s request flagged a host of problems in the two data sets, and one of them found what he says are serious irregularities in the data for additional papers co-authored by Munday.

Also this:

Dixson, in the February interview, said she did not know about the allegations. Although she denies making up data, “There hypothetically could be an error in there,” she said, perhaps because of mistakes in transcribing the data; “I don’t know. I’m human.” . . . Clark and colleagues also found problems in the data for the 2014 paper in Nature Climate Change, which showed fish behavior is altered near natural CO2 seeps off the coast of Papua New Guinea. (Munday was the first of five authors on the study, Dixson the third.) That data set also contained several blocks of identical measurements, although far fewer than in the Science paper. . . . Munday says Dixson has recently provided him with one original data sheet for the study, which shows she made a mistake transcribing the measurements into the Excel file, explaining the largest set of duplications. “This is a simple human error, not fraud,” he says. Many other data points are similar because the methodology could yield only a limited combination of numbers, he says. Munday says he has sent Nature Climate Change an author correction but says the mistake does not affect the paper’s conclusions.

Bad data but they do not affect the paper’s conclusions, huh? We’ve heard that one before. It kinda makes you wonder why they bother collecting data at all, given that the conclusions never seem to change.

And here’s someone we’ve heard from before:

[Nicholas] Brown . . . identified problems of a different nature in two more Munday papers that had not been flagged as suspicious by the Clark team and on which Dixson was not an author. At about 20 places in a very large data file for another 2014 paper in Nature Climate Change, the raw data do not add up to total scores that appear a few columns farther to the right. And in a 2016 paper in Conservation Physiology, fractions that together should add up to exactly one often do not; instead the sum varies from 0.15 to 1.8.

Munday concedes that both data sets have problems as well, which he says are due to their first authors hand copying data into the Excel files. He says the files will be corrected and both journals notified. But Brown says the anomalies strongly suggest fabrication. No sensible scientist would calculate results manually and then enter the raw data and the totals—thousands of numbers in one case—into a spreadsheet, he says.

To him, the problems identified in the data sets also cast suspicions on the “ludicrous effect sizes” in many of the 22 papers flagged by the whistleblowers. “Suppose you’re going to the house of somebody you think may have been handling stolen televisions, and you found 22 brand new televisions in his basement, and three had serial numbers that corresponded to ones that have been stolen from shops,” Brown says. “Are you going to say, ‘Yeah, we’ll assume you’ve got the purchase receipts for the other 19?’”

OK, now we’re getting rude. If talking about “methodological or analytical weaknesses” is needlessly aggressive, what is it when you liken someone to a thief of television sets?

Back to Javert

I have not looked into the details of this case. It could be that one or more authors of those papers were committing fraud, it could be that they didn’t know what they’re doing, it could be that they were just really sloppy, or it could be some combination of these, as with the Cornell pizza researcher guy who seemed to have just had a big pile of numbers in his lab and would just grab whatever numbers he needed when it was time to write a paper. It could be that none of those findings are replicable, or it could be that the errors are minor and everything replicates. Someone else will have to track all this down.

What bothers me is the way the critics have been attacked. There was that guy on twitter quoted above, and then there’s Munday, one of the original researchers, who in 2016 wrote: “It seems that Clark and Jutfelt are trying to make a career out of criticizing other people’s work. I can only assume they don’t have enough good ideas of their own to fill in their time . . . Recently, I found out they have been ‘secretly’ doing work on the behavioural effects of high CO2 on coral reef fishes, presumably because they want to be critical of some aspects of our work.”

The idea that there’s something shameful about critically assessing published work, or that it’s bad to “make a career” out of it, or that you can “only assume” that if someone is critical, that “they don’t have enough good ideas of their own to fill in their time” . . . That’s just a horrible, horrible attitude. Criticism is a valuable and often thankless part of science.

And to slam the critics for going public . . . jeez! They tried everything and were stonewalled at every turn, so, yeah, they went public. Why not? The original papers were published in public. I don’t see why the reputations of the scientists who wrote those papers should be considered more valuable than the social value of getting the research right.

This is so annoying.

I think the original researchers should’ve said something like this:

We very much appreciate the efforts of these outside critics who found serious errors in our published papers. We are carefully looking into our data processing and analysis pipeline and will share all of as soon as is possible. In the meantime, we consider all our published findings to be tentative; we will only be able to say more after a careful assessment of our data and procedures. Whatever happens, we are pleased that our studies were reexamined so carefully, and again we thank the critics for their careful work.

P.S. We appreciate that some people have been defending us on social media and that our universities have stood by us. We pride ourselves on our research integrity and we very much regret the sloppiness in our work that has led to our errors. But, please, do not defend us by attacking our critics. There was nothing improper or inappropriate in their criticism of our work! They found flaws in our published papers, and it was their scientific duty to share this information with the world. Telling us personally wouldn’t have been enough. Our papers are in the public record. Our papers did have methodological weaknesses—that is clear, as we report values that are not mathematically or physically possible—and so the authors of the critical paper should not be attacked for pointing out these errors.

Doubting the IHME claims about excess deaths by country

The Institute for Health Metrics and Evaluation at the University of Washington (IHME) was recently claiming 900,000 excess deaths, but that doesn’t appear to be consistent with the above data.

These graphs are from Ariel Karlinsky, who writes:

The main point of the IHME report, that total COVID deaths, estimated by excess deaths, are much larger than reported COVID deaths, is most likely true and the fact that they have drawn attention to this issue is welcome. In a study of 94 countries and territories by Dmitry Kobak and myself – we estimate this ratio (based on actual all-cause mortality data) at 1.6. We believe this to be a lower bound since we lack data for much of the world, where more localized reports and studies demonstrate larger excess.

The issue with the IHME report is that it uses extremely partial data when much more encompassing (such as World Mortality) exists, the issue is that the country-level estimates they showed publicly are incredibly different than known ones (mostly higher) and that they purport to accurately estimate excess deaths where data simply does not exist – this undermines a tremendous effort currently underway to improve and collect vital data in many countries.

Karlinsky also quotes. Stéphane Helleringer:

I [Helleringer] do worry a lot though about false impression of knowledge and confidence that is conveyed by their estimates; especially the detailed global maps like the ones they just produced for COVID death toll and MANY other health indicators for which few or no data are available. The risk is that IHME figures, with their apparent precision, will distract some funders & governments from goal of universal death registration in low to middle Incomes countries. From their standpoint, if IHME readily estimates mortality, why invest in complex systems to register each death?

This is an interesting moral-hazard issue that comes up from time to time when considering statistical adjustments. I remember years ago that some people opposed adjustments for census undercount based on the reasoning that, once the census was allowed to adjust, that would remove their motivation for counting everyone. In practice I think we have to push hard in both data collection and modeling: work to gather the cleanest and fullest possible datasets and then work to adjust for problems with the data. If the apparently very seriously flawed IHME estimates are taken as a reason not to gather good data, that’s a problem not so much with statistics as with governments and the news media who have the habit of running with authoritative-sounding numbers from respected institutions and not checking. We saw that a few years ago in a different setting with that silly Democracy Index. The claims lacked face validity and were based on crappy data, but, hey, it was from Harvard! The University of Washington isn’t quite Harvard, but I guess the IHME had a good enough public relations department that they could get that air of authority. Also, they sent a message that (some) people wanted to hear. Also, the coronavirus authorities, for all their flaws, were lucky in their enemies. Say what you want about the IHME, they weren’t as dumb as last year’s White House Council of Economic Advisors or the Stanford-advised Pandata team or the Hoover Institution’s Richard Epstein, who, when he’s not busy jamming his fingers down people’s throats, made a coronavirus death prediction that was off by a factor of 1000.

P.S. See Karlinsky’s page for more details on data and estimates.

P.P.S. Instead of using legends in his graphs, Karlinsky should’ve placed labels on each line directly. For some reason, many people don’t seem to know about this trick, which allows people to read your graph without having to go back and forth and decode the colors.

Blast from the past

Paul Alper points us to this news article, The Secret Tricks Hidden Inside Restaurant Menus, which is full of fun bits:

There is now an entire industry known as “menu engineering”, dedicated to designing menus that convey certain messages to customers, encouraging them to spend more and make them want to come back for a second helping.
“Even the binding around the menu is passing us important messages about the kind of experience we are about to have,” explains Charles Spence [whose recent book Gastrophysics: the New Science of Eating], a professor in experimental psychology and multisensory perception at the University of Oxford.
“For a large chain that might have a million people a day coming into their restaurants around the world, it can take up to 18 months to put out a menu as we test everything on it three times,” says Gregg Rapp, a menu engineer based in Palm Springs, California
Perhaps the first thing a customer will notice about a menu when the waiter hands it to them is its weight. Heavier menus have been shown to suggest to the customer that they are in a more upscale establishment where they might expect high levels of service.
A study conducted by researchers in Switzerland found that a wine labelled with a difficult-to-read script was liked more by drinkers than the same wine carrying a simpler typeface. Spence’s own research has also found that consumers often associate rounder typefaces with sweeter tastes, while angular fonts tend to convey a salty, sour or bitter experience.
“Naming the farmer who grew the vegetables or the breed of a pig can help to add authenticity to a product,” says Spence.
A study from the University of Cologne in Germany last year showed that by cleverly naming dishes with words that mimic the mouth movements when eating, restaurants could increase the palatability of the food. They found words that move from the front to the back of the mouth were more effective – such as the made up word “bodok”.
Dan Jurafsky, a professor of computational linguistics at Stanford University, performed a study that analysed the words and prices of 650,000 dishes on 6,500 menus. He found that if longer words were used to describe a dish, it tended to cost more. For every letter longer the average word length was, the price of the dish it was describing went up by 18 cents (14p).
“When we[Rapp] do eye tracking on a customer with a menu in their hand, we typically see hotspots in the upper right hand side,” he says. “The first item on the menu is also the best real estate.”

But filling a menu with too many items can actually hamper choice, according to menu design experts. They say offering any more than seven items can overwhelm diners. To overcome this, they tell restaurants to break down their menus into sections of between five and seven dishes.

“More than seven is too many, five is optimal and three is magical,” says Rapp. There is some research to back this up – a study from Bournemouth University found that in fast food restaurants, customers wanted to pick from six items per category. In fine dining establishments, they preferred a little more choice – between seven and 10 items.

“The problem with pictures is that the brain will also taste the food a little bit when it sees a picture, so when the food comes it may not be quite as good as they imagined,” warns Rapp.
In recent years, Pizza Hut began testing eye-tracking technology to predict what diners might want as they scan through 20 different toppings before offering a likely combination to the customer.
But the article is outdated
This article was originally published on November 20, 2017, by BBC Future, and is republished here with permission.
Putting brand names into dish titles is also an effective strategy for many chain restaurants, as are nostalgic labels like “handmade” or “ye olde” according to Brian Wansink from the Food and Brand Lab at Cornell University. A dose of patriotism and family can also boost sales.
I guess we can apply some partial pooling.  If this news article reports the work of several different research groups, and Wansink’s is one of them.  Then, given other things we’ve learned about Wansink’s work, we can make some inference about the distribution of studies of this type . . .
One can also consider this from the reporting standpoint.  100% of the quotations come from people with a direct incentive to promote this work.
Really sad to see this coming from the BBC.  They’re supposed to be a legitimate news organization, no?  I can’t really fault them for citing Wansink—back then, there were still lots of people who hadn’t heard about what was up with his lab—but even in 2017 weren’t they teaching journalists to interview some non-interested parties when preparing their stories?
P.S. The most extreme bit is this quote:
More than seven is too many, five is optimal and three is magical . . .
But that just gives away the game.  Now we’re talking about magic, huh?

Formalizing questions about feedback loops from model predictions

This is Jessica. Recently I asked a question about when a model developer should try to estimate the relationship between model predictions and the observed behavior that results when people have access to the model predictions. Kenneth Tay suggested a recent machine learning paper on Performative Prediction by Juan Perdomo Tijana Zrnic. Celestine Mendler-Dunner and Moritz Hardt. It comes close to answering the question and raises some additional ones.

My question had been about when it’s worthwhile, in terms of achieving better model performance, for the model to estimate and adjust for the function that maps from the predictions you visualize to the realized behavior. This paper doesn’t attempt to address when it’s worthwhile, but assumes that these situations arise and formalizes the concepts you need to figure out how to deal with it efficiently. 

It’s a theoretical paper, but they give a few motivating examples where reactions to model predictions change the target of the prediction: crime prediction changes police allocation changes crime patterns, stock price prediction changes trading activity changes stock price, etc. In ML terms, you get distribution shift, referring to the difference between the distribution you used to develop the model and the one that results after you deploy the model, whenever reactions to predictions interfere with the natural data generating process. They call this “performativity.” So what can be said/done about it? 

First, assume there’s a map D(.) from model parameters to the joint distributions over features (X) and outcomes (Y) they induce, e.g., for any specific parameter choice theta, D(theta) is the specific joint distribution over X and Y that you get as a result of deploying a model with parameters theta. The problem is that the model is calibrated given the data that has been seen prior to deploying it, not the data that results after its deployed. 

Typically in ML the way to deal with this is to retrain the model. However, maybe you don’t have to always do this. The key is to find the decision rule (here defined by the model parameters theta) that you know will perform well on the distribution D(theta) that you’re going to observe when you deploy the model. The paper uses a risk minimization framework to talk about two properties you want to find this rule. 

First you have to define the objective of finding the model specification (parameters theta) that minimizes loss over the induced distribution rather than the fixed distribution you typically assume in supervised learning. They call this “performative optimality.”

Next, you need “performative stability,” which is defined in the context of repeated risk minimization. Imagine a process defined by some update rule where you repeatedly find the model that minimizes risk (i.e., is performatively optimal) on the distribution you observed when you deployed the previous version of the model, D(theta_t-1). You’re looking for a fixed point in this risk minimization process (what I called visualization equilibrium).  

I like this formulation, and the implications of it for thinking about when this kind of thing is achievable. This gets closer to the question I was asking. The authors show that to guarantee that it’s actually feasibly to find the performative optima and performatively stable points exist, you need both your loss function and the map D(.) to have certain properties. 

First, loss needs to be smooth and strongly convex to guarantee a linear convergence rate in retraining to a stable point that approximately minimizes your performative risk. However, you also need the map D(.), to be sufficiently Lipschitz continuous, which constrains the relationship between the distance in parameter space between different thetas and the distance in response distribution space in the different distributions that get induced by those alternative thetas. Stated roughly, your response distribution can’t be too sensitive to changes to the model parameters. If you can get a big change in the response distribution from a small change in model parameters, you might not be able to find your performatively stable solution.  

This is where things get interesting, because now we can tie things back to real world situations and ask, when is this guaranteed? I have some hunches based on my reading of recent work in AI-human collaboration that maybe this doesn’t always hold. For example, some work has discussed how in situations where you have a person overseeing how model predictions are applied, you have to be careful about assuming that it’s always good to update your model because it improves accuracy. Instead, a more accurate model may lead to worse human/model “team” decision making if the newly updated model’s predictions conflict in some noticeable way from the human’s expectations about it. Instead you may want to aim for updates that won’t change the predictions to be so different from the previously deployed model predictions that the human stops trusting the model at all and making all the decisions themselves, because then you’re stuck with human accuracy on a larger proportion of decisions. So this implies that it may be possible for a small change in parameter space to result in a disproportionately large change in response distribution space. 

There’s lots more in the paper, including some analysis to show that it can be harder in general to get performative optimality than to find a performatively stable model. Again it’s theoretical, so it’s more about reflecting on what’s possible with different retraining procedures, though they run some simulations involving a specific game (strategic classification) to demonstrate how the concepts can be applied. It seems there’s been some follow-up work that generalizes to a setting where the distribution you get from some set of model parameters (a result of strategic behavior) isn’t deterministic but depends on the previous state. This setting makes it easier to think about response distribution shifts caused by “broken” mental models for example. At any rate, I’m excited to see that ML researchers are formalizing these questions, so that we have more clues of what to look for in data to better understand and address these issues.

Raymond Smullyan on Ted Cruz, Al Sharpton, and those scary congressmembers

Palko shares this fun logic puzzle from the great Raymond Smullyan which also has obvious implications for modern politics:

Inspector Craig of Scotland Yard was called to Transylvania to solve some cases of vampirism. Arriving there, he found the country inhabited both by vampires and humans. Vampires always lie and humans always tell the truth. However, half the inhabitants, both human and vampire, are insane and totally deluded in their beliefs: all true propositions they believe false, and all false propositions they believe true. The other half of the inhabitants are completely sane: all true statements they know to be true, and all false statements they know to be false. Thus sane humans and insane vampires make only true statements; insane humans and sane vampires make only false statements. Inspector Craig met two sisters, Lucy and Minna. He knew that one was a vampire and one was a human, but knew nothing about the sanity of either. Here is the investigation: Craig (to Lucy): Tell me about yourselves. Lucy: We are both insane. Craig (to Minna): Is that true? Minna: Of course not! From this, Craig was able to prove which of the sisters was the vampire. Which one was it?

With all the conspiracy theories floating around, this distinction between “vampires” and “humans” keeps arising. I assume people such as Al Sharpton and Ted Cruz are “sane vampires” who know when they’re promoting lies, but then there are lots of others like those notorious q-anon congressmembers who are “insane humans” who actually believe what they’re saying.

A complicating factor is that these people help each other. The sane vampires make use of the insane humans in order to increase their political power, and, conversely, the insane humans get support for their false beliefs from the political power of the sane vampires.

So it’s not just Inspector Craig who’s playing the vampires and the humans against each other. The vamps and humans are getting into it directly. And then there are the false statements that get amplified by some mixture of sane vampires and insane humans in the news media.

I don’t think that this post adds anything to our understanding of politics or political science—lots of observers, academics and non-academics, have been talking for awhile about the interaction between political manipulators and sincere believers. And this doesn’t even get into issues such as internet trolls who are being paid expressly to spread disinformation and to attack debunkers. So, these concerns are out there, even if we don’t always know what to do about it. It’s just think it’s interesting to see how Smullyan anticipated all this.

Any graph should contain the seeds of its own destruction

The title of this post is a line that Jeff Lax liked from our post the other day. It’s been something we’ve been talking about a long time; the earliest reference I can find is here, but it had come up before then, I’m sure.

The above histograms illustrate. The upper left plot averages away too much of the detail. The graph with default bin widths, on the upper right, is fine, but I prefer the lower left graph, which has enough detail to reveal the limits of the histogram’s resolution. That’s what I mean by the graph containing the seeds of its own destruction. We don’t need confidence bands or anything else to get a sense of the uncertainty in the bar heights; we see that uncertainty in the visible noise of the graph itself. Finally, the lower right graph goes too far, with so many bins that the underlying pattern is no longer clear.

My preferred graph here is not the smoothest or even the one that most closely approximates the underlying distribution (which in this case is a simple unit normal); rather, I like the graph that shows the data while at the same time giving a visual cue about its uncertainty.

P.S. Here’s the code:

a <- rnorm(1000)
par(mar=c(3,3,1,1), mgp=c(1.5,0.5,0), tck=-.01, mfrow=c(2,2))
hist(a, breaks=seq(-4,4,1), bty="l", main="Not enough bins", xlab="")
hist(a, bty="l", main="Default bins", xlab="")
hist(a, breaks=seq(-4,4,0.25), bty="l", main="Extra bins", xlab="")
hist(a, breaks=seq(-4,4,0.1), bty="l", main="Too many bins", xlab="")

P.S. Yeah, yeah, I agree, it would be better to do it in ggplot2. And, yeah, yeah, it's a hack to hardcode the histogram boundaries at +/-4. I'm just trying to convey the graphical point; go to other blogs for clean code!

Postmodernism for zillionaires

“Postmodernism” in academia is the approach of saying nonsense using a bunch of technical-sounding jargon. At least, I think that’s what postmodernism is . . .

Hmm, let’s check wikipedia:

Postmodernism is a broad movement that developed in the mid- to late 20th century across philosophy, the arts, architecture, and criticism, marking a departure from modernism. The term has been more generally applied to describe a historical era said to follow after modernity and the tendencies of this era.

Postmodernism is generally defined by an attitude of skepticism, irony, or rejection toward what it describes as the grand narratives and ideologies associated with modernism . . .

Postmodernism is often associated with schools of thought such as deconstruction, post-structuralism, and institutional critique, as well as philosophers such as Jean-François Lyotard, Jacques Derrida, and Fredric Jameson.

Criticisms of postmodernism are intellectually diverse and include arguments that postmodernism promotes obscurantism, is meaningless, and that it adds nothing to analytical or empirical knowledge. . . .

OK, so, yeah, postmodernism is a kind of aggressive anti-rigor.

I was thinking about this when reading about Elon Musk’s latest plan, which is to build highway tunnels in Miami . . . a city that’s basically underwater. I mean, why not go all-in and build a fleet of submarines? Musk’s an expert on that, right?

It’s hard for me to believe Musk really plans to build tunnels in Miami; I guess it’s part of some plan he has to grab government $ (not that I have any problem with that, I spend government $ all the time). Meanwhile, various local government officials in Miami are saying positive things about the ridiculous tunnel plan—but I’m guessing that they don’t believe in it either; they just want to say yeah great because that’s what politicians do.

Anyway, the whole thing is so postmodern. It’s like some clever-clever philosopher positing a poststructuralist version of physics, or someone arguing that Moby Dick is just a text with no author, or whatever.

As with academic postmodernism, perhaps the very ridiculousness of the tunnels-in-Miami idea is part of its selling point? After all, anyone can come up with a good idea. It takes someone really special to promote a ridiculous idea with a straight face.

Also as with academic postmodernism, it’s almost irrelevant if the idea makes sense. For example, suppose some literature professor somewhere gets a reputation based on the latest version of hyperstructuralism or whatever. You and I can laugh, but this dude has a steady job. He doesn’t care whether this makes sense, any more than the beauty-and-sex-ratio researchers care whether their statistics make any sense. They have success within a closed community. With a zillionaire, the currency is not academic success but . . . currency. What does it matter to a zillionaire that he’s promoting a ridiculous idea? He has a zillion dollars, which in some way retroactively justifies all his decisions. Kinda like those pharaohs and their cathedrals. Or maybe it’s a Keynesian thing—taking literally the economic dictum about hiring people to dig holes and fill them up again. Experimental theater for the ultra-rich.

P.S. It seems that the above is unfair to postmodernism; see comments here, here, and here.

size of bubbles in a bubble chart

(This post is by Yuling, not Andrew.)

We like bubble charts. In particular, it is the go-to visualization template for binary outcomes (voting, election turnout, mortality…): stratify observations into groups, draw a scatter plot of proportions versus group feature, and use the bubble size to communicate the “group size”. To be concrete, below is a graph I draw in a recent paper, where we have survey data of mortality in some rural villages. The x-axis is the month and the y-axis the survey mortality rate that month. The size of the bubble is the accessible population size under risk during that month. I also put the older population in a separate row as their mortality rates are orders of magnitude higher.

When we make a graph comparison we always have a statistical model in mind: the scale (probability, log probability, log odds…) implies the default modeling scale; one standard error bar corresponds to a normal assumption, etc. Here, as you can imagine, we have a hierarchical model in mind and would like to partial-pool across bubbles. Visualizing the size of the bubbles implicitly conveys the message that “I have many groups! they have imbalanced group sizes! so I need a Bayesian model to enhance my small area estimation!”

OK, nothing new so far. What I want to blog about is which “group size” we should visualize. To be specific in this mortality survey, should the size be the size of the population (n), or the number of death cases (y)? I only need one of them because the y-axis indicates their ratio y/n. This distinction is especially clear for across-age comparisons.

It is common to pick to population size, which is what I did in the graph above. I also googled “gelman bubble chart election”, the first result  jumping out is the “Deep Interactions with MRP” paper in which the example visualized the subgroup population size (n) of income × ethnicity × state group, not their one-party vote count.

But I can provide a counterargument for visualizing the case size (y). Again, a graph is an implicit model: visualizing the proportion corresponds to a Bernoulli trial. The inverse Fisher information of theta in a Bernoulli (theta) likelihood is theta(1-theta). But that could be wrong unit to look at because theta is close to zero anyway. If we look at the log odds, the Fisher information of logit(theta) is theta(1-theta). In the mortality context, theta is small. Hence the “information of logit mortality” from a size-n group will be n* theta(1-theta)≈ y, which also  implies an 1/y variance scaling. This y-dependent factor will determine how much a bubble is pooled toward the shared prior mean in a multilevel posterior.

In this sense, the routine of visualizing the group size comes from a rule-of-thumb 1/n variance scaling, which is a reasonable approximation when the group-specific precision is roughly a constant.   For a Bernoulli model, the reasoning above suggests a better bubble scale could be n*theta(1-theta) ≈ y(1-n/y), but it also sounds pedantry to compute such quantities for raw data summary.

Hmmm,any experimental measure of graphical perception will inevitably not measure what it’s intended to measure.

Indeed, the standard way that statistical hypothesis testing is taught is a 2-way binary grid. Both these dichotomies are inappropriate.

I originally gave this post the title, “New England Journal of Medicine makes the classic error of labeling a non-significant difference as zero,” but was I was writing it I thought of a more general point.

First I’ll give the story, then the general point.

1. Story

Dale Lehman writes:

Here are an article and editorial in this week’s New England Journal of Medicine about hydroxychloroquine. The study has many selection issues, but what I wanted to point out was the major conclusion. It was an RCT (sort of) and the main result was “After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19….” This was the conclusion when the result was “The incidence of new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine (49 of 414 (11.8%) and those receiving placebo (58 of 407 (14.3%)); the absolute difference was -2.4 percentage points (95% confidence interval, -7.0 to 2.2; P=0.35).”

The editorial based on the study said it correctly: “The incidence of a new illness compatible with Covid-19 did not differ significantly between participants receiving hydroxycholoroquine ….” The article had 25 authors, academics and medical researchers, doctors and Phds – I did not check their backgrounds to see whether or how many statisticians were involved. But this is Stat 101 stuff: the absence of a significant difference should not be interpreted as evidence of no difference. I believe the authors, peer reviewers, and editors know this. Yet they published it with the glaring result ready for journalists to use.

To add to this, the study of course does not provide the data. And the editorial makes no mention of their recent publication (and retraction) of the Surgisphere paper. It would seem that that whole episode has had little effect on their processes and policies. I don’t know if you are up for another post on the subject, but I don’t think they should be let off the hook so easily.

Agreed. This reminds me of the stents story. It’s hard to avoid binary thinking: the effect is real or it’s not, the result is statistically significant if it’s not, etc.

2. The general point

Indeed, the standard way that statistical hypothesis testing is taught is a 2-way binary grid, where the underlying truth is “No Effect” or “Effect” (equivalently, Null or Alternative hypothesis) and the measured outcome is “Not statistically significant” or “Statistically significant.”

Both these dichotomies are inappropriate. First, the underlying reality is not a simple Yes or No; in real life, effects vary. Second, it’s a crime to take all the information from an experiment and compress it into a single bit of information.

Yes, I understand that some times in life you need to make binary decisions: you have to decide whether to get on the bus or not. But. This. Is. Not. One. Of. Those. Times. The results of a medical experiment get published and then can inform many decisions in different ways.

Whassup with the weird state borders on this vaccine hesitancy map?

Luke Vrotsos writes:

I thought you might find this interesting because it relates to questionable statistics getting a lot of media coverage.

HHS has a set of county-level vaccine hesitancy estimates that I saw in the NYT this morning in this front-page article. It’s also been covered in the LA Times and lots of local media outlets.

Immediately, it seems really implausible how big some of the state-border discontinuities are (like Colorado-Wyoming). I guess it’s possible that there’s really such a big difference, but if you check the 2020 election results, which are presumably pretty correlated with vaccine hesitancy, it doesn’t seem like there is. For example, estimated vaccine hesitancy for Moffat County, CO is 17% vs. 31% for neighboring Sweetwater County, WY, but Trump’s vote share was actually higher (81%) in Moffat County than in Sweetwater County (74%).

According to HHS’s methodology, they don’t actually have county-level data from their poll (just state-level data), which isn’t too surprising. This is how they arrived at the estimates:

It’s not 100% clear to me what’s skewing the estimates here, but maybe there’s some confounder that’s making the coefficient on state of residence much too big — it could be incorporating the urban/rural split of the state, which they don’t seem to adjust for directly. I guess the way to check if this analysis is wrong would be to re-run it to try to predict county-level election results and see if you get the same discontinuities (which we know don’t exist there).

Let me know what you think. It’s strange to see results that seem so unlikely, just by looking at a map, reported so widely.

I agree that the map looks weird. I wouldn’t be surprised to see some state-level effects, because policies vary by state and the political overtones of vaccines can vary by state, but the border effects just look too large and too consistent here. I wonder if part of the problem here is that they are using health insurance status as a predictor, and maybe that varies a lot from state to state, even after adjusting for demographics?

How big is the Household Pulse Survey? The documentation linked above doesn’t say. I did some googling and finally found this document that says that HPS had 80,000 respondents in week 26 (the source of the data used to make the above map). 80,000 is pretty big! Not big enough to get good estimates for all the 3000 counties in the U.S., but big enough to get good estimates for subsets of states. For example, if we divide states into chunks of 200,000 people each, then we have, ummmm, 80K * 200K / 330 million = 48 people per chunk. That would give us a raw standard error of 0.5/sqrt(48) = 0.07 per chunk, which is pretty big, but (a) some regression modeling should help with that, and (b) it’s still enough to improve certain things such as the North Dakota / Minnesota border.

The other thing is, I guess they know the county of each survey respondent, so they can include state-level and county-level predictors in their model. The model seems to have individual-level predictors but nothing at the state or county level. It might be kinda weird to use election results as a county-level predictor, but there are lots of other things they could use.

On the other hand, the map is not a disaster. The reader of the map can realize that the state borders are artifacts, and that tells us something about the quality of the data and model. I like to say that any graph should contain the seeds of its own destruction, and it’s appealing, in a way, that this graph shows the seams.

P.S. I wrote the above post, then I wrote the title, and then it struck me that this title has the same rhythm as What joker put seven dog lice in my Iraqi fez box?

Whatever you’re looking for, it’s somewhere in the Stan documentation and you can just google for it.

Someone writes:

Do you have link to an example of Zero-inflated poisson and Zero-inflated negbin model using pure stan (not brms, nor rstanarm)? If yes, please share it with me!

I had a feeling there was something in the existing documentation already! So I googled *zero inflated Stan*, and . . . yup, it’s the first link:

We don’t generally recommend the Poisson model; as discussed in Regression and Other Stories, we prefer the negative binomial. So I’m not thrilled with this being the example in the user’s guide. But the code is simple enough that it wouldn’t take much to switch in the negative binomial instead. Really, the main challenge with the negative binomial is not the coding so much as the interpretation of the parameters, which is something we were struggling with in chapter 15 of Regression and Other Stories as well.

Anyway, the real message of this post is that the Stan documentation is amazing. Thanks, Bob (and everybody else who’s contributed to it)!

Responding to Richard Morey on p-values and inference

Jonathan Falk points to this post by Richard Morey, who writes:

I [Morey] am convinced that most experienced scientists and statisticians have internalized statistical insights that frequentist statistics attempts to formalize: how you can be fooled by randomness; how what we see can be the result of biasing mechanisms; the importance of understanding sampling distributions. In typical scientific practice, the “null hypothesis significance test” (NHST) has taken the place of these insights.

NHST takes the form of frequentist signficance testing, but not its function, so experienced scientists and statisticians rightly shun it. But they have so internalized its function that they can call for the general abolition of significance testing. . . .

Here is my basic point: it is wrong to consider a p value as yielding an inference. It is better to think of it as affording critique of potential inferences.

I agree . . . kind of. It depends on what you mean by “inference.”

In Bayesian data analysis (and in Bayesian Data Analysis) we speak of three steps:
1. Model building,
2. Inference conditional on a model,
3. Model checking and improvement.
Hypothesis testing is part of step 3.

So, yes, if you follow BDA terminology and consider “inference” to represent statements about unknowns, conditional on data and a model, then a p-value—or, more generally, a hypothesis test or a model check—is not part of inference; it a critique of potential inferences.

But I think that in the mainstream of theoretical statistics, “inference” refers not just to point estimation, interval estimation, prediction, etc., but also to hypothesis testing. Using that terminology, a p-value is a form of inference. Indeed, in much of statistical theory, null hypothesis significance testing is taken to be fundamental, so that virtually all inference corresponds to some transformations of p-values and families of p-values. I don’t hold that view myself (see here), but it is a view.

The other thing I want to emphasize is that the important idea is model checking, not p-values. You can do everything that Morey wants to do in his post without ever computing a p-value, just by doing posterior predictive checks or the non-Bayesian equivalent, comparing observed data to their predictions under the model. The p-value is one way to do this, but I think it’s rarely a good way to do it. When I was first looking into posterior predictive checks, I was computing lots of p-values, but during the decades since, I’ve moved toward other summaries.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.