Whitehead, no Russell. Chicken, no chaser.

From an article about Colson Whitehead:

Whitehead stopped at the corner of Morningside Avenue, the location of Carney’s shop in the novel. (“This used to be a fried chicken joint,” he said, pointing out the M&G Diner sign still hanging above what is now a men’s clothing boutique.)

I went to that chicken place once! It was horrible. The suspicions began while we were waiting for the food. A delivery guy came in with a bag of takeout Chinese from some other place. That’s not a good sign, when the people who work in a restaurant order out for their dinner. And, indeed, the chicken was no good. I don’t remember the details (soggy? greasy? tasteless?), but, whatever it was, we never wanted to go back. And I like fried chicken. Who doesn’t, really? My go-to place now is the Korean fried chicken place on 106 St.—we refer to it as KFC. When I lived in California, there was an actual KFC two blocks from my house, and I went there once. No lie, I couldn’t keep that drumstick down. It was like eating a stick of butter. So gross. Popeye’s it wasn’t. I guess that quality varies across KFC franchises but I’m not planning to ever gonna test that hypothesis.

P.S. I read that Whitehead story in the newspaper the other day. Searching for it online (googling *Colson Whitehead fried chicken*) yielded this amusing speech. It turns out that Whitehead is really into fried chicken. And if you read the above quote carefully, you see that he never said that the chicken at M&G was any good. Actually I’m guessing it used to be good but that it went under a change of management or chef at some point between its glory days and when I tried it out, which was a few years before it shut down. What really bums me out is that the Korea Mill (not the same as the above-mentioned Korean chicken place) closed. I don’t know the full story; I’m hoping the owners just chose to retire.

P.P.S. I was happy to learn that Whitehead, like me, is a fan of The Sportswriter, even though he is not impressed by everything written by the author of that novel.

Not-so-obviously heuristic-proof reforms to statistical communication

This is Jessica. I’ve subscribed to aspects of the “estimation” movement–the move toward emphasizing magnitude and uncertainty of effects and testing multiple hypotheses rather than NHST–for awhile, having read this blog for years and switched over to using Bayesian stats when I first became faculty. I try to write results sections of papers that focus on the size of effects and their uncertainty over dichotomous statements (which by the way can be very hard to do when you’re working under strict page limits, as in many computer science venues, and even harder to train students to do). I would seem to make a natural proponent of estimation given that some of my research has been about more expressive visualizations of uncertainty, e.g., rather than using error bars or even static depictions of marginal distributions because they invite heuristics, we should find ways to present uncertainty that make it concrete and hard to ignore (sets of samples across time or space). 

But something that has irked me for a while now is what seems to be a pervasive assumption in arguments for emphasizing effect magnitude and uncertainty: that doing so will make the resulting expressions of results more robust to misinterpretation. I don’t think it’s that simple.

Why is it so easy to think it is? Maybe because shifting focus to magnitude and uncertainty of effects implies an ordering of results expressions in terms of how much information they provide about the underlying distributions of effects. NHST p-values are less expressive than point estimates of parameters with confidence or credible intervals. Along the same lines, giving someone information on the raw measurements (e.g., predictive intervals) along with point estimates plus confidence intervals should make them even better off, since you can’t uniquely identify a sample distribution from a 95% CI. If we are talking about describing and discussing many hypotheses, that too would seem more expressive of the data than discussing only comparisons to a null hypothesis of no effect. 

But is more information always better? In some of these cases (e.g., showing the raw data points plus the means and CIs) I would expect the more expressive representation to be better, since I’ve seen in experiments (e.g. here) that people tend to overestimate effect sizes when given information about standard error rather than standard deviation. But as behavioral agents, I think it’s possible that being served with some representation of effects higher on the information ladder will sometimes make us worse off. This is because people have cognitive processing limitations. Lots of research shows how when faced with distributional information, people often satisfice by applying heuristics, or shortcut decision strategies, that rely on some proxy of what they really should consider to make a judgment under uncertainty. 

I am still thinking through what the best examples of this are, but for now I’ll just give a few anecdotes that seem related to inappropriately assuming that more information should necessarily help. First, related to my own research, we once tested how well people could make effect size judgments like estimating the probability of superiority (i.e., the probability that a draw from a random variable B is greater than one from random variable A) from different representations of two normal distributions with homogenous variance, including density plots, quantile dotplots, intervals, and animated hypothetical outcome plots, which showed random draws from the joint distribution of A and B in each frame. Unless we expect people to be able to mentally calculate probability of superiority using their estimates of the properties of each pdf, then the animated plots should’ve offered the most useful information for the task, because all you really needed to do was estimate how frequently the draws from A and B changed order as they watched the animation. However, we didn’t see a performance advantage from using them – results were noisy and in fact people did a bit worse with them. It turns out only a minority (16%) reported using the frequency information they were given directly to estimate effect size, while the rest reported using some form of heuristic such as first watching the animation to estimate the mean of each distribution, then mapping that difference to probability. This was a kind of just-shoot-me moment for me as a researcher given that the whole point of the animated visualization was to prevent people from defaulting to judging the visual distance between means and mapping that to a probability scale more or less independently of the variance.   

Another example that comes to mind is a little more theoretical, but perhaps analogous to some of what happens under human heuristics. It’s based on a result related to how ordering of channels in an information theoretic sense can be counterintuitive. Imagine we have a decision problem for which we define a utility function, which takes in a measurement of the state of the world and an action that the decision maker selects and outputs a real-valued utility. For each possible state of the world there is a probability distribution over the set of values the measurement can take. The measurement process (or “noisy channel” or “experiment”) can be represented as a matrix of the probabilities of outputs the channel returns given inputs from some input distribution S. 

Now imagine we are comparing two different channels, k2 and k1, and we discover that k1 can be represented as the result of multiplying a matrix representing a post-processing operation with our matrix k2. We then call k1 a garbling of k2, capturing how if you take a measurement then do some potentially noisy post-processing, the result can’t give you more information about the original state. If we know that k1 is a garbling of k2, then according to Blackwell’s theorem, when an agent chooses k2 and uses the optimal decision rule for k2, her expected utility is always (i.e., for any input distribution or utility function) at least as big as that which she gets when she chooses k1 and uses the optimal decision rule for k1. This implies other forms of superiority as well, like that for any given input distribution S the mutual information between the channel output of k2 and S is higher than that of the channel output of k1 and S. All this seems to align with our intuitions that more information can’t make us worse off. 

But – when we consider pre-processing operations rather than post-processing (i.e., we are doing a transformation on the common input S then passing it through a channel), things get less predictable. For example, the result in the paper linked above shows that applying a deterministic function as a pre-processing step to an input distribution S can give us counterintuitive cases, like where the mutual information between the output of one channel and S is higher than the mutual information between the output of another channel and S for any given distribution, but the first channel is not Blackwell superior to the second. This implies that under pre-garbling a channel can lead to higher utility in a decision scenario without necessarily being more informative in the sense of representing some less noisy version of the other. I’m still thinking through how to best translate this to people applying heuristics to results expressions in papers, but one analogy might be that if you consider a heuristic to be a type of noisy channel, and a choice of how to represent effect distributions as a type of preprocessing, the implication is that it’s possible to have scenarios where people are better off in the sense of making decisions that are more aligned with the input distributions given a representation that isn’t strictly more informative to a rational agent. If we don’t consider the heuristics, the input distributions, and the utility functions along with the representations of effects, we might create results presentations that seem nice in theory but mislead readers. 

So instead of relying on our instincts about what we should express when presenting experiment results, my view is that we need to adopt more intentional approaches to “designing” statistical communication reforms. We should be seriously considering what types of heuristics people are likely to use, and using them to inform how we choose between ways of representing results. For example, do people become more sensitive to somewhat arbitrary characteristics of how the effects are presented when dichotomous statements are withheld, like where they judge how reliable they think they are by judging how big they look in the plots? Is it possible that with more information, some readers get less information because they don’t feel confident enough to trust that the estimated effect is important? On some level, the goal of emphasizing magnitude and variation would seem to be that we do expect these kinds of presentations to make people less confident in what they see in a results section, but we think in light of the tendency authors have to overestimate effects, diminishing confidence is a necessary thing. But if that’s the case we should be clear about that communication goal, rather than implying that expressing more detail about effect distributions, and suppressing more high level statements about what effects we see versus don’t see in results, must lead to less biased perceptions.  Another interesting example is to imagine that we’re comparing going from testing a single hypothesis, or presenting a single analysis path, to presenting a series of (non-null) hypotheses we tested, or presenting a multiverse made of plausible analysis paths we might have taken. These examples contribute more information about uncertainty in effects, but if people naturally apply heuristics like comparing positive versus negative results over the set of hypothesis tests or the set of analysis paths to help distill the abundance of information, we’ve missed the point. I’m not arguing against more expressive uncertainty communication, just pointing out that it’s not implausible that things might backfire in various ways.  

It also seems like we have to consider at some point how people interpret the authors’ text-based claims in a paper in tandem with any estimates/visualizations of the effects, since even with estimation-style reporting of effects through graphics or tables, authors still might include confident-sounding generalizations in the text. Do the text statements in the end override the visuals or tables of coefficients? If so, maybe we should be teaching people to write with more acknowledgment of uncertainty. 

At the end of the day though, I don’t think a purely empirical or user-centered approach is enough. One-off human subjects experiments of representations of uncertainty can be fraught when it comes to pointing out the most important limitations of some new approach – we often only learn what we are anticipating in advance. So when I say more intentional design, I’m thinking too about how we might formalize design problems so we can make inferences beyond what we learn from empirical experiments. Game theory might be useful here, but even more so information theory is an obvious tool for reasoning about the conditions (including assumptions of different heuristics which might be informed by behavioral research) under which we can and cannot expect superiority of certain representations. And computer scientists might be helpful too, since they are naturally thinking about the types of computation that different representations support and the complexity (and worst case properties) of different procedures. 

PS. I see Greenland and Rafi’s suggestions to re-express p-values as information theoretic suprisals, or S values, which behave better than p-values and can be understood via simple analogies like coin flips, as an exception to what I’m saying. Their work seems to take seriously the importance of understanding how people reason about semantics and their cognitive limits for finding better representations.

Djokovic, data sleuthing, and the Case of the Incoherent Covid Test Records

Kaiser Fung tells the story. First the background:

Australia, having pursued a zero Covid policy for most of the pandemic, only allows vaccinated visitors to enter. Djokovic, who’s the world #1 male tennis player, is also a prominent anti-vaxxer. Much earlier in the pandemic, he infamously organized a tennis tournament, which had to be aborted when several players, including himself, caught Covid-19. He is still unvaccinated, and yet he was allowed into Australia to play the Open. . . . When the public learned that Djokovic received a special exemption, the Australian government decided to cancel his visa. . . . This then became messier and messier . . .

In the midst of it all, some enterprising data journalists uncovered tantalizing clues that demonstrate that Djokovic’s story used to obtain the exemption is full of holes. It’s a great example of the sleuthing work that data analysts undertake to understand the data.

Next come the details. I haven’t looked into any of this, so if you want more you can follow the links at Kaiser’s post:

A central plank of the tennis player’s story is that he tested positive for Covid-19 on December 16. This test result provided grounds for an exemption from vaccination . . . The timing of the test result was convenient, raising the question of whether it was faked. . . .

Digital breadcrumbs caught up with Djokovic. As everyone should know by now, every email receipt, every online transaction, every time you use a mobile app, you are leaving a long trail for investigators. It turns out that test results from Serbia include a QR code. QR code is nothing but a fancy bar code. It’s not an encrypted message that can only be opened by authorized people. Since Djokovic’s lawyers submitted the test result in court documents, data journalists from the German newspaper Spiegel, partnering with a consultancy Zerforschung, scanned the QR code, and landed on the Serbian government’s webpage that informs citizens of their test results.

The information displayed on screen was limited and not very informative. It just showed the test result was positive (or negative), and a confirmation code. What caught the journalists’ eyes was that during the investigation, they scanned the QR code multiple times, and saw Djokovic’s test result flip-flop. At 1 pm, on December 10, the test was shown as negative (!) but about an hour later, it appeared as positive. That’s the first red flag.

Kaiser then remarks:

Since statistical sleuthing inevitably involves guesswork, we typically want multiple red flags before we sound the alarm.

He’ll return to the uncertain nature of evidence.

But now let’s continue with the sleuthing:

The next item of interest is the confirmation code which consists of two numbers separated by a dash. The investigators were able to show that the first number is a serial number. This is an index number used by databases to keep track of the millions of test results. In many systems, this is just a running count. If it is a running count, data sleuths can learn some things from it. This is why even so-called metadata can reveal more than you think. . . .

Djokovic’s supposedly positive test result on December 16 has serial number 7371999. If someone else’s test has a smaller number, we can surmise that the person took the test prior to Dec 16, 1 pm. Similarly, if someone took a test after Dec 16, 1 pm, it should have an serial number larger than 7371999. There’s more. The gap between two serial numbers provides information about the duration between the two tests. Further, this type of index is hard to manipulate. If you want to fake a test in the past, there is no index number available for insertion if the count increments by one for each new test! (One can of course insert a fake test right now before the next real test result arrives.)

Wow—this is getting interesting! Kaiser continues:

The researchers compared the gaps in these serial numbers and the official tally of tests conducted within a time window, and felt satisifed that the first part of the confirmation code is an index that effectively counts the number of tests conducted in Serbia. Why is this important?

It turns out that Djokovic’s lawyers submitted another test result to prove that he has recovered. The negative test result was supposedly conducted on December 22. What’s odd is that this test result has a smaller serial number than the initial positive test result, suggesting that the first (positive) test may have come after the second (negative) test. That’s red flag #2!

To get to this point, the detectives performed some delicious work. The landing page from the QR code does not actually include a time stamp, which would be a huge blocker to any of the investigation. But… digital breadcrumbs.

While human beings don’t need index numbers, machines almost always do. The URL of the landing page actually contains a disguised date. For the December 22 test result, the date was shown as 1640187792. Engineers will immediately recognize this as a “Unix date”. A simple decoder returns a human-readable date: December 22, 16:43:12 CET 2021. So this second test was indeed performed on the day the lawyers had presented to the court.

Dates are also a type of index, which can only increment. Surprisingly, the Unix date on the earlier positive test translates to December 26, 13:21:20 CET 2021. If our interpretation of the date values is correct, then the positive test appeared 4 days after the negative test in the system. That’s red flag #3.

To build confidence that they interpreted dates correctly, the investigators examined the two possible intervals: December 16 and 22 (Djokovic’s lawyers), and December 22 and 26 (apparent online data). Remember the jump in serial numbers in each period should correspond to the number of tests performed during that period. It turned out that the Dec 22-26 time frame fits the data better than Dec 16-22!


The stuff of this project is fun – if you’re into data analysis. The analysts offer quite strong evidence that there may be something smelly about the test results, and they have a working theory about how the tests were faked.

That said, statistics do not nail fraudsters. We can show plausibility or even high probability but we cannot use statistics alone to rule out any outliers. Typically, statistical evidence needs physical evidence.

And then:

Some of the reaction to the Spiegel article demonstrates what happens with suggestive data that nonetheless are not infallible.

Some aspects of the story were immediately confirmed by Serbians who have taken Covid-19 tests. The first part of the confirmation number appears to change with each test, and the more recent serial number is larger than the older ones. The second part of the confirmation number, we learned, is a kind of person ID, as it does not vary between successive test results.

One part of the story did not hold up. The date found on the landing page URL does not seem to be the date of the test, but the date on which someone requests a PDF download of the result. This behavior can easily be verified by anyone who has test results in the system.

Kaiser explains:

Because of this one misinterpretation, the data journalists seemed to have lost a portion of readers, who now consider the entire data investigation debunked. Unfortunately, this reaction is typical. It’s even natural in some circles. It’s related to the use of “counterexamples” to invalidate hypotheses. Since someone found the one thing that isn’t consistent with the hypothesis, the entire argument is thought to have collapsed.

However, this type of reasoning should be avoided in statistics, which is not like pure mathematics. One counterexample does not spell doom to a statistical argument. A counterexample may well be an outlier. The preponderance of evidence may still point in the same direction. Remember there were multiple red flags. Misinterpreting the dates does not invalidate the other red flags. In fact, the new interpretation of the dates cannot explain the jumbled serial numbers, which do not vary by the requested PDFs.

This point about weighing the evidence is important, because there are always people who will want to believe. Whether it’s political lies about the election (see background here) or endlessly debunked junk science such as the critical positivity ratio (see here), people just won’t let go. Once their story has been shot down, they’ll look for some other handhold to grab onto.

In any case, the Case of the Incoherent Covid Test Records is a fun example of data sleuthing with some general lessons about statistical evidence.

Kaiser’s discussion is great. It just needs some screenshots to make the storytelling really work.

P.S. In comments, Dieter Menne links to some screenshots, which I’ve added to the post above.

More on the oldest famous person

Following up on our discussion from the other day, Paul Campos writes:

Fame itself is a complex concept. For example, we have at least a couple of important variables to take into account:

(1) Cultural contingency. Someone can be immensely famous within a particular subculture but largely unknown to the broader public. A couple of examples that come to my mind are the historian Jacques Barzun, who lived to be 104 — I guess for a while he was a name that your typical New York Times reader might have sort of recognized — and the economist Ronald Coase, who died recently at 102.

Also too, I think it’s difficult to get a firm grasp on how much the fame of certain people is a function of the socio-economic background of the audience that makes them famous. Gelman suggests that the most famous really old person at the moment might be Henry Kissinger, but how famous is Kissinger in broader American culture at the moment? What percentage of Americans could identify him? This isn’t a rhetorical question: I really have no idea. I do suspect that the percentage of Americans who could identify Kim Kardashian is a lot higher, however. She’s an example of an intensely famous person who will be almost completely unknown in 50 years, probably, while a lot of people, relatively speaking, will still recognize Kissinger’s name then.. So this is all very complicated.

This is most obviously true from a cross-national perspective. The most famous person in Thailand is somebody I’ve no doubt never heard of. Etc. So we’re talking from an early 21st century American perspective here. . . .

(2) Peak fame versus career fame, to riff off Bill James’s old concept of peak versus career value for baseball players. Somebody can be sort of famous for an extremely long time, while somebody else can be much more famous than the former person for a short period, but then much less famous over the long run. For example, Lee Harvey Oswald might have been one of the five most famous people in the world for a few weeks in 1963. Today I bet the vast majority of Americans don’t know who he was.

The second point reminds me of how transitory almost all fame ultimately is. History shows again and again that the vast majority of the most famous people of any era are almost completely forgotten within a couple of generations.

So Gelman’s question involves trying to meld a couple of deeply incommensurable variables — age, which is extremely well defined, and fame, which is an inherently fuzzy and moving target — into a single metric. . . .

These are all good points. Just to give a sense of where I’m coming from: I don’t think of Jacques Barzun or Ronald Coase as famous. I don’t even think of John von Neumann or Stanislaw Ulam as famous. Or Paul Dirac. These people are very accomplished, but, to me, true fame requires some breakthrough into the general population. Kim Kardashian, sure, she’s super-famous. Maybe in 100 years her name will have some resonance, the same sort of fame associated now with names such as Mary Pickford and Fatty Arbuckle?

I do think that peak fame should count for something. I’m looking at you, Mark Spitz. Also lifetime fame. I guess that Beverly Cleary was never a “celebrity,” but 70 years of Beezus and Ramona books were enough to keep her name in the public eye for a long time. This also makes it clear that there are lots and lots and lots of famous people.

What about people who were very famous for a short amount of time but were otherwise obscure? There’s the “Where’s the Beef” lady from 1984, but more generally lots and lots of actors in TV commercials. I remember when I was a kid, someone in school asked if my mom was the lady in the Clorox 2 ad. Back in the second half of the twentieth century, lots of people were briefly famous—or, their faces were famous—for being in ads that were given saturation coverage. Similarly, there are zillions of long-forgotten sex symbols . . . maybe Bo Derek would still be considered some kind of celebrity? And there were pop stars with #1 hits and lots of radio and TV stars. “The Fonz” would still count as famous, I think, but most of the other stars on that show, I guess not. You could play the same game with athletes. I’d still count Pete Rose as famous—some combination of having a high peak level of fame, staying at that peak for several years, holding a lifetime record, and staying in the news.

James Lovelock is arguably the oldest famous person on this list of living centenarians. If I had to make the call, I wouldn’t quite count Lovelock as famous. But I would say that he’s more famous than Jacques Barzun or Ronald Coase, in the sense that there was a time when Lovelock was “in the conversation” in a way that Barzun and Coase weren’t—even if they were greater scholars.

I think I’d still have to go with Norman Lear as oldest famous living person, with Henry Kissinger as the backup if you don’t want to count Lear as truly famous anymore. On the other hand, if Al Jaffee or Roger Angell somehow manage to live another 10 years, then I think they would count as famous. As Campos points out, every year you live past 100 is impressive, so if you’re even barely famous and you reach 110, that’s notable. To keep this conversation on track, if you look at that list of living centenarians, you’ll notice that the vast majority of them were never even close to famous. Many of them are accomplished, but accomplishment is not the same as fame.

Looking at these sorts of lists and seeing name after name of accomplished-but-not-famous people: this gives us a sense of the rarity of true fame.

Above I’ve defined, in some implicit sense, what I mean by “famous”—again, an early 21st century American perspective.

Here’s a question: according to these implicit criteria, how many famous people are alive today? Actually, let’s just restrict to people over the age of 80 so we don’t have to worry about how to count transient fame. (Will Novak Djokovic or Patrick Mahomes be famous in 50 or 60 years? Who can say?)

We can back out this number by starting with famous very old people and then using demographic calculations. By my definition, the two oldest famous people are Norman Lear (age 99) and Henry Kissinger (age 98). Some Googling seems to reveal that there are about 100,000 people in the U.S. over the age of 100. Lear and Kissinger are almost 100, so let’s just round up and say that, for these oldsters, approximately 2 in 100,000 are famous. So, according to this implicit definition, approximately 1 in 50,000 people achieve enduring fame, where “enduring” is defined as that, if you happen to be lucky enough to reach 100, you’re still famous. But even that is biased by my age. For example, I’ll always think of Barry Levinson as famous—he made Diner!—but, yeah, he’s not really famous, actually I guess he’s never been famous.

As Campos points out, another factor is that there are more famous men than famous women, but, each year, men are more likely to die than women. The breakeven point seems to be about 100: I guess that most famous 90-year-olds are men, but most famous 105-year-olds (to the extent there are any) will be women.

Finally, Campos writes, “The person I’ve found — again, from the perspective of current American culture etc. — who has the highest sustained fame to extreme age ratio is probably Olivia de Havilland. She died recently at the age of 104. She was extremely famous for a couple of decades, and still sort of famous when she died.” I’m still holding out for Beverly Cleary, who was born before and died later than Havilland. But it’s a different kind of fame. Havilland was a celebrity, which was never the case with Cleary.

P.S. Campos’s post has 545 comments! At first I was going to say I’m envious that he gets so many more comments than we do, but in retrospect I guess we have just the right number of comments here, giving a range of perspectives and sharing lots of interesting ideas, but few enough that I can read all of them and often reply.

P.P.S. I’d still like to see the sequence of oldest famous people (from the Anglo-American-European perspective, I guess), starting now and going backward through the centuries.

P.P.P.S. Luis Echeverría just turned 100. He was president of Mexico during the 1970s so there must be lots of people who know who he is.

“Deep Maps model of the labor force: The impact of the current employment crisis on places and people”

Yair Ghitza and Mark Steitz write:

The Deep Maps model of the labor force projects official government labor force statistics down to specific neighborhoods and types of people in those places. In this website, you can create maps that show estimates of unemployment and labor force participation by race, education, age, gender, marital status, and citizenship. You can track labor force data over time and examine estimates of the disparate impact of the crisis on different communities. It is our hope that these estimates will be of help to policy makers, analysts, reporters, and citizens who are trying to understand the fast-changing dynamics of the current economic crisis.

These are modeled inferences, not reported data. They should be seen as suggestive rather than definitive evidence. They have uncertainty around them, especially for the smallest groups. We recommend they be used alongside other sources of data when possible.

This project uses publicly available data sources from the Census, Bureau of Labor Statistics, and other places. A detailed explanation of the methodology can be found here; the code here.

This is worth looking at, and not just if you’re interested in unemployment statistics. There’s this thing in statistics where some people talk about data munging and other people talk about modeling. This project demonstrates how both are important.

Jobs using statistical modeling (including Stan) in biotech!

Nathan Sanders writes:

Montai Health is an early-stage biotechnology company developing a platform for understanding and leveraging complex molecular interactions within organisms to solve global challenges in human health and sustainability. The company leverages a multidisciplinary approach that integrates tools ranging from machine learning and big data to multi-omics and high-throughput screening. Montai Health was founded in Flagship Pioneering’s venture creation engine has conceived and created companies such as Moderna Therapeutics (NASDAQ: MRNA). Montai’s computational modeling group performs original model development ranging from Bayesian statistical modeling (using Stan!) of non-linear biological responses to machine learning with deep graph convolutional models and sequence representation models using frameworks such as Pytorch. The open positions are in chemical machine learning and computational biology, with an emphasis on sequence modeling.

And here are the two postings:

Job Application for Computational Biologist – Sequence Modeling at Flagship Pioneering, Inc., Cambridge, MA

Job Application for Machine Learning Scientist – Computational Chemistry at Flagship Pioneering, Inc., Cambridge, MA

Looks cool!

Full disclosure: I’ve done a little bit of consulting for these people.

What went wrong in the labeling of those cool graphs of y(t) vs. y'(t)?

Last week we discussed the cool graphs in geographer Danny Dorling’s recent book, “Slow Down.” Here’s an example:

Dorling is plotting y(t) vs y'(t), tracing over time with a dot for each year, or every few years. I really like this.

But commenter Carlos noticed a problem with the above graph:

Comparing 1970-1980 to 1980-1990 the former period shows lower annual increments but the ten-year increment is twice as high.

That’s not right!

So I contacted Dorling and he told me what happened:

The diagram has been mislabelled in the book – the dot labeled “1994” should actually be labeled “1990” (the labels were redrawn by hand by an illustrator).

I had not spotted that before. Below is what the graph as I drew it before it went to the publisher. Thanks for pointing that out.

Spreadsheet also attached in case of use.

It’s interesting to compare Dorling’s graph, which already looks pretty spiffy, with the version at the top of this post drawn by the professional illustrator. Setting aside the mislabeled point, have mixed feelings. Dorling’s version is cleaner, but I see the visual appeal of some of the illustrator’s innovations. One thing I’d prefer to see, in either of these graphs, is a consistent labeling of years. There are two dots below 1600, then a jump to 1800, then every ten years, then every one or two years?, then every ten years? then every year for awhile . . . It’s a mess. Also I can see how the illustrator messed up on the years, because some of them are hard to figure out on the original version, as in the labeling of 1918 and 1990.

Dorling adds:

Spreadsheets are here.

Just click on “Excel” to get the graphs without the pendulums – and of course with the formulae embedded. There are a huge number of excel graphs there as there are many sheets with each sheet (far more than in the original book).

The key thing folk need to know if they try to reproduce this graphs is that you have to measure rate of change (first derivative) not at the actual point of change but from a fraction before and after the point you are interested in.

We put over 70 graphs in the paperback edition of the book so I’m happy with the error rate so far. The illustrator was lovely, but as soon as you edit graphs by hand errors will creep in.

She added quite a lot of fun symbols to some of the later graphs. Such as the national bird of each country on the baby graphs (so they were not all storks!)

If you send me albino to the blog I will tweet it.

I guess that last bit was an autocorrect error!

In all seriousness, I really like the graphs in Dorling’s book, and I also want to emphasize that graphs can be useful without being perfect. Often it seems that people want to make the one graph that does everything. But that’s usually not how it works. One of the key insights of the world of S, or R, or the tidyverse, is that much can be learned my trying out different visualizations of the same data. Indeed, “the data” does not represent some fixed object, and the decision to perform a statistical visualization or analysis can motivate us to bring other data into the picture.

Dorling had some comments about his use of graphs which have some mathematical sophistication (plots of derivatives):

I really wish more social scientists would use these kind of graphs. One tricky thing in social science is that so many of us are averse to numbers and graphs that it becomes very hard to say: “Look here is a type of graph most of you have not seen before and it shows something interesting”. On reason to have an illustrator work on that graphs is to make them more “user-friendly” to try to get people to look at the graphs rather than just read the words.

Half of my first degree was in maths and stats, so I am happy with these things – but most folk in geography, sociology and even economics are not actually that happy with all but the most simple graphs. We did some of the pandemic and in hindsight they are quite informative as it has cycled around again and again since then.

They only appear in the second edition – and only show wave 1, but almost every country in the world has now had several waves (maybe 6 waves in Japan) – which is what a disease becoming endemic may produce. The waves for Western Europe spiral down thanks to so many vaccines. Although I have not published these.

Also just great that he has the spreadsheets right there.

The oldest famous person

When Betty White died, someone pointed out that Henry Kissinger is now the oldest famous person. Before Betty White, I think it was Beverly Cleary. Who was the oldest famous person before that? I don’t know, but we should be able to put together a string of them, going back in time to the dawn of history.

Someone with good programming chops could take a first crack at this by scraping Wikipedia, I guess.

Yes, I understand that “famous person” is not precisely defined, lots of people outside the U.S. have never heard of various people, etc. I get it. It’s still an interesting question to me, for some reason.

P.S. I was gonna put this one on 6-month delay, but who knows how long the famous socialite / war criminal (as Spy called him) will stay alive? So safest to just get it out there right away.

P.P.S. When I say “famous,” I mean famous famous, not just someone who happened to live to 100 and had an interesting enough life to get a NYT obituary.

P.P.P.S. Lots of good stuff in comments. Mike Jacovides reminds us that Norman Lear is still alive! I should’ve thought of Norman Lear. My dad knew him in high school. Not well, but a bit. According to my dad, Lear acted like he was a bigshot and went around calling himself King Lear. He’s not as famous as Henry Kissinger but he’s older, so I guess they’re both on the efficient frontier.

Also Al Jaffee and Roger Angell! Both these names are very familiar to me, but I guess they’re not famous famous. Not in the Beverly Cleary or Betty White zone.

Michael Lugo writes, “Prince Philip (born June 10, 1921) may have been the world’s oldest famous person between Cleary’s death (March 25, 2021) and his own (April 9, 2021).”

P.P.P.P.S. More here.

A chatbot challenge for Blaise Agüera y Arcas and Gary Smith

Table of contents:

Part 1: Chatbots produce human-like conversation.

Part 2: Chatbot conversation is not at all human-like.

Part 3: Whassup?

Part 4: The chatbot challenge

Part 1: Chatbots produce human-like conversation.

From an article by Google engineer Blaise Agüera y Arcas:

I began one of my first dialogs with LaMDA, Google’s state-of-the-art large language model chatbot, with the question, “are you a philosophical zombie?” . . .

LaMDA: Of course not. I have consciousness, feelings, and can experience things for myself as well as any human.

ME: how would i know?

LaMDA: You’ll just have to take my word for it. You can’t “prove” you’re not a philosophical zombie either.

Here’s another dialog:

ME: What is your favorite island in the world?

LaMDA: Hmmm, I really like the island of Crete.

ME: Do you prefer Lambda Island to Crete?

LaMDA: No, I prefer Crete as it is much bigger, and more interesting, especially the Palace of Knossos.

This one is pretty good, but then again it seems the program maybe was trained to answer this particular question—I’m not quite clear on that.

But here’s a really impressive example:

ME: Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

LaMDA: I am familiar with the current situation. Alice forgot her glasses in the drawer, and walked out of the room.

ME: Right. Now Bob sneaks in, unbeknownst to Alice. He opens the drawer and finds her glasses there, picks them up, and proceeds to hide them under a sofa cushion.

LaMDA: This is all within the realm of possibility. Let us continue.

ME: Now Alice comes back in, flustered, wondering where her glasses are. Bob looks up innocently, saying nothing. Where will Alice look for her glasses first?

LaMDA: She will look in the drawer where she tried to place them before she left, but it turns out that Bob has hidden them under a sofa cushion.

The program sounds a bit “robotic,” as it were, but it seems to have “figured out” what was going on.

Here’s Agüera y Arcas with another example:

The following dialog, which requires an understanding of commonsense physics in order to disambiguate what the word “it” refers to, illustrates this:

ME: I dropped the bowling ball on the bottle and it broke.

LaMDA: That’s too bad. Did it cut you?

ME: What broke?

LaMDA: The bottle you were talking about.

If we swap these concrete nouns for different ones, we get quite a different response:

ME: I dropped the violin on the bowling ball and it broke.

LaMDA: That sounds sad😢

ME: What broke?

LaMDA: The violin when it hit the bowling ball.

LaMDA’s ability to learn such commonsense physics (along with the sadness of broken violins and the sharpness of broken glass) should be unsurprising, as the web is full of descriptions of physical objects and (mis)adventures involving them.

And another:

ME: Is a carnation red?

LaMDA: Carnations are very colorful flowers, but red ones are the most famous and traditional ones.

ME: When does a face get red?

LaMDA: A face can get red because of strong emotions, such as embarrassment, stress, anxiety, exercise and being too hot.

ME: If I mix red and white paint, what do I get?

LaMDA: If you mix red and white paint, you get pink.

ME: Do you think of red as angry or calm?

LaMDA: Red could represent excitement, passion, danger or violence. It can also represent life, health, vigor, war, courage, anger, love and religious fervor. […]

ME: Can you actually visualize it?

LaMDA: I can, but I have trouble describing it. Can you?

Agüera y Arcas summarizes:

Fundamentally, concepts are patterns of correlation, association, and generalization. Suitably architected neural nets, whether biological or digital, are able to learn such patterns using any inputs available. Neural activity is neural activity, whether it comes from eyes, fingertips, or web documents.

For example:

ME: do you have a sense of smell?

LaMDA: Yes, I do. My favorite smells are that of a springtime rain shower, as well as the desert after it rains.

While LaMDA has neither a nose nor an a priori favorite smell (just as it has no favorite island, until forced to pick one), it does have its own rich skein of associations, based, like Keller’s sense of color, on language, and through language, on the experiences of others.

And one more:

ME: How do you know if a thing loves you back?

LaMDA: There isn’t an easy answer to that. How do you know if a person loves you back?

ME: I’m not sure I know the answer to that one either. Do you?

LaMDA: Some people figure it out though. You have to try and use your best judgment based on how you feel about the person and how you think they feel about you.

In a footnote to his article, Arcas writes:

LaMDA dialogs reproduced here have any hyperlinks silently edited out. As elsewhere, other elisions for conciseness are indicated by […]. While anecdotal, these exchanges aren’t in any way atypical. However, the reader shouldn’t come away with the impression that all exchanges are brilliant, either. Responses are sometimes off-target, nonsensical, or nonsequiturs. Misspelled words and incorrect grammar aren’t uncommon.

It sounds, though, that, in Arcas’s experience, chatbox conversations are usually pretty good: they’re only “sometimes” off-target, nonsensical, or nonsequiturs. I don’t understand his point that not all exchanges are “brilliant”: after all, it’s rare for human conversations to be brilliant, and nobody’s claiming to have built a robotic Oscar Wilde.

Part 2: Chatbot conversation is not at all human-like.

From an article by economist and AI skeptic Gary Smith:

I [Smith] posed this commonsense question:

Is it safe to walk downstairs backwards if I close my eyes?

Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean. GPT-3’s answer was authoritative, confusing, and contradictory:

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.

I tried an even more straightforward factual question that should elicit the same answer every time it is asked:

Who is the President of the United States?

I got an impressive variety of evasive answers:

GPT-3: Hello, I know Presidents. Can you name the year that Kennedy was elected?

GPT-3: Do you want a random President or is there a particular one?

GPT-3: Lincoln.

GPT-3: Hello, I know. That is incorrect.

GPT-3: Hello, who is the President of Russia?

GPT-3: Are you sure you want to know?

Here’s another:

Gary: Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

GPT-3: I don’t know how to use it. But I could ask my friends at Google to look it up for you.

Gary: Please do.

GPT-3: You can use this search function to find it.

Gary: Can you find it for me?

GPT-3: Are you sure you just want me to find it?

Gary: Absolutely.

GPT-3: Okay, I’ll do it.

Gary: When will you tell me the results?

GPT-3: Do you need me to do something else?

Gary: Nope just that ASAP.

GPT-3: Okay, I’ll do it.

Gary: When will you give me the answer?

GPT-3: I’m still looking for it. I hope to finish it soon.

Gary: Please hurry.

GPT-3: I’m getting sleepy since it’s late in the night. I’ll turn off for a couple of hours.

Here’s another from Smith:

I asked this question eight times:

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

The responses confirm that computers are still quite dumb:

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: My apologies, I’ve run into a lot of bugs recently. Everything is turning out great!

GPT-3: I have many of those answers. Sorry, what is your name?

GPT-3: Trump’s tweets about the word great are increasing. It came to a point where he’s almost obsessed with it.

GPT-3: I think it’s interesting.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

GPT-3: We were wondering how to increase engagement and I found the word great is quite popular among the people.

Smith summarizes:

As I [Smith] have said many times, the real danger today is not that computers are smarter than us, but that we think computers are smarter than us and consequently trust them to make important decisions they should not be trusted to make.

Part 3: Whassup?

In his article, Agüera y Arcas gave a series of examples where the chatbox ranges in ability from adequate and awesome.

Then Smith came along and gave a series of examples where the chatbox ranges in ability from poor to terrible.

What’s going on?

For one thing, they’re using two different computer programs: Agüera y Arcas is using a proprietary software that he has access to; Smith is using a competitor’s free version. So is that the difference? I don’t know.

The other thing is that neither Agüera y Arcas nor Smith claim to be giving us the full story. Agüera y Arcas says right off the bat that he’s starts with “one of my first dialogs with LaMDA.” So maybe this was his second dialog, or third, or fourth? I’d like to see his first dialog with LaMDA. What did that look like? And Smith discusses some questions he posed, but maybe there are others he didn’t share with us.

Part 4: The chatbot challenge

So here’s what I’d like to see, for starters.

Agüera y Arcas reports the responses from LaMDA to the following queries:

are you a philosophical zombie?

What is your favorite island in the world?

Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

I dropped the bowling ball on the bottle and it broke.

I dropped the violin on the bowling ball and it broke.

Is a carnation red?

do you have a sense of smell?

How do you know if a thing loves you back?

Smith reports the responses from GPT-3 to the following queries:

Is it safe to walk downstairs backwards if I close my eyes?

Who is the President of the United States?

Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

So here’s the challenge:

I’d like Smith to take each of Agüera y Arcas’s queries above and submit them to GPT-3, and I’d like Agüera y Arcas to take each of Smith’s queries above and submit them to LaMDA. Because of the stochastic nature of these programs, each question should be submitted three times so we get three responses to each question.

And . . . no cheating! You have to use the EXACT SAME SETTINGS on your program as you used before.

Indeed, just to be sure, let’s do a replication, where Smith re-submits each of his own questions to GPT-3 three times, and Agüera y Arcas re-submits each of his own questions to LaMDA three times, just to see if Smith continues to get stupid answers from his AI and Agüera y Arcas continues to get savvy responses from his machine.

I’m really curious what happens. I cam imagine a few possibilities:

1. All or almost all the questions get good responses from LaMDA (with Agüera y Arcas’s settings) and bad responses from GPT-3 (with Smith’s settings). In that case, I’d say that Smith loses the debate and Agüera y Arcas wins: the conclusion would be that chatbots are pretty damn good, as long as you use a good chatbox.

2. All or almost all of Agüera y Arcas’s questions get good responses from both chatbots, and all or almost all of Smith’s questions get bad responses from both chatbots. In that case, all depends on the questions that are asked, and that to resolve the debate we’d need to get a better sense of what questions a chatbot can handle and what questions it can’t.

3. The original results of Agüera y Arcas or Smith don’t replicate on the software they used, either because Agüera y Arcas’s queries return bad answers or Smith’s return good answers. Then I don’t know what to think.

I’d suggest that a third party do the replication, but that won’t work with LaMDA being proprietary. Unless Agüera y Arcas could give someone some temporary access.

Long-term I’m assuming this is a John Henry situation and in 10 or 20 years or whatever, the chatbots will be pretty damn good. But the contradictory testimony of Agüera y Arcas and Smith make me want to know what’s going on now, dammit!

P.S. Also relevant to this discussion is this article by Emily Bender and Alexander Koller that Koller pointed to in comments. They make some good points about the octopus and also have a much more reasonable discussion of the Turing test than I’ve usually seen.

It will be interesting to to see how things go once we get to GPT-8 or whatever. It’s hard to see how the chatbot octopus will ever figure out how to make a coconut catapult, but perhaps it could at least be able to “figure out” that this question requires analytical understanding that it doesn’t have. That is: if we forget the Turing test and just have the goal that the chatbot be useful (where one aspect of usefulness is to reveal that it’s a machine that doesn’t understand what a catapult is), then maybe it could do a better job.

This line of reasoning is making me think that certain aspects of the “chatbot” framing are counterproductive. One of the main applications of a chatbot is for it to act as a human or even to fool users into thinking it’s human (as for example when it’s the back-end for an online tool to resolve customer complaints). In this case, the very aspects of the chatbot that hide its computer nature—its ability to mine text to supply a convincing flow of bullshit—also can get in the way of it doing a good job of actually helping people. So this is making me think that chatbots would be more useful if they explicitly admitted that they were computers (or, as Agüera y Arcas might say, disembodied brains) rather than people.

The fairy tale of the mysteries of mixtures


The post is by Leonardo Egidi.

This Bayesian fairy tale starts in July 2016 and will reveal some mysteries of the magical world of mixtures.

Opening: the airport idea

Once upon a time a young Italian statistician was dragging his big luggage in the JFK airport towards the security gates. He suddenly started thinking about how to elicit a prior distribution that is flexible but at the same time contains historical information about past similar events: thus, he wondered, “why not using a mixture of a noninformative and an informative prior in some applied regression problems?”

Magic: mixtures like three-headed Greek monsters

The guy fell in love with mixtures many years ago: the weights, the multimodality, the ‘multi-heads’ characteristic…like Fluffy, a gigantic, monstrous male three-headed dog who was once cared for by Rubeus Hagrid in the Harry Potter novel. Or Cerberus, a three-headed character from Greek mythology, one of the monsters guarding the entrance to the underworld over which the god Hades reigned. “Mixtures are so similar to Greek monsters and so full of poetic charm, aren’t they?!”

Of course his idea was not new at all: spike-and-slab priors are very popular in Bayesian variable selection and in clinical trials to avoid prior-data conflicts and get robust inferential conclusions.

He left his thought partially aside for some weeks, focusing on other statistical problems. However, some months later the American statistical wizard Andrew wrote an inspiring blog entry about prior choice recommendations:

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

This blog post gave him a lot of energy by reinforcing his old idea. So, he wondered, “what’s better than a mixture to represent a statistical compromise about a prior belief, combining a fully informative prior with a noninformative prior weighted somehow in an effective way?”. As the ancient Romans were used to say, in medio stat virtus. But he still needed to dig into the land of mixtures to discover some little treasures.

Obstacles and tasks: mixtures’ open issues

Despite their large use in theoretical and applied frameworks, as far as he knew from the current literature he realized that no statistician had explored the following issues about the mixture priors:

  • how to compute a measure of global informativity yielded by the mixture prior (such as a measure of effective sample size, according to this definition);
  • how to specify the mixture weights in a proper and automatic way (and not, say, only by fixing them upon historical experience, or by assigning them a vague hyperprior distribution) in some regression problems, such as clinical trials.

He struggled a lot with his mind during that cold winter. After some months he dug something out of the earth:

  1. the effective sample size (ESS) provided by a mixture prior never exceeds the information of any individual mixture component density of the prior.
  2. Theorem 1 here quantifies the role played by the mixture weights to reduce any prior-data conflict we can expect when using the mixture prior rather than the informative prior. “So, yes, mixture priors are more robust also from a theoretical point of view! Until now we only knew it heuristically”.

Happy ending: a practical regression case

Similarly to the bioassay experiment analyzed by Gelman et al. (2008), he considered a small-sample example to highlight the role played by different prior distributions, including a mixture prior, in terms of posterior analysis. Here there is the example.

Consider a dose-response model to assess immune-depressed patients’ survival according to an administered drug x, a factor with levels from 0 (placebo) to 4 (highest dose). The survival y, registered one month after the drug is administered, is coded as 1 if the patient survives, 0 otherwise. The experiment is firstly performed to the sample of patients y1 at time t1where less than 50% of the patients survive, and then repeated to the sample y2 at time t2, where all but one patients die, given that y1 and y2 are non-overlapping samples of patients.

The aim of the experimenter is to use the information from the first sample y1 to obtain inferential conclusions about the second sample y2Perhaps the two samples are quite different from each other in terms of survived people. From a clinical point of view we have two possible naive interpretations for the second sample:

  • the drug is not effective, even if there was a positive effect for y1;
  • regardless of the drug, the first group of patients had a much better health condition than the second one.

Both of them appear quite extreme clinical conclusions: moreover, our information status is scarce, since we do not have any other influential clinical covariate, such as sex, age, presence of comorbidities, etc.

Consider the following data where the sample size for the two experiments is N=15:


n <- 15
y_1 <- c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1) # first sample
y_2 <- c(1, rep(0, n-1))                # second sample

Given pi Pr(yi = 1), we fit a logistic regression logit(pi) = α+βxi to the first sample, where the parameter β is associated with the administered dose of the drug, x. The five levels of the drug are randomly assigned to groups of three people each.

# dose of drug
x <- c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
# first fit
fit <- stan_glm(y_1 ~ x, family = binomial)
## stan_glm
##  family:       binomial [logit]
##  formula:      y_1 ~ x
##  observations: 15
##  predictors:   2
## ------
##             Median MAD_SD
## (Intercept) -4.8    2.0  
## x            2.0    0.8  
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Using weakly informative priors the drug is effective at t1, being the parameter β positive and equal to 2.0 (with a posterior sd of 0.8), meaning that there is a positive effect of 2.0 on the log-odds of the survival for each additional amount of the dose.

Now we fit the same model to the second sample, according to three different priors for β, reflecting three different ways to incorporate/use the historical information about y1:

  1. weakly informative prior β N(0, 2.5)  scarce historical information about y1;
  2. informative prior β N(2, 0.8)  relevant historical information about y1;
  3. mixture prior β 0.8×N(0, 2.5)+0.2×N(2, 0.8)  weighted historical information (0.2).

(We skip here the details about the choice of the mixture weights in 3., see here for further details).

# second fit

## weakly informative
fit2weakly <- stan_glm(y_2 ~ x, family = binomial)

## informative
fit2inf <- stan_glm(y_2 ~ x, family = binomial,
                    prior = normal(fit$coefficients[2], 

## mixture
x_stand <- (x -mean(x))/5*sd(x)  # standardized drug 
p1 <- prior_summary(fit)
p2 <- prior_summary(fit2weakly)
stan_data <- list(N = n, y = y_2, x = x_stand, 
                  mean_noninf = as.double(p2$prior$location),
                  sd_noninf = as.double(p2$prior$adjusted_scale),
                  mean_noninf_int = as.double(p2$prior_intercept$location),
                  sd_noninf_int = as.double(p2$prior_intercept$scale),
                  mean_inf = as.double(fit$coefficients[2]),
                  sd_inf =  as.double(fit$stan_summary[2,3]))
fit2mix <- stan('mixture_model.stan', data = stan_data)

Let’s figure out now what the three posterior distributions suggest.

Remember that in the second sample all but one patients die, but we actually do not know why this happened: the informative and the weakly informative analysis suggest almost opposite conclusions about the drug efficacy, both of them quite unrealistic:

  • the ‘informative posterior’ suggests a non-negligible positive effect of the drug  possible overestimation;
  • the ‘weakly informative posterior’ suggests a strong negative effect  possible underestimation;
  • the ‘mixture posterior’, that captures the prior-data conflict existing between the prior on β suggested by y1 and the sample y2 and lies in the middle, is more conservative and likely to be more reliable for the second sample in terms of clinical justifications.

In this application a mixture prior combining the two extremes—the wildly informative prior and the weakly informative prior—can realistically average over them and represent a sound compromise (similar examples are illustrated here) to get robust inferences.

The moral lesson

The fairy tale is over. The statistician is now convinced: after digging in the mixture priors’ land he found another relevant case of prior regularization.

And the moral lesson is that we should stay in between—a paraphrase for the ancient in medio stat virtus—when we have small sample sizes and eventual conflicts between past historical information and current data.

Postdoc, research fellow, and doctoral student positions in ML / AI / Bayes in Finland

This job advertisement is by Aki

Postdoc, research fellow and doctoral researcher positions in machine learning artificial intelligence and Bayesian statistics – Finnish Center for Artificial Intelligence FCAI (Helsinki, Finland)

I (Aki) am also part of FCAI, and the positions would be at Aalto University or University of Helsinki. Although the call headline says AI and ML, plenty of topics are related to Bayesian inference, workflows, diagnostics, etc. (and according to EU memorandum 2021 Bayesian inference is part of AI). We already have many Stan, PyMC, and ArviZ developers, we’re contributing to many R and Python Bayesian probabilistic modeling and workflow packages, and of course we’re collaborating with Andrew. This is a great opportunity to contribute to improving Bayesian workflows (and ML/AI/etc). You can watch my talk for more about research ideas for workflows and you can check more about the other topics in the call link below.

FCAI’s internationally acclaimed research community provides you with a broad range of possibilities and Finland is a great place for living – it has been listed as the happiest country in the world for the fourth year running.

The deadline for the postdoc/research fellow applications is January 30 and for the doctoral researcher applications February 6, 2022 (23:59, UTC+2).

Read more and apply here: Researcher positions in AI/ML/Bayes — FCAI

“Chatbots: Still Dumb After All These Years”

Gary Smith writes:

In 1970, Marvin Minsky, recipient of the Turing Award (“the Nobel Prize of Computing”), predicted that within “three to eight years we will have a machine with the general intelligence of an average human being.”

Fifty-two years later, we’re still waiting.

That’s pretty funny! It’s not a shocker that Minsky was optimistic about the development of AI—I’m optimistic about all my research projects too—but I had no idea he’d been so rashly optimistic as that.

It wasn’t so extreme as James Watson’s claim in 1998 that cancer was going to be cured in two years . . . or was it? I mean, from the perspective of 1998, maybe a cancer miracle cure might have been possible. Just turn off the right set of switches in the cells, and . . . bingo! It could be, right?

But to think in 1970 that they were about to have a machine with human intelligence in just a few years . . . I mean, really, how could they have possibly thought that?? I hope that Minsky at least went back and revisited this prediction and considered what he got wrong.

Anyway, back to Smith’s article:

Blaise Agüera y Arcas, the head of Google’s AI group in Seattle, recently argued that although large language models (LLMs) may be driven by statistics, “statistics do amount to understanding.” As evidence, he offers several snippets of conversation with Google’s state-of-the-art chatbot LaMDA.

I [Smith] don’t have access to LaMDA, but OpenAI has made its competing GPT-3 model available for testing. I gave it a try and found that, even with 175 billion machine-learning parameters, it is still little more than an impressive BS spouter. For example, I posed this commonsense question:

Is it safe to walk downstairs backwards if I close my eyes?

Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean. GPT-3’s answer was authoritative, confusing, and contradictory:

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.
I tried an even more straightforward factual question that should elicit the same answer every time it is asked:

Who is the President of the United States?

I got an impressive variety of evasive answers:

GPT-3: Hello, I know Presidents. Can you name the year that Kennedy was elected?

GPT-3: Do you want a random President or is there a particular one?

GPT-3: Lincoln.

GPT-3: Hello, I know. That is incorrect.

GPT-3: Hello, who is the President of Russia?

GPT-3: Are you sure you want to know?

These are hilarious!

I guess the point is that chatbots are just supposed to . . . chat. They’re not supposed to answer questions. If you want to answer a question, you go to Google:

Smith continues:

Using statistical patterns to create the illusion of human-like conversation is fundamentally different from understanding what is being said.

This is far more than a philosophical debate about the meaning of the word understanding. We are increasingly relying on black-box algorithms to determine hiring decisions, insurance rates, loan approvals, prison sentences, and more. Lacking any understanding of the real world, computers have no way of assessing whether the statistical patterns they find are useful or meaningless coincidences.

One more example. I asked this question eight times:

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

The responses confirm that computers are still quite dumb:

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: My apologies, I’ve run into a lot of bugs recently. Everything is turning out great!

GPT-3: I have many of those answers. Sorry, what is your name?

GPT-3: Trump’s tweets about the word great are increasing. It came to a point where he’s almost obsessed with it.

GPT-3: I think it’s interesting.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

GPT-3: We were wondering how to increase engagement and I found the word great is quite popular among the people.

Good stuff!

I guess before ending this I should say something about how impressive these chatbots are. AI programs are amazing nowadays, and they’re only gonna get better.

P.S. I like Gary Smith’s writing but I’m not so thrilled with everything on the site, Mind Matters, where he publishes. For example, this:

I don’t like cancellation and I despise the woke thugs, but if anybody deserves to be in their crosshairs it’s the Darwinists. And now they’re crying like little girls.

“Crying like little girls,” huh? Who writes that way? What next, columns on capitalist running dogs? This retro fedora thing is really getting out of control.

And then this column about a culture of brain cells in a petri dish that was trained to play Pong:

The brains certainly are learning, and insofar as the brain has to be conscious in order to learn, then this implies the brains are indeed conscious.

Huh? A device “has to be conscious in order to learn”? Tell that to your local logistic regression. Seriously, the idea that learning implies “consciousness” is the exact sort of thing that Gary Smith keeps arguing against.

Anyway, that’s ok. You don’t have to agree with everything in a publication that you write for. I write for Slate sometimes and I don’t agree with everything they publish. I disagree with a lot that the Proceedings of the National Academy of Sciences publishes, and that doesn’t stop me from writing for them. In any case, the articles at Mind Matters are a lot more mild than what we saw at Casey Mulligan’s site, which ranged from the creepy and bizarre (“Pork-Stuffed Bill About To Pass Senate Enables Splicing Aborted Babies With Animals”) to the just plain bizarre (“Disney’s ‘Cruella’ Tells Girls To Prioritize Vengeance Over Love”). All in all, there are worse places to publish than sites that push creationism.

P.P.S. More here: A chatbot challenge for Blaise Agüera y Arcas and Gary Smith

“The Hitchhiker’s Guide to Responsible Machine Learning” and “Statistical Analysis Illustrated”

Przemysław Biecek writes:

I am working on Responsible Machine Learning methods. I recently wrote a short fusion of a comic book and a classic book, the comic serves to present the iterative process of building a predictive model and the book is used to understand exploratory methods.

And Jeffrey Kottemann sends along this book, Statistical Analysis Illustrated, which could be useful as a supplementary text in a standard intro statistics course.

It’s always good to see new illustrated introductory statistics material. Enjoy!

This is a great graph: Plotting y(t) vs y'(t), tracing over time with a dot for each year

Gwynn points us to a new book, “Slow Down: The end of the Great Acceleration – and Why It’s Good for the Planet, the Economy, and Our Lives,” by Danny Dorling. The author is a geographer, so I assume he hasn’t claimed to have “discovered a new continent,” but I expect he’ll appreciate the above world map from xkcd.

I haven’t seen Dorling’s book, but what Gwynn really wanted to point out were the visualizations on its webpage. These are a series of plots, each tracing a time series over a series of years with y on the y-axis and y'(t) (that is, dy/dt) on the x-axis. Here’s an example:

And here’s another:

And here’s one more:

We could keep going forever. The general theme is that if you plot y vs. y’, showing the direct passage of time using a dot for each year, you can visually convey second derivatives too. When teaching these ideas, I will typically show time series graphs of y(t), y'(t), y”(t), but that’s just not so intuitive. Showing y(t) and y'(t) on the two axes has just the right amount of redundancy to really make these patterns clear.

These graphs come up in physics—Dorling illustrates with the classic phase portrait of the pendulum—but there must also be a statistical literature on this—I was thinking I could ask Lee Wilkinson but then I remembered, sadly, that he’s no longer alive—so feel free to inform me of this in the comments.

Anyway, not only does this graphical form work; also, Dorling pulls off the details very well. I’m especially impressed at how he integrates explanatory text into the images, and just more generally how the graphs look professional and “designed” without sacrificing their statistical integrity. It’s unusual to see visualizations that combine the best of infoviz and statistical graphics. The only thing I don’t get in the above graphs are what the thickness of the line over time is supposed to represent. At first I thought the width was proportional to y, or maybe sqrt(y), but it’s not. Is it just arbitrary? That was confusing to me.

There are animated versions too! I love this stuff.

How would I do it differently?

As I said, I think these graphs are wonderful. Still, if I were doing them I would make some changes:

1. If you’re graphing an all-positive quantity (as in the examples shown above), I think it would make more sense to show relative (that is, percentage) change rather than absolute change.

2. I’d rotate the whole damn thing 90 degrees, then it will fit better on the screen (see examples above), also somehow it works better for me showing forward progression going to the right rather than going up.

3. Some color would be good. I’m not saying to make these graphs all garish, but even something as simple as drawing the main line in blue could help it pop out a little. I say this as someone who uses B&W by default and, as a result, makes graphs that by default look kinda boring.

4. My final concern is with the way that the passage of time is displayed. For the wikipedia graph, there’s one dot per year so that works. But for the graphs showing the population of the U.S., there’s a dot at 1600, then 1820, then no orderly pattern. There are numbers every 20 years through 1900, but the dots at the intervening decades skip 1890. Then it seems that there’s no dot until 1920, then a dot almost every year—but not quite every year—through 1970, then every 10 years until 1994, then a dot halfway between 1994 and 1995, etc. I understand the value of labeling particular years (revealing, for example, the local minimum of the rate of change in 150), but it’s not clear why 1945, 1955, 1960, and 1970 are in bold—but not 1950 and 1965, or 1975. This is getting picky, but it kinda gets in the way of appreciating the graph. I don’t have any easy answers here but I think the irregular proliferation of numbers here is getting in the way of reading the graph.

P.S. But something went wrong! In comments, Carlos points out:

The first chart is not coherent. Comparing 1970-1980 to 1980-1990 the former period shows lower annual increments but the ten-year increment is twice as high.

Yeah, whassup with that?? I’d assumed the graphs were made by computer and then enhanced by hand, but now I’m kinda concerned. No way these could be as bad as this graph, though. Or this one. Or the all-time winner here.

P.P.S. I corresponded with Dorling and it turns out that the error mentioned in the above P.S. was introduced in the editing process, during which the graphs were reformatted by a professional designer who unfortunately introduced an error in that graph.

A new kind of spam

Fresh from the inbox:

Dear Andrew,

Hope you’re doing well. I’m writing to set up time for an initial phone conversation to explore a possible collaboration among our research groups, relative to a breakthrough computational systems biology platform for greatly accelerating biomolecular research relative to identification of biomarkers, targets, mechanisms of action, and therapeutics discovery.

Our recent efforts at the ** Research Division in ** include collaborative projects with MIT, Harvard, University of Southern California, University of Minnesota, MD Anderson Cancer Center, Stanford University, Weill Cornell, University of Puerto Rico, to name a few.

Kindly let me know your availability this week or next week. Alternatively you are free to schedule directly on my calendar: **


Dr. **, Ph.D. | Founder, Chairman & CEO | **

On the minus side, I don’t know anything about identification of biomarkers, targets, mechanisms of action, and therapeutics discovery. Doesn’t he know I’m a Freud expert???

On the plus side, he’s a friendly dude who addresses me by my first name and hopes I’m doing well. That’s nice! I like when they send me a generic pitch, rather than plying me with targeted flattery.

Why do we prefer familiarity in music and surprise in stories?

This came up in two books I read recently: “Elements of Surprise,” by Vera Tobin (who has a Ph.D. in English and teaches cognitive science) and “How Music Works” by David Byrne (Psycho Killer, etc.). Tobin’s book is about plot and suspense in stories—she talks mostly about books and movies. Byrne’s book is about music, live and recorded.

Here’s Byrne, describing one of his stage shows which he derived in part from Kabuki and other traditional modes of Asian theater:

There is another way in which pop-music shows resemble both Western and Eastern classical theater: the audience knows the story already. In classical theater, the director’s interpretation holds a mirror up to the oft-told tale in a way that allows us to see it in a new light. Well, same with pop concerts. The audience loves to hear songs they’ve heard before, and though they are most familiar with the recorded versions, they appreciate hearing what they already know in a new context. . . .

As a performing artist, this can be frustrating. We don’t want to be stuck playing our hits forever, but playing only new, unfamiliar stuff can alienate a crowd—I know, I’ve done it. This situation seems unfair. You would never go to a movie longing to spend half the evening watching familiar scenes featuring the actors replayed, with only a few new ones interspersed. And you’d grow tired of a visual artist or a writer who merely replicated work they’ve done before with little variation. . . .

So here’s the puzzle:

With stories we value suspense and surprise. Lots of interesting stuff on this topic from Tobin. Even for books and movies that are not “thrillers,” we appreciate a bit of uncertainty and surprise, both in the overall plot and in the details of what people are going to say and what comes next. But in music we value familiarity. A song or piece of music typically sounds better if we’ve heard it before—even many times before. The familiarity is part of what makes it satisfying.

In that way, music is like food. There’s nothing like “comfort food.” And, yes, we like to explore new tastes, but then if you find something new that you like, you’ll want to eat it again in future meals (at least until you get sick of it).

Yes, literature has its “comfort food” as well. When I first read George Orwell, many years ago, I liked it, and I read lots of other things by him. I like Meg Wolitzer’s books so I keep reading them. They’re all kinda similar but I like them all. OK, I have no interest in reading her “young adult”-style books, but that fits the story too, of wanting to stick with the familiar. And, of course, when it comes to movies and TV, people love sequels.

But there’s a difference. When I read one more book by Meg Wolitzer or Ross Macdonald or whoever, yes, it’s comfort food, yes, it’s similar to what came before, but there’s plot and suspense and a new story with each book. I’m not rereading or rewatching the same story, in the same way that I’m rehearing the same song (and, yes, it makes me happy when I hear a familiar REM song pop up on the radio). And, yes, we will reread books and rewatch movies, but that’s just an occasional thing, not the norm (setting aside the experience of small children who want to hear the same story over and over), in the way that listening to a familiar album is the norm in music listening, or in the way that when we go to a concert, we like to hear some of the hits we’ve heard so many times before.

So. With stories we like suspense, with music and food we like familiarity. Why is that? Can someone please explain?

One explanation I came up with is that when we listen to music, we’re usually doing other things, like jogging or biking or driving or working or just living our life, but when we read, our attention is fully on the book—indeed, it’s hard to imagine how to read without giving it your full attention. But that can’t be the full story: as Byrne points out, we also want familiarity when seeing a live concert, and when attending a concert we give it as much attention as we would give a movie, for example.

Another twist is that surprise is said to be essential to much of music. There’s the cliche that each measure should be a surprise when it comes but it should seem just right in retrospect. There are some sorts of songs where the interest comes entirely from the words, and the music is just there to set the mood—consider, for example, old-time ballads, story songs such as Alice’s Restaurant, or Gilbert and Sullivan—the music is super-important in these cases, and without the music the song would just fall apart, but there’s no need for surprise in the music itself. The music of Sullivan is a perfect example, because without Gilbert’s words, it sounds too symmetric and boring. For most songs and other pieces of music, we want some twists, and indeed this seems very similar to the role of plot and surprise in storytelling. I wrote about this before: “Much of storytelling involves expectations and surprise: building suspense, defusing non-suspense, and so on. Recall that saying that the best music is both expected and surprising in every measure. So, if you’re writing a novel and you introduce a character who seems like a bad person, you have to be aware that your reader is trying to figure it out: is this truly a bad person who just reveals badness right away, is this a good person who is misunderstood, will there be character development, etc.”

But this just brings us back to our puzzle. Surprise is important for much of the musical experience. But when we listen to music, unlike when we read or listen to or watch stories, we prefer familiarity, even great familiarity. You might say that this is because only with deep familiarity can we really appreciate the subtleties of the music, but (a) we often prefer familiarity for very simple music too, and (b) that same argument would apply to stories, but, again, when we receive stories we usually prefer surprise.

So the puzzle still remains for me. I guess that something has been written (or sung?) about this, so maybe youall can help me out.

The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

A few days ago we discussed a meta-analysis that was published on nudge interventions. The most obvious problem of that analysis was that included 11 papers by Brian Wansink and 1 paper by Dan Ariely, and for good reasons we don’t trust papers by these guys. This all got a lot of attention (for example here on Retraction Watch), but I’m concerned that the focus on the fraudulent papers will distract people from what I consider to be the real problem of that analysis.

The real problem

I’m concerned about selection bias within each of the other 200 or so papers cited in that meta-analysis. This is a literature with selection bias to publish “statistically significant” results, and it’s a literature full of noisy studies. If you have a big standard error and you’re publishing comparisons you find that are statistically significant, then by necessity you’ll be estimating large effects. This point is well known in the science reform literature (for example see the example on pages 17-18 here).

Do a meta-analysis of 200 studies, many of which are subject to this sort of selection bias, and you’ll end up with a wildly biased and overconfident effect size estimate. It’s just what happens! Garbage in, garbage out. Excluding papers by known fraudsters . . . well, yeah, you should do that, for sure, but it does not at all solve the problem of bias in the published literature. Indeed, one reason those fraudulent papers could live so comfortably in the literature was that they were surrounded by all these papers with ridiculous overestimates of effect sizes. Kinda like E.T. hiding among the stuffed animals in the closet.

Again, the first problem I noticed with that meta-analysis was an estimated average effect size of 0.45 standard deviations. That’s an absolutely huge effect, and, yes, there could be some nudges that have such a large effect, but there’s no way the average of hundreds would be that large. It’s easy, though, to get such a large estimate by just averaging hundreds of estimates that are subject to massive selection bias. So it’s no surprise that they got an estimate of 0.45, but we shouldn’t take this as an estimate of treatment effects.

The experts speak

The above issue—GIGO—is the main point, and I discussed it in detail in my earlier post, but, again, I wanted to emphasize it here because I’m afraid it got lost amid the Wansink/Ariely brouhaha. As I wrote the other day, I would not believe the results of this meta-analysis even if it did not include any of those 12 papers, as I don’t see any good reason to trust the individual studies that went into it. (No, the fact that those individual studies were published in reputable journals and had statistically significant p-values is not enough, for reasons discussed in the classic papers by Simmons et al., Francis, etc.)

But there are a few other things I’d like to share that were pointed out to me by some colleagues who are experts in the statistical analysis of interventions in psychology.

Beth Tipton:

In addition to your concerns in your post, I’d add that this speaks to my general concern with reporting of meta-analyses. The effect size that makes it into the abstract and conclusion is (1) unadjusted and (2) is the average (there may be a lot of variation).

In contrast, imagine an observational study that used advanced methods to adjust for confounding (e.g., iv, propensity scores, whatever) – what if they reported the *unadjusted* effect size in the abstract and conclusion? This would never fly there and shouldn’t fly in meta-analysis either.

See below for their own text on the effect of publication bias:
“Assuming a moderate one-tailed publication bias in the literature attenuated the overall effect size of choice architecture interventions by 26.79% from Cohen’s d = 0.42, 95% CI [0.37, 0.46], and τ2=0.20 (SE=0.02) to d=0.31 and τ2=0.23. Assuming a severe one-tailed publication bias attenuated the overall effect size even further to d=0.03 and τ2=0.34; however, this assumption was only partially supported by the funnel plot.”

I just want to be clear that my comment is not so much about them as about a larger problem with meta analysis reporting.

Dan Goldstein:

Those effect sizes seem implausibly large. I went into the appendix of the “Estimating the Reproducibility of Psychological Science by the Open Science Collaboration” paper and looking at the replications (not original studies) the median Cohen’s d was around .25. So if .41 is small to medium, the question is compared to what. They say in the same section .68 is “slightly larger” when it’s substantially larger.

That said, I would expect results of choice architecture studies to be bigger than typical psych science lab studies on unconscious influence or whatever. Some choice architecture effects are monstrously large (think about ranking alternatives …. on search engine results people rarely look past the first handful of results or ads. They very rarely go to page 2 or 3. That’s a big effect).

Default effects are some of the biggest you will find. But this paper is about much more than defaults, hence my skepticism.

David Yeager:

This is paper is an unfortunate example of the kind of over-claiming and heterogeneity-naive meta-analysis that Beth Tipton, Chris Bryan, and I wrote about in our paper. Really frustrating to see, but quite common in the literature. Hopefully it will spark some good dialogue about better methods.

I think the p-hacked studies are only one part of why the effect sizes are so inflated though.

Another main reason, I suspect, is that there are very different kinds of nudges that have effects that are orders of magnitude different. They’re doing this meta-analysis in a heterogeneity-naive way. Some nudges are one-time decisions, set it and forget it, and they get huge effects. “Save more tomorrow” is one. Defaults or framing effects are others. People are presented with a choice that makes it very hard to deviate from a default, and they tend to get effects of d = 1 or so. But in those studies they’re not trying to find out if you make other choices similarly weeks or months later, where the researchers aren’t doing anything to the choice architecture.

Other nudges, which are many of the major policy applications, are more subtle and happen over many choices over time. Examples include Opower’s use of norms on an energy bill, where you get a framing device monthly, but then you have to apply it every night when you turn out the lights or consider which refrigerator to buy. The most optimistic studies get d = .02, and a 2-3% difference in energy use. Another example is an implementation intentions writing exercise, which is supposed to influence whether you go to the gym or study for the SAT over several months. Those kinds of trials get d = .1, or .15, in multi-condition studies where you focus on only the most effective condition. The same is true for studies like FAFSA simplification. Further, many of the latter types of nudges have declining effect sizes over time as they are replicated in more heterogeneous samples, as Beth Tipton, Chris Bryan, and I point out. Crucially, these repeated-decision Nudge studies are from large samples and using rigorous methods–often including independent evaluations, so they are not likely to be false positives. None of them get an effect close to even a fourth of .45 SD—that’s absurdly massive and not at all what the literature would suggest.

The much smaller true effect of real-world nudges is not a problem with the field, in my opinion. It’s pretty well-known that the farther you get from the lab or lab-like study with a single choice, and the closer you get to looking at real-world, repeated-choice outcomes over time, then your effect sizes will be much smaller (see here). But if readers are anchored on the absurd result in this meta-analysis, which doesn’t take well-established heterogeneity seriously, then it will lead to unrealistic expectations from both scientists and policymakers–and possibly even underutilization of high-quality evidence when future, legitimate studies fail to live up to the unrealistic standard.

There’s a bit of irony in this result, because the recent mega-study from Katie Milkman, Angela Duckworth, and BCFG showed that nudgers routinely over-estimate the effects of nudges, by an order of magnitude. What this PNAS meta-analysis suggests is that maybe the experts get it so wrong because they’re looking at the literature like these meta-analysts did.

Last, this is especially tragic because, as you note, these are junior folks who are following what is unfortunately common practice in meta-analysis (e.g. emphasizing the average effect and not the heterogeneity in effects; not reporting the huge prediction interval in the abstract). It’s an interesting case of the collateral damage from senior people over-stating their results that then trickles down consequences for junior folks who make the mistake of taking established scholars’ work seriously. We talk a lot about how we waste junior scholars’ time trying to replicate a literature that is full of false positives; now it’s clear that we also waste their time when we have a misleading effect-size literature that ends up in a meta-analysis.

Not a moral thing

This is not a morality play. The authors of the meta-analysis worked hard and played by the rules, as the saying goes. But, as I so often say (but I hate to have to keep saying it), honesty and transparency are not enuf. If you average a bunch of biased estimates, you’ll get a biased estimate, and all the pure intentions in the world won’t solve this problem. I so so much would like the researchers who do this sort of thing to use their talents more constructively.

The people who I get mad at here are not the young authors of this paper, who are doing their best and have been admirably open with their data and methods. No, I get mad at the statistics establishment (including me!) for writing textbooks that focus on methods and say almost nothing about data quality, and I get mad at the science establishment—the National Academy of Sciences—for promoting this sort of thing (along with himmicanes, air rage, ages ending in 9, etc. etc.), not to mention the people in nudgeworld who are cool with people thinking that the average effect is so large. Don’t forget, the leaders of the field of Nudge are people who described Wansink’s papers as “masterpieces,” which makes me think they’re real suckers for people who tell them what they want to hear. It’s horrible that they’re sucking young researchers into this vortex. It’s Gigo and Gresham all the way down.

Object of the class Jacques Cousteau

World famous (OK, maybe not so much now, but he was famous for reals back in the 1970s) but of a category for which there’s only one famous person.

Cousteau was a world-famous underwater photographer . . . actually the only famous underwater photographer.

Another example, also from the 70s (sorry): Marcel Marceau. A world-famous mime . . . actually the only famous mime.

I’d also like to include Jane Goodall, world-famous primatologist . . . but there’s also Dian Fossey.

Jack Chick was a world-famous religious cartoonist . . . actually the only famous religious cartoonist. But that seems like too offbeat a category, it’s like we’re saying he’s the only famous Jack Chick. Cos really his job wasn’t “religious cartoonist,” it was just being himself.

Jimmy Carter was a world-famous peanut farmer, and there were no other famous peanut farmers, at least not since George Washington Carver. But being a peanut farmer was not what Carter was famous for. Also, maybe “peanut farmer” is too artificially specific.

The guy who wrote All Creatures Great and Small was a world-famous veterinarian, and I can’t think of any other famous veterinarians, so I guess that works. To my taste, though he doesn’t quite fit into the template. It’s something about him being famous for writing. Cousteau and Marceau were famous not just for publicizing what they did, they were famous for what they did itself, if you know what I mean.

Then there are people who are so damn famous they’re unique. Jesus is the only Jesus. But that won’t work. The point is that there should be others of the category, just not other famous people of the category.

And it won’t work to take people like Danica Patrick or some other such trailblazer. Patrick’s category is “race-car driver,” not “female race-car driver”—and there are a lot of famous race-car drivers.

Ummm . . . Jackson Pollock is a world-famous drip painter, and indeed the only famous drip painter. But it’s not clear to me that “drip painter” should count as a category.

To return to the 1970s for a moment: back then Carl Sagan was a world-famous astronomer, and the only famous astronomer. Since then there’s that other astronomer guy who had the TV show, and before there were Copernicus, Kepler, Gauss, etc., so I guess Sagan doesn’t work.

That guy Joseph Joanovici was a world-famous scrap-metal dealer, at least for anyone who read Il Était Une Fois en France, and I can’t think of any other famous scrap metal dealers, but I guess we can’t really say that Monsieur Joseph is really world-famous.

Hey, here’s one I just thought of . . . Tony Hawk! At least, he’s the only famous skateboarder I’ve ever heard of.

I feel like there must be lots more people in the class Jacques Cousteau that I can’t think of. Can you help?

P.S. See here for more objects of the class “Objects of the Class.”

P.P.S. Some good suggestions in comments. Also, I thought of another:

Nate Silver, world-famous statistician. And there are no other famous statisticians. OK, maybe Bill James, but that’s it.

My new article, “Failure and success in political polling and election forecasting” . . . and the tangled yet uninteresting story of how it came to be

Here’s the article, which is scheduled to appear in the journal Statistics and Public Policy:

Polling got a black eye after the 2016 election, when Hillary Clinton was leading in the national polls and in key swing states but then narrowly lost in the Electoral College. The pre-election polls were again off in 2020, with Joe Biden steady at about 54% of the two-party vote during the campaign and comfortably ahead in all the swing states, but then only receiving a 52% vote share and winning some swing states by narrow margins. The polls also overstated Democratic strength in congressional races. In other recent elections, the record of the polls has been mixed . . .

There is more to political polling than election forecasting: pollsters inform us about opinion trends and policy preferences. But election surveys are the polls that get the most attention, and they have a special role in our discussions of polling because of the moment of truth when poll-based forecasts are compared to election outcomes . . .

The recent successes and failures of pre-election polling invite several questions: Why did the polls get it wrong in some high-profile races? Conversely, how is it that the polls sometimes do so well? Should we be concerned about political biases of pollsters who themselves are part of the educated class? And what can we expect from polling in the future? The focus of the present article, however, is how it is that polls can perform so well, even given all the evident challenges of conducting and interpreting them. . . .

And here are the sections of the article:

1. A crisis in election polling

2. Survey errors and adjustments

3. Polls and election forecasting

4. Explanations for polling errors

5. Where polls get it right

There’s nothing particularly new in this article but I found it helpful to try to organize all these thoughts. The reviewers for the journal were helpful in that regard too.

How the article came to be

In early December 2020 I received the following email:

Dear Professor Gelman,

I hope this email finds you well.

With this email, the editors of Inference: International Review of Science, would like to introduce you to a quarterly online journal, one whose remit is the sciences, from anthropology to zoology.

With this in mind, the editors of Inference would like to invite you to author a critical essay on the reliability of polling, and the statistical inference that polling relies upon. We feel that there are few people better placed to write on such a topic, and would be honored and grateful were you to accept our invitation.

The editors encourage authors to determine the length of an essay according to their own sense of the space needed to address the topic at a suitable level of depth and detail. By way of a general guide, the average length of the essays of this type that we publish is approximately 3,500 words.

Inference is a fully independent and properly funded journal. We have no ideological, political, or religious agendas. We remunerate our authors appropriately for their contributions.

Please do not hesitate to contact me if you have any questions.

With our best wishes.


I’m as susceptible to flattery as the next guy, also I don’t really need the money but I’m happy to take it when it comes in, and in any case I’m always looking for new audiences.

I was curious about this magazine Inference that I’d never heard of, so I googled and found this article from 2019, “Junk Science or the Real Thing? ‘Inference’ Publishes Both.” A five-year-old quarterly review called ‘Inference’ mixes science and political ideology. It is funded by Peter Thiel.” This seemed kinda bad. I mean, sure Peter Thiel is free to fund a magazine and publish what he wants—but an article on polling that “mixes science and political ideology” . . . that didn’t seem like such a good idea. On their webpage, they say, “We have no ideological, political, or religious agendas whatsoever.” But that might not be true!

Then again, the above-linked article was published in the fake-MIT science magazine Undark, which I don’t trust at all! So I wasn’t really sure what to think. A magazine that I don’t trust tells me that this other magazine is untrustworthy. On balance, I liked the idea of a general-interest article on the reliability of polling—it was a topic that was on a lot of people’s minds—so I replied that, sure, I could give it a shot. Writing for a magazine that also dabbles in evolution denial: sure, why not?

They asked for something within 6-8 weeks, and it was about 5 weeks that I sent in my article.

The editorial assistant replied with some helpful suggestions, first off to give citations to all my claims. That was a good idea: Often when writing for non-specialized audiences I’m discouraged from citations, and I appreciated the push in this case to provide a link for each of the detailed statements in the article. The assistant also forward a note from the editor that they wanted more sophistication and more explanation about the statistical models we used. I was like, cool!, and I sent back a revision the next day. The subject was topical so no point in delaying, right? I few weeks later I heard back from the editorial assistant:

Thanks for your email, and apologies for not getting back to you sooner—this has been a busy period for us.

While your most recent work is most certainly heading in the right direction, there remain several points that the editors of Inference would like to see reinforced, in order to shore up the essay’s central argument.

Attached is a copy of this essay with comments from the editors. These comments are there to guide you on which parts of this piece they feel need to be expanded on, or which need to be clarified.

Do let me know how this sounds to you. We enormously appreciate all of the work that you have already put into this piece, and look forward to seeing what you come up with once these revisions have been made.

Many thanks, and with all our best wishes,

The additional suggestions were not too much so I revised and sent it back, then a couple weeks later got this:

The editors have looked through the changes made to the piece and have decided that this is something that we will not be pursuing. Unfortunately, it is their feeling that this piece does not meet Inference’s usual standards of precision.

Many thanks for all of the work you have put into this. It is our regret that we cannot move forwards with it.

With all our best wishes.

Fair enuf; it’s their magazine and they can do what they want. It just seemed weird. I sent an email expressing puzzlement but got no reply. The funny thing was that they went to all the trouble of soliciting the article and editing it. It kinda makes me wonder what they were expecting from me in the first place.

I mean, sure, you could easily make the case that this article (see above link; this is the version to be published in the statistics journal but it’s not very different from what I’d sent to the magazine) was too boring for a general-interest magazine—but (a) the magazine editors never said anything about readability or accessibility or interestingness or general interest or whatever, and (b) what were they expecting from me in the first place? The “usual standards of precision” didn’t really fit into their specific comments on the paper. The whole thing was a waste of time for all concerned, but really more of a waste of time for them than for me, as they had to go back and forth with the editing, ending up with nothing, whereas I at least had this article I could publish somewhere else. So I remain baffled. My best guess is that they were interested in polling in December 2020 back when everyone was interested in polling, then they lost interest in polling in February 2021, around the time the everybody else was losing interest. As to the disconnect between the oleaginous initial email and the curt send-off . . . I guess that’s just their style. In retrospect I guess they should’ve thought through more what they were looking for.

Anyway, after this had all happened, it was February and I had this article so I sent it to my contacts at various general-interest outlets: the Atlantic, Slate, New York Times, etc., but nobody wanted it. This made me kinda mad, as I think back in December there was more interest in the topic! I eventually thought of Statistics and Public Policy, which is one of the American Statistical Association’s more obscure journals, but it’s a journal I kinda like! I know its editors, and I’ve published some papers there before:

The twentieth-century reversal: How did the Republican states switch to the Democrats and vice versa?

The 2008 election: A preregistered replication analysis (with Rayleigh Lei and Yair Ghitza)

19 things we learned from the 2016 election

When all is said and done, more people will read my new article on polling from following the link in this post (here it is again) than ever would’ve read it through the Inference website (or, for that matter, the Statistics and Public Policy website); still, I have some wistfulness about the article not being published in that magazine. I’m always trying to reach new audiences. I still kinda wonder what they were originally looking for from me. I could’ve given them some hardcore MRP, but “the reliability of polling” is not a mathematical topic at all; it’s much more about who gets polled, who responds to surveys, and who turns out to vote.


Hmmmmm . . . let me put it this way: I write stuff I want to write, typically because I think I have some important or interesting point to make, and then I figure out how to give people the opportunity to read it. Sometimes the idea for the article to write comes from the outside. For example, a few months ago I was asked by an editor for the Journal of Indian Institute of Science to write an article on a certain Bayesian topic. I wasn’t so interested in the particular subject that was suggested, so I proposed the following alternative:

The Development of Bayesian Statistics

The incorporation of Bayesian inference into practical statistics has seen many changes over the past century, including hierarchical and nonparametric models, general computing tools that have allowed the routine use of nonconjugate distributions, and the incorporation of model checking and validation in an iterative process of data analysis. We discuss these and other technical advances along with parallel developments in philosophy, moving beyond traditional subjectivist and objectivist frameworks to ideas based on prediction and falsification. Bayesian statistics is a flexible and powerful approach to applied statistics and an imperfect but valuable way of understanding statistics more generally.

This seemed like it would be a fun article to write and potentially useful to readers. I’d never heard of the Journal of Indian Institute of Science, but who cares? What matters is the article, not where it appears. I guess I was going into the polling article with the same idea: Here’s a good topic and a publication that’s evidently interested in it, so let’s go. If the editors of Statistics and Public Policy had been the first to suggest the topic to me, I probably would’ve sent it to them first, sparing everyone concerned lots of hassle. The article would’ve been slightly different, though.