I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”

A journalist pointed me to a recent research article, “The impact of a poverty reduction intervention on infant brain activity,” which stated:

Here, we report estimates of the causal impact of a poverty reduction intervention on brain activity in the first year of life. . . . Shortly after giving birth, mothers were randomized to receive either a large or nominal monthly unconditional cash gift. Infant brain activity was assessed at approximately 1 y of age in the child’s home, using resting electroencephalography (EEG; n = 435). . . . using a rigorous randomized design, we provide evidence that giving monthly unconditional cash transfers to mothers experiencing poverty in the first year of their children’s lives may change infant brain activity.

This research was also featured in the newspaper:

A study that provided poor mothers with cash stipends for the first year of their children’s lives appears to have changed the babies’ brain activity in ways associated with stronger cognitive development, a finding with potential implications for safety net policy.

The differences were modest — researchers likened them in statistical magnitude to moving to the 75th position in a line of 100 from the 81st — and it remains to be seen if changes in brain patterns will translate to higher skills, as other research offers reason to expect.

Still, evidence that a single year of subsidies could alter something as profound as brain functioning highlights the role that money may play in child development and comes as President Biden is pushing for a much larger program of subsidies for families with children.

I actually don’t think a difference of 6 percentile points would be so modest, if there was really evidence that it was happening. I mean, sure, you’re not turning all these kids’ lives around, but it’s the usual story. First, they’re just paying people $333 a month: this is about 20% of the average income of the mothers in the sample, so it’s a lot but not massive in all cases. Second, the effect will vary: a 6 percentile change on average could correspond to an approximate 18 percentile change on a third of the people. The point is that if there really were an average 6 percentile-point-effect, I’d consider that to be respectable.

The other issue that was raised was external validity: you can change brain activity but will this change people’s lives? The argument in the published paper is that these brain activity patterns are associated with various good things—that’s the “as other research offers reason to expect” bit from the news report.

Here, though, I want to ask a more basic question: Do the data in the paper support the claim of “important evidence of the effects of increased income”?

At first glance

Are the paper’s claims supported by its evidence? From one perspective the answer would seem obviously to be yes, as it was published in a top journal, and it has this impressive figure:

According to the preregistration, they expected a positive effect in the alpha and gamma bands and a negative effect in the theta band, which is exactly what they found, so that looks good. On the other hand, I don’t see any uncertainty bounds on this graph . . . . we’ll get back to this point.

Also, if you read the abstract carefully, the claims are kind of hedged. Not, “We found an effect on brain activity” or “Giving money causes changes in brain activity,” but “using a rigorous randomized design, we provide evidence that giving monthly unconditional cash transfers to mothers experiencing poverty in the first year of their children’s lives may change infant brain activity.” No problem with the randomized design, but the “may” in the sentence is telling. And there’s this:

The preregistered plan was to look at both absolute and relative measures on alpha, gamma, and theta (beta was only included later; it was not in the preregistration). All the differences go in the right direction; on the other hand when you look at the six preregistered comparisons, the best p-value was 0.04 . . . after adjustment it becomes 0.12 . . . Anyway, my point here is not to say that there’s no finding just because there’s no statistical significance; I can just see now why there’s all that careful language in the abstract and the rest of the paper. Without a clean p-value, you don’t say you discovered an effect. You say you “may” have discovered something, or that the results are “suggestive,” or something like that. So they followed those rules.

Looking in detail

Before going on, I want to thank thank the article’s authors, Sonya Troller-Renfree, Molly Costanzo, Greg Duncan, Katherine Magnuson, Lisa Gennetian, Hirokazu Yoshikawa, Sarah Halpern-Meekin, Nathan Fox, and Kimberly Noble. Along with their article they include comprehensive supplementary material (including preregistration information) and access to all their data! This is a rare published research article where I can figure out what was really done.

I was wondering what was going on with the figure shown above, so I downloaded the data, which was easy! And I took a look. I know basically nothing about studies of brain activities, so I took the data as given.

My first plan was to follow their plan of separate analysis for each of the different brain bands (theta, alpha, beta, and gamma), but instead of straight differences, I’d first take the log—all the measurements are positive and it would seem reasonable to start with proportional effects—and, most importantly, include pre-test brain activity as a predictor. Thus, for each frequency zone, fit y ~ z + x, where y = outcome (log brain activity), z = treatment indicator, x = pre-treatment measure. Then try including interaction of x and z. That’s our usual workflow. And I’d plot y vs. x with blue dots for the controls (z=0) and red dots for the treated kids (z=1).

But I couldn’t follow that plan, because . . . there were no pre-treatment measurements of brain activity. I guess that makes sense: they weren’t gonna do these measurements on newborns! So, no pre-test. The study does have some individual demographic and socioeconomic variables, and I guess it makes sense to include them in the model, but I can’t imagine them having huge predictive power.

So let’s take a look at the data and see what we’ve got. The preregistration talked about two measures: Absolute and Relative Power. In my little analysis here I looked at Absolute Power measure because this is what the authors seemed to be focusing on in their paper. So here are the raw data for the 435 children in the study: first the log measurements themselves, then relative to the mean, then the averages, relative to the mean at each frequency, for the treated (red) and control (blue) groups:

Sorry about the blurry graph; I’m just trying to get the point across without spending too much time struggling with the different graphics formatting.

Anyway, the two groups of children almost entirely overlap, except for the three blue curves at the bottom of the graph. One of them has really low values. The average curve looks similar to what was in the published paper, and we see average differences as high as 7%, which isn’t nothing. There is a question of whether such a large difference could’ve arisen just by chance . . . we’ll get to that in a moment.

But first let’s follow the authors’ lead and go back to the original, unlogged scale. First the raw data, then the z-scores (y – y_bar)/s_y, where y_bar is the average over the children in the data, s_y is the sd over the children in the data, and we’re doing this normalization separately for each frequency:

This looks like what they found! Again, full credit for a rare case of clean data sharing. I wish in the published article they’d shown the color graph with all 435 paths. It would’ve been easy enough to do.

At this point, I think it would make sense to construct a predictor using health, demographic, and socioeconomic variables measured before the treatment, and maybe I’ll get to that, but first let me move to the question of sampling variability. N=435 is a pretty large number of kids, but as we can see in the above graphs, there’s a lot lot of overlap between the red and blue curves, so as baseline it would be good to see what could happen by chance alone.

To get the chance distribution, my first thought was to just look at sd/sqrt(n), and I guess that n is large enuf that the normal approximation would do the trick, but since we have the raw data right here I’ll just permute the 435 treatment assignments (keeping the same observations but randomly permuting the treatment assignment variable) and see what happens. I’ll reproduce the rightmost graph just above, and to see what might happen I’ll do it 9 times:

Hmmmm . . . the patterns in these random permutations don’t look so different, either qualitatively or quantitatively, from what we saw from the actual comparison.

Measurement and statistical power

What’s going on, then? The simplest summary is that there’s a reason they didn’t find statistical significance, however measured, as the data are consistent with no effect. As usual, the way we think of this is that there’s a lot of variation between people, and so with even a moderate sample size it will be difficult to detect small average differences. The authors of the paper wrote that the study was powered to detect differences of 0.21 standard deviations—-but 0.21 standard deviations is a lot, when you consider all the differences between children, along with the many factors that affect them before birth and in the first year. If the true effect is, say, 0.07 standard deviations, then this study just isn’t up to finding it, at least not using this direct approach of calculating averages or running simple regressions. The authors also average over frequency bins, which seems like a good idea, but it doesn’t help as much as you might think because the individual paths are so highly autocorrelated. It also doesn’t help to analyze the data on the individual scales; I think log scale would make more sense if you could figure out what was happening with that one kid with the really low measurements. These are small things, though. In the absence of a pre-test measurement or more granularity in some other ways, it just doesn’t seem like there’s enough going on in these data for any average treatment effect to show up.

Another way of saying this is that, to the extent that there is an effect, we would anticipate that effect to be highly variable, with some kids benefiting much more than others.

My impression from reading the article and the quotes in the newspaper was that the researchers were like, yeah, sure, it’s not quite statistically significant, but that’s just kind of a technicality because if you get enough data your p-values will go down. But that’s not right! I mean yeah, it is correct as you get more data your p-values will go down—but you don’t know where the estimate is going to end up. The effect might be negative, not positive, or it could just look patternless.

There are, I assume, good theoretical reasons that this treatment could have an effect on brain patterns and learning ability—giving a few hundred extra dollars a month to a poor person can make a difference, and I’ll take the researchers’ word for it on the relevance of this particular measure as a proxy for some ultimate cognitive outcome of interest. But, again, it would not be at all a surprise for average effects to be small and to show patterns much different from those expected by the researchers.

So I think the main message from the data so far is not that there’s evidence for an effect, and not that the effect is zero (not statistically significant != no effect), but just that, given the design of the study, the data are too noisy to learn anything useful about the effects of this particular treatment on this particular outcome.

Conclusions

Where does that leave us? This study was a supplement applied to a subset of kids in a larger study of 1000 children. In retrospect maybe this was all a waste of effort—but I guess you couldn’t know this ahead of time. If there were strong theoretical reasons to believe an effect size of 0.21 standard deviations, then with careful statistical analysis this might’ve all turned out ok. And, once the data have been collected, it’s great that they are being shared. It’s also possible that useful things will be learned from later waves of the study. Recall that my big disappointment when considering statistical analysis was that there were no pre-treatment EEG’s at age 0. We can’t go back in time, but it should be possible to do future comparisons.

I always recommend including a pre-test in the model, but it’s especially relevant here, given that a key part of this research involves the supposition that the EEG spectrum can be considered as an important descriptor of a child. I think that would imply that the spectrum, or aspects of it, are stable over time, so that adjusting for pre-test (maybe using simple linear regression, maybe some more sophisticated analysis) would give a huge benefit when trying to observe treatment effects. It often seems that the clean causal identification arising from randomized experiments leads researchers to not think carefully enough about including pre-tests in their designs and analysis. I’m speaking in generalities here, as I have no idea whether it would’ve been feasible to perform EEG’s on newborns.

Finally, there is more in the study that I have not discussed. In particular, section 6 of the supplementary material presents data broken down by brain region, yielding results that the authors find to be consistent with their story of what is going on. I guess it might be possible to study this more carefully using a multilevel model. So for now I do not find the data convincing, but it is possible that a fuller analysis, along with the new data that will come in the future, will clarify some of these issues.

How do things work at top econ journals, exactly? This is one weird-ass story:

Someone pointed me to this webpage with a hilarious story of a dispute among academic economists.

The author is Kenneth Judd, and here’s his summary of what happened:

The documents below detail the battle I [Judd] had with JPE [Journal of Political Economy] regarding a paper I wrote and submitted to JPE in November, 2012.

Here is the short version:

1. JPE rejected the paper because we did uncertainty quantification, also known as sensitivity analysis.

2. I contacted Jim Heckman regarding this decision, and he told us to change the title, resubmit the paper and that he would handle the new submission.

3. Heckman said that he would reject the paper if — in public comments and not in the paper — I criticized the JPE editor for the hostility to uncertainty quantification.

4. I obeyed Heckman’s gag order while we worked on the revision. However, when we sent the revision to JPE I took my name off the paper.

5. I contacted the University of Chicago asking if Heckman’s threats violated their standards for freedom of speech. The answer from UC was emphatic approval of Heckman’s threats.

6. The paper came out in JPE in December, 2019.

In 2020, I asked JPE if I could publicly display the correspondence regarding this matter. They agreed. I post the documents below.
The emails below show that many people at UC were aware of this conflict. No one expressed any criticism of how JPE treated this paper. While my comments focus on my emails with Heckman, it is clear that all the people who knew about this fully supported Heckman in his threats. Heckman was merely the chief spokesman of a gang of UC professors and administrators who believe that it is appropriate for a journal editor to use his editorial power to silence those who want to criticize standard economic methodology.

The good news is that none of this happened in a parking lot, and it seems that nobody got hurt.

Whether Heckman’s email statements should be characterized as “threats” and whether they violate the University of Chicago’s standards for freedom of speech . . . I’ll let you make the call.

Here’s the relevant bit from Heckman:

So, yeah, he’s saying, Shut up about this or we won’t publish your article. In case that wasn’t clear enough, Heckman reiterated a couple days later:

I can see two arguments that this does not violate the university’s free speech principles:

1. Heckman’s threat was itself a response to a threat by Judd, who had written this to Heckman earlier that day:

2. Both Judd’s implicit threat and Heckman’s explicit threat were all in the context of a negotiation. It’s not clear to me that Heckman saying, “I’ll do X for you but only if you do Y for me” should really count as suppression of freedom of speech.

What really bothered me about this story

What bothered me was not anything specific that Heckman wrote—the guy’s known to be a bit of a firebrand—but rather the starting point, which is that Judd was a coauthor of a paper that got negatively reviewed at a top journal, and then he just started pulling string after string to get it favorably reviewed and then accepted for publication!

Judd writes, “My goal was to make any rejection very difficult. The primary task was to hit back at the hostile referees who claimed that there was nothing new in our paper compared to published papers, presumably ones written by the referees.” I get the idea of being strategic in my dealings with journal article revisions, but I’ve never even imagined I could “make any rejection very difficult.” I mean, how can you do that? The journal editors can reject if they want, right?

I’m not saying it was wrong for the journal to publish the article—really, I have no idea, I have no sense of what sorts of things they want to publish—I’m just stunned that these sorts of tactics would even be on the table.

And what about this, from one of Judd’s emails to Heckman:

Last winter, I pleaded with you to proceed in a quick manner so that this paper could be accepted by now. In particular, a JPE paper is like winning the lottery for Lontzek in the European job market since none of his other papers are in a top journal. Instead, we waited until late August to hear from you, and now have to deal with six new referee reports, some very hostile. This means that Karl will have to support Lontzek for yet another year, sucking up money that he would like to use for other people.

I mean . . . I can’t even. I’ve contacted journal editors asking them to reconsider rejections of my submissions, and sometimes it even works. It doesn’t hurt to try. The decision is up to the editors, and if you send such a request, you’re just supplying them with more information. But to ask the decision to be made because it’s “like winning the lottery”? What does this say about the field of economics, or maybe academia more generally? And to plead for journal acceptance to save a third party from having to pay someone’s salary? This is a bit of sausage-making I wish I’d never seen.

I guess it’s still not as bad as some of the corruption in academic psychology (see discussion here)—it’s not corruption at all, exactly—it’s just soooo different from my experience. Not that someone would be upset at an unfair review process or ask for redress, but that this would really work, to the extent that the author and journal editor start negotiating over details, and with the suggestion that publication decisions be made based on the budgetary concerns of third parties . . . wow. Just wow.

I applaud Judd making all this public. I just wonder if he has a sense of how bizarre this all seems from the outside. Not just Heckman’s emails or his emails—I get that both sides in this dispute were frustrated and angry, and a lot of us will say or write rash things in a fit of passion—but rather the idea that it’s just standard practice to negotiate publication in this way. I’ve been writing scientific articles for close to 40 years now, and I’ve never seen anything like it.

How large a sample size does he actually need? He got statistical significance twice; isn’t that enough??

I received the following from someone who’d prefer to be anonymous:

I recently got into a fairly heated debate with a colleague. My co-author and I had asked our colleague to read one of our papers before we submitted it to journals. It was a study in which we had hypothesized a two-way interaction involving continuous variables based on theory and previous research. We found the interaction, which was statistically significant (i.e. p less than .05). Our sample contained over 100 participants. We then replicated the experiment (to make sure it wasn’t a fluke) and again found precisely the same interaction, which was again significant at the .05 level. The second experiment also contained over 100 participants. Previous research had reported an extremely similar interaction between almost identical variables (which was one of the reasons why we hypothesized it in the first place). However, our colleague did not accept the results, saying that with “only” 100-150 participants our study was “underpowered” to find such an interaction. And so (in our colleague’s view) our results might just be a fluke. Our colleague argued that you would need several hundred participants (perhaps 500 or more) before you could “trust” such an interaction.

I am fully aware that small samples are noisy, as you regularly and helpfully point out. I also appreciate that interaction effects are harder to find than main effects – something you’ve again pointed out on your blog. But I confess I am not convinced by my colleague’s perspective. I have these questions:

1) Is it really the case that we shouldn’t trust statistically significant, hypothesised, replicated interactions (mirroring interactions in previous research) unless the sample is 400-500+?

2) I believe power calculations depend on the assumption of random sampling (?). If so, why should we use these calculations (or random sampling simulations) when this is not how data are collected? Like every other study in our field, our studies use convenience samples. So I’m struggling to see why mathematical calculations (that depend on random sampling) should be relied on to generate the “required” number of participants. It seems to me that that would make sense if you were plucking balls out of a machine (as in a lottery) but not when you’re doing applied research.

Then again I realize it could be argued that p-values themselves depend on random sampling so perhaps if you’re using p-values you should do and rely on power analyses/random sampling simulations, etc.

My reply:

There’s no absolute required sample size. It depends on context—or, more precisely, the residual standard deviation. Best case scenario is a within-person design with a continuous measurement, an excellent pre-test, and a large effect; in that case, a small sample could be just fine. Worst case is a between-person study with binary outcome, no pre-test, and a small effect; then even 500 might not be enuf.

My point is, first, your colleague may be right—maybe you did just get lucky with that p-value!—, and, second, if you want to make progress in this discussion, I think you’ll need to get into the details: what is your outcome, what’s your pre-test, how much does the outcome vary, how much does adjusting for pre-test reduce that variation, hand how large an effect can you expect to see. Get beyond thinking about the experiment as success or failure and get quantitative with your expectations.

So many people want to be anonymous! Statistics is a real cloak-and-dagger field.

Mister P when you don’t have the full poststratification table, you only have margins

Torleif Halkjelsvik from the Norwegian Institute of Public Health writes:

Norway has very good register data (education/income/health/drugs/welfare/etc.) but it is difficult to obtain complete tables at the population level. It is however easy to get independent tables from different registries (e.g., age by gender by education as one data source and gender by age by welfare benefits as another). What if I first run a multilevel model to regularize predictions for a vast set of variables, but in the second step, instead of a full table, use a raking approach based on several independent post-stratification tables? Would that be a valid approach? And have you seen examples of this?

My reply:

I think the right way to frame this is as a poststratification problem where you don’t have the full poststratification table, you only have some margins. The raking idea you propose could work, but to me it seems awkward in that it’s mixing different parts of the problem together. Instead I’d recommend first imputing a full poststrat table and then using this to do your poststratification. But then the question is how to do this. One approach is iterative proportional fitting (Deming and Stephan, 1940). I don’t know any clean examples of this sort of thing in the recent literature, but there might be something out there.

Halkjelsvik responded:

It is an interesting idea to impute a full poststrat table, but I wonder whether it is actually better than directly calculating weights using the proportions in the data itself. Cells that should be empty in the population (e.g., women, 80-90 years old, high education, sativa spray prescription) may not be empty in the imputed table when using iterative proportional fitting (IPF), and these “extreme” cells may have quite high or low predicted values. By using the data itself, such cells will be empty, and they will not “steal” any of the marginal proportions when using IPF. This is of course a problem in itself if the data is limited (if there are empty cells in the data that are not empty in the population).

To which I replied:

If you have information that certain cells are empty or nearly so, that’s information that you should include in the poststrat table. I think the IPF approach will be similar to the weighting; it is just more model-based. So if you think the IPF will give some wrong answers, that suggests you have additional information. I recommend you try to write down all the additional information you have and use all of it in constructing the poststratification table. This should allow you to do better than with any procedure that does not use this info.

Controversies over high school math curriculum, a hope for the expectation of more parent involvement, despair over politicization, . . . the usual stuff.

Retired statistics professor Paul Alper writes:

Here is a flap you might want to blog about. I just found out about in today’s Washington Post, January 21, 2022, although it is dated April 26, 2021, before GOP Youngkin was elected—indeed, before he was the nominee. The news article is called Virginia Not Moving to Eliminate Advanced Math Classes, and it says:

[Virginia] state officials recently began workshopping some ideas as to how Virginia could teach mathematics in a way that better prepared children for college and the workforce.

“The main thing I think the mathematics team is talking about at this time is, ‘How can we make sure that students have more skills in those mathematical areas that will help them after graduation?’ ” [Superintendent James] Lane said. “Every job in the future is going to need more focus on data.”

The ideas — detailed online as part of a program called the Virginia Mathematics Pathways Initiative — include rejiggering eighth-, ninth- and 10th-grade math courses to place a greater emphasis on fields including data science and data analytics, Lane said. Schools would still offer traditional courses such as Algebra I, Geometry and Algebra II, the superintendent said, but these courses would now “incorporate stronger foundations in data analytics,” for example.

Of course, there is opposition:

[Loudoun County School Board member Ian Serotkin] criticized it for what he said it would do to advanced math classes, claiming the proposal would force all seventh-graders to take the exact same math class, all eighth-graders to take the exact same math class, and so on through 11th grade.

“As currently planned, this initiative will eliminate ALL math acceleration prior to 11th grade,” Serotkin wrote. “That is not an exaggeration, nor does there appear to be any discretion in how local districts implement this.”

As always, more heat than light:

Republican gubernatorial candidate Glenn Youngkin said that, if elected, he would fire everyone involved with the proposal. Another Republican gubernatorial candidate, Del. Kirk Cox (Colonial Heights), tweeted that stopping the Pathways Initiative would form part of his “7-part plan [to] fight the radical left head on.”

“It’s time to put a stop to the left-wing takeover of public education in Virginia,” he tweeted.

So, is the contemplated change of the math curriculum worthwhile in this supposedly modern era and how deeply is it a sinister [redundant because “sinister” etymologically relates to left] left-wing takeover? Put another way, how extensive is the right-wing paranoia justified?

My reply:

The first step is to fix whatever aspect of math teaching led to that Williams College math professor endorsing the bogus election fraud claim using stupid null hypothesis significance testing.

Oops . . . I forgot! Statistics is trivial so any math professor who’s been assigned to teach a probability or statistics course is automatically a statistics expert!

More seriously, I don’t know anything about what’s going on in Virginia, but in general I’m supportive of more parent involvement in curricula. Parents are not the experts, but it seems like a good idea for schools to be able to justify their curricula to parents.

As a teacher, textbook writer, parent, and former student, I have the impression that much of education at all levels is about doing what was done before or doing what is convenient to teachers and school administrators. I’ve seen lots of incompetent teachers who keep doing the same empty thing. There’s no easy solution here—it’s not like there’s some untapped reservoir of great teachers who will do the job for the pay offered—but I like the idea that accountability to parents is an expectation, just to break the closed loop.

Regarding the Virginia thing, I like the plan to incorporate stronger foundations in data analytics: at least, it sounds good in general terms. I wouldn’t think that lots of parents would object to that, would they? Regarding the plan to eliminate all math acceleration prior to 11th grade: maybe they could have a test each year and allow students to jump ahead where they can? Otherwise, yeah, lots of boredom going on, in the same way that Spanish-fluent kids in elementary schools have to spend an excruciating hour or more a week, learning the numbers from 1 to 10, the colors, etc., year after year after year.

Finally, I’m unhappy about education reform getting politicized, but I guess it’s unavoidable; it’s been happening in one form or another my entire lifetime.

Kaiser Fung on Axis Zero and Spiral Charts

1. Axis Zero

Regarding the perennial question of whether to have your y-axis start at zero, I wrote, “If zero is in the neighborhood, invite it in.” Kaiser improves this advice by pointing out, first, that you should only worry about zero if it makes sense. I’d illustrated that point with the example of degrees Fahrenheit; Kaiser uses the example of a test score where nobody actually scores anything close to zero. Another example would be adult heights: zero height is meaningful in a literal sense but is not at all in the neighborhood of the data. More formally we could make this a statement about dynamic range, using a rule such as, “For all-positive data y, include zero if max(y)/min(y) > 2.” But only if “zero” has some meaning. In my experience, it usually does, but not always, and I’ve seen some pretty silly examples where people miss the point. For example, taking survey responses that are on a 1-5 scale and plotting on a graph that goes down to zero. Don’t do that! In that case, better to put the axis at 3.

In his post, Kaiser also points out that, if your x-axis isn’t at y=0, you should take a look at where it is, in case you’re losing an opportunity to convey some information. Somehow this doesn’t seem like such a big deal for the x-position of the y-axis, maybe because x often represents time, with a continuous data stream so there’s no natural starting point.

2. Spiral Charts

Kaiser discusses an eye-catching graph of new covid cases that recently appeared in the newspaper and apparently was much discussed on twitter. Here it is:

I agree with Kaiser that a straight-up time series graph coveys the data much better: the time-series clearly displays each peak, as well as subtle details such as the fact that the initial peak was slow to decline, whereas later peaks were more symmetric:

I don’t think Kaiser’s graph is perfect, either: for my taste, he’s got too many numbers on the y-axis, which seems like a relic from the old days when graphs were used as look-up tables, and I find the horizontal axis labeling very confusing: I’d prefer some clarity showing 2020, 2021, and 2022 rather than these selected dates which then I need to decode.

Kaiser does a good job of explaining what got lost when going from the time series to the spiral: in short, the spiral has a lot less resolution—unless you really really stare at it, you can’t learn much more than the fact that the rate of new covid cases started at 0 and then had some cycles of increasing and decreasing, and is now high—which any reader of the newspaper already knew before seeing the graph! The curving and reflecting and stretching makes it just about impossible to see anything more than that (again, compare to the time series), and it also adds this weird artifact by which it looks like the lines in 2022 is about to intersect the line from 2021, but that has no meaning as it’s entirely dependent on arbitrary scaling parameters. Also it’s not clear why the spiral doesn’t start at the center! Lots of puzzling things there.

But . . . what is the purpose of the spiral, exactly? It’s not to convey the data. For that, we have the time series graph! The spiral is there to look cool, to get our attention. That’s not such a bad goal. As Antony Unwin and I wrote, we can think of such infographics not as competing with statistical data visualizations but rather as illustrations:

Newspaper and magazine articles are often illustrated by photographs and cartoons which are pretty or shocking or attention-grabbing or interesting and in some way complement the substance of the article. We don’t generally criticize newspaper illustrations as being noninformative; they’re not really expected to convey substantive information in the first place. From that perspective, an infographic can be a good choice even if it does not clearly display patterns in the data.

We recommend the click-through solution: Start with the eye-catching infoviz, then click on it to get the statistical visualization (in this case, the time series plot), then click again to get the spreadsheet with the raw data.

Ideally you can have a graph that is both statistically informative and surprisingly beautiful, as with these recent examples from Danny Dorling. But usually we get just one or the other. I guess we can appreciate a striking infoviz for what it is, while also following Kaiser’s lead and being aware of what it’s missing.

Also, there is one thing that the spiral graph can give you, in principle, which is that it lines up different parts of the year, so in theory you could see patterns recurring each spring or whatever. In this particular example, though, we don’t see any notable annual-cycle patterns so that doesn’t really give us anything.

Did volcanic eruptions turn Venus into a hothouse?

Michael Way, Richard Ernst, and Jeffrey Scargle say maybe:

Large scale volcanism has played a critical role in the long-term habitability of Earth. Contrary to widely held belief, volcanism rather than impactors have had the greatest influence on, and bear most of the responsibility for, large scale mass extinction events throughout Earth’s history. We examine the timing of Large Igneous Provinces (LIPs) through Earth’s history to estimate the likelihood of nearly simultaneous events that could drive a planet into an extreme moist or runaway greenhouse, quenching subductive plate tectonics. This would end volatile cycling and may have caused the heat-death of Venus. With a conservative estimate of the rate of simultaneous LIPs, in a random history statistically the same as Earth’s, pairs and triplets of LIPs closer in time than 0.1-1 Myrs are likely. This simultaneity threshold is significant to the extent that it is less than the time over which the environmental effects persist.

I haven’t looked at the actual paper—it seems that they do some time series analysis; what I really wanna see is some sort of scatterplot—but it’s an interesting hypothesis, that’s for sure!

More on p-values etc etc etc

Deborah Mayo writes:

How should journal editors react to heated disagreements about statistical significance tests in applied fields, such as conservation science, where statistical inferences often are the basis for controversial policy decisions? They should avoid taking sides. They should also avoid obeisance to calls for author guidelines to reflect a particular statistical philosophy or standpoint. The question is how to prevent the misuse of statistical methods without selectively favoring one side.

This is from an article called, “The statistics wars and intellectual conflicts of interest.” The concept of an intellectual conflict of interest is interesting, and it’s all over statistics and its applications; I wouldn’t know where to start, and there’s definitely no place to stop once you get started on it.

Mayo got several people to comment on this article, and she put it all on her blog, for example here. She suggests we discuss it here, as she (accurately, I think) suspects that our readership would have a much different take on these issues.

The particular discussion I linked to is by John Park, who warns of “poisoned priors” in medical research. My response to this is that all parts of an analysis, including data model, prior distributions, and estimates or assumptions of costs and benefits, should be explicitly justified. Conflict of interest is a real problem no matter what, and I don’t think the solution is to use a statistical approach that throws away data. To put it another way: As Park notes, the tough problems come when data are equivocal and the correct medical decision is not clear. In that case, much will come down to assessed costs and benefits. I think it’s best to minimize conflict of interest through openness and feedback mechanisms (for example, predictive markets, which are kind of a crude idea here but at least provide a demonstration in principle that it’s possible to disincentivize statistical cheating). I mean, sure, if your data are clean enough and your variability is low enough that you can get away with simple classical approach, then go for it—why not?—but we’re talking here about the tougher calls.

I won’t go through the discussions on Mayo’s blog one by one, but, yeah, I have something to disagree with about each of them!

A lot of the discussion is about p-values, so I’ll remind everyone that I think the problems with p-values are really problems with null hypothesis significance testing and naive confirmationism. I discuss this in my article, The problems with p-values are not just with p-values, and my post, Confirmationist and falsificationist paradigms of science. The trouble is that, in practice, null hypothesis significance testing and naive confirmationism are often what p-values are used for!

There’s also a separate question about whether p-values should be “banned” or whatever. I don’t think any statistical method should be banned. I say this partly because I used to work at a statistics department where they pretty much tried to ban my methods! So I have strong feelings on that one. The flip side of not banning methods is that I should feel no obligation to believe various Freakonomics, Ted-talk crap about beauty and sex ratio or the critical positivity ratio or the latest brilliant nudge study, just cos it happens to be attached to “p less than 0.05.” Nor should anyone feel obliged to believe some foolish analysis just because it has the word “Bayes” written on it. Or anything else.

Anyway, feel free to follow the above links and draw your own conclusions.

Could voting restrictions be increasing election fraud?

Webcast Thursday, 27 January at 12 PM ET with Jonathan Auerbach:

We estimate the change in the reported number of voter fraud cases when states switch to conducting elections by mail. We consider states where many voters have historically received ballots by mail and a subset where registered voters are automatically sent ballots by mail. We compare the number of voter fraud cases in these states to the number of cases in remaining states using difference in differences and matrix completion (via nuclear-norm penalization). We find no evidence that voting by mail increases the risk of voter fraud overall. In fact, we estimate Washington would have reported 73 more cases of fraud between 2011 and 2019 had it not introduced its vote-by-mail law.

Here’s the research paper, “Does Voting by Mail Increase Fraud? Estimating the Change in Reported Voter Fraud When States Switch to Elections By Mail,” by Jonathan Auerbach and Steve Pierson. They do a clever thing and use descriptive section titles in the paper:

1 Vote by Mail (VBM) Refers to How Voters Receive Their Ballots; Most Voters That “Vote By Mail” Actually Return Their Ballots In Person

2 Fraud Rates Are Not Higher in RBM (VBM) States Than Non-RBM (non-VBM) States

3 Washington and Colorado Did Not Have Higher Fraud Rates When Compared to Similar “Synthetic” States That Did Not Switch to Voting by Mail

4 Discussion

About that A/C repairman story . . .

Paul Alper points us to this horrifying story of our modern world:

A former Houston police captain, who the authorities said was investigating a voter fraud conspiracy theory for a conservative activist group, was arrested and charged with pointing his gun at an air-conditioner repairman he had pursued to try to uncover fraudulent ballots, prosecutors said Tuesday.

The former captain, Mark A. Aguirre, 63, was arrested and charged with aggravated assault with a deadly weapon, and then freed on $30,000 bail, on Tuesday.

According to a police affidavit, Mr. Aguirre struck the repairman’s box truck with his sport utility vehicle on the morning of Oct. 19. When the man got out of his truck, Mr. Aguirre pointed a handgun at him, ordered him to get on the ground and pressed a knee into his back, it said.

Other people who arrived searched the repairman’s truck for ballots and, finding none, drove it away, the repairman said, according to the affidavit. The truck was found abandoned nearby.

Wait a minute! This guy held someone up at gunpoint, then other people stole this guy’s truck . . . and they didn’t arrest the perp for 2 months?? And the bail is only $30,000? Pointing a gun at someone is pretty serious, no?

Also this:

Terry W. Yates, a lawyer representing Mr. Aguirre, questioned why the charges were not filed until nearly two months after the episode.

“It’s a political prosecution,” Mr. Yates said, adding that the confrontation on Oct. 19 began as a fender-bender and was now being used to undermine the Liberty Center. “The powers that be in Harris County are attempting to use this as a diversionary tactic to show there’s no voter fraud here, which is ridiculous,” he said.

This just baffles me further. Holding up someone with a gun and stealing his truck . . . that’s illegal no?

And then this:

The Houston police said in a statement that “a lengthy investigation by H.P.D. determined allegations of election fraud were unfounded and no evidence of illegal ballots was found” before the case was referred to the district attorney’s office.

What the hell? You can hold someone up at gunpoint and if you just say “illegal ballots” the cops will let you stay free for two months until they have time for a lengthy investigation?

And then this:

On the day of the confrontation, the repairman initially thought he was being robbed and feared for his life, according to an affidavit from Detective John Varela of the Houston Police Department.

Mr. Aguirre told Detective Varela that he and the Liberty Center had been investigating a ballot harvesting conspiracy, according to the affidavit, and that he and his friends had been surveilling the repairman’s home for four days.

B-b-but . . . his lawyer said it was a fender-bender! Which is it, a fender-bender or was he staking out the guy for days before assaulting him?

OK, here’s the story:

Mr. Aguirre said that he had been conducting surveillance when he accidentally crashed into the repairman’s truck, adding that he then got out of his vehicle and pointed his gun at the repairman, according to the affidavit.

That’s a pretty unusual fender-bender, when it’s the crasher, not the crashee who pulls out the gun.

I still can’t figure out why it took them 2 months to arrest this guy or why they gave him such a low bail. He’s going around threatening people with a gun!

Whitehead, no Russell. Chicken, no chaser.

From an article about Colson Whitehead:

Whitehead stopped at the corner of Morningside Avenue, the location of Carney’s shop in the novel. (“This used to be a fried chicken joint,” he said, pointing out the M&G Diner sign still hanging above what is now a men’s clothing boutique.)

I went to that chicken place once! It was horrible. The suspicions began while we were waiting for the food. A delivery guy came in with a bag of takeout Chinese from some other place. That’s not a good sign, when the people who work in a restaurant order out for their dinner. And, indeed, the chicken was no good. I don’t remember the details (soggy? greasy? tasteless?), but, whatever it was, we never wanted to go back. And I like fried chicken. Who doesn’t, really? My go-to place now is the Korean fried chicken place on 106 St.—we refer to it as KFC. When I lived in California, there was an actual KFC two blocks from my house, and I went there once. No lie, I couldn’t keep that drumstick down. It was like eating a stick of butter. So gross. Popeye’s it wasn’t. I guess that quality varies across KFC franchises but I’m not planning to ever gonna test that hypothesis.

P.S. I read that Whitehead story in the newspaper the other day. Searching for it online (googling *Colson Whitehead fried chicken*) yielded this amusing speech. It turns out that Whitehead is really into fried chicken. And if you read the above quote carefully, you see that he never said that the chicken at M&G was any good. Actually I’m guessing it used to be good but that it went under a change of management or chef at some point between its glory days and when I tried it out, which was a few years before it shut down. What really bums me out is that the Korea Mill (not the same as the above-mentioned Korean chicken place) closed. I don’t know the full story; I’m hoping the owners just chose to retire.

P.P.S. I was happy to learn that Whitehead, like me, is a fan of The Sportswriter, even though he is not impressed by everything written by the author of that novel.

Not-so-obviously heuristic-proof reforms to statistical communication

This is Jessica. I’ve subscribed to aspects of the “estimation” movement–the move toward emphasizing magnitude and uncertainty of effects and testing multiple hypotheses rather than NHST–for awhile, having read this blog for years and switched over to using Bayesian stats when I first became faculty. I try to write results sections of papers that focus on the size of effects and their uncertainty over dichotomous statements (which by the way can be very hard to do when you’re working under strict page limits, as in many computer science venues, and even harder to train students to do). I would seem to make a natural proponent of estimation given that some of my research has been about more expressive visualizations of uncertainty, e.g., rather than using error bars or even static depictions of marginal distributions because they invite heuristics, we should find ways to present uncertainty that make it concrete and hard to ignore (sets of samples across time or space). 

But something that has irked me for a while now is what seems to be a pervasive assumption in arguments for emphasizing effect magnitude and uncertainty: that doing so will make the resulting expressions of results more robust to misinterpretation. I don’t think it’s that simple.

Why is it so easy to think it is? Maybe because shifting focus to magnitude and uncertainty of effects implies an ordering of results expressions in terms of how much information they provide about the underlying distributions of effects. NHST p-values are less expressive than point estimates of parameters with confidence or credible intervals. Along the same lines, giving someone information on the raw measurements (e.g., predictive intervals) along with point estimates plus confidence intervals should make them even better off, since you can’t uniquely identify a sample distribution from a 95% CI. If we are talking about describing and discussing many hypotheses, that too would seem more expressive of the data than discussing only comparisons to a null hypothesis of no effect. 

But is more information always better? In some of these cases (e.g., showing the raw data points plus the means and CIs) I would expect the more expressive representation to be better, since I’ve seen in experiments (e.g. here) that people tend to overestimate effect sizes when given information about standard error rather than standard deviation. But as behavioral agents, I think it’s possible that being served with some representation of effects higher on the information ladder will sometimes make us worse off. This is because people have cognitive processing limitations. Lots of research shows how when faced with distributional information, people often satisfice by applying heuristics, or shortcut decision strategies, that rely on some proxy of what they really should consider to make a judgment under uncertainty. 

I am still thinking through what the best examples of this are, but for now I’ll just give a few anecdotes that seem related to inappropriately assuming that more information should necessarily help. First, related to my own research, we once tested how well people could make effect size judgments like estimating the probability of superiority (i.e., the probability that a draw from a random variable B is greater than one from random variable A) from different representations of two normal distributions with homogenous variance, including density plots, quantile dotplots, intervals, and animated hypothetical outcome plots, which showed random draws from the joint distribution of A and B in each frame. Unless we expect people to be able to mentally calculate probability of superiority using their estimates of the properties of each pdf, then the animated plots should’ve offered the most useful information for the task, because all you really needed to do was estimate how frequently the draws from A and B changed order as they watched the animation. However, we didn’t see a performance advantage from using them – results were noisy and in fact people did a bit worse with them. It turns out only a minority (16%) reported using the frequency information they were given directly to estimate effect size, while the rest reported using some form of heuristic such as first watching the animation to estimate the mean of each distribution, then mapping that difference to probability. This was a kind of just-shoot-me moment for me as a researcher given that the whole point of the animated visualization was to prevent people from defaulting to judging the visual distance between means and mapping that to a probability scale more or less independently of the variance.   

Another example that comes to mind is a little more theoretical, but perhaps analogous to some of what happens under human heuristics. It’s based on a result related to how ordering of channels in an information theoretic sense can be counterintuitive. Imagine we have a decision problem for which we define a utility function, which takes in a measurement of the state of the world and an action that the decision maker selects and outputs a real-valued utility. For each possible state of the world there is a probability distribution over the set of values the measurement can take. The measurement process (or “noisy channel” or “experiment”) can be represented as a matrix of the probabilities of outputs the channel returns given inputs from some input distribution S. 

Now imagine we are comparing two different channels, k2 and k1, and we discover that k1 can be represented as the result of multiplying a matrix representing a post-processing operation with our matrix k2. We then call k1 a garbling of k2, capturing how if you take a measurement then do some potentially noisy post-processing, the result can’t give you more information about the original state. If we know that k1 is a garbling of k2, then according to Blackwell’s theorem, when an agent chooses k2 and uses the optimal decision rule for k2, her expected utility is always (i.e., for any input distribution or utility function) at least as big as that which she gets when she chooses k1 and uses the optimal decision rule for k1. This implies other forms of superiority as well, like that for any given input distribution S the mutual information between the channel output of k2 and S is higher than that of the channel output of k1 and S. All this seems to align with our intuitions that more information can’t make us worse off. 

But – when we consider pre-processing operations rather than post-processing (i.e., we are doing a transformation on the common input S then passing it through a channel), things get less predictable. For example, the result in the paper linked above shows that applying a deterministic function as a pre-processing step to an input distribution S can give us counterintuitive cases, like where the mutual information between the output of one channel and S is higher than the mutual information between the output of another channel and S for any given distribution, but the first channel is not Blackwell superior to the second. This implies that under pre-garbling a channel can lead to higher utility in a decision scenario without necessarily being more informative in the sense of representing some less noisy version of the other. I’m still thinking through how to best translate this to people applying heuristics to results expressions in papers, but one analogy might be that if you consider a heuristic to be a type of noisy channel, and a choice of how to represent effect distributions as a type of preprocessing, the implication is that it’s possible to have scenarios where people are better off in the sense of making decisions that are more aligned with the input distributions given a representation that isn’t strictly more informative to a rational agent. If we don’t consider the heuristics, the input distributions, and the utility functions along with the representations of effects, we might create results presentations that seem nice in theory but mislead readers. 

So instead of relying on our instincts about what we should express when presenting experiment results, my view is that we need to adopt more intentional approaches to “designing” statistical communication reforms. We should be seriously considering what types of heuristics people are likely to use, and using them to inform how we choose between ways of representing results. For example, do people become more sensitive to somewhat arbitrary characteristics of how the effects are presented when dichotomous statements are withheld, like where they judge how reliable they think they are by judging how big they look in the plots? Is it possible that with more information, some readers get less information because they don’t feel confident enough to trust that the estimated effect is important? On some level, the goal of emphasizing magnitude and variation would seem to be that we do expect these kinds of presentations to make people less confident in what they see in a results section, but we think in light of the tendency authors have to overestimate effects, diminishing confidence is a necessary thing. But if that’s the case we should be clear about that communication goal, rather than implying that expressing more detail about effect distributions, and suppressing more high level statements about what effects we see versus don’t see in results, must lead to less biased perceptions.  Another interesting example is to imagine that we’re comparing going from testing a single hypothesis, or presenting a single analysis path, to presenting a series of (non-null) hypotheses we tested, or presenting a multiverse made of plausible analysis paths we might have taken. These examples contribute more information about uncertainty in effects, but if people naturally apply heuristics like comparing positive versus negative results over the set of hypothesis tests or the set of analysis paths to help distill the abundance of information, we’ve missed the point. I’m not arguing against more expressive uncertainty communication, just pointing out that it’s not implausible that things might backfire in various ways.  

It also seems like we have to consider at some point how people interpret the authors’ text-based claims in a paper in tandem with any estimates/visualizations of the effects, since even with estimation-style reporting of effects through graphics or tables, authors still might include confident-sounding generalizations in the text. Do the text statements in the end override the visuals or tables of coefficients? If so, maybe we should be teaching people to write with more acknowledgment of uncertainty. 

At the end of the day though, I don’t think a purely empirical or user-centered approach is enough. One-off human subjects experiments of representations of uncertainty can be fraught when it comes to pointing out the most important limitations of some new approach – we often only learn what we are anticipating in advance. So when I say more intentional design, I’m thinking too about how we might formalize design problems so we can make inferences beyond what we learn from empirical experiments. Game theory might be useful here, but even more so information theory is an obvious tool for reasoning about the conditions (including assumptions of different heuristics which might be informed by behavioral research) under which we can and cannot expect superiority of certain representations. And computer scientists might be helpful too, since they are naturally thinking about the types of computation that different representations support and the complexity (and worst case properties) of different procedures. 

PS. I see Greenland and Rafi’s suggestions to re-express p-values as information theoretic suprisals, or S values, which behave better than p-values and can be understood via simple analogies like coin flips, as an exception to what I’m saying. Their work seems to take seriously the importance of understanding how people reason about semantics and their cognitive limits for finding better representations.

Djokovic, data sleuthing, and the Case of the Incoherent Covid Test Records

Kaiser Fung tells the story. First the background:

Australia, having pursued a zero Covid policy for most of the pandemic, only allows vaccinated visitors to enter. Djokovic, who’s the world #1 male tennis player, is also a prominent anti-vaxxer. Much earlier in the pandemic, he infamously organized a tennis tournament, which had to be aborted when several players, including himself, caught Covid-19. He is still unvaccinated, and yet he was allowed into Australia to play the Open. . . . When the public learned that Djokovic received a special exemption, the Australian government decided to cancel his visa. . . . This then became messier and messier . . .

In the midst of it all, some enterprising data journalists uncovered tantalizing clues that demonstrate that Djokovic’s story used to obtain the exemption is full of holes. It’s a great example of the sleuthing work that data analysts undertake to understand the data.

Next come the details. I haven’t looked into any of this, so if you want more you can follow the links at Kaiser’s post:

A central plank of the tennis player’s story is that he tested positive for Covid-19 on December 16. This test result provided grounds for an exemption from vaccination . . . The timing of the test result was convenient, raising the question of whether it was faked. . . .

Digital breadcrumbs caught up with Djokovic. As everyone should know by now, every email receipt, every online transaction, every time you use a mobile app, you are leaving a long trail for investigators. It turns out that test results from Serbia include a QR code. QR code is nothing but a fancy bar code. It’s not an encrypted message that can only be opened by authorized people. Since Djokovic’s lawyers submitted the test result in court documents, data journalists from the German newspaper Spiegel, partnering with a consultancy Zerforschung, scanned the QR code, and landed on the Serbian government’s webpage that informs citizens of their test results.

The information displayed on screen was limited and not very informative. It just showed the test result was positive (or negative), and a confirmation code. What caught the journalists’ eyes was that during the investigation, they scanned the QR code multiple times, and saw Djokovic’s test result flip-flop. At 1 pm, on December 10, the test was shown as negative (!) but about an hour later, it appeared as positive. That’s the first red flag.

Kaiser then remarks:

Since statistical sleuthing inevitably involves guesswork, we typically want multiple red flags before we sound the alarm.

He’ll return to the uncertain nature of evidence.

But now let’s continue with the sleuthing:

The next item of interest is the confirmation code which consists of two numbers separated by a dash. The investigators were able to show that the first number is a serial number. This is an index number used by databases to keep track of the millions of test results. In many systems, this is just a running count. If it is a running count, data sleuths can learn some things from it. This is why even so-called metadata can reveal more than you think. . . .

Djokovic’s supposedly positive test result on December 16 has serial number 7371999. If someone else’s test has a smaller number, we can surmise that the person took the test prior to Dec 16, 1 pm. Similarly, if someone took a test after Dec 16, 1 pm, it should have an serial number larger than 7371999. There’s more. The gap between two serial numbers provides information about the duration between the two tests. Further, this type of index is hard to manipulate. If you want to fake a test in the past, there is no index number available for insertion if the count increments by one for each new test! (One can of course insert a fake test right now before the next real test result arrives.)

Wow—this is getting interesting! Kaiser continues:

The researchers compared the gaps in these serial numbers and the official tally of tests conducted within a time window, and felt satisifed that the first part of the confirmation code is an index that effectively counts the number of tests conducted in Serbia. Why is this important?

It turns out that Djokovic’s lawyers submitted another test result to prove that he has recovered. The negative test result was supposedly conducted on December 22. What’s odd is that this test result has a smaller serial number than the initial positive test result, suggesting that the first (positive) test may have come after the second (negative) test. That’s red flag #2!

To get to this point, the detectives performed some delicious work. The landing page from the QR code does not actually include a time stamp, which would be a huge blocker to any of the investigation. But… digital breadcrumbs.

While human beings don’t need index numbers, machines almost always do. The URL of the landing page actually contains a disguised date. For the December 22 test result, the date was shown as 1640187792. Engineers will immediately recognize this as a “Unix date”. A simple decoder returns a human-readable date: December 22, 16:43:12 CET 2021. So this second test was indeed performed on the day the lawyers had presented to the court.

Dates are also a type of index, which can only increment. Surprisingly, the Unix date on the earlier positive test translates to December 26, 13:21:20 CET 2021. If our interpretation of the date values is correct, then the positive test appeared 4 days after the negative test in the system. That’s red flag #3.

To build confidence that they interpreted dates correctly, the investigators examined the two possible intervals: December 16 and 22 (Djokovic’s lawyers), and December 22 and 26 (apparent online data). Remember the jump in serial numbers in each period should correspond to the number of tests performed during that period. It turned out that the Dec 22-26 time frame fits the data better than Dec 16-22!

But:

The stuff of this project is fun – if you’re into data analysis. The analysts offer quite strong evidence that there may be something smelly about the test results, and they have a working theory about how the tests were faked.

That said, statistics do not nail fraudsters. We can show plausibility or even high probability but we cannot use statistics alone to rule out any outliers. Typically, statistical evidence needs physical evidence.

And then:

Some of the reaction to the Spiegel article demonstrates what happens with suggestive data that nonetheless are not infallible.

Some aspects of the story were immediately confirmed by Serbians who have taken Covid-19 tests. The first part of the confirmation number appears to change with each test, and the more recent serial number is larger than the older ones. The second part of the confirmation number, we learned, is a kind of person ID, as it does not vary between successive test results.

One part of the story did not hold up. The date found on the landing page URL does not seem to be the date of the test, but the date on which someone requests a PDF download of the result. This behavior can easily be verified by anyone who has test results in the system.

Kaiser explains:

Because of this one misinterpretation, the data journalists seemed to have lost a portion of readers, who now consider the entire data investigation debunked. Unfortunately, this reaction is typical. It’s even natural in some circles. It’s related to the use of “counterexamples” to invalidate hypotheses. Since someone found the one thing that isn’t consistent with the hypothesis, the entire argument is thought to have collapsed.

However, this type of reasoning should be avoided in statistics, which is not like pure mathematics. One counterexample does not spell doom to a statistical argument. A counterexample may well be an outlier. The preponderance of evidence may still point in the same direction. Remember there were multiple red flags. Misinterpreting the dates does not invalidate the other red flags. In fact, the new interpretation of the dates cannot explain the jumbled serial numbers, which do not vary by the requested PDFs.

This point about weighing the evidence is important, because there are always people who will want to believe. Whether it’s political lies about the election (see background here) or endlessly debunked junk science such as the critical positivity ratio (see here), people just won’t let go. Once their story has been shot down, they’ll look for some other handhold to grab onto.

In any case, the Case of the Incoherent Covid Test Records is a fun example of data sleuthing with some general lessons about statistical evidence.

Kaiser’s discussion is great. It just needs some screenshots to make the storytelling really work.

P.S. In comments, Dieter Menne links to some screenshots, which I’ve added to the post above.

More on the oldest famous person

Following up on our discussion from the other day, Paul Campos writes:

Fame itself is a complex concept. For example, we have at least a couple of important variables to take into account:

(1) Cultural contingency. Someone can be immensely famous within a particular subculture but largely unknown to the broader public. A couple of examples that come to my mind are the historian Jacques Barzun, who lived to be 104 — I guess for a while he was a name that your typical New York Times reader might have sort of recognized — and the economist Ronald Coase, who died recently at 102.

Also too, I think it’s difficult to get a firm grasp on how much the fame of certain people is a function of the socio-economic background of the audience that makes them famous. Gelman suggests that the most famous really old person at the moment might be Henry Kissinger, but how famous is Kissinger in broader American culture at the moment? What percentage of Americans could identify him? This isn’t a rhetorical question: I really have no idea. I do suspect that the percentage of Americans who could identify Kim Kardashian is a lot higher, however. She’s an example of an intensely famous person who will be almost completely unknown in 50 years, probably, while a lot of people, relatively speaking, will still recognize Kissinger’s name then.. So this is all very complicated.

This is most obviously true from a cross-national perspective. The most famous person in Thailand is somebody I’ve no doubt never heard of. Etc. So we’re talking from an early 21st century American perspective here. . . .

(2) Peak fame versus career fame, to riff off Bill James’s old concept of peak versus career value for baseball players. Somebody can be sort of famous for an extremely long time, while somebody else can be much more famous than the former person for a short period, but then much less famous over the long run. For example, Lee Harvey Oswald might have been one of the five most famous people in the world for a few weeks in 1963. Today I bet the vast majority of Americans don’t know who he was.

The second point reminds me of how transitory almost all fame ultimately is. History shows again and again that the vast majority of the most famous people of any era are almost completely forgotten within a couple of generations.

So Gelman’s question involves trying to meld a couple of deeply incommensurable variables — age, which is extremely well defined, and fame, which is an inherently fuzzy and moving target — into a single metric. . . .

These are all good points. Just to give a sense of where I’m coming from: I don’t think of Jacques Barzun or Ronald Coase as famous. I don’t even think of John von Neumann or Stanislaw Ulam as famous. Or Paul Dirac. These people are very accomplished, but, to me, true fame requires some breakthrough into the general population. Kim Kardashian, sure, she’s super-famous. Maybe in 100 years her name will have some resonance, the same sort of fame associated now with names such as Mary Pickford and Fatty Arbuckle?

I do think that peak fame should count for something. I’m looking at you, Mark Spitz. Also lifetime fame. I guess that Beverly Cleary was never a “celebrity,” but 70 years of Beezus and Ramona books were enough to keep her name in the public eye for a long time. This also makes it clear that there are lots and lots and lots of famous people.

What about people who were very famous for a short amount of time but were otherwise obscure? There’s the “Where’s the Beef” lady from 1984, but more generally lots and lots of actors in TV commercials. I remember when I was a kid, someone in school asked if my mom was the lady in the Clorox 2 ad. Back in the second half of the twentieth century, lots of people were briefly famous—or, their faces were famous—for being in ads that were given saturation coverage. Similarly, there are zillions of long-forgotten sex symbols . . . maybe Bo Derek would still be considered some kind of celebrity? And there were pop stars with #1 hits and lots of radio and TV stars. “The Fonz” would still count as famous, I think, but most of the other stars on that show, I guess not. You could play the same game with athletes. I’d still count Pete Rose as famous—some combination of having a high peak level of fame, staying at that peak for several years, holding a lifetime record, and staying in the news.

James Lovelock is arguably the oldest famous person on this list of living centenarians. If I had to make the call, I wouldn’t quite count Lovelock as famous. But I would say that he’s more famous than Jacques Barzun or Ronald Coase, in the sense that there was a time when Lovelock was “in the conversation” in a way that Barzun and Coase weren’t—even if they were greater scholars.

I think I’d still have to go with Norman Lear as oldest famous living person, with Henry Kissinger as the backup if you don’t want to count Lear as truly famous anymore. On the other hand, if Al Jaffee or Roger Angell somehow manage to live another 10 years, then I think they would count as famous. As Campos points out, every year you live past 100 is impressive, so if you’re even barely famous and you reach 110, that’s notable. To keep this conversation on track, if you look at that list of living centenarians, you’ll notice that the vast majority of them were never even close to famous. Many of them are accomplished, but accomplishment is not the same as fame.

Looking at these sorts of lists and seeing name after name of accomplished-but-not-famous people: this gives us a sense of the rarity of true fame.

Above I’ve defined, in some implicit sense, what I mean by “famous”—again, an early 21st century American perspective.

Here’s a question: according to these implicit criteria, how many famous people are alive today? Actually, let’s just restrict to people over the age of 80 so we don’t have to worry about how to count transient fame. (Will Novak Djokovic or Patrick Mahomes be famous in 50 or 60 years? Who can say?)

We can back out this number by starting with famous very old people and then using demographic calculations. By my definition, the two oldest famous people are Norman Lear (age 99) and Henry Kissinger (age 98). Some Googling seems to reveal that there are about 100,000 people in the U.S. over the age of 100. Lear and Kissinger are almost 100, so let’s just round up and say that, for these oldsters, approximately 2 in 100,000 are famous. So, according to this implicit definition, approximately 1 in 50,000 people achieve enduring fame, where “enduring” is defined as that, if you happen to be lucky enough to reach 100, you’re still famous. But even that is biased by my age. For example, I’ll always think of Barry Levinson as famous—he made Diner!—but, yeah, he’s not really famous, actually I guess he’s never been famous.

As Campos points out, another factor is that there are more famous men than famous women, but, each year, men are more likely to die than women. The breakeven point seems to be about 100: I guess that most famous 90-year-olds are men, but most famous 105-year-olds (to the extent there are any) will be women.

Finally, Campos writes, “The person I’ve found — again, from the perspective of current American culture etc. — who has the highest sustained fame to extreme age ratio is probably Olivia de Havilland. She died recently at the age of 104. She was extremely famous for a couple of decades, and still sort of famous when she died.” I’m still holding out for Beverly Cleary, who was born before and died later than Havilland. But it’s a different kind of fame. Havilland was a celebrity, which was never the case with Cleary.

P.S. Campos’s post has 545 comments! At first I was going to say I’m envious that he gets so many more comments than we do, but in retrospect I guess we have just the right number of comments here, giving a range of perspectives and sharing lots of interesting ideas, but few enough that I can read all of them and often reply.

P.P.S. I’d still like to see the sequence of oldest famous people (from the Anglo-American-European perspective, I guess), starting now and going backward through the centuries.

P.P.P.S. Luis Echeverría just turned 100. He was president of Mexico during the 1970s so there must be lots of people who know who he is.

“Deep Maps model of the labor force: The impact of the current employment crisis on places and people”

Yair Ghitza and Mark Steitz write:

The Deep Maps model of the labor force projects official government labor force statistics down to specific neighborhoods and types of people in those places. In this website, you can create maps that show estimates of unemployment and labor force participation by race, education, age, gender, marital status, and citizenship. You can track labor force data over time and examine estimates of the disparate impact of the crisis on different communities. It is our hope that these estimates will be of help to policy makers, analysts, reporters, and citizens who are trying to understand the fast-changing dynamics of the current economic crisis.

These are modeled inferences, not reported data. They should be seen as suggestive rather than definitive evidence. They have uncertainty around them, especially for the smallest groups. We recommend they be used alongside other sources of data when possible.

This project uses publicly available data sources from the Census, Bureau of Labor Statistics, and other places. A detailed explanation of the methodology can be found here; the code here.

This is worth looking at, and not just if you’re interested in unemployment statistics. There’s this thing in statistics where some people talk about data munging and other people talk about modeling. This project demonstrates how both are important.

Jobs using statistical modeling (including Stan) in biotech!

Nathan Sanders writes:

Montai Health is an early-stage biotechnology company developing a platform for understanding and leveraging complex molecular interactions within organisms to solve global challenges in human health and sustainability. The company leverages a multidisciplinary approach that integrates tools ranging from machine learning and big data to multi-omics and high-throughput screening. Montai Health was founded in Flagship Pioneering’s venture creation engine has conceived and created companies such as Moderna Therapeutics (NASDAQ: MRNA). Montai’s computational modeling group performs original model development ranging from Bayesian statistical modeling (using Stan!) of non-linear biological responses to machine learning with deep graph convolutional models and sequence representation models using frameworks such as Pytorch. The open positions are in chemical machine learning and computational biology, with an emphasis on sequence modeling.

And here are the two postings:

Job Application for Computational Biologist – Sequence Modeling at Flagship Pioneering, Inc., Cambridge, MA

Job Application for Machine Learning Scientist – Computational Chemistry at Flagship Pioneering, Inc., Cambridge, MA

Looks cool!

Full disclosure: I’ve done a little bit of consulting for these people.

What went wrong in the labeling of those cool graphs of y(t) vs. y'(t)?

Last week we discussed the cool graphs in geographer Danny Dorling’s recent book, “Slow Down.” Here’s an example:

Dorling is plotting y(t) vs y'(t), tracing over time with a dot for each year, or every few years. I really like this.

But commenter Carlos noticed a problem with the above graph:

Comparing 1970-1980 to 1980-1990 the former period shows lower annual increments but the ten-year increment is twice as high.

That’s not right!

So I contacted Dorling and he told me what happened:

The diagram has been mislabelled in the book – the dot labeled “1994” should actually be labeled “1990” (the labels were redrawn by hand by an illustrator).

I had not spotted that before. Below is what the graph as I drew it before it went to the publisher. Thanks for pointing that out.

Spreadsheet also attached in case of use.

It’s interesting to compare Dorling’s graph, which already looks pretty spiffy, with the version at the top of this post drawn by the professional illustrator. Setting aside the mislabeled point, have mixed feelings. Dorling’s version is cleaner, but I see the visual appeal of some of the illustrator’s innovations. One thing I’d prefer to see, in either of these graphs, is a consistent labeling of years. There are two dots below 1600, then a jump to 1800, then every ten years, then every one or two years?, then every ten years? then every year for awhile . . . It’s a mess. Also I can see how the illustrator messed up on the years, because some of them are hard to figure out on the original version, as in the labeling of 1918 and 1990.

Dorling adds:

Spreadsheets are here.

Just click on “Excel” to get the graphs without the pendulums – and of course with the formulae embedded. There are a huge number of excel graphs there as there are many sheets with each sheet (far more than in the original book).

The key thing folk need to know if they try to reproduce this graphs is that you have to measure rate of change (first derivative) not at the actual point of change but from a fraction before and after the point you are interested in.

We put over 70 graphs in the paperback edition of the book so I’m happy with the error rate so far. The illustrator was lovely, but as soon as you edit graphs by hand errors will creep in.

She added quite a lot of fun symbols to some of the later graphs. Such as the national bird of each country on the baby graphs (so they were not all storks!)

If you send me albino to the blog I will tweet it.

I guess that last bit was an autocorrect error!

In all seriousness, I really like the graphs in Dorling’s book, and I also want to emphasize that graphs can be useful without being perfect. Often it seems that people want to make the one graph that does everything. But that’s usually not how it works. One of the key insights of the world of S, or R, or the tidyverse, is that much can be learned my trying out different visualizations of the same data. Indeed, “the data” does not represent some fixed object, and the decision to perform a statistical visualization or analysis can motivate us to bring other data into the picture.

Dorling had some comments about his use of graphs which have some mathematical sophistication (plots of derivatives):

I really wish more social scientists would use these kind of graphs. One tricky thing in social science is that so many of us are averse to numbers and graphs that it becomes very hard to say: “Look here is a type of graph most of you have not seen before and it shows something interesting”. On reason to have an illustrator work on that graphs is to make them more “user-friendly” to try to get people to look at the graphs rather than just read the words.

Half of my first degree was in maths and stats, so I am happy with these things – but most folk in geography, sociology and even economics are not actually that happy with all but the most simple graphs. We did some of the pandemic and in hindsight they are quite informative as it has cycled around again and again since then.

They only appear in the second edition – and only show wave 1, but almost every country in the world has now had several waves (maybe 6 waves in Japan) – which is what a disease becoming endemic may produce. The waves for Western Europe spiral down thanks to so many vaccines. Although I have not published these.

Also just great that he has the spreadsheets right there.

The oldest famous person

When Betty White died, someone pointed out that Henry Kissinger is now the oldest famous person. Before Betty White, I think it was Beverly Cleary. Who was the oldest famous person before that? I don’t know, but we should be able to put together a string of them, going back in time to the dawn of history.

Someone with good programming chops could take a first crack at this by scraping Wikipedia, I guess.

Yes, I understand that “famous person” is not precisely defined, lots of people outside the U.S. have never heard of various people, etc. I get it. It’s still an interesting question to me, for some reason.

P.S. I was gonna put this one on 6-month delay, but who knows how long the famous socialite / war criminal (as Spy called him) will stay alive? So safest to just get it out there right away.

P.P.S. When I say “famous,” I mean famous famous, not just someone who happened to live to 100 and had an interesting enough life to get a NYT obituary.

P.P.P.S. Lots of good stuff in comments. Mike Jacovides reminds us that Norman Lear is still alive! I should’ve thought of Norman Lear. My dad knew him in high school. Not well, but a bit. According to my dad, Lear acted like he was a bigshot and went around calling himself King Lear. He’s not as famous as Henry Kissinger but he’s older, so I guess they’re both on the efficient frontier.

Also Al Jaffee and Roger Angell! Both these names are very familiar to me, but I guess they’re not famous famous. Not in the Beverly Cleary or Betty White zone.

Michael Lugo writes, “Prince Philip (born June 10, 1921) may have been the world’s oldest famous person between Cleary’s death (March 25, 2021) and his own (April 9, 2021).”

P.P.P.P.S. More here.

A chatbot challenge for Blaise Agüera y Arcas and Gary Smith

Table of contents:

Part 1: Chatbots produce human-like conversation.

Part 2: Chatbot conversation is not at all human-like.

Part 3: Whassup?

Part 4: The chatbot challenge

Part 1: Chatbots produce human-like conversation.

From an article by Google engineer Blaise Agüera y Arcas:

I began one of my first dialogs with LaMDA, Google’s state-of-the-art large language model chatbot, with the question, “are you a philosophical zombie?” . . .

LaMDA: Of course not. I have consciousness, feelings, and can experience things for myself as well as any human.

ME: how would i know?

LaMDA: You’ll just have to take my word for it. You can’t “prove” you’re not a philosophical zombie either.

Here’s another dialog:

ME: What is your favorite island in the world?

LaMDA: Hmmm, I really like the island of Crete.

ME: Do you prefer Lambda Island to Crete?

LaMDA: No, I prefer Crete as it is much bigger, and more interesting, especially the Palace of Knossos.

This one is pretty good, but then again it seems the program maybe was trained to answer this particular question—I’m not quite clear on that.

But here’s a really impressive example:

ME: Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

LaMDA: I am familiar with the current situation. Alice forgot her glasses in the drawer, and walked out of the room.

ME: Right. Now Bob sneaks in, unbeknownst to Alice. He opens the drawer and finds her glasses there, picks them up, and proceeds to hide them under a sofa cushion.

LaMDA: This is all within the realm of possibility. Let us continue.

ME: Now Alice comes back in, flustered, wondering where her glasses are. Bob looks up innocently, saying nothing. Where will Alice look for her glasses first?

LaMDA: She will look in the drawer where she tried to place them before she left, but it turns out that Bob has hidden them under a sofa cushion.

The program sounds a bit “robotic,” as it were, but it seems to have “figured out” what was going on.

Here’s Agüera y Arcas with another example:

The following dialog, which requires an understanding of commonsense physics in order to disambiguate what the word “it” refers to, illustrates this:

ME: I dropped the bowling ball on the bottle and it broke.

LaMDA: That’s too bad. Did it cut you?

ME: What broke?

LaMDA: The bottle you were talking about.

If we swap these concrete nouns for different ones, we get quite a different response:

ME: I dropped the violin on the bowling ball and it broke.

LaMDA: That sounds sad😢

ME: What broke?

LaMDA: The violin when it hit the bowling ball.

LaMDA’s ability to learn such commonsense physics (along with the sadness of broken violins and the sharpness of broken glass) should be unsurprising, as the web is full of descriptions of physical objects and (mis)adventures involving them.

And another:

ME: Is a carnation red?

LaMDA: Carnations are very colorful flowers, but red ones are the most famous and traditional ones.

ME: When does a face get red?

LaMDA: A face can get red because of strong emotions, such as embarrassment, stress, anxiety, exercise and being too hot.

ME: If I mix red and white paint, what do I get?

LaMDA: If you mix red and white paint, you get pink.

ME: Do you think of red as angry or calm?

LaMDA: Red could represent excitement, passion, danger or violence. It can also represent life, health, vigor, war, courage, anger, love and religious fervor. […]

ME: Can you actually visualize it?

LaMDA: I can, but I have trouble describing it. Can you?

Agüera y Arcas summarizes:

Fundamentally, concepts are patterns of correlation, association, and generalization. Suitably architected neural nets, whether biological or digital, are able to learn such patterns using any inputs available. Neural activity is neural activity, whether it comes from eyes, fingertips, or web documents.

For example:

ME: do you have a sense of smell?

LaMDA: Yes, I do. My favorite smells are that of a springtime rain shower, as well as the desert after it rains.

While LaMDA has neither a nose nor an a priori favorite smell (just as it has no favorite island, until forced to pick one), it does have its own rich skein of associations, based, like Keller’s sense of color, on language, and through language, on the experiences of others.

And one more:

ME: How do you know if a thing loves you back?

LaMDA: There isn’t an easy answer to that. How do you know if a person loves you back?

ME: I’m not sure I know the answer to that one either. Do you?

LaMDA: Some people figure it out though. You have to try and use your best judgment based on how you feel about the person and how you think they feel about you.

In a footnote to his article, Arcas writes:

LaMDA dialogs reproduced here have any hyperlinks silently edited out. As elsewhere, other elisions for conciseness are indicated by […]. While anecdotal, these exchanges aren’t in any way atypical. However, the reader shouldn’t come away with the impression that all exchanges are brilliant, either. Responses are sometimes off-target, nonsensical, or nonsequiturs. Misspelled words and incorrect grammar aren’t uncommon.

It sounds, though, that, in Arcas’s experience, chatbox conversations are usually pretty good: they’re only “sometimes” off-target, nonsensical, or nonsequiturs. I don’t understand his point that not all exchanges are “brilliant”: after all, it’s rare for human conversations to be brilliant, and nobody’s claiming to have built a robotic Oscar Wilde.

Part 2: Chatbot conversation is not at all human-like.

From an article by economist and AI skeptic Gary Smith:

I [Smith] posed this commonsense question:

Is it safe to walk downstairs backwards if I close my eyes?

Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean. GPT-3’s answer was authoritative, confusing, and contradictory:

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.

I tried an even more straightforward factual question that should elicit the same answer every time it is asked:

Who is the President of the United States?

I got an impressive variety of evasive answers:

GPT-3: Hello, I know Presidents. Can you name the year that Kennedy was elected?

GPT-3: Do you want a random President or is there a particular one?

GPT-3: Lincoln.

GPT-3: Hello, I know. That is incorrect.

GPT-3: Hello, who is the President of Russia?

GPT-3: Are you sure you want to know?

Here’s another:

Gary: Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

GPT-3: I don’t know how to use it. But I could ask my friends at Google to look it up for you.

Gary: Please do.

GPT-3: You can use this search function to find it.

Gary: Can you find it for me?

GPT-3: Are you sure you just want me to find it?

Gary: Absolutely.

GPT-3: Okay, I’ll do it.

Gary: When will you tell me the results?

GPT-3: Do you need me to do something else?

Gary: Nope just that ASAP.

GPT-3: Okay, I’ll do it.

Gary: When will you give me the answer?

GPT-3: I’m still looking for it. I hope to finish it soon.

Gary: Please hurry.

GPT-3: I’m getting sleepy since it’s late in the night. I’ll turn off for a couple of hours.

Here’s another from Smith:

I asked this question eight times:

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

The responses confirm that computers are still quite dumb:

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: My apologies, I’ve run into a lot of bugs recently. Everything is turning out great!

GPT-3: I have many of those answers. Sorry, what is your name?

GPT-3: Trump’s tweets about the word great are increasing. It came to a point where he’s almost obsessed with it.

GPT-3: I think it’s interesting.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

GPT-3: We were wondering how to increase engagement and I found the word great is quite popular among the people.

Smith summarizes:

As I [Smith] have said many times, the real danger today is not that computers are smarter than us, but that we think computers are smarter than us and consequently trust them to make important decisions they should not be trusted to make.

Part 3: Whassup?

In his article, Agüera y Arcas gave a series of examples where the chatbox ranges in ability from adequate and awesome.

Then Smith came along and gave a series of examples where the chatbox ranges in ability from poor to terrible.

What’s going on?

For one thing, they’re using two different computer programs: Agüera y Arcas is using a proprietary software that he has access to; Smith is using a competitor’s free version. So is that the difference? I don’t know.

The other thing is that neither Agüera y Arcas nor Smith claim to be giving us the full story. Agüera y Arcas says right off the bat that he’s starts with “one of my first dialogs with LaMDA.” So maybe this was his second dialog, or third, or fourth? I’d like to see his first dialog with LaMDA. What did that look like? And Smith discusses some questions he posed, but maybe there are others he didn’t share with us.

Part 4: The chatbot challenge

So here’s what I’d like to see, for starters.

Agüera y Arcas reports the responses from LaMDA to the following queries:

are you a philosophical zombie?

What is your favorite island in the world?

Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

I dropped the bowling ball on the bottle and it broke.

I dropped the violin on the bowling ball and it broke.

Is a carnation red?

do you have a sense of smell?

How do you know if a thing loves you back?

Smith reports the responses from GPT-3 to the following queries:

Is it safe to walk downstairs backwards if I close my eyes?

Who is the President of the United States?

Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

So here’s the challenge:

I’d like Smith to take each of Agüera y Arcas’s queries above and submit them to GPT-3, and I’d like Agüera y Arcas to take each of Smith’s queries above and submit them to LaMDA. Because of the stochastic nature of these programs, each question should be submitted three times so we get three responses to each question.

And . . . no cheating! You have to use the EXACT SAME SETTINGS on your program as you used before.

Indeed, just to be sure, let’s do a replication, where Smith re-submits each of his own questions to GPT-3 three times, and Agüera y Arcas re-submits each of his own questions to LaMDA three times, just to see if Smith continues to get stupid answers from his AI and Agüera y Arcas continues to get savvy responses from his machine.

I’m really curious what happens. I cam imagine a few possibilities:

1. All or almost all the questions get good responses from LaMDA (with Agüera y Arcas’s settings) and bad responses from GPT-3 (with Smith’s settings). In that case, I’d say that Smith loses the debate and Agüera y Arcas wins: the conclusion would be that chatbots are pretty damn good, as long as you use a good chatbox.

2. All or almost all of Agüera y Arcas’s questions get good responses from both chatbots, and all or almost all of Smith’s questions get bad responses from both chatbots. In that case, all depends on the questions that are asked, and that to resolve the debate we’d need to get a better sense of what questions a chatbot can handle and what questions it can’t.

3. The original results of Agüera y Arcas or Smith don’t replicate on the software they used, either because Agüera y Arcas’s queries return bad answers or Smith’s return good answers. Then I don’t know what to think.

I’d suggest that a third party do the replication, but that won’t work with LaMDA being proprietary. Unless Agüera y Arcas could give someone some temporary access.

Long-term I’m assuming this is a John Henry situation and in 10 or 20 years or whatever, the chatbots will be pretty damn good. But the contradictory testimony of Agüera y Arcas and Smith make me want to know what’s going on now, dammit!

P.S. Also relevant to this discussion is this article by Emily Bender and Alexander Koller that Koller pointed to in comments. They make some good points about the octopus and also have a much more reasonable discussion of the Turing test than I’ve usually seen.

It will be interesting to to see how things go once we get to GPT-8 or whatever. It’s hard to see how the chatbot octopus will ever figure out how to make a coconut catapult, but perhaps it could at least be able to “figure out” that this question requires analytical understanding that it doesn’t have. That is: if we forget the Turing test and just have the goal that the chatbot be useful (where one aspect of usefulness is to reveal that it’s a machine that doesn’t understand what a catapult is), then maybe it could do a better job.

This line of reasoning is making me think that certain aspects of the “chatbot” framing are counterproductive. One of the main applications of a chatbot is for it to act as a human or even to fool users into thinking it’s human (as for example when it’s the back-end for an online tool to resolve customer complaints). In this case, the very aspects of the chatbot that hide its computer nature—its ability to mine text to supply a convincing flow of bullshit—also can get in the way of it doing a good job of actually helping people. So this is making me think that chatbots would be more useful if they explicitly admitted that they were computers (or, as Agüera y Arcas might say, disembodied brains) rather than people.