OK, I was wrong about Paul Samuelson.

Regarding my post from the other day, someone who knows much more about Samuelson than I do provides some helpful background:

It is your emphasis on “realistic” that is wrong. Paul played a significant role in the random walk model for stock prices and knew a huge amount about both theory and practice, as part of the MIT group, including others such as Bob Solow. He had no shades on his eyes – but he knew that the model fit well enough that betting on indexes was for most people as good as any strategy. But how to convey this in an introductory text? Most people would go to simple random walks – coin tossing. But that was far from realistic. Stock prices jumped irregularly and were at that time limited to something like half shares or quarter shares. It is a clever idea to think of a sort of random draw of actual price changes as a device to teach students what was going on. Much more realistic. I cannot believe he ever said this was exactly how it worked. Text books have to have a degree of idealization if they are to make a complex subject understandable. Paul was certainly not a pompous fool. Economics was a no holds barred field in those days and all models were actively criticized from both theoretical and empirical sides. His clever instructional idea was indeed more realistic and effective as well. He did get the Soviet economy wrong, but so dd every other economist.

Regarding that last point, I still maintain that Samuelson’s problem was not getting the Soviet economy wrong, but rather that he didn’t wrestle with his error in later editions of his book; see third and fourth paragraph of this comment. But, sure, that’s just one thing. Overall I get my correspondent’s point, and I no longer think the main substance of my earlier post was correct.

“The narrow beach between the continent of clear effects and the sea of confusion”: that sweet spot between signals that are so clear you don’t need statistics, and signals that are so weak that statistical analysis can’t help

Steve Heston writes:

Marginal Revolution has a nice link showing that data provide almost no power to tell us anything about macroeconomic models. Tabarrok calls this “embarrassing.” I’m not so sure.

In my own field of finance, efficient market theory tells us that stock returns will have low predictability. More generally, optimization by rational economic agents makes it hard to predict their behavior. A consumer of a firm with quadratic objective function will have a linear first-order condition. So their behavior will be a linear function of the expectation of something, and (by iterated expectations) will be hard to predict.

This seems to occur in science generally. We can’t see inside black holes. We can’t know quantum positions. String theory has no interesting predictions.

Are there good examples in statistics? We struggle to estimate rare events, unit roots, and the effect of exponential growth. Early Covid forecasts were just silly. I think the data provide no information about the incentive effect of the death penalty. Similarly, data have limited power to gauge causal magnitudes of global warming.

I think these results are “embarrassing” in the sense that the data cannot tell us anything, so there is little point in doing statistics.

Heston continues:

Here is an example of poor understanding. Some Harvard biomedical informatics guy ran a regression on unit-root Covid growth without differencing the log-data. Naturally, he got an R-square of 99.8%. And the s-statistics were highly significant! He did not realize that his exponential forecast was just assuming that the in-sample growth would continue forever. That growth rate was estimated over only 2 weeks, so it was almost meaningless. The type of erroneous thinking told us for years that we had only two weeks to flatten the curve.

I think that statistics is fun, and there are classic areas where data are plentiful, e.g., predicting stock returns. But when the power is low, I am skeptical about pursuing research in the area.

My reply: statistics lives in the sweet spot between data that are so clear that you don’t need statistics (what Dave Krantz used to call the “intra-ocular traumatic test” or “IOTT”: effects that are so large that they hit you between the eyes) and data that are so noisy that statistical analysis doesn’t help at all.

From this perspective, considering this zone of phase transition, the narrow beach between the continent of clear effects and the sea of confusion, statistics may seem like a specialized and unimportant subject. And maybe it is kind of unimportant. I’m a statistician and have spent my working life teaching, researching, and writing about the topic, but I’d never claim that statistics is more important than physics, or chemistry, or literature, or music, or any number of other pursuits.

On the other hand, this beach or borderland, while small compared to the continents and seas, is not nothing—indeed, lots of important things happen here! What’s known is already known, and we spend lots of time and effort on that knowledge frontier (to mix spatial metaphors a bit). Statistics is important because it’s one of our tools for pushing that frontier back, reclaiming the land, as it were. As long as we recognize its limitations, statistics can be a valuable tool and a helpful way of looking at the world.

What exactly is a “preregistration”?

An anonymous correspondent writes:

I’m a PhD student in psychology with a background in computer science. I have struggled with the morality of the statistical approaches for a while, until I discovered Bayesian statistics. It doesn’t solve everything, but at least I don’t have to bend my mind in so many weird ways.

I would like to ask a question. In the last few years, you seem to embrace preregistration, as can be seen for example in this blogpost. However, I haven’t found a way to convince my co-authors of this. The reason is that my PhD project is part of an outside collaboration. We have automated large parts of the data collection from questionnaires and wearables. This way, we gather lots and lots of data. However, given that we want to steer our data collection procedures as early as possible and don’t have much literature to build our hypotheses on, the project managers push for analyzing the data continuously (exploring the data). To me, this is a big red flag. However, I do see their points as well. As another argument, since we have so much data, we can save a lot of time on being meticulous before doing anything. So, I came up with a research protocol

1. Explore data and find result
2. Report preliminary result to client
3. Create preregistration
4. Verify the result on new data
5. If the result doesn’t hold anymore, go back to step 1
6. Report results to client
7. Write paper

Do you think this is still worthy to be called a “preregistration”? If not, how would you do it?

My reply: Sure, this sounds reasonable. Preregistration is a set of steps, it’s not anything precise. See for example my preregistered analysis here. It’s good to do this in a way that gives space for data exploration, because Cantor’s corner.

“The series is inspired by Dan Ariely’s novel ‘Predictably Irrational’ . . .”

Someone points us to this news article:

NBC has placed a series order for “The Irrational,” the network announced on Tuesday.

According to the logline, the drama follows Alec Baker, a world-renowned professor of behavioral science, lends his expertise to an array of high-stakes cases involving governments, law enforcement and corporations with his unique and unexpected approach to understanding human behavior. The series is inspired by Dan Ariely’s novel “Predictably Irrational,” which was published in February 2008 by HarperCollins. Ariely will serve as a consultant.

I wonder if they’ll have an episode with the mysterious paper shredder?

In all seriousness, I absolutely love the above description of the TV show, especially the part where they described Ariely’s book as a novel. How edgy!

Describing his work as fiction . . . more accurate than they realize.

P.S. In the movie of me, I’d like to be played by a younger Rodney Dangerfield.

Maybe Paul Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

Yesterday we pointed to a post by Gary Smith, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” that included a wonderful story that makes the legendary economist Paul Samuelson look like a pompous fool. Here’s Smith:

Mathematical convenience has often trumped common sense in financial models. For example, it is often assumed — because the assumption is useful — that changes in stock prices can be modeled as independent draws from a probability distribution. Paul Samuelson offered this analogy:

Write down those 1,800 percentage changes in monthly stock prices on as many slips of paper. Put them in a big hat. Shake vigorously. Then draw at random a new couple of thousand tickets, each time replacing the last draw and shaking vigorously. That way we can generate new realistically representative possible histories of future equity markets.

I [Smith] did Samuelson’s experiment. I put 100 years of monthly returns for the S&P 500 in a computer “hat” and had the computer randomly select monthly returns (with replacement) until I had a possible 25-year history. I repeated the experiment one million times, giving one million “Samuelson simulations.”

I also looked at every possible starting month in the historical data and determined the very worst and very best actual 25-year investment periods. The worst period began in September 1929, at the start of the Great Crash. An investment over the next 25 years would have had an annual return of 5.1%. The best possible starting month was January 1975, after the 1973-1974 crash. The annual rate of return over the next 25 years was 17.3%.

In the one million Samuelson simulations, 9.6% of the simulations gave 25-year returns that were worse than any 25-year period in the historical data and 4.9% of the simulations gave 25-year returns that were better than any actual 25-year historical period. Overall, 14.5% of the Samuelson simulations gave 25-year returns that were too extreme. Over a 50-year horizon, 24.5% of the Samuelson simulations gave 50-year returns that were more extreme than anything that has ever been experienced.

You might say that Smith is being unfair, as Samuelson was only offering a simple mathematical model. But it was Samuelson, not Smith, who characterized his random drawing as “realistically representative possible histories of future equity markets.” Samuelson was the one claiming realism.

My take is that Samuelson wanted it both ways. He wanted to show off his math, but he also wanted relevance, hence his “realistically.”

The prestige of economics comes partly from its mathematical sophistication but mostly because it’s supposed to relate to the real world.

Smith’s example of Samuelson’s error reminded me of this story from David Levy and Sandra Peart of this graph from the legendary textbook. This is from 1961:

samuelson.png

Alex Tabarrok pointed out that it’s even worse than it looks: “in subsequent editions Samuelson presented the same analysis again and again except the overtaking time was always pushed further into the future so by 1980 the dates were 2002 to 2012. In subsequent editions, Samuelson provided no acknowledgment of his past failure to predict and little commentary beyond remarks about ‘bad weather’ in the Soviet Union.”

The bit about the bad weather is funny. If you’ve had bad weather in the past, maybe the possibility of future bad weather should be incorporated into the forecast, no?

Is there a connection?

Can we connect Samuelson’s two errors?

Again, the error with the Soviet economy forecast is not that he was wrong in the frenzied post-Sputnik year of 1961; the problem is that he kept making this error in his textbook for decades to come. Here’s another bit, from Larry White:

As late as the 1989 edition [Samuelson] coauthor William Nordhaus wrote: ‘The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.’

I see three similarities between the stock-market error and the command-economy error:

1. Love of simple mathematical models: the random walk in one case and straight trends in the other. The model’s so pretty, it’s too good to check.

2. Disregard of data. Smith did that experiment disproving Samuelson’s claim. Samuelson could’ve done that experiment himself! But he didn’t. That didn’t stop him from making a confident claim about it. As for the Soviet Union, by the time 1980 had come along Samuelson had 20 years of data refuting his original model, but that didn’t stop him from just shifting the damn curve. No sense that, hey, maybe the model has a problem!

3. Technocratic hubris. There’s this whole story about how Samuelson was so brilliant. I have no idea how brilliant he was—maybe standards were lower back then?—but math and reality don’t care how brilliant you are. I see a connection between Samuelson thinking that he could describe the stock market with a simple random walk model, and him thinking that the Soviets could just pull some levers and run a thriving economy. Put the experts in charge, what could go wrong, huh?

More stories

Smith writes:

As a student, Samuelson reportedly terrorized his professors with his withering criticisms.

Samuelson is of course the uncle of Larry Summers, another never-admit-a-mistake guy. There is a story about Summers saying something stupid to Samuelson a week before Arthur Okun’s funeral. Samuelson reportedly said to Summers, “In my eulogy for Okun, I’m going to say that I don’t remember him ever saying anything stupid. Well, now I won’t be able to say that about you.”

There was a famous feud between Samuelson and Harry Markowitz about whether investors should think about arithmetic or geometric means. In one Samuelson paper responding to Markowitz, every word (other than author names) was single syllable.

I once gave a paper at a festschrift honoring Tobin. Markowitz began his talk by graciously saying to Samuelson, who was sitting arm-crossed in the front row, “In the spirit of this joyous occasion, I would like to say to Paul that ‘Perhaps there is some merit in your argument.’” Samuelson immediately responded, “I wish I could say the same.”

Here’s the words-of-one-syllable paper, and here’s a post that Smith found:

Maybe Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

P.S. OK, I was wrong.

New open access journal on visualization and interaction

This is Jessica. I am on the advisory board of an open access visualization research journal called the Journal of Visualization and Interaction (JoVI), recently launched by Lonni Besançon, Florian Echtler, Matt Kay, and Chat Wacharamanotham. From their website:

The Journal of Visualization and Interaction (JoVI) is a venue for publishing scholarly work related to the fields of visualization and human-computer interaction. Contributions to the journal include research in:

  • how people understand and interact with information and technology,
  • innovations in interaction techniques, interactive systems, or tools,
  • systematic literature reviews,
  • replication studies or reinterpretations of existing work,
  • and commentary on existing publications.

One component of their mission is to require materials to be open by default, including exposing all data and reasoning for scrutiny, and making all code reproducible “within a reasonable effort.” Other goals are to emphasize knowledge and discourage rejection based on novelty concerns (a topic that comes up often in computer science research, see e..g., my thoughts here). They welcome registered reports, and say they will not impose top down constraints on how many papers can be published that can lead to arbitrary-seeming decisions on papers that hinge on easily fixable mistakes. This last part makes me think they are trying to avoid the kind of constrained decision processes of conference proceeding publications, which are still the most common publication mode in computer science. There are existing journals like Transactions on Visualization and Computer Graphics that give authors more chances to go back and forth with reviewers, and my experience as associate editor there is that papers don’t really get rejected for easily fixable flaws. Part of JoVI’s mission seems to be about changing the kind of attitude that reviewers might bring, away from one of looking for reasons to reject and toward trying to work with the authors to make the paper as good as possible. If they can do this while also avoiding some of the other CS review system problems like lack of attention or sufficient background knowledge of reviewers, perhaps the papers will end up being better than what we currently see in visualization venues. 

This part of JoVI’s mission distinguishes it from other visualization journals:

Open review, comments, and continued conversation

All submitted work, reviews, and discussions will by default be publicly available for other researchers to use. To encourage accountability, editors’ names are listed on the articles they accept, and reviewers may choose to be named or anonymous . All submissions and their accompanying reviews and discussions remain accessible whether or not an article is accepted. To foster discussions that go beyond the initial reviewer/author exchanges, we welcome post-publication commentaries on articles.

Open review is so helpful for adding context to how papers were received at the time of submission, so I hope it catches on here. Plus I really dislike by the attitude that it is somehow unfair to bring up problems with published work, at least outside of the accepted max 5 minutes of public QA that happens after the work is presented at a conference. People talk amongst themselves about what they perceive the quality or significance of new contributions to be, but many of the criticisms remain in private circles. It will be interesting to see if JoVI gets some commentaries or discussion on published articles, and what they are like. 

This part is also interesting: “On an alternate, optional submission track, we will continually experiment with new article formats (including modern, interactive formats), new review processes, and articles as living documents. This experimentation will be motivated by re-conceptualizing peer review as a humane, constructive process aimed at improving work rather than gatekeeping.” 

distll.pub is no longer publishing new stuff but some of their interactive ML articles were very memorable and probably had more impact than more conventionally published papers on the topic. Even more so I like the idea of trying to support articles as living documents that can continue to be updated. The current publication practices in visualization seem a long way from encouraging a process where it’s normal to first release working papers. Instead, people spend six months building their interactive system or doing their small study to get a paper-size unit of work, and then they move on. I associate the areas where working papers seem to thrive (e.g., theoretical or behavioral econ) with theorizing or trying to conceptualize something fundamental to behavior, rather than just describing or implementing something. The idea that we should be trying to write visualization papers that really make us think hard over longer periods, and that may not come in easily bite-size chunks, seems kind of foreign to how the research is conceptualized. But any steps toward thinking about papers as incomplete or imperfect, and building more feedback and iteration into the process, are welcome.

The mistake comes when it is elevated from a heuristic to a principle.

Gary Smith pointed me to this post, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” which reminded me of my discussion with Nate Silver a few years ago regarding his mistaken claim that, “the most robust assumption is usually that polling is essentially a random walk, i.e., that the polls are about equally likely to move toward one or another candidate, regardless of which way they have moved in the past.” My post was called, “Politics is not a random walk: Momentum and mean reversion in polling,” and David Park, Noah Kaplan, and I later expanded that into a paper, “Understanding persuasion and activation in presidential campaigns: The random walk and mean-reversion models.”

The random walk model for polls is a bit like the idea that the hot hand is a fallacy: it’s an appealing argument that has a lot of truth to it (as compared to the alternative model that poll movement or sports performance is easily predictable given past data) but is not quite correct, and the mistake comes when it is elevated from a heuristic to a principle.

This mistake happens a lot, no? It comes up in statistics all the time.

P.S. Some discussion in comments on stock market and investing. I know nothing about that topic; the above post is just about the general problem of people elevating a heuristic to a principle.

Effect size expectations and common method bias

I think researchers in the social sciences often have unrealistic expectations about effect sizes. This has many causes, including publication bias (and selection bias more generally) and forking paths. Old news here.

Will Hobbs pointed me to his (PPNAS!) paper with Anthony Ong that highlights and examines another cause: common method bias.

Common method bias is the well-known (in some corners at least) phenomenon whereby specific common variance in variables measured through the same methods can produce bias. You can come up with many mechanisms for this. Variables measured in the same questionnaire can be correlated because of consistency motivations, the same tendency to give social desirably responses, similar uses of the similar scales, etc.

Many of these biases result in inflated correlations. Hobbs writes:

[U]nreasonable effect size priors is one of my main motivations for this line of work.

A lot of researchers seem to consider effect sizes meaningful only if they’re comparable to the huge observational correlations seen among subjective closed-ended survey items.

But often the quantities we really care about — or at least we are planning more ambitious field studies to estimate — are inherently going to be not measured in the same ways. We might assign a treatment and measure a survey outcome. We might measure a survey outcome, use this to target an intervention, and then look at outcomes in administrative data (e.g., income, health insurance data).

Here at this blog, perhaps there’s the most coverage of tiny studies, forking paths, and selection bias as causes of inflated effect size expectations. So this is a good reminder there are plenty of other causes, even with big samples or pre-registered analysis plans, like common method bias and confounding more generally.

This post is by Dean Eckles.

 

Computational linguist Bob Carpenter says LLMs are intelligent. Here’s why:

More specifically, he says:

ChatGPT can do deep, involved reasoning. It has the context capacity to do that.

I [Bob] think that human language is what is known as “AI complete”. To be good at language, you have to be intelligent, because language is about the world and context. You can’t do what ChatGPT does ignorant of the world or be unable to plan. . . .

Humans also generally produce output one word at a time in spoken language. In writing we can plan and go back and revise. We can do a little planning on the fly, but not nearly as much. To me, this was the biggest open problem in computational linguistics—it’s what my job talk was about in 1989 and now it’s basically a solved problem from the engineering if not scientific perspective.

I [Bob] am not saying there’s no limitations to using the LLM architecture—it doesn’t have any long- or really medium-term memory. I’m just saying it can’t do what it does now without some kind of “intelligence”. If you try to define intelligence more tightly, you either rule out humans or you somehow say that only human meat can be intelligent.

I told Bob that take on this might be controversial, even among computer scientists, and he replied:

Of course. Everything’s controversial among academics . . .

My [Bob’s] position is hardly novel. It’s the take of everyone I know who understands the tech (of course, that’s a no-true-Scotsman argument), including this paper from Microsoft Research. I do think if you have studied cognitive science, philosophy of language, and philosophy of mind, studied language modeling, studied psycholinguistics, have some inkling of natural language compositional semantics and lexical semantics, and you understand crowdsourcing with human feedback, then you’re much more likely to come to the same conclusion as me. If you’re just shooting from the hip without having thought deeply about meaning and how to frame it or how humans process language a subword component at a time, then of course the behavior seems “impossible”. Everyone seems to have confused it with cutting-and-pasting search results, which is not at all what it’s doing.

I’m not saying it’s equivalent to a human, just that whatever it’s doing is a form of general intelligence. What it’s truly lacking is longer term memory. That means there are things humans can do that it really is incapable of doing in its present form. But that’s not because it’s a “dumb machine”. We’re just “dumb meat” viewed from that perspective (unless you want to get all spiritual and say we have a soul of some kind that matters).

Bob also recommends this paper from Google and this one from OpenAI, and he continues:

There’s a ton of work on scaling laws now and what people are seeing is emergent behavior at certain model sizes. As in like 1% performance for 3B parameters, then 95% performance for 6B parameters kind of thing. But nobody knows why this is happening or where.

The capacity of these models is quite high, including the representation of words, representation of positions, etc. It’s generting one word at a time, but the structrure is an incredibly rich time series with literally billions of parameters.

The background here is that I’ve been reading what Thomas Basbøll has been writing on chatbots and the teaching of writing (a topic of interest to me, because I teach writing as part of my Communicating Data and Statistics course), and he recommended a long article by Giulio Alessandrini, Brad Klee, and Stephen Wolfram entitled “What Is ChatGPT Doing . . . and Why Does It Work?”

I really liked Alessandrini et al.’s article. It was at the right level for me, stepping through the following topics:

It’s Just Adding One Word at a Time
Where Do the Probabilities Come From?
What Is a Model?
Models for Human-Like Tasks
Neural Nets
Machine Learning, and the Training of Neural Nets
The Practice and Lore of Neural Net Training
“Surely a Network That’s Big Enough Can Do Anything!”
The Concept of Embeddings
Inside ChatGPT
The Training of ChatGPT
Beyond Basic Training
What Really Lets ChatGPT Work?
Semantic Grammar and the Power of Computational Language
So . . . What Is ChatGPT Doing, and Why Does It Work?

Alessandrini et al.’s article has lots of examples, graphs, and code, and I get the impression that they’re actively trying to figure out what’s going on. They get into some interesting general issues; for example,

One might have thought that for every particular kind of task one would need a different architecture of neural net. But what’s been found is that the same architecture often seems to work even for apparently quite different tasks. At some level this reminds one of the idea of universal computation . . . but I think it’s more a reflection of the fact that the tasks we’re typically trying to get neural nets to do are “human-like” ones—and neural nets can capture quite general “human-like processes”.

In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself. . . .

That’s not to say that there are no “structuring ideas” that are relevant for neural nets. Thus, for example, having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images. And having patterns of connectivity that concentrate on “looking back in sequences” seems useful . . . in dealing with things like human language, for example in ChatGPT.

They also talk about the choices involved in tuning the algorithms—always an important topic in statistics and machine learning—so, all in all, I think a good starting point before getting into the technical articles that Bob pointed us to above. I pointed Bob to the Alessandrini et al. tutorial and his reaction was that it “seriously under-emphasizes the attention model in the transform and the alignment post-training. It’s the latter that took GPT-3 to ChatGPT, and it’s a huge difference.”

That’s the problem with sending a pop science article to an expert: the expert will latch on to some imperfection. The same thing happens to me when people send me popular articles on Bayesian statistics or American politics or whatever: I can’t help focusing on the flaws. Anyway, I still like the Alessandrini et al. article, I guess more so when supplemented with Bob’s comments.

P.S. Also I told Bob I still don’t get how generating one word at a time can tell the program to create a sonnet in the style of whoever. I just don’t get how “Please give me a sonnet . . .” will lead to a completion that has sonnet form. Bob replied:

Turn this around. Do you know how you write sentences without planning them all out word by word ahead of time? Language is hard, but we do all kinds of planning in this same way. Think about how you navigate from home to work. You don’t plan out a route step by step then execute it, you make a very general plan (‘ride my bke’ or ‘take the subway’), take a step toward that goal, then repeat until you get to work. As you get to each part of the task (unlock the bike, carry the bike outside, put the kickstand up, get on the bike, etc.) are all easily cued by what you did last, so it barely requires any thought at all. ChatGPT does the same thing with language. ChatGPT does a ton of computation on your query before starting to generate answers. It absolutely does a kind of “planning” in advance and as the MS paper shows, you can coach it to do better planning by asking it to share its plans. It does this all with its attention model. And it maintains several rich, parallel representations of how language gets generated.

Do you know how you understand language one subword component at a time? Human brains have *very slow* clock cycles, but very *high bandwidth* associative reasoning. We are very good at guessing what’s going to come next (though not nearly as good as GPT—it’s ability at this task is far beyond human ability) and very good at piecing together meaning from hints (too good in many ways as we jump to a lot of false associations and bad conclusions). We are terrible at logic and planning compared to “that looks similar to something I’ve seen before”.

I think everyone who’s thought deeply about language realizes it has evolved to make these tasks tractable. People can rap and write epic poems on the fly because there’s a form that we can follow and one step follows the next when you have a simple bigger goal. So the people who know the underlying architectures, but say “oh language is easy, I’m not impressed by ChatGPT” are focusing on this aspect of language. Where ChatGPT falls down is with long chains of logical reasoning. You have to coax it to do that by telling it to. Then it can do it in a limited way with guidance, but it’s basic architecture doesn’t support good long-term planning for language. If you want GPT to write a book, you can’t prompt it with “write a book”. Instead, you can say “please outline a book for me”, then you can go over the outline and have it instantiate as you go. At least that’s how people are currently using GPT to generate novels.

I asked Aki Vehtari about this, and Aki pointed out that there are a few zillion sonnets on the internet already.

Regarding the general question, “How does the chatbot do X?”, where X is anything other than “put together a long string of words that looks like something that could’ve been written by a human” (so, the question could be, “How does the chatbot write a sonnet” or “How does ChatGPT go from ‘just guessing next word’ to solving computational problems, like calculating weekly menu constrained by number of calories?”), Bob replied:

This is unknown. We’ve basically created human-level or better language ability (though not human or better ability to connect language to the world) and we know the entire architecture down to the bit level and still don’t know exactly why it works. My take and the take of many others is that it has a huge capacity in its representation of words and its representation of context and the behavior is emergent from that. It’s learned to model the world and how it works because it needs that information to be as good as it is at language.

Technically, it’s a huge mixture model of 16 different “attention heads”, each of which is itself a huge neural network and each of which pay attention to a different form of being coherent. Each of these is a contextual model with access to the previous 5K or so words (8K subword tokens).

Part of the story is that the relevant information is in the training set (lots of sonnets, lots of diet plans, etc.); the mysterious other part is how it knows from your query what piece of relevant information to use. I still don’t understand how it can know what to do here, but I guess that for now I’ll just have to accept that the program works but I don’t understand how. Millions of people drive cars without understanding at any level how cars work, right? I basically understand how cars work, but there’d be no way I could build one from scratch.

Principal stratification for vaccine efficacy (causal inference)

Rob Trangucci, Yang Chen, and Jon Zelner write:

In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

Sounds important. And they use Stan, which always makes me happy.

What exactly is a “representative sample”?

Emilio Laca writes:

Could you refer me to the definition of “representative sample” as you use it in your books. I am interested in developing my understanding of the theoretical and philosophical basis to make statements about unobservables using measurements on subsets of populations. I also want to learn more about how a Bayesian approach changes (or not) how one deals with sampling design. Any readings that you can recommend will be appreciated.

My reply: I don’t think there’s any formal definition. We could say that a sample is representative if we could usefully use it to represent the population As this definition makes clear, “representativeness” depends on the use to which the sample would be put.

Do readers have other thoughts?

Is omicron natural or not – a probabilistic theory?

Aleks points us to this article, “A probabilistic approach to evaluate the likelihood of artificial genetic modification and its application to SARS-CoV-2 Omicron variant,” which begins:

A method to find a probability that a given bias of mutations occur naturally is proposed to test whether a newly detected virus is a product of natural evolution or artificial genetic modification. The probability is calculated based on the neutral theory of molecular evolution and binominal distribution of non-synonymous (N) and synonymous (S) mutations. Though most of the conventional analyses, including dN/dS analysis, assume that any kinds of point mutations from a nucleotide to another nucleotide occurs with the same probability, the proposed model takes into account the bias in mutations, where the equilibrium of mutations is considered to estimate the probability of each mutation. The proposed method is applied to evaluate whether the Omicron variant strain of SARS-CoV-2, whose spike protein includes 29 N mutations and only one S mutation, can emerge through natural evolution. The result of binomial test based on the proposed model shows that the bias of N/S mutations in the Omicron spike can occur with a probability of 1.6 x 10^(-3) or less. Even with the conventional model where the probabilities of any kinds of mutations are all equal, the strong N/S mutation bias in the Omicron spike can occur with a probability of 3.7 x 10^(-3), which means that the Omicron variant is highly likely a product of artificial genetic modification.

I don’t know anything about the substance. The above bit makes me suspicious, as it looks like what they’re doing is rejecting null hypothesis A and using this to claim that their favored alternative hypothesis B is true.

Further comments from an actual expert are here.

“If you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis.”

José Iparraguirre writes:

Just finished “Understanding Statistics and Experimental Design. How to Not Lie with Statistics”, by Michael­ Herzog, Gregory­ Francis, and Aaron ­Clarke (Springer, 2019). Near the end (p. 128), I read the following regarding “optional stopping”:

…suppose a scientist notes a marginal (p = 0.07) result in Experiment 1 and decides to run a new Experiment 2 to check on the effect. It may sound like the scientist is doing careful work, however, this is not necessarily true. Suppose Experiment 1 produced a significant effect (p = 0.03), would the scientist still have run Experiment 2 as a second check? If not, then the scientist is essentially performing optional stopping across experiments, and the Type I error rate for any given experiment (or across experiments) is unknown.

Indeed, the problem with optional stopping is not the actual behavior preformed by the scientist (e.g., the study with a planned sample size gives p = 0.02) but with what he would have done if the result turned out differently (e.g., if the study with a planned sample size gives p = 0.1, he would have added 20 more subjects). More precisely, if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis.

It is the last statement, “if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis”, that kept me wondering and prompted me to write to you and ask you for your comments.

My reply:

1. Yes, they’re correct that if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis. We make this point in section 1.2 of our Garden of Forking Paths paper and this is part of the definition of these error rates; the statement should not be controversial.

2. I’m pretty much not interested in type 1 or type 2 error rates; mostly the only reason I think it’s worth thinking about them is to respond to confused researchers who think that a low p-value represents strong evidence for their preferred theories.

3. I think that stopping the experiment based on the data is just fine in practice and is not cheating, as some people might think. See my discussion here.

I also sent to Greg Francis, who wrote:

I too don’t think the statement is controversial, even though it is surprising to many practicing scientists. In part I think that is because textbooks on hypothesis testing give examples with very specific settings (always with a fixed sample size). Even though (usually) the textbooks properly describe theorems (e.g., for defining a sampling distribution), they ignore some realities of data collection (a fixed sample size is not the norm) and so do not consider what happens when you deviate from the theorems.

Contrary to Andrew, I think a lot of scientists genuinely do care about Type I and Type II error rates. However, getting control of those error rates is more difficult than many people realize, and I think that difficulty is good motivation to consider Bayesian approaches.

For a bit more discussion about error rates across experiments, you might look at this paper on a “reverse Bonferroni” method. I’m not sure I would actually recommend the method for any practical situation, but it highlights what to consider when you try to control error rates across experiments.

What happened in the 2022 elections

Yair writes:

The 2022 election defied conventional wisdom and historical trends. In a typical midterm election year with one-party control of the presidency, House and Senate, the incumbent party would expect major losses. Instead, Democrats re-elected every incumbent senator and expanded their Senate majority by a seat, won the overwhelming majority of heavily contested gubernatorial elections, gained control of 4 state legislative chambers, and only narrowly lost the U.S. House. . . .

Unlike other recent midterm years, our analysis shows a stark contrast between the electorate in areas with one or more highly contested House, Senate or gubernatorial races versus those with less contested races. . . .

Their key findings:

Gen Z and Millennial voters had exceptional levels of turnout, with young voters in heavily contested states exceeding their 2018 turnout by 6% among those who were eligible in both elections. Further, 65% of voters between the ages of 18 and 29 supported Democrats, cementing their role as a key part of a winning coalition for the party. While young voters were historically evenly split between the parties, they are increasingly voting for Democrats.

Extreme “MAGA” Republicans underperformed. . . . Candidates who were outspoken election deniers did 1 to 4 points worse than other Republicans, contributing to their losses in important close races. Of course, election denial is one of many extreme positions associated with “MAGA” Republicans, so this analysis likely reflects relatively extreme stances on other issues, including abortion rights . . .

Women voters pushed Democrats over the top in heavily contested races, where abortion rights were often their top issue. After Republican-appointed justices on the Supreme Court overturned abortion rights, a disproportionate number of women voters registered to vote in states with highly contested elections. At the same time, polls showed Democratic women and men indicating they were more engaged in the election. While relative turnout by gender remained largely stable, Democratic performance improved over 2020 among women in highly contested races, going from 55% to 57% support.

Democrats largely retained their winning 2020 coalition in heavily contested races, with some exceptions. Turnout and support among voters by race, education, gender, and other demographic factors remained relatively stable in heavily contested races. . . .

Democratic support among young voters is partly due to the diversity of this group, as America becomes more diverse over time. But that is not the whole story. Democratic support was higher among young voters of color, both nationally (78%) and in highly contested races (also 78%).9 But support among young white voters rose between 2018 (53% national, 52% highly contested races) and 2022 (58% nationally, 57% highly contested races). This 5-6 point support change is notable, indicating a broad base of Democratic support among young voters across the country. . . .

By any historical standard, 2022 turnout was high. Nationally it did not match 2018’s record-breaking turnout of 118 million votes, but it did reach 111 million ballots cast. . . . However, these national turnout numbers mask important differences at the state and congressional level: namely, that turnout matched or even exceeded 2018 turnout in the most highly contested elections in the country. . . . in these heavily contested races with higher turnout, Democratic candidates generally prevailed. . . . While some of these turnout trends are driven by population increases, it mostly reflects the high turnout environment that has been consistent from the 2018 election onward. In highly contested elections — where voters know the race could be decided by a small number of votes and campaigns invest resources into engaging voters — turnout often matched the historic “Blue Wave” election in 2018. . . .

Campaigns and voter registration groups invest significant resources in identifying, registering and mobilizing new voters as they seek to grow their coalitions. The high turnout era has been marked by millions of new voters entering — and staying in — the electorate. . . .

Lots more numbers and graphs at the above link.

If you’re coming to FAccT 2023, don’t miss the world premiere performance of Recursion.

The ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) conference will be held this year in Chicago from 12-15 June. And at 7pm on Wed 13 June there will be a staged reading of Recursion, a play that Jessica and I wrote.

Recursion is an entertaining and thought-provoking (we hope) play with computer science themes that connect to many of the topics that bring people to FAccT. The play will be directed and acted by a troupe of students from the Northwestern University theater program. We even have some funding thanks to Northwestern’s Engineering school.

If you’ll be coming to FAccT, we hope you can boot it to the performance! And if you’re not coming, well, maybe you should reconsider.

Bayesian Believers in a House of Pain?

This post is by Lizzie

A colleague from U-Mass Amherst sent me this image yesterday. He said he had found ‘multiple paper copies’ in an office and then ruminated on how they might have been used. I suggested they might have been for ‘a group of super excited folks at a conference jam session!’

This leads back to a rumination I have had for a long time: how come I cannot find a MCMC version of ‘Jump Around’? It seems many of the lyrics could be improved upon with an MCMC spin (though I would keep: I got more rhymes than there’s cops at a Dunkin’).

My colleague suggested that there is perhaps a need to host a Bayesian song adaptation contest ….

The problems with p-values are not just with p-values.

From 2016 but still worth saying:

Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.

I put much of the blame on statistical education, for two reasons:

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. . . .

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. . . .

If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted.

In summary:

I agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

Colonoscopy corner: Misleading reporting of intent-to-treat analysis

Dale Lehman writes:

I’m probably the 100th person that has sent this to you: here is the NEJM editorial and here is the study.

The underlying issue, which has been a concern of mine for some time now, is the usual practice of basing analysis on “intention to treat” rather than on “treatment per protocol.” In the present case, randomized assignment into groups “invited” to have a colonoscopy and those not, resulted in a low percentage actually following the advice. Based on intention to treat, the benefits of colonoscopy appear to be small (or potentially none). Based on those actually receiving the colonoscopy, the effectiveness appears quite large. While the editorial accurately describes the results, it seems far less clear than it could/should be. The other reasons why this latest study may differ from prior ones are valid (effectiveness of the physicians, long-term follow up, etc.) but pale in importance with the obvious conclusion that when adherence is low, effectiveness is thwarted. As the editorial states “screening can be effective only if it is performed.” I think that should be the headline and that is what the media reports should have focused on. Instead, the message is mixed at best – leading some headlines to suggest that the new study raises questions about whether or not colonoscopies are effective (or cost-effective).

The correct story does come though if you read all the stories but I think the message is far more ambiguous than it should be. Intention to treat is supposed to reflect real world practice whereas treatment per protocol is more of a best-case analysis. But when the difference (and the adherence rate here, less than 50%) is so low, then the most glaring result of this study should be that increasing adherence is of primary importance (in my opinion). Instead, there is a mixed message. I don’t even think the difference can be ascribed to the difference in audiences. Intention to treat may be appropriate for public health practitioners whereas the treatment per protocol might be viewed as appropriate for individual patients. However, in this case it would seem relatively costless to invite everyone in the target group to have a colonoscopy, even if less than half will do so. Actually, I think the results indicate that much more should be done to improve adherence, but at a minimum I see little justification for not inviting everyone in the target group to get a colonoscopy. I don’t see how this study casts much doubt on those conclusions, yet the NEJM and the media seem intent on mixing the message.

In fact, Dale was not the 100th person who had sent this to me or even the 10th person. He was the only one, and I had not heard about this story. I’m actually not sure how I would’ve heard about it . . .

Anyway, I quickly looked at everything and I agree completely with Dale’s point. For example, the editorial says:

In the intention-to-screen analysis, colonoscopy was found to reduce the risk of colorectal cancer over a period of 10 years by 18% (risk ratio, 0.82; 95% confidence interval [CI], 0.70 to 0.93). However, the reduction in the risk of death from colorectal cancer was not sig- nificant (risk ratio, 0.90; 95% CI, 0.64 to 1.16).

I added the boldface above. What it should say there is not “colonoscopy” but “encouragement to colonoscopy.” Just two words, but a big deal. There’s nothing wrong with an intent-to-treat analysis, but then let’s be clear: it’s measuring the intent to treat, not the treatment itself.

P.S. Relatedly, I received this email from Gerald Weinstein:

Abuse of the Intention to Treat Principle in RCTs has led to some serious errors in interpreting such studies. The most absurd, and possibly deadly example is a recent colonoscopy study which was widely reported as “Screening Procedure Fails to Prevent Colon Cancer Deaths in a Gold-standard Study,” despite the fact that only 42% of the colonoscopy group actually underwent the procedure. My concern is far too many people will interpret this study as meaning “colonoscopy doesn’t work.”

It seems some things don’t change, as I had addressed this issue in a paper written with your colleague, Bruce Levin, in 1985 (Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548). I am a retired cardiac surgeon who has had to deal with similar misguided studies during my long career.

The recent NEJM article “Effect of Colonoscopy Screening on Risks of Colorectal Cancer and Related Death” showed only an 18% reduction in death in the colonoscopy group which was not statistically significant and was widely publicized in the popular media with headlines such as “Screening Procedure Fails to Prevent Colon Cancer Deaths in Large Study.”

In fact, the majority of people in the study group did not undergo colonoscopy, but were only *invited* to do so, with only 42% participating. How can colonoscopy possibly prevent cancer in those who don’t under go it? Publishing such a study is deeply misguided and may discourage colonoscopy, with tragic results.

Consider this: If someone wanted to study attending a wedding as a superspreader event, but included in the denominator all those who were invited, rather than those who attended, the results would be rendered meaningless by so diluting the case incidence as to lead to the wrong conclusion.

My purpose here is not merely to bash this study, but to point out difficulties with the “Intention to Treat” principle, which has long been a problem with randomized controlled studies (RCTs). The usefulness of RCTs lies in the logic of comparing two groups, alike in every way *except* for the treatment under study, so any differences in outcome may be imputed to the treatment. Any violation of this design can invalidate the study, but too often, such studies are assumed to be valid because they have the structure of an RCT.

There are several ways a clinical study can depart from RCT design: patients in the treatment group may not actually undergo the treatment (as in the colonoscopy study) or patients in the control group may cross over into the treatment group, yet still be counted as controls, as happened in the Coronary Artery Surgery Study (CASS) of the 1980s. Some investigators refuse to accept the problematic effects of such crossover and insist they are studying a “policy” of treatment, rather than the treatment itself. This concept, followed to its logical (illogical?) conclusion, leads to highly misleading trials, like the colonoscopy study.

P.P.S. I had a colonoscopy a couple years ago and it was no big deal, not much of an inconvenience at all.

Decades of polling have drained the aquifer of survey participation.

I wrote about this in 2004:

Back in the 1950s, when the Gallup poll was almost the only game in town, it was rational to respond to the survey–you’d be one of about 1000 respondents and could have a reasonable chance of (indirectly) affecting policy. Now you’re just one of millions, and so answering a pollster is probably not worth the time (See here for related arguments).

The recent proliferation of polls—whether for marketing or to just to sell newspapers—exploits people’s civic-mindedness. Polling and polling and polling until all the potential respondents get tired—it’s like draining the aquifer to grow alfalfa in the desert.

Poll aggregation: “Weighting” will never do the job. You need to be able to shift the estimates, not just reweight them.

Palko points us to this post, Did Republican-Leaning Polls “Flood the Zone” in 2022?, by Joe Angert et al., who write:

As polling averages shifted towards Republicans in the closing weeks of the 2022 midterms, one interpretation was that Americans were reverting to the usual pattern of favoring out-party candidates. Other observers argued that voter intentions were not changing and that the shift was driven by the release of a disproportionate number of pro-Republican polls – an argument supported by the unexpectedly favorable results for Democratic candidates on Election Day.

They continue:

We are not alleging a conspiracy among Republican pollsters to influence campaign narratives. . . . Even so, our results raise new concerns about the use of polling averages to assess campaign dynamics. A shift from one week to another may reflect changes in underlying voter preferences but can also reflect differences in the types of polls used to construct polling averages. This concern is particularly true for sites that aggregate polls without controlling for house effects (pollster-specific corrections for systematic partisan lean). . . .

Our results are also salient for aggregators who use pollster house effects to adjust raw polling data. In theory, these corrections remove poll-specific partisan biases, allowing polling averages to be compared week-to-week, even given changes in the types of polls being released. However, in most cases, aggregators use black-box models to estimate and incorporate house effects, making it impossible to assess the viability of this strategy. . . .

There’s a statistical point here, too, which is that additive “house effects” can appropriately shift individual polls so that even biased polls can supply information, but “weighting” can never do this. You need to move the numbers, not just reweight them.