Evaluation blind spots and eliciting moving targets

This is Jessica. There have been a few interesting articles in the past couple weeks that point to evaluation blind spots in LLM evaluation. One is this explainer article from OpenAI on why they withdrew their late April update to GPT-4o. It’s worth reading if you aren’t familiar with the kinds of adjustments these models undergo after pre-training. While many concrete details are lacking, they give an overview of their evaluation approach, which involves combining different types of reward signals (e.g., fine tuning on good examples, adjusting the model’s reward distribution to match preferences elicited from humans and ChatGPT), various safety checks, offline testing against benchmarks, and interactive “vibe checking” by experts aimed at getting a sense of how it feels to interact with the model in practice.

The recent model update was problematic they claim because it introduced inappropriate levels of sycophancy (including “validating doubts, fuelling anger, urging impulsive actions” etc). The article attributes this mistake to their decision to de-prioritize results of the vibe checking done by experts, some of which had suggested something being off about the model. Leading up to this release, signals about general model behavior and personality (which the vibe-check evals are about) were not  “launch-blocking” the way safety tests for things that might cause catastrophic risks were. So they went forward on the grounds that the model looked good on these other tests. 

They also suggest that several changes to the reward signals in the post-training process contributed to the increased sycophancy: 

In the April 25th model update, we had candidate improvements to better incorporate user feedback, memory, and fresher data, among others. … For example, the update introduced an additional reward signal based on user feedback—thumbs-up and thumbs-down data from ChatGPT. This signal is often useful; a thumbs-down usually means something went wrong.

But we believe in aggregate, these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check.

AI safety concerns are hard to separate from model behavior in general

There are a few things I find interesting in this. First, it strikes me as being kind of naive and behind the times on an epistemological level to assume that behavior and personality can be separated from other safety risks. There has been plenty of public discussion at this point about the potential for large language models to persuade people to believe things that aren’t true, and evidence that this is already happening. More generally, it seems like it should be common knowledge that small shifts in complex system dynamics can throw things out of whack in ways that become significant. It’s weird to think that OpenAI somehow still saw these behavioral risks as less pressing than the possibility of cyberattacks or the creation of bioweapons. It suggests a mismatch between how OpenAI (and perhaps the AI community more broadly) sees (or wants to see) what they are doing and where the models are these days. The article mentions, for example, that they had not originally expected the models to be used as much as they are for emotional support. I wonder if overlooking the change in sychophancy is partly a result of their not wanting to acknowledge these use cases because they don’t fit some preferred narrative of the models as superintelligent agents capable of strategizing or reasoning beyond human abilities.

On the other hand, hindsight is always 20-20, and it is naturally going to be harder to predict the impacts of changes to a model’s tone or personality than it is to predict what could go wrong if it supplies specific harmful information. From this perspective it’s less surprising to hear that their evaluation approach was underprepared to catch subtle but potentially harmful shifts in behavior like sycophancy.

Going forward, they say that signals of general model behavior will have launch-blocking potential. This implies that AI safety really subsumes all model behavior, which seems right. If LLMs provide a new kind of primitive or basic interface to computing, which I would argue is the right way to think about them, then it’s hard to argue that a few narrow use cases should take precedence. 

Post-hoc alignment with human values is a messy game of heuristics

The fact that incorporating new reward signals that they thought would be helpful threw the model out of whack makes clear what a delicate, heuristic-layering process posthoc adjustments to align model behavior with human values are. It’s impressive that these kinds of approaches have worked as well as they have. But from the standpoint of evaluation, is there any way out of getting stuck in a kind of whack-a-mole game, where every time some new kind of feedback is introduced in the posthoc tuning process, the entire model surface must be re-surveyed for new types of vulnerabilities or risks? Is there really some final uber state of evaluation that will be reached through this process, where all potentially harmful aspects of model behavior can be checked and therefore controlled? Or will the criteria themselves keep shifting as the use cases change, making these kinds of “woopsies” model updates inevitable?

It makes me wonder as well about the stability of the signals that are being elicited. Human experts using a model may be more robust evaluation instruments than benchmark-style evaluations or crowd-based preference feedback when it comes to picking up on subtle shifts in behavior, but it’s not clear to me that we should expect people’s judgments about the appropriateness of model personality or changes in behaviors like sycophancy to be a) stable and b) informative about the actual riskiness of model updates. I would expect human appraisals of what’s appropriate to shift with our emerging understanding of what these models can and cannot do, and to be idiosyncratic to some degree. It seems hard to assess the value of subtle behavioral shifts outside of some specific downstream task, but there are so many downstream tasks. So I wonder if evaluation noise is something inherent because the eval targets are themselves poorly defined.

Beneath all of this there is also the incentive issue of needing to create a model that feels pleasant enough to use to keep people coming back while also avoiding the dark side of people being vulnerable to flattery and preferring to believe things that align with their beliefs more than reality. I guess I’m wondering how far can we really expect to take a philosophy of alignment based on applying a bunch of patches posthoc before it backfires due to people being poor judges of what is good for them.

P.S. Right after posting I saw this Rolling Stone article, which talks about chatbot-based emotional support on a whole different level. Apparently delusion is no longer available only to those mentally afflicted. Now we can democratize it too.

“On Sociological Exploitation: Why the Guinea Pig Sometimes Bites”

In the above-titled article from 1971, political scientist Brian Vargus wrote:

Resistance to community research is growing, in part, because applied sociologists have not delivered the answers sought by their clients. Current analyses of this phenomenon are too simple because they overlook the setting in which the client-practitioner transaction takes place. The situation is complicated by “role strains” felt on both sides. An overlooked component of this is the exploitation of subjects and applied research by the reformulation of projects in “scientific and respectable” terms. One possible structural change at least to lessen this pattern and hence bolster the contract between applied sociology and its clients is suggested.

It’s interesting to read this sort of reflection from half a century ago, at which time social science played a much different role in society than it does now. There are some bits that would ring true today–a concern that social science is a sort of colonial enterprise, in which the lives of the “colonists” are being exploited for the benefit of the careers of the “colonizers”–but overall it just feels like a much different era, a time when there was the hope and even the expectation that social science would address and help to solve major social problems such as poverty, crime, depression, war, and so forth. This didn’t seem impossible, at least from the perspective of the early 1960s, when it could’ve been reasonable to think that economics had solved the problems of unemployment and was on its way to solve the problem of poverty; to think that game theory had, if not solved, at least decreased the risks of international conflict; to think that some combination of psychotherapy and pharmaceuticals were on their way to eradicating mental illness as well as helping the non-mentally-ill to be happy with their lots in life; etc. By 1971 the wheels were beginning to fall off the bus, but we hadn’t reached the current state of affairs, under which social ills are considered to be either impossible to solve or else addressable only by implausible methods (whether that be retraining people by teaching them all to “learn to code” or reorganizing the economy via those 100% tariffs we keep hearing about).

So, yeah, there was something refreshing about reading Vargus’s take, in which social scientists were being criticized for “rent seeking” (as we would call it today) rather than doing their true job of saving the world. Nowadays the usual criticisms of social science are that it provides justification for perpetuating inequality (the job of much of economics), tools for actively increasing the level of inequality (that’s finance, including all those useful how-to-invest-your-money tips, which increase inequality in that they’re only useful to people who have the spare cash to invest), ideological arguments for the status quo (that’s the job of “freakonomics”), promotes a left-wing political agenda (that would be all of sociology, most of anthropology, and lots of the social science that’s published in medical journals such as Lancet), promotes left-wing racial and gender ideology (lots of academic psychology, sociology, etc.), promotes right-wing racial and gender ideology (a loud minority of academic psychology, sociology, etc.), or just that it’s a waste of time and resources.

There are lots complaints about social science nowadays, but not many complaints that it’s exploiting its research participants or that it’s not doing its job of improving society (in Vargus’s words, “a relationship in which both the university and the community it serves are cooperating in achieving some shared goals”). Interesting. Given all the unfulfilled promises of economic planning, game theory, monetarism, psychoanalysis, criminology, anthropology, opinion polling, and so many other branches of social science, perhaps we should see our current era of reduced expectations as a step forward, but it still makes me sad.

As regular readers know, I still the social sciences are worth studying even though they’re mostly useless. As I wrote a few years ago:

What’s the point of social science? Why do we do it at all?

We study the natural sciences because they help us understand the natural world and they also solve problems, from vaccines to the building of bridges to more efficient food production. We study the social sciences because they help us understand the social world and because, whatever we do, people will engage in social-science reasoning.

The baseball analyst Bill James once said that the alternative to good statistics is not no statistics, it’s bad statistics. Similarly, the alternative to good social science is not no social science, it’s bad social science.

The reason we do social science is because bad social science is being promulgated 24/7, all year long, all over the world. And bad social science can do damage.

In summary: the utilitarian motivation for the natural sciences is that can make us healthier, happier, and more comfortable. The utilitarian motivation for the social sciences is they can protect us from bad social-science reasoning. It’s a lesser thing, but that’s what we’ve got, and it’s not nothing.

Again, though, I have a nostalgia for the days when social science was legitimately believed to be directly useful in curing social ills.

AISTATS ’25 Best Paper award—Margossian and Saul on exact recovery of means and correlation in VI

The best paper award for AISTATS ’25 was just awarded to the following paper.

Here’s Charles, nattily dressed in a blue blazer for the award ceremony:

Photo credit: Daniel Lee (the Cornell CS professor, not the Stan developer)

For those who might not know Charles, he was a Ph.D. student of Andrew’s, is currently a postdoc at Flatiron Institute, and as he posted here recently, is on his way to start a faculty position in the statistics department at the University of British Columbia (Vancouver). I can say from a front row seat that Lawrence is really really good at coaching writing and presenting and Charles is a great presenter (though the latter wouldn’t help with the best paper award—they knew about it before the presentation so we got to see the practice talk). If you get a chance to take one of Lawrence’s scientific writing tutorials, take it—it will for sure give you a leg-up in a best paper competition.

It’s a small world in Bayesian-flavored AI and stats. I believe that’s Stephan Mandt on the right of the photo—he’s the general program chair for AISTATS this year. Stephan was a postdoc with Dave Blei at Columbia while I was there and worked with Matt Hoffman at Google. Charles has also published papers on VI with Dave Blei. Daniel Lee (who took the photo above) was at Bell Labs with me in the late 1990s. I sometimes ran into Lawrence on the train back then going to AT&T Labs.

I. J. Good corner: The flying Venus flytrap and other partly-baked ideas

We recently had a discussion about useless insights from social science, and commenter Adam pointed to an article on “10 evolutionary insights” from the pandemic that I thought was kinda speculative. As I put it:

All its claims seem to me to be stretching credulity. The whole thing reminds me of an article I read in Omni magazine many years ago, featuring a bunch of what they called “partly-baked ideas.” The one I remember was to take some plant such as a Venus flytrap that opens and closes its leaves once per day, keep it in an inside environment with lighting control, and gradually decrease the period of night and day from 24 hours to 23 hours to 22 hours to shorter and shorter, until eventually the light is flickering on and off multiple times a second and the leaves are flapping so fast that the plant elevates off the ground. Clever, but I don’t think it’ll happen. That’s how I felt about the ideas in that PNAS article.

This got me curious about that article from Omni, so I did some googling, and . . . I found it! Here it is:

There it is–the grand prize winner, so vivid that I remembered it almost exactly, over 45 years after I came across it one day in the school library.

But here’s something new: The term “partly baked ideas” derives from I. J. (originally Isidore Jacob, later changed to Irving John) Good, the mathematician who worked with Turing applying Bayesian inference to crack the Enigma code in WW2 and then wrote two statistics books and a zillion articles, perhaps now most famous for his tongue-in-cheek article about 46656 varieties of Bayesians, but that’s kind of unfair because it seems that, in addition to helping to build the algorithm to crack that code, Good also invented a whole bunch of Bayesian statistics from scratch along with a precursor to the fast Fourier transform. He was kind of Tukeyesque in his combination of breadth and depth. This article of his from 1964, “Speculations concerning the first ultraintelligent machine,” is pretty impressive! I’d always thought of Good as having had a kind of disappointing career: it seemed that in his articles he kept reliving the glory of the war years. But, checking the dates and the bio, that’s not quite right: he was born in 1916, so, sure, he did his most intense work in his twenties, but then he worked in military intelligence for awhile after that, so we’re not going to hear what he did during that period, and he published his second book in 1965. It’s the usual pattern for mathematicians to slow down after 40, so, all in all, he kept things going for quite awhile. Also, with all those short articles . . . he was a proto-blogger! I kind of wonder what would’ve happened to me in a world without the internet. Maybe I’d be writing a new book every year. It’s hard to find the right balance between the different strands of my work, and I guess it seems to me that, after the age of fifty, Good did too much of one thing and not enough of everything else. His choice, of course! I’m evaluating his career in a theoretical or aesthetic sense, in the same way that I wish that John Updike had written more short stories and fewer novels after he turned 50, or that David Sedaris had pivoted toward telling other people’s interesting stories rather than continuing to mine the dead seams of his own life and extracting smaller and smaller nuggets. All these people are (or were) free to do what they thought they could do best for them, in the same way that I might sit and read a novel rather than complete one of my important (I think) but unfinished research projects.

Getting back to that “partly baked ideas” thing, I went to the library and checked out the 1962 book, “The Scientist Speculates: An Anthology of Partly-Baked Ideas,” edited by Good, along with Alan James Mayne and John Maynard Smith. Good himself wrote many of the articles; other authors you might have heard of include George Pólya, J. E. Littlewood (expressing the belief that the Riemann hypothesis is false!), Bruno de Finetti, Arthur Koestler, Michael Polanyi, Colin Cherry, Cyril Burt (oh, sorry, “Sir” Cyril Burt), Wassily Leontief (who recommends “effective, somewhat ruthless, overall co-ordination . . . not only all operations within a single plant, but also the relationship between plants within an industry and of all industries within a national economy as a whole,” which I guess is consistent with his perhaps Sputnik-influenced claim that “The motive drive and the co-ordinating devices used by the Russians seem to be so much more effective than those built into our system that in the post-war economic race the East has been gaining rapidly on the West”), Arthur C. Clarke, Isaac Asimov, Roger Penrose’s dad, Lloyd Shapley’s dad, David Bohm, Eugene Wigner, and Marvin Minsky. An impressive list of authors, even given that they mostly restricted themselves to getting contributions from dudes (with the exception of Gertrude Schmeidler and maybe some of the many authors who are listed as Anon or only with first initials). In a bit of charming British understatement, the book does not feature mini-biographies of the authors at the end of the articles or the end of the volume.

Seeing Minsky’s name on that list reminded me of Jeffrey Epstein’s Edge Foundation, which had some of that same vibe of Great Men sharing their Big Ideas. I have to say that Good et al. carried it off a lot better in 1962 than the Epsteins did in the 2000s (see here and here).

The book was ok, not quite as fun to read as you might hope, but it had some readable bits. I pretty much agreed with Colin Cherry in his article where he attributed road rage to the frustrated desire of drivers in traffic to communicate with each other. I think the only thing he missed was that, when you drive, you can do all the road rage you want and it doesn’t tire you out, so you keep raging more in an escalating spiral of annoyance. In contrast, if you’re biking and you’re feeling frustrated, you can just pedal harder, which is exhausting, creating a useful feedback loop by which you become too tired to be angry.

The most striking thing about the book, though, is that it has a long section–20 different articles!–on ESP. It leads with a disclaimer by Good (“[Claims of ESP] will be generally accepted if they become readily repeatable by other experimenters. Since that has not yet happened I am including this chapter with some trepidation. But if there is anything at all in parascience it is very important to science and to men”), but, still, it seems to me to be a sign of its times that there is a chapter on ESP but nothing on ghosts, gods, angels, space aliens, or other popular supernatural or science-fictional beliefs. I’m reminded in some way of Stephen Pinker’s list of “taboo” subjects, which left me wondering what were his criteria for putting a topic on that list.

More on the ESP thing in a future post.

Books by Charles Rosen and Jeremy Denk on piano playing and the nature of music

I just read two interesting books by classical pianists: Piano Notes by Charles Rosen and Every Good Boy Does Fine by Jeremy Denk. Both addressed the connections between the physical act of playing the piano and the intellectual and emotional aspects of listening to music. This was all interesting to me because I enjoy listening to music (including classical piano music), but I can’t play or read music, nor can I sing in tune, nor can I recall any piece of music or even any long passages of music (setting aside a few very short songs such as Row Row Row Your Boat and Happy Birthday and songs such as The Star Spangled Banner that I’ve heard thousands of times). I can recognize any passage from my favorite music, but I can’t hum anything from beginning to end.

I was discussing this with musician/composer/music-theorist Dmitri Tymoczko (one of whose sayings is “Math is like flypaper for smart people”) and he had some interesting but not completely conclusive responses. Here’s our conversation thread:

Me:

It struck me that when I read a book or see a movie–or follow a spoken story or a logical argument–I see the entire narrative as a sort of fixed structure, within which I can scan forward and backward or jump around at will. This is helpful to me when I am giving a talk–I can pretty much ad lib and feel the structure developing as I speak–or I can see the structural flaws of talks by others. But when I listen to music–even familiar music, pieces I’ve heard hundreds of times, and even when I’m focusing–I’m more like a Turing machine, perceiving only the current instant and with only a vague, peripheral-vision sense of everything that comes before or after.

I’m sure that you have a vision of each piece of music as a whole. But you’re an extreme case. What about people who are not professional musicians or composers? I’m not quite sure how to think about this. Part of it must be the ability to read music (which I don’t have)–if you can attach the sounds to the images, then there’s a direct visual mapping. But I’m guessing that many people who can’t read music can still envision an entire piece. I’m not quite sure what that would feel like–but, as noted above, I can analogize it to my ability to store an entire 3-hour movie (yes, we just saw The Brutalist, and I highly recommend it!) in my head. Conversely, I imagine there are people who perceive books and movies the way I experience music, only in the moment.

Any thoughts? This is perhaps something that William James figured out many years ago…

Dmitri:

I see several interlocking problems

(1) you lack the conceptual categories for making a musical map

(2) a lot of music involves really subtle distinctions that take training to hear
– e.g. a Mozart symphony can be pretty similar from part to part
unless you know what to listen for
(In other words, you are what we musicians call “a muggle.”)

(3) you lack confidence. there’s something weird going on where you can make maps of a certain crude kind, but you discount that, because you want something much more detailed.

(4) you lack experience playing a rock band, where people literally use maps to tell people what to do

What if you took some song you like, I dunno, “It’s the end of the world as we know it,” and made a little map? Use the words to guide your sense of the parts.

Part 1a: “That’s great it starts with an earthquake …”
Part 2a: “It’s the end of the world”
Part 1b: “Six o’clock TV hour”
Part 2b: “It’s the end of the world” (Introducing “time I had some time alone”)
Part 3: “I feel fine” extended, wordless vocal
Part 2c: “It’s the end of the world”
Part 1c: “The other night …”
Part 2d: “It’s the end of the world” (Time I had some time alone there the whole time)
Part 2e: “It’s the end of the world” (two voices in harmony)
Part 2f: “It’s the end of the world”

Now if you want you can get more precise: for example, part 1a is twice as long as the other part 1s, and it introduces two new chords at the end.

The more maps you make the more you will start to hear these things.

You gotta crawl before you walk. I worry like you are expecting either automatic or nothing, rather than being willing to take the baby steps. Like “I can’t understand this graduate text in statistics so I am not going to even try …” So much science involves the long, slow accumulation of knowledge. Music too!

Rock bands use these kind of maps all the time, when teaching songs to each other. Part 1 is the verse, part 2 is the chorus. Usually there is a 3rd part called a “bridge.”

Now imagine some classical piece that is really simple: say a slow part, a fast part and a slow part. There’s a map. Here’s a beautiful piece by Messiaen that uses that map. (This should be listened to really loud.)

When I do improvised pieces I often have a little map just like this, maybe with five parts.

Or try Mozart’s Turkish march. I bet you can hear the parts coming back.

I do think there’s a basic thing going on which is that music is like a language, and you are sensing that you haven’t really been trained in the language. You like music, but feel frustrated because you can’t connect with it intellectually. That’s all very reasonable and fair. This is one reason why we like to train kids in music when they are young.

I mean, I go to Italy and see all these glamorous people happily yapping about risotto and such. I always wish I could join in. They are happy to speak English with me but I feel like the risotto yapping would be richer if I could learn Italian. Unfortunately I wasted all my time learning music and figuring out what the Grothendieck construction is …

Dmitri adds:

You might like a book by Jerrold Levinson called “Music in the Moment.” He’s anti-map and pro turing machine. I think he takes things too far but it is a good book, a bracing challenge. He’s right about a lot.

OK, I went to the library and checked out the Levinson book, and I absolutely hated it. I found it so annoying that I blogged something about it–or, at least I thought I did! But now I can’t find the post. Anyway, here’s the story. Right near the beginning of his book, Levinson writes that music is inherently “in the moment” in a way that visual art isn’t because music is perceived over time, whereas you can look at a painting all at once. My problem with this argument is that stories are perceived over time: we read books one paragraph at a time and rarely go back, and we watch a movie in sequence just as we listen to music–but for books and movies I can hold the entire structure–plot, themes, scenes, characters–in my mind, while the story is being told and after also. I have no problem watching a movie or reading a book, partaking in its instantaneous nature, while still holding it in my head as part of a coherent whole. And I’m pretty sure that Dmitri can do this with music too. To my mind, Levinson’s argument was entirely ruined by him not addressing why it doesn’t hold for stories.

To get back to the music-understanding thing, I talked with someone I know who is a good singer and I asked her if she visualizes all of a song at once. She replied that sometimes she’ll start with one passage in the middle and then track the song backward and forward until she’s built the whole mosaic. She also said that if she likes a song, she’ll listen to it over and over until it all makes sense to her and she can see how the chorus, the verses, and the bridge fit together.

In my post, Playing music, listening to music, background music, talking about music, I wrote:

Many writers have discussed how the experience of music has shifted over the past century or so from playing music, to listening to music, to background music–and this all affects how we hear music and how we talk about it. The story is clear enough. Until very recently in music history, if you wanted music you had to sing it or play it yourself, or go somewhere where people were performing. Changes in performance spaces corresponded to different developments of musical style and content. Records and radio made it possible to just listen, and to listen to a much wider variety of music than you’d encounter in your daily life or even going to the occasional concert. And then, as the decades progressed, we gradually moved to the modern condition of music always being available in the background.

It’s my impression that, in general, music is more interesting to play than to listen to. Combinations of notes that don’t sound like much can be interesting to explore, and playing music is a form of exploration or experimentation: even if you’re playing a piece note by note, at each step you’re implicitly experimenting in that you always have a choice to play it differently. Beyond this, I can well imagine that some music can be more interesting than to compose than to listen to or even to play–indeed, it’s not hard to compose patterns of notes that are unplayable by a human using traditional musical instruments. So some of the development of modern music has to be a move from interesting-to-listen-to to interesting-to-play to interesting-to-compose.

Charles Rosen discusses this in his book. One thing he points out that I hadn’t thought about is that if you’re playing music at home, or if you’re in someone’s living room listening to someone play, you can read the music along with hearing it. In a live concert you might not be reading the music but you can watch the playing, which is particular helpful if different themes are being played by different instruments. As Rosen puts it:

Playing Bach for oneself or for a friend or pupil looking at the score (the way the Art of Fugue or the Well-Tempered Keyboard or the Goldberg Variations would have been played before 1770) raised few problems; nothing had to be brought out, the harpsichordist . . . experienced the different voices through the movement of the hands, the listener saw the score and followed all the contrapuntal complexity disentangling the sound visually while listening. Bach’s art did not depend on hearing the different voices and separating them in the mind, but on appreciating the way what was separate on paper blended into a wonderful whole.

Rosen continues by explaining why it is a challenge for modern performers to bring out these patterns through sound alone using various tricks of phrasing and emphasis that would not have been needed in Bach’s time. I’d say it’s an interesting paradox, that the most faithful presentation of the music would not be the same as its original form, but I guess this occurs all the time when appreciating art of earlier centuries. We can’t read Chaucer straight-up, and even Shakespeare presents some challenges; when looking at old Italian paintings it helps to have some background on the Biblical scenes being displayed; etc.

Regarding music: its traditional form is haptic and visual as well as aural. If you listen to recorded music, it’s just aural. Listening to music without playing it, reading the score, or watching the players is a little bit like hearing a Carl Stalling score and only imagining the actions of Bugs Bunny and his pals. It can still be great, but it’s only part of the experience.

Rosen’s book had many other insights too. I recommend it. He jumps back and forth between his own experiences playing the piano, more general reflections on the history of classical music, and discussions, too technical for me to follow, of particular pieces of music.

Rosen is different from me! He writes, “When I hear music, I prefer to lose myself in it, not to drift outside in my own personal world with music as a decorative and distant background.” Rosen thinks about music the way I think about books and music. I wonder if there’s anything that Rosen did like to have in the background, the way I like music in the background. Baseball, maybe? I think of the ideal baseball-game experience as being outside on a nice day, eating hot dogs and having pleasant conversation with the sights and sounds of the game in the background, with occasional periods of intensity and focus.

Rosen also wrote, “The life of music is based not so much on those who want to listen, but on those who want to play and sing.” I guess this isn’t quite true of pop music, or of the careers of musical performers–if you play in a wedding band, your job is to play what the guests want to hear–but I can see how it would be the case when considering what of the music of the past will survive.

Jeremy Denk’s book is a lot like Rosen’s–indeed, the younger man thanks the elder in his acknowledgments–with lots and lots about what it feels like to play the piano, the way that it complements other instruments, the way that different styles are appropriate for different pieces of music, and the sense of a musical community that encompasses teachers, students, performers, and composers. As with Rosen, Denk has some technical bits that I couldn’t really follow, but when the book was over I felt I had a better understanding of music than I had before. When it comes to learning about music, I found this inside-out approach to be much more helpful than various books about how to understand music. I did get a lot out of David Byrne’s book, but that was really more about art and culture in general, not much about music in particular (despite the book’s title).

Most of Denk’s book was about his experiences as a child and youth learning how to play the piano, and it includes many inspiring and heartfelt tributes to his teachers. As a teacher myself, I appreciated that! And it was interesting to think of the ways in which teaching music is different than teaching statistics. He also talks a bit about the social environment of being a music student, and some of that sounded horrible. Not that Denk was a horrible person, more that he was inside a horrible system. It’s so competitive! He went through recital after recital, competition after competition. OK, sure, you could say that all of academics is competitive in that you need to get good grades if you want to move forward. The piano competitions seemed different, though, in that the different students in the program were competing with each other in a way that I don’t see in schools and universities. Maybe part of it is that so many kids are good at music, but there are very few touring-pianist careers of the sort that Rosen and Denk have had. The other funny thing about the Denk-in-school thing is how easy it all came to him. I mean, sure, he had struggles and frustrations, but pretty much his school experiences were a series of steps forward. This reminded me of my own path through school: many of the students struggled but it all came easy to me, and I could focus on my interests without having to worry about not being able to keep up. It was kinda fun to read a memoir by someone who’d had the same experiences.

(again) Yeah, yeah, I understand why you’re all talking about accusations of fraud. But for the rest of us, it’s about the non-replication and the bad science, not about possible fraud and blame

There’s been a lot of news coverage lately on the replication crisis in psychology and related fields. Simine Vazire and I wrote something a couple years ago, Why did it take so many decades for the behavioral sciences to develop a sense of crisis around methodology and replication?, exploring more formally the timeline I’d discussed in a much-discussed post from 2016. Why this has been motivating so much discussion just now in the news media is another story, I guess related to their being a big lawsuit in the air. Lawsuit = news, I guess? This article by Gideon Lewis-Kraus seems like a good summary.

Just one thing, though. The subtitle of that article is, “Dan Ariely and Francesca Gino became famous for their research into why we bend the truth. Now they’ve both been accused of fabricating data.” We see two things that should help the reader engage: (a) a focus on individual personalities and life stories, and (b) the accusation of fabricating data. Both these things are worth discussing—people’s choices matter, and fraud is always a concern (as would be mistaken accusations of fraud). Indeed, we had a long discussion just a couple months ago on cheating in science, sports, journalism, business, and art, riffing on a book by financier Dan Davies.

So, sure, but . . . I continue to think that the big problem of non-replicability is not fraud so much as the misguided expectations: the idea that science is supposed to be some endless stream of discoveries and a refusal to admit error. Here are a few relevant posts:
Honesty and transparency are not enough
Psychology needs to get tired of winning
Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud
Here’s why I don’t trust the Nudgelords . . .
The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

One refreshing difference of the current headlines compared to what came in the past is that, for whatever reasons, some of the researchers involved in the latest scandals appear open to admitting that much of their past work is just wrong, that its nonreplication is not just some technical problem but rather is a reflection that in the past they were basically doing the experimental equivalent of generating random numbers and using them to tell stories. I still have unpleasant memories of political scientists insisting, in the face of all evidence, that subliminal smiley faces have large effects on attitudes toward immigration; of a sociologist avoiding looking at careful explanations of how his much-publicized claims were nothing more than noise mining; of the himmicanes and air rage people never giving up; the Freakonomists not coming to terms with their promotion of climate change denial; the nudgelords memory-holing their former adoration of a now-discredited food behavior researcher; etc etc etc.

I guess what I’m saying is an important step forward in the current discussion of replication problems, both in science and in the news media, is to recognize that so much of this research is just no good, at best ridiculously overestimating effect sizes and setting up a false sense of certainty in a world that is highly variable. All this is separate from any questions of fraud and blame.

Measurement error model Stan fitting struggle: The funnel again rears its ugly head

1. The model and my initial expectations

I simulated data and then fit the following measurement error model:
– The latent regression is y ~ normal(a + b*x, sigma)
– But x is not observed. What is observed is x_star ~ normal(x, sigma_x_star)
– The prior for x is normal(mu_x, sigma_x)
– When fitting this model, you can’t separately identify sigma_x and sigma_x_star. I did all the fitting treating sigma_x_star as data and setting it to its true value.

Here’s the Stan program, which I saved in the file measurement_error.stan:

data {
  int N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
}
parameters {
  real a, b, mu_x;
  real<lower=0> sigma, sigma_x;
  vector[N] x;
}
model {
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b*x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

I simulated fake data from this model, except that for the fake data I simulated x from uniform(-5, 5) instead of from a normal. The uniform(-5, 5) distribution has mean 0 and standard deviation 2.9.

My motivation for this example was to teach to my class how easy it is to set up and fit a measurement error model. I expected that fitting would cause no problems. The model is multivariate normal conditional on the variance parameters, and it’s a well-known model in econometrics: You solve it by fitting a bivariate normal distribution to the data (x_star, y). Given the mean and covariance matrix of this distribution, and with known sigma_x_star, it’s easy to compute mu_x, a, b, sigma, and sigma_x.

Here’s the R script:

library("cmdstanr")
library("arm")
model <- cmdstan_model("measurement_error.stan")
set.seed(123)

# Simulate fake data
# We've drawn x from a zero-centered distribution so as to simplify the problem by removing the posterior correlation between a and b
N <- 1000
a <- 0.2
b <- 0.3
sigma <- 0.5
x <- runif(N, -5, 5)
y <- rnorm(N, a + b*x, sigma)
fake <- data.frame(x, y)

# The regression that would be obtained if we observed x and y
lm_x <- lm(y ~ x, data=fake)
display(lm_x)

# Simulate x_star, which is x observed with measurement error
sigma_x_star <- 1
fake$x_star <- rnorm(N, fake$x, sigma_x_star)

# The "attenuated" regression" obtained using x_star and y
lm_x_star <- lm(y ~ x_star, data=fake)
display(lm_x_star)

# To fit the model in Stan we need to specify (or provide strong information on) the error scale sigma_x_star.
# Here we just set it to its true value:
stan_data <- list(N=N, x_star=fake$x_star, y=y, sigma_x_star=sigma_x_star)

# Now fit the model. It has no problems:
print(model$sample(data=stan_data, parallel_chains=4, refresh=0))

2. What happened

I simulated N=1000 data points x from uniform(-5, 5) and then simulated x_star and y from the model with true parameters a=0.2, b=0.3, sigma=0.5, sigma_x_star=1. I fit the model in Stan and it worked just fine:

variable     mean   median    sd   mad       q5      q95 rhat ess_bulk ess_tail
 lp__    -1804.64 -1804.34 25.43 25.50 -1848.76 -1763.21 1.00     1118     1765
 a           0.21     0.21  0.02  0.02     0.18     0.24 1.00     3722     2985
 b           0.31     0.31  0.01  0.01     0.30     0.32 1.00     3099     3338
 mu_x       -0.03    -0.03  0.10  0.10    -0.19     0.13 1.00     4349     2988
 sigma       0.48     0.48  0.02  0.02     0.46     0.51 1.00     2042     2989
 sigma_x     2.83     2.83  0.07  0.07     2.72     2.95 1.00     4486     3294
 x[1]       -2.74    -2.75  0.80  0.80    -4.07    -1.38 1.00     6253     2545
 x[2]        1.99     1.99  0.81  0.84     0.63     3.31 1.00     5708     2892
 x[3]       -0.99    -1.00  0.80  0.80    -2.30     0.33 1.00     7102     2965
 x[4]        4.24     4.25  0.81  0.81     2.91     5.59 1.00     7561     2979

I then redid it, simulating new data with sigma_x_star=2 (and specifying this new value of sigma_x_star when running Stan), and it was still pretty good:

variable     mean   median    sd   mad       q5      q95 rhat ess_bulk ess_tail
 lp__    -1852.63 -1854.41 50.04 48.97 -1929.33 -1767.86 1.01      254      530
 a           0.21     0.21  0.03  0.03     0.16     0.25 1.00     1228     1865
 b           0.31     0.31  0.01  0.01     0.29     0.33 1.01      338      775
 mu_x       -0.03    -0.03  0.11  0.11    -0.20     0.15 1.00     1589     2452
 sigma       0.52     0.52  0.03  0.03     0.48     0.57 1.01      333      772
 sigma_x     2.72     2.72  0.09  0.09     2.57     2.88 1.00      705     1674
 x[1]       -2.69    -2.67  1.18  1.16    -4.68    -0.75 1.00     3824     2508
 x[2]        1.16     1.17  1.18  1.18    -0.76     3.11 1.00     4200     2528
 x[3]        0.56     0.58  1.18  1.17    -1.37     2.52 1.00     4722     2950
 x[4]        3.32     3.31  1.16  1.15     1.42     5.22 1.00     4626     2875

I then redid it, simulating new data with sigma_x_star=4 (and specifying this new value of sigma_x_star when running Stan), and there were problems:

variable     mean   median     sd    mad       q5      q95 rhat ess_bulk ess_tail
 lp__    -1738.11 -1775.98 217.55 183.32 -2017.40 -1277.99 1.20       15       14
 a           0.23     0.23   0.04   0.04     0.16     0.30 1.00      471      934
 b           0.33     0.33   0.04   0.04     0.27     0.39 1.18       17       26
 mu_x       -0.10    -0.10   0.15   0.15    -0.35     0.14 1.00      662     1497
 sigma       0.49     0.50   0.08   0.08     0.32     0.61 1.17       17       16
 sigma_x     2.63     2.63   0.19   0.19     2.33     2.94 1.11       27       51
 x[1]       -2.07    -2.09   1.22   1.19    -4.06    -0.11 1.01     6205     2553
 x[2]        0.87     0.86   1.27   1.23    -1.27     2.98 1.01     5485     2473
 x[3]        0.77     0.74   1.26   1.22    -1.27     2.88 1.01     7869     2023
 x[4]        3.36     3.35   1.25   1.22     1.31     5.48 1.01     5805     2597

3. Some experiments

I did some experiments with the problematic sigma_x_star=4 case.

I re-fit, pinning sigma and sigma_x to their true values (sigma = 0.5, sigma_x = 2.8). Here's the Stan program, measurement_error_pin_sigma_and_sigma_x.stan:

data {
  int N;
  vector[N] y;
  vector[N] x_star;
  real<lower=0> sigma_x_star;
  real<lower=0> sigma, sigma_x;
}
parameters {
  real a, b, mu_x;
  vector[N] x;
}
model {
  x ~ normal(mu_x, sigma_x);
  y ~ normal(a + b*x, sigma);
  x_star ~ normal(x, sigma_x_star);
}

I can't wait till Stan allows this pinning to be done using the function call rather than requiring a rewriting of the model.

In any case, it mixed well:

variable     mean   median    sd   mad       q5      q95 rhat ess_bulk ess_tail
    lp__ -1487.52 -1487.03 29.36 28.84 -1535.53 -1439.92 1.00      861     1509
    a        0.23     0.23  0.04  0.04     0.16     0.29 1.01      550      864
    b        0.30     0.30  0.01  0.01     0.29     0.32 1.00     1379     2191
    mu_x    -0.09    -0.10  0.15  0.15    -0.34     0.15 1.01      678     1417
    x[1]    -2.15    -2.15  1.38  1.38    -4.41     0.13 1.00     4301     2853
    x[2]     1.02     1.02  1.37  1.34    -1.21     3.23 1.00     4534     2550
    x[3]     0.84     0.82  1.31  1.32    -1.30     2.99 1.00     4668     3046
    x[4]     3.59     3.59  1.35  1.37     1.38     5.77 1.00     4737     2830
    x[5]     1.14     1.14  1.35  1.33    -1.10     3.37 1.00     4846     2806
    x[6]    -3.01    -3.02  1.36  1.39    -5.25    -0.75 1.00     5914     2756

Also pretty good mixing when I pinned sigma to its true value but allowed sigma_x to be estimated (that requires another Stan program which I won't show here):

variable     mean   median    sd   mad       q5      q95 rhat ess_bulk ess_tail
 lp__    -2467.47 -2467.40 57.35 58.13 -2562.27 -2374.46 1.01      148      326
 a           0.23     0.23  0.05  0.04     0.16     0.31 1.01      299      689
 b           0.33     0.33  0.02  0.02     0.30     0.36 1.02      162      415
 mu_x       -0.10    -0.10  0.15  0.16    -0.35     0.14 1.01      366      885
 sigma_x     2.63     2.63  0.16  0.16     2.39     2.89 1.01      173      495
 x[1]       -2.04    -2.04  1.26  1.26    -4.14     0.03 1.00     4347     2967
 x[2]        0.91     0.90  1.23  1.21    -1.10     2.98 1.00     3239     2328
 x[3]        0.80     0.79  1.24  1.22    -1.20     2.89 1.00     4107     2691
 x[4]        3.32     3.32  1.25  1.31     1.26     5.36 1.00     3891     2604
 x[5]        1.07     1.05  1.30  1.28    -1.06     3.23 1.00     4996     2879

But when I pinned sigma_x to its true value but allowed sigma to be estimated, there were some issues:

variable    mean  median    sd   mad       q5     q95 rhat ess_bulk ess_tail
   lp__  -883.53 -894.00 98.28 90.32 -1021.77 -720.09 1.08       53       39
   a        0.23    0.23  0.04  0.04     0.16    0.29 1.00      873     1444
   b        0.29    0.29  0.02  0.02     0.26    0.31 1.06       64       57
   mu_x    -0.10   -0.10  0.15  0.15    -0.36    0.15 1.00     1016     1803
   sigma    0.56    0.56  0.06  0.05     0.46    0.64 1.07       62       39
   x[1]    -2.05   -2.07  1.48  1.46    -4.46    0.38 1.00     5525     2691
   x[2]     0.99    0.98  1.56  1.53    -1.56    3.59 1.00     6583     2671
   x[3]     0.91    0.93  1.53  1.54    -1.59    3.45 1.00     7308     2414
   x[4]     3.47    3.48  1.49  1.50     1.01    5.89 1.00     6288     2552
   x[5]     1.03    1.07  1.49  1.50    -1.49    3.43 1.00     6945     2727

No surprise because now it's multivariate normal. The mixing isn't perfect, implying that there must be some high correlations among a, b, and mu_x.

For my next experiment, I replaced the simulation x <- runif(N, -5, 5) with x < rnorm(N, 0, 2.9). No change in the model, just in the simulated data. The results were pretty much the same as before, except this time it also fails if I pin only sigma:

variable     mean   median      sd   mad       q5     q95 rhat ess_bulk ess_tail
 lp__    -1248.82 -2547.20 2293.19 81.20 -2651.38 2913.61 1.54        7       39
 a          -6.36     0.24   11.81  0.06   -31.16    0.32 1.53        7       36
 b         -20.69     0.31   37.41  0.03   -99.47    0.34 1.54        7       37
 mu_x       -0.14    -0.14    0.17  0.23    -0.32    0.15 1.27       11      739
 sigma_x     2.18     2.83    1.26  0.24     0.01    3.15 1.54        7       37
 x[1]       -2.57    -2.74    1.74  2.09    -5.31   -0.30 1.45        7       40
 x[2]       -0.71    -0.32    1.18  0.93    -2.82    1.12 1.50      343     2096
 x[3]        2.44     2.76    1.99  2.18    -0.33    5.43 1.48        7       32
 x[4]       -0.04    -0.31    1.12  0.84    -1.85    2.02 1.52     2501     1572
 x[5]       -2.44    -2.60    1.69  2.01    -5.12   -0.30 1.40        8       

4. Trying a different random seed

I then repeated everything above, except changing set.seed(123) to set.seed(1234). The results were similar:

First, generating x <- runif(N, -5, 5): - Again the model mixed perfectly, well, and poorly when sigma_x_star was 1, 2, and 4, respectively. - Continuing with sigma_x_star = 4, it mixed well when we pinned both sigma and sigma_x, it mixed well when we pinned sigma, and it mixed not so great when we pinned sigma_x. Next, generating x <- rnorm(N, 0, 2.8): - Again the model mixed perfectly, well, and poorly when sigma_x_star was 1, 2, and 4, respectively. - Continuing with sigma_x_star = 4, it mixed terribly when we pinned both sigma and sigma_x, it mixed ok when we pinned sigma, and it mixed not so great when we pinned sigma_x. It's kind of funny that when the x's are generated from the model (but not when they're generated from a uniform distribution with the same mean and variance), the simulation does poorly when both sigma and sigma_x are pinned. High correlations, I guess. Still, it's rare that simulations get worse when we simplify the model by setting variance components to be known.

5. My current understanding of the problem

Setting aside the anomaly just mentioned above, my take on the problem is that there’s a funnel–a pole in the target function where sigma = 0 and y[n] = a + b*x[n] (that is, x[n] = (y[n] – a)/b) for all n. This is an awkward funnel for us because the funnel is happening in the error term, rather than at the hierarchical level. To put it another way, the vector being is “funneled” is not x, it’s (y – (a + b*x)). So it’s not quite clear how to program up the alternative parameterization here. I tried specifying x with offset and multiplier in different ways, but nothing worked.

The funnel here arises because, from the measurement error model, it’s possible for the fit to place all the x[n]’s so that the plot of y vs. x is a perfect line.

6. Potential solutions

I think nested Laplace (that is, marginalizing over the latent parameters) could work just fine. Two questions will arise: (a) will it work?, (b) will it give a speed improvement? I think the answer to (a) has to be Yes, because for this particular example the conditional distribution of the latent parameters is multivariate normal which can be integrated exactly. I’m not sure about (b).

This would also be a good one to throw at the new adaptive stepsize algorithm that Bob Carpenter and others are working on.

It’s also possible that better adaptation could help? I played around with using ADVI as a starting point and it didn’t seem to do much. Here’s my hacky code:

stan_with_advi_init <- function(model, data, ..., quiet=TRUE){
  if (quiet) {
    advi_est <- model$variational(data, refresh=0, show_exceptions=0, show_messages=0)
    model$sample(data, init=advi_est, refresh=0, show_exceptions=0, show_messages=0, diagnostics=NULL, ...)
  }
  else {
    advi_est <- model$variational(data)
    model$sample(data, init=advi_est, ...)
  }
}

A typical function call will look like this:

fit <- stan_with_advi_init(model, data, parallel_chains=4, max_treedepth=5)
print(fit)

In an applied context, we could use zero-avoiding priors. I played around with these too, and I think they have to be pretty strong. When I assigned weak lognormal(0, 1) priors to sigma and sigma_x, it improved things, but there was still some slow mixing.

7. Summary

Even this simple measurement-error model is surprisingly difficult to fit when it is set up as a latent-data problem. One way to solve it is to integrate out the latent parameters--this reduces the computations to a manageable 2 dimensions--but that approach is limited because it won't work so well for non-normal models, and it requires additional work as more complexity is added to the model. Perhaps a newer HMC algorithm will move better through this space. Otherwise we might need to think harder about priors. I'm keeping an open mind on this one.

Anticipated good news if the economy goes downhill

There’s some concern about an upcoming recession caused by economic shocks, including tariffs, reduced tourism to the U.S., reduced hiring because of economic uncertainty, etc.

There’s always a silver lining. Some things we might expect to follow from an economic downturn:

– Less energy consumption. Fewer tourists means a decrease in airline flights, car rentals, etc. And if China is selling us less stuff, they may be using burning less coal there. All in all, this is good for the environment. A global recession could lead to people eating less meat, or at least slowing the rate of increase in meat consumption. And if people are pessimistic about the future, they might have fewer kids, which would save lots of energy use.

– Lower gas prices. I don’t really care so much about this because my bike doesn’t use gasoline, but gas prices are famously influential in people’s perceptions of inflation. Less travel would imply lower demand for oil, and the price should go down. Indeed, if there’s enough of a depression, just about all prices could drop.

– Some improvement in quality of life. Maybe if your kids have fewer plastic toys, they’ll play more creatively with the toys they already have. Less travel = more quality time on the couch. Etc. The standard economic theory would suggest that taking things away from people would decrease their utility, but we know that the standard theory isn’t always correct.

OK, that’s just three things. But I’m no economist. I assume the experts in the audience could add a few more.

P.S. It seems that some people didn’t fully get the point of this post. I thought the reference to fewer toys and the reference to standard economic theory would clue you in to the fact that I wasn’t entirely serious in all the above details. That said, it’s not at all controversial that an economic downturn would be expected to slow the rate of recourse consumption, at least in the short term. That doesn’t mean that a recession is a good thing, it just is what it is.

Here’s what’s happening with time-shifting of births.

A couple weeks ago we discussed the question, “What day of the year will have the fewest noninduced births?”

It was an interesting difference between the reasoning of a mathematician (who focused on one factor–the loss of an hour in the day when Daylight Saving Time begins–while ignoring lots of other factors) and statistical reasoning (where we start with the data and see what we can find). As some commenters noted, ultimately we want to use both forms of reasoning, with statistical analysis backed up by mathematical modeling. From a “sociological” point of view, I found it interesting that the mathematician focused on such a minor aspect of the problem, it just happened to be the aspect that was most suitable to direct mathematical analysis.

Here’s the birthday analysis that Aki and I did:

which is fine, but it’s all births, not just noninduced births. In my post, I pointed to a paper that separated the data by natural, C-section, and induced births but only had showed data for thirty days of the year:

The paper said the numbers came from the National Center for Health Statistics, but I wasn’t able to find data on births by date on their website, so I sent an email to the authors to see if they the counts for all three types of births for all 366 dates. No reply yet, unfortunately. (Fair enough; it’s a 15-year-old paper, and the authors may well have lost the data file.)

There was some discussion in comments on the effects of scheduled births on the pattern of dates of all three sorts of births. It’s tricky because whether a birth is scheduled or not is itself “endogenous” in that a birth could be scheduled but then the baby could be born before that date.

More data!

I wasn’t sure about what to say here in the absence of more data . . . and then some more data showed up! I did some searching and came across this article by Mireille Jacobson, Maria Kogelnik and Heather Royer on birth timing and post-natal outcomes, who write:

Fewer births occur on major US holidays than would otherwise be expected. We use California data to study the nature and health implications of this birth date manipulation. . . . “missing” holiday births are displaced to a window of time 11 days before the holiday through 16 days after the holiday. Delivery type does not change over this window, consistent with a pure retiming of births rather than an increase in the use of procedures such as cesarean sections. . . . while some of the retiming seems to be driven by patients’ preferences, provider incentives appear to play a crucial role in holiday-related birth retiming. At Kaiser Permanente hospitals, where systemwide financial incentives discourage providers from electively timing births, the dip in births on holidays is less than for hospitals overall.

Here’s what they estimate:

No plots of the raw data, unfortunately just this estimate which is based on averaging over a set of holidays that occur throughout the year.

Here’s where the data come from:

Our primary data source is the restricted-access 2000-2016 California Birth Statistical Master Files. These data cover the universe of California births during this period and come from birth certificate information that the parents and medical provider fill out at the time of birth. These data include demographic information (e.g., age, education) for the parents, health conditions/outcomes of the mother and infant (e.g., gestational diabetes, birth weight, gestational length), and the use of medical interventions (e.g., cesarean section, induction, and stimulation). Crucial to our approach, these data include the exact date of birth of the infant.

And here’s a quick summary:

Births average 1442 per day, but are systematically lower on holidays and weekends (with a mean of about 1100 births per day) than on other days. The data on delivery mode make clear that this is a result of scheduling. The number of cesarean section deliveries is nearly 50% lower on holidays and weekends than on other days. Induced/stimulated births are about 28% lower. Spontaneous vaginal births are also lower on holidays (by about 15%), although they account for a much higher share of births on holidays (52%) than on other non-weekend days (44%).

That’s cool. Now I want to see the raw data. It says they’re in the California Birth Statistical Master Files. I don’t know where to find these files, also if they’re “restricted access,” maybe I don’t have permission to see them, and I guess that the authors of the above-linked paper won’t be able to send them to me.

A request

If anyone out there has the data from the California Birth Statistical Master Files and wants to plot the time series of avg #births by date (and multiply the number for 29 Feb by (# years in data)/(# leap years in data)), for each of the three categories of births, could you please graph these (three time series on a single plot would be fine) and send to me? Thanks!

What is judgment and decision making (JDM)?

JDM now

Dan Goldstein has two posts that should interest some of you:

What is the field of Judgment and Decision-Making (JDM)?

What Judgment and Decision Making (JDM) is and what it isn’t

In the first of these articles, Dan characterizes judgment and decision making as “a field within Cognitive Psychology” with core topics of are “risk, uncertainty, choice, decision, probability, prediction, future, intertemporal choice, heuristics, utility, forecasting, normative models, prescriptive models, and descriptive models.”

He cites a 1996 post from Barbara Mellers, which “speaks of ‘almost five decades’ of JDM research, which would point to somewhere in the late 1940s. Well after Brunswik, a few years after Von Neumann and Morgenstern’s ‘Theory Games and Economic Behavior’ and a few year’s before Ward Edwards’s Psychological Bulletin article ‘The theory of decision making.'” Dan continues by saying that “the majority of JDM research has always been about the difference between formalisms and human behavior.”

In the second article, Dan gives a “concise definition” of judgment and decision making as “The study of intuitive statistics” and a “longer definition” as “The study of human decision making behavior, formal decision models, and the differences between the two.”

Neither of these two definitions quite work for me. My problem with the “intuitive statistics” definition is that, to me, the core topics of statistics are measurement and inference, neither of which directly map to judgment or decision making. My problem with the longer definition is that it’s missing the “judgment” part.

Regarding that last issue, Dan writes:

Does the difference between judgment and decision making really matter?

Judgments (like estimating the distance of an object or the population of a country), and decisions (like choosing medical treatment A vs B given available information and risks) are different, but they’re so related that I find it convenient to roll everything up into “decision making.” . . . JDM or “judgment and decision making” is now a fixed phrase and there’s not much talk about the distinction between judgments and decisions.

To me they are different! You can read chapter 9 of BDA3 for my take on decision analysis. I think it makes a lot of sense to distinguish between “judgment” and “decision making.” Indeed, I think that theorists and practitioners of statistics have made major errors over the years by trying to frame inferences and judgments as decision problems.

That said, Dan works within the field of JDM and I’m an outsider, so I’d guess that his definition is a good summary of what people in that area are thinking about and working on.

His post is pretty long and even includes some data! I recommend you follow the link and read the whole thing.

Origins of JDM

I associate the field of Judgment and Decision Making with the classic book from 1982, “Judgment Under Uncertainty: Heuristics and Biases,” edited by Daniel Kahneman, Paul Slovic, and Amos Tversky.

Judgment and decision making is a subfield of psychology with connections to psychophysics (according to britannica.com, the “study of quantitative relations between psychological events and physical events or, more specifically, between sensations and the stimuli that produce them”) and cognitive psychology (according to wikipedia, “the scientific study of mental processes”).

I was curious who’d coined the term, “judgment and decision making.” It’s a good pairing. In 1988 Jon Baron published a book, Thinking and Deciding, a title that I like because it makes me reflect upon these two different processes. I’ve taught classes on decision analysis, but that’s not the same as thinking. That was before Dave Krantz explained to me about goal-based decision making.

A Google scholar search on “judgment and decision making” reveals multiple reviews on the topic, including a book of articles from 1986 edited by Hal R. Arkes and Kenneth R. Hammond, a book chapter by B. Fischoff from 1988, a textbook by J. F. Yates from 1990, a review article from 1998 by B. A. Mellers, A. Schwartz, and A. D. J. Cooke, a book chapter from T. D. Gilovich and D. W. Griffin from a handbook of social psychology published in 2010, and a review article from 2020 by Baruch Fischhoff, and Stephen B. Broomell, and yet another review article, this one by Priscila G. Brust-Renck, Rebecca B. Weldon, and Valerie F. Reyna from 2021.

That’s not all of it either! It makes sense that psychology is a reflective field, and psychologists like to write review articles. As a serial textbook author myself, I’m not complaining.

I don’t have it in me to read all the above reviews, but it would be interesting to compare the two articles by Fischoff that were written 32 years apart.

In the meantime, I still want to know when the term was first used. Going back to Google scholar, I’ll restrict my search to earlier decades.

For the decade 1940-1950, all I find is a reference to an article by J. Don Miller in The Journal of Business of the University of Chicago from 1947, containing phrase, “Neither college nor university training is conducive to the type of judgment and decision-making required in business.” It’s a readable article! But not relevant to the academic study of judgment and decision making that I was thinking of.

For 1950-1960 we see some references to judgment and decision making in business management and in human-factors research in psychology. So, again, no experiments or new theoretical structure. By 1960, the cognitive revolution was well established in psychology, classical (Neumann-Morgenstern) decision analysis was well established within economics and business, and researchers had started to explore various descriptive and normative problems with the classical approach—but it seems that no one had put these together into a new subfield that combined the mathematics/statistics/economics of decision analysis with cognitive psychology.

As of 1960, “judgment and decision making” was thought of as something done by managers at the workplace, not as its own field of study.

In 1960-1970, things begin to change. In 1961, Michael A. Wallach and Nathan Kogan published an article, “Aspects of judgment and decision making: Interrelationships and changes with age.” This is a serious psychology paper, with theories and data, following in the psychophysics tradition that would become so fruitful when continued by Tversky and Kahneman a decade later in their famous experiments on “the law of small numbers,” “anchoring and adjustment,” and other fallacies and heuristics of judgment under uncertainty. The O.G. researcher in this area was Laplace, back in the early 1800s, but here I’m talking about modern research in the area. Going through the references from the 1960s, the phrase “judgment and decision making” still is mostly used in the business context, but theoretical and empirical articles appear from stalwarts such as Ward Edwards (“Dynamic decision theory and probabilistic information processings,” published in the journal Human Factors in 1962) and Paul Slovic (“Risk-taking in children: Age and sex differences,” published in Child Development, 1966). The subfield is beginning to be formed, but is not cohered, nor has it been named.

The 1970s feature a flood of research papers on the topic. Just from the first page of the Google scholar search, there’s “Studies of problem solving, judgment, and decision making: Implications for educational research” from 1975, “Judgment and decision-making in a medical specialty” (1974), “The concept of weight in judgment and decision making: A review and some unifying proposals” (1980), “Studies of problem solving, judgment, and decision making: Implications for educational research” (1975), “Comparison of Bayesian and regression approaches to the study of information processing in judgment” (1971), “Human judgment and decision making: Theories, methods, and procedures” (1980), and so on.

And then come the 1980s, with the Kahneman/Slovic/Tversky book and all the rest. Hey! Here’s an article in the Annual Review of Psychology from 1984 (“Judgment and decision: Theory and application,” by Gordon F. Pitz and Natalie J. Sachs) that states:

A judgment or decision making (JDM) task is characterized either by uncertainty of information or outcome, or by a concern for a person’s preferences, or both. . . . Numerous authors have demonstrated that judgments depart significantly from the prescriptions of formal decision theory (see Kahneman et aI 1982). An earlier review of behavioral decision theory (Slovic et al 1977) was largely devoted to a descriPtion of these inconsistencies. . . . Since theorists are also human, and hence liable to the same biases as their subjects, there may exist a “bias heuristic” that leads psychologists to see biases in all forms of judgment (Berkelely & Humphreys 1982). The last chapter in this area in the Annual Review of Psychology included a critical discussion of the adequacy of prescriptive models for evaluating judgment and decision making (Einhorn & Hogarth 1981). . . .

In 1982 the newly-formed journal Medical Decision Making featured an article by Jay J. J. Christensen-Szalanski on “Recent Developments in the Psychology of Judgment and Decision Making,” and, as early as 1980, there was a book, “Human Judgment and Decision Making: Theories, Methods, and Procedures,” by Kenneth R. Hammond, Gary H. McClelland, and Jeryl Mumpower (see here for a review).

“They had it all but they wanted more”: Left-wing radicals in the 1960s and right-wingers now

Seeing all this news about various well-connected right-wing activists and billionaires attempting to take apart the government, blow up the economy, and destroy our alliances . . . OK, at some level I get it. The recent election gave the Republicans a rare opportunity in which one faction of one party controls all three branches of government (legislative, executive, and judicial) so it makes sense to run with it. That’s the “Project 2025” thing: Given the partisan polarization of voters, there’s a logic to pressing all the buttons on the console, making as many policy changes as you can, and trying to lock them all in so that they stay after you lose power–or set up conditions that make it harder for you to lose. So, yeah, sure, full-out attack on any alternative sources of political power, including the civil service, foreign allies, teachers, students, etc.

The “blow up the economy” thing is less clear, but even for that there’s a logic in terms of weakening alternative power bases. When the economy is going strong, employees can do what they want, with the confidence that they can find a new job if they quit or are fired, and companies have flexibility in hiring and business decisions. But if the economy is tanking, employees will be scared of losing their jobs and companies will be more vulnerable to government actions. So, again, less independent power. If your company is in financial peril, you’ll be less inclined to take one more risk by opposing the government.

So, yeah, I see the logic. But . . . looking at this another way, it’s all absolutely nuts. Right-wing activists and billionaires already had so much, even under the Biden administration. They had low taxes, low business regulation, a system of international alliances with other capitalist countries and freedom of action all around the world. They had lots of nice things, a Tesla in every pot, etc. From this perspective, I’d think that a natural goal on their part would be to keep what they have. Sure, push in the direction of lower taxes and less business regulation, push on whatever social issues you want, make whatever small changes to weaken the political opposition, but don’t go crazy–you don’t want to lose everything you’ve got.

Let me be clear here. I’m not trying to make the familiar argument that right-wing governments should compromise because too much political and economic inequality will anger the masses: cut universal benefits and replace them with tax cuts for the rich, and the people will come at you with pitchforks. That’s a tricky argument to make, because right-wing policies have their own attractions to the voters, and ideologues can well believe, sometimes with reason, that with careful messaging and enough control over the news media, that they can retain the consent of the governed.

Rather, I’m making a simpler argument. Right-wing activists and billionaires already have it all. Why would they want to blow everything up. If you think of the economy and society of the United States as being held up by many pillars, what you’re seeing is the party in power deciding to remove these pillars one at a time. There’s a risk that it could all collapse! And, sure, rich and well-connected people are better situated than the average person to weather an economic collapse, but, still, it seems crazy to me. And their kids can get measles too. It would take a better decision analyst than me to weigh the risks of your kid dying against the warm feeling you get by being part of the anti-establishment team.

They had it all but they wanted more.

I see an analogy to left-wing radicals in the late 1960s. The left side of the Democratic party controlled all three branches of government, they had most of the policies they wanted . . . but that wasn’t enough. They wanted more. As with the far right today, there was a lot of talk about blowing it all up and starting anew. In retrospect, it seems crazy. They had it all but they wanted more.

Again, I understand a little bit of this. Even if you have it all, why not ask for more. And you never have it all; there will always be some struggles on the margin, and in a free society such as the United States there will always be powerful institutions not under your control. In the 1960s, the left had to content with the Catholic church, the army, and big business: these were three major institutions with a strong conservative bent. Not to mention the political establishment in the south. In 2025, the right has its own opponents, notably in the worlds of education, medicine, and organized labor. The part that puzzles me, in both cases, was that pushing where you can, negotiating from a position of strength, was not enough. The people on top of the world wanted to blow up the world that they were on top of.

I get that, from a political standpoint, things are more complicated. Neither “the left” in the 1960s and “the right” today are unitary actors. The part that still baffles me is that many of the leaders of these groups already do have most of what they could possibly want–not just personally (money, fame, adulation, power) but also in their political goals.

From a psychological perspective, maybe the story is that if you already have it all, you feel invulnerable–vindicated by all the past gambles you’ve taken that have paid off–and you’re willing to throw the dice once again.

Perhaps the next step in understanding this is to sit down and read the full texts of the Port Huron Statement, the Weatherman document, Project 2025, and the most recent thousand tweets of Elon Musk.

Of plagues and chickens: How can someone be so skeptical in one place and so credulous somewhere else?

I was reading the London Review of Books and came across this letter:

Tom Shippey doesn’t question the opinion that plague carried off half the West European population during the Black Death (LRB, 7 November). As a microbiologist, I have reservations. And contemporary accounts mislead. John Wyclif claimed that the Black Death had caused the number of students at Oxford to fall from six thousand to three thousand. Hastings Rashdall in his classic history of medieval universities poured cold water on Wyclif, commenting that ‘the medieval mind was prone to exaggeration, especially where figures are concerned. It delighted in good round numbers, and was accustomed to make confident statements entirely without adequate data.’ And when another plague returned, Daniel Defoe wrote of ‘People being more addicted to Prophesies and Astrological Computations, Dreams, and Old Wives’ Tales’. There is nothing new under the sun.

Hugh Pennington
Aberdeen

I don’t know nuttin bout no plague, but, yeah, I could well imagine that contemporary sources could’ve overestimated the death toll. When you’re told a number, you want to see its source.

But . . . there was something about that name, “Hugh Pennington,” that rang a bell.

I did a blog search and found this, from 2011:

A review by Hugh Pennington of some books about supermarkets that contained the arresting (to me) line:

Consumption [of chicken] in the US has increased steadily since Herbert Hoover’s promise of ‘a chicken in every pot’ in 1928; it rose a hundredfold between 1934 and 1994, from a quarter of a chicken a year to half a chicken a week.

A hundredfold–that’s a lot! I thought it best to look this one up so I Googled “chicken consumption usda” and came up with this document by Jean Buzby and Hodan Farah, which contains this delightfully-titled graph:

OK, so it wasn’t a hundredfold increase, actually only sixfold. People were eating way more than a quarter of a chicken a year in 1934. And chicken consumption did not increase steadily since 1928. The curve is flat until the early 1940s.

This got me curious: who is Hugh Pennington, exactly? In that issue of the LRB, it says he “sits on committees that advise the World Food Programme and the Food Standards Agency. I guess he was just having a bad day, or maybe his assistant gave him some bad figures. Too bad they didn’t have Google back in 1994 or he could’ve looked up the numbers directly. “A hundredfold” . . . didn’t that strike him as a big number??

Anyone can have a bad day–I’m sure I’ve promoted some too-good-to-be-true numbers in my time–; still, it’s funny to see how Pennington expressed such strong skepticism about the plague death counts while just swallowing whole that outlandish claim regarding chicken consumption.

Or maybe there’s more to the chicken story that I haven’t heard, or more to the plague story that Pennington hasn’t heard.

Generalizing for Sampling and Causal Inference (my talk 3pm today at the University of Maryland)

Monday, April 28, 2025, 3:00 PM, 1101 A. James Clark Hall, University of Maryland, College Park:

We can combine model and design-based inference to address the following challenges of generalizing from sample to population: sparse data, small-area estimation, adjustment for non-census variables, cluster sampling, and survey weights. The methods are intellectually exciting and also important in the real world, as we demonstrate using examples in public health and public opinion, medical research, and policy analysis.

There will be discussion from Barry Graubard and Partha Lahiri. I’ll be discussing this paper and some related ideas.

It’s always fun to come back to the University of Maryland. I took classes in probability and stochastic processes there, many years ago.

P.S. Here’s a link to the talk.

I don’t buy that claim that “eating whole foods and avoiding toxins” is a “horseshoe” alliance of the far left and far right

Palko pointed me to this news article, “How the Right Claimed ‘Crunchy’”, which states:

Once, eating whole foods and avoiding toxins was associated with a lefty worldview. Now, being a “crunchy mom” is more often about “health freedom.” . . . As Kennedy [RFK, Jr.] evolved, so did “crunchy,” into a “horseshoe” alliance of far left- and far right-leaning home-schoolers and homesteaders, hippies and religious believers suspicious of conventional medicine who like to grow their own food. . . .

I’m highly suspicious of this claim. Here’s my personal anecdote. I do much of my food shopping at Whole Foods and I buy organic vegetables–so feel free to criticize me as a yuppie (“Whole Paycheck”) or a corporate slave (Amazon) or whatever. Anyway, the people in the store seem pretty normal to me–they don’t look like any kind of “horseshoe alliance.”

I’m suspicious of “horseshoe” arguments more generally. In this case, I think a lot of people want to eat healthy. It’s a kind of cultural reaction: Americans eat more and more junk food, so healthy eating becomes more of something you would have to decide to do. Kind of like how Americans have become less and less religious in the aggregate, so that religious attendance is more of an active statement than it used to be. So, sure, there’s something going on. But I see no reason to believe in this horseshoe thing here. It doesn’t make a lot of sense and I haven’t seen any data supporting it either.

I sent the above to Palko, who responded:

For a much more insightful look at the horseshoe, I do recommend this New Yorker piece from 2017, “Aromatic oils have become big business. But are they medicine or marketing?”, commented on here.

In his post, Palko writes that the “deep dive into the world of essential oils illuminates one of the most interesting corners of 21st century pseudo-science, the medical quackery that somehow appeals to the audiences of both Gwyneth Paltrow’s Goop and Alex Jones’ InfoWars.” I’m guessing it appeals to many people in the middle too.

I sent the above discussion to Weakliem, who responded:

I don’t really agree with the horseshoe model–there seem to be a lot of pathways from various points on the left to various points on the right (and vice versa). You couldn’t really get at this with survey data, but it would be interesting to have a biographical analysis of notable figures.

Horseshoe theories are appealing and in some way comforting as it’s a way of dismissing the extremes. When someone seems to start on the far right and moves to the far left, or vice-versa, yes, this can happen, but often I think it means they weren’t originally so much on the right or the left as you might think. Meanwhile, lots of beliefs that are nutty, wrong, or just not mainstream can be held by people all over the political spectrum. We’ve discussed a few times already on this blog how, until recently, anti-vax attitudes were not associated with the left or the right, but journalists loved talking about anti-vax liberals. It was just a framing that fit the sorts of stories they wanted to tell.

“Either a 2% or a 75% chance of rain”

Palko writes:

Last Saturday, I [Palko] checked Google and saw the forecast for a week from that day was a 75% chance of rain. That would have been very good news–it’s been dry in Southern California this winter–perhaps too good to be true. I checked a couple of competing sites and saw no indications of rain in the next seven days anywhere in the vicinity. A couple of hours later I checked back in and Google was now in line with all the other forecasts with 5% or less predicted.

As of Thursday, Google is down to 0% for Saturday while the Weather Channel has 18%.

Wow. I’ve noticed sometimes that different online sources give much different weather forecasts, even for the next day. But I’ve never looked into this systematically. I’m reminded a bit of Rajiv Sethi’s evaluations of election forecasts and our earlier post, What does it mean when they say there’s a 30% chance of rain?

Palko continues:

We’ve talked a lot about what it means for a continuously updated prediction such as election outcomes, navigation app travel time estimates, and weather forecasts to be accurate. It’s a complicated question without an objectively true answer. There are many valid metrics, none of which gives us the definitive answer

Obviously, accuracy is the main objective, but there are other indicators of model quality we can and should keep an eye on. Barring big new data (a major shift in the polls, a recently reported accident on your route), we don’t expect to see huge swings between updates, and if there are a number of competing models largely running off the same data, we expect a certain amount of consistency. If we have a prediction that is inaccurate, displays sudden swings, and makes forecasts wildly divergent from its competitors, that raises some questions.

The well-meaning but useless or counterproductive social science establishment

Part 1: 2020

Back in the dark days of April, 2020, a team of 42 social scientists published an article that began:

The COVID-19 pandemic represents a massive global health crisis. Because the crisis requires large-scale behaviour change and places significant psychological burdens on individuals, insights from the social and behavioural sciences can be used to help align human behaviour with the recommendations of epidemiologists and public health experts. Here we discuss evidence from a selection of research topics relevant to pandemics, including work on navigating threats, social and cultural influences on behaviour, science communication, moral decision-making, leadership, and stress and coping.

As I wrote at the time, the author list includes someone named Nassim, but not Taleb, and someone named Fowler, but not Anthony. It includes someone named Sander but not Greenland. Indeed it contains no authors with names of large islands. It includes someone named Zion but no one who, I’d guess, can dunk. Also no one from Zion. It contains someone named Dean and someone named Smith but . . . ok, you get the idea. It includes someone named Napper but no sleep researchers named Walker. It includes someone named Rand but no one from Rand. It includes someone named Richard Petty but not the Richard Petty. It includes Cass Sunstein but not Richard Epstein.

As befits an article with 42 authors, there were a lot of references: 6.02 references per author, to be precise. But, even with all these citations, I wasn’t quite sure where this research can be used to “support COVID-19 pandemic response,” as promised in the title of the article.

And the article got me angry. I’ll give some details in a moment, but here I just want to say that (a) I assume those 42 authors were sincerely trying to help the world, (b) their help was framed in terms of claims that social-science research by them and their friends would be helpful, and (c) I don’t think it was. Indeed, I’d argue that their work was counterproductive to the extent that it distracted policymakers from the real issues.

As I wrote back in 2020, my trouble with that 42-authored article is that many of its claims are so open-ended that they don’t tell us much about policy. For example, I’m not sure what we can do with a statement such as this:

Negative emotions resulting from threat can be contagious, and fear can make threats appear more imminent. A meta-analysis found that targeting fears can be useful in some situations, but not others: appealing to fear leads people to change their behaviour if they feel capable of dealing with the threat, but leads to defensive reactions when they feel helpless to act. The results suggest that strong fear appeals produce the greatest behaviour change only when people feel a sense of efficacy, whereas strong fear appeals with low-efficacy messages produce the greatest levels of defensive responses.

I’m not saying that people shouldn’t do research on this topic, I’m just asking what can you do with this sort of thing? In what way does this “support COVID-19 pandemic response”?

Beyond the very indirect connection to policy, I’m also concerned because, of the three references cited in the above passage, one is from PNAS in 2014 and one was from Psychological Science in 2013. That’s not a good sign!

Looking at the papers in more detail . . . The PNAS study found that if you manipulate people’s Facebook news feeds by increasing the proportion of happy or sad stories, people will post more happy or sad things themselves. The Psychological Science study is based on two lab experiments: 101 undergraduates who “participated in a study ostensibly measuring their thoughts about “island life,” and 48 undergraduates who were “randomly assigned to watch one of three videos” of a shill. Also a bunch of hypothesis tests with p-values like 0.04. Anyway, the point here is not to relive the year 2013 but rather to note that the relevance of these p-hacked lab experiments to policy is pretty low.

Also, the abstract of the 42-author paper says, “In each section, we note the nature and quality of prior research, including uncertainty and unsettled issues.” But then the paper goes on to unqualified statements that the authors don’t even seem to agree with.

For example, from the article, under the heading, “Disaster and ‘panic’” [scare quotes in original]:

There is a common belief in popular culture that, when in peril, people panic, especially when in crowds. That is, they act blindly and excessively out of self-preservation, potentially endangering the survival of all. . . . However, close inspection of what happens in disasters reveals a different picture. . . . Indeed, in fires and other natural hazards, people are less likely to die from over-reaction than from under-reaction, that is, not responding to signs of danger until it is too late. In fact, the concept of ‘panic’ has largely been abandoned by researchers because it neither describes nor explains what people usually do in disaster. . . . use of the notion of panic can be actively harmful. News stories that employ the language of panic often create the very phenomena that they purport to condemn. . . .

But, just a bit over two months earlier, one of the authors of that article wrote an op-ed titled, “The Cognitive Bias That Makes Us Panic About Coronavirus”—and he cited lots of social-science research in making that argument.

So what is it? Is “panic” a concept that’s been “largely been abandoned by researchers because it neither describes nor explains what people usually do in disaster,” or is it a concept that is so useful that it should be the subject of a widely-circulated op-ed?

I don’t think social science research has changed so much between 28 Feb 2020 (when that pundit wrote about panic and backed it up with citations) and 30 Apr 2020 (when that same pundit coauthored a paper saying that researchers shouldn’t be talking about panic). And, yes, I know that the author of an op-ed doesn’t write the headline. But, for a guy who thinks that “the concept of ‘panic'” is not useful in describing behavior, it’s funny how quickly he leaps to use that word. A quick google turned up this from early 2016: “How Pro Golf Explains the Stock Market Panic.” I guess that, since the Democrats were in power at that time, he had to talk about volatility in the market as an illogical “panic” rather than a rational response to economic conditions.

All joking aside, this just gets me angry. These guys have the influence and media access that allows them to go around promoting themselves and their friends with the PANIC headline whenever they want. But then in their scholarly review article, they lay down the law and tell us how foolish we are to believe in “‘panic.'” They get to talk about panic whenever they want, but when we want to talk about it, the scare quotes come out.

Don’t get me wrong. I’m sure these people mean well. With decades of focused effort they’ve climbed to the top of the greasy academic pole; their students and colleagues tell them, week after week and month after month, how brilliant they are. We were facing a major world event, they wanted to help, so they do what they can do. But sometimes the thing that you’re professionally trained to do, isn’t much help.

It’s like, I dunno, suppose you’re really fast and strong, maybe you’re an NFL cornerback! And then some emergency comes up, and a loved one is in the hospital. You want to help. But all the speed and strength and coordination in the world won’t do any good here. The only way you can contribute is through other means, like holding someone’s hand or paying the medical bills or calling the family or whatever. You can help, but your professional skills have no relevance here. This sort of thing comes up all the time; there are areas of life that are orthogonal to each other. But people don’t always want to hear it. They want to help out, not just in the same way that others might help, but by using their special talents.

Interlude: The social-science establishment

We all like to use the term “establishment” to refer to powerful or influential people who we disagree with. But that’s not fair. “Rogue economist” Steven Levitt is part of the establishment (sorry, Steve!). I’m part of the establishment too. I can post grumpy blog entries every day between 9 and 10am eastern time until the end of my days and beyond, and I’m still an MIT-and-Harvard-educated Ivy League professor living a comfortable life and with more influence than most.

Indeed, back in 2020, I expressed agreement with another member of the establishment, University of Chicago political science professor Anthony Fowler, who, in a snappily-titled piece called, “Curing Coronavirus Isn’t a Job for Social Scientists,” wrote:

The public appetite for more information about Covid-19 is understandably insatiable. Social scientists have been quick to respond. . . . While I understand the impulse, the rush to publish findings quickly in the midst of the crisis does little for the public and harms the discipline of social science. . . . Even in normal times, social science suffers from a host of pathologies. . . . A global crisis only exacerbates these problems. . . . and the promise of favorable news coverage in a time of crisis further distorts incentives. . . .

Well put. So what does it mean that I’m the establishment, Anthony Fowler’s the establishment, and we’re arguing that those other 42 people in the establishment are wrong?

The key point, I think, is that the Gang of 42 is arguing, not just for specific ideas–indeed, I argued above that their ideas are not so coherent–but in support of the idea of the social-science establishment. That’s what they mean by saying, “insights from the social and behavioural sciences can be used to help align human behaviour with the recommendations of epidemiologists and public health experts.”

Now, don’t get me wrong, I think the establishment has its uses–somebody has to run the journals, standardize the curricula, etc., indeed when I’m writing textbooks, my coauthors and I are making our bid to be part of that establishment, and I’m happy when that happens (as with Bayesian Data Analysis) and frustrated when we don’t fully succeed (as with Regression and Other Stories).

But in the case of covid policy, I think Anthony Fowler was right that social scientists should’ve backed off. And there I see a big problem with the social science establishment, which is that one of its major roles is . . . to promote itself. So to ask the establishment to chill out, to ask Cass Sunstein to, for just once in his life, not make a confident pronouncement or write a book telling everyone else how to think and behave, to tell policymakers that the solutions to their social-behavior problems will not be found in the latest issues of Psychological Science and PNAS . . . that’s a counter-establishment move.

Part 2: 2024

In the meantime, the rise of the political right has brought us what might be called a counter-establishment or new social science establishment that has all of the same problems of the old social science establishment–no, that’s not right, let me say it avoids some of those problems but has introduced its own problems–so we still have a lot to struggle against.

The old establishment had power pose and the collected works of Brian “pizzagate” Wansink. The new establishment has what might be called “podcast science”–cold showers and miracle cures–along with a soft spot for conspiracy theories such as the original Pizzagate story. The old establishment had the scientist as hero; the new establishment has the rich guy as hero. I don’t like either of those narratives.

There is some overlap: both the old and the new establishment are suckers for crude gender and racial essentialism as well as various goofy fallacies of measurement (the Implicit Association Test for the old establishment, finger measurements for the new establishment). Conceptually I don’t see much difference between ridiculous claims in Psychological Science about ovulation and voting, and the sorts of extreme gender essentialism that we hear about from podcasts–either way, it’s what we’ve called “schoolyard evolutionary biology.”

The people who say, “Invite Cass Sunstein to more parties and he’ll help with our public health problems” are not the same people who say, “Cryptocurrency will solve our economic problems,” but I see a similarity: in both cases there’s a group with some political power and intellectual influence wielding some crude theories and spending lots of time promoting themselves.

If you look at the present post, I have a lot more to say in the “2020” section than the “2024” section. Part of that is the benefit of hindsight–much of the first section above is copied from my posts from May 2020, and I like what I said there–but most of it is that I’m a lot more familiar with the world of bad science in academia than I am with the world of bad science in the partisan conservative media, even though it is the latter that has more influence now. I’ll have to leave it to others to pick up that particular baton; here I just want to register my objection to well-meaning but useless or counterproductive social science establishments of all kinds, and I want to suggest that the lessons we have learned (or should’ve learned) from 2020 about distrusting the establishment should apply to the new establishment as well.

My message is not “don’t trust anybody,” and it certainly isn’t “don’t ever trust an Ivy League professor” (as that would lead us straight to the paradox of the liar). And social science theories can be valuable for policy, including theories associated with the left, right, and center. But . . . hmmm, I’m not quite sure how to finish this paragraph. I’m not completely sure what my message here is, or should be. So let me just stop here. A luxury I have since I’m blogging, not writing a news article or journal article or book that needs to come to some coherent conclusion.

Let’s take apart this claim by Christopher Lasch from 1977 that hasn’t aged well: “Such changes have made both racist ideology and the ideology of martial conquest, appropriate to an earlier age of empire-building, increasingly anachronistic.”

As a social scientist, I find it instructive to look at the mistaken assessments written by thoughtful people in past eras.

I thought of this after reading this passage from 1977 written by sociologist Christopher Lasch:

The functional significance of racism in Western society is that it once provided ideological support for colonialism and for backward labor systems based on slavery or peonage. These forms of exploitation rested on the direct, unconcealed appropriation of surplus value by the master class, which justified its domination on the grounds that the lower orders, disqualified for self-government by virtue of racial inferiority or lowly birth, needed and benefited from their masters’ protection. Racism and paternalism were two sides of the same coin, the “white man’s burden.”

Capitalism has gradually substituted the free market for direct forms of domination. Within advanced countries, it has converted the serf or slave into a free worker. It has also revolutionized colonial relations. Instead of imposing military rule on their colonies, industrial nations now govern through client states, ostensibly sovereign, which keep order in their stead. Such changes have made both racist ideology and the ideology of martial conquest, appropriate to an earlier age of empire-building, increasingly anachronistic.

Setting aside some of the now-unfashionable socialist jargon (“appropriation of surplus value”), it’s striking how wrong he was. Racist ideology and the ideology of martial conquest aren’t what they were in 1861, sure, but they haven’t gone away, nor do they seem increasingly anachronistic. Racist ideology is no longer being used to justify slavery (except sometimes retroactively, from various neo-Confederates and neo-Nazis); it’s being used to justify economic and social inequality. As for the ideology of martial conquest, we’re seeing that now in Ukraine and in talk about Mexico and Canada, also Taiwan, Gaza, and probably some other places that aren’t coming to mind right now.

Lasch was hardly unique in this error. Indeed, it would be fair to say he deserves some credit for promoting an “end of history” thesis over a decade before the collapse of the Soviet Union. From the perspective of 1977, it seemed to make sense to think of racism and martial conquest as on the way out.

Regarding racism: in the context of U.S. politics, racism was alive and well in 1977 (in the school busing issue in Boston, for example), but arguably the racism was more of a vehicle for political realignment than anything else, a “card” for liberal and conservative politicians to play while they could, but a declining aspect of American culture, something that was only held onto by various left-behind and disappearing groups. Something we can’t say now, given the centrality of race to influential political analyses and movements on the left and the right.

Regarding martial conquest: the U.S. had just lost in Vietnam, China was soon to give up trying to boss Vietnam around, Vietnam had its hands more than full with Cambodia, the Soviets were soon to fail in their invasion of Afghanistan. Martial conquest was indeed looking like a dead end, not just in the large scale of mutual assured destruction but pretty much anywhere in the world. The loser Argentine generals were soon to fail to conquer the tiny Falkland Islands.

Up until the early 2000s, it could well be argued that racism and martial conquest were going away, so instead of criticizing Lasch for not forecasting the post-2010 world, we should perhaps credit him for anticipating a trend that would continue for more than twenty years after he wrote his article.

That said, I have one concern.

I can accept Lasch’s statement about the “the ideology of martial conquest, appropriate to an earlier age of empire-building,” being “increasingly anachronistic.” He was wrong, but the geopolitics of the 1960s-1990s seemed to pretty strongly support this “end of history” take.

Regarding the racism, though, I don’t think Lasch was fully thinking through the issue. As long as there is political inequality, and as long as there is economic inequality, there will be a desire for explanations and justifications of these patterns, and racism is an always-available source of such explanations. So, even setting aside current disputes about race in science and politics, I have the feeling that racism is here to stay.

Again, the point of this is not to say, Hey, this dude from 50 years ago got things wrong!, but rather to reflect up on the perspectives that people had back then. Our own takes are time-bound, and one way to understand this is to consider time-bound takes from generally sensible people from earlier times.

“The difference between ‘significant’ and ‘not significant’ is not itself statistically significant” . . . in the wild!

Here’s the statistical principle (from my 2006 paper with Hal Stern).

And here’s the recent example:

Kevin Lewis points us to this article, “The long-term effects of marijuana use on mental health outcomes,” which begins:

We estimate the long-term effect of initiating marijuana use in adolescence on several mental health outcomes later in life. We use the first two waves (1994–1996) and the fifth wave (2016–2018) of the National Longitudinal Study of Adolescent to Adult Health (Add Health) and estimate instrumental variables models with school-level fixed effects, where the instrument is the respondents’ perceptions about their friends’ marijuana use. We find that marijuana use in adolescence is associated with a significant increase in anxiety approximately twenty years later. The increase in anxiety is only present among females and is stronger among females who used marijuana regularly as adolescents.

Did you catch the error?

Read that last sentence carefully: the phrase “only present among females” has the flavor of comparison of significance levels.

And, indeed, here are the results:

You can just look at the rightmost column (“Anxiety”). The estimate is 0.05 +/- 0.04 for boys and 0.11 +/- 0.05 for girls. A quick calculation gives an approximate estimated difference of 0.06 +/- 0.06: not statistically significant!

I’m also suspicious of the “stronger among females who used marijuana regularly”—this is based on a coefficient estimate getting larger when they modify their marijuana use indicator, and again this is not the sort of direct comparison that would be recommended.

It’s not that I’m saying that non-statistically-significant results should be ignored. The authors should report all comparisons of interest. It’s just that comparisons should come with uncertainties. Comparing significance levels doesn’t do that. They’re creating a misleading sense of certainty in their analysis. I have no reason to think they’re doing it wrong on purpose; it’s just an unfortunate side effect of dichotomizing results based on significance thresholds.

The point of this post is not to focus on this particular article published in an obscure journal. It’s just yet another example of something we’ve seen many times before and unfortunately will see many times again.

Friday 10am: Online conversation on “Experiments, Causal Inference, and Limits of Evidence” with Nancy Cartwright and Berna Devezer

It’s organized by Martin Paul Fritze and Cait Lamberton from the Center for Empirical Philosophy and Behavioral Insights, it takes place 10am Fri 25 Apr 2025, and the zoom link is here.

The three of us will each speak for 15 minutes and then there will be discussion. I have no idea what any of us will say . . . it should be full of surprises.

As background here are some of my earlier interactions with Cartwright and Devezer:

Benefits and limitations of randomized controlled trials: I agree with Deaton and Cartwright

More on possibly rigor-enhancing practices in quantitative psychology research

And, as a bonus, this from Dan Simpson:

Dan’s Paper Corner: Can we model scientific discovery and what can we learn from the process?

and this from Jessica Hullman:

Taking theory more seriously in psychological science

Show up to this event and come prepared with some tough questions!

The Error-Reversal Heuristic: How would you have reacted had the mistake gone in the opposite direction?

Something we’ve seen with depressing regularity is that researchers do something sloppy—perhaps even deceitful or fraudulent, but oftentimes just sloppy—and then when the error is pointed out, they reply that the main conclusions of the study have not changed.

It often looks ridiculous, and when we post on these things we put them in the Zombies category—but, sometimes, sure it must be the case. Someone claims some big result, but it still seems to make sense that they got the direction right, so maybe the magnitude doesn’t matter?

How to think about this?

My suggestion: try the Error-Reversal Heuristic. Imagine how the promoter of the idea would’ve reacted had the mistaken gone in the opposite direction.

Here are some examples.

1. Published paper from an organization called Toxic-Free Future claims that a toxin is at 80% of the legal limit. They screwed up their calculation—it’s actually only 8%—and here’s their response: “it is important to note that this does not impact our results . . . and our recommendations remain the same.”

The Error-Reversal Heuristic: Suppose someone else had done a study and found that the level of exposure was “8% of the reference dose, thus, a potential concern,” but they’d done the calculation wrong, and the level was really 80% of the reference dose. Then I assume that the folks at Toxic-Free Future would’t say that the recommendations remain the same, right? They’d say the exposure had been underestimated by a factor of 10 and that’s a big deal!

2. Published paper in Lancet (uh oh) published a paper that hydroxychloroquine/chloroquine was killing people. It turns out the work was fraudulent, which perhaps should not surprise us, given the strong criticism by James “not the racist dude” Watson, who wrote at the time, “The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . . The most obvious confounder is disease severity . . . The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. . . . This temporal aspect cannot be picked up a single severity measurement. In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. . . .”

Five days after the problems with this paper came out, a press officer for Lancet wrote that “The results and conclusions reported in the study remain unchanged.”

Ummm . . . time for the Error-Reversal Heuristic: Suppose the results had originally been reported as kinda small, then it turned out a mistake had been made, and the actual effect of the drug was to double the mortality rate. How would the promoters have reacted? I’m pretty sure they’d say that such an effect is a big deal!

3. A published paper, “Attractive Names Sustain Increased Vegetable Intake in Schools” (guess who’s the author? Hint: “Pizzagate”) made big claims. It turned out that the data in the paper were incoherent, and a correction was written that was longer than the original paper. According to Retraction Watch: “Some of the changes include explaining the children studied were preschoolers (3-5 years old), not preteens (8-11), as originally claimed.” The author’s response to all of this? You got it: “These mistakes and omissions do not change the general conclusion of the paper.”

Time for the Error-Reversal Heuristic! What if things had gone the other way? Someone published a null result on the effects of attractive names on vegetable intake in schools, but it turned out that the data had been entirely garbled, and in fact the study was on preschoolers, not preteens. Would Mister Cornell Food Researcher then reply that the general conclusions did not change? Hell no! He would’ve said this any claims of a null finding were invalidated by the sloppiness of the study.

4. A notorious member of the National Academy of Sciences published a paper with t-statistics reported as 5.03 and 11.14. But those were in error! The actual t-statistics were 1.8 and 3.3. How did the author reply? You’ll never guess: this “does not change the conclusion of the paper.” As I wrote at the time:

This is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

But . . . let’s try the Error-Reversal Heuristic. Suppose the published t statistics had been 1.8 and 3.3, but that had been an error, and they really were 5.03 and 11.14. How would the author have responded then? Probably something about how strong the evidence is, right?

5. We’ve also seen papers where the result goes in the opposite direction of the pre-registration. Had it gone in the same direction as the pre-registration, it would be hailed as a success, so when it goes in the opposite direction . . . maybe not so much of a success? There was the notorious case of the paper about ovulation and clothing with a finding that failed to replicate in a new study by the same authors. They refused to let go of the original, fatally-flawed claim and instead argued that they’d discovered an interaction. And then there was the “gremlins” article that approached the Platonic ideal of having more errors than data points. The only thing that remained constant amid all the wreckage was . . . the conclusion.

6. And, most consequentially, there was the notorious “Excel error” paper, where fatal flaws were discovered and the authors dismissed this as an “academic kerfuffle,” which isn’t quite “the conclusions are unchanged,” but close enough. Again, imagine if someone had published a null result and then, once the data had been fixed, a big estimate in their preferred direction had shown up. I think they would’ve said this was a big deal.

I’m happy to retell these above stories as often as might be needed–recall Paul Alper’s horse—; my point in this post is to give examples of the error-reversal heuristic.

P.S. Sometimes people do it right. Here’s an example where fatal flaws were found in a published paper, and the authors concluded, “A reanalysis of the data leads to revised findings that do not replicate the results in the original paper.” So it is possible.