Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile)

This is a long post, so let me give you the tl;dr right away: Don’t use the word “whom” in your dating profile.

OK, now for the story. Fasten your seat belts, it’s going to be a bumpy night.

It all started with this message from Dmitri with subject line, “Man I hate to do this to you but …”, which continued:

How could I resist?

https://www.cnbc.com/2024/02/15/using-this-word-can-make-you-more-influential-harvard-study.html

I’m sorry, let me try again … I had to send this to you BECAUSE this is the kind of obvious shit you like to write about. I like how they didn’t even do their own crappy study they just resurrected one from the distant past.

OK, ok, you don’t need to shout about it!

Following the link we see this breathless press release NBC news story:

Using this 1 word more often can make you 50% more influential, says Harvard study

Sometimes, it takes a single word — like “because” — to change someone’s mind.

That’s according to Jonah Berger, a marketing professor at the Wharton School of the University of Pennsylvania who’s compiled a list of “magic words” that can change the way you communicate. Using the word “because” while trying to convince someone to do something has a compelling result, he tells CNBC Make It: More people will listen to you, and do what you want.

Berger points to a nearly 50-year-old study from Harvard University, wherein researchers sat in a university library and waited for someone to use the copy machine. Then, they walked up and asked to cut in front of the unknowing participant.

They phrased their request in three different ways:

“May I use the Xerox machine?”
“May I use the Xerox machine because I have to make copies?”
“May I use the Xerox machine because I’m in a rush?”
Both requests using “because” made the people already making copies more than 50% more likely to comply, researchers found. Even the second phrasing — which could be reinterpreted as “May I step in front of you to do the same exact thing you’re doing?” — was effective, because it indicated that the stranger asking for a favor was at least being considerate about it, the study suggested.

“Persuasion wasn’t driven by the reason itself,” Berger wrote in a book on the topic, “Magic Words,” which published last year. “It was driven by the power of the word.” . . .

Let’s look into this claim. The first thing I did was click to the study—full credit to CNBC Make It for providing the link—and here’s the data summary from the experiment:

If you look carefully and do some simple calculations, you’ll see that the percentage of participants who complied was 37.5% under treatment 1, 50% under treatment 2, and 62.5% under treatment 3. So, ok, not literally true that both requests using “because” made the people already making copies more than 50% more likely to comply: 0.50/0.375 = 1.33, and increase of 33% is not “more than 50%.” But, sure, it’s a positive result. There were 40 participants in each treatment, so the standard error is approximately 0.5/sqrt(40) = 0.08 for each of those averages. The key difference here is 0.50 – 0.375 = 0.125, that’s the difference between the compliance rates under the treatments “May I use the Xerox machine?” and “May I use the Xerox machine because I have to make copies?”, and this will have a standard error of approximately sqrt(2)*0.08 = 0.11.

The quick summary from this experiment: an observed difference in compliance rates of 12.5 percentage points, with a standard error of 11 percentage points. I don’t want to say “not statistically significant,” so let me just say that the estimate is highly uncertain, so I have no real reason to believe it will replicate.

But wait, you say: the paper was published. Presumably it has a statistically significant p-value somewhere, no? The answer is, yes, they have some “p < .05" results, just not of that particular comparison. Indeed, if you just look at the top rows of that table (Favor = small), then the difference is 0.93 - 0.60 = 0.33 with a standard error of sqrt(0.6*0.4/15 + 0.93*0.07/15) = 0.14, so that particular estimate is just more than two standard errors away from zero. Whew! But now we're getting into forking paths territory: - Noisy data - Small sample - Lots of possible comparisons - Any comparison that's statistically significant will necessarily be huge - Open-ended theoretical structure that could explain just about any result. I'm not saying the researchers were trying to anything wrong. But remember, honesty and transparency are not enuf. Such a study is just too noisy to be useful.

But, sure, back in the 1970s many psychology researchers not named Meehl weren’t aware of these issues. They seem to have been under the impression that if you gather some data and find something statistically significant for which you could come up with a good story, that you’d discovered a general truth.

What’s less excusable is a journalist writing this in the year 2024. But it’s no surprise, conditional on the headline, “Using this 1 word more often can make you 50% more influential, says Harvard study.”

But what about that book by the University of Pennsylvania marketing professor? I searched online, and, fortunately for us, the bit about the Xerox machine is right there in the first chapter, in the excerpt we can read for free. Here it is:

He got it wrong, just like the journalist did! It’s not true that including the meaningless reason increased persuasion just as much as the valid reason did. Look at the data! The outcomes under the three treatment were 37.5%, 50%, and 62.5%. 50% – 37.5% ≠ 62.5% – 37.5%. Ummm, ok, he could’ve said something like, “Among a selected subset of the data with only 15 or 16 people in each treatment, including the meaningless reason increased persuasion just as much as the valid reason did.” But that doesn’t sound so impressive! Even if you add something like, “and it’s possible to come up with a plausible theory to go with this result.”

The book continues:

Given the flaws in the description of the copier study, I’m skeptical about these other claims.

But let me say this. If it is indeed true that using the word “whom” in online dating profiles makes you 31% more likely to get a date, then my advice is . . . don’t use the word “whom”! Think of it from a potential-outcomes perspective. Sure, you want to get a date. But do you really want to go on a date with someone who will only go out with you if you use the word “whom”?? That sounds like a really pretentious person, not a fun date at all!

OK, I haven’t read the rest of the book, and it’s possible that somewhere later on the author says something like, “OK, I was exaggerating a bit on page 4 . . .” I doubt it, but I guess it’s possible.

Replications, anyone?

To return to the topic at hand: In 1978 a study was conducted with 120 participants in a single location. The study was memorable enough to be featured in a business book nearly fifty years later.

Surely the finding has been replicated?

I’d imagine yes; on the other hand, if it had been replicated, this would’ve been mentioned in the book, right? So it’s hard to know.

I did a search, and the article does seem to have been influential:

It’s been cited 1514 times—that’s a lot! Google lists 55 citations in 2023 alone, and in what seem to be legit journals: Human Communication Research, Proceedings of the ACM, Journal of Retailing, Journal of Organizational Behavior, Journal of Applied Psychology, Human Resources Management Review, etc. Not core science journals, exactly, but actual applied fields, with unskeptical mentions such as:

What about replications? I searched on *langer blank chanowitz 1978 replication* and found this paper by Folkes (1985), which reports:

Four studies examined whether verbal behavior is mindful (cognitive) or mindless (automatic). All studies used the experimental paradigm developed by E. J. Langer et al. In Studies 1–3, experimenters approached Ss at copying machines and asked to use it first. Their requests varied in the amount and kind of information given. Study 1 (82 Ss) found less compliance when experimenters gave a controllable reason (“… because I don’t want to wait”) than an uncontrollable reason (“… because I feel really sick”). In Studies 2 and 3 (42 and 96 Ss, respectively) requests for controllable reasons elicited less compliance than requests used in the Langer et al study. Neither study replicated the results of Langer et al. Furthermore, the controllable condition’s lower compliance supports a cognitive approach to social interaction. In Study 4, 69 undergraduates were given instructions intended to increase cognitive processing of the requests, and the pattern of compliance indicated in-depth processing of the request. Results provide evidence for cognitive processing rather than mindlessness in social interaction.

So this study concludes that the result didn’t replicate at all! On the other hand, it’s only a “partial replication,” and indeed they do not use the same conditions and wording as in the original 1978 paper. I don’t know why not, except maybe that exact replications traditionally get no respect.

Langer et al. responded in that journal, writing:

We see nothing in her results [Folkes (1985)] that would lead us to change our position: People are sometimes mindful and sometimes not.

Here they’re referring to the table from the 1978 study, reproduced at the top of this post, which shows a large effect of the “because I have to make copies” treatment under the “Small Favor” condition but no effect under the “Large Favor” condition. Again, given the huge standard errors here, we can’t take any of this seriously, but if you just look at the percentages without considering the uncertainty, then, sure, that’s what they found. Thus, in their response to the partial replication study that did not reproduce their results, Langer et al. emphasized that their original finding was not a main effect but an interaction: “People are sometimes mindful and sometimes not.”

That’s fine. Psychology studies often measure interactions, as they should: the world is a highly variable place.

But, in that case, everyone’s been misinterpreting that 1978 paper! When I say “everybody,” I mean this recent book by the business school professor and also the continuing references to the paper in the recent literature.

Here’s the deal. The message that everyone seems to have learned, or believed they learned, from the 1978 paper is that meaningless explanations are as good as meaningful explanations. But, according to the authors of that paper when they responded to criticism in 1985, the true message is that this trick works sometimes and sometimes not. That’s a much weaker message.

Indeed the study at hand is too small to draw any reliable conclusions about any possible interaction here. The most direct estimate of the interaction effect from the above table is (0.93 – 0.60) – (0.24 – 0.24) = 0.33, with a standard error of sqrt(0.93*0.07/15 + 0.60*0.40/15 + 0.24*0.76/25 + 0.24*0.76/25) = 0.19. So, no, I don’t see much support for the claim in this post from Psychology Today:

So what does this all mean? When the stakes are low people will engage in automatic behavior. If your request is small, follow your request with the word “because” and give a reason—any reason. If the stakes are high, then there could be more resistance, but still not too much.

This happens a lot in unreplicable or unreplicated studies: a result is found under some narrow conditions, and then it is taken to have very general implications. This is just an unusual case where the authors themselves pointed out the issue. As they wrote in their 1985 article:

The larger concern is to understand how mindlessness works, determine its consequences, and specify better the conditions under which it is and is not likely to occur.

That’s a long way from the claim in that business book that “because” is a “magic word.”

Like a lot of magic, it only works under some conditions, and you can’t necessarily specify those conditions ahead of time. It works when it works.

There might be other replication studies of this copy machine study. I guess you couldn’t really do it now, because people don’t spend much time waiting at the copier. But the office copier was a thing for several decades. So maybe there are even some exact replications out there.

In searching for a replication, I did come across this post from 2009 by Mark Liberman that criticized yet another hyping of that 1978 study, this time from a paper by psychologist Daniel Kahenman in the American Economic Review. Kahneman wrote:

Ellen J. Langer et al. (1978) provided a well-known example of what she called “mindless behavior.” In her experiment, a confederate tried to cut in line at a copying machine, using various preset “excuses.” The conclusion was that statements that had the form of an unqualified request were rejected (e.g., “Excuse me, may I use the Xerox machine?”), but almost any statement that had the general form of an explanation was accepted, including “Excuse me, may I use the Xerox machine because I want to make copies?” The superficiality is striking.

As Liberman writes, this represented a “misunderstanding of the 1978 paper’s results, involving both a different conclusion and a strikingly overgeneralized picture of the observed effects.” Liberman performs an analysis of the data from that study which is similar to what I have done above.

Liberman summarizes:

The problem with Prof. Kahneman’s interpretation is not that he took the experiment at face value, ignoring possible flaws of design or interpretation. The problem is that he took a difference in the distribution of behaviors between one group of people and another, and turned it into generic statements about the behavior of people in specified circumstances, as if the behavior were uniform and invariant. The resulting generic statements make strikingly incorrect predictions even about the results of the experiment in question, much less about life in general.

Mindfulness

The key claim of all this research is that people are often mindless: they respond to the form of a request without paying attention to its context, with “because” acting as a “magic word.”

I would argue that this is exactly the sort of mindless behavior being exhibited by the people who are promoting that copying-machine experiment! They are taking various surface aspects of the study and using it to draw large, unsupported conclusions, without being mindful of the details.

In this case, the “magic words” are things like “p < .05," "randomized experiment," "Harvard," "peer review," and "Journal of Personality and Social Psychology" (this notwithstanding). The mindlessness comes from not looking into what exactly was in the paper being cited.

In conclusion . . .

So, yeah, thanks for nothing, Dmitri! Three hours of my life spent going down a rabbit hole. But, hey, if any readers who are single have read far enough down in the post to see my advice not to use “whom” in your data profile, it will all have been worth it.

Seriously, though, the “mindlessness” aspect of this story is interesting. The point here is not, Hey, a 50-year-old paper has some flaws! Or the no-less-surprising observation: Hey, a pop business book exaggerates! The part that fascinates me is that there’s all this shaky research that’s being taken as strong evidence that consumers are mindless—and the people hyping these claims are themselves demonstrating the point by mindlessly following signals without looking into the evidence.

The ultimate advice that the mindfulness gurus are giving is not necessarily so bad. For example, here’s the conclusion of that online article about the business book:

Listen to the specific words other people use, and craft a response that speaks their language. Doing so can help drive an agreement, solution or connection.

“Everything in language we might use over email at the office … [can] provide insight into who they are and what they’re going to do in the future,” says Berger.

That sounds ok. Just forget all the blather about the “magic words” and the “superpowers,” and forget the unsupported and implausible claim that “Arguments, requests and presentations aren’t any more or less convincing when they’re based on solid ideas.” As often is the case, I think these Ted-talk style recommendations would be on more solid ground if they were just presented as the product of common sense and accumulated wisdom, rather than leaning on some 50-year-old psychology study that just can’t bear the weight. But maybe you can’t get the airport book and the Ted talk without a claim of scientific backing.

Don’t get me wrong here. I’m not attributing any malign motivations to any of the people involved in this story (except for Dmitri, I guess). I’m guessing they really believe all this. And I’m not using “mindless” as an insult. We’re all mindless sometimes—that’s the point of the Langer et al. (1978) study; it’s what Herbert Simon called “bounded rationality.” The trick is to recognize your areas of mindlessness. If you come to an area where you’re being mindless, don’t write a book about it! Even if you naively think you’ve discovered a new continent. As Mark Twain apparently never said, it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

The usual disclaimer

I’m not saying the claims made by Langer et al. (1978) are wrong. Maybe it’s true that, under conditions of mindlessness, all that matters is the “because” and any empty explanation will do; maybe the same results would show up in a preregistered replication. All I’m saying is that the noisy data that have been presented don’t provide any strong evidence in support of such claims, and that’s what bothers me about all those confident citations in the business literature.

P.S.

After writing the above post, I sent this response to Dmitri:

OK, I just spent 3 hours on this. I now have to figure out what to do with this after blogging it, because I think there are some important points here. Still, yeah, you did a bad thing by sending this to me. These are 3 hours I could’ve spent doing real work, or relaxing . . .

He replied:

I mean, yeah, that’s too bad for you, obviously. But … try to think about it from my point of view. I am more influential, I got you to work on this while I had a nice relaxing post-Valentine’s day sushi meal with my wife (much easier to get reservations on the 15th and the flowers are a lot cheaper), while you were toiling away on what is essentially my project. I’d say the magic words did their job.

Good point! He exploited my mindlessness. I responded:

Ok, I’ll quote you on that one too! (minus the V-day details).

I’m still chewing on your comment that you appreciate the Beatles for their innovation as much as for their songs. The idea that there are lots of songs of similar quality but not so much innovation, that’s interesting. The only thing is that I don’t know enough about music, even pop music, to have a mental map of where everything fits in. For example, I recently heard that Coldplay song, and it struck me that it was in the style of U2 . But I don’t really know if U2 was the originator of that soaring sound. I guess Pink Floyd is kinda soaring too, but not quite in the same way . . . etc etc … the whole thing was frustrating to me because I had no sense of whether I was entirely bullshitting or not.

So if you can spend 3 hours writing a post on the above topic, we’ll be even.

Dmitri replied:

I am proud of the whole “Valentine’s day on the 15th” trick, so you are welcome to include it. That’s one of our great innovations. After the first 15-20 Valentine’s days, you can just move the date a day later and it is much easier.

And, regarding the music, he wrote:

U2 definitely invented a sound, with the help of their producer Brian Eno.

It is a pretty safe bet that every truly successful musician is an innovator—once you know the sound it is easy enough to emulate. Beethoven, Charlie Parker, the Beatles, all the really important guys invented a forceful, effective new way of thinking about music.

U2 is great, but when I listened to an entire U2 song from beginning to end, it seemed so repetitive as to be unlistenable. I don’t feel that way about the Beatles or REM. But just about any music sounds better to me in the background, which I think is a sign of my musical ignorance and tone-deafness (for real, I’m bad at recognizing pitches) more than anything else. I guess the point is that you’re supposed to dance to it, not just sit there and listen.

Anyway, I warned Dmitri about what would happen if I post his Valentine’s Day trick:

I post this, then it will catch on, and it will no longer work . . . just warning ya! You’ll have to start doing Valentine’s Day on the 16th, then the 17th, . . .

To which Dmitri responded:

Yeah but if we stick with it, it will roll around and we will get back to February 14 while everyone else is celebrating Valentines Day on these weird wrong days!

I’ll leave him with the last word.

How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr)

Thurs 7 Mar 18 Apr 2024, 12:30pm at 19 West 4th St., room 517:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

Our new book, Active Statistics, is now available!

Coauthored with Aki Vehtari, this new book is lots of fun, perhaps the funnest I’ve ever been involved in writing. And it’s stuffed full of statistical insights. The webpage for the book is here, and the link to buy it is here or directly from the publisher here.

With hundreds of stories, activities, and discussion problems on applied statistics and causal inference, this book is a perfect teaching aid, a perfect adjunct to a self-study program, and an enjoyable bedside read if you already know some statistics.

Here’s the quick summary:

This book provides statistics instructors and students with complete classroom material for a one- or two-semester course on applied regression and causal inference. It is built around 52 stories, 52 class-participation activities, 52 hands-on computer demonstrations, and 52 discussion problems that allow instructors and students to explore the real-world complexity of the subject. The book fosters an engaging “flipped classroom” environment with a focus on visualization and understanding. The book provides instructors with frameworks for self-study or for structuring the course, along with tips for maintaining student engagement at all levels, and practice exam questions to help guide learning. Designed to accompany the authors’ previous textbook Regression and Other Stories, its modular nature and wealth of material allow this book to be adapted to different courses and texts or be used by learners as a hands-on workbook.

It’s got 52 of everything because it’s structured around a two-semester class, with 13 weeks per semester and 2 classes per week. It’s really just bursting with material, including some classic stories and lots of completely new material. Right off the bat we present a statistical mystery that arose with a Wikipedia experiment, and we have a retelling of the famous Literary Digest survey story but with a new and unexpected twist (courtesy of Sharon Lohr and J. Michael Brick). And the activities have so much going on! One of my favorites is a version of the two truths and a lie game that demonstrates several different statistical ideas.

People have asked how this differs from my book with Deborah Nolan, Teaching Statistics: A Bag of Tricks. My quick answer is that Active Statistics has a lot more of everything, it’s structured to cover an entire two-semester course in order, and it’s focused on applied statistics. Including a bunch of stories, activities, demonstrations, and problems on causal inference, a topic that is not always so well integrated into the statistics curriculum. You’re gonna love this book.

You can buy it here or here. It’s only 25 bucks, which is an excellent deal considering how stuffed it is with useful content. Enjoy.

Hey! Here’s some R code to make colored maps using circle sizes proportional to county population.

Kieran Healy shares some code and examples of colored maps where each region is given a circle in proportion to its population. He calls these “Dorling cartograms,” which sounds kinda mysterious to me but I get that there’s no easy phrase to describe them. It’s clear in the pictures, though:

I wrote to Kieran asking if it was possible to make the graphs without solid circles around each point, as that could make them more readable.

He replied:

Yeah it’s easy to do that, you just give different parameters to geom_sf(), specifically you set the linewidth to 0 so no border is drawn on the circles. So instead of geom_sf(color=“gray30”) or whatever you say geom_sf(linewidth=0). But I think this does not in fact make things more readable with a white, off-white, or light gray background:

The circle borders do a fair amount of work to help the eye see where the circles actually are as distinct elements. It’s possible to make the border more subtle and still have it work:

In this version the circle borders are only a *very slightly* darker gray than the background, but it makes a big difference still.

Finally you could also remove the circle borders but make the background very dark, like this:

Not bad, though there issue becomes properly seeing the dark orange— especially smaller counties with very high pct Black. This would work better with one of the other palettes.

Interesting. Another win for ggplot.

How to code and impute income in studies of opinion polls?

Nate Cohn asks:

What’s your preferred way to handle income in a regression when income categories are inconsistent across several combined survey datasets? Am I best off just handling this with multiple categorical variables? Can I safely create a continuous variable?

My reply:

I thought a lot about this issue when writing Red Sate Blue State. My preferred strategy is to use a variable that we could treat as continuous. For example when working with ANES data I was using income categories 1,2,3,4,5 which corresponded to income categories 1-16th percentile, 16-33rd, 34-66th, 67-95th, and 96-100th. If you have different surveys with different categories, you could use some somewhat consistent scaling, for example one survey you might code as 1,3,5,7 and another might be coded as 2,4,6,8. I expect that other people would disagree with this advice but this the sort of thing that I was doing. I’m not so much worried about the scale being imperfect or nonlinear. But if you have a non-monotonic relation, you’ll have to be more careful.

Cohn responds:

Two other thoughts for consideration:

— I am concerned about non-monotonicity. At least in this compilation of 2020 data, the Democrats do best among rich and poor, and sag in the middle. It seems even more extreme when we get into the highest/lowest income strata, ala ANES. I’m not sure this survives controls—it seems like there’s basically no income effect after controls—but I’m hesitant to squelch a possible non-monotonic effect that I haven’t ruled out.

—I’m also curious for your thoughts on a related case. Suppose that (a) dataset includes surveys that sometimes asked about income and sometimes did not ask about income, (b) we’re interested in many demographic covariates, besides income, and; (c) we’d otherwise clearly specify the interaction between income and the other variables. The missing income data creates several challenges. What should we do?

I can imagine some hacky solutions to the NA data problem outright removing observations (say, set all NA income to 1 and interact our continuous income variable with whether we have actual income data), but if we interact other variables with the NA income data there are lots of cases (say, MRP where the population strata specifies income for full pop, not in proportion to survey coverage) where we’d risk losing much of the power gleaned from other surveys about the other demographic covariates. What should we do here?

My quick recommendation is to fit a model with two stages, first predicting income given your other covariates, then predicting your outcome of interest (issue attitude, vote preference, whatever) given income and the other covariates. You can fit the two models simultaneously in one Stan program. I guess then you will want some continuous coding for income (could be something like sqrt(income) with income topcoded at $300K) along with a possibly non-monotonic model at the second level.

Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics

In politics, as in baseball, hot prospects from the minors can have trouble handling big-league pitching.

Right after Sarah Palin was chosen as the Republican nominee for vice president in 2008, my friend Ubs, who grew up in Alaska and follows politics closely, wrote the following:

Palin would probably be a pretty good president. . . . She is fantastically popular. Her percentage approval ratings have reached the 90s. Even now, with a minor nepotism scandal going on, she’s still about 80%. . . . How does one do that? You might get 60% or 70% who are rabidly enthusiastic in their love and support, but you’re also going to get a solid core of opposition who hate you with nearly as much passion. The way you get to 90% is by being boringly competent while remaining inoffensive to people all across the political spectrum.

Ubs gives a long discussion of Alaska’s unique politics and then writes:

Palin’s magic formula for success has been simply to ignore partisan crap and get down to the boring business of fixing up a broken government. . . . It’s not a very exciting answer, but it is, I think, why she gets high approval ratings — because all the Democrats, Libertarians, and centrists appreciate that she’s doing a good job on the boring non-partisan stuff that everyone agrees on and she isn’t pissing them off by doing anything on the partisan stuff where they disagree.

Hey–I bet you never thought you’d see the words “boringly competent,” “inoffensive,” and “Sarah Palin” in the same sentence!

Prediction and extrapolation

OK, so what’s the big deal? Palin got a reputation as a competent nonpartisan governor but when she hit the big stage she shifted to hyper-partisanship. The contrast is interesting to me because it suggests a failure of extrapolation.

Now let’s move to baseball. One of the big findings of baseball statistics guru Bill James is that minor-league statistics, when correctly adjusted, predict major-league performance. James is working through a three-step process: (1) naive trust in minor league stats, (2) a recognition that raw minor league stats are misleading, (3) a statistical adjustment process, by which you realize that there really is a lot of information there, if you know how to use it.

For a political analogy, consider Scott Brown. When he was running for the Senate last year, political scientist Boris Shor analyzed his political ideology. The question was, how would he vote in the Senate if he were elected? Boris wrote:

We have evidence from multiple sources. The Boston Globe, in its editorial endorsing Coakley, called Brown “in the mode of the national GOP.” Liberal bloggers have tried to tie him to the Tea Party movement, making him out to be very conservative. Chuck Schumer called him “far-right.”

In 2002, he filled out a Votesmart survey on his policy positions in the context of running for the State Senate. Looking through the answers doesn’t reveal too much beyond that he is a pro-choice, anti-tax, pro-gun Republican. His interest group ratings are all over the map. . . .

All in all, a very confusing assessment, and quite imprecise. So how do we compare Brown to other state legislators, or more generally to other politicians across the country?

My [Boris’s] research, along with Princeton’s Nolan McCarty, allows us to make precisely these comparisons. Essentially, I use the entirety of state legislative voting records across the country, and I make them comparable by calibrating them through Project Votesmart’s candidate surveys.

By doing so, I can estimate Brown’s ideological score very precisely. It turns out that his score is -0.17, compared with her score of 0.02. Liberals have lower scores; conservatives higher ones. Brown’s score puts him at the 34th percentile of his party in Massachusetts over the 1995-2006 time period. In other words, two thirds of other Massachusetts Republican state legislators were more conservative than he was. This is evidence for my [Boris’s] claim that he’s a liberal even in his own party. What’s remarkable about this is the fact that Massachusetts Republicans are the most, or nearly the most, liberal Republicans in the entire country!

Very Jamesian, wouldn’t you say? And Boris’s was borne out by Scott Brown’s voting record, where he indeed was the most liberal of the Senate’s Republicans.

Political extrapolation

OK, now back to Sarah Palin. First, her popularity. Yes, Gov. Palin was popular, but Alaska is a small (in population) state, and surveys

find that most of the popular governors in the U.S. are in small states. Here are data from 2006 and 2008:

governors.png

There are a number of theories about this pattern; what’s relevant here is that a Bill James-style statistical adjustment might be necessary before taking state-level stats to the national level.

The difference between baseball and politics

There’s something else going on, though. It’s not just that Palin isn’t quite so popular as she appeared at first. There’s also a qualitative shift. From “boringly competent nonpartisan” to . . . well, leaving aside any questions of competence, she’s certainly no longer boring or nonpartisan! In baseball terms, this is like Ozzie Smith coming up from the minors and becoming a Dave Kingman-style slugger. (Please excuse my examples which reveal how long it’s been since I’ve followed baseball!)

So how does baseball differ from politics, in ways that are relevant to statistical forecasting?

1. In baseball there is only one goal: winning. Scoring more runs than the other team. Yes, individual players have other goals: staying healthy, getting paid, not getting traded to Montreal, etc., but overall the different goals are aligned, and playing well will get you all of these to some extent.

But there are two central goals in politics: winning and policy. You want to win elections, but the point of winning is to enact policies that you like. (Sure, there are political hacks who will sell out to the highest bidder, but even these political figures represent some interest groups with goals beyond simply being in office.)

Thus, in baseball we want to predict how a player can help his team win, but in politics we want to predict two things: electoral success and also policy positions.

2. Baseball is all about ability–natural athletic ability, intelligence (as Bill James said, that and speed are the only skills that are used in both offense and defense), and plain old hard work, focus, and concentration. The role of ability in politics is not so clear. In his remarks that started this discussion, Ubs suggested that Palin had the ability and inclination to solve real problems. But it’s not clear how to measure such abilities in a way that would allow any generalization to other political settings.

3. Baseball is the same environment at all levels. The base paths are the same length in the major leagues as in AA ball (at least, I assume that’s true!), the only difference is that in the majors they throw harder. OK, maybe the strike zone and the field dimensions vary, but pretty much it’s the same game.

In politics, though, I dunno. Some aspects of politics really do generalize. The Massachusetts Senate has got to be a lot different from the U.S. Senate, but, in their research, Boris Shor and Nolan McCarty have shown that there’s a lot of consistency in how people vote in these different settings. But I suspect things are a lot different for the executive, where your main task is not just to register positions on issues but to negotiate.

4. In baseball, you’re in or out. If you’re not playing (or coaching), you’re not really part of the story. Sportswriters can yell all they want but who cares. In contrast, politics is full of activists, candidates, and potential candidates. In this sense, the appropriate analogy is not that Sarah Palin started as Ozzie Smith and then became Dave Kingman, but rather a move from being Ozzie Smith to being a radio call-in host, in a world in which media personalities can be as powerful, and as well-paid, as players on the field. Perhaps this could’ve been a good move for, say, Bill Lee, in this alternative universe? A player who can’t quite keep the ball over the plate but is a good talker with a knack for controversy?

Commenter Paul made a good point here:

How many at-bats long is a governorship? The most granular I could imagine possibly talking is a quarter. At the term level we’d be doing better making each “at-bat” independent of the previous. 20 or so at-bats don’t have much predictive value either. Even over a full 500 at-bat season, fans try to figure out whether a big jump in BABIP is a sign of better bat control or luck.

The same issues arise at very low at-bat counts too. If you bat in front of a slugger, you can sit on pitches in the zone. If you’ve got a weakness against a certain pitching style, you might not happen to see it. And once the ball is in the air, luck is a huge factor in if it travels to a fielder or between them.

I suspect if we could somehow get a political candidate to hold 300-400 different political jobs in different states, with different party goals and support, we’d be able to do a good job predicting future job performance, even jumping from state to national levels. But the day to day successes of a governor are highly correlative.

Indeed, when it comes to policy positions, a politician has lots of “plate appearances,” that is, opportunities to vote in the legislature. But when it comes to elections, a politician will only have at most a couple dozen in his or her entire career.

All the above is from a post from 2011. I thought about it after this recent exchange with Mark Palko regarding the political candidacy of Ron DeSantis.

In addition to everything above, let me add one more difference between baseball and politics. In baseball, the situation is essentially fixed, and pretty much all that matters is player ability. In contrast, in politics, the most important factor is the situation. In general elections in the U.S., the candidate doesn’t matter that much. (Primaries are a different story.) In summary, to distinguish baseball players in ability we have lots of data to estimate a big signal; to distinguish politicians in vote-getting ability we have very little data to estimate a small signal.

The four principles of Barnard College: Respect, empathy, kindness . . . and censorship?

A few months ago we had Uh oh Barnard . . .

And now there’s more:

Barnard is mandating that students remove any items affixed to room or suite doors by Feb. 28, after which point the college will begin removing any remaining items, Barnard College Dean Leslie Grinage announced in a Friday email to the Barnard community. . . .

“We know that you have been hearing often lately about our community rules and policies. And we know it may feel like a lot,” Grinage wrote. “The goal is to be as clear as possible about the guardrails, and, meeting the current moment, do what we can to support and foster the respect, empathy and kindness that must guide all of our behavior on campus.”

According to the student newspaper, here’s the full email from the Barnard dean:

Dear Residential Students,

The residential experience is an integral part of the Barnard education. Our small campus is a home away from home for most of you, and we rely on each other to help foster an environment where everyone feels welcome and safe. This is especially important in our residential spaces. We encourage debate and discussion and the free exchange of ideas, while upholding our commitment to treating one another with respect, consideration and kindness. In that spirit, I’m writing to remind you of the guardrails that guide our residential community — our Residential Life and Housing Student Guide.

While many decorations and fixtures on doors serve as a means of helpful communication amongst peers, we are also aware that some may have the unintended effect of isolating those who have different views and beliefs. So, we are asking everyone to remove any items affixed to your room and/or suite doors (e.g. dry-erase boards, decorations, messaging) by Wednesday, February 28 at noon; the College will remove any remaining items starting Thursday, February 29. The only permissible items on doors are official items placed by the College (e.g. resident name tags). (Those requesting an exemption for religious or other reasons should contact Residential Life and Housing by emailing [email protected].)

We know that you have been hearing often lately about our community rules and policies. And we know it may feel like a lot. The goal is to be as clear as possible about the guardrails, and, meeting the current moment, do what we can to support and foster the respect, empathy and kindness that must guide all of our behavior on campus.

The Residential Life and Housing team is always here to support you, and you should feel free to reach out to them with any questions you may have.

Please take care of yourselves and of each other. Together we can build an even stronger Barnard community.

Sincerely,

Leslie Grinage

Vice President for Campus Life and Student Experience and Dean of the College

The dean’s letter links to this Residential Life and Housing Student Guide, which I took a look at. It’s pretty reasonable, actually. All I saw regarding doors was this mild restriction:

While students are encouraged to personalize their living space, they may not alter the physical space of the room, drill or nail holes into any surface, or affix tapestries and similar decorations to the ceiling, light fixtures, or doorways. Painting any part of the living space or college-supplied furniture is also prohibited.

The only thing in the entire document that seemed objectionable was the no-sleeping-in-the-lounges policy, but I can’t imagine they would enforce that rule unless someone was really abusing the privilege. They’re not gonna send the campus police to wake up a napper.

So, yeah, they had a perfectly reasonable rulebook and then decided to mess it all up by not letting the students decorate their doors. So much for New York, center of free expression.

I assume what’s going on here is that Barnard wants to avoid the bad publicity that comes from clashes between groups of students with opposing political views. And now they’re getting bad publicity because they’re censoring students’ political expression.

The endgame seems to be to turn the college to some sort of centrally-controlled corporate office park. But that wouldn’t be fair. In a corporate office, they let you decorate your own cubicle, right?

ISBA 2024 Satellite Meeting: Lugano, 25–28 June

Antonietta Mira is organizing a satellite workshop before ISBA. It’s free, there is still time to submit a poster, and it’s a great excuse to visit Lugano. Here are the details:

I really like small meetings like this. Mitzi and I are going to be there and then continue on to ISBA.

Leap Day Special!

The above graph is from a few years ago but is particularly relevant today!

It’s funny that, in leap years, approximately 10% fewer babies are born on 29 Feb. I think it would be cool to have a Leap Day birthday. But I guess most people, not being nerds, would prefer the less-“weird” days before and after.

There’s lots of good stuff at the above link; I encourage you to read the whole thing.

In the years since, we’ve improved Stan so we can fit and improve the birthdays time series decomposition model using full Bayesian inference.

Here’s Aki’s birthday case study which has all the details. This will also be going into our Bayesian Workflow book.

“Exclusive: Embattled dean accused of plagiarism in NSF report” (yup, it’s the torment executioners)

The story is at Retraction Watch:

Erick Jones, the dean of engineering at the University of Nevada in Reno, appears to have engaged in extensive plagiarism in the final report he submitted to the National Science Foundation for a grant, Retraction Watch has learned.

The $28,238 grant partially supported a three-day workshop that Jones and his wife, Felicia Jefferson, held for 21 students in Washington, DC, in April 2022 titled “Broadening Participation in Engineering through Improved Financial Literacy.” Jefferson received a separate award for $21,757.

Jones submitted his final report to the agency in May 2023. Retraction Watch obtained a copy of that report through a public records request to Jones’s previous employer, the University of Texas at Arlington, and identified three published sources of extended passages he used without citation or quotation marks. . . .

Lots more details at the link.

Those torment executioners keep on tormenting us.

In all seriousness, between the University of Nevada salary and the National Science Foundation grants, this guy’s been taking a lot of public funds to produce some really bad work. Seems like a real failure of oversight at UNR and NSF to let this go on like this.

Good work by Retraction Watch to follow up on this story.

P.S. I forgot to include the quotations from UNR luminaries:

“In Erick Jones, our University has a dynamic leader who understands how to seize moments of opportunity in order to further an agenda of excellence,” University President Brian Sandoval said.

“What is exciting about having Jones as our new dean for the College of Engineering is how he clearly understands the current landscape for what it means to be a Carnegie R1 ‘Very High Research’ institution,” Provost Jeff Thompson said. “He very clearly understands how we can amplify every aspect of our College of Engineering, so that we can continue to build transcendent programs for engineering education and research.”

Also this:

Jones was on a three-year rotating detail at National Science Foundation where he was a Program Director in the Engineering Directorate for Engineering Research Centers Program.

Shameful that he would work for NSF and then pay that back by taking their money and submitting a plagiarized report. But, hey, I guess that’s what University President Brian Sandoval would call “understanding how to seize moments of opportunity in order to further an agenda of excellence.”

What could be more excellent than taking government funds and using it to publish plagiarized reports and crayon drawings?

It sounds like it’s fine with UNR if their dean of engineering does this. I wonder what would happen to any UNR students who did this sort of thing? I guess they wouldn’t get paid $372,127 for it, but maybe the university could at least give them a discount on their tuition?

P.P.S. That all said, let’s not forget that there are much worse cases of corruption out there. The UNR case just particularly bothers me, partly because it’s close to what I do—except that when my colleagues get NSF funds, we don’t use them to produce plagiarized reports—and partly because the problems are so obvious: as discussed in our earlier post, you can look at the papers this dean of engineering had published and see that they are incoherent and have no content, even before getting into the plagiarism. It’s hard to believe that his hiring was a mere lack of oversight; you’d have to work really hard to not see the problems in his publications. But, yeah, there’s lots of much worse stuff going on that we read about in the newspaper every day.

Varying slopes and intercepts in Stan: still painful in 2024

Andrew recently blogged the following: Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson. This is the kind of model Andrew et al. used for one example in Red State, Blue State, which is the varying effect of income on Republican preference by state. Each state had its own slope and intercept related with a multivariate hierarchical prior. The version in Gelman and Hill’s regression book is a hack that tried to scale an inverse Wishart; the LKJ is what they would have used if Ben Goodrich had created it at that point.

Andrew points to a tutorial on Bayesian varying effects models from Will Hipson, which is really nice in the way it steps through workflow, building up the model in stages. The model Hipson develops is an improvement on what we have in our User’s Guide. After everything else, I circle back and talk about doc, trying to connect it to my recent post on why doc is so dangerous.

I think we can do a bit better in the current verison of Stan, but I have to confess up front that Andrew’s right—this is still painful. This took me around three hours to put together the model and simulations and blog post and I’m the one who designed the language! This would’ve been much faster if I wasn’t trying to bring it up to a “publishable” standard as an example of how I like to see Stan code written.

The original Stan model

Here’s Will Hipson’s model:

data {
  int N_obs; // number of observations
  int N_pts; // number of participants
  int K; // number of predictors + intercept
  int pid[N_obs]; // participant id vector
  matrix[N_obs, K] x; // matrix of predictors
  real y[N_obs]; // y vector
}

parameters {
  matrix[K, N_pts] z_p; // matrix of intercepts and slope
  vector[K] sigma_p; // sd for intercept and slope
  vector[K] beta; // intercept and slope hyper-priors
  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix
  real sigma; // population sigma
}

transformed parameters {
  matrix[K, N_pts] z; // non-centered version of beta_p
  z = diag_pre_multiply(sigma_p, L_p) * z_p; 
}

model {
  vector[N_obs] mu;
  
  // priors
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  sigma_p ~ exponential(1);
  L_p ~ lkj_corr_cholesky(2);
  to_vector(z_p) ~ normal(0, 1);
  
  // likelihood
  for(i in 1:N_obs) {
    mu[i] = beta[1] + z[1, pid[i]] + (beta[2] + z[2, pid[i]]) * x[i, 2];
  }
  y ~ normal(mu, sigma);
}

generated quantities {
  matrix[2, 2] Omega;
  Omega = multiply_lower_tri_self_transpose(L_p);
}

Warning: There’s a bug in this code in that it only handles the K = 2 case. You can see this with the 1 and 2 hardcoded in the definition of mu[i].

My Stan model

The documentation for the model is at the top of the Stan code, then the Stan code only has a single line of doc other than explanations of the variables (which I wouldn’t include in non-tutorial code, just to link this back to what I was saying a few posts ago about comments).

/**
 * Varying slopes and intercept hierarchical linear regression.
 * N observations organized into J groups, with jj[n] being the group
 * and x[n, 1:K] the covariates for observation n.  The covariate
 * matrix x should include a column of 1s to include a slope.
 * 
 * The slopes and intercept per group have a multivariate normal prior
 * and the scale has an exponential prior.  The location of the
 * multivariate normal prior has a standard normal hyperprior and its
 * covariance is decomposed into a correlation matrix with an LKJ
 * hyperprior and a scale vector with an exponential hyperprior. In
 * symbols: 
 *
 * Likelihod:
 *   y[n] ~ normal(x[n] * beta[1:K, jj[n]], sigma) for n in 1:N
 *
 * Priors:
 *   sigma ~ exponential(1)
 *   beta[1:K, j] ~ multi_normal(nu, Sigma) for j in 1:J
 * 
 * Hyperpriors:
 *   nu ~ normal(0, 1)
 *   scale(Sigma) ~ exponential(1)
 *   corr(Sigma) ~ lkj(2)
 *
 * where scale(Sigma) is the scale vector and corr(Sigma) is the
 * correlation matrix of Sigma.
 *
 * For efficiency and numerical stability, the covariance and
 * correlation matrices are Cholesky factored.
 */
data {
  int<lower=0> J;                      // number of groups
  int<lower=0> N;                      // number of observations
  array[N] int<lower=1, upper=J> jj;   // group per observation
  int<lower=1> K;                      // number of covariates
  matrix[N, K] x;                      // data matrix
  vector[N] y;                         // observations
}
parameters {
  vector[K] nu;                        // location of beta[ , j]
  vector<lower=0>[K] tau;              // scale of beta[ , j]
  cholesky_factor_corr[K] L_Omega;     // Cholesky of correlation of beta[ , j]
  matrix[K, J] beta_std;               // standard beta (beta - nu) / Sigma
  real<lower=0> sigma;                 // observation error for y
}
transformed parameters {
  matrix[K, J] beta = rep_matrix(nu, J)
                      + diag_pre_multiply(tau, L_Omega) * beta_std;
}
model {
  nu ~ normal(0, 1);
  tau ~ exponential(1);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(beta_std) ~ normal(0, 1);  // beta[ , j] ~ multi_normal(nu, Sigma)
  sigma ~ exponential(1);
  y ~ normal(rows_dot_product(x, beta[ , jj]'), sigma);
}
generated quantities {
  matrix[K, K] Sigma                   // covariance of beta[, j]
    = multiply_lower_tri_self_transpose(diag_pre_multiply(tau, L_Omega));
}

(WordPress is really annoying in its mishandling of angle brackets in pre environments.)

I started the first version using K = 2 and loops rather than vectorization. Next, I generalized from K = 2 to general K. Then I found the rows dot product function and got rid of the last loop. (Loops are fast in Stan—it’s the redundant autodiff, like multiple calculations of log(sigma) that are the time wasters in non-vectorized code.)

One could make the case for suffixing all the parameters of the prior for beta using _beta. You’d probably want to do that in a context with multiple groupings. It gets very hard to read even in this simple case—with multiple groupings it’s a right royal mess. Long variable names are very hard to read in math.

I put all the comments to the right at end of lines, so they don’t waste any vertical space and don’t get in the way of actually reading the code. Except the one true inline code comment that I’d leave, which is the implication of all the non-centered parameterization on the distribution of beta.

Let’s take it for a spin

I actually wrote the model first. I would generally recommend debugging the stages, but I’ve built this model many times, so I decided to roll the dice.

What I haven’t done before is simulate data in Python. Luckily, I have a very helpful coding assistant named ChatGPT. I have a system prompt that looks like this (I’m not actually a professor any more, but I thought it’d be easier to underestand that way).

WHO: I am a professor in statistics and machine learning with a Ph.D. in cognitive science and computer science. I understand (a) graduate level probability and statistics, (b) graduate-level machine learning including neural networks, (c) applied mathematics including calculus and analysis, matrices and linear algebra, combinatorics and graph theory, mathematical logic and set theory, (c) cognitive science including psycholinguistics and cognitive anthropology, (d) linguistics including phonology, morphology, syntax, semantics, and pragmatics, and (e) philosophy including 20th century analytic philosophy and philosophy of mind.

HOW: Respond in a concise and neutral academic style without embellishments.

Please call me “Bob”.

This was a one-and-done prompt. I’ve also had a bit of practice formulating these queries.

BOB: In Python, I want to set K = 5 covariates J = 10 groups. I want to generate a random K-vector location nu from a standard normal, and a (K x K) random covariance matrix Sigma of roughly unit scale (multiplying Cholesky factors or taking entries from a random walk would be OK). Then generate a K x J matrix beta where the columns are drawn from a multivariate normal distribution with location nu and covariance Sigma.

Then I want to set N = 1000 observations. For each observation in in range(N), generate a random group jj[n] in range(J) uniformly. Take a scale sigma = 1.5 for errors, and generate an N x K standard normal matrix x of covariates, with the first column set to all 1 values for an intercept. Then generate y from a hierarchical linear regression where the coefficient vector for observation n is given by beta[:, jj[n]] and the noise is normal with location 0 and scale sigma.

GPT-4 (Feb 28, 2024): [It just led with the code, which is unusual.]

import numpy as np
import scipy.linalg

# Set parameters
K = 5  # number of covariates
J = 10  # number of groups
N = 1000  # number of observations
sigma = 1.5  # scale of errors

# Generate nu and Sigma
nu = np.random.normal(0, 1, K)  # K-vector location from a standard normal
A = np.random.normal(0, 1, (K, K))
Sigma = np.dot(A, A.T)  # create a positive semi-definite matrix for covariance

# Generate beta
beta = np.random.multivariate_normal(nu, Sigma, J).T  # K x J matrix

# Generate observations
jj = np.random.randint(0, J, N)  # random group for each observation
x = np.random.normal(0, 1, (N, K))  # N x K standard normal matrix
x[:, 0] = 1  # set first column to all 1s for intercept

# Generate y
y = np.zeros(N)
for n in range(N):
    y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)

y[:5]  # Show the first 5 values of y as a sample output

The hierarchical linear regression model was successfully generated with the specified parameters. The first five values of the dependent variable y from the model are approximately:

−2.27
8.86
8.20
1.30
7.27

These values are based on the simulation with K=5 covariates,
J=10 groups, N=1000 observations, and an error scale σ=1.5.

It looks like it actually ran that Python code? I haven’t seen output in this format before from GPT—the code’s in a pulldown labeled “Analyzing”.

Anway, GPT’s code worked out of the box. It only took a couple of minutes to pull the data generation into a function and return a dictionary with simulations and simulated parameter values to compare to the fitted values. My Python code illustrates my suggestion a few posts back to turn blocks of code into functions rather than putting a bunch of comments around them in the script.

import numpy as np
import scipy.linalg

def sim_param_data(K, J, N, sigma):
    nu = np.random.normal(0, 1, K)
    A = np.random.normal(0, 1, (K, K))
    Sigma = np.dot(A, A.T)
    beta = np.random.multivariate_normal(nu, Sigma, J).T 
    jj = np.random.randint(0, J, N)
    x = np.random.normal(0, 1, (N, K))
    x[:, 0] = 1
    y = np.zeros(N)
    for n in range(N):
        y[n] = np.dot(x[n], beta[:, jj[n]]) + np.random.normal(0, sigma)
    return nu, Sigma, beta, {'J': J, 'N': N, 'jj': jj + 1, 'K': K, 'x': x, 'y': y}
    
import cmdstanpy as csp
nu, Sigma, beta, data = sim_param_data(K = 5, J = 10, N = 1000, sigma = 1.5)
model = csp.CmdStanModel(stan_file = 'rsbs.stan')
fit = model.sample(data)

It takes Stan about 20 seconds to fit this data set, R-hats all less than 1.01, is low, ESS in the thousands from a sample of size 4000, and all but a couple parameters are all recovered within 95% posterior intervals. There is quite a lot of uncertainty here with this little data and this many groups—don’t take those point estimates of covariance too seriously!

Appendix on doc

Let’s digress and talk about doc. I wrote a blog post a few days ago on doc, and this triggers some of the same issues. I want to say up front that doc is hard and if you go and look at code I’ve written, there will be a lot of places where you can improve the doc. Same for the code. So this is a kind of normative theory of doc, not what one might expect in reality. People only have a finite amount of time for any project. You might want to take a look at the doc in R parts of his example with the same eye.

First, there’s a scaffolding example which has the classic problem of documentation just for the sake of documentation.

vector[N] mu; // declaring a mu vector

You see the same thing in the final example where “vector[N] y” is documented as “y vector”. For the same reason, I don’t like this from an early example,

  sigma ~ exponential(1); // using an exponential prior on sigma

And this is what I meant by documenting the language.

mu[i] = x[i] * beta; // * is matrix multiplication in this context

and

  cholesky_factor_corr[K] L_p; // Cholesky correlation matrix

Going back to the final example, rather than “population sigma”, I would prefer “error scale” as it does not rely on the conventional notation sigma to pick out the scale.

The comment for z says “non-centered version of beta_p”, but the non-centered variable here is z_p. The terminology of “centering” is around the population mean, not zero.

Continuing with doc for z, I don’t understand what it means to be a version of beta_p. There is no beta_p in the model, so maybe some external doc? In the definition of mu, you can see beta acting as the location of the non-centered parameterization.

Did anyone spot the bug in this model? This is the real reason we don’t trust doc and have to read the code. It only works for K = 2. You’ll see a hard-coded 1 and 2 on the line defining mu[i] despite other parts of the program using K. My advice in this situation is to just bite the bullet and code the K = 2 case first. Then generalize later if you need to. I code the general case in the next section.

I want to emphasize that I’m not trying to pick on Will Hipson here. I’m assuming his intent was to be pedagogical, as the comment density drops as the models get more complicated. And the code is really good—better than in our User’s Guide.

This example also emphasizes why code review is so useful—a second set of eyes is the top thing I can recommend for improving code. Just knowing your code will be reviewed helps you write better code.

A suggestion on how to improve the broader impacts statement requirement for AI/ML papers

This is Jessica. Recall that in 2020, NeurIPS added a requirement that authors include a statement of ethical aspects and future societal consequences extending to both positive and negative outcomes. Since then, requiring broader impact statements in machine learning papers has become a thing.

The 2024 NeurIPS call has not yet been released, but in 2023 authors were required to complete a checklist where they had to respond to the following: “If appropriate for the scope and focus of your paper, did you discuss potential negative societal impacts of your work?”, with either Y, N, or N/A with explanation as appropriate. More recently, ICML introduced a requirement that authors include impact statements in submitted papers: “a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. This statement should be in a separate section at the end of the paper (co-located with Acknowledgements, before References), and does not count toward the paper page limit.”

ICML provided authors who didn’t feel they had much to say the following boiler-plate text:

“This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.”  

but warned authors to “to think about whether there is content which does warrant further discussion, as this statement will be apparent if the paper is later flagged for ethics review.”

I find this slightly amusing in that it sounds like what I would expect authors to be thinking even without an impact statement: This work is like, so impactful, for society at large. It’s just like, really important, on so many levels. We’re out of space unfortunately, so we’ll have to leave it at that.\newline\newline\newline\newline Love, \newline\newline\newline\newline the authors \newline\newline\newline\newline

I have an idea that might increase the value of the exercises, both for authors and those advocating for the requirements: Have authors address potential impacts in the context of their discussion of related work *with references to relevant critical work*, rather than expecting them to write something based on their own knowledge and impressions (which is likely to be hard for many authors for reasons I discuss below).  In other words, treat the impact statement as another dimension of contextualizing one’s work against existing scholarship, rather than a free-form brainstorm.

Why do I think this could be an improvement?  Here’s what I see as the main challenges these measures run into (both my own thoughts and those discussed by others):  

  1. Lack of incentives for researchers to be forthright about possible negative implications of their work, and consequently a lack of depth in the statements they write. Having them instead find and cite existing critical work on ethical or societal impacts doesn’t completely reconcile this, but presumably the critical papers aren’t facing quite the same incentives to say only the minimum amount. I expect it is easier for the authors to refer to the kind of critiques that ethics experts think are helpful than it is for them to write such critical reflections themselves.
  2. Lack of transparency around how impacts statements factor into reviews of papers. Authors perceive reviewing around impacts statements as a black box, and have responded negatively to the idea that their paper could potentially get rejected for not sufficiently addressing broader impacts. But authors have existing expectations about the consequences for not citing some relevant piece of prior work.
  3. Doubts about whether AI/ML researchers are qualified to be reflecting on the broader impacts of their work. Relative to say, the humanities, or even areas of computer science that are closer to social science, like HCI, it seems pretty reasonable to assume that researchers submitting machine learning papers are less likely to gravitate to and be skilled at thinking about social and ethical problems, but skilled at thinking about technical problems. Social impacts of technology require different sensibilities and training to make progress on (though I think there are also technical components to these problems as well, which is why both sides are needed). Why not acknowledge this by encouraging the authors to first consult what has been said by experts in these areas, and add their two cents only if there are aspects of the possible impacts or steps to be taken to address them (e.g., algorithmic solutions) that they perceive to be unaddressed by existing scholarship? This would better acknowledge that just any old attempt to address ethics is not enough (consider, e.g., Gemini’s attempt not to stereotype, which was not an appropriate way to integrate ethical concerns into the tech). It would also potentially encourage more exchange between what currently can appear to be two very divided camps of researchers.
  4. Lack of established processes for reflecting on ethical implications in time to do something about them (e.g., choose a different research direction) in tech research. Related work is often one of the first sections to be written in my experience, so at least those authors who start working on their paper in advance of the deadline might have a better chance of acknowledging potential problems and adjusting their work in response. I’m less convinced that this will make much of a difference in many cases, but thinking about ethical implications early is part of the end goal of requiring broader impacts statements as far as I can tell, and my proposal seems more likely to help than hurt for that goal.

The above challenges are not purely coming from my imagination. I was involved in a couple survey papers led by Priyanka Nanayakkara on what authors said in NeurIPS broader impacts statements, and many contained fairly vacuous statements that might call out buzzwords like privacy or fairness but didn’t really engage with existing research. If we think it’s important to properly understand and address potential negative societal impacts of technology, which is the premise of requiring impacts statements to begin with, why expect a few sentences that authors may well be adding at the last minute to do this justice? (For further evidence that that is what’s happening in some cases, see e.g., this paper reporting on the experiences of authors writing statements). Presumably the target audience of the impact statements would benefit from actual scholarship on the societal implications over rushed and unsourced throwing around of ethical-sounding terms. And the authors would benefit from having to consult what those who are investing the time to think through potential negative consequences carefully have to say.

Some other positive byproducts of this might be that the published record does a better job of pointing awareness to where critical scholarship needs to be further developed (again, leading to more of a dialogue between the authors and the critics). This seems critical, as some of the societal implications of new ML contributions will require both ethicists and technologists to address. And those investing the time to think carefully about potential implications should see more engagement with their work among those building the tools.

I described this to Priyanka, who also read a draft of this post, and she pointed out that an implicit premise of the broader impact requirements is that the authors are uniquely positioned to comment on the potential harms of their work pre-deployment. I don’t think this is totally off base (since obviously the authors understand the work at a more detailed level than most critics), but to me it misses a big part of the problem: that of misaligned incentives and training (#1, #3 above). It seems contradictory to imply that these potential consequences are not obvious and require careful reflection AND that people who have not considered them before will be capable of doing a good job at articulating them.

At the end of the day, the above proposal is an attempt to turn an activity that I suspect currently feels “religious” for many authors into something they can apply their existing “secular” skills to. 

Blog is adapted to laptops or desktops, not to smartphones or pads.

Sean Manning writes:

People behave differently on the post-2008 Internet than before because most of them are on smartphones or pads not laptops or desktops. For example, its hard to copy and paste blocks of text on a touchscreen, but usually easy to make screenshots, so people move things from one site to another as screenshots. Its hard to jump precisely around a text and type punctuation marks, so its hard to enter bbcode. Its easy to scroll, so sites designed for smartphones often have an infinite scroll. Its easy to pull out a smartphone in breaks from other activities, so people visiting the Internet on a smartphone are often in a hurry. People do more of what their tools encourage (affordances) and less of what their tools discourage.

Good point! I hadn’t thought if it that way, partly I guess because I don’t have a mobile phone or pad, so I do very little interaction with touchscreens.

A few years ago someone contacted me with a proposal to fix up the blog and make it more friendly to mobile devices, but it wasn’t clear to me that these changes would actually work. Or, to put it another way, it seemed that any changes would either be too minor to make a difference, or so major that they wouldn’t work with the sort of content we have here. What I hadn’t thought about was Manning’s point, that the way we write and interact on this blog is in some ways a function of how we interact with it on the computer.

There probably are some ways of making the blog more mobile-friendly, but I guess the real point is that the style of communication we’ve developed here works for this format. Kinda like how some stories work better as movies, some as TV shows, and some as plays. You can transfer from one medium to another but they’re different.

Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan

Shravan points us to these materials:

Hierarchical models are bread and butter stuff for psycholinguists, so we are trying hard to make Stan/brms mainstream through various means. Teaching this stuff feels like the most important work I am doing right now, more important even than the scientific side of things.

We have chapters on hierarchical modeling in our book (to be published soon with CRC Press), we use both brms and Stan:

https://vasishth.github.io/bayescogsci/book/ [edit: made it a live link]

The online version will remain available for free. Comments/corrections are welcome; one can open issues: https://github.com/vasishth/bayescogsci/issues

This summer, I [Shravan] am teaching an intro to Bayes using brms/Stan, with a focus on hierarchical modeling, especially directed at researchers in linguistics who do experimental work:

https://www.mils.ugent.be/courses/module-9-bayesian-data-analysis/

Plus, at Potsdam, for the last seven years I have been running an annual summer school on stats for linguistics and psych, where our focus is on hierarchical modeling using Stan/brms:

https://vasishth.github.io/smlp2024/

Here, we teach both frequentist and Bayesian approaches to hierarchical modeling.

Cool! Good to have these resources out there.

Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson

I was teaching varying-intercept, varying-slope multilevel models, and . . . I can get them to fit in Stan, but the code is kinda ugly, so I was struggling to clean it up, with no success. This will be a real research project, to add appropriate functions and possibly expand the Stan language so that these models can be written at a higher, more intuitive level.

Varying-intercept models aren’t so bad. In lme4 or blme or rstanarm or brms, you write something like:

y ~ 1 | group + x + z + x:z

and that transfers pretty directly into Stan. Just create the X matrix and go from there. Indeed, you can add as many batches of varying coefficients and it’s no biggie to code it up.

But once you get to varying intercepts and slopes, it all changes. In lme4 or blme or rstanarm or brms, you can just write things like:

y ~ (1 + z | group) + x + z + x:z

But if you want to program this directly in Stan, once you have varying intercepts and slopes, you have to deal with covariance-matrix decompositions and arrays of coefficient vectors, and it’s all a hairy mess.

What to do? For this semester’s class, Imma just gonna go with lme4/blme/rstanarm when fitting varying-intercept, varying-slope models. All this Stan coding is a rabbit hole that’s getting us away from the goal, which is to be able to fit, use, and evaluate statistical models for measurement and variation.

I would like to be able to more easily fit these in Stan, though. Why, you might ask? If we can fit them in lme4, or blme for more stability, or rstanarm for including more uncertainty in the inferences, then why bother coding directly in Stan?

The answer for why we want to code directly in Stan is that we’re often wanting to expand our models, for example adding mixture components, measurement error terms, time series or spatial dependence, etc.

For that reason, you will want to be able to code varying-intercept, varying-slope models in Stan—even if I won’t be teaching that in class this semester.

The good news is that I did some googling and found this tutorial by Will Hipson on programming hierarchical regressions in Stan. It’s from 2020 and I have not looked at every line of code there, but it all looks reasonable and there’s lots of explanation of the workflow. So maybe this is the best place to start, if you want to go in this direction, as you should!

On the border between credulity and postmodernism: The case of the UFO’s-as-space-aliens media insiders

I came across this post from Tyler Cowen:

From an email I [Cowen] sent to a well-known public intellectual:

I think the chance that the bodies turn out to be real aliens is quite low.

But the footage seems pretty convincing, a way for other people to see what…sources have been telling me for years. [Everyone needs to stop complaining that there are no photos!]

And to think it is a) the Chinese, b) USG secret project, or…whatever…*in Mexico* strains the imagination.

It is interesting of course how the media is not so keen to report on this. They don’t have to talk about the aliens, they could just run a story “The Mexican government has gone insane.” But they won’t do that, and so you should update your mental model of the media a bit in the “they are actually pretty conservative, in the literal sense of that term, and quite readily can act like a deer frozen in the headlights, though at some point they may lurch forward with something ill-conceived.”

Many of you readers are from Christian societies, or you are Christian. But please do not focus on the bodies! I know you are from your early upbringing “trained” to do so, even if you are a non-believer. Wait until that evidence is truly verified (and I suspect it will not be). Focus on the video footage.

In any case, the Mexican revelations [sic] mean this issue is not going away, and perhaps this will force the hand of the USG to say more than they otherwise would have.

The above-linked post seems ridiculous to me, while comments on the post are much more reasonable—I guess it’s not hard to be reasonable when all you have to do is laugh at a silly hoax.

From a straight-up econ point of view I guess it makes sense that there has been a continuing supply of purported evidence for space aliens: there’s a big demand for this sort of thing so people will create some supply. It’s disappointing to me to see someone as usually-savvy as Cowen falling for this sort of thing, but (a) there’s some selection bias, as I’m not writing about all the people out there who have not been snookered by this Bermuda triangle ancient astronauts Noah’s ark fairies haunted radios bigfoot ESP ghosts space aliens stuff.

Given my earlier post on news media insiders getting all excited about UFOs (also this), you won’t be surprised to hear that I’m annoyed by Cowen’s latest. It’s just so ridiculous! Amusingly, his phrasing, “I think the chance that the bodies turn out to be real aliens is quite low,” echoes that of fellow contrarian pundit Nate Silver, who wrote, “I’m not saying it’s aliens, it’s almost definitely not aliens.” Credit them for getting the probability on the right side of 50%, but . . . c’mon.

As I wrote in my earlier posts, what’s noteworthy is not that various prominent people think that UFO’s might be space aliens—as I never tire of saying in this context, 30% of Americans say they believe in ghosts, which have pretty much the same basis in reality—; rather, what’s interesting is that they feel so free to admit this belief. I attribute this to a sort of elite-media contagion: Ezra Klein and Tyler Cowen believe the space aliens thing is a possibility, they’re smart guys, so other journalists take it more seriously, etc. Those of us outside the bubble can just laugh, but someone like Nate Silver is too much of an insider and is subject to the gravitational pull of elite media, twitter, etc.

Mark Palko offers a slightly different take, attributing the latest burst of elite credulity to the aftereffects of a true believer who managed to place a few space-aliens-curious stories into the New York Times, which then gave the story an air of legitimacy etc.

The space aliens thing is interesting in part because it does not seem strongly connected to political polarization. You’ve got Cowen on the right, Klein on the left, and Silver on the center-left. OK, just three data points, but still. Meanwhile, Cowen gets a lot of far-right commenters, but most of the commenters to his recent post are with me on this one, just kind of baffled that he’s pushing the story.

Postmodernism

A couple days after seeing Cowen’s post, I happened to be reading a book that discussed postmodernism in the writing of history. I don’t care so much about postmodernism, but the book was interesting; I’ll discuss it in a future post.

In any case, here’s the connection I saw.

Postmodernism means different things to different people, but one of its key tenets is that there is no objective truth . . . uhhhh, let me just “do a wegman” here and quote wikipedia:

Postmodernism is an intellectual stance or mode of discourse which challenges worldviews associated with Enlightenment rationality dating back to the 17th century. Postmodernism is associated with relativism and a focus on the role of ideology in the maintenance of economic and political power. Postmodernists are “skeptical of explanations which claim to be valid for all groups, cultures, traditions, or races, and instead focuses on the relative truths of each person”. It considers “reality” to be a mental construct. Postmodernism rejects the possibility of unmediated reality or objectively-rational knowledge, asserting that all interpretations are contingent on the perspective from which they are made; claims to objective fact are dismissed as naive realism.

One thing that struck me about Cowen’s post was not just that he’s sympathetic to the space-aliens hypothesis; also it seems to bug him that the elite news media isn’t covering it more widely. Which is funny, because it bugs me that the media (including Bloomberg columnist Cowen) are taking it as seriously as they do!

Cowen writes, “It is interesting of course how the media is not so keen to report on this.” Doesn’t seem so interesting to me! My take is that most people in the media have some common sense and also have some sense of the history of this sort of nexus of hoaxes and credibility, from Arthur Conan Doyle onward.

The postmodernism that I see coming from Cowen is in statement, “the footage seems pretty convincing, a way for other people to see what . . . sources have been telling me for years,” which seems to me, as a traditional rationalist or non-postmodernist, to be a form of circular reasoning saying that something is real because people believe in it. Saying “this issue is not going away” . . . I mean, sure, astrology isn’t going away either! Unfortunately, just about nothing ever seems to go away.

Oppositionism

There’s something else going on here, it’s hard for me to put my finger on, exactly . . . something about belief in the occult as being oppositional, something “they” don’t want you do know about, whether “they” is “the media” or “the government” or “organized religion” or “the patriarchy” or “the medical establishment” or whatever. As we discussed in an earlier post on a topic, one interesting thing is how things happen that push certain fringe beliefs into a zone where it’s considered legitimate to take them seriously. As a student of public opinion and politics, I’m interested not just in who has these beliefs and why, but also in the processes by which some such beliefs but not others circulate so that they seem perfectly normal to various people such as Cowen, Silver, etc., in the elite news media bubble.

Hand-drawn Statistical Workflow at Nelson Mandela

In September 2023 I taught a week-long course on statistical workflow at the Nelson Mandela African Institution of Science and Technology (NM-AIST), a public postgraduate research university in Arusha, Tanzania established in 2009.

NM-AIST – CENIT@EA

The course was hosted by Dean Professor Ernest Rashid Mbega and the Africa Centre for Research, Agricultural Advancement, Teaching Excellence and Sustainability (CREATES) through the Leader Professor Hulda Swai and Manager Rose Mosha.

Our case study was an experiment on the NM-AIST campus designed and implemented by Dr Arjun Potter and Charles Luchagula to study the effects of drought, fire, and herbivory on growth of various acacia tree species. The focus was pre-data workflow steps, i.e. experimental design. The goal for the week was to learn some shared statistical language so that scientists can work with statisticians on their research.

Together with Arjun and Charles, with input from Drs Emmanuel Mpolya, Anna Treydte, Andrew Gelman, Michael Betancourt, Avi Feller, Daphna Harel, and Joe Blitzstein, I created course materials full of activities. We asked participants to hand-draw the experimental design and their priors, working together with their teammates. We also did some pencil-and-paper math and some coding in R.

Course participants were students and staff from across NM-AIST. Over the five days, between 15 and 25 participants attended on a given day.

Using the participants’ ecological expertise, we built a model to tell a mathematical story of how acacia tree height could vary by drought, fire, herbivory, species, and plot location. We simulated parameters and data from this model, e.g. beta_fire = rnorm(n = 1, mean = -2, sd = 1) then simulated_data …= rnorm(n, beta_0 + beta_fire*Fire +… beta_block[Block], sd_tree). We then fit the model to the simulated data.

Due to difficulty in manipulating fire, fire was assigned at the block-level, whereas drought and herbivory were assigned at the sub-block level. We saw how this reduced precision in estimating the effect of fire:

We redid the simulation assuming a smaller block effect and saw improved precision. This confirmed the researcher’s intuitions that they need to work hard to reduce the block-to-block differences.

To keep the focus on concepts not code, we only simulated once from the model. A full design analysis would include many simulations from the model. In Section 16.6 of ROS they fix one value for the parameters and simulate multiple datasets. In Gelman and Carlin (2014) they consider a range of plausible parameters using prior information. Betancourt’s workflow simulates parameters from the prior.

Our course evaluation survey was completed by 14 participants. When asked “which parts of the class were most helpful to you to understand the concepts?”, respondents chose instructor explanations, drawings, and activities as more helpful than the R code. However, participants also expressed eagerness to learn R and to analyze the real data in our next course.

The hand-drawn course materials and activities were inspired by Brendan Leonard’s illustrations in Bears Don’t Care About Your Problems and I Hate Running and You Can Too. Brendan wrote me,

I kind of think hand-drawing stuff makes it more fun and also maybe less intimidating?

I agree.

More recently, I have been reading Introduction to Modern Causal Inference by Alejandro Schuler and Mark van der Laan, who say

It’s easy to feel like you don’t belong or aren’t good enough to participate…

yup.

To deal with that problem, the voice we use throughout this book is informal and decidedly nonacademic…Figures are hand-drawn and cartoonish.

I’m excited to return to NM-AIST to continue the workflow steps with the data that Dr Arjun Potter and Charles Luchagula have been collecting. With the real data, we can ask: is our model realistic enough to achieve our scientific goals ?

When Steve Bannon meets the Center for Open Science: Bad science and bad reporting combine to yield another ovulation/voting disaster

The Kangaroo with a feather effect

A couple of faithful correspondents pointed me to this recent article, “Fertility Fails to Predict Voter Preference for the 2020 Election: A Pre-Registered Replication of Navarrete et al. (2010).”

It’s similar to other studies of ovulation and voting that we’ve criticized in the past (see for example pages 638-640 of this paper.

A few years ago I ran across the following recommendation for replication:

One way to put a stop to all this uncertainty: preregistration of studies of all kinds. It won’t quell existing worries, but it will help to prevent new ones, and eventually the truth will out.

My reaction was that this was way too optimistic.The ovulation-and-voting study had large measurement error, high levels of variation, and any underlying effects were small. And all this is made even worse because they were studying within-person effects using a between-person design. So any statistically significant difference they find is likely to be in the wrong direction and is essentially certain to be a huge overestimate. That is, the design has a high Type S error rate and a high Type M error rate.

And, indeed, that’s what happened with the replication. It was a between-person comparison (that is, each person was surveyed at only one time point), there was no direct measurement of fertility, and this new study was powered to only be able to detect effects that were much larger than would be scientifically plausible.

The result: a pile of noise.

To the authors’ credit, their title leads right off with “Fertility Fails to Predict . . .” OK, not quite right, as they didn’t actually measure fertility, but at least they foregrounded their negative finding.

Bad Science

Is it fair for me to call this “bad science”? I think this description is fair. Let me emphasize that I’m not saying the authors of this study are bad people. Remember our principle that honesty and transparency are not enough. You can be of pure heart, but if you are studying a small and highly variable effect using a noisy design and crude measurement tools, you’re not going to learn anything useful. You might as well just be flipping coins or trying to find patterns in a table of random numbers. And that’s what’s going on here.

Indeed, this is one of the things that’s bothered me for years about preregistered replications. I love the idea of preregistration, and I love the idea of replication. These are useful tools for strengthening research that is potentially good research and for providing some perspective on questionable research that’s been done in the past. Even the mere prospect of preregistered replication can be a helpful conceptual tool when considering an existing literature or potential new studies.

But . . . if you take a hopelessly noisy design and preregister it, that doesn’t make it a good study. Put a pile of junk in a fancy suit and it’s still a pile of junk.

In some settings, I fear that “replication” is serving a shiny object to distract people from the central issues of measurement, and I think that’s what’s going on here. The authors of this study were working with some vague ideas of evolutionary psychology, and they seem to be working under the assumption that, if you’re interested in theory X, that the way to science is to gather some data that have some indirect connection to X and then compute some statistical analysis in order to make an up-or-down decision (“statistically significant / not significant” or “replicated / not replicated”).

Again, that’s not enuf! Science isn’t just about theory, data, analysis, and conclusions. It’s also about measurement. It’s quantitative. And some measurements and designs are just too noisy to be useful.

As we wrote a few years ago,

My criticism of the ovulation-and-voting study is ultimately quantitative. Their effect size is tiny and their measurement error is huge. My best analogy is that they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

At some point, a set of measurements is so noisy that biases in selection and interpretation overwhelm any signal and, indeed, nothing useful can be learned from them. I assume that the underlying effect size in this case is not zero—if we were to look carefully, we would find some differences in political attitude at different times of the month for women, also different days of the week for men and for women, and different hours of the day, and I expect all these differences would interact with everything—not just marital status but also age, education, political attitudes, number of children, size of tax bill, etc etc. There’s an endless number of small effects, positive and negative, bubbling around.

Bad Reporting

Bad science is compounded by bad reporting. Someone pointed me to a website called “The National Pulse,” which labels itself as “radically independent” but seems to be an organ of the Trump wing of the Republican party, and which featured this story, which they seem to have picked up from the notorious sensationalist site, The Daily Mail:

STUDY: Women More Likely to Vote Trump During Most Fertile Point of Menstrual Cycle.

A new scientific study indicates women are more likely to vote for former President Donald Trump during the most fertile period of their menstrual cycle. According to researchers from the New School for Social Research, led by psychologist Jessica L Engelbrecht, women, when at their most fertile, are drawn to the former President’s intelligence in comparison to his political opponents. The research occurred between July and August 2020, observing 549 women to identify changes in their political opinions over time. . . .

A significant correlation was noticed between women at their most fertile and expressing positive opinions towards former President Donald Trump. . . . the 2020 study indicated that women, while ovulating, were drawn to former President Trump because of his high degree of intelligence, not physical attractiveness. . . .

As I wrote above, I think that research study was bad, but, conditional on the bad design and measurement, its authors seem to have reported it honestly.

The news report adds new levels of distortion.

– The report states that the study observed women “to identify changes in their political opinions over time.” First, the study didn’t “observe” anyone; they conducted an online survey. Second, they didn’t identify any changes over time: the women in the study were surveyed only once!

– The report says something about “a significant correlation” and that “the study indicated that . . .” This surprised me, given that the paper itself was titled, “Fertility Fails to Predict Voter Preference for the 2020 Election.” How do you get from “fails to predict” to “a significant correlation”? I looked at the journal article and found the relevant bit:

Results of this analysis for all 14 matchups appear in Table 2. In contrast to the original study’s findings, only in the Trump-Obama matchup was there a significant relationship between conception risk and voting preference [r_pb (475) = −.106, p = .021] such that the probability of intending to vote for Donald J. Trump rose with conception risk.

Got it? They looked at 14 comparisons. Out of these, one of these was “statistically significant” at the 5% level. This is the kind of thing you’d expect to see from pure noise, or the mathematical equivalent, which is a study with noisy measurements of small and variable effects. The authors write, “however, it is possible that this is a Type I error, as it was the only significant result across the matchups we analyzed,” which I think is still too credulous a way to put it; a more accurate summary would be to say that the data are consistent with null effects, which is no surprise given the realistic possible sizes of any effects in this very underpowered study.

The authors of the journal article also write, “Several factors may account for the discrepancy between our [lack of replication of] the original results.” They go on for six paragraphs giving possible theories—but never once considering the possibility that the original studies and theirs were just too noisy to learn anything useful.

Look. I don’t mind a bit of storytelling: why not? Storytelling is fun, and it can be a good way to think about scientific hypotheses and their implications. The reason we do social science is because we’re interested in the social world; we’re not just number crunchers. So I don’t mind that the authors had several paragraphs with stories. The problem is not that they’re telling stories, it’s that they’re only telling stories. They don’t ever reflect that this entire literature is chasing patterns in noise.

And this lack of reflection about measurement and effect size is destroying them! They went to all this trouble to replicate this old study, without ever grappling with that study’s fundamental flaw (see kangaroo picture at the top of this post). Again, I’m not saying that they authors are bad people or that they intend to mislead; they’re just doing bad, 2010-2015-era psychological science. They don’t know better, and they haven’t been well served by the academic psychology establishment which has promoted and continues to promote this sort of junk science.

Don’t blame the authors of the bad study for the terrible distorted reporting

Finally, it’s not the authors’ fault that their study was misreported by the Daily Mail and that Steve Bannon associated website. “Fails to Predict” is right there in the title of the journal article. If clickbait websites and political propagandists want to pull out that p = 0.02 result from your 14 comparisons and spin a tale around it, you can’t really stop them.

The Center for Open Science!

Science reform buffs will enjoy these final bits from the published paper:

“Not once in the twentieth century . . . has a single politician, actor, athlete, or surgeon emerged as a first-rate novelist, despite the dismayingly huge breadth of experience each profession affords.”

Tom Bissell writes:

Recently, in The Spooky Art, Norman Mailer [wrote that] Not once in the twentieth century . . . has a single politician, actor, athlete, or surgeon emerged as a first-rate novelist, despite the dismayingly huge breadth of experience each profession affords. For better or worse, and I am prepared to admit worse, writers are writers are writers. This explains why so many mediocre fiction writers sound the same, why there exist so many books about writers, and why many talented fiction writers seem to think that their best option to distinguish themselves is to flee the quotidian to explore more fanciful subject matter.

That’s an interesting point. Here in the twenty-first century, novel writing is a niche art and a niche business. In the previous century, though, the novel was a major popular art form, and lots of people were motivated to write them, both for artistic and financial reasons. Great novels were written in the twentieth century by people with all sorts of social backgrounds, high, low, and various steps in between—George Orwell was a police officer!—, but I think Mailer was right, that none of these great novels were written by politicians, actors, athletes, or surgeons. Perhaps the closest candidate is Michael Crichton (not a surgeon but he was trained as a doctor; no great novels but he did write Jurassic Park, which was solid genre fiction). Had his novels not been successful, it seems likely he would’ve just become a doctor, which indicates a bit of selection bias in Mailer’s statement. Jim Bouton authored the literary classic Ball Four, but it’s not a novel and presumably the writing was mostly done by his coauthor, who was a professional writer. OK, I guess my best shots on this are George V. Higgins (author of some arguably-great novels (see also here) and also a practicing lawyer) and Scott Turow (also a practicing lawyer as well as an author of several excellent legal thrillers which, ok, they’re not great novels but they have a lot of strengths, I guess I’d say they’re better than Michael Crichton’s even if they don’t have the originality of someone like Jim Thompson). But “lawyer” is not quite the same category as “politician, actor, athlete, or surgeon”—indeed, a lawyer is already a sort of professional fiction writer.

I dunno, it’s an interesting question. I assume there were a fair number of twentieth-century politicians, actors, athletes, and surgeons who had the capacity to write a great novel, or at least make a fair attempt, but it doesn’t seem to have happened. Maybe it would just have taken too much effort, to the extent that, had they gone all-in to write a great novel or a reasonable attempt, they would’ve just become full-time writers and that what’s we’d remember them as. I’m not sure.

Gore Vidal was a politician (kind of) and wrote some excellent novels, maybe they don’t count as “great” but maybe they do. He’s the closest match I can think of—but maybe not, because he was a writer before going into politics, so he doesn’t really count as a politician emerging as a novelist.

P.S. Bissell’s article also discusses the idea of writers being outsiders, which motivates me to point to these two posts:

There often seems to be an assumption that being in the elite and being an outsider are mutually exclusive qualities, but they’re not.

The insider-outsider perspective

As I wrote in the comments to one of those posts:

Saying someone is an outsider doesn’t convey much information, given that just about anyone can grab that label. However, as an observer of politics (and science), I notice that people sometimes highlight their outsider status, and as a political scientist I find that interesting. For example, what’s interesting about Steven Levitt in Freakonomics is not so much that he thinks of himself as a “rogue” but that he decided to label himself that way. Rather than presenting himself as an informant from the inside, he presented himself as an outsider. He had the choice of taking either tack, and he decided on the outsider label. That’s interesting.

Why would people want economics advice from a “rogue” outsider who thinks that drunk walking is more dangerous than drunk driving, thinks we are assured of 30 years of global cooling, and believes that beautiful parents are 36% more likely to have girls? Wouldn’t you prefer economics advice from an insider, someone with a Harvard and MIT education who’s now the William B. Ogden Distinguished Service Professor of Economics at the University of Chicago? That’s what baffles me.

The outsider-novelist thing is more clear, in that different authors offer different perspectives. We read Jack London for one thing and Jane Austen for another.

“Science as Verified Trust”

Interesting post by Sean Manning:

There seems to be a lot of confusion about the role of trust in science or scholarship. Engineers such as Bill Nye and political propagandists throw around the phrase “trust the science”! On the other hand, the rationalists whom I mentioned last year brandish the Royal Society’s motto nullius in verba “Take nobody’s word for it” like a sword. I [Manning] think both sides are working from some misconceptions about how science or scholarship work. . . .

What makes this scientific or scholarly is not that you do every step yourself. It is that every step of the argument has been checked by multiple independent people, so in most cases you can quickly see if those people disagree and then trust those preliminary steps. Science or scholarship is not about heroes who know every skill, its about systems of questioning and verification which let us provisionally assume that some things are true while we focus on something where we are not sure of the answer. . . .