22 Revision Prompts from Matthew Salesses

As promised last month, here they are:

Here are 22 exercises I [Salesses] have given (as options) so far. Some are diagnostic. Some are more traditional prompts that require writing to a directive. They’re not in order of when I assigned them.

1.

Make a list of all the decisions your protagonist makes, every single one. Now change the order of the decisions your protagonist makes so that they follow a causal chain. Rearrange your story to reflect this new list of decisions. Cut at least one decision and, if this applies, the corresponding scene. Now restructure the story again. Many stories are not ultimately in complete chronological order. The *plot* should still be the same causal chain, but the story might utilize flashbacks, memories, flash forwards, imagined scenes, etc.

2.

Start by asking yourself the following questions

1. What is the story about?
2. What is the central conflict?
3. What are the stakes?
4. What does the protagonist want?
5. How does the character change or fail to change?
6. Where are the gaps where something could be added? What is missing?
7. What is extraneous?

3.

(I handed out paper split into three columns.)

Each column represents a page of your story. Draw lines across each column where the story has a natural break. You have to decide for yourself where these breaks may be. Many might be space breaks in the story, but also maybe breaks between characters, between timelines, between narrative summary and scene, etc. Then–in each individual section you’ve created:

1. Write down any action that happens in that section.
2. Write down what each section is “doing” for the story. What is its purpose(s)? To raise the stakes? To complicate a relationship? To move the plot? Etc.
3. Write down any themes that show up in each section.
4. Shade the sections that happen in the past.
5. Note any characters who show up.
6. Note any settings.
7. Write down any desires that show up and the conflicts that get in the way of those desires.

4.

Rearrange your story so only two “sections” you identified are in the same place. Every other section should be moved. If that means you have to add or delete scenes, please do so. Do whatever you need to do to make your story work again.

5.

Look for any place you can add characterization. Just read through and put an asterisk or make a mark where you could add anything at all to characterize. After you’ve done that, go back and add characterization at every single point. Think about characterizing in all of the ways we discussed in class, but especially through decision and action. Your additional characterization should make your story much longer. You probably have too much characterization now. Cut at least half of what you’ve added.

6.

Wherever a new character enters a story, write an extended introduction for that character, including what they look like, how they are dressed, what objects are associated with them, any identifying marks, any identifying habits or gestures, their way of seeing things, their attitude toward the world, their age, their ethnicity, their occupation, their family relationships and history, their relationship to the protagonist, the narrator’s or protagonist’s or even any other character’s opinion of them, their desires, their problems, their faults, etc. Also include at least one paragraph of backstory for each character in that introduction. What do you know about their pasts, how they’ve come to be who they are now?

Once you’ve done all of this, cut what you don’t need or move it elsewhere in the story and keep what you need. Keep what “gets the character in.” Is it that she’s the kind of person who stays up until everyone has gone to bed and waits for the neighborhood stray to poke its head into her yard, but never goes out to talk to it or pet it, just looks at it longingly? Etc.

7.

Skip forward in time at least ten years and write at least one more scene into your story in which the protagonist sees the dramatic consequences of a decision s/he made earlier.

8.

Add a source of outside conflict to your story. That is, add something big that comes in and forces itself on the plot, something like a toxic spill or an earthquake or a war or a rabid dog or a serial killer or a rapture. Don’t make this a small insertion but something that truly changes the story. You might think about what large outside conflict might connect to your character’s arc thematically, if that helps.

9.

Underline all of the “missed opportunities” in your story. Or, better yet, have a friend or your writing partner do this for you. Add as much as you can for every single missed opportunity. Now cut at least half of what you’ve added. Rearrange the story and do whatever you need to do to make the story work again.

10.

Identify the “symbolic action” in each scene–what does the protagonist do that represents her/his change from where/who s/he was at the beginning of the scene (even if this change is slight), and which changes the situation (even if this change is slight). Add symbolic action where you are missing it. Try to get the most significant symbolic action for your protagonist close to the end of the scene.

11.

Highlight everything in your story that is backstory. Take a look now and see what the balance between past and present looks like. For each piece of backstory, try to identify whether it is absolutely necessary. Then ask yourself where it is absolutely necessary—for one, ask yourself where it would put the most pressure on the present story. Remember that the past slows the story down by its nature—so it should be used to increase urgency in that particular present time. If you have a character encountering her mother on page 15, you don’t need the backstory about the mother that makes that encounter harrowing on page 3. The present and past need to work together in the right place, so that there’s a clear reason we go into the past when we do and a clear reason to go back to the present. What are those reasons?

Now delete as much as you can (that isn’t necessary). Try to put as much of the necessary backstory as you can into the present instead. Think about what information is being conveyed. Can you put the backstory in dialogue? Can you get the same information across through action? Attitude? Implication? If we need to know that the protagonist has a bad relationship with her mother, do we need backstory about that relationship? Or can we see this dramatized in the present scene and understand their dynamic without backstory? Is it about the change between the past or the present, or is it about the change that is going to happen in the present?

12.

Utilize objects in your story. Try to get objects into each scene. Try to associate each character with an object. Make objects that appear earlier in your story reappear later, either in their same form or transformed. How does your character interact with those objects differently the second (or third) time versus the first? Leave some objects, cut the ones that don’t seem to add anything. Make sure at least one object “tracks” through the story, appearing at multiple points and signaling change either through how the characters interact with it, or how they feel about it, or what it looks like, how it has changed, etc.

13.

Kill off one character or more. Or simply delete one character or more. Or fuse two characters or more together to make a single character. How does the story change? If you need another character again, make up a new one instead of returning to a character you cut.

14.

Consolidate your settings. Try to cut out as many settings as you can, to get the story down to 1-3 settings total. Make sure the setting has an effect on the story and is not just a place to set it.

15.

If your story is in first-person, write a scene from the narrator’s point of telling. Narrators are always telling a story from a future point, even in present tense (an implied future point just after the action or a future point far later). What is your narrator doing with her life when she is telling this story? Where is she? Who is she now? What does she know at that future point that is different from the present of the story’s action? Make sure this is a scene, dramatized, not just narrative summary. You may need to bring in another character.

16.

Start by underlining (or however you want to do it) anything that moves the character arc/inside story forward (thoughts, emotions, etc.). Then over-line (draw a line over the top of, or highlight or however you want to do it) anything that moves the outside story/plot forward. Look for places inside and outside story overlap. Look for places where the inside story could be heightened and add things there about the character arc. Look for places where there could be more action/outside story and add some external action.

Most of all: look for places where the inside story affects or should affect and does not yet affect the outside story (emotions that lead to action or so forth) and places where the outside story affects or should affect and does not yet affect the inside story (action leads to emotion or so forth). Now try to create causation between the two—inside and outside story—so that they are intricately linked.

Make sure both inside and outside story create an arc. Make sure both inside story and outside story have a beginning, middle, and end.

You might try to add a foil or mirror character (someone in the story who is the opposite or mirror of your protagonist), or at least identify and utilize those potential characters in your current draft, and see how your protagonist feels about and interacts with that character, how that foil/mirror changes or doesn’t change, etc.

17.

Write past your ending. Write one or two or more scenes after the end of your story. Even if these scenes don’t make it into the final version, they will help inform the final version. What happens after the end of your story? And don’t stop there: now write about what the consequences are of what happens after the end of your story.

18.

Add a scene with a character who arrives for only that one scene and interacts with your protagonist (a character from the world outside the story–an example would be a prank phone call or an old friend coming to town).

19.

Write one anchor scene with all of your main characters in the same room, maybe even talking. What is the setting, how do they interact with each other, how do they relate to each other, etc? Introduce everyone within one scene and then work off of that scene. Maybe you already have that scene and you reorganize your story to center it, or maybe you add that scene.

20.

Get closer in psychic distance (go further into your protagonist’s head) or make a switch in POV—for a scene or two or even for the entire story.

21.

Add a scene to reveal more about a minor character(s). What is the minor character doing while the rest of the story is going on? How do those actions or action affect what the protagonist must do or decide?

22.

Add a scene that changes the context in which we know the characters–maybe they go on a trip, or maybe they are always together but now we see them apart, or maybe someone important to them comes to visit and they have to be on their best behavior, etc.

I love this sort of thing! Next step is to figure out how to adapt these ideas to statistics teaching and learning.

Clybourne Park. And a Jamaican beef patty. (But no Gray Davis, no Grover Norquist, no rabbi.)

I was just at a pizza place on Clybourne Ave (Pequod’s; great crust but I prefer the sauce at Lou’s), and that reminded me that I’d never seen Clybourne Park, despite having gone many times to Steppenwolf (mostly positive experiences except for a performance of Uncle Vanya that was so horrible we left after intermission; on this trip we saw The Thanksgiving Play, which was excellent), so I picked up a copy of the script of Clybourne Park and read it. We read A Raisin in the Sun in English class one year in high school but I don’t remember it well.

Clybourne Park was hilarious. I would’ve preferred a bit more plot and a bit less shtick, but that’s just my taste. Also it could be different in performance than on the page. Years ago I saw a play called Dealer’s Choice which I absolutely loved—it’s the best representation of poker I’ve ever seen in any form—so much that I bought the script, which also is great, but for that I now have the opposite bit of uncertainty, which is to what extent my enjoyment of the script was conditioned on having seen the play live.

P.S. On the way to Pequod’s, I noticed a Jamaican beef patty place on the side of the road. I was gonna keep on pedaling, but I felt like I owed it to the readership to try one out. It cost $4.95 (more than it cost in New York last time I tried one, but still only a very tiny fraction of the cost of meeting Gray Davis, Grover Norquist, and a rabbi). The flavor of the spices was excellent, but I have to say that I otherwise didn’t enjoy the patty—the grease level was too high and I’m just not so into beef anymore, so I didn’t finish it. I have the feeling I’d have had the same reaction to a Golden Krust patty. I think I’ll have to try Bob’s advice and make something for myself using my own baguette dough and some sort of lentil concoction in the middle.

Evidence-based Medicine Eats Itself, and How to do Better (my talk at USC this Friday)

Fri 4 Oct 2024, noon at Hastings Auditorium (HMR 100), USC Health Sciences Campus:

There are three commonly stated principles of evidence-based research: (1) reliance on statistically significant results from randomized trials, (2) balancing of costs, benefits, and uncertainties in decision making, and (3) treatments targeted to individuals or subsets of the population. Unfortunately, principle (1) conflicts with principles (2) and (3). Solving this problem requires improvements along all stages of the statistics pipeline, from design and data collection through modeling, analysis, and computation, to reporting of results and decision analysis. Specific research questions that arise include adjustment for noisy pre-treatment variables, inclusion of sampling weights in model-based inference, and inference for interactions with sparse data.

It’ll be kind of funny speaking at USC medical school. To be prepared, maybe I should memorize “The Ten Craziest Facts You Should Know About A Giraffe.”

In all seriousness, I assume the audience for my talk will be the serious medical researchers on campus, not the publicity hounds. I intend the talk to be generally accessible while discussing the interesting statistical issues and their applied implications.

Wendy Brown: “Just as nothing is more corrosive to serious intellectual work than being governed by a political programme (whether that of states, corporations, or a revolutionary movement), nothing is more inapt to a political campaign than the unending reflexivity, critique and self-correction required of scholarly inquiry.”

William Davies quotes Wendy Brown:

Just as nothing is more corrosive to serious intellectual work than being governed by a political programme (whether that of states, corporations, or a revolutionary movement), nothing is more inapt to a political campaign than the unending reflexivity, critique and self-correction required of scholarly inquiry.

Good point! I do the “unending reflexivity, critique and self-correction” thing, and I don’t think I’m good at politics. Politics requires bargaining and secrecy, neither of which is impossible for me but neither of which I’m particularly good at. I guess that most people don’t enjoy bargaining and secrecy, but it’s the work that you need to put in to do politics. For that matter, reflexivity, critique and self-correction can take some effort too, even though they come naturally.

The tricky thing is that scholarly inquiry requires some politics, and politics requires some scholarly inquiry.

I’ll go through each part.

1. Scholarly inquiry isn’t just about figuring things out. It’s also about communication, explaining to the world. Sometimes it’s pure naked politics, professors making ugly threats to each other to secure publication in a prestigious journal. Other times it’s a pitiful form of cynicism, or journal editors pulling strings, or the use of tame journalists.

But those are just some embarrassing, extreme cases. Small politics comes up all the time. One day in the summer of 1990 I came up with what I knew was the best idea I’d ever had. I wrote it up and sent it to my former Ph.D. adviser, who had some suggestions of his own . . . together we wrote a paper that became a legend in preprint form. We submitted it to a journal, and it took a couple years to come out. Years later, I heard that the journal editor had mistakenly sent it to a reviewer who didn’t know jack about the topic but had a personal grudge against my former adviser. The paper was good enough that the referee couldn’t quite take it down, but he did manage to drag things out. Politics.

And then of course there’s plain old academic politics. Or stuff like this. Or this. Or this. Etc etc.

Anyway, I’m not painting myself as any sort of victim here. Considering everything, academic politics has treated me just fine. I’m just sharing the stories that are familiar to me. Or, to put it another way, even for someone who’s had success, politics has arisen even when I’ve wanted to avoid it. We’re always promoting our ideas, trying to get them out there.

Ok, not always. I guess that dude who proved Fermat’s last theorem just had to prove it, and the rest just followed. But it doesn’t usually work that way.

How does that relate to the Wendy Brown quote? When we want to disseminate scholarly inquiry, we need to do politics—but that’s corrosive to serious intellectual work. And we’ve seen that, from “p less than 0.05” on down. It’s a problem.

2. Politics isn’t just about politics, it also involves serious intellectual work. To the extent that you’re good at pulling those levers, at some point you’ll have to decide where you want to drive that car. I’m a political scientist, but I have very little direct experience with that aspect of the world. It does seem plausible to me what Brown says, that “nothing is more inapt to a political campaign than the unending reflexivity, critique and self-correction required of scholarly inquiry.” A lack of intellectual integrity can be a superpower in politics just as in academia. So, again, we have this tension, bouncing back and forth between the political action that is necessary for, and tearing apart, scholarly inquiry; and the scholarly inquiry that is necessary for politics.

No further thoughts from me here on this one. I just was struck by that quote from Brown. I’ll have to read her book.

“A Hudson Valley Reckoning: Discovering the Forgotten History of Slaveholding in My Dutch American Family”

I just read this book by Debra Bruno. It was interesting, both the history and the story of how the author learned the story and talked with people about it. I don’t really have anything to say here; just wanted to share it as it has some of the sort of political science content that we don’t usually discuss in this space.

Faculty positions at the University of Oregon’s new Data Science department

This is Jessica. Peter Ralph of University of Oregon writes:

I’m writing because I’m helping start a Data Science department here.

Briefly, here’s the postings: we’re looking for a department head:

 https://academicjobsonline.org/ajo/jobs/27838

and two assistant professors:

 https://academicjobsonline.org/ajo/jobs/27833

We’re new and so we’re still small (3.5 people) but have a very strong and fast-growing major, and a commitment to hire more people in upcoming years. We’re filling a big gap on campus – UO has no stats department – and who we’re really looking for are people who (like us!) work with/on data, are excited about data science education and interdisciplinary quantitative research, and are enthusiastic about building a new department with a strong focus on equity and justice. There’s more info in the job ads -and, please ping me with questions or recommendations!

Thanks,

  Peter Ralph

  Interim Head, Data Science 

  Institute of Ecology and Evolution

  University of Oregon

p.s. The deadline for the “head” position is Oct 15, but it will be a rolling deadline – interested people who won’t make that date should let me know.

“Toward reproducible research: Some technical statistical challenges” and “The political content of unreplicable research” (my talks at Berkeley and Stanford this Wed and Thurs)

Wed 2 Oct 9:30am at Bin Yu’s research group, 1011 Evans Hall, University of California, Berkeley:

Toward reproducible research: Some technical statistical challenges

The replication crisis in social science is not just about statistics; it has also involved the promotion of naive pseudo-scientific ideas and exaggerated “one weird trick” claims in fields ranging from embodied cognition to evolutionary psychology to nudging in economics. In trying to move toward more replicable research, several statistical issues arise; here, we discuss challenges related to design and measurement, modeling of variation, and generalization from available data to new scenarios. Technical challenges include modeling of deep interactions and taxonomic classifications and the incorporation of sampling weights into regression modeling. We use multilevel Bayesian inference, but it should be possible to implement these ideas using other statistical frameworks.

Thurs 3 Oct 10:30am at the Stanford classical liberalism seminar, room E103, Stanford Graduate School of Business:

The political content of unreplicable research

Discussion of the replication crisis in the social science has focused on the statistical errors that have led researchers and consumers of research to overconfidence in dubious claims, along with the social structures that incentivize bad work to be promoted, publicized, and left uncorrected. But what about the content of this unreplicable work? Consider embodied cognition, evolutionary psychology, nudging in economics, claimed efficacy of policy interventions, and the manipulable-voter model in political science. These models of the world, if true, would have important implications for politics, supporting certain views held on the left, right, and technocratic center of the political spectrum. Conversely, the lack of empirical support for these models has implications for social science, if people are not so arbitrarily swayed as the models suggest.

The two talks should have very little overlap—which is funny, given that I’ll probably be the only person to attend both of them!

In preparation for both talks, I recommend reading the first three sections of our piranha paper.

The Stanford talk is nontechnical, talking about the social science and policy implications of the replication crisis, and I want to convey that the replication crisis isn’t just about silly Ted talks; it also has implications for how we should understand the world.

The Berkeley talk is for statisticians, talking about how the roots of and solutions to the replication crisis are not just procedural (pregregistration etc.) or data-analytical (p-values etc.) but also involve measurement, design, and modeling.

“Announcing the 2023 IPUMS Research Award Winners”

PUMS stands for Public Use Microdata Sample—it’s a subset of the U.S. Census that contains individual-level data, I think it was 1% of the Census. I don’t know the full history, but here’s the current Census website with these data.

IPUMS is a compendium of public use microdata from different sources:

In collaboration with 105 national statistical agencies, nine national archives, and three genealogical organizations, IPUMS has created the world’s largest accessible database of census microdata. IPUMS includes almost a billion records from U.S. censuses from 1790 to the present and over a billion records from the international censuses of over 100 countries. We have also harmonized survey data with over 30,000 integrated variables and 150 million records, including the Current Population Survey, the American Community Survey, the National Health Interview Survey, the Demographic and Health Surveys, and an expanding collection of labor force, health, and education surveys. In total, IPUMS currently disseminates integrated microdata describing 1.4 billion individuals drawn from over 750 censuses and surveys. . . .

Our signature activity is harmonizing variable codes and documentation to be fully consistent across datasets. This work rests on an extensive technical infrastructure developed over more than two decades, including the first structured metadata system for integrating disparate datasets. By using a data warehousing approach, we extract, transform, and load data from diverse sources into a single view schema so data from different sources become compatible. The large-scale data integration from IPUMS makes thousands of population datasets interoperable. . . .

I’m on their mailing list—maybe I requested some of their data at some point?—and this announcement came in the email:

We are thrilled to announce the winners of our annual IPUMS Research Awards competition. This competition celebrates innovative research from 2023 that uses IPUMS data to advance or deepen our understanding of social and demographic processes. . . .

IPUMS USA

  • Best published work: Zachary Ward. “Intergenerational Mobility in American History: Accounting for Race and Measurement Error.”
  • Best student work: Jonathan Tollefson. “Environmental Risk and the Reorganization of Urban Inequality in the Late 19th and Early 20th Century.”

IPUMS Spatial: IPUMS NHGIS, IPUMS IHGIS, IPUMS Terra, or IPUMS CDOH

  • Best published work: Clark Gray and Maia Call. “Heat and Drought Reduce Subnational Population Growth in the Global Tropics.”
  • Best student work: Nicolas Longuet-Marx. “Party Lines or Voter Preferences? Explaining Political Realignment.”

IPUMS CPS

  • Best published work: Kaitlyn M. Berry, Julia A. Rivera Drew, Patrick J. Brady, and Rachel Widome. “Impact of Smoking Cessation on Household Food Security.”
  • Best student work: Sungbin Park, Kyung Min Lee, and John Earle. “Death Without Benefits: Unemployment Insurance, Re-Employment, and the Spread of Covid.”

IPUMS International

  • Best published work: Seife Dendir. “Intergenerational Education Mobility in Sub-Saharan Africa.”
  • Best student work: Rita Trias-Prats. “Gender Asymmetries in Household Leadership.”

IPUMS Global Health: IPUMS DHS and/or IPUMS PMA

  • Best published work: Chad Hazlett, Antonio P. Ramos, and Stephen Smith. “Better Individual-Level Risk Models Can Improve the Targeting and Life-Saving Potential of Early-Mortality Interventions.”
  • Best student work: Sara Ronnkvist, Brian Thiede, and Emma Barber. “Child Fostering in a Changing Climate: Evidence from Sub-Saharan Africa.”

IPUMS Health Surveys: IPUMS NHIS or IPUMS MEPS

  • Best published work: Jessica Y Ho. “Lifecourse Patterns of Prescription Drug Use in the United States.”
  • Best student work: Namgyoon Oh. “Nutrition to Nurturance: The Impact of Children’s WIC Eligibility Loss on Parental Well-being.”

IPUMS Time Use: IPUMS ATUS, IPUMS MTUS, or IPUMS AHTUS

  • Best published work: Eunjeong Paek. “Workplace Computerization and Inequality in Schedule Control.”
  • Best student work: Anja Gruber. “The Impact of Job Loss on Parental Time Investment.”

Excellence in Research
This award highlights outstanding research using any of the IPUMS data collections by authors who identify as members of groups that are underrepresented in social science and health research.

  • Best published work: Samuel H. Kye, and Andrew Halpern-Manners. “If Residential Segregation Persists, What Explains Widespread Increases in Residential Diversity?”
  • Best student work: Sophie Li. “The Effect of a Woman-Friendly Occupation on Employment: U.S. Postmasters Before World War II.”

That’s great. I love that they give this award to people who use their data.

Also, their slogan is “Use it for good!” How cool is that?

Good job, IPUMS.

Fake stories in purported nonfiction

I’ve been frustrated by the willingness of people to just make stuff up, or to pass along obvious errors and fabrications if they think it will help them make a point.

We’ve seen this with the Harvard law school professor who promotes the sort of stupid election denial conspiracy theories that would surely rate an F in his classes but which help his political allies. We’ve possibly seen this with the pizzagate guy and the dishonesty researcher, in that there are legitimate questions about whether their famous bottomless soup bowl and paper shredder were ever used in experiments as claimed. We’ve seen this with the computer programmer who shared an implausible and evidence-free claim of a smallish town with a supposed epidemic of IT-preventable deaths, a claim that then appeared in a data science textbook. We’ve seen this with a string of economists who elaborated nearly beyond recognition an already-dubious story about boat-pullers.

It’s some mixture of lies and the sort of aggressive credulity that allows someone to promote untrue stories with a straight face.

I thought about this again when listening to the controversial If Books Could Kill podcast’s episode on Rich Dad Poor Dad, a book that featured a ridiculous story where the author claims that, as a child, he and his friend melted down lead toothpaste tubes and poured them into a plaster mold to make . . . lead nickels! Apparently this was some sort of urban legend that the book’s author must have felt was effective at making whatever stupid point he was making.

So, yeah, the fabricating author of Rich Dad Poor Dad is pretty much the same as those economists and psychologists and computer scientists and all sorts of credentialed people who are willing to make stuff up or to promote fabrications in order to support their favored theories, providing active contrapositives to Dan Davies’s maxim, “Good ideas do not need lots of lies told about them in order to gain public acceptance.”

Anyway, this interests me, the general phenomenon of fake stories in purported nonfiction. I’m not talking about Mark Twain or David Sedaris telling tall tales, or George Orwell or A. J. Liebling moving around details or creating composite characters—not that I’d do that either! Rather, what concerns me is people making up events or details which they then present as evidence for whatever iffy claims they are pushing. The paradigmatic example might be fabricated atrocity stories in a war: the other side is the bad guys, so it’s ok to make stuff up, right? And then we see it in science (sometimes inadvertently, but recall Clarke’s Law) and then there’s that “smallish town” dude—I guess he justified his exaggeration on the grounds that, hey, information technology really is important—and the Rich Dad Poor Dad guy, who perhaps styles himself as the Marc Hauser of personal responsibility, not to be held back by the sort of schoolmarms who would insist on factual accuracy . . .

Really, it’s not about science at all. It’s about all those people out there for whom the concept of truth is dominated by the concept of justice. If a certain story should be true, and you’re in the right, then by golly you have every right to insist that it is true. Who are others to question you? The story is as good as true! Even if that smallish town and that paper shredder never existed, they could’ve existed, and why is everyone unfairly calling you a liar when they have no evidence of that, etc etc etc.

So, again, it was interesting to see pop social science and academic pontificating—two things I’ve thought a lot about in recent decades—mixed in with self-help, which is something I’m not so familiar with. It gives me more of a sense of the underlying unity of the problem.

“If Books Could Kill” could be better

That podcast has the problem that it consistently goes overboard. Freakonomics, while seriously flawed, has lots of good material too. Gladwell’s superpower is his credulity but at least he gets you thinking. The End of History is a period piece but it had some interesting ideas and, I think, deserved its wide circulation. Rich Dad Poor Dad . . . ok, I never read that one, but it reminds me of those 1970s bestsellers, Winning Through Intimidation and Looking out for #1, which, sure, they’re evil, but they make you think, just because they were sooooo different from the usual messages that we were getting in school or in what has since been called “the mainstream media.”

Similarly with Freakonomics and The Rules and Gladwell and all the rest: a key part of the appeal to readers is the feeling that you’re being let in on the secret, and the authors are breaking taboos, telling you things you’re not supposed to be hearing.

One thing that’s unfortunate about If Books Could Kill is that it has such a uniform political perspective. The hosts of the show are on the political left, and that’s their choice: They don’t need to to pretend to be something they’re not, and there are plenty of people on the other side making the conservative case. Indeed, one thing I’ve noticed in the science reform movement is that there are good reasons to support (or, I guess, to oppose) science reform from the left or the right.

Still, I wish they’d mock some bad left-wing books too. They came close when they devoted an episode to the book, The Identity Trap, which was written by center-left pundit Yascha Mounk—but they presented him as a “reactionary centrist,” focusing on the aspects of right-wing thought that led him to say silly things, rather than on the left-wing contributions to his silliness. They also did episodes on center-left pundit Cass Sunstein, but again without really probing the questionable leftist aspects of his ideology. I’m all-in on mocking Sunstein, but if you characterize him as a conservative, you’re really missing the point: his work illustrates lots of bad things about contemporary liberal politics. And they criticize Oprah Winfrey, who also is on the left side of the political spectrum, but again without connecting this to the ideological nature of her appeal.

Why would I like the hosts of If Books Could Kill to critique left-wing messages of bad pop-science and self-help books? Not for the purpose of “balance” or “false equivalence” or whatever. No, the reason is that these guys are funny and often insightful, and it would be interesting to see what they can do if they can’t just lean on their political stance. It’s the same reason I’d like to see Joe Rogan interview Alexey Guzey. If Rogan’s gonna be credulous, why not be credulous with someone who has genuinely interesting things to say?

Online seminar for Monte Carlo Methods++

This is Bob.

There’s a new weekly online seminar on all aspects of Monte Carlo (including adjacent topics like generative modeling and uncertainty quantification). Here’s the home page for the seminar.

The website includes the schedule, and links for Zoom, the Google mailing list, and the YouTube channel. Their site states that seminars take place on

  • Tuesdays at 8:30 am PT / 11:30 am ET / 4:30 pm London / 5:30 pm Paris / 11:30 pm Beijing

Presumably they’ll announce new times to cope with asynchronous daylight time changeovers.

They’re starting with an A-list of speakers:

  • Persi Diaconis – 10/01/2024
  • Mike Giles – 10/08/2024
  • Art Owen – 10/15/2024
  • Gareth Roberts – 10/29/2024

Mike Giles wrote the fundamental matrix autodiff results we used for Stan—I didn’t know he worked on sampling. Art Owen gave a practice version of his talk at Flatiron Institute yesterday and it included crystal clear introductions to several ideas I’d never seen before including multiple ways to generate quasi Monte Carlo points, Brownian bridges (which I’d never seen introduced in a way I could understand), etc.

The seminars are being organized by Valentin De Bortoli (DeepMind), Yang Chen (Michigan), Nianqiao Ju (Purdue), Sifan Liu (Flatiron/Duke), Sam Power (Bristol), Qian Qin (Minnesota), and Guanyang Wang (Rutgers).

3 levels of fraud: One-time, Linear, and Exponential

In a one-time fraud, you do it, you get it over with, and you move on. Kind of like that saying, “The secret of a great success for which you are at a loss to account is a crime that has never been found out, because it was properly executed.” One-time frauds occur where for some reason the victim of the fraud is in no position to do anything about it, and the fraudster has no need to keep doing more of it.

In a linear fraud, you do it, and then you need to keep doing it, or you need to keep covering for it. An example is if you do fraudulent research to get an academic job, and then you’re expected to continue producing amazing results if you want promotion. So what do you do? You keep on doing what it takes to get those amazing results. Remember the Armstrong Principle: if you’re pushed to promise more than you can deliver, you’re motivated to cheat. Or, for another example, Lance Armstrong himself: he didn’t need to keep doping—retirement from competitive cycling was an option at any point—but he did need to continue to lie, and to intimidate others from lying, because these doping investigations kept happening. The system of laws and rules in cycling did not allow his cheating to be grandfathered in, so he had to keep on frauding.

In an exponential fraud, as time goes on you have to do larger and larger frauds. As we discussed in the context of Dan Davies’s book, Lying for Money, this exponential property is characteristic of financial fraud, where you have to keep scamming or borrowing more money to cover your past losses. At no point can you just close out, because if you don’t keep covering what you’d already promised, your creditors would close in. This also happens with people who try to resolve their gambling debts by making it all back at the track: they need to bet more and more until eventually they go bust. It’s worse than the classic gambler’s ruin problem, because the bets get bigger and bitter.

The above-linked post discussed linear and exponential frauds, but I hadn’t thought to include one-time fraud. As I wrote at the time, Maradona didn’t have to keep punching balls into the net; once was enough, and he still got to keep his World Cup victory. If Brady Anderson doped, he just did it and that was that; no escalating behavior was necessary.

What’s the story behind that paper by the Center for Open Science team that just got retracted?

Nov 2023: A paper was published with the inspiring title, “High replicability of newly discovered social-behavioural findings is achievable.” Its authors included well known psychologists who were active in the science-reform movement, along with faculty at Berkeley, Stanford, McGill, and other renowned research universities. The final paragraph of the abstract reported:

When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that in the original study. This high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries.

This was a stunning result, apparently providing empirical confirmation of the much-mocked Harvard claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

Indeed, upon publication of that article, the renowned journal Nature released a news article titled “What reproducibility crisis? New research protocol yields ultra-high replication rate”:

In a bid to restore its reputation, experimental psychology has now brought its A game to the laboratory. A group of heavy-hitters in the field spent five years working on new research projects under the most rigorous and careful experimental conditions possible and getting each other’s labs to try to reproduce the findings. . . . The study, the authors say, shows that research in the field can indeed be top quality if all of the right steps are taken. . . .

Since then, though, the paper has been retracted. (The news article is still up in its original form, though.)

What happened?

Sep 2024: Here is the retraction notice, in its entirety:

The Editors are retracting this article following concerns initially raised by Bak-Coleman and Devezer.

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

Post-publication peer review and editorial examination of materials made available by the authors upheld these concerns. As a result, the Editors no longer have confidence in the reliability of the findings and conclusions reported in this article. The authors have been invited to submit a new manuscript for peer review.

All authors agree to this retraction due to incorrect statements of preregistration for the meta-study as a whole but disagree with other concerns listed in this note.

As many people have noticed, the irony here is that the “most rigorous and careful experimental conditions possible” and “all the right steps” mentioned in that news article refer to procedural steps such as . . . preregistration!

June 2023: Back in when the original paper came out, well before its publication, let alone the subsequent retraction, I expressed my opinion that their recommended “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” were not the best ways of making a study replicable. I argued that those sorts of procedural steps are less important than clarity in scientific methods (“What exactly did you do in the lab or the field, where did you get your participants, where and when did you work with them, etc.?”), implementing treatments that have large effects (this would rule out many studies of ESP, subliminal suggestion, etc.), focusing on scenarios where effects could be large, and improving measurements. Brian Nosek, one of the authors of the original article, responded to me and we had some discussion here. I later published a version of my recommendations in an article, Before Data Analysis: Additional Recommendations for Designing Experiments to Learn about the World, for the Journal of Consumer Psychology.

Sep 2024: As noted above, the retraction of the controversial paper was spurred by concerns of Joe Bak-Coleman and Berna Devezer, which the journal published here. These are their key points:

The authors report a high estimate of replicability, which, in their appraisal, “justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries” . . . However, replicability was not the original outcome of interest in the project, and analyses associated with replicability were not preregistered as claimed.

Again, let me emphasize that preregistration and methodological transparency were two of the “rigour-enhancing measures” endorsed in that original paper. So for them to have claimed preregistration and then not to have done it, that’s not some sort of technicality. It’s huge; it’s at the very core of their claims.

Bak-Coleman and Devezer continue:

Instead of replicability, the originally planned study set out to examine whether the mere act of scientifically investigating a phenomenon (data collection or analysis) could cause effect sizes to decline on subsequent investigation . . . The project did not yield support for this preregistered hypothesis; the preregistered analyses on the decline effect and the resulting null findings were largely relegated to the supplement, and the published article instead focused on replicability, with a set of non-preregistered measures and analyses, despite claims to the contrary.

Interesting. So the result that got all the attention (“When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that in the original study”) and which was presented as so positive for science (“This high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries”) was actually a negative result which went against their preregistered hypothesis! Had the estimated effects declined, that would’ve been a win for the original hypothesis (which apparently involved supernatural effects, but that’s another story); a lack of decline becomes a win in the new framing.

The other claim in the originally-published paper, beyond the positive results in the replications, was that the “rigour-enhancing practices” had a positive causal effect. But, as Bak-Coleman and Devezer note, the study was not designed or effectively carried out to estimate the effect of such practices. There were also lots of other issues, for example changes of outcome measures; you can read Bak-Coleman and Devezer’s article for the details.

Nov 2021 – Mar 2024: After the article appeared but before it was retracted, Jessica Hullman expressed concerns here and here about the placement of this research within the world of open science:

On some level, the findings the paper presents – that if you use large studies and attempt to eliminate QRPs, you can get a high rate of statistical significance – are very unsurprising. So why care if the analyses weren’t exactly decided in advance? Can’t we just call it sloppy labeling and move on?

I care because if deception is occurring openly in papers published in a respected journal for behavioral research by authors who are perceived as champions of rigor, then we still have a very long way to go. Interpreting this paper as a win for open science, as if it cleanly estimated the causal effect of rigor-enhancing practices is not, in my view, a win for open science. . . .

It’s frustrating because my own methodological stance has been positively impacted by some of these authors. I value what the authors call rigor-enhancing practices. In our experimental work, my students and I routinely use preregistration, we do design calculations via simulations to choose sample sizes, we attempt to be transparent about how we arrive at conclusions. . . .

When someone says all the analyses are preregistered, don’t just accept them at their word, regardless of their reputation.

The first comment on that latter post was by Anonymous, who wrote, “Data Colada should look into this.”

The joke here is that Data Colada is a blog run by three psychologists who have done, and continue to do, excellent investigatory work in the science reform movement—they’re on our blogroll! and they recently had to withstand a specious multi-million dollar lawsuit, so they’ve been through a lot—so looking into a prominent psychology paper that misrepresented its preregistration would be right up their alley—except that one of the authors of that paper is . . . a Data Colada author! I continue to have huge respect for these people. Everyone makes mistakes, and it gets tricky when you are involved in a collaborative project that goes wrong.

Sep 2024: After the retraction came out, Jessica published a followup post reviewing the story:

From the retraction notice:

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

This is obviously not a good look for open science. The paper’s authors include the Executive Director of the Center for Open Science, who has consistently advocated for preregistration because authors pass off exploratory hypotheses as confirmatory.

Jessica continues:

As a full disclosure, late in the investigation I was asked to be a reviewer, probably because I’d shown interest by blogging about it. Initially it was the extreme irony of this situation that made me take notice, but after I started looking through the files myself I’d felt compelled to post about all that was not adding up. . . .

What I still don’t get is how the authors felt okay about the final product. I encourage you to try reading the paper yourself. Figuring out how to pull an open science win out of the evidence they had required someone to put some real effort into massaging the mess of details into a story. . . . The term bait-and-switch came to mind multiple times as I tried to trace the claims back to the data. Reading an academic paper (especially one advocating for the importance of rigor) shouldn’t remind one of witnessing a con, but the more time I spent with the paper, the more I was left with that impression. . . . The authors were made aware of these issues, and made a choice not to be up front about what happened there.

On the other hand:

It is true that everyone makes mistakes, and I would bet that most professors or researchers can relate to having been involved in a paper where the story just doesn’t come together the way it needs to, e.g., because you realized things along the way about the limitations of how you set up the problem for saying much about anything. Sometimes these papers do get published, because some subset of the authors convinces themselves the problems aren’t that big. And sometimes even when one sees the problems, it’s hard to back out for peer pressure reasons.

Sep 2024: Bak-Coleman posted his own summary of the case. It’s worth reading the whole thing. Here I want to point out one issue that didn’t come up in most of the earlier discussions, a concern not just about procedures (preregistration, etc.) but about what was being studied:

Stephanie Lee’s story covers the supernatural hypothesis that motivated the research and earned the funding from a parapsychology-friendly funder. Author Jonathan Schooler had long ago proposed that merely observing a phenomenon could change its effect size. Perhaps the other authors thought this was stupid, but that’s a fantastic reason to either a) not be part of the project or b) write a separate preregistration for what you predict. We can see how the manuscript evolved to obscure this motivation for the study. The authors were somewhat transparent about their unconventional supernatural explanation in the early drafts of the paper from 2020:

According to one theory of the decline effect, the decline is caused by a study being repeatedly run (i.e., an exposure effect). According to this account, the more studies run between the confirmation study and the self-replication, the greater the decline should be.

This is nearly verbatim from the preregistration:

According to one theory of the decline effect, the decline is caused by a study being repeatedly run (i.e., an exposure effect). Thus, we predict that the more studies run between the confirmation study and the self-replication, the greater will be the decline effect.

It is also found in responses to reviewers at Nature, who sensed the authors were testing a supernatural idea even though they had reframed things towards replication by this point:

The short answer to the purpose of many of these features was to design the study a priori to address exotic possibilities for the decline effect that are at the fringes of scientific discourse….

As an aside, it’s wild to call your co-authors and funder the fringes of scientific discourse. Why take money from and work with cranks? Have some dignity. . . .

This utterly batshit supernatural framing erodes en route to the the published manuscript. Instead, the authors refer to these primary hypotheses that date back to the origin of the project as phenomena of secondary interest and do not describe the hypotheses and mechanisms explicitly. They refer only to this original motivation in the supplement of “test of unusual possible explanations.” . . .

It’s fine to realize your idea was bad, but something else to try bury it in the supplement and write up a whole different paper you describe in multiple places as being preregistered and what you set out to study. Peer review is no excuse for misleading readers just to get your study published because the original idea you were funded to study was absurd.

Nevertheless, when you read the paper, you’d have no idea this is what they got funding to study. Their omitted variables and undisclosed deviations in their main-text statistical models make it even harder to discern they were after the decline effect. They were only found in the pre-registered analysis code which was made public during the investigation.

In distancing themselves for two of the three reasons they got funding they mislead the reader about what they set out to study and why. This isn’t a preregistration issue. This is outcome switching, and lying. It’s almost not even by omission because they say it’s the fringes of scientific discourse but it’s the senior author on the paper!

In 2019 I spoke at a conference at Stanford (sorry!) that was funded by those people, and I agree with Bak-Coleman that science reform and supernatural research are strange bedfellows. Indeed, the conference itself was kinda split, with most of the speakers being into science reform but with a prominent subgroup who were pushing traditional science-as-hero crap—I guess they saw themselves as heroic Galileo types. I remember one talk that started going on about how brilliant Albert Einstein or Elon Musk was at the age of 11, another talk all about Nobel prize winners . . . that stuff got me so annoyed I just quietly slipped out of the auditorium and walked outside the building into the warm California sun . . . and there I met some other conference participants who were equally disgusted by that bogus hero-worship thing . . . I’d found my science-reform soulmates! I also remember the talk by Jonathan Schooler (one of the authors of the recently-retracted article), not in detail but I do remember being stunned that he was actually talking about ESP. Really going there, huh? It gave off a 70s vibe, kinda like when I took a psychology class in college and the professor recommended that we try mind-altering drugs. (That college course was in the 80s, but it was the early 80s, and the professor definitely seemed like a refugee from the 60s and 70s; indeed, here he is at some sort of woo-woo-looking website.)

Responses from the authors of the original article

I wanted to supplement the above readings with any recent statements by the prominent authors of the now-retracted article, but I can’t find anything online at the sites of Brian Nosek, Data Colada, or elsewhere. If someone can point me to something, I’ll add the link. Given all the details given by Bak-Coleman and others, it’s hard for me to imagine that a response from the authors would change my general view of the situation, but I could be missing something, also it’s always good to hear more perspectives.

Matt Clancy in comments pointed to this response by a subset of the authors of the original paper. Protzko, Nosek, Lundmark, Pustejovsky, Buttrick, and Axt write:

This collaboration began at a conference in 2012, organized by Jonathan Schooler and funded by the Fetzer Franklin Fund. The meeting included researchers from many fields who had observed evidence of declining and disappearing effects. Some presenters . . . offered conventional explanations for observing weaker evidence in replications compared with original studies such as selective reporting, p-hacking, underpowered research, and poor methodological transparency. Other presenters offered unconventional explanations, such as the act of observation making effects decline over time, possibilities that Schooler believed worthy of empirical investigation . . . and others have dismissed as inconsistent with the present understanding of physics.

Let me just say, not in the spirit of argumentation but in the spirit of offering another perspective, that what Protzko et al. refer to as “conventional explanations for observing weaker evidence in replications” may be conventional in some quarters but are revolutionary in much of science—and certainly were far from conventional in 2012! Back then, and even now to a large extent, when a study failed to replicate, the response by the original authors was almost always that there was something wrong with the replication attempt, that some important aspect, not actually emphasized in the original study, turned out to be the key all along. The most notorious of these might be when the ovulation-and-clothing researchers failed to replicate their own study but then claimed the new result was a success because it demonstrated that weather was a key moderating factor. Or the slow-walking guy, who wrote, “There are already at least two successful replications of that particular study . . . Both articles found the effect but with moderation by a second factor.” It was actually a different factor in each experiment, and you don’t need to be Karl R. Popper to know that a theory that can explain anything, explains nothing. Another example is discussed here. So, yeah, as of 2012 I’d say the conventional explanation was that any published study was just fine and any apparently failed replication could be explained away as the replicators just not doing it right.

Also, “inconsistent with the present understanding of physics” is a funny way to put it. It’s literally true! For example, if someone shows me a perpetual motion machine or a video of some guru levitating or ESP or whatever, then, sure, it’s inconsistent with the present understanding of physics. Others have referred to such things as “supernatural explanations,” but I guess that’s a less polite way to put it in some quarters.

Regarding preregistration, they state that their meta-analysis was based on 80 individual results (some of which they label as “discoveries,” which I know is standard statistics jargon so I can’t really complain but I still don’t like the term), each of which was preregistered, along with meta-analyses, only some of which were preregistered:

The statements about preregistration of the individual studies and meta-project should have been made separately. Only some of the analyses for the meta-project were preregistered. . . . This is not the only error in the paper (see our submitted correction in the Appendix), but it is an important one.

This letter doesn’t address the procedural concerns expressed by Hullman and Bak-Coleman or the validity of their larger conclusion that their “high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries.” I guess they plan to write more about all of this in the future. For now, the paper has been retracted so I guess they have no need to defend its claims. In particular, in the letter they downgraded their claim from “justifies confidence in rigour-enhancing methods” to “this proof-of-concept project demonstrated that high replicability is achievable in a prospective context.” So, on the substance, we all seem to be in pretty much the same place.

Putting it all together

The interesting thing about this story—its “man bites dog” aspect—is that the people involved in the replication failure are not the usual suspects. This is not a case of Ted-talking edgelords getting caught in an exaggeration. This time we’re talking about are prominent science reformers. Indeed, Brian Nosek, leader of the Center for Open Science, coauthored a wonderful paper a few years ago detailing how they’d fooled themselves with forking paths and how they were saved from embarrassment by running their own replication study.

One thing that concerned me when I first heard this story—and I think Jessica had a similar reaction—was, are these people being targeted? Whistleblowers attract reaction, and lots of people are there waiting to see you fall. I personally get annoyed when people misrepresent my writings and claim that I’ve said things that I never said—this happens a lot, and when it happens I’m never sure where to draw the line between correcting the misrepresentation and just letting it go, because the sort of people who will misrepresent are also the sort of people who won’t correct themselves or admit they were wrong; basically, you don’t want to get in a mudwrestling match with someone who doesn’t mind getting dirty—so I was sensitive to the possibility that Nosek and the other authors of that paper were being mobbed. But when I read what was written by Hullman, Bak-Coleman, Devezer, and others, I was persuaded that they were treating the authors of that paper fairly.

In that case, what happened? This was not a Wansink or Ariely situation where, in retrospect, they’d been violating principles of good science for years and finally got caught. Rather, the authors of that recent-retracted paper included several serious researchers in psychology, along with people who had made solid contributions to science reform, along with a couple of psychology researchers who were more fringey.

So it’s not simple case of “Yeah, yeah, we could’ve expected that all along.” It’s more like, “What went wrong?”

I think three things are going on.

1. I think there’s a problem with trying to fix the replication crisis using procedural reforms, by which I mean things like preregistration, p-value or Bayes-factor thresholds, and changes in the processes of scientific publication. There’s room for improvement in all these areas, no doubt, and I’m glad that people are working on them—indeed, I’ve written constructively on many of these topics myself—but they don’t turn bad science into good science, all they do is offer an indirect benefit by changing the incentive structure and, ideally, motivating better work in the future. That’s all fine, but when they’re presented as “rigour-enhancing methods to increase the replicability of new discoveries” . . . No, I don’t think so.

I think that most good science (and engineering) is all about theory and measurement, not procedure.

Indeed, over-focus on procedure is a problem not just in the science-reform movement but also in statistics textbooks. We go on and on about random sampling, random assignment, linear models, normal distributions, etc etc etc. . . . All these tools can be very useful—when applied to measurements that address questions of interest in the context of some theoretical understanding. If you’re studying ESP or the effects of subliminal smiley faces on attitude or the effects of day of the month on vote intention, all the randomization in the world won’t help you. And statistical modeling—Bayesian or otherwise—won’t help you either, except in the indirect sense of making it more clear that whatever you’re trying to study is overwhelmed by noise. This sort of negative benefit is a real thing; it’s just not quite what reformers are talking about when they talk about “increase the replicability of new discoveries.”

2. Related is the personalization of discourse in meta-science. Terms such as “p-hacking” and “questionable research practices” are in a literal sense morally neutral, but I feel that they are typically used in an accusatory way. That’s the case even with “harking” (hypothesizing after results are known), which I think is actually a good thing! One reason I started using the term “forking paths” is that it doesn’t imply intentionality, but there’s only so much you can do with words.

I think the bigger issue here is that scientific practice is taken as a moral thing, which leads to two problems. First, if “questionable research practices” is something done by bad people, then it’s hard to even talk about it, because then it seems that you’re going around accusing people. I know almost none of the people whose work I discuss—that’s how it should be, because publications are public, they’re meant to be read by strangers—and it would be very rare that I’d have enough information to try to judge their actions morally, even if I were in a position to judge, which I’m not. Second, if “questionable research practices” is something done by bad people, then if you know your motives are pure, logically that implies that you can’t have done questionable research practices. I think that’s kinda what happened with the recently retracted paper. The authors are science reformers! They’ve been through a lot together! They know they’re good people, so when they feel “accused” of questionable research practices (I put “accused” in quotes because what Bak-Coleman and Devezer were doing in their article was not to accuse the authors of anything but rather to describe certain things that were in the article and the metadata), it’s natural for them (the authors) to feel that, no, they can’t be doing these bad things, which puts them in a defensive posture.

The point about intentionality is relevant to practice, in that sins such as “p-hacking,” “questionable research practices,” “harking,” etc., are all things that are easy to fix—just stop doing these bad things!—, and many of the proposed solutions, such as preregistration and increased sample size, require some effort but no thought. Some of this came up in our discussion of the 2019 paper, “Arrested Theory Development: The Misguided Distinction Between Exploratory and Confirmatory Research,” by Aba Szollosi and Chris Donkin.

3. The other thing is specific to this particular project but perhaps has larger implications in the world of science reform. A recent comment pointed to this discussion from an Open Science Framework group from 2013. Someone suggests a multi-lab replication project:

And here was the encouraging reply:

As I wrote here, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment in order to keep them on board. Within academic psychology, the science reform movement arose from a coalition between radical reformers (who viewed replications as a way to definitely debunk prominent work in social psychology they believed to be fatally flawed) and conservatives (who viewed replications as a way to definitively confirm findings that they considered to have been unfairly questioned on methodological grounds). As often in politics, this alliance was unstable and has in turn led to “science reform reform” movements from the “left” (viewing current reform proposals as too focused on method and procedure rather than scientific substance) and from the “right” (arguing that the balance has tipped too far in favor of skepticism).

To say it another way, the science reform movement promises different things to different people. At some level this sort of thing is inevitable in a world where different people have different goals but still want to work together. For some people such as me, the science reform movement is a plus because it opens up a space for criticism in science, not just in theory but actual criticism of published claims, including those made by prominent people and supported by powerful institutions. For others, I think the science reform movement has been viewed as a way to make science more replicable.

Defenders of preregistration have responded to the above points by saying something like, “Sure, preregistration will not alone fix science. It’s not intended to. It’s a specific tool that solves specific problems.” Fair enough. I just think that a lot of confusion remains on this point; indeed, my own reasons for preregistration in my own work are not quite the reasons that science reformers talk about.

Summary

The 2023 paper that claimed, “this high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries,” was a disaster. The 2024 retraction of the paper makes it less of a disaster. As is often the case, what appears to be bad news is actually the revelation of earlier bad news; it’s good news that it got reported.

Confusion remains regarding the different purposes of replication, along with the role of procedural interventions such as preregistration that are designed to improve science.

We should all be thankful to Bak-Coleman and Devezer for the work they put into this project. I can see how this can feel frustrating for them: in an ideal world, none of this effort would have been necessary, because the original paper would never have been published!

The tensions within the science reform movement—as evidenced by the prominent publication of a research article that was originally designed to study a supernatural phenomenon, then was retooled to represent evidence in favor of certain procedural reforms, and finally was shot down by science reformers from the outside—can be seen as symbolic of, or representative of, a more general tension that is inherent in science. I’m speaking here of the tension between hypothesizing and criticism, between modeling and model checking, between normal science and scientific revolutions (here’s a Bayesian take on that). I think scientific theories and scientific measurement need to be added to this mix.

Mark Twain on chatbots

From 1906, as quoted by Albert Bigelow Paine:

On the trip down in the dining-car there was a discussion concerning the copyrighting of ideas, which finally resolved itself into the possibility of originating a new one. Clemens said:

There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope. We give them a turn and they make new and curious combinations. We keep on turning and making new combinations indefinitely; but they are the same old pieces of colored glass that have been in use through all the ages.

See here for further discussion of that topic (thinking of much of our own speech and writing as a form of sequential rearrangement of existing material). I do think that this sort of mechanistic analogy can give us insights into our own processes of speaking and writing.

Also relevant is this article, Artificial Intelligence and Aesthetic Judgment, with Jessica Hullman and Ari Holtzman.

Stan’s autodiff is 4x faster than JAX on CPU but 5x slower on GPU (in one eval)

This is Bob.

JAX on my mind

I’ve been thinking a lot about JAX lately. JAX is appealing to a computer scientist like me, due to its beautifully compositional architecture for coding autodiff on GPUs. Had JAX existed when we started coding Stan in 2011, we would’ve used that rather than rolling our own autodiff system. Apparently Theano did exist at that time, but we didn’t hear about it until long after releasing the first version of Stan.

Why JAX?

Originally, my interest was sparked by Matt Hoffman’s work on Cheese and Meads samplers (too lazy to look up their capitalization pattern, so going Andrew style) that use massively parallel HMC steps on GPU (Matt was driven to abandon NUTS because its recursive structure is anathema to GPU acceleration). It continues to be kindled by statements from people like Elizaveta Semenova, who announced during the recent StanCon at Oxford that she had to give up Stan and moved to NumPyro because she couldn’t code neural nets easily in Stan and Stan doesn’t scale on the GPU (we have some operations that can be sent to GPU, but we can’t keep the whole eval in kernel). The full blown fire is due to Justin Domke and Abhinav Agrawal’s work on normalizing flows—they have a repo, vistan, that implements their methods, including a greatly improved version of autodiff variational inference (ADVI) and a version of real non-volume preserving (realNVP) normalizing flows. Gilad Turok, who’s a research analyst here and is applying to grad school for next year, is almost done with a better engineered version we can submit to Blackjax.

We haven’t gotten far, but so far we haven’t found a density yet that the realNVP flows, followed by importance resampling, didn’t fit nearly perfectly. For example, it could fit a 1000-parameter hierarchical IRT 2PL without any identifying strategy other than priors, which is a model that Stan cannot fit well at all due to the product of funnels structure of the posterior geometry. Justin and Abhinav’s ADVI and realNVP variants rely on JAX’s ability to massively parallelize the log density and gradient calculations in order to stabilize the stochastic gradients used to calculate KL divergence up to a constant (i.e., the ELBO).

Comparing Stan and JAX performance

At the recent StanCon in Oxford, Simon Maskell presented a result I’ve been meaning to generate for a year or two, which is a comparison of JAX and Stan on CPU. Here’s Simon (or at least his arm!) presenting the result:

Working out the algebra problem Simon’s slide title presents (JG = 20 * JC, JG= 5 * SC, where JG is JAX on GPU, JC is JAX on CPU, and SC is Stan on CPU), we see that JAX on GPU is 5 times faster than Stan on CPU, which is in turn 4 times faster than JAX on CPU. For the evaluation, Simon was using a high-end consumer-grade (RTX) NVIDIA GPU—we will re-run Simon’s results on our state-of-the-art GPUs over the next couple of months and report back. Tensorflow did a similar evaluation with similar results a few years ago, but I wasn’t able to find it.

CPU to GPU over next decade?

This is actually kind of a bummer in that it means we couldn’t really move to JAX as a back end for most of our users because it’s not very performant on CPUs. My guess is that the median Stan user in terms of computation has a 5-year old Window’s notebook with no GPU. Part of my interest in things like JAX is my feeling that in 10 years, this is no longer going to be the case, and almost everyone trying to fit a statistical model will be working on a cluster.

Oh no Stanford no no no not again please make it stop

Susie Neilson writes:

Tony Robbins was having a rough year. It was 2019 . . . Buzzfeed News was publishing a multi-part investigation into allegations that he had, during his in-person events, groped women and belittled abuse survivors. Robbins had a lot to lose. . . . he had amassed a net worth estimated at more than $400 million. His sweaty self-help seminars and grab-life-by-the-balls philosophy had spurred devotion from millions of followers, many of whom spent thousands on his products. . . . But at a precarious time, Robbins found an unlikely ally: Stanford.

No, not this guy. Not this person either, nor is it the hypester mentioned here. Not this group either. No, it was some other crew:

Around the same time that Buzzfeed News published its series, the Stanford Healthcare Innovation Lab, helmed by acclaimed genomicist Michael Snyder, launched a very different kind of investigation into Robbins’ seminars as part of an effort to identify “novel approaches to mental health.” In 2021, researchers affiliated with the lab, known as SHIL, published a study of “Unleash the Power Within,” Robbins’ four-day flagship seminar. . . .

Then, in 2022, SHIL-affiliated researchers — some of whom were fans and acolytes of Robbins’ work — published a more provocative paper. This one claimed that Robbins’ six-day, $4,500 “Date with Destiny” program eliminated symptoms of depression in 100% of initially depressed event-goers who were studied. In contrast, across clinical trials of antidepressants, just half of people report feeling better in six to eight weeks.

100% improved, huh? That sounds like big news! And, indeed, the Stanfordians were pretty proud, as well they should have been after such a clear and positive finding:

“This is going to be one of the most effective, if not the most effective, improvements in depression published,” Ariel Ganz, SHIL’s director of mental health innovation and the studies’ co-author, said in a 2021 video conversation with Snyder, other co-authors and Robbins.

In case you’re curious, the two papers in question are here and here

Sorry, now comes the bad news:

But when the Chronicle asked more than a dozen experts in psychology, statistics and medical research to review Stanford’s Date with Destiny study, many raised serious concerns about its validity. They found basic calculation errors, head-scratching data points and conflicting statements about how study participants were selected. Critically, they noted that too few people participated in the research for the findings to hold meaning for the public at large.

That’s a bummer when your study has data problems. Really too bad. That paper had 26 participants and 9 authors—that’s less than 3 data points per author, better than the student-faculty ratio at Ivy League schools. You’d think the authors could’ve avoided all these errors by divvying up the problem and looking carefully at the data from three participants each. Now they’re in the same category as that gremlins guy who approached the Platonic ideal of publishing a paper with more errors than data points.

The authors responded:

Snyder declined an interview, asking for reporters’ questions in writing. In response to a list of detailed questions, he acknowledged the study must now be corrected in light of the Chronicle’s findings, while noting that it was peer-reviewed. Snyder said that, in any event, the resulting paper suggested that “immersive interventions may be useful for reducing depressive symptoms and enhancing well-being.”

Ahhhh, the one-way street fallacy! Their data are also consistent with no effect, or with a negative effect.

Neilson continues:

Neither of the studies bore marks of research misconduct or fraud, according to the experts contacted by the Chronicle. Yet the work doesn’t appear to be of the caliber expected of a researcher with Snyder’s reputation, they said, or of a university such as Stanford. . . .

“Has he lost his mind?” asked Phyllis Gardner, Snyder’s colleague at Stanford’s School of Medicine, who was among the first to question the work of disgraced Theranos founder Elizabeth Holmes.

“That seems really surprising to me,” said Dr. Max Wintermark, the former deputy director of the Stanford Center for Precision Mental Health and Wellness, who now chairs the University of Texas MD Anderson Cancer Center’s neuroradiology department.

What was going on?? Here’s the story:

The Chronicle spent several months looking into Stanford’s unusual research. The partnership behind the studies formed when Ganz, then a new postdoctoral researcher in Snyder’s lab, met Benjamin Rolnik, a former Hollywood talent agent, at a wellness retreat hosted by one of Robbins’ close friends and collaborators, self-help guru Byron Katie.

Wow, this is a wonderful origin story. Much better than anything we have in statistics. The story continues:

Robbins and Snyder backed the same startup; they have also promoted each others’ products and business partners. . . . Robbins has also elevated controversial ideas and therapies. In the 1990s, he expressed doubt about the link between HIV and AIDS . . . Like Robbins, Katie has been criticized in online forums for belittling the experiences of abuse survivors during her own events. In 2012, she told an interviewer that she “never” felt sorry for people who experienced rape and other abuse, because “they only believe their thoughts” and were “perfectly all right.” In an email to the Chronicle, Katie said her statement had been “taken out of context.”

Wow. Also this:

Stanford did not respond to the Chronicle’s request for a list of donors to the genetics department and more information about SHIL members. . . . Under Snyder, Rolnik and Ganz went on to co-author four papers on Katie’s method, showing The Work could help stutterers, teachers and people whose genes put them at elevated risk for breast cancer.

And now some details:

James Heathers, a physiologist and affiliated researcher at Linnaeus University in Sweden, is best known for his work detecting errors in published research. . . . Heathers commended the Stanford researchers for publishing their data in a Dropbox folder, which he said is a good practice that demonstrates openness, but said he was left with broad concerns after reviewing it.

“Overall: This study is poorly conducted, and contains data handling errors, protocol violations, and other evidence of poor experimental practice,” Heathers wrote in a document summarizing his line-by-line review of the Date With Destiny study.

Critically, the researchers relied on patients’ PHQ-9 scores but calculated them incorrectly, Heathers noted. Instead of adding up each of the nine items, they added item No. 2 twice and skipped over item No. 3. These scores form the bedrock of findings about patients’ changing depression levels. “It is hard to imagine how this escaped inspection,” Heathers wrote.

OK, but this last bit is funny

I’ve tried to present the above story in an entertaining way, but ultimately it’s sad (people with health problems being led on by hype), infuriating (academics who are already rich and successful cutting corners to gain even more riches and fame . . . what is it they want, a jet ski made out of diamonds? Is there anything that would satisfy these people), and upsetting (as this takes up financial and attentional resources that could be spent on actual science).

But the story does have one funny part, and here it is:

Representatives for Elsevier, the company behind the Journal of Psychiatric Research, which published the study, declined to provide details about the article’s peer-review process, but said it “upholds the highest standards of quality and integrity across all journals.”

I have a horrible feeling they’re being honest there. This published paper with all these data problems and conflict-of-interest problems probably does match a standard of quality and integrity that occurs across Elsevier’s journals.

They’re not even pretending to be shocked that they published such a bad paper. You gotta hand it to them for being so brazen about it. Not.

Well, today we find our heroes flying along smoothly…

This is Jessica. I hadn’t planned to be down on open science research again so soon, but I seem to keep finding myself presented with messes associated with it. After an 7+ month investigation instigated by a Matters Arising critique by Bak-Coleman and Devezer, Nature Human Behavior retracted the “feel-good open science story” paper “High replicability of newly discovered social-behavioural findings is achievable” by Protzko et al. From the retraction notice:

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

This is obviously not a good look for open science. The paper’s authors include the Executive Director of the Center for Open Science, who has consistently advocated for preregistration because authors pass off exploratory hypotheses as confirmatory. Another author is a member of the Data Colada team that has outed others’ questionable research transgressions and helped popularize the ideas that selective reporting and harking threaten the validity of claimed results in psych. 

I once thought I did know all about it

If seeing this paper retracted makes you uncomfortable, I don’t blame you. It makes me uncomfortable too. My views on mainstream open science research and advocacy were much more positive a year ago before I encountered all this.

As a full disclosure, late in the investigation I was asked to be a reviewer, probably because I’d shown interest by blogging about it. Initially it was the extreme irony of this situation that made me take notice, but after I started looking through the files myself I’d felt compelled to post about all that was not adding up. When asked to officially participate in the investigation, I agreed but with some major hesitation. I knew that to be comfortable weighing in on the question of retraction, I’d want to think through many possible defenses for how the paper presents its points. That would mean spending more time beyond that I’d already spent going through the OSF to write one of my blog posts to sort through the paper’s arguments and consider whether they could possibly hold up. None of this is at all connected my main gig in computer science. 

But ultimately I said yes out of a sense of duty, figuring that as an outsider to this community with no real alliances with the open science movement or any of the authors involved, it would be relatively easy for me to be honest. 

The final version of the Matters Arising, now published by the journal, summarizes a number of core issues: the lack of justification, given the study design and missing pre-registration, for implying a causal relationship or even discussing an association between rigor-enhancing practices and the replicability rate the authors observe; the inconsistencies between the replicability definition and those in the literature; the over-interpretation of the statistical power estimate, etc. Hard to get beyond this barrage of points. 

Since the rain falls, the wind it blows, and the sun shines

What’s funny though is that I somehow still had sort of expected this to be a difficult call. Maybe I was susceptible to the tendency to want to give such esteemed authors, several of whom have done some work I really respect, the benefit of the doubt. I was obviously aware going into the investigation about the lack of preregistration for the main analyses that they claimed to have preregistered. But I tried to have an open enough mind that I wouldn’t miss any possible value that the paper could still have for readers despite that flaw. 

Unfortunately, as I re-read the Protzko et al. paper to consider what, if any, one could learn about the role of rigor-enhancing practices to their results, I quickly found myself unable to resolve a fundamental issue related to how they establish that the replication rate they observe is high in the first place. The reference set of effects they mean when they use terms like “original discoveries” is not consistent throughout the paper, including in their calculations of expected power and replicability, which they use to establish their claim of “high replicability.” Sometimes these refer to effects from the pilots and sometimes used to refer to effects from the confirmatory studies. As a result of the way the authors set up their claims, referring to rigor-enhancing practices characterizing the whole process, they would need the rigor-enhancing practices to apply to both the confirmatory studies and the pilot studies. 

But the paper text and other materials contradict themselves about how the practices apply across these two sets of studies. For example, I spent some time looking for the pilot preregistrations (which the paper also claims exist), but found only a handful, suggesting that the paper also can’t back up its claims about preregistration there. Given this contradiction between what they say about their design (and the lack of info on the pilots) and the logic they set up to make one of their central points, I didn’t see the paper could redeem itself, even if we decide to be optimistic about the other issues. Retraction was clearly the right decision. You can read some comments related to what I wrote in my review here

What I still don’t get is how the authors felt okay about the final product. I encourage you to try reading the paper yourself. Figuring out how to pull an open science win out of the evidence they had required someone to put some real effort into massaging the mess of details into a story. It was frustrating as a reader of the paper trying to match the reported values to the set of effects or processes they used. The term bait-and-switch came to mind multiple times as I tried to trace the claims back to the data.  Reading an academic paper (especially one advocating for the importance of rigor) shouldn’t remind one of witnessing a con, but the more time I spent with the paper, the more I was left with that impression. It’s worth noting that the lack of sufficient detail about the pilots was brought up at length in Tal Yarkoni’s review of the original submission, as well as Malte Elson’s review for NHB. The authors were made aware of these issues, and made a choice not to be up front about what happened there.

It is true that everyone makes mistakes, and I would bet that most professors or researchers can relate to having been involved in a paper where the story just doesn’t come together the way it needs to, e.g., because you realized things along the way about the limitations of how you set up the problem for saying much about anything. Sometimes these papers do get published, because some subset of the authors convinces themselves the problems aren’t that big. And sometimes even when one sees the problems, it’s hard to back out for peer pressure reasons. 

But even then, there’s still a difference between finding oneself in such a situation and crowing all over the place about the paper as if it is a piece of work that delivers some valuable truth. What’s puzzled me from the start is that this paper was not only published, it was widely shared by the authors as a kind of victory lap for open science. 

Don’t you know that your creator is running out of ideas

So while I came into this whole experience relatively open-minded about open science, my views have been colored less positively after learning about this paper and seeing certain other open science advocates defend it. I personally stopped seeing the value of most behavioral experiments a few years ago, because I could no longer get beyond the chasm between the inferences we want to draw and the processes we are limited to when we design them. But I guess I interpreted this as more of a personal tic. Preregistration, open data and methods, better power analysis etc. practices might not be enough to make me feel excited about behavioral experiments, but I assumed that the work open science advocates were doing to encourage these practices was doing some good. I hadn’t really considered that open science could be doing harm, beyond maybe encouraging a different set of rigor signalling games.

This experience has changed my view, from ”live and let live if people find it helpful” to “this is not helpful,” given that producing evidence to change policy (or logical justifications presented as sufficient for policy without empirical evidence) appears to be a goal of open science research like this. Preregister if you find it helpful. Make your materials open because you should. But don’t expect these practices to transform your results into solid science, and don’t trust people that try to tell you it’s as easy as adopting a few simple rituals. I’m now doubtful that the flurry of research on fixing the so-called replication crisis is truly interested in engaging deeply with concepts like statistical power or replicability. I’m left wondering how many other empirical pro-open science papers are rhetorical feats to “keep up the momentum” regardless of what can actually be concluded from the data. 

P.S. On a lighter note related to the title of this post (or not so light if you remember how the quote ends), remember Rocky and Bullwinkle? My dad used to always try to get us to watch re-runs when they came on TV. The other references in the post (also from my dad’s era) are from a Bert Jansch song.  

It’s martingale time, baby! How to evaluate probabilistic forecasts before the event happens? Rajiv Sethi has an idea. (Hint: it involves time series.)

My Columbia econ colleague writes:

The following figure shows how the likelihood of victory for the two major party candidates has evolved since August 6—the day after Kamala Harris officially secured the nomination of her party—according to three statistical models (Trump in red, Harris in blue):

Market-derived probabilities have fluctuated within a narrower band. The following figure shows prices for contracts that pay a dollar if Harris wins the election, and nothing otherwise, based on data from two prediction markets (prices have been adjusted slightly to facilitate interpretation as probabilities, and vertical axes matched to those of the models):

So, as things stand, we have five different answers to the same question—the likelihood that Harris will prevail ranges from 51 to 60 percent across these sources. On some days the range of disagreement has been twice as great.

As we’ve discussed, a difference in 10% of predicted probability corresponds to roughly a difference of 0.4% (that is, 0.004) in predicted vote share. So, yeah, it makes complete sense to me that different serious forecasts would differ by this much, also it makes sense that markets could be different by this much. (As Rajiv discussed in an earlier post, for logistical reasons it’s not easy to arbitrage between the two markets shown above, so it’s possible for them to maintain some daylight between them.)

Also, this is a minor thing but if you’re gonna plot two lines that add to a constant, it’s enough to just plot one of them. I say this in part out of general principles and in part because these lines that cross 0.5 create shapes and other visual artifacts such as the “vase” in PredictIt plot. I think these visual artifacts get in the way of seeing and learning from the data.

OK, that’s all background. Different forecasts differ. The usual way we talk about evaluating forecast is by comparing to outcomes. Rajiv writes:

The standard approach would involve waiting until the outcome is revealed and then computing a measure of error such as the average daily Brier score. This can and will be done, not just for the winner of the presidency but also the outcomes in each competitive state, the popular vote winner, and various electoral college scenarios.

I think the evaluation should be done on vote margin, not on the binary win/loss outcome, as the evaluation based on a binary outcome is hopelessly noisy, a point that’s come up on this blog many times and which I explained again last month. Even if you use the vote margin, though, you still have just one election outcome, and that won’t be enough for you to compare different reasonable forecasts.

Rajiv has a new idea:

But there is a method of obtaining a tentative measure of forecasting performance even prior to event realization. The basic idea is this. Imagine a trader who believes a particular model and trades on one of the markets on the basis of this belief. Such a trader will buy and sell contracts when either the model forecast or the market price changes, and will hold a position that will be larger in magnitude when the difference between the forecast and the price is itself larger. This trading activity will result in an evolving portfolio with rebalancing after each model update. One can look at the value of the resulting portfolio on any given day, compute the cumulative profit or loss over time, and use the rate of return as a measure of forecasting accuracy to date.

This can be done for any model-market pair, and even for pairs of models or pairs of markets (by interpreting a forecast as a price or vice versa).

I don’t agree with everything Rajiv does here—he writes, “The trader was endowed with $1,000 and no contracts at the outset, and assigned preferences over terminal wealth given by log utility (to allow for some degree of risk aversion),” which makes no sense to me, as I think anyone putting $1000 in a prediction market would be able to lose it all without feeling much bite—but I’m guessing that if the analysis were switched to a more sensible linear utility model, the basic results wouldn’t change.

Rajiv summarizes his empirical results:

Repeating this exercise for each model-market pair, we obtain the following returns:

Among the models, FiveThirtyEight performs best and Silver Bulletin worst against each of the two markets, though the differences are not large. And among markets, PredictIt is harder to beat than Polymarket.

I don’t take this as being a useful evaluation of the three public forecasts, because . . . these are small numbers, and this is still just N = 1. It’s one campaign we’re talking about. Another way to put it is: What are the standard errors on these numbers? You can’t get a standard error from only one data point.

This doesn’t mean that the idea is empty; we should just avoid overinterpreting the results.

I haven’t fully processed Rajiv’s idea but I think it’s connected to the martingale property of coherent probabilistic forecasts, as we’ve discussed in the context of betting on college basketball and elections.

However you look at it, the thing that will kill your time series of forecasts is too much volatility: if you anticipate having to incorporate a flow of noisy information, you need to anchor your forecasts (using a “prior” or a “model”) to stop your forecast from being bounced around by noise.

Unfortunately, that goal of sensible calibrated stability runs counter to another goal of public forecasters, which is to get attention! For that, you want your forecast to jump around so that you are continuing to supply news.

I sent the above to Rajiv, who wrote:

One thing I didn’t show in the post is the how the value of the portfolio (trading on PredictIt) would have evolved over time:

You will see that if I had conducted this analysis a couple of weeks ago, Silver Bulletin would have been doing well. It really suffered when the price of the Harris contract rose on the markets, since it was holding a significant short position.

The lesson here (I think) is that we need lots of events to come to any conclusion. I am working on the state level and popular vote winner forecasts, but of course these will all be correlated so doesn’t really help much with the problem of small numbers of events. This is the Grimmer-Knox-Westwood point as I understand it.

If we were to apply this sort of procedure retroactively to past time series of forecasts or betting markets that were too variable because they were chasing the polls too much, then I’d think/hope it could reveal the problems. After 2016, forecasters have worked hard to keep lots of uncertainty in their forecasts, and I think that one result of that is to keep the market prices more stable.

Nonsampling error and the anthropic principle in statistics

We’ve talked before about the anthropic principle, which, in physics, is roughly the idea that things are what they are because otherwise we wouldn’t be around to see them. Related are various equilibrium principles which state that things are what they are because, if they weren’t, behavior would change until equilibrium is reached.

An example is the idea that price elasticity of demand should be close to -1. If it’s steeper than -1, then the seller has a motivation to lower the price so as to get more total money; if it’s shallower than -1, then the seller has a motivation to raise the price so as get more total money; equilibrium is only at -1. The price elasticity of demand is not, in general, -1, because there are lots of other costs, benefits, and constraints in any system; the -1 thing is just a baseline. Still, it can be a useful baseline.

Another example is the median voter theorem, which we’ve discussed many times on this blog (see the many links here): to the extent that parties take positions are not close to the median of the voters, the parties should be able to gain votes by moving toward the median. Again, this does not generally happen because of many complicating factors; the median voter theorem can still be helpful as a baseline.

Another example is effect sizes in statistics, a topic that can also be studied empirically.

Today I want to talk about polling error, in particular, this finding from an article with Houshmand Shirani-Mehr, David Rothschild, and Sharad Goel:

Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error.

Roughly speaking, nonsampling error is about the same size as sampling error. I want to argue that this fits an anthropic or equilibrium storyline. It goes like this: if you conduct a survey with a huge sampling error, then there will be a clear benefit from increasing your sample size and bringing that sampling error down. From the other direction, it would not make sense to run a state poll with a sample size in the tens of thousands: that would bring down the sampling error but it would not help with nonsampling error.

With independent error components,
sd of total error = sqrt((sd of sampling error)^2 + (sd of nonsampling error)^2).
and the way the math works is that, reducing the smaller of these terms gives diminishing returns.

Again this reasoning is only approximate. For one thing, if the sd of average survey error is twice that of sampling error, then this implies that sampling error is less than nonsampling error (because of that root-mean-square thing), and I guess that kinda makes sense, given that polls are used for information other than the headline number, also polls are analyzed for trends, not just levels. The idea is that you wouldn’t expect nonsampling error to be much less than sampling error or much more than sampling error.

The point of this anthropic reasoning is not to give an exact answer but rather to give some intuition to where we are. It’s related to the general principle that you’d expect variance and squared bias to be comparable to each other, as discussed in Section 4.3 of Regression and Other Stories.

Present Each Other’s Posters: An update after 15 years

This came up a few years back:

I was at a conference which had an excellent poster session. I realized the session would have been even better if the students with posters had been randomly assigned to stand next to and explain other students’ posters. Some of the benefits:

1. The process of reading a poster and learning about its material would be more fun if it was a collaborative effort with the presenter.

2. If you know that someone else will be presenting your poster, you’ll be motivated to make the poster more clear.

3. When presenting somebody else’s poster, you’ll learn the material. As the saying goes, the best way to learn a subject is to teach it.

4. The random assignment will lead to more inderdisciplinary understanding and, ultimately, collaboration.

I think just about all poster sessions should be done this way.

P.S. In reply to comments:

– David writes that my idea “misses the potential benefit to the owner of the poster of geting critical responses to their work.” The solution: instead of complete randomization, randomize the poster presenters into pairs, then put pairs next to each other. Student A can explain poster B, student B can explain poster A, and spectators can give their suggestions to the poster preparers.

– Mike writes that “one strong motivation for presenters is the opportunity to stand in front of you (and other members of the evaluation committee) and explain *their* work to you. Personally.” Sure, but I don’t think it’s bad if instead they’re explaining somebody else’s work. If I were a student, I think I’d enjoy explaining my fellow-students’ work to an outsider. The ensuing conversation might even result in some useful new ideas.

– Lawrence suggests that “the logic of your post apply to conference papers, too.” Maybe so.

I had this idea a while ago but never did anything with it, so I’m happy to see that Cosma Shalizi independently came up with the idea, tried it out, and it worked:

Some years ago, Henry Farrell and I [Shalizi] ran a series of workshops about cooperative problem-solving and collective cognition where we wanted to get people with very different disciplinary backgrounds . . . talking to each other productively. We hit upon an idea which worked much better than we had any right to hope. . . .

1. Every participant in the workshop writes a brief presentation, with enough lead time for the organizers to read them all. In the context of an inter-disciplinary workshop, what often works best is to describe an outstanding problem in the field.

2. The workshop organizers semi-randomly assign each participant’s presentation to someone else, with enough lead time that the assignee can study the presentation. Again, in the interdisciplinary context, the organizers try to make sure that there’s some hope of comprehension. (While I called this the “presentation exchange”, it needn’t be a strict swap, where A gets assignd B’s presentation and vice versa.)

3. Everyone gives the presentation they were assigned, followed by their own comments on what they found interesting / cool / provocative and what they found incomprehensible. No one gives the presentation they wrote. . . .

Doing this at the beginning of the workshop helps make sure that everyone has some comprehension of what everyone else is talking about, or at least that mis-apprehensions or failures to communicate are laid bare. It can help break up the inevitable disciplinary/personal cliques. It can, and has, spark actual collaborations across disciplines. And, finally, many people report that knowing their presentation is going to be given by someone else forces them to write with unusual clarity and awareness of their own expert blind-spots. . . .

I’ve also used it for disciplinary workshops — because every discipline is a fractal (or lattice) of sub-sub-…-sub-disciplinary specialization. I’ve also used it for student project classes, at both the undergrad and graduate level. . . .

I’m so happy to hear about this.