Multilevel modeling to make better decisions using data from schools: How can we do better?

Michael Nelson writes:

I wanted to point out a paper, Stabilizing Subgroup Proficiency Results to Improve the Identification of Low-Performing Schools, by Lauren Forrow, Jennifer Starling, and Brian Gill.

The authors use Mr. P to analyze proficiency scores of students in subgroups (disability, race, FRL, etc.). The paper’s been getting a good amount of attention among my education researcher colleagues. I think this is really cool—it’s the most attention Mr. P’s gotten from ed researchers since your JREE article. This article isn’t peer reviewed, but it’s being seen by far more policymakers than any journal article would.

All the more relevant that the authors’ framing of their results is fishy. They claim that some schools identified as underperforming, based on mean subgroup scores, actually aren’t, because they would’ve gotten higher means if the subgroup n’s weren’t so small. They’re selling the idea that adjustment by poststratification (which they brand as “score stabilization”) may rescue these schools from their “bad luck” with pre-adjustment scores. What they don’t mention is that schools with genuinely underperforming (but small) subgroups could be misclassified as well-performing if they have “good luck” with post-adjustment scores. In fact, they don’t use the word “bias” at all, as in: “Individual means will have less variance but will be biased toward the grand mean.” (I guess that’s implied when they say the adjusted scores are “more stable” rather than “more accurate,” but maybe only to those with technical knowledge.)

And bias matters as much as variance when institutions are making binary decisions based on differences in point estimates around a cutpoint. Obviously, net bias up or down will be 0, in the long run, and over the entire distribution. But bias will always be net positive at the bottom of the distribution, where the cutpoint is likely to be. Besides, relying on net bias and long-run performance to make practical, short-run decisions seems counter to the philosophy I know you share, that we should look at individual differences not averages whenever possible. My fear is that, in practice, Mr. P might be used to ignore or downplay individual differences–not just statistically but literally, given that we’re talking about equity among student subgroups.

To the authors’ credit, they note in their limitations section that they ought to have computed uncertainty intervals. They didn’t, because they didn’t have student-level data, but I think that’s a copout. If, as they note, most of the means that moved from one side of the cutoff to the other are quite near it already, you can easily infer that the change is within a very narrow interval. Also to their credit, they acknowledge that binary choices are bad and nuance is good. But, also to their discredit, the entire premise of their paper is that the education system will, and presumably should, continue using cutpoints for binary decisions on proficiency. (That’s the implication, at least, of the US Dept. of Ed disseminating it.) They could’ve described a nuanced *application* of Mr. P, or illustrated the absurd consequences of using their method within the existing system, but they didn’t.

Anyway, sorry this went so negative, but I think the way Mr. P is marketed to policymakers, and its potential unintended consequences, are important.

Nelson continues:

I’ve been interested in this general method (multilevel regression with poststratification, MRP) for a while, or at least the theory behind it. (I’m not a Bayesian so I’ve never actually used it.)

As I understand it, MRP takes the average over all subgroups (their grand mean) and moves the individual subgroup means toward that grand mean, with smaller subgroups getting moved more. You can see this in the main paper’s graphs, where low means go up and high means go down, especially on the left side (smaller n’s). The grand mean will be more precise and more accurate (due to something called superefficiency), while the individual subgroup means will be much more precise but can also be much more biased toward the grand mean. The rationale for using the biased means is that very small subgroups give you very little information beyond what the grand mean is already telling you, so you should probably just use the grand mean instead.

In my view, that’s an iffy rationale for using biased subgroup proficiency scores, though, which I think the authors should’ve emphasized more. (Maybe they’ll have to in the peer-reviewed version of the paper.) Normally, bias in individual means isn’t a big deal: we take for granted that, over the long run, upward bias will be balanced out by downward bias. But, for this method and this application, the bias won’t ever go away, at least not where it matters. If what we’re looking at is just the scores around the proficiency cutoff, that’s generally going to be near the bottom of the distribution, and means near the bottom will always go up. As a result, schools with “bad luck” (as the authors say) will be pulled above the cutoff where they belong, but so will schools with subgroups that are genuinely underperforming.

I have a paper under review that derives a method for correcting a similar problem for effect sizes—it moves individual estimates not toward a grand mean but toward the true mean, in a direction and distance determined by a measure of the data’s randomness.

I kinda see what Nelson is saying, but I still like the above-linked report because I think that in general it is better to work with regularized, partially-pooled estimates than with raw estimates, even if those raw estimates are adjusted for noise or multiple comparisons or whatever.

To help convey this, let me share a few thoughts regarding hierarchical modeling in this general context of comparing averages (in this case, from different schools, but similar issues arise in medicine, business, politics, etc.).

1. Many years ago, Rubin made the point that, when you start with a bunch of estimates and uncertainties, classical multiple comparisons adjustments are effectively increasing can be increasing the standard errors so that fewer comparisons are statistically significant, whereas Bayesian methods move the estimates around. Rubin’s point was that you can get the right level of uncertainty much more effectively by moving the intervals toward each other rather than by keeping their centers fixed and then making them wider. (I’m thinking now that a dynamic visualization would be helpful to make this clear.)

It’s funny because Bayesian estimates are often thought of as trading bias for variance, but in this case the Bayesian estimate is so direct, and it’s the multiple comparisons approaches that do the tradeoff, getting the desired level of statistical significance by effectively making all the intervals wider and thus weakening the claims that can be made from data. It’s kinda horrible that, under the classical approach, your inferences for particular groups and comparisons will on expectation get vaguer as you get data from more groups.

We explored this idea in our 2000 article, Type S error rates for classical and Bayesian single and multiple comparison procedures (see here for freely-available version) and more thoroughly in our 2011 article, Why we (usually) don’t have to worry about multiple comparisons. In particular, see the discussion on pages 196-197 of that latter paper (see here for freely-available version):

2. MRP, or multilevel modeling more generally, does not “move the individual subgroup means toward that grand mean.” It moves the error terms toward zero, which implies that it moves the local averages toward their predictions from the regression model. For example, if you’re predicting test scores given various school-level predictors, then multilevel modeling partially pools the individual school means toward the fitted model. It would not in general make sense to partially pool toward the grand mean—not in any sort of large study that includes all sorts of different schools. (Yes, in Rubin’s classic 8-schools study, the estimates were pooled toward the average, but these were 8 similar schools in suburban New Jersey, and there were no available school-level predictors to distinguish them.)

3. I agree with Nelson that it’s a mistake to summarize results using statistical significance, and this can lead to artifacts when comparing different models. There’s no good reason to make decisions based on whether a 95% interval includes zero.

4. I like multilevel models, but point estimates from any source—multilevel modeling or otherwise—have unavoidable problems when the goal is to convey uncertainty. See our 1999 article, All maps of parameter estimates are misleading.

In summary, I like the Forrow et article. The next step should be to go beyond point estimates and statistical significance and to think more carefully about decision making under uncertainty in this educational context.

It’s the finals! Time to choose the ultimate seminar speaker: Beverly Cleary vs. Laura Ingalls Wilder

We’ve reached the endpoint of our third seminar speaker competition. Top seeds J. R. R. Tolkien, Miles Davis, David Bowie, Dr. Seuss, Hammurabi, Judas, Martha Stewart, and Yo-Yo Ma fell by the wayside—indeed, Davis, Judas, and Ma didn’t even get to round 2!—; unseeded heavyweight Isaac Newton lost in round 3; and dark-horse favorites James Naismith, Henry Winkler, Alison Bechdel, and J. Robert Lennon couldn’t make the finish line either.

What we have is two beloved and long-lived children’s book authors. Cleary was more prolific, but maybe only because she got started at a younger age. Impish Ramona or serious Laura . . . who’s it gonna be?

Either way, I assume it will go better than this, from a few years ago:

The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of Arts and Sciences are proud to present

You are invited to apply for a seminar led by Professor Bruno Latour on Tuesday, September 23, 12-3pm. Twenty-five graduate students from throughout the university will be selected to participate in this single seminar given by Prof. Latour. Students will organize themselves into a reading group to meet once or twice in early September for discussion of Prof. Latour’s work. They will then meet to continue this discussion with a small group of faculty on September 15, 12-2pm. Students and a few faculty will meet with Prof. Latour on September 23. A reading list will be distributed in advance.

If you are interested in this 3-4 session seminar (attendance at all 3-4 sessions is mandatory), please send

Your School:
Your Department:
Year you began your terminal degree at Columbia:
Thesis or Dissertation title or topic:
Name of main advisor:

In one short, concise paragraph tell us what major themes/keywords from Latour’s work are most relevant to your own work, and why you would benefit from this seminar. Please submit this information via the site
The due date for applications is August 11 and successful applicants will be notified in mid-August.

That was the only time I’ve heard of a speaker who’s so important that you have to apply to attend his seminar! And, don’t forget, “attendance at all 3-4 sessions is mandatory.” I wonder what they did to the students who showed up to the first two seminars but then skipped #3 and 4.

Past matchup

Wilder faced Sendak in the last semifinal. Dzhaughn wrote:

This will be a really tight match up.

Sendak has won the Laura Ingalls Wilder Award. Yet no one has won more Maurice Sendak Awards than Wilder. And she was dead when he won it.

Maurice Sendak’s paid for his college by working at FAO Schwarz. That’s Big, isn’t it?

The Anagram Department notices “Serial Lulling Award,” not a good sign for a seminar speaker. “American Dukes” and “Armenia Sucked” are hardly top notch, but less ominous.

So, I come up with a narrow edge to Sendak but I hope there is a better reason.

“Serial Lulling Award” . . . that is indeed concerning!

Raghu offers some thoughts, which, although useless for determining who to advance to the final round, are so much in the spirit of this competition that I’ll repeat them here:

This morning I finished my few-page-a-day reading of the biography of basketball inventor and first-round loser James Naismith, and I was struck again by how well-suited he is to this tournament:

“It was shortly after seven o’clock, and the meal was over. He added briskly, ‘Let me show you some of the statistics I’ve collected about accidents in sports. I’ve got them in my study.’ He started to rise from the table and fell back into his chair. Ann recognized the symptoms. A cerebral hemorrhage had struck her father.” — “The Basketball Man, James Naismith” by Bernice Larson Webb

Statistics! Sports! Medical inference!

I am not, however, suggesting that the rules be bent; I’ve had enough of Naismith.

I finished Sendak’s “Higglety Pigglety Pop! Or, There Must Be More to Life” — this only took me 15 minutes or so. It is surreal, amoral, and fascinating, and I should read more by Sendak. Wilder is neither surreal nor amoral, though as I think I noted before, when I was a kid I found descriptions of playing ball with pig bladders as bizarre as science fiction. I don’t know who that’s a vote for.

I find it hard to read a book a few pages a day. I can do it for awhile, but at some point I either lose interest and stop, or I want to find out what happens next so I just finish the damn book.

Diana offers a linguistic argument:

Afterthought and correction: The “n” should be considered a nasal and not a liquid, so Laura Ingalls Wilder has five liquids, a nasal, a fricative, a glide, and two plosives, whereas Maurice Sendak has two nasals, a liquid, two fricatives, and two plosives (and, if you count his middle name, three nasals, three liquids, two fricatives, and four plosives). So Wilder’s name actually has the greater variety of consonants, given the glide, but in Sendak’s name the various kinds are better balanced and a little more spill-resistant.

OK, sippy cups. Not so relevant for a talk at Columbia, though, given that there will be very few toddlers in the audience.

Anon offers what might appear at first to be a killer argument:

If you look at the chart, you can pretty clearly notice that the bracket is only as wide as it is because of Laura Ingalls Wilder’s prodigious name. I’ve got to throw my hat in the ring for Sendak, simply for storage.

+1 for talking about storage—optimization isn’t just about CPU time!—but this length-of-name argument reeks of sexism. In a less traditional society, Laura wouldn’t have had to add the Wilder to her name, and plain old “Laura Ingalls,” that’s a mere 13 characters wide, and two of them are lower-case l’s, which take up very little space (cue Ramanujan here). Alison Bechdel’s out of the competition now, but she’s still looking over my shoulder, as it were, scanning for this sort of bias.

And Ben offers a positive case for the pioneer girl:

There’s some sort of libertarian angle with Wilder though right?

What if we told Wilder about bitcoin and defi and whatnot? Surely that qualifies as surreal and amoral in the most entertaining kind of way. I know talking about these things in any context is a bit played out at this point but c’mon. This isn’t some tired old celebrity we’re selling here! This is author of an American classic, from the grave — any way she hits that ball is gonna be funny.

Sounds good to me!

Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

Wesley Tansey writes:

This is no doubt something we both can agree is a sad and wrongheaded use of statistics, namely incredible reliance on null hypothesis significance testing. Here’s an example:

Phase III trial. Failed because their primary endpoint had a p-value of 0.053 instead of 0.05. Here’s the important actual outcome data though:

For the primary efficacy endpoint, INV-PFS, there was no significant difference in PFS between arms, with 243 (84%) of events having occurred (stratified HR, 0.77; 95% CI: 0.59, 1.00; P = 0.053; Fig. 2a and Table 2). The median PFS was 4.5 months (95% CI: 3.9, 5.6) for the atezolizumab arm and 4.3 months (95% CI: 4.2, 5.5) for the chemotherapy arm. The PFS rate was 24% (95% CI: 17, 31) in the atezolizumab arm versus 7% (95% CI: 2, 11; descriptive P < 0.0001) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0, 4; descriptive P = 0.0006), respectively, at 18 months (Fig. 2a). As the INV-PFS did not cross the 0.05 significance boundary, secondary endpoints were not formally tested.

The odds of atezolizumab being better than chemo are clearly high. Yet this entire article is being written as the treatment failing simply because the p-value was 0.003 too high.

He adds:

And these confidence intervals are based on proportional hazards assumptions. But this is an immunotherapy trial where we have good evidence that these trials violate the PH assumption. Basically, you get toxicity early on with immunotherapy, but patients that survive that have a much better outcome down the road. Same story here; see figure below. Early on the immunotherapy patients are doing a little worse than the chemo patients but the long-term survival is much better.

As usual, our recommended solution for the first problem is to acknowledge uncertainty and our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.

Regarding acknowledging uncertainty: Yes, at some point decisions need to be made about choosing treatments for individual patients and making general clinical recommendations—but it’s a mistake to “prematurely collapse the wave function” here. This is a research paper on the effectiveness of the treatment, not a decision-making effort. Keep the uncertainty there; you’re not doing us any favors by acting as if you have certainty when you don’t.

Laura Ingalls Wilder vs. Maurice Sendak; Cleary advances

OK, two more children’s book authors. Both have been through a lot. Laura defeated cool person Banksy, lawgiver Steve Stigler, person known by initials Malcolm X, and then a come-from-behind victory against lawgiver Alison Bechdel. Meanwhile, Maurice dethroned alleged tax cheat Martha Stewart, namesake Steve McQueen, and fellow children’s book author Margaret Wise Brown.

Who’s it gonna be? I’d say Maurice because he’s an illustrator as well as a writer. On the other hand, Laura’s books have a lot more content than Maurice’s, also as a political scientist I appreciate the story of how Laura rewrote some of her life history to be more consistent with her co-author daughter’s political ideology.

Both authors are wilderness-friendly!

Past matchup

Raghu suggests we should sit here for the present.

Dzhaughn writes:

I have had the Cleary image of Ramona sitting in a basement taking one bite out of every apple for more than 90% of my life.

But Diana counters:

I don’t wanna go down to the basement.

Moving away from Ramona for a moment, Pedro writes:

A little bit of Googling reveals that Shakira once started in a soap opera (Telenovela) in her teen years. Apparently embarrassed, she ended up buying the rights to the soap and now it’s no longer available in any legal way.

Although I’m very sympathetic towards her actions and feelings, this blog is very pro-open science and sharing data and her actions are as against that as possible…

Good point! Cleary is very open, as you can see if you read her two volumes of autobiography. Maybe if she comes to speak, we’ll hear some excerpts from volume 3?

Again, here are the announcement and the rules.

It’s the semifinals! Shakira vs. Beverly Cleary; Sendak advances

As usual, the powers-of-2 thing sneaks up on us. All of a sudden, our third Greatest Seminar Speaker competition is nearing its final rounds.

Today we have two contestants to be reckoned with. Shakira made it pretty far against weak competition but then vanquished the mighty Dahl. Meanwhile Cleary shot down David Bowie, A. J. Foyt, and the inventor of Code Names.

Songwriter or storyteller; which will it be?

Past matchup

Raghu offers arguments in both directions:

On the one hand, we have not resolved the mystery of physiological scaling among weight lifters.

On the other:

I decided to spend some time in the library working — a change of scenery — and I picked up a book by Maurice Sendak, “Higglety Pigglety Pop! or There Must Be More to Life,” because previously all I’ve read by Sendak is “Where the Wild Things Are” and because “There Must Be More to Life” is a wonderful title. So far I am only four chapters in: a narcissistic and possibly psychopathic dog leaves her comfortable life in search of something better. It is excellent, and I look forward to finishing. So far it shows no connections to science or statistics, but I wouldn’t mind a seminar on whether there is or is not more to life.

Dzhaughn makes the case for . . . ummm, I’m not sure which one:

It’s hard for me to relate to someone who can eat as much as they want. Or more than they want in case of japanese hot dog guy. Maybe i should open my mind and shut my mouth, even if that’s not their approach.

Supposedly Li Wenwen wins when she can eat, with more ease in her mores, more rice than Maurice, then Maurice.

Anonymous breaks the tie:

I think you really need to give a leg up to the unheard voices. I mean Maurice Sendak got to blab and blab in books, and then I’m sure went on the academic circuit to tell pretentious college students all about the importance of children books, and how important it is to pay him millions of dollars. I don’t speak mandarin, so although Li Wenwen has surely spoken at many cadre meetings or whatever, I haven’t heard it.

And I was all ready to give it Li, but then Ethan came in with this late entry:

“There must be more to life” is far weightier than anything Li can lift. Sendak wins on Wenwen’s turf. Sendak on to the semis.

Again, here are the announcement and the rules.

“Behavioural science is unlikely to change the world without a heterogeneity revolution”

Christopher Bryan, Beth Tipton, and David Yeager write:

In the past decade, behavioural science has gained influence in policymaking but suffered a crisis of confidence in the replicability of its findings. Here, we describe a nascent heterogeneity revolution that we believe these twin historical trends have triggered. This revolution will be defined by the recognition that most treatment effects are heterogeneous, so the variation in effect estimates across studies that defines the replication crisis is to be expected as long as heterogeneous effects are studied without a systematic approach to sampling and moderation. When studied systematically, heterogeneity can be leveraged to build more complete theories of causal mechanism that could inform nuanced and dependable guidance to policymakers. We recommend investment in shared research infrastructure to make it feasible to study behavioural interventions in heterogeneous and generalizable samples, and suggest low-cost steps researchers can take immediately to avoid being misled by heterogeneity and begin to learn from it instead.

We posted on the preprint version of this article earlier. The idea is important enough that it’s good to have an excuse to post on it again.

P.S. This also reminds me of our causal quartets.

Maurice Sendak vs. Li Wenwen; Wilder advances

At first I was gonna say that the edge goes to the author of In the Night Kitchen, because he can draw tempting food that Li Wenwen would then eat, causing her to go out of her weight class and forfeit her title. But Li is in already in the upper weight class so she can eat as much as she wants.

Who should advance? Pierre says, “I don’t care.” But some of you must have opinions!

Past matchup

Dzhaughn writes:

Bechdel’s rule would have meant nothing to Laura. But just about any conceivable movie about Laura passes the Bechdel test.

Jonathan adds:

It’s hard not to look ahead to anticipate potential future matchups and ignore the match in front of you. But it’s one match at a time. Fun Home on the Prairie!

I’m going to go with male protagonist proxies here (violating the Bechdel Rule)

Michael Landon (Poppa Wilder) vs. Michael Cerveris (Poppa Bechdel): Landon played a teenage werewolf while Cerveris played Sweeney Todd. Both scary, but I think the werewolf is scarier. So, Wilder.

That’s 2 arguments for Laura and 0 for Alison, so Laura it is.

Again, here are the announcement and the rules.

Predicting LLM havoc

This is Jessica. Jacob Steinhardt recently posted an interesting blog post on predicting emergent behaviors in modern ML systems like large language models. The premise is that we can get qualitatively different behaviors form a deep learning model with enough scale–e.g., AlphaZero hitting a point in training where suddenly it has acquired a number of chess concepts. Broadly we can think of this happening as a result of how acquiring new capabilities can help a model lower its training loss and how as scale increases, you can get points where some (usually more complex) heuristic comes to overtake another (simpler) one. The potential for emergent behaviors might seem like a counterpoint to the argument that ML researchers should write broader impacts statements to prospectively name the potential harms their work poses to society… non-linear dynamics can result in surprises, right? But Steinhardt’s argument is that some types of emergent behavior are predictable.  

The whole post is worth reading so I won’t try to summarize it all. What most captured my attention though is his argument about predictable deception, where a model fools or manipulates the (human) supervisor rather than doing the desired tasks, because doing so gets it better or equal reward. Things like ChatGPT saying that “When I said that tequila has ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar” or an LLM claiming there is “no single right answer to this question” when there is, sort of like a journalist insisting on writing a balanced article about some issue where one side is clearly ignoring evidence. 

The creepy part is that the post argues that there is reason to believe that certain factors we should expect to see in the future–like models being trained on more data, having longer dialogues with humans, and being more embedded in the world (with a potential to act)–are likely to increase deception. One reason is because models can use the extra info they are acquiring to build better theories-of-mind and use them to better convince their human judges of things. And when they can understand what humans respond to and act in the world they can influence human beliefs through generating observables. For example, we might get situations like the following: 

suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

This reminds me of a similar example used by Gary Marcus of how we might start with some untrue proposition or fake news (e.g., Mayim Bialik is selling CBD gummies) and suddenly have a whole bunch of websites on this topic. Though he seemed to be talking about humans employing LLMs to generate bullshit web copy. Steinhardt also argues that we might expect deception to emerge very quickly (think phase transition), as suddenly a model achieves high enough performance by deceiving all the time that those heuristics dominate over the more truthful strategies. 

The second part of the post on emergent optimization argues that as systems increase in optimization power—i.e., as they consider a larger and more diverse space of possible policies to achieve some goal—they become more likely to hack their reward functions. E.g., a model might realize your long term goals are hard to achieve (say, lots of money and lots of contentness) but that’s hard. And so instead it resorts to trying to change how you appraise one of those things over time. The fact that planning capabilities can emerge in deep models even when they are given a short-term objective (like predicting the next token in some string of text) and that we should expect planning to drive down training loss (because humans do a lot of planning and human-like behavior is the goal) means we should be prepared for reward hacking to emerge. 

From a personal perspective, the more time I spend trying out these models, and the more I talk to people working on them, the more I think being in NLP right now is sort of a double-edged sword. The world is marveling at how much these models can do, and the momentum is incredible, but it also seems that on a nearly daily basis we have new non-human-like (or perhaps worse, human-like but non-desirable) behaviors getting classified and becoming targets for research. So you can jump into the big whack-a-mole game, and it will probably keep you busy for awhile, but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches. Though I guess anyone who is watching curiously what’s going on in NLP is in the same boat. It really is kind of uncomfortable.

This is not to say though that there aren’t plenty of NLP researchers thinking about LLMs with a relatively clear sense of direction and vision – there certainly are. But I’ve also met researchers who seem all in but without being able to talk very convincely about where they see it all going. Anyway, I’m not informed enough about LLMs to evaluate Steinhardt’s predictions but I like that some people are making thoughtful arguments about what we might expect to see.


P.S. I wrote “but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches” but it occurs to me now that it’s not really clear to me what I’m waiting for to determine “how far we can go.” Do deep models really need to perfectly emulate humans in every way we can conceive of for these approaches to be considered successful? It’s interesting to me that despite all the impressive things LLMs can do right now, there is this tendency (at least for me) to talk about them as if we need to withhold judgment for now. 

Don’t trust people selling their methods: The importance of external validation. (Sepsis edition)

This one’s not about Gladwell; it’s about sepsis.

John Williams points to this article, “The Epic Sepsis Model Falls Short—The Importance of External Validation,” by Anand Habib, Anthony Lin, and Richard Grant, who report that a proprietary model used to predict sepsis in hospital patients doesn’t work very well.

That’s to be expected, I guess. But it’s worth the reminder, given all the prediction tools out there that people are selling.

Alison Bechdel vs. Laura Ingalls Wilder; Cleary advances

Laura was a pioneer who drove a carriage in the snow—it was so cold she had to stop to snap the frozen breath off the horse’s nose so it could breathe! But Alison’s no slouch herself: we’ve heard from her latest book that she’s in excellent shape and is obsessed with workouts. So either of these two ladies could show us a thing or two about fitness. The question is, who’d be the better seminar speaker? Uncork your clever arguments, please!

Past matchup

Raghu brings on the stats:

Unlike Beverly Cleary, Leona Helmsley wrote nothing I would want to read. Quickly looking at snippets of books *about* her, none of them seem like anything I want to read, either.

“Palace Coup: The Inside Story of Harry and Leona Helmsley” gets 3.6 on Goodreads, which is basically 0 given the scale of scores there, and not even the 2 reviews are interesting.

“The Helmsleys: The Rise and Fall of Harry and Leona Helmsley” gets 3.0.

There’s a book by the guy who administered her philanthropic trust, and from the preview on Google Books, it looks excruciatingly dull and poorly written.

Extra credit for adjusting the raw numbers. One of the three central tasks of statistics is generalizing from observed data to underlying constructs of interest.

Anon writes:

According to Wikipedia “Alan Dershowitz, while having breakfast with her [Helmsley] at one of the Helmsley hotels, received a cup of tea with a tiny bit of water spilled on the saucer. Helmsley grabbed the cup from the waiter and smashed it on the floor, then told him to beg for his job.”

I think we can all appreciate someone who would tell Dershowitz to beg for his job. But Raghu counters with:

You’re saying that with Helmsley, the pre-seminar coffee will be ruined by a temper tantrum? I don’t know about Columbia, but our campus catering wouldn’t stand for such abuse, and then I wouldn’t get any coffee, and then I would leave without attending the talk.

We wouldn’t want the seminar to happen without Raghu in the audience, so Bev it is. We’ll see how she fares against Shakira in the semis.

Again, here are the announcement and the rules.

Is there a Bayesian justification for the Murdaugh verdict?

Jonathan Falk writes:

I know you’re much more a computational Bayesian than a philosophical Bayesian, and I assume you were as ignorant of the national phenomenon of the Murdaugh trial as I was, but I just don’t quite get it.

Assume the following facts are true:
(1) Two people were murdered: a mother and son
(2) The husband and father, after having denied he was anywhere near the scene, was forced into admitting he was there shortly before they were killed by incontrovertible evidence.
(3) He is/was a drug addict and embezzler.
(4) There is literally no other evidence connecting him to the crime.

Starting from a presumption of innocence (not sure what that means exactly in probabilistic terms, but where p is the prior probability of guilt, p<<.5) how do you as a Bayesian get from (1)-(4) combined with the prior to get to a posterior of "beyond reasonable doubt?" (Again, leaving precise calibration aside, surely p>>.9)

People lie all the time, and most drug addicts and embezzlers are not murderers, and while proximity to a murder scene (particularly one in secluded private property) is pretty good evidence, it’s not usually enough to convict anyone without some other pretty good evidence, like a murder weapon, or an obvious motive. I’d be willing to countenance a model with a posterior in the neighborhood of maybe 0.7, but it’s not clear to me how a committed Bayesian proceeds in a case like this and finds the defendant guilty.


He’s got a good question. I have two answers:

1. The availability heuristic. It’s easy to picture Murdaugh being the killer and no good alternative explanations were on offer.

2. Decision analysis. Falk is framing the problem in probabilistic terms: what sort of evidence would it take to shift the probability from much less than 50% to well over 90%, and I see his point that some strong evidence would be necessary, maybe much more than was presented at the trial. But I’m thinking the more relevant framing for the jury is: What should they decide?

Suppose the two options are find Murdaugh guilty or not guilty of first-degree murder. Picture yourself on the jury, making this decision, and consider the 2 x 2 matrix of outcomes under each option:

– Truly guilty, Found not guilty by the court: Evil man gets away with one of the worst crimes you could imagine.

– Truly guilty, Found guilty by the court: Evil man is punished, justice is done. A win.

– Truly not guilty, Found not guilty by the court: Avoided a mistake. Whew!

– Truly not guilty, Found guilty by the court: Lying, creepy-ass, suspicious-acting drug addict and embezzler didn’t actually kill his wife and kid—at least, not quite in the way as was charged. But he goes away for life anyway. No great loss to society here.

The point is that the circumstances of the crime and the jury’s general impression of the defendant are relevant to the decision. The assessed probability that he actually did the crime is relevant, but there’s not any kind of direct relation between the probability and the decision of how to vote in the jury. If you think the defendant is a bad enough guy, then you don’t really need to care so much about false positives.

That said, this will vary by juror, and some of them might be sticklers for the “beyond reasonable doubt” thing. From my perspective, I see the logic of the decision-analysis perspective whereby a juror can be fully Bayesian, estimate the probability of guilt at 0.7 or whatever, and still vote to convict.

P.S. For those commenters who haven’t heard of the Murdaugh trial . . . You can just google it! It’s quite a story.

Beverly Cleary (2) vs. Leona Helmsley; Shakira advances

This one satisfies the Bechdel test: two women competing without reference to a man. Sure, there’s Henry Huggins and Harry Helmsley lurking in the background, but let’s face it: Henry was outshined by his friend Beezus and her little sister Ramona, and Harry was a pale shadow of his publicity-grabbing wife. The women are where it’s at here.

As to the particular exemplars of girl-power on offer here: what would you like to hear in your seminar? A good story or some solid tax-cheating tips? One of these could be entertaining and one could be useful, right? Which would you prefer, cosponsorship by the English Department or the Business School?

Past matchup

Dzhaughn offers a positive case for the Chevalier:

I can tell you that over the last few months Peru has been a cauldron of unrest: a flimsy coup attempt by a populist leftish president with a 20% approval rating, then impeached, the promoted vp pres betrayed her supporters, more than 60 protesters killed by police (and apparently 0 police interviewed by investigators about this), airports invaded and closed, practically all roads throughout the south closed in multiple places for weeks due to political protests, except on weekends and a couple weeks around Christmas and New Years because one has got to sell the crops. And party.

Nevertheless, Shakira (and Pique) still seemed to be at least the #3 story on TV news. Beat that.

Meanwhile, Anon slams both contestants:

There’s been some controversy recently over some of Dahl’s more antisemitic comments, and I don’t feel that it would be proper to champion an antisemitic person in the competition like this. However, Shakira committed tax fraud, and that is much more evil.

Agreed on the evilness of tax fraud, and insidious crime that tears apart civil society.

Here’s my problem with Dahl. Every time he comes up, the discussion is always some variant of “He’s an nasty guy” or “He’s getting banned.” I’m sick of hearing about mean people and I’m sick of hearing about people getting banned, so we’ve come to the end of the line for this particular storyeller and war hero. Shakira to the semifinals!

Again, here are the announcement and the rules.

Round 4 has arrived! Shakira vs. Roald Dahl; Li advances

To start our fourth round, we have an alleged tax cheat vs. . . . I have no idea if Roald Dahl cheated on his taxes, but it wouldn’t shock me if he had!

To get to this point, Shakira defeated Michael “Douglas” Keaton, A. A. “could be Person Known by Initials or Children’s Book Author” Milne, and Gary “D & D” Gygax. Meanwhile, Dahl prevailed over Jane “Traitor” Fonda, Ethel “Traitor” Rosenberg, and Henry “as far as I know, not a traitor” Winkler.

I guess what I’m trying to say is that neither of today’s contestants has been strongly tested yet. It’s been a pretty easy ride for both of them, as they managed to dodge pre-tournament favorites such as J. R. R. Tolkien, James Naismith, and Miles Davis. (Michael Keaton was my personal dark-horse favorite, but I recognized he was a dark horse.)

Looking at the 8 remaining candidates in the bracket, we have 4 children’s book authors, 2 alleged tax cheats, 1 creator of laws or rules, and 1 duplicate name. All the traitors, duplicate names, and namesakes have disappeared. All the cool people have gone too! Who’d’ve thought? Maybe that’s what happens when the decisions are made by nerds.

Past matchup

Isaac vs. Wenwen turned out to motivate an awesome old-school blog-comment thread on body mass, scaling, and weightlifting. Physics and sports: always a great combination!

Raghu kicked off the discussion:

I vaguely remember that McMahon and Bonner’s excellent “On Size and Life” had a graph of the weightlifting world record for various weight divisions, which scaled as body weight ^ 2/3 or 3/4 or something like that. (I’m at a conference and so can’t look at my copy today.) Anyway, it leads to the interesting idea that what we care about shouldn’t be absolute performance, or the best in crude weight bins, but performance relative to the background physiological scaling law. How good are you *compared to * Mass^2/3 (or whatever). So, at Li Wenwen’s seminar we can all try this out.

And it went on from there. All about gravity and weightlifting.

But who should it be—Isaac or Wenwen? Robin writes:

But which speaker has the highest BMI?

To which David replies:

That’s easy to determine. Since dissatisfaction with and/or ranting about BMI is strongly correlated with BMI, just ask them whether or not they think BMI is a useful measurement.

Newton was a ranter, but I think it’s safe to assume that Li, as a lifter, will be more aware of weight. So she advances.

Again, here are the announcement and the rules.

Count the living or the dead?

Martin Modrák writes:

Anders Huitfeldt et al. recently published a cool preprint that follows up on some quite old work and discusses when we should report/focus on ratio of odds for a death/event and when we should focus on ratios of survival/non-event odds.

The preprint is accompanied by a site providing a short description of the main ideas:

The key bit:

When an intervention reduces the risk of an outcome, the effect should be summarized using the standard risk ratio (which “counts the dead”, i.e. considers the relative probability of the outcome event), whereas when the intervention increases risk, the effect should instead be summarized using the survival ratio (which “counts the living”, i.e. considers the relative probability of the complement of the outcome event).

I took a look and was confused. I was not understanding the article so I went to the example on pages 15-16, and I don’t get that either. They’re saying there was an estimate of relative risk of 3.2, and they’re saying the relative risk for this patient should be 1.00027. Those numbers are so different! Does this really make sense. I get that the 3.2 is a multiplicative model and the 1.00027 is from an additive model, but they’re still so different.

There’s also the theoretical concern that you won’t always know ahead of time (or even after you see the data) if the treatment increases or decreases risk, and it seems strange to have these three different models floating around.

In response to my questions, Martin elaborated:

A motivating use case is in transferring effect estimates from a study to new patients/population: A study finds that a drug (while overall beneficial) has some adverse effects – let’s say that it in the control group 1% patients had thrombotic event (blood clot) and in the treatment group it was 2%. Now we are considering to give the drug to a patient we believe has already elevated baseline risk of thrombosis – say 5%. What is their risk of thrombosis if they take the drug? Here, the choice of effect summary will matter:

1) The risk ratio for thrombosis from the study is 2, so we could conclude that our patient will have 10% risk.

2) The risk ratio for _not_ having a thrombosis is 0.98/0.99 = 0.989899, so we could conclude that our patient will have 95% * 0.989899 ~= 94% risk of _not_ having a thrombosis and thus 6% risk of thrombosis.

3) The odds ratio for thrombosis is ~2.02, the baseline odds of our patient is ~0.053, so the predicted odds is ~0.106 and the predicted risk for thrombosis is 9.6%.

So at least cases 1) and 2) could lead to quite different clinical recommendations.

The question is: which effect summaries (or covariate-dependent effect summaries) are most likely to be stable across populations and thus allow us to easily apply the results to a new patient. The preprint
“Shall we count the living or the dead?” by Huitfeldt et al. argues that under assumptions that plausibly at least approximately hold in many cases where we study adverse effects the risk ratio of _not_ having the outcome (i.e. “counting the living”) requires few covariates to be stable. A similar line of argument then implies that at least in some scenarios where we study direct beneficial effects of a drug, the risk ratio of the outcome (i.e. “counting the dead”) is likely to be approximately stable with few covariates. The odds ratio is then stable only when we in fact condition on all covariates that cause the outcome – in this case all other effect summaries are also stable.

The authors frame the logic in terms of a fully deterministic model where we enumerate the proportion of patients having underlying conditions that either 100% cause the effect regardless of treatment, or 100% cause the effect only in presence/absence of treatment, so the risk is fully determined by the prevalence of various types of conditions in the population.

The assumptions when risk ratio of _not_ having an outcome (“counting the living”) is stable are:

1) There are no (or very rare) conditions that cause the outcome _only_ in the absence of the treatment (in our example: the drug has no mechanism which could prevent blood clots in people already susceptible to blood clots).

2) The presence of conditions that cause the outcome irrespective of treatment is independent of presence of conditions that cause the outcome only in the presence of treatment (in our example: if a specific genetic mutation interacts with the drug to cause blood clots, the presence of the mutation is independent of unhealthy lifestyle that could cause blood clots on its own). If I understand this correctly, this can only approximately hold if the outcome is rare – if a population that has high prevalence of independent causes, it has to have less treatment-dependent causes simply because the chance of the outcome cannot be more than 100%.

3) We have good predictors for all of the conditions that cause the outcome only when combined with treatments AND that differ between study population and target population and include those predictors in our model (in our example: variables that reflect blood coagulation are likely to need to be included as the drug may push high coagulation “over the edge” and coagulation is likely to differ between populations, OTOH if a specific genetic mutation interacts with the drug, we need to include it only if the genetic background of the target population differs from the study population)

The benefit then is that if we make those assumptions, we can avoid modeling a large chunk of the causal structure of the problem – if we can model the causal structure fully, it doesn’t really matter how we summarise the effects.

The assumptions are quite strong, but the authors IMHO reasonably claim, that they may approximately hold for real use cases (and can be at least sometimes empirically tested). One case they give is vaccination:

The Pfizer Covid vaccine has been reported to be associated with risk ratio of 3.2 for Myocarditis (a quite serious problem). So for a patient with 1% baseline risk of Myocarditis (this would be quite high), if risk ratio was stable, we could conclude that the patient would have 3.2% risk after vaccination. However, the risk ratio for not having Myocarditis is 0.999973 and assuming this is stable, it results in predicting a 1.0027% risk after vaccination. The argument is that the latter is more plausible as the assumptions for stability of risk ratio of not having the event could approximately hold.

Another intuition to thinking about this is that the reasons a person may be prone to Myocarditis (e.g. history of HIV) aren’t really made worse by vaccination – the vaccination only causes Myocarditis due to very rare underlying conditions that mostly don’t manifest otherwise, so people already at risk are not affected more than people at low baseline risk.

Complementarily, risk ratio of the outcome (counting the dead) is stable when:

1) There are no (or very rare) conditions that cause the outcome only in the presence of the treatment (i.e. the treatment does not directly harm anybody w.r.t the outcome).

2) The presence of conditions that _prevent_ the outcome regardless of treatment is indepedent of presence of conditions that prevent the outcome only in the presence of treatment.

3) We have good predictors for all of the conditions that prevent the outcome only when combined with the treatment AND that differ between study population and target population and include those predictors in our model.

This could plausibly be the case for drugs where we have a good idea how they prevent the specific outcome (say an antibiotic, that prevents infection unless the pathogen has resistance). Notably those assumptions are unlikely to hold for outcomes like “all-cause mortality”, so the title of the preprint might be a bit of a misnomer.

The preprint doesn’t really consider uncertainty, but in my reading, the reasoning should apply almost identically under uncertainty.

There’s also an interesting historical outlook as the idea can be traced back to a 1958 paper by Mindel C. Sheps which was ignored, but similar reasoning was then rediscovered on a bunch of occasions. For rare outcome the logic also maps to focusing on “relative benefits and absolute harms” as is often considered good practice in medicine.

One thing I also find interesting here is the connection between data summaries and modeling. In some abstract sense, the way you decide to summarize your data is a separate question from how you will model the data and underlying phenomenon of interest. But in practice they go together: different data summaries suggest different sorts of models.

Isaac Newton vs. Li Wenwen; Sendak advances

Today’s contestants are powerhouses! Isaac “creator of laws or rules” Newton has already defeated the #2 hitter of all time and a co-creator of the more-relevant-than-ever Three Laws of Robotics. Meanwhile Li “duplicate name” Wenwen powerlifted her way past two of the greatest musicians of our time.

What happens when the smartest man who’s ever lived meets the strongest woman ever? It’s up to you to tell us!

It would be great to invite both of them to speak, but that’s not an option. We only have the budget for one speaker in this seminar, and Newton was the director of the Royal Mint or something like that, so he’s not gonna let us get away with paying in IOUs.

Past matchup

Dzhaughn writes:

Doesn’t “Goodnight Moon” need a comma or three? What does it *mean* when she leave out those commas? With “Brown, Margaret Wise” vs. “Brown Margaret Wise” we have a paternalistic emphasis on the patronymic in the former against a strong suggestion of cannibalism in the latter. And, reallly, where do we stand on cannibalism? Soylent Green is People or the Authentic Paleo Diet? I’ll pass on both.

There also was the time I boughtlots of peanut butter cups for Halloween and there weren’t many trick-or-treaters. Maurices, Mo’ Probems. But over all a sweet memory.

This doesn’t answer who we should pick, but, wow, so much happening in such a small space! Actually, I can make it even smaller:

This kind of free association is what blog commenting is all about. Really this one comment makes the whole seminar competition worth it.

But we still need to decide who advances, and for that we turn to the concise summary from Anonymous Pigeon:

“Where the wild things are” is great but a bit of a boo hoo ending. The ending of it never happened or it did in dream or whatnot makes everything that happened so worthless. If Maurice Sendak has a poor ending to offer us, I would go with Margaret Wise Brown because her book “Goodnight moon”, ends in sleep, just as we should after a long seminar.

A seminar ending in sleep . . . all too accurate! But that’s not what we want. We want to stay awake, so the Wild Thing it is.

Again, here are the announcement and the rules

Maurice Sendak vs. Margaret Wise Brown; Wilder advances

Children’s book authors have been doing well in the tournament, and here are two more. Don’t let the runaway bunny in the night kitchen, or the wild things might eat it up! I have no idea how either of these authors would be as a speaker, so please share your thoughts!

Past matchup

Jonathan writes:

It is not well enough known that Laura Ingalls Wilder and Malcolm X used to date. Yes, she was almost 60 years older than Malcolm, but that didn’t matter a bit to either of them. And why should it? They had managed to surpass the racial divide and the urban divide and they bonded over their early lives of struggle and resilience. Each of them wrote controversial autobiographical works with pronounced racial issues and with controversies over authorship. When Laura’s husband Almanzo died in 1949, she sought out the then-24-year-old Malcolm in prison. An intense relationship started, not consummated until Malcolm’s release from prison in 1952. Each of them sought comfort with the other, and they died only eight years apart, meeting whenever they could for the sort of love that only they understood.

I clearly need a seminar devoted to plausible fan fiction. Since Laura wrote more of her own autobiography than Malcolm did, I’m going with the writer.

Maybe the competition should be Alex Haley vs. Rose Wilder Lane. Hey, that’s a good idea for a category—ghostwriters!

Nick follows up:

Strangely enough, Laura Ingalls Wilder’s daughter, Rose Wilder Lane was both a noted libertarian and enamoured by what she saw as the radical individualism and liberty in Islam. She wrote a book, “The Discovery of Freedom: Man’s Struggle Against Authority,” in 1943, which has a chapter which has been republished as a book , “Islam and the Discovery of Freedom.”

With relatively scant literature available on Islam in the English language in the mid-1900s, it is likely Malcolm X knew it.

+1 for Laura.

But the decider comes from this exchange:

Fishbone: Did Laura Ingalls Wilder ever have her photo taken while holding an M1 in one hand while peeking out the window at those who might wish to bring harm to him and his family? I think not. Game, set, match to Mr. el-Shabazz.

Dzhaughn: Pa would take him out pronto. Not saying that’s justified, but . . . Michael Landon.

I never watched the TV show so when I think of Pa, I don’t think of Landon, I think of the books, or maybe some of the illustrations in the books. Pa did seem pretty tough.

So Laura it is.

Again, here are the announcement and the rules.

Malcom X vs. Laura Ingalls Wilder; Bechdel advances

Somber suits or hoop skirts? Nation of Islam or Christmas trees? The city versus the country? It’s your choice! Malcolm gave his last speech near Columbia University. I have no idea if Laura ever came to New York (maybe to visit her daughter?). Make of that what you will.

Past matchup

Manuel writes:

I want Mrs Peel to succeed for a selfish reason. As she will be sipping champagne while giving the talk, I hope the organization will go for a caviar seminar instead of the usual pizza seminar. I confess that I don’t know how stylish Bechdel is with champagne, so maybe I’m rooting for the wrong person. As for the Britishness, I dunno, all Anglo-Saxon look the same to me.

In my experience, alcohol just makes any seminar worse. So I’m much more attracted to Jonathan’s argument:

It’s one thing (actually two things, I suppose) to beat Willie Nelson and Hammurabi. After all, they’re men, and beating men at their own games has been a Bechdel specialty. But now she faces a double-X chromosome, and what a phenotype. So we now have two women and their conflict has no intervening man. By making it to this match, Bechdel has won and can now modestly withdraw from the contest. Who would you rather hear about Hollywood sexism from? Men ķnow the answer. (Uh-oh… maybe that means it should be Bechdel after all.)

The result is ambiguous, not favoring either candidate. But at this point we’re playing on Bethel’s turf of sexism and gender identity, so she’ll be the one to move on and face Malcolm or Laura in the next round.

Again, here are the announcement and the rules.

Reconciling evaluations of the Millennium Villages Project

Shira Mitchell, Jeff Sachs, Sonia Sachs, and I write:

The Millennium Villages Project was an integrated rural development program carried out for a decade in 10 clusters of villages in sub-Saharan Africa starting in 2005, and in a few other sites for shorter durations. An evaluation of the 10 main sites compared to retrospectively chosen control sites estimated positive effects on a range of economic, social, and health outcomes (Mitchell et al. 2018). More recently, an outside group performed a prospective controlled (but also nonrandomized) evaluation of one of the shorter-duration sites and reported smaller or null results (Masset et al. 2020). Although these two conclusions seem contradictory, the differences can be explained by the fact that Mitchell et al. studied 10 sites where the project was implemented for 10 years, and Masset et al. studied one site with a program lasting less than 5 years, as well as differences in inference and framing. Insights from both evaluations should be valuable in considering future development efforts of this sort. Both studies are consistent with a larger picture of positive average impacts (compared to untreated villages) across a broad range of outcomes, but with effects varying across sites or requiring an adequate duration for impacts to be manifested.

I like this paper because we put a real effort into understanding why two different attacks on the same problem reached such different conclusions. A challenge here was that one of the approaches being compared was our own! It’s hard to be objective about your own work, but we tried our best to step back and compare the approaches without taking sides.

Some background is here:

From 2015: Evaluating the Millennium Villages Project

From 2018: The Millennium Villages Project: a retrospective, observational, endline evaluation

Full credit to Shira for pushing all this through.

Stochastic terrorism

Paul Alper points to this post by Lee Moran:

Alexandria Ocasio-Cortez Rips Tucker Carlson Audience Over “Death Threats” . . . “Every time that dude puts my name in his mouth, the next day, I mean, this is like what stochastic terrorism is,” the New York Democrat said on Tuesday’s episode of “The Breakfast Club” radio show.

Alper clarifies:

Not that I fully understand the term, “stochastic terrorism,” but here is what she means:

“It’s like when . . . you use a very large platform to turn up the temperature and target an individual until something happens,” she continued. “And then when something happens, because it’s indirect, you say, ‘Oh, I had nothing to do with that.’”

Googling turned up this definition from wikipedia:

Since 2018, the term “stochastic terrorism” has become a popular term used when discussing lone wolf attacks. While the exact definition has morphed over time, it has commonly come to refer to a concept whereby consistently demonizing or dehumanizing a targeted group or individual results in violence that is statistically likely, but cannot be easily accurately predicted.

The term was initially used to suggest that a quantifiable relationship may exist between seemingly random acts of terror and their intended goal of “perpetuating a reign of fear” via a manipulation of mass media and its capacity for “instant global news communication”. For example, careful timing and placement of just a few moderately explosive devices could have the same intended effect as numerous random attacks or the use of more powerful explosives if they were shrewdly devised to elicit the maximum response from media organizations. . . .

A variation of this stochastic terrorism model was later adapted by an anonymous blogger posting on Daily Kos in 2011 to describe public speech that can be expected to incite terrorism without a direct organizational link between the inciter and the perpetrator. The term “stochastic” is used in this instance to describe the random, probabilistic nature of its effect; whether or not an attack actually takes place. The stochastic terrorist in this context does not direct the actions of any particular individual or members of a group. Rather, the stochastic terrorist gives voice to a specific ideology via mass media with the aim of optimizing its dissemination.

It is in this manner that the stochastic terrorist is thought to randomly incite individuals predisposed to acts of violence. Because stochastic terrorists do not target and incite individual perpetrators of terror with their message, the perpetrator may be labeled a lone wolf by law enforcement, while the inciters avoid legal culpability and public scrutiny.

The wikipedia article continues:

In their 2017 book Age of Lone Wolf Terrorism, criminologist Mark S. Hamm and sociologist Ramón Spaaij discuss stochastic terrorism as a form of “indirect enabling” of terrorists. They write that “stochastic terrorism is the method of international recruitment used by ISIS”, and they refer to Anwar al-Awlaki and Alex Jones as stochastic terrorists.

I’d never heard about this before, but, yeah, it all makes sense. Back in the time of the World Trade Center bombings, people talked about the idea that a small amount of terrorism can still cause general terror because of a general uncertainty about where it might happen next. There are two ways that this new idea of stochastic terrorism goes further. The first is that, instead of some existing group such as ISIS planning terror but you don’t know where it will happen, it’s a nebulous group of potential terrorists, and the uncertainty is who will be the extremist who does the terrorist act. The other thing is this idea of deniability or indirect action, so that someone like Alex Jones can spread lies and violent rhetoric without knowing exactly who among his listeners will personally harass the people he’s talking about. “Stochastic terrorism” seems like a useful term, in that a key part of what makes it effective is the unpredictability of where it’s coming from.

Alison Bechdel vs. Diana Rigg; Helmsley advances

The fourth-seeded Creator of Laws or Rules us up against the third-seeded Cool Person. This matchup should satisfy the Bechdel Rule. Whether you want Mrs Peel to succeed will depend a bit on what you think about the British Empire Commonwealth.

What’s it gonna be, the Avengers or the Secret to Superhuman Strength?

I’ll be going on vacation for a few days, so post your arguments and we’ll move on to the next bout at the end of the week.

Past matchup

Siobhan writes:

Cleary and Dylan have to progress so that Dylan can sing ‘To Ramona’ to Beverley Cleary during their round.

And she should win! Klickitat Street to beat Highway 61, hands down!

People who weren’t bookish kids generally have little idea of the scale of the universes of girls’ novels – how much time we spend in them, and how they become part of our minds, even when we grow out of them. Music has visibility and prestige, which children’s literature mostly doesn’t. But in terms of cultural socialization, children’s literature is hugely-significant.

It’s my impression that music takes up a smaller part of the cultural space now than it did 40 or 50 or 60 years ago. I don’t really know why this should be, and it could just be that my impressions are distorted by this being a different stage in my own life . . . I dunno. We’ve discussed this before. I guess the question is how best to measure it.

In any case, Siobhan’s argument is compelling, but I don’t like the idea of advancing Dylan just for the purpose of making him lose in the next round. Bob may be friends with Patti but he’s no patsy.

So it’ll be Helmsley who will advance to face the Cleary juggernaut in the next round. Who knows, maybe Leona has a killer karaoke rendition of To Ramona up her sleeve!

Again, here are the announcement and the rules.