Ghostwriting by federal judges (a way to violate disclosure rules) as the opposite of plagiarism.

We’re used to seeing plagiarism in public life—Joe Biden’s the most famous culprit, but there are also various German politicians, as well as Harvard law professor and almost-Supreme-Court-judge Laurence Tribe. Some Yale law professors too, I think. And lots of academics, from Weggy on down to various more obscure characters.

Plagiarism is taking someone else’s writing and passing it off as your own. The opposite—taking your own writing and attributing it to someone else—is called ghostwriting.

The most common use of ghostwriting by political figures is when someone is hired to write something that will be published under the politician’s name. Maybe the most famous example here is the book Profiles in Courage, attributed to John Kennedy. A few years later, Barry Goldwater had a couple of ghostwritten books attributed to him. Perhaps the best way of thinking about these examples is that these guys are politicians, not writers, but they’re taking political responsibility for the words that are placed under their names. It’s not really important whether they wrote the particular words; what’s relevant is that they’re standing by these words.

Recently I learned about a different sort of example of political ghostwriting. In this case it was the politician who was the ghostwriter: he did the writing and then attributed it to others.

Paul Campos has the story. It’s about controversial abortion judge Matthew Kacsmaryk:

As a lawyer for a conservative legal group, Matthew Kacsmaryk in early 2017 submitted an article to a Texas law review criticizing Obama-era protections for transgender people and those seeking abortions.

The Obama administration, the draft article argued, had discounted religious physicians who “cannot use their scalpels to make female what God created male” and “cannot use their pens to prescribe or dispense abortifacient drugs designed to kill unborn children.”

But a few months after the piece arrived, an editor at the law journal who had been working with Kacsmaryk received an unusual email: Citing “reasons I may discuss at a later date,” Kacsmaryk, who had originally been listed as the article’s sole author, said he would be removing his name and replacing it with those of two colleagues at his legal group . . .

What Kacsmaryk did not say in the email was that he had already been interviewed for a judgeship by his state’s two senators and was awaiting an interview at the White House.

As part of that process, he was required to list all of his published work on a questionnaire submitted to the Senate Judiciary Committee, including “books, articles, reports, letters to the editor, editorial pieces, or other published material you have written or edited.”

The article, titled “The Jurisprudence of the Body,” was published in September 2017 by the Texas Review of Law and Politics, a right-leaning journal that Kacsmaryk had led as a law student at the University of Texas. But Kacsmaryk’s role in the article was not disclosed . . .

I guess that it was no secret that Kacsmaryk was an anti-abortion activist, and in general there’s no principle by which political activists should not become judges, so I’m guessing it was a bit of overkill for him to break the rules and not list this article he wrote. Seems like an example of someone trying so hard to be careful that he gets himself in trouble. Too clever by half.

But all this made me think of my own writing. Have I ever been a ghostwriter? I don’t think so. Several years ago my work was plagiarized—that is, some of my ideas were published under somebody else’s name—but he didn’t ask me for permission to do this, so it was plagiarism by him, not ghostwriting by me. Other times I’ve written a draft article or book and then approached others to join me as coauthors, but that’s not ghostwriting either, first because I kept my name on the projects and second because the coauthors helped on these projects too, which indeed is why I invited them to join the projects in the first place. There have have also been some projects where I’ve helped out but said I’d rather not be included as an author, not out of embarrassment but rather because I didn’t think my contributions to these projects were so large, and I was concerned that if I were included as a coauthor, people would inappropriately give me credit for the work. Finally, there have been some projects where I’ve helped out but I was dissatisfied with the final version of the paper and I didn’t feel like putting in the effort to make it better, so I just asked to be removed from the author list so I would not be taking responsibility for the product.

No actual ghostwriting, though. If someone paid me enough, I’d do it. I’d even prove a theorem for Stephen Wolfram for enough $ . . . unfortunately, I’m pretty sure I’m not a good enough mathematician to prove any of the theorems he’d like to see proven. And I have no plans to interview for any appointive federal positions, so I doubt I’d have any motivation to do a Kacsmaryk and try to erase the tracks on any work I’m legally required to disclose.

P.S. There’s one thing I do sometimes that’s kind of ghostwriter-like, which is to take some phrase that I’ve used (here are some examples) and attribute it to someone else, or to say something like, “As the saying goes . . .” I do this sometimes for amusement and sometimes because I feel that, even though I have no direct quote for the saying, that it’s something I’ve heard before, or similar to something I’ve heard before, so I don’t really feel I deserve credit for it.

How much confidence should be placed in market-based predictions?

David Rothschild writes:

Dave Pennock and I (along with Rubert Freeman, Dan Reeves, and Bo Waggoner) wrote a really short paper for a CS conference called “Toward a Theory of Confidence in Market-Based Predictions”, with the goal of opening up further discussion and research. It really poses as many, or more, about what a margin-of-error around a probability could look like.

From the abstract to the article:

Prediction markets are a way to yield probabilistic predictions about future events, theoretically incorporating all available information. In this paper, we focus on the confidence that we should place in the predic- tion of a market. When should we believe that the market probability meaningfully reflects underlying uncertainty, and when should we not? We discuss two notions of confidence. The first is based on the expected profit that a trader could make from correcting the market if it were wrong, and the second is based on expected market volatility in the future.

Their article reminded me of this old paper of mine: The boxer, the wrestler, and the coin flip: a paradox of robust Bayesian inference and belief functions. The article is from 2006 but it’s actually based on a discussion I had with another grad student around 1987 or so. But they’ve thought about this stuff in a more sophisticated way than I did.

Also relevant is this recent post, A probability isn’t just a number; it’s part of a network of conditional statements.

How to digest research claims? (1) vitamin D and covid; (2) fish oil and cancer

I happened to receive two emails on the same day on two different topics, both relating to how much to trust claims published in the medical literature.

1. Someone writes:

This is the follow up publication for the paper that was retracted from preprint servers a few months ago, the language has changed but the results are the same: patients treated with cacifediol had a much lower mortality rate than patients who were not treated:

This follows three other papers on the same therapy which found the same results:

Small pilot RCT
Large propensity matched study
Cohort trial of 574 patients

I continue to be bewildered that this therapy has been ignored given that it’s so safe with such a high upside.

This led me to an interesting question which I thought you may have an answer for: “What are the most costly Type II errors in history?”

2. Someone else writes:

Do you think these two studies are flawed?

Serum Phospholipid Fatty Acids and Prostate Cancer Risk: Results From the Prostate Cancer Prevention Trial
Plasma Phospholipid Fatty Acids and Prostate Cancer Risk in the SELECT Trial

I said that I don’t know, I’ve never heard of this topic before. Why do you think they might be flawed?

And my correspondent replied:

I don’t understand the nested case cohort design but a very senior presenter at our Grand Rounds mentioned the studies were flawed. He didn’t go into the details as his topic was entirely different. I am trying to understand whether fish oil leads to increased risk for prostate cancer. I take fish oil myself but these studies shake my confidence, although they may be flawed studies.

I have no idea what to think about any of these papers. The medical literature is so huge that it often seems hopeless to interpret any single article or even subliterature.

An alternative approach is to look for trusted sources on the internet, but that’s not always so helpful either. For example, when I google *cleveland clinic vitamin d covid*, the first hit is an article, Can Vitamin D Prevent COVID-19?, which sounds relevant but then I notice that the date is 18 May 2020. Lots has been learned about covid since then, no?? I’m not trying to slam the Cleveland Clinic here, just saying that it’s hard to know where to look. I trust my doctor, which is fine, but (a) not everyone has a primary care doctor, and (b) in any case, doctors need to get their information from somewhere too.

I don’t know what is currently considered the best way to summarize the state of medical knowledge on any given topic.

P.S. Just to clarify one point: In the above post I’m not saying that the answers to these medical questions are unknowable, or even that nobody knows the answers. I can well believe there are some people who have a clear sense or what’s going on here. I’m just saying that I have no idea what to think about these papers. So I appreciate the feedback in the comments section.

GiveWell’s Change Our Mind contest, cost-effectiveness, and water quality interventions

Some time ago I wrote about a new meta-analysis pre-print where we estimated that providing safe drinking water led to a 30% mean reduction in deaths in children under-5, based on data from 15 RCTs. Today I want to write about water, but from a perspective of cost-effectiveness analyses (CEA).

A few months ago GiveWell (GW), a major effective altruism charity, hosted a Change Our Mind contest. Its purpose was to critique and improve on GW’s process/recommendations on how to allocate funding. This type of contest is obviously a fantastic idea (if you’re distributing tens of millions of dollars to charitable causes, even a fraction of percent improvement to efficiency of your giving is worth paying good money for) and GW also provided pretty generous rewards for the top entries. There were two winners and I think both of them are worth blogging about:

1. Noah Haber’s “GiveWell’s Uncertainty Problem”
2. An examination of cost-effectiveness of water quality interventions by Matthew Romer and Paul Romer Present (MRPRP henceforth)

I will post separately on the uncertainty analysis by Haber sometime soon, but today I want to write a bit on MRPRP’s analysis.

As I wrote last time, back in April 2022 GW recommended a grant of $65 million for clean water, in a “major update” to their earlier assessment. The decision was based on a pretty comprehensive analysis by GW, which estimated cost-benefit of specific interventions aimed at improving water quality in specific countries.[1] (Scroll down for footnotes. Also, I’m flattered to say that they also cited our meta-analysis a motivation for updating their assessment.) MRPRP re-do the GW’s analysis and find effects that are 10-20% smaller in some cases. This is still highly cost effective, but (per the logic I already mentioned) even small differences in cost-effectiveness will have large real-world implications for funding, given that funding gap for provision of safe drinking water is calculated in hundreds of millions of dollars.

However, my intention is not to argue what the right number should be. I’m just wondering about one question these kind of cost-effectiveness analyses raise, which is how to combine different sources of evidence.

When trying to estimate how clean water reduces mortality in children, we can estimate these reductions due to clean water either by looking at direct experimental evidence (e.g. in our meta-analysis) or indirectly: first you look at the estimates of reductions in disease (diarrhea episodes), then at evidence on how it links to mortality. The direct approach is the ideal (it is the ultimate outcome we care about; it is objectively measured and clearly defined, unlike diarrhea), but deaths are rare. That is why researchers studying water RCTs historically focused on reductions in diarrhea and often chose not to capture/report deaths. So we have many more studies of diarrhea.

Let’s say you go the indirect evidence route. To obtain an estimate, we need to know or make assumptions on (1) the extent of self-reporting bias (e.g. “courtesy” bias), (2) how many diseases can be affected by clean water, and (3) the potentially larger effect of clean water on severe cases (leading to death) than “any” diarrhea. Each of these are obviously hard. Direct evidence model (meta-analysis of deaths) doesn’t require any of these steps.

And once we have the two estimates (indirect and direct), then what? I describe GW process in footnotes (I personally think it’s not great but want to keep this snappy).[2] Suffice to say that they use the indirect evidence to derive a “plausibility cap”, the maximum size of the effect they are willing to admit into the CEA. MRPRP do it differently, by putting distributions on parameters in direct and indirect models and then running both in Stan to arrive at a combined, inverse-weighted estimate. [3] For example, for point (2) above (which diseases are affected by clean water), they look at a range of scenarios and put a Gaussian distribution with a mean at the most probable scenario and the most optimistic scenario being 2 SDs away. MRPRP acknowledge that this is an arbitrary choice.

A priori a model-averaging approach seems obviously better than taking a model and imposing an arbitrary truncation (like in GW’s old analysis). However, now depending on how you weigh direct vs indirect evidence models, you can have ~50% reduction or ~40% increase in the estimated benefits compared to GW’s previous analysis; a more extensive numerical example is below.[4] So you want to be very careful in how you weigh! E.g. for one of the programs MRPRP estimate of benefits is ~20% lower than GW’s, because in their model 3/4 of the weight is put on the (lower variance) indirect evidence model and it dominates the result.

In the long term the answer is to collect more data on mortality. In the short term probabilistically combining several models makes sense. However, putting 75% weight on a model of indirect evidence rather than the one with a directly measured outcome strikes me as very strong assumption and the opposite of my intuition. (Maybe I’m biased?) Similarly, why would you use Gaussians as a default model for encoding beliefs (e.g. in share of deaths averted)? I had a look at using different families of distributions in Stan and got to quite different results. (If you want to follow the details, my notes are here.)

More generally, when averaging over two models that are somewhat hard to compare, how should we think about model uncertainty? I think it would be a good idea in principle to penalise both models, because there are many unknown unknowns in water interventions. So they’re both overconfident! But how to make this penalty “fair” across two different types of models, when they vary in complexity and assumptions?

I’ll stop here for now, because this blog is already a bit long. Perhaps this will be of interest to some of you.

Footnotes:

[1] There many benefits of clean water interventions that a decision maker should consider (and the GW/MRPRP analyses do): in addition to reductions in deaths there are also medical costs, developmental effects, and reductions in disease. For this post I am only concerned with how to model reductions in deaths.

[2] GW’s process is, roughly, as follows: (1) Meta-analyse data from mortality studies, take a point estimate, adjust it for internal and external validity to make it specific to relevant contexts where they want to consider their program (e.g. baseline mortality, predicted take-up etc.). (2) Using indirect evidence they hypothesise what is the maximum impact on mortality (“plausibility cap”). (3) If the benefits from direct evidence exceed the cap, they set benefits to the cap’s value. Otherwise use direct evidence.

[3] By the way, as far as I saw, neither model accounts for the fact that some of our evidence on mortality and diarrhea comes from the same sources. This is obviously a problem, but I ignore it here, because it’s not related to the core argument.

[4] To illustrate with numbers, I will use GW’s analysis of Kenya Dispensers for Safe Water (a particular method of chlorination at water source), one of several programs they consider. (The impact of using MRPRP approach on other programs analysed by GiveWell is much less.) In GW’s analysis, the direct evidence model gave 6.1% mortality reduction, but plausibility cap was 5.6%, so they set it to 5.6%. Under the MRPRP model, the direct evidence suggests about 8% reduction, compared to 3.5% in the indirect evidence model. The unweighted mean of the two would be 5.75%, but because of the higher uncertainty on the direct effect the final (inverse-variance weighted) estimate is a 4.6% reduction. That corresponds to putting 3/4 of weight on indirect evidence. If we applied the “plausibility cap” logic to the MRPRP estimates, rather than weighing two models, the estimated reduction in mortality for Kenya DSW program would be 8% rather than 4.6%, a whooping 40% increase on GW’s original estimate.

The behavioral economists’ researcher degree of freedom

A few years ago we talked about the two modes of pop-microeconomics:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-school teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get 1 or 2 above. They’re complete opposites!

I thought of this when rereading this post from a few years ago, where we quoted Jason Collins, who wrote regarding the decades-long complacency of the academic psychology and economics establishment regarding the hot-hand fallacy fallacy:

We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

When writing the post with the above quote, I had been thinking specifically of issues with the hot hand.

Stepping back, I see this as part of the larger picture of researcher degrees of freedom in the fields of social psychology and behavioral economics.

You can apply the “two modes of thinking” idea to the hot hand:

Argument 1 goes like this: Believing in the hot hand sounds silly. But lots of successful players and coaches believe in it. Real money is at stake—this is not cheap talk! So it’s our duty to go beneath the surface and understand why, counterintuitively, belief in the hot hand makes sense, even though it might naively seem like a fallacy. Let’s prove that the pointy-headed professors outsmarted themselves and the blue-collar ordinary-Joe basketball coaches were right all along, following the anti-intellectual mode that was so successfully employed by the Alvin H. Baum Professor of Economics at the University of Chicago (for example, an unnamed academic says something stupid, only to be shot down by regular-guy “Chuck Esposito, a genial, quick-witted and thoroughly sports-fixated man who runs the race and sports book at Caesars Palace in Las Vegas.”)

Argument 2 goes the other way: Everybody thinks there’s a hot hand, but we, the savvy social economists and behavioral economists, know that because of evolution our brains make lots of shortcuts. Red Auerbach might think he’s an expert at basketball, but actually some Cornell professors have collected some data and have proved definitively that everything you thought about basketball was wrong.

Argument 1 is the “Econ 101” idea that when people have money on the line, they tend to make smart decisions, and we should be suspicious of academic theories that claim otherwise. Argument 2 is the “scientist as hero” idea that brilliant academics are making major discoveries every day, as reported to you by Ted, NPR, etc.

In the case of the hot hand, the psychology and economics establishment went with Argument 2. I don’t see any prior reason why they’d pick 1 or 2. In this case I think they just made an honest mistake: a team of researchers did a reasonable-seeming analysis and everyone went from there. Following the evidence—that’s a good idea! Indeed, for decades I believed that the hot hand was a fallacy. I believed in it, I talked about it, I used it as an example in class . . . until Josh Miller came to my office and explained to me how so many people, including me, had gotten it wrong.

So my point here is not to criticize economists and psychologists for getting this wrong. The hot hand is subtle, and it’s easy to get this one wrong. What interests me is how they chose—even if the choice was not made consciously—to follow Argument 2 rather than Argument 1 here. You could say the data led them to Argument 2, and that’s fine, but the same apparent strength of data could’ve led them to Argument 1. These are people who promote flat-out ridiculous models of the Argument 1 form such as the claim that “all deaths are to some extent suicides.” Sometimes they have a hard commitment to Argument 1. This time, though, they went with #2, and this time they were the foolish professors who got lost trying to model the real world.

I’m still working my way though the big picture here of trying to understand how Arguments 1 and 2 coexist, and how the psychologists and economists decide which one to go for in any particular example.

Interestingly enough, in the hot-hand example, after the behavioral economists saw their statistical argument overturned, they didn’t flip over to Argument 1 and extol the savvy of practical basketball coaches. Instead they pretty much tried to minimize their error and try to keep as much of Argument 2 as they could, for example arguing that, ok, maybe there is a hot hand but it’s much less than people think. They seem strongly committed to the idea that basketball players can’t be meaningfully influenced by previous shots, even while also being committed to the idea that words associated with old people can slow us down, images of money can make us selfish, and so on. I’m still chewing on this one.

Beverly Cleary is winner in third iteration of Greatest Seminar Speaker competition

Our third seminar speaker competition has come to an end, with the final round pitting Beverly “Ramona” Cleary against Laura “Ingalls” Wilder.

Before going on, I’d like to say that Alison Bechdel is the “Veronica Geng” of this particular competition, in that, even after she was defeated, she pretty much defined the parameters of the game. We’re still talking about her, whether or not she’s still in the running.

But the current matchup passes the Bechdel Test, so let’s move along.

Raghu writes:

I used my best slogan for Cleary (Cleary for the present!) in her prior round — a calculated risk, since it would have been worthless if Cleary hadn’t advanced. I am reminded of the dilemma of Karna in the Mahabharata, faced with the fearsome half-demon Ghatotkacha on the 15th day of the epic’s climactic war:

“Karna is unable to prevent Ghatotkacha from wreaking havoc on the Kaurava army, and even many of his celestial weapons are rendered useless. As the army breaks around him … Karna uses Vasavi Śhakti as a last resort. This weapon had been bestowed by Indra and could only be used once; Karṇa had been keeping it in reserve to use against Arjuna [his arch rival]. … The Pandavas were filled with grief at Ghatotkacha’s death. Krishna, however, couldn’t help but smile, knowing that Ghatotkacha has saved Arjuna from Karna.”

Cleary, then, is on her own.

Oncodoc leaps into the gap:

Let’s hear from Ms. Wilder. Her opinions without the editing imposed by her daughter might be interesting.

I don’t know about that! Rose was arguably the catalyst that made Laura’s writing so readable.

Dzhaughn riffs:

I remember these two together at a party out in the midwest summer home of William Carlos and Esther Williams. Beverly and John Cleary and the Absolute Monster Gentlemen were jamming, and I was chatting with Billy and Laura Ingalls Wilder and Melissa Gilbert and Brendan Sullivan besides some potted plants when suddently from the north i saw an olive helicopter with Michael Landon land on the roof. Right behind him were Lorne Greene and Zha Zha Gabor, Denis and Doodles Weaver, Spike Jones, Spike Jonze, Spike Lee and Peggy Lee. Paul Lynde right in the middle of everything. Everybody was shouting, Studs Terkel was roasting beef on the bbq with gin, flames everwhere, William Carlos and Esther Williams had to jump in the pool and started swimming laps. Garrison Keillor next door and said he was going to tell the story of this little house A Prairie Home Companion.

OK fine but I have no idea who this would support.

Diana plays it straight:

Cleary, for a simple reason: her books made me laugh many times.
I read all the Little House books as a kid, read some of them multiple times, but I don’t remember being made laugh by them.
Laughter during a seminar (here and there) is a good thing. Cleary can hold her own on the serious end too.

“She was not a slowpoke grownup. She was a girl who could not wait. Life was so interesting she had to find out what happened next.”
― Beverly Cleary, Ramona the Pest

Lots of arguments here. We’ll have to go to the very first commenter, Anonymous Pigeon, for a weighing of the evidence:

While seminars should be informative, they should also be fun. Laura can tell a good story but looks at it with less of the fun approach of Beverly. 1 point to the creator of Ramona The Pest. Neither one of them is better than the other on the writing scale, at least in my opinion. No points attributed there. So overall, Beverly Cleary should win because she would be the best seminar speaker.

Agreed.

Multilevel modeling to make better decisions using data from schools: How can we do better?

Michael Nelson writes:

I wanted to point out a paper, Stabilizing Subgroup Proficiency Results to Improve the Identification of Low-Performing Schools, by Lauren Forrow, Jennifer Starling, and Brian Gill.

The authors use Mr. P to analyze proficiency scores of students in subgroups (disability, race, FRL, etc.). The paper’s been getting a good amount of attention among my education researcher colleagues. I think this is really cool—it’s the most attention Mr. P’s gotten from ed researchers since your JREE article. This article isn’t peer reviewed, but it’s being seen by far more policymakers than any journal article would.

All the more relevant that the authors’ framing of their results is fishy. They claim that some schools identified as underperforming, based on mean subgroup scores, actually aren’t, because they would’ve gotten higher means if the subgroup n’s weren’t so small. They’re selling the idea that adjustment by poststratification (which they brand as “score stabilization”) may rescue these schools from their “bad luck” with pre-adjustment scores. What they don’t mention is that schools with genuinely underperforming (but small) subgroups could be misclassified as well-performing if they have “good luck” with post-adjustment scores. In fact, they don’t use the word “bias” at all, as in: “Individual means will have less variance but will be biased toward the grand mean.” (I guess that’s implied when they say the adjusted scores are “more stable” rather than “more accurate,” but maybe only to those with technical knowledge.)

And bias matters as much as variance when institutions are making binary decisions based on differences in point estimates around a cutpoint. Obviously, net bias up or down will be 0, in the long run, and over the entire distribution. But bias will always be net positive at the bottom of the distribution, where the cutpoint is likely to be. Besides, relying on net bias and long-run performance to make practical, short-run decisions seems counter to the philosophy I know you share, that we should look at individual differences not averages whenever possible. My fear is that, in practice, Mr. P might be used to ignore or downplay individual differences–not just statistically but literally, given that we’re talking about equity among student subgroups.

To the authors’ credit, they note in their limitations section that they ought to have computed uncertainty intervals. They didn’t, because they didn’t have student-level data, but I think that’s a copout. If, as they note, most of the means that moved from one side of the cutoff to the other are quite near it already, you can easily infer that the change is within a very narrow interval. Also to their credit, they acknowledge that binary choices are bad and nuance is good. But, also to their discredit, the entire premise of their paper is that the education system will, and presumably should, continue using cutpoints for binary decisions on proficiency. (That’s the implication, at least, of the US Dept. of Ed disseminating it.) They could’ve described a nuanced *application* of Mr. P, or illustrated the absurd consequences of using their method within the existing system, but they didn’t.

Anyway, sorry this went so negative, but I think the way Mr. P is marketed to policymakers, and its potential unintended consequences, are important.

Nelson continues:

I’ve been interested in this general method (multilevel regression with poststratification, MRP) for a while, or at least the theory behind it. (I’m not a Bayesian so I’ve never actually used it.)

As I understand it, MRP takes the average over all subgroups (their grand mean) and moves the individual subgroup means toward that grand mean, with smaller subgroups getting moved more. You can see this in the main paper’s graphs, where low means go up and high means go down, especially on the left side (smaller n’s). The grand mean will be more precise and more accurate (due to something called superefficiency), while the individual subgroup means will be much more precise but can also be much more biased toward the grand mean. The rationale for using the biased means is that very small subgroups give you very little information beyond what the grand mean is already telling you, so you should probably just use the grand mean instead.

In my view, that’s an iffy rationale for using biased subgroup proficiency scores, though, which I think the authors should’ve emphasized more. (Maybe they’ll have to in the peer-reviewed version of the paper.) Normally, bias in individual means isn’t a big deal: we take for granted that, over the long run, upward bias will be balanced out by downward bias. But, for this method and this application, the bias won’t ever go away, at least not where it matters. If what we’re looking at is just the scores around the proficiency cutoff, that’s generally going to be near the bottom of the distribution, and means near the bottom will always go up. As a result, schools with “bad luck” (as the authors say) will be pulled above the cutoff where they belong, but so will schools with subgroups that are genuinely underperforming.

I have a paper under review that derives a method for correcting a similar problem for effect sizes—it moves individual estimates not toward a grand mean but toward the true mean, in a direction and distance determined by a measure of the data’s randomness.

I kinda see what Nelson is saying, but I still like the above-linked report because I think that in general it is better to work with regularized, partially-pooled estimates than with raw estimates, even if those raw estimates are adjusted for noise or multiple comparisons or whatever.

To help convey this, let me share a few thoughts regarding hierarchical modeling in this general context of comparing averages (in this case, from different schools, but similar issues arise in medicine, business, politics, etc.).

1. Many years ago, Rubin made the point that, when you start with a bunch of estimates and uncertainties, classical multiple comparisons adjustments are effectively increasing can be increasing the standard errors so that fewer comparisons are statistically significant, whereas Bayesian methods move the estimates around. Rubin’s point was that you can get the right level of uncertainty much more effectively by moving the intervals toward each other rather than by keeping their centers fixed and then making them wider. (I’m thinking now that a dynamic visualization would be helpful to make this clear.)

It’s funny because Bayesian estimates are often thought of as trading bias for variance, but in this case the Bayesian estimate is so direct, and it’s the multiple comparisons approaches that do the tradeoff, getting the desired level of statistical significance by effectively making all the intervals wider and thus weakening the claims that can be made from data. It’s kinda horrible that, under the classical approach, your inferences for particular groups and comparisons will on expectation get vaguer as you get data from more groups.

We explored this idea in our 2000 article, Type S error rates for classical and Bayesian single and multiple comparison procedures (see here for freely-available version) and more thoroughly in our 2011 article, Why we (usually) don’t have to worry about multiple comparisons. In particular, see the discussion on pages 196-197 of that latter paper (see here for freely-available version):


2. MRP, or multilevel modeling more generally, does not “move the individual subgroup means toward that grand mean.” It moves the error terms toward zero, which implies that it moves the local averages toward their predictions from the regression model. For example, if you’re predicting test scores given various school-level predictors, then multilevel modeling partially pools the individual school means toward the fitted model. It would not in general make sense to partially pool toward the grand mean—not in any sort of large study that includes all sorts of different schools. (Yes, in Rubin’s classic 8-schools study, the estimates were pooled toward the average, but these were 8 similar schools in suburban New Jersey, and there were no available school-level predictors to distinguish them.)

3. I agree with Nelson that it’s a mistake to summarize results using statistical significance, and this can lead to artifacts when comparing different models. There’s no good reason to make decisions based on whether a 95% interval includes zero.

4. I like multilevel models, but point estimates from any source—multilevel modeling or otherwise—have unavoidable problems when the goal is to convey uncertainty. See our 1999 article, All maps of parameter estimates are misleading.

In summary, I like the Forrow et article. The next step should be to go beyond point estimates and statistical significance and to think more carefully about decision making under uncertainty in this educational context.

It’s the finals! Time to choose the ultimate seminar speaker: Beverly Cleary vs. Laura Ingalls Wilder

We’ve reached the endpoint of our third seminar speaker competition. Top seeds J. R. R. Tolkien, Miles Davis, David Bowie, Dr. Seuss, Hammurabi, Judas, Martha Stewart, and Yo-Yo Ma fell by the wayside—indeed, Davis, Judas, and Ma didn’t even get to round 2!—; unseeded heavyweight Isaac Newton lost in round 3; and dark-horse favorites James Naismith, Henry Winkler, Alison Bechdel, and J. Robert Lennon couldn’t make the finish line either.

What we have is two beloved and long-lived children’s book authors. Cleary was more prolific, but maybe only because she got started at a younger age. Impish Ramona or serious Laura . . . who’s it gonna be?

Either way, I assume it will go better than this, from a few years ago:

CALL FOR APPLICATIONS: LATOUR SEMINAR — DUE DATE AUGUST 11 (extended)
The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of Arts and Sciences are proud to present

BRUNO LATOUR AT COLUMBIA UNIVERSITY, SEPTEMBER 22-25
You are invited to apply for a seminar led by Professor Bruno Latour on Tuesday, September 23, 12-3pm. Twenty-five graduate students from throughout the university will be selected to participate in this single seminar given by Prof. Latour. Students will organize themselves into a reading group to meet once or twice in early September for discussion of Prof. Latour’s work. They will then meet to continue this discussion with a small group of faculty on September 15, 12-2pm. Students and a few faculty will meet with Prof. Latour on September 23. A reading list will be distributed in advance.

If you are interested in this 3-4 session seminar (attendance at all 3-4 sessions is mandatory), please send

Name:
Uni:
Your School:
Your Department:
Year you began your terminal degree at Columbia:
Thesis or Dissertation title or topic:
Name of main advisor:

In one short, concise paragraph tell us what major themes/keywords from Latour’s work are most relevant to your own work, and why you would benefit from this seminar. Please submit this information via the site
http://brown.submittable.com/submit
The due date for applications is August 11 and successful applicants will be notified in mid-August.

That was the only time I’ve heard of a speaker who’s so important that you have to apply to attend his seminar! And, don’t forget, “attendance at all 3-4 sessions is mandatory.” I wonder what they did to the students who showed up to the first two seminars but then skipped #3 and 4.

Past matchup

Wilder faced Sendak in the last semifinal. Dzhaughn wrote:

This will be a really tight match up.

Sendak has won the Laura Ingalls Wilder Award. Yet no one has won more Maurice Sendak Awards than Wilder. And she was dead when he won it.

Maurice Sendak’s paid for his college by working at FAO Schwarz. That’s Big, isn’t it?

The Anagram Department notices “Serial Lulling Award,” not a good sign for a seminar speaker. “American Dukes” and “Armenia Sucked” are hardly top notch, but less ominous.

So, I come up with a narrow edge to Sendak but I hope there is a better reason.

“Serial Lulling Award” . . . that is indeed concerning!

Raghu offers some thoughts, which, although useless for determining who to advance to the final round, are so much in the spirit of this competition that I’ll repeat them here:

This morning I finished my few-page-a-day reading of the biography of basketball inventor and first-round loser James Naismith, and I was struck again by how well-suited he is to this tournament:

“It was shortly after seven o’clock, and the meal was over. He added briskly, ‘Let me show you some of the statistics I’ve collected about accidents in sports. I’ve got them in my study.’ He started to rise from the table and fell back into his chair. Ann recognized the symptoms. A cerebral hemorrhage had struck her father.” — “The Basketball Man, James Naismith” by Bernice Larson Webb

Statistics! Sports! Medical inference!

I am not, however, suggesting that the rules be bent; I’ve had enough of Naismith.

I finished Sendak’s “Higglety Pigglety Pop! Or, There Must Be More to Life” — this only took me 15 minutes or so. It is surreal, amoral, and fascinating, and I should read more by Sendak. Wilder is neither surreal nor amoral, though as I think I noted before, when I was a kid I found descriptions of playing ball with pig bladders as bizarre as science fiction. I don’t know who that’s a vote for.

I find it hard to read a book a few pages a day. I can do it for awhile, but at some point I either lose interest and stop, or I want to find out what happens next so I just finish the damn book.

Diana offers a linguistic argument:

Afterthought and correction: The “n” should be considered a nasal and not a liquid, so Laura Ingalls Wilder has five liquids, a nasal, a fricative, a glide, and two plosives, whereas Maurice Sendak has two nasals, a liquid, two fricatives, and two plosives (and, if you count his middle name, three nasals, three liquids, two fricatives, and four plosives). So Wilder’s name actually has the greater variety of consonants, given the glide, but in Sendak’s name the various kinds are better balanced and a little more spill-resistant.

OK, sippy cups. Not so relevant for a talk at Columbia, though, given that there will be very few toddlers in the audience.

Anon offers what might appear at first to be a killer argument:

If you look at the chart, you can pretty clearly notice that the bracket is only as wide as it is because of Laura Ingalls Wilder’s prodigious name. I’ve got to throw my hat in the ring for Sendak, simply for storage.

+1 for talking about storage—optimization isn’t just about CPU time!—but this length-of-name argument reeks of sexism. In a less traditional society, Laura wouldn’t have had to add the Wilder to her name, and plain old “Laura Ingalls,” that’s a mere 13 characters wide, and two of them are lower-case l’s, which take up very little space (cue Ramanujan here). Alison Bechdel’s out of the competition now, but she’s still looking over my shoulder, as it were, scanning for this sort of bias.

And Ben offers a positive case for the pioneer girl:

There’s some sort of libertarian angle with Wilder though right?

What if we told Wilder about bitcoin and defi and whatnot? Surely that qualifies as surreal and amoral in the most entertaining kind of way. I know talking about these things in any context is a bit played out at this point but c’mon. This isn’t some tired old celebrity we’re selling here! This is author of an American classic, from the grave — any way she hits that ball is gonna be funny.

Sounds good to me!

Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

Wesley Tansey writes:

This is no doubt something we both can agree is a sad and wrongheaded use of statistics, namely incredible reliance on null hypothesis significance testing. Here’s an example:

Phase III trial. Failed because their primary endpoint had a p-value of 0.053 instead of 0.05. Here’s the important actual outcome data though:

For the primary efficacy endpoint, INV-PFS, there was no significant difference in PFS between arms, with 243 (84%) of events having occurred (stratified HR, 0.77; 95% CI: 0.59, 1.00; P = 0.053; Fig. 2a and Table 2). The median PFS was 4.5 months (95% CI: 3.9, 5.6) for the atezolizumab arm and 4.3 months (95% CI: 4.2, 5.5) for the chemotherapy arm. The PFS rate was 24% (95% CI: 17, 31) in the atezolizumab arm versus 7% (95% CI: 2, 11; descriptive P < 0.0001) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0, 4; descriptive P = 0.0006), respectively, at 18 months (Fig. 2a). As the INV-PFS did not cross the 0.05 significance boundary, secondary endpoints were not formally tested.

The odds of atezolizumab being better than chemo are clearly high. Yet this entire article is being written as the treatment failing simply because the p-value was 0.003 too high.

He adds:

And these confidence intervals are based on proportional hazards assumptions. But this is an immunotherapy trial where we have good evidence that these trials violate the PH assumption. Basically, you get toxicity early on with immunotherapy, but patients that survive that have a much better outcome down the road. Same story here; see figure below. Early on the immunotherapy patients are doing a little worse than the chemo patients but the long-term survival is much better.

As usual, our recommended solution for the first problem is to acknowledge uncertainty and our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.

Regarding acknowledging uncertainty: Yes, at some point decisions need to be made about choosing treatments for individual patients and making general clinical recommendations—but it’s a mistake to “prematurely collapse the wave function” here. This is a research paper on the effectiveness of the treatment, not a decision-making effort. Keep the uncertainty there; you’re not doing us any favors by acting as if you have certainty when you don’t.

Laura Ingalls Wilder vs. Maurice Sendak; Cleary advances

OK, two more children’s book authors. Both have been through a lot. Laura defeated cool person Banksy, lawgiver Steve Stigler, person known by initials Malcolm X, and then a come-from-behind victory against lawgiver Alison Bechdel. Meanwhile, Maurice dethroned alleged tax cheat Martha Stewart, namesake Steve McQueen, and fellow children’s book author Margaret Wise Brown.

Who’s it gonna be? I’d say Maurice because he’s an illustrator as well as a writer. On the other hand, Laura’s books have a lot more content than Maurice’s, also as a political scientist I appreciate the story of how Laura rewrote some of her life history to be more consistent with her co-author daughter’s political ideology.

Both authors are wilderness-friendly!

Past matchup

Raghu suggests we should sit here for the present.

Dzhaughn writes:

I have had the Cleary image of Ramona sitting in a basement taking one bite out of every apple for more than 90% of my life.

But Diana counters:

I don’t wanna go down to the basement.

Moving away from Ramona for a moment, Pedro writes:

A little bit of Googling reveals that Shakira once started in a soap opera (Telenovela) in her teen years. Apparently embarrassed, she ended up buying the rights to the soap and now it’s no longer available in any legal way.

Although I’m very sympathetic towards her actions and feelings, this blog is very pro-open science and sharing data and her actions are as against that as possible…

Good point! Cleary is very open, as you can see if you read her two volumes of autobiography. Maybe if she comes to speak, we’ll hear some excerpts from volume 3?

Again, here are the announcement and the rules.

It’s the semifinals! Shakira vs. Beverly Cleary; Sendak advances

As usual, the powers-of-2 thing sneaks up on us. All of a sudden, our third Greatest Seminar Speaker competition is nearing its final rounds.

Today we have two contestants to be reckoned with. Shakira made it pretty far against weak competition but then vanquished the mighty Dahl. Meanwhile Cleary shot down David Bowie, A. J. Foyt, and the inventor of Code Names.

Songwriter or storyteller; which will it be?

Past matchup

Raghu offers arguments in both directions:

On the one hand, we have not resolved the mystery of physiological scaling among weight lifters.

On the other:

I decided to spend some time in the library working — a change of scenery — and I picked up a book by Maurice Sendak, “Higglety Pigglety Pop! or There Must Be More to Life,” because previously all I’ve read by Sendak is “Where the Wild Things Are” and because “There Must Be More to Life” is a wonderful title. So far I am only four chapters in: a narcissistic and possibly psychopathic dog leaves her comfortable life in search of something better. It is excellent, and I look forward to finishing. So far it shows no connections to science or statistics, but I wouldn’t mind a seminar on whether there is or is not more to life.

Dzhaughn makes the case for . . . ummm, I’m not sure which one:

It’s hard for me to relate to someone who can eat as much as they want. Or more than they want in case of japanese hot dog guy. Maybe i should open my mind and shut my mouth, even if that’s not their approach.

Supposedly Li Wenwen wins when she can eat, with more ease in her mores, more rice than Maurice, then Maurice.

Anonymous breaks the tie:

I think you really need to give a leg up to the unheard voices. I mean Maurice Sendak got to blab and blab in books, and then I’m sure went on the academic circuit to tell pretentious college students all about the importance of children books, and how important it is to pay him millions of dollars. I don’t speak mandarin, so although Li Wenwen has surely spoken at many cadre meetings or whatever, I haven’t heard it.

And I was all ready to give it Li, but then Ethan came in with this late entry:

“There must be more to life” is far weightier than anything Li can lift. Sendak wins on Wenwen’s turf. Sendak on to the semis.

Again, here are the announcement and the rules.

“Behavioural science is unlikely to change the world without a heterogeneity revolution”

Christopher Bryan, Beth Tipton, and David Yeager write:

In the past decade, behavioural science has gained influence in policymaking but suffered a crisis of confidence in the replicability of its findings. Here, we describe a nascent heterogeneity revolution that we believe these twin historical trends have triggered. This revolution will be defined by the recognition that most treatment effects are heterogeneous, so the variation in effect estimates across studies that defines the replication crisis is to be expected as long as heterogeneous effects are studied without a systematic approach to sampling and moderation. When studied systematically, heterogeneity can be leveraged to build more complete theories of causal mechanism that could inform nuanced and dependable guidance to policymakers. We recommend investment in shared research infrastructure to make it feasible to study behavioural interventions in heterogeneous and generalizable samples, and suggest low-cost steps researchers can take immediately to avoid being misled by heterogeneity and begin to learn from it instead.

We posted on the preprint version of this article earlier. The idea is important enough that it’s good to have an excuse to post on it again.

P.S. This also reminds me of our causal quartets.

Maurice Sendak vs. Li Wenwen; Wilder advances

At first I was gonna say that the edge goes to the author of In the Night Kitchen, because he can draw tempting food that Li Wenwen would then eat, causing her to go out of her weight class and forfeit her title. But Li is in already in the upper weight class so she can eat as much as she wants.

Who should advance? Pierre says, “I don’t care.” But some of you must have opinions!

Past matchup

Dzhaughn writes:

Bechdel’s rule would have meant nothing to Laura. But just about any conceivable movie about Laura passes the Bechdel test.

Jonathan adds:

It’s hard not to look ahead to anticipate potential future matchups and ignore the match in front of you. But it’s one match at a time. Fun Home on the Prairie!

I’m going to go with male protagonist proxies here (violating the Bechdel Rule)

Michael Landon (Poppa Wilder) vs. Michael Cerveris (Poppa Bechdel): Landon played a teenage werewolf while Cerveris played Sweeney Todd. Both scary, but I think the werewolf is scarier. So, Wilder.

That’s 2 arguments for Laura and 0 for Alison, so Laura it is.

Again, here are the announcement and the rules.

Predicting LLM havoc

This is Jessica. Jacob Steinhardt recently posted an interesting blog post on predicting emergent behaviors in modern ML systems like large language models. The premise is that we can get qualitatively different behaviors form a deep learning model with enough scale–e.g., AlphaZero hitting a point in training where suddenly it has acquired a number of chess concepts. Broadly we can think of this happening as a result of how acquiring new capabilities can help a model lower its training loss and how as scale increases, you can get points where some (usually more complex) heuristic comes to overtake another (simpler) one. The potential for emergent behaviors might seem like a counterpoint to the argument that ML researchers should write broader impacts statements to prospectively name the potential harms their work poses to society… non-linear dynamics can result in surprises, right? But Steinhardt’s argument is that some types of emergent behavior are predictable.  

The whole post is worth reading so I won’t try to summarize it all. What most captured my attention though is his argument about predictable deception, where a model fools or manipulates the (human) supervisor rather than doing the desired tasks, because doing so gets it better or equal reward. Things like ChatGPT saying that “When I said that tequila has ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar” or an LLM claiming there is “no single right answer to this question” when there is, sort of like a journalist insisting on writing a balanced article about some issue where one side is clearly ignoring evidence. 

The creepy part is that the post argues that there is reason to believe that certain factors we should expect to see in the future–like models being trained on more data, having longer dialogues with humans, and being more embedded in the world (with a potential to act)–are likely to increase deception. One reason is because models can use the extra info they are acquiring to build better theories-of-mind and use them to better convince their human judges of things. And when they can understand what humans respond to and act in the world they can influence human beliefs through generating observables. For example, we might get situations like the following: 

suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

This reminds me of a similar example used by Gary Marcus of how we might start with some untrue proposition or fake news (e.g., Mayim Bialik is selling CBD gummies) and suddenly have a whole bunch of websites on this topic. Though he seemed to be talking about humans employing LLMs to generate bullshit web copy. Steinhardt also argues that we might expect deception to emerge very quickly (think phase transition), as suddenly a model achieves high enough performance by deceiving all the time that those heuristics dominate over the more truthful strategies. 

The second part of the post on emergent optimization argues that as systems increase in optimization power—i.e., as they consider a larger and more diverse space of possible policies to achieve some goal—they become more likely to hack their reward functions. E.g., a model might realize your long term goals are hard to achieve (say, lots of money and lots of contentness) but that’s hard. And so instead it resorts to trying to change how you appraise one of those things over time. The fact that planning capabilities can emerge in deep models even when they are given a short-term objective (like predicting the next token in some string of text) and that we should expect planning to drive down training loss (because humans do a lot of planning and human-like behavior is the goal) means we should be prepared for reward hacking to emerge. 

From a personal perspective, the more time I spend trying out these models, and the more I talk to people working on them, the more I think being in NLP right now is sort of a double-edged sword. The world is marveling at how much these models can do, and the momentum is incredible, but it also seems that on a nearly daily basis we have new non-human-like (or perhaps worse, human-like but non-desirable) behaviors getting classified and becoming targets for research. So you can jump into the big whack-a-mole game, and it will probably keep you busy for awhile, but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches. Though I guess anyone who is watching curiously what’s going on in NLP is in the same boat. It really is kind of uncomfortable.

This is not to say though that there aren’t plenty of NLP researchers thinking about LLMs with a relatively clear sense of direction and vision – there certainly are. But I’ve also met researchers who seem all in but without being able to talk very convincely about where they see it all going. Anyway, I’m not informed enough about LLMs to evaluate Steinhardt’s predictions but I like that some people are making thoughtful arguments about what we might expect to see.

 

P.S. I wrote “but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches” but it occurs to me now that it’s not really clear to me what I’m waiting for to determine “how far we can go.” Do deep models really need to perfectly emulate humans in every way we can conceive of for these approaches to be considered successful? It’s interesting to me that despite all the impressive things LLMs can do right now, there is this tendency (at least for me) to talk about them as if we need to withhold judgment for now. 

Don’t trust people selling their methods: The importance of external validation. (Sepsis edition)

This one’s not about Gladwell; it’s about sepsis.

John Williams points to this article, “The Epic Sepsis Model Falls Short—The Importance of External Validation,” by Anand Habib, Anthony Lin, and Richard Grant, who report that a proprietary model used to predict sepsis in hospital patients doesn’t work very well.

That’s to be expected, I guess. But it’s worth the reminder, given all the prediction tools out there that people are selling.

Alison Bechdel vs. Laura Ingalls Wilder; Cleary advances

Laura was a pioneer who drove a carriage in the snow—it was so cold she had to stop to snap the frozen breath off the horse’s nose so it could breathe! But Alison’s no slouch herself: we’ve heard from her latest book that she’s in excellent shape and is obsessed with workouts. So either of these two ladies could show us a thing or two about fitness. The question is, who’d be the better seminar speaker? Uncork your clever arguments, please!

Past matchup

Raghu brings on the stats:

Unlike Beverly Cleary, Leona Helmsley wrote nothing I would want to read. Quickly looking at snippets of books *about* her, none of them seem like anything I want to read, either.

“Palace Coup: The Inside Story of Harry and Leona Helmsley” gets 3.6 on Goodreads, which is basically 0 given the scale of scores there, and not even the 2 reviews are interesting. https://www.goodreads.com/en/book/show/1553418

“The Helmsleys: The Rise and Fall of Harry and Leona Helmsley” gets 3.0.

There’s a book by the guy who administered her philanthropic trust, and from the preview on Google Books, it looks excruciatingly dull and poorly written.

Extra credit for adjusting the raw numbers. One of the three central tasks of statistics is generalizing from observed data to underlying constructs of interest.

Anon writes:

According to Wikipedia “Alan Dershowitz, while having breakfast with her [Helmsley] at one of the Helmsley hotels, received a cup of tea with a tiny bit of water spilled on the saucer. Helmsley grabbed the cup from the waiter and smashed it on the floor, then told him to beg for his job.”

I think we can all appreciate someone who would tell Dershowitz to beg for his job. But Raghu counters with:

You’re saying that with Helmsley, the pre-seminar coffee will be ruined by a temper tantrum? I don’t know about Columbia, but our campus catering wouldn’t stand for such abuse, and then I wouldn’t get any coffee, and then I would leave without attending the talk.

We wouldn’t want the seminar to happen without Raghu in the audience, so Bev it is. We’ll see how she fares against Shakira in the semis.

Again, here are the announcement and the rules.

Is there a Bayesian justification for the Murdaugh verdict?

Jonathan Falk writes:

I know you’re much more a computational Bayesian than a philosophical Bayesian, and I assume you were as ignorant of the national phenomenon of the Murdaugh trial as I was, but I just don’t quite get it.

Assume the following facts are true:
(1) Two people were murdered: a mother and son
(2) The husband and father, after having denied he was anywhere near the scene, was forced into admitting he was there shortly before they were killed by incontrovertible evidence.
(3) He is/was a drug addict and embezzler.
(4) There is literally no other evidence connecting him to the crime.

Starting from a presumption of innocence (not sure what that means exactly in probabilistic terms, but where p is the prior probability of guilt, p<<.5) how do you as a Bayesian get from (1)-(4) combined with the prior to get to a posterior of "beyond reasonable doubt?" (Again, leaving precise calibration aside, surely p>>.9)

People lie all the time, and most drug addicts and embezzlers are not murderers, and while proximity to a murder scene (particularly one in secluded private property) is pretty good evidence, it’s not usually enough to convict anyone without some other pretty good evidence, like a murder weapon, or an obvious motive. I’d be willing to countenance a model with a posterior in the neighborhood of maybe 0.7, but it’s not clear to me how a committed Bayesian proceeds in a case like this and finds the defendant guilty.

Thoughts?

He’s got a good question. I have two answers:

1. The availability heuristic. It’s easy to picture Murdaugh being the killer and no good alternative explanations were on offer.

2. Decision analysis. Falk is framing the problem in probabilistic terms: what sort of evidence would it take to shift the probability from much less than 50% to well over 90%, and I see his point that some strong evidence would be necessary, maybe much more than was presented at the trial. But I’m thinking the more relevant framing for the jury is: What should they decide?

Suppose the two options are find Murdaugh guilty or not guilty of first-degree murder. Picture yourself on the jury, making this decision, and consider the 2 x 2 matrix of outcomes under each option:

– Truly guilty, Found not guilty by the court: Evil man gets away with one of the worst crimes you could imagine.

– Truly guilty, Found guilty by the court: Evil man is punished, justice is done. A win.

– Truly not guilty, Found not guilty by the court: Avoided a mistake. Whew!

– Truly not guilty, Found guilty by the court: Lying, creepy-ass, suspicious-acting drug addict and embezzler didn’t actually kill his wife and kid—at least, not quite in the way as was charged. But he goes away for life anyway. No great loss to society here.

The point is that the circumstances of the crime and the jury’s general impression of the defendant are relevant to the decision. The assessed probability that he actually did the crime is relevant, but there’s not any kind of direct relation between the probability and the decision of how to vote in the jury. If you think the defendant is a bad enough guy, then you don’t really need to care so much about false positives.

That said, this will vary by juror, and some of them might be sticklers for the “beyond reasonable doubt” thing. From my perspective, I see the logic of the decision-analysis perspective whereby a juror can be fully Bayesian, estimate the probability of guilt at 0.7 or whatever, and still vote to convict.

P.S. For those commenters who haven’t heard of the Murdaugh trial . . . You can just google it! It’s quite a story.

Beverly Cleary (2) vs. Leona Helmsley; Shakira advances

This one satisfies the Bechdel test: two women competing without reference to a man. Sure, there’s Henry Huggins and Harry Helmsley lurking in the background, but let’s face it: Henry was outshined by his friend Beezus and her little sister Ramona, and Harry was a pale shadow of his publicity-grabbing wife. The women are where it’s at here.

As to the particular exemplars of girl-power on offer here: what would you like to hear in your seminar? A good story or some solid tax-cheating tips? One of these could be entertaining and one could be useful, right? Which would you prefer, cosponsorship by the English Department or the Business School?

Past matchup

Dzhaughn offers a positive case for the Chevalier:

I can tell you that over the last few months Peru has been a cauldron of unrest: a flimsy coup attempt by a populist leftish president with a 20% approval rating, then impeached, the promoted vp pres betrayed her supporters, more than 60 protesters killed by police (and apparently 0 police interviewed by investigators about this), airports invaded and closed, practically all roads throughout the south closed in multiple places for weeks due to political protests, except on weekends and a couple weeks around Christmas and New Years because one has got to sell the crops. And party.

Nevertheless, Shakira (and Pique) still seemed to be at least the #3 story on TV news. Beat that.

Meanwhile, Anon slams both contestants:

There’s been some controversy recently over some of Dahl’s more antisemitic comments, and I don’t feel that it would be proper to champion an antisemitic person in the competition like this. However, Shakira committed tax fraud, and that is much more evil.

Agreed on the evilness of tax fraud, and insidious crime that tears apart civil society.

Here’s my problem with Dahl. Every time he comes up, the discussion is always some variant of “He’s an nasty guy” or “He’s getting banned.” I’m sick of hearing about mean people and I’m sick of hearing about people getting banned, so we’ve come to the end of the line for this particular storyeller and war hero. Shakira to the semifinals!

Again, here are the announcement and the rules.

Round 4 has arrived! Shakira vs. Roald Dahl; Li advances

To start our fourth round, we have an alleged tax cheat vs. . . . I have no idea if Roald Dahl cheated on his taxes, but it wouldn’t shock me if he had!

To get to this point, Shakira defeated Michael “Douglas” Keaton, A. A. “could be Person Known by Initials or Children’s Book Author” Milne, and Gary “D & D” Gygax. Meanwhile, Dahl prevailed over Jane “Traitor” Fonda, Ethel “Traitor” Rosenberg, and Henry “as far as I know, not a traitor” Winkler.

I guess what I’m trying to say is that neither of today’s contestants has been strongly tested yet. It’s been a pretty easy ride for both of them, as they managed to dodge pre-tournament favorites such as J. R. R. Tolkien, James Naismith, and Miles Davis. (Michael Keaton was my personal dark-horse favorite, but I recognized he was a dark horse.)

Looking at the 8 remaining candidates in the bracket, we have 4 children’s book authors, 2 alleged tax cheats, 1 creator of laws or rules, and 1 duplicate name. All the traitors, duplicate names, and namesakes have disappeared. All the cool people have gone too! Who’d’ve thought? Maybe that’s what happens when the decisions are made by nerds.

Past matchup

Isaac vs. Wenwen turned out to motivate an awesome old-school blog-comment thread on body mass, scaling, and weightlifting. Physics and sports: always a great combination!

Raghu kicked off the discussion:

I vaguely remember that McMahon and Bonner’s excellent “On Size and Life” had a graph of the weightlifting world record for various weight divisions, which scaled as body weight ^ 2/3 or 3/4 or something like that. (I’m at a conference and so can’t look at my copy today.) Anyway, it leads to the interesting idea that what we care about shouldn’t be absolute performance, or the best in crude weight bins, but performance relative to the background physiological scaling law. How good are you *compared to * Mass^2/3 (or whatever). So, at Li Wenwen’s seminar we can all try this out.

And it went on from there. All about gravity and weightlifting.

But who should it be—Isaac or Wenwen? Robin writes:

But which speaker has the highest BMI?

To which David replies:

That’s easy to determine. Since dissatisfaction with and/or ranting about BMI is strongly correlated with BMI, just ask them whether or not they think BMI is a useful measurement.

Newton was a ranter, but I think it’s safe to assume that Li, as a lifter, will be more aware of weight. So she advances.

Again, here are the announcement and the rules.

Count the living or the dead?

Martin Modrák writes:

Anders Huitfeldt et al. recently published a cool preprint that follows up on some quite old work and discusses when we should report/focus on ratio of odds for a death/event and when we should focus on ratios of survival/non-event odds.

The preprint is accompanied by a site providing a short description of the main ideas:

The key bit:

When an intervention reduces the risk of an outcome, the effect should be summarized using the standard risk ratio (which “counts the dead”, i.e. considers the relative probability of the outcome event), whereas when the intervention increases risk, the effect should instead be summarized using the survival ratio (which “counts the living”, i.e. considers the relative probability of the complement of the outcome event).

I took a look and was confused. I was not understanding the article so I went to the example on pages 15-16, and I don’t get that either. They’re saying there was an estimate of relative risk of 3.2, and they’re saying the relative risk for this patient should be 1.00027. Those numbers are so different! Does this really make sense. I get that the 3.2 is a multiplicative model and the 1.00027 is from an additive model, but they’re still so different.

There’s also the theoretical concern that you won’t always know ahead of time (or even after you see the data) if the treatment increases or decreases risk, and it seems strange to have these three different models floating around.

In response to my questions, Martin elaborated:

A motivating use case is in transferring effect estimates from a study to new patients/population: A study finds that a drug (while overall beneficial) has some adverse effects – let’s say that it in the control group 1% patients had thrombotic event (blood clot) and in the treatment group it was 2%. Now we are considering to give the drug to a patient we believe has already elevated baseline risk of thrombosis – say 5%. What is their risk of thrombosis if they take the drug? Here, the choice of effect summary will matter:

1) The risk ratio for thrombosis from the study is 2, so we could conclude that our patient will have 10% risk.

2) The risk ratio for _not_ having a thrombosis is 0.98/0.99 = 0.989899, so we could conclude that our patient will have 95% * 0.989899 ~= 94% risk of _not_ having a thrombosis and thus 6% risk of thrombosis.

3) The odds ratio for thrombosis is ~2.02, the baseline odds of our patient is ~0.053, so the predicted odds is ~0.106 and the predicted risk for thrombosis is 9.6%.

So at least cases 1) and 2) could lead to quite different clinical recommendations.

The question is: which effect summaries (or covariate-dependent effect summaries) are most likely to be stable across populations and thus allow us to easily apply the results to a new patient. The preprint
“Shall we count the living or the dead?” by Huitfeldt et al. argues that under assumptions that plausibly at least approximately hold in many cases where we study adverse effects the risk ratio of _not_ having the outcome (i.e. “counting the living”) requires few covariates to be stable. A similar line of argument then implies that at least in some scenarios where we study direct beneficial effects of a drug, the risk ratio of the outcome (i.e. “counting the dead”) is likely to be approximately stable with few covariates. The odds ratio is then stable only when we in fact condition on all covariates that cause the outcome – in this case all other effect summaries are also stable.

The authors frame the logic in terms of a fully deterministic model where we enumerate the proportion of patients having underlying conditions that either 100% cause the effect regardless of treatment, or 100% cause the effect only in presence/absence of treatment, so the risk is fully determined by the prevalence of various types of conditions in the population.

The assumptions when risk ratio of _not_ having an outcome (“counting the living”) is stable are:

1) There are no (or very rare) conditions that cause the outcome _only_ in the absence of the treatment (in our example: the drug has no mechanism which could prevent blood clots in people already susceptible to blood clots).

2) The presence of conditions that cause the outcome irrespective of treatment is independent of presence of conditions that cause the outcome only in the presence of treatment (in our example: if a specific genetic mutation interacts with the drug to cause blood clots, the presence of the mutation is independent of unhealthy lifestyle that could cause blood clots on its own). If I understand this correctly, this can only approximately hold if the outcome is rare – if a population that has high prevalence of independent causes, it has to have less treatment-dependent causes simply because the chance of the outcome cannot be more than 100%.

3) We have good predictors for all of the conditions that cause the outcome only when combined with treatments AND that differ between study population and target population and include those predictors in our model (in our example: variables that reflect blood coagulation are likely to need to be included as the drug may push high coagulation “over the edge” and coagulation is likely to differ between populations, OTOH if a specific genetic mutation interacts with the drug, we need to include it only if the genetic background of the target population differs from the study population)

The benefit then is that if we make those assumptions, we can avoid modeling a large chunk of the causal structure of the problem – if we can model the causal structure fully, it doesn’t really matter how we summarise the effects.

The assumptions are quite strong, but the authors IMHO reasonably claim, that they may approximately hold for real use cases (and can be at least sometimes empirically tested). One case they give is vaccination:

The Pfizer Covid vaccine has been reported to be associated with risk ratio of 3.2 for Myocarditis (a quite serious problem). So for a patient with 1% baseline risk of Myocarditis (this would be quite high), if risk ratio was stable, we could conclude that the patient would have 3.2% risk after vaccination. However, the risk ratio for not having Myocarditis is 0.999973 and assuming this is stable, it results in predicting a 1.0027% risk after vaccination. The argument is that the latter is more plausible as the assumptions for stability of risk ratio of not having the event could approximately hold.

Another intuition to thinking about this is that the reasons a person may be prone to Myocarditis (e.g. history of HIV) aren’t really made worse by vaccination – the vaccination only causes Myocarditis due to very rare underlying conditions that mostly don’t manifest otherwise, so people already at risk are not affected more than people at low baseline risk.

Complementarily, risk ratio of the outcome (counting the dead) is stable when:

1) There are no (or very rare) conditions that cause the outcome only in the presence of the treatment (i.e. the treatment does not directly harm anybody w.r.t the outcome).

2) The presence of conditions that _prevent_ the outcome regardless of treatment is indepedent of presence of conditions that prevent the outcome only in the presence of treatment.

3) We have good predictors for all of the conditions that prevent the outcome only when combined with the treatment AND that differ between study population and target population and include those predictors in our model.

This could plausibly be the case for drugs where we have a good idea how they prevent the specific outcome (say an antibiotic, that prevents infection unless the pathogen has resistance). Notably those assumptions are unlikely to hold for outcomes like “all-cause mortality”, so the title of the preprint might be a bit of a misnomer.

The preprint doesn’t really consider uncertainty, but in my reading, the reasoning should apply almost identically under uncertainty.

There’s also an interesting historical outlook as the idea can be traced back to a 1958 paper by Mindel C. Sheps which was ignored, but similar reasoning was then rediscovered on a bunch of occasions. For rare outcome the logic also maps to focusing on “relative benefits and absolute harms” as is often considered good practice in medicine.

One thing I also find interesting here is the connection between data summaries and modeling. In some abstract sense, the way you decide to summarize your data is a separate question from how you will model the data and underlying phenomenon of interest. But in practice they go together: different data summaries suggest different sorts of models.