Dan Ariely: “Why Louisiana’s Ten Commandments law is a broken moral compass”

You think I’m kidding with the above title, but I’m not. Key quote from the noted social psychologist / business-school professor:

The assumption that the Ten Commandments can serve as a universal moral code is increasingly out of touch with contemporary American society.

I didn’t check out all the links in this article, but my guess is that Ariely’s problem with the 10 Commandments is that God signed the tablets at the bottom, not at the top.

Or maybe he didn’t like that commandment, “Thou shalt not make unto thee any graven image,” which, when translated into the language of modern science, might imply, “Thou shalt not fabricate data.” Those damn commandments, always getting in the way!

P.S. You might say, why write about this? Why not just decorously look away. Well, remember what they say about dead horses. The Hill is a reputable publication, and they’re using their space to promote questionable work. I think this kind of thing is bad for science as a whole. It’s just particularly ridiculous when they invite someone with major ethical concerns about his research to be lecturing us on ethics. I do think mockery is appropriate here. It’s nothing personal. I just think of all the people in social science, working so hard and not pushing the ethical boundaries, working openly and honestly and with a sincere willingness to learn . . . and then this is what gets rewarded with publicity. I laugh because otherwise I’d scream.

The “delay-the-reckoning heuristic” in pro football?

Paul Campos tells this story of an NFL coach making a bad decision:

Denver kicked a field goal with 1:54 to go to make the score 13-6 Pittsburgh. Denver had one time out remaining. At this point [Denver coach] Payton elected to kick the ball off through the end zone rather than try an onside kick. If you do the math, this meant that the reasonable best case scenario for Denver was that they would stop Pittsburgh from getting a first down after three running plays, and get the ball back after a punt deep in their own territory with about ten seconds left and no time outs. This is in fact what happened. Since the odds of scoring a game-tying TD in this situation are almost zero, the question that naturally arose after the game is why Payton didn’t try an onside kick. . . .

Now the dumbest part of all this is that the downside of a failed onside kick in this situation is trivial. If the kick fails, Pittsburgh gets the ball around the 50 rather than on their own 30 after the kick through the end zone. If Pittsburgh makes a first down in either situation the game is over so that’s irrelevant. But what’s the downside of Pittsburgh punting from the Denver 44 instead of, as they did, from the Pittsburgh 36? This is a 20-yard difference, but, because the end zone serves as a constraint on punters since a punt that goes into the end zone comes out to the 20, the real difference in field position as a practical matter is probably more like ten yards, with the likely outcome being Denver getting the ball on its own 20 rather than its own ten. So Payton passed up a chance to get the ball back at midfield with nearly two minutes and a time out left — a very manageable situation if you need a TD — for a realistic best case scenario that would have required something tantamount to a miracle to result in a TD.

Campos’s framing of the decision, a framing that seems reasonable to me, is that Denver, down by 7 points with 1:54 to go, had two options with their kickoff:

1. Deep kick, then try to keep Pittsburgh from getting a first down, then try to score a touchdown with the one or two plays remaining after the punt recovery.

2. Onside kick, then in the unlikely event that Denver recovers (approximately 6% chance, according to this source), they have more than a minute and a half to try to reach the end zone.

Campos argues that option 2 is better because, even conditional on stopping the first down and getting the ball back, the probability of scoring a touchdown in one or two plays, starting from deep in your own territory, is so much lower than the probability of scoring a touchdown from closer to midfield with a minute and a half to go. If Denver has a 60% chance of stopping Pittsburgh from getting the first down, then this would imply that Campos thinks that Denver would be much less than 1/10th as likely to get the touchdown in one or two plays than with a minute and a half and one time out. I don’t know these probabilities, but I assume a pro football coach would have an assistant who’d be able to access these numbers instantly. As Campos says, “all of this happened immediately after the two minute warning, so he and his staff didn’t have to make snap decision: they had three minutes of beer and ED commercials to figure out this statistical puzzle.”

Assuming that (a) Campos is correct about the probabilities, and (b) that the decision isn’t even close, the question then arises, why did the coach make the wrong decision (which, again, we’re assuming was wrong prospectively, not just retrospectively)?

Campos offers three explanations:

(1) When choosing between a course of action that creates a very slim chance of winning — onside kicks are rarely recovered — and one that creates a much slimmer chance, coaches tend to treat the decision in an irrational way, because what’s the difference between a one in 50 chance of winning and a one in 200 . . . when the baseline is low enough it’s like it doesn’t even matter. Now this is very likely true in this one particular instance, but it’s very much NOT true in the long run, when making many similar decisions over time.

(2) Coaches have a strong and strongly irrational preference for delaying the arrival of certain defeat for as long as possible, even at the cost of greatly reducing the odds of actually winning. . . . [This fits] Payton’s mentality throughout this game, including the decision to kick a field goal while down 13-0 with ten minutes to go, and even more so the decision to punt on fourth and eight from the Denver 33 while down by ten with seven and half minutes to go. Such decisions are both more likely to cause the moment of certain defeat to arrive later than it would arrive otherwise, and seriously suboptimal in terms of increasing the chances of actually winning the game.

(3) A third factor in such decisions is that coaches would prefer defeat while pursuing the conventional course of action to defeat while doing something unconventional, since the latter makes them prone to heavier criticism, even if the criticism is wrong. . . .

Let’s get that third factor out of the way first, as it often comes up in this sort of discussions of coaching decisions. We talked about it a few years ago in the context of fourth-down decisions in the NFL (with followup here). The short answer is yeah, most coaches have a motivation to be conservative, as it looks worse if you do something bold and it fails, but in this case the decision seems so clear that I don’t think the onside kick is particularly controversial.

As for the first factor . . . sure, I guess the point is that when a decision seems unimportant (in this case, raising the probability of tying or winning from near-zero to higher but still very low), then there is less motivation to make the rational decision. But, I don’t really buy this. Think about it the other way. From the coach’s perspective, if the probability of winning is tiny, then the game is already basically lost, so at that point the team is playing with the house’s money. So why not roll the dice? From a psychological standpoint, this seems fundamentally different from if you’re almost certain to win and you make a seemingly-more conservative play even if it slightly increases your probability of losing, because if you’re gonna lose, you don’t want it to happen from what would be considered a weird play.

So then it comes down to Campos’s second argument, which interests me because it seems related to other decision-analysis paradoxes. But, like other ideas in the always-confusing heuristics-and-biases literature, it introduces its own challenges.

The delay-the-reckoning heuristic

I’m gonna label this idea identified by Campos, that “Coaches have a strong and strongly irrational preference for delaying the arrival of certain defeat for as long as possible, even at the cost of greatly reducing the odds of actually winning,” as the delay-the-reckoning heuristic.

The general scenario is that you are at a fork in the decision tree, where one branch will give you decisive, or near-decisive, information right away, while the other branch will take you down further steps until the uncertainty is resolved.

Which fork will you take?

Just speaking in general terms, I feel like it could go either way. Let me break the scenario into two sub-scenarios: potential good news (as in the football example when you’re behind but there’s an outside chance you could get lucky and win) or potential bad news (if you were leading in the football game and there’s an outside chance you could get unlucky and lose). Or you could think of medical examples: you have a serious disease but there’s a potential miracle drug with a small chance of working, or you’re healthy and you’re going to take a blood test that might reveal you have an incurable cancer.

In the potential-good-news scenario, I agree with Campos that it somehow seems more natural (whatever that means) to delay the reckoning. The idea is that you’re pretty sure it’s gonna be bad news, so you’d like to prolong the period of hope for as long as possible. From a pure decision-analysis standpoint, it’s always better to get information sooner, as it can inform later steps in the decision problem, but from an emotional standpoint, I understand the appeal of keeping hope alive for as long as possible.

On the other hand, what about the “Give it to me straight, Doc?” attitude? If you’re in bad shape, maybe you want to just know already and move on. So I’m not sure.

Also, I kind of understand the emotional logic to delaying the reckoning . . . but this is all happening within two minutes of a football game! Is it really worth lowering your win probability just to gain an additional minute and a half of hope? That doesn’t seem quite right. I feel like there’s something more going on here.

What about the potential-bad-news scenario? There I feel like it’s natural to want the information as soon as possible so as to rule out the unlikely bad outcome. Or maybe not. I feel like I’m working in the grand tradition of judgment and decision making research, which is to theorize based on personal impressions of hypothetical scenarios.

I sent the above to Dan Goldstein, an expert in judgment and decision making, and he pointed us to the book, Deliberate Ignorance: Choosing Not to Know, edited by Ralph Hertwig and Christoph Engel. So maybe there’s something there that’s relevant to our discussion.

The delay-the-reckoning heuristic interests me for its own sake and also for its connection to other time-related decision analysis issues such as the base-rate fallacy and its opposite, the slow-to-update problem.

Problems caused by grade inflation

Columbia math lecturer Peter Woit writes:

There has been significant grade inflation over the years, so having a transcript with a string of As isn’t worth what it once was. This is not good for the unusually talented, who now need to find other ways to distinguish themselves.

That’s a good point! I’ve typically thought of grade inflation in isolation (as in my post asking why weren’t the instructors all giving all A’s already?) with the problem being that inflated grades provide less information to future employers.

Woit’s point is related but goes further. Now that A’s are given out like candy corn in the world’s worst Halloween party, they don’t provide much signal, first because, as Woit says, non-unusually-talented students can also get strings of A’s on their transcripts, and also because if you’re competing on grades, the occasional slip can be so costly. Either way, ambitious students have to distinguish themselves in other ways—for example, by publishing articles in journals and conferences. This propagation of “publish or perish” down to the high school level just exacerbates the explosion of publications—apparently, zillions of medical students are kinda required to publish research too, and if publication is a requirement, then the quality is not gonna matter so much, and these papers just get stirred in with whatever remaining legitimate literature is being produced.

So, yeah, if we were to give out more B’s and C’s, maybe the world would be a better place.

I’m not planning to first, though. As I wrote a few years ago, the real mystery to me is not, Why is there grade inflation?, but rather, Why is there any room left to inflate: why weren’t the instructors all giving all A’s already?

At that time, I recommended statistician Val Jonhson’s plan to “make post-hoc adjustments to assigned grades to account for differences in faculty grading policies”—basically, fit a multilevel item-response model to estimate students’ latent abilities based on their grades. As I wrote at the time:

The beauty of Val’s approach is that it does three things:

1. By statically correcting for grading practices, Val’s method produces adjusted grades that are more informative measures of student ability.

2. Since students know their grades will be adjusted, they can choose and evaluate their classes based on what they expect to learn and how they expect to perform; they don’t have to worry about the extraneous factor of how easy the grading is.

3. Since instructors know the grades will be adjusted, they can assign grades for accuracy and not have to worry about the average grade. (They can still give all A’s but this will no longer be a benefit to the individual students after the course is over.)

I still like Val’s idea, but at this point there may be too much grade inflation at some schools for it to work. At some point there is so little signal left that you can’t recover the information you want.

OK, at this point you might say, sure, grades are B.S., whatever. But that puts in the worse position of implicitly requiring students to have other qualifications. At best, this sends students to interesting research projects and internships, but many times it just pushes them into trying to hop on projects to get credentials. Rather than writing some crappy Neurips paper and then learning the tricks to get it accepted, I think they’d be better off taking interesting courses in college, working hard, doing well on exams, and writing good term papers.

How far can exchangeability get us toward agreeing on individual probability?

This is Jessica. What’s the common assumption behind the following? 

    • Partial pooling of information over groups in hierarchical Bayesian models 
    • In causal inference of treatment effects, saying that the outcome you would get if you were treated (Y^a) shouldn’t change depending on whether you are assigned the treatment (A) or not
    • Acting as if we believe a probability is the “objective chance” of an event even if we prefer to see probability as an assignment of betting odds or degrees of belief to an event

The question is rhetorical, because the answer is in the post title. These are all examples where statistical exchangeability is important. Exchangeability says the joint distribution of a set of random variables is unaffected by the order in which they are observed. 

Exchangeability has broad implications. Lately I’ve been thinking about it as it comes up at the ML/stats intersection, where it’s critical to various methods: achieving coverage in conformal prediction, using counterfactuals in analyzing algorithmic fairness, identifying independent causal mechanisms in observational data, etc. 

This week it came up in the course I’m teaching on prediction for decision-making. A student asked whether exchangeability was of interest because often people aren’t comfortable assuming data is IID. I could see how this might seem like the case given how application-oriented papers (like on conformal prediction) sometimes talk about the exchangeabilty requirement as an advantage over the usual assumption of IID data. But this misses the deeper significance, which is that it provides a kind of practical consensus between different statistical philosophies. This consensus, and the ways in which it’s ultimately limited, is the topic of this post.

Interpreting the probability of an individual event

One of the papers I’d assigned was Dawid’s “On Individual Risk,” which, as you might expect, talks about what it means to assign probability to a single event. Dawid distinguishes “groupist” interpretations of probability that depend on identifying some set of events, like the frequentist definition of probability as the limiting frequency over hypothetical replications of the event, from individualist interpretations, like a “personal probability” reflecting the beliefs of some expert about some specific event conditioned on some prior experience. For the purposes of this discussion, we can put Bayesians (subjective, objective, and pragmatic, as Bob describes them here) in the latter personalist-individualist category. 

On the surface, the frequentist treatment of probability as an “objective” quantity appears incompatible with the individualist notion of probability as a descriptor of a particular event from the perspective of the particular observer (or expert) ascribing beliefs. If you have a frequentist and a personalist thinking about the next toss of a coin, for example, you would expect the probability the personalist assigns to depend on their joint distribution over possible sequences of outcomes, while the frequentist would be content to know the limiting probability. But de Finetti’s theorem shows that if one believes a sequence of events to be exchangeable, then you can’t distinguish their beliefs about those random variables from conceiving of independent events with some underlying probability. Given a sequence of exchangeable Bernoulli random variables X1, X2, X3, … , you can think of a draw from their joint distribution as sampling p ~ mu, then drawing X1, X2, X3, … from Bernoulli(p) (where mu is a distribution on [0,1]). So the frequentist and personalist can both agree, under exchangeability, that p is meaningful for decision making. David Spiegalhalter recently published an essay on interpreting probability that he ended by commenting on how remarkable this pragmatic consensus is.

But Dawid’s goal is to point out ways in which the apparent alignment is not as satisfactory as it may seem in resolving the philosophical chasm. It’s more like we’ve thrown a (somewhat flimsy) plank over it. Exchangeability may sometimes get us across by allowing the frequentist and personalist to coordinate in terms of actions, but we have to be careful how much weight we put on this.  

The reference set depends on the state of information

One complication is that the personalist’s willingness to assume exchangeability depends on the information they have. Dawid uses the example of trying to predict the exam score of some particular student. If they have no additional information to distinguish the target student from the rest, the personalist might be content to be given an overall limiting relative frequency p of failure across a set of students. But as soon as they learn something that makes the individual student unique, p is no longer the appropriate reference for the individual student’s probability of passing the exam. 

As an aside, this doesn’t mean that exchangeability is only useful if we think of members of some exchangeable set as identical. There may still be practical benefits of learning from the other students in the context of a statistical model, for example. See, e.g., Andrew’s previous post on exchangeability as an assumption in hierarchical models, where he points out that assuming exchangeability doesn’t necessarily mean that you believe everything is indistinguishable, and if you have additional information distinguishing groups, you can incorporate that in your model as group-level predictors.

But for the purposes of personalists and frequentists agreeing on a reference for the probability of a specific event, the dependence on information is not ideal. Can we avoid this by making the reference set more specific? What if we’re trying to predict a particular student’s score on a particular exam in a world where that particular student is allowed to attempt the same exam as many times as they’d like? Now that the reference group refers to the particular student and particular exam, would the personalist be content to accept the limiting frequency as the probability of passing the next attempt? 

The answer is, not necessarily. This imaginary world still can’t get us to the generality we’d need for exchangeability to truly reconcile a personalist and frequentist assessment of the probability. 

Example where the limiting frequency is constructed over time

Dawid illustrates this by introducing a complicating (but not at all unrealistic) assumption: that the student’s performance on their next try on the exam will be affected by their performance on the previous tries. Now we have a situation where the limiting frequency of passing on repeated attempts is constructed over time. 

As an analogy, consider drawing balls from an urn, where when we draw our first ball, there is 1 red ball and 1 green ball in it. Upon drawing a ball, we immediately return and add an additional ball of the same color. At each draw, each ball in the urn is equally likely of being drawn, and  the sequence of colors is exchangeable. 

Given that p is not known, which do you think the personalist would prefer to consider as the probability of a red ball on the first draw: the proportion of red balls currently in the urn, or the limiting frequency of drawing a red ball over the entire sequence? 

Turns out in this example, the distinction doesn’t actually matter: the personalist should just bet 0.5. So why is there still a problem in reconciling the personalist assessment with the limiting frequency?

The answer is that we now have a situation where knowledge of the dynamic aspect of the process makes it seem contradictory for the personalist to trust the limiting frequency. If they know it’s constructed over time, then on what ground is the personalist supposed to assume the limiting frequency is the right reference for the probability on the first draw? This gets at the awkwardness of using behavior in the limit to think about individual predictions we might make.

Why this matters in the context of algorithmic decision-making

This example is related to some of my prior posts on why calibration does not satisfy everyone as a means of ensuring good decisions. The broader point in the context of the course I’m teaching is that when we’re making risk predictions (and subsequent decisions) about people, such as in deciding who to grant a loan or whether to provide some medical treatment, there is inherent ambiguity in the target quantity. Often there are expectations that the decision-maker will do their best to consider the information about that particular person and make the best decision they can. What becomes important is not so much that we can guarantee our predictions behave well as a group (e.g., calibration) but that we understand how we’re limited by the information we have and what assumptions we’re making about the reference group in an individual case. 

“The terror among academics on the covid origins issue is like nothing we’ve ever seen before”

Michael Weissman sends along this article he wrote with a Bayesian evaluation of Covid origins probabilities. He writes:

It’s a peculiar issue to work on. The terror among academics on the covid origins issue is like nothing we’ve ever seen before.

I was surprised he was talking about “terror” . . . People sometimes send me stuff about covid origins and it all seems civil enough. I guess I’m too far out of the loop to have noticed this! That said, there have been times that I’ve been attacked for opposing some aspect of the scientific establishment, so I can believe it.

I asked Weissman to elaborate, and he shared some stories:

A couple of multidisciplinary researchers from prestigious institutions were trying to write up a submittable paper. They were leaning heavily zoonotic, at least before we talked. They said they didn’t publish because they could not get any experts to talk with them. They said they prepared formal legal papers guaranteeing confidentiality but it wasn’t enough. I guess people thought that their zoo-lean was a ruse.

The extraordinarily distinguished computational biologist Nick Patterson tells me that a prospective collaborator cancelled their collaboration because Patterson had blogged that he thought the evidence pointed to a lab leak. It is not normal for a scientist to drop an opportunity to collaborate with someone like Patterson over a disagreement on an unrelated scientific question. You can imagine the effect of that environment on younger, less established scientists.

Physicist Richard Muller at Berkeley tried asking some bio colleague about an origins-related technical issue. The colleague blew him off. Muller asked if a student or postdoc could help. No way- far too risky, would ruin their career. (see around minute 43 here)

Come to think about it, I got attacked (or, at least, misrepresented) for some of my covid-related research too; the story is here. Lots of aggressive people out there in the academic research and policy communities.

Also, to put this in the context of the onset of covid in 2020, whatever terror we have been facing by disagreeing with powerful people in academia and government is nothing compared to the terror faced by people who were exposed to this new lethal disease. Covid is now at the level of a bad flu season, so still pretty bad but much less scary than a few years ago.

What are my goals? What are their goals? (How to prepare for that meeting.)

Corresponding with someone who had a difficult meeting coming up, where she was not sure how much to trust the person she was meeting with, I gave the following advice:

Proceed under the assumption that they want to do things right. I say this because if they’re gonna be defensive, then it doesn’t matter what you say; it’s not like you’re gonna sweet-talk them into opening up. But if they do want to do better, then maybe there is some hope.

My correspondent responded that the person she was meeting hadn’t been helpful up to this point: “I always assume (and hope for) good intentions and a desire to do better. But I’ll admit I’m feeling less positive after a few days of not getting an answer.”

I continued:

Many of these sorts of meetings require negotiation, and good negotiation often involves withholding of information or outright deception, and I’m not good at either of these things, so I don’t even try. Instead I try some of the classic “Getting to Yes” strategies:
(1) Before the meeting, I ask myself what are my goals: my short-term goals for the meeting and my medium and long-term goals that I’m aiming for.
(2) During the meeting, I explicitly ask the other parties what their goals are.

When I think of various counterproductive interactions I’ve had in the past, often it seems this has come in part because I was not clear on my goals or on the goals of the other parties; as a result we butted heads when we could’ve found a mutually-beneficial solution. I’m including here some interactions with bad actors: liars, cheats, etc. Even when working with people you can’t trust, the general principles can apply.

It does not always make sense to tell the other parties what your goals are! But, don’t worry, most people won’t ever ask, as they will typically be focused on trying to stand firm on some micro-issue or another. Kinda like how amateur poker players are notorious for looking over and over again at their own hole cards and not looking enough at you.

The above advice may seem silly because you’re not involved in a negotiation at all! Even so, if you have a sense of what your goals are and what their goals are, this could be helpful. And be careful to distinguish goals from decision options. A goal is “I would like X to happen”; a decision option is “I will do Y.” It’s natural to think in terms of decision options, but I think this is limiting, compared to thinking about goals.

Anyway, that’s just my take from a mixture of personal experience and reading on decision making; I’ve done no direct research on the topic.

The above techniques are not any sort of magic; they’re just an attempt to focus on what is important.

Newly published in 2024

Our big item is the book Active Statistics: Stories, Games, Problems, and Hands-on Demonstrations for Applied Regression and Causal Inference (Andrew Gelman and Aki Vehtari).

Then there are the recently published research articles:

It’s a good mix:  some technical work, some research methods, some applications, some teaching material, some reviews. Lots of collaborators!

If you want to see what’s coming next, you can check out the lists of unpublished and unwritten articles.

Here are our audios and videos from the past year:

Last but not least, our blog posts of 2024. We had 540 posts with 11,483 total comments:

It’s bezzle time: The Dean of Engineering at the University of Nevada gets paid $372,127 a year and wrote a paper that’s so bad, you can’t believe it. (204 comments)
In some cases academic misconduct doesn’t deserve a public apology (177 comments)
Getting a pass on evaluating ways to improve science (154 comments)
Niall Ferguson, J. D. Vance, George Washington, and Jesus (129 comments)
Stabbers gonna stab — fraud edition (124 comments)
Suspicious data pattern in recent Venezuelan election (111 comments)
This well-known paradox of R-squared is still buggin me. Can you help me out? (105 comments)
Reflections on the recent election (103 comments)
The mainstream press is failing America (UK edition) (99 comments)
The Behavioural Insights Team decided to scare people. (98 comments)
If you want to play women’s tennis at the top level, there’s a huge benefit to being ____. Not just ____, but exceptionally ___, outlier-outlier ___. (And what we can learn about social science from this stylized fact.) (94 comments)
“Is it really ‘the economy, stupid’?” (91 comments)
Bad stuff going down at the American Sociological Association (84 comments)
How to think about the claim by Justin Wolfers that “the income of the average American will double approximately every 39 years”? (84 comments)
Polling averages and political forecasts and what do you really think is gonna happen in November? (83 comments)
Intelligence is whatever machines cannot (yet) do (82 comments)
On the border between credulity and postmodernism: The case of the UFO’s-as-space-aliens media insiders (78 comments)
Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging. (78 comments)
“Not once in the twentieth century . . . has a single politician, actor, athlete, or surgeon emerged as a first-rate novelist, despite the dismayingly huge breadth of experience each profession affords.” (77 comments)
Abraham Lincoln and confidence intervals (77 comments)
What is the prevalence of bad social science? (76 comments)
Whassup with those economists who predicted a recession that then didn’t happen? (74 comments)
Feedback on the blog—this is your chance! (74 comments)
If school funding doesn’t really matter, why do people want their kid’s school to be well funded? (69 comments)
“It’s a very short jump from believing kale smoothies are a cure for cancer to denying the Holocaust happened.” (69 comments)
On lying politicians and bullshitting scientists (67 comments)
My comments on Nate Silver’s comments on the Fivethirtyeight election forecast (66 comments)
My suggestion for the 2028 Olympics (66 comments)
The statistical controversy over “White Rural Rage: the Threat to American Democracy” (and a comment about post-publication review) (65 comments)
HMC fails when you initialize at the mode (63 comments)
Selection bias leads to confusion about the relative stability of deterministic and stochastic algorithms (63 comments)
Wendy Brown: “Just as nothing is more corrosive to serious intellectual work than being governed by a political programme (whether that of states, corporations, or a revolutionary movement), nothing is more inapt to a political campaign than the unending reflexivity, critique and self-correction required of scholarly inquiry.” (63 comments)
Bayesians are frequentists. (62 comments)
More red meat for you AI skeptics out there (61 comments)
The River, the Village, and the Fort: Nate Silver’s new book, “On the Edge” (61 comments)
“Zombie Ideas” in psychology, from personality profiling to lucky golf balls (61 comments)
Andrew Gelman is not the science police because there is no such thing as the science police (60 comments)
With journals, it’s all about the wedding, never about the marriage. (59 comments)
Stanford medical school professor misrepresents what I wrote (but I kind of understand where he’s coming from) (59 comments)
Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism (58 comments)
Bayesian statistics: the three cultures (58 comments)
A quick simulation to demonstrate the wild variability of p-values (58 comments)
How to think about the effect of the economy on political attitudes and behavior? (57 comments)
Forking paths in LLMs for data analysis (57 comments)
Two kings, a royal, a knight, and three princesses walk into a bar (Nobel prize edition) (57 comments)
Prediction markets in 2024 and poll aggregation in 2008 (57 comments)
The New York Young Republican Club (56 comments)
Holes in Bayesian statistics (my talk tomorrow at the Bayesian conference, based on work with Yuling) (56 comments)
Abortion crime controversy update (55 comments)
Some references and discussions on the foundations of probability—not the math so much as its connection to the real world, including the claim that “Pr(aliens exist on Neptune that can rap battle) = .137” (54 comments)
Props to the liberal anticommunists of the 1930s-1950s (54 comments)
Interpreting recent Iowa election poll using a rough Bayesian partition of error (54 comments)
How would the election turn out if Biden or Trump were replaced by a different candidate? (53 comments)
Uncertainty in games: How to get that balance so that there’s a motivation to play well, but you can still have a chance to come back from behind? (52 comments)
Decisions of parties to run moderate or extreme candidates (52 comments)
Benefit of Stanford: Are there connections between unethical behavior in science promotion and cheating in private life? (51 comments)
When all else fails, add a code comment (50 comments)
The feel-good open science story versus the preregistration (who do you think wins?) (49 comments)
Deadwood (49 comments)
Why are all these school cheating scandals happening? (48 comments)
Is marriage associated with happiness for men or for women? Or both? Or neither? (48 comments)
He took public funds and falsified his data. Are they gonna make him pay back the $19 million? (48 comments)
Arnold Foundation and Vera Institute argue about a study of the effectiveness of college education programs in prison. (46 comments)
How do you interpret standard errors from a regression fit to the entire population? (46 comments)
It’s Harvard time, baby: “Kerfuffle” is what you call it when you completely botched your data but you don’t want to change your conclusions. (46 comments)
Where have all the count words gone? In defense of “fewer” and “among” (45 comments)
Arguing about bitcoin (44 comments)
Polling by asking people about their neighbors: When does this work? Should people be doing more of it? And the connection to that French dude who bet on Trump (44 comments)
Prediction isn’t everything, but everything is prediction (43 comments)
Who wrote the music for In My Life? Three Bayesian analyses (43 comments)
Honesty and transparency are not enough: politics edition (43 comments)
What’s gonna happen between now and November 5? (42 comments)
What’s the story behind that paper by the Center for Open Science team that just got retracted? (42 comments)
Torment executioners in Reno, Nevada, keep tormenting us with their publications. (41 comments)
“You want to gather data to determine which of two students is a better basketball shooter. You plan to have each student take N shots and then compare their shooting percentages. Roughly how large does N have to be for you to have a good chance of distinguishing a 30% shooter from a 40% shooter?” (41 comments)
Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?) (41 comments)
The appeal of New York Times columnist David Brooks . . . Yeah, I know this all sounds like a nutty “it’s wheels within wheels, man” sort of argument, but I’m serious here! (40 comments)
Sympathy for the Nudgelords: Vermeule endorsing stupid and dangerous election-fraud claims and Levitt promoting climate change denial are like cool dudes in the 60s wearing Che T-shirts and thinking Chairman Mao was cool—we think they’re playing with fire, they think they’re cute contrarians pointing out contradictions in the system. For a certain kind of person, it’s fun to be a rogue. (40 comments)
Freakonomics does it again (not in a good way). Jeez, these guys are credulous: (40 comments)
Mister P and Stan go to Bangladesh . . . (39 comments)
“Things are Getting So Politically Polarized We Can’t Measure How Politically Polarized Things are Getting” (39 comments)
Opposition (38 comments)
“My basic question is do we really need data to be analysed by both methods?” (38 comments)
The most interesting part of the story is that the publisher went through all these steps of reviewing and revising. If they just want to make money by publishing crap, why bother engaging outside reviewers at all? (38 comments)
A new argument for estimating the probability that your vote will be decisive (37 comments)
Kamala Harris gets coveted xkcd endorsement. (37 comments)
Implicitly denying the controversy associated with the Implicit Association Test. (Whassup with the American Association of Arts & Sciences?) (37 comments)
Simulation to understand two kinds of measurement error in regression (36 comments)
I strongly doubt that any human has ever typed the phrase, “torment executioners,” on any keyboard—except, of course, in discussions such as this. (36 comments)
What is the purpose of a methods section? (36 comments)
The four principles of Barnard College: Respect, empathy, kindness . . . and censorship? (35 comments)
Whooping cough! How to respond to fatally-flawed papers? An example, in a setting where the fatal flaw is subtle, involving a confounding of time and cohort effects (35 comments)
The election is coming: What forecasts should we trust? (35 comments)
Prediction markets and the need for “dumb money” as well as “smart money” (35 comments)
“The Exceptions: Nancy Hopkins, MIT, and the Fight for Women in Science” (35 comments)
Shreddergate! A fascinating investigation into possible dishonesty in a psychology experiment (34 comments)
20 years of blogging . . . What have been your favorite posts? (34 comments)
That’s what happens when you try to run the world while excluding 99.8% of the population (34 comments)
The Theory and Practice of Oligarchical Collectivism (34 comments)
What’s the problem, “math snobs” or rich dudes who take themselves too seriously and are enabled in that by the news media? (33 comments)
(Trying to) clear up a misunderstanding about decision analysis and significance testing (33 comments)
“Take a pass”: New contronym just dropped. (33 comments)
Well, today we find our heroes flying along smoothly… (33 comments)
“How a simple math error sparked a panic about black plastic kitchen utensils”: Does it matter when an estimate is off by a factor of 10? (33 comments)
“Why do medical tests always have error rates?” (32 comments)
Toward a unified theory of bad science and bad scholarship (32 comments)
Michael Clayton in NYC (32 comments)
I’ve been mistaken for a chatbot (31 comments)
Here’s my excuse for using obsolete, sub-optimal, or inadequate statistical methods or using a method irresponsibly. (31 comments)
“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001].” (30 comments)
“Don’t feed the trolls” and the troll semi-bluff (30 comments)
Bad parenting in the news, also, yeah, lots of kids don’t believe in Santa Claus (30 comments)
What do the data say about Kamala Harris’s electability? (30 comments)
When is calibration enough? (30 comments)
“I wonder just what it takes to get people to conclude that a research seam has been mined to the point of exhaustion.” (30 comments)
Extinct Champagne grapes? I can be even more disappointed in the news media (29 comments)
Relating t-statistics and the relative width of confidence intervals (29 comments)
Report of average change from an Alzheimer’s drug: I don’t get the criticism here. (29 comments)
Obnoxious receipt from Spirit Airlines (29 comments)
The recent Iranian election: Should we be suspicious that the vote totals are all divisible by 3? (29 comments)
Bill James hangs up his hat. Also some general thoughts about book writing vs. blogging. Also I push back against James’s claim about sabermetrics and statistics. (29 comments)
Unsolicited feedback on your research from the LLMs at large (29 comments)
Defining statistical models in JAX? (29 comments)
She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io? (29 comments)
What to do with age? (including a regression predictor linearly and also in discrete steps) (28 comments)
“Our troops with aching hearts were obliged to fire a part of the town as a punishment.” (28 comments)
The Rider (28 comments)
“Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”). What would you guess is the correlation between day# and temp? r=???” (28 comments)
Anti-immigration attitudes: they didn’t want a bunch of Hungarian refugees coming in the 1950s (28 comments)
What genre of writing is AI-generated poetry? (28 comments)
“I work in a biology lab . . . My PI proposed a statistical test that I think is nonsense. . .” (28 comments)
Blog is adapted to laptops or desktops, not to smartphones or pads. (27 comments)
“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . . (27 comments)
Inspiring story from a chemistry classroom (27 comments)
Which books, papers, and blogs are in the Bayesian canon? (27 comments)
3M misconduct regarding knowledge of “forever chemicals”: As is so often the case, the problem was in open sight for a long time before anything was done (27 comments)
God is in every leaf of every tree—comic book movies edition. (26 comments)
Our new Substack newsletter: The Future of Statistical Modeling! (26 comments)
Pinker was right, I was wrong. (26 comments)
Does this study really show that lesbians and bisexual women die sooner than straight women? Disparities in Mortality by Sexual Orientation in a Large, Prospective JAMA Paper (26 comments)
A cook, a housemaid, a gardener, a chauffeur, a nanny, a philosopher, and his wife . . . (26 comments)
MCMC draws cannot fill the posterior in high dimensions (26 comments)
Beyond junk science: How to go forward (26 comments)
Instability of win probability in election forecasts (with a little bit of R) (26 comments)
Mark Twain on chatbots (26 comments)
Fake data on the honeybee waggle dance, followed by the inevitable “It is important to note that the conclusions of our studies remain firm and sound.” (26 comments)
Make a hypothesis about what you expect to see, every step of the way. A manifesto: (26 comments)
“Of course, this could conceivably be a case of near unbelievable luck: A flawed analysis based on wrong assumptions gave an unusually large causal effect estimate – but the misguided result just happened to be correct. We can imagine how the research team huddled nervously around the computer terminal biting their nails and silently praying as they executed their updated Stata code, only to erupt in joy and celebration as the results appeared on screen and revealed they were right all along. . . .” (26 comments)
A very interesting discussion by Roy Sorensen of the interesting-number paradox (26 comments)
I love this paper but it’s barely been noticed. (25 comments)
Conformal prediction and people (25 comments)
More on the disconnect between who voters support and what they support (25 comments)
Ancestor-worship in academia: Where does it happen? (25 comments)
A welcome rant on betting, knowledge, belief, and the foundations of probability (25 comments)
Oh no Stanford no no no not again please make it stop (25 comments)
Evidence-based Medicine Eats Itself, and How to do Better (my talk at USC this Friday) (25 comments)
This one might possibly be interesting. (25 comments)
4 different meanings of p-value (and how my thinking has changed) (25 comments)
Credit where due to NPR regarding science data fraud, and here’s how they can do even better (25 comments)
Why isn’t Barack Obama out there giving political speeches? (24 comments)
“Exclusive: Embattled dean accused of plagiarism in NSF report” (yup, it’s the torment executioners) (24 comments)
Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile) (24 comments)
“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct” (24 comments)
Again on the role of elite media in spreading UFOs-as-space-aliens and other bad ideas (24 comments)
The piranha problem: Large effects swimming in a small pond (24 comments)
Crap papers with crude political agendas published in scientific journals: A push-pull problem (24 comments)
log(A + x), not log(1 + x) (24 comments)
B-school prof data sleuth lawsuit fails (24 comments)
Nonsampling error and the anthropic principle in statistics (24 comments)
A 10% swing in win probability corresponds (approximately) to a 0.4% swing in predicted vote (24 comments)
How large is that treatment effect, really? (My talk at the NYU economics seminar, Thurs 7 Mar 18 Apr) (23 comments)
“I was left with an overwhelming feeling that the World Values Survey is simply a vehicle for telling stories about values . . .” (23 comments)
Grappling with uncertainty in forecasting the 2024 U.S. presidential election (23 comments)
Google is violating the First Law of Robotics. (23 comments)
A question for Nate Cohn at the New York Times regarding a claim about adjusting polls using recalled past vote (23 comments)
New Course: Prediction for (Individualized) Decision-making (23 comments)
Bias remaining after adjusting for pre-treatment variables. Also the challenges of learning through experimentation. (23 comments)
“Accounting for Nonresponse in Election Polls: Total Margin of Error” (23 comments)
This post is not really about Aristotle. (22 comments)
The free will to repost (22 comments)
Paper cited by Stanford medical school professor retracted—but even without considering the reasons for retraction, this paper was so bad that it should never have been cited. (22 comments)
“AI” as shorthand for turning off our brains. (This is not an anti-AI post; it’s a discussion of how we think about AI.) (22 comments)
Fewer kids in our future: How historical experience has distorted our sense of demographic norms (22 comments)
More on the oldest famous person ever (just considering those who lived to at least 104) (22 comments)
Freakonomics asks, “Why is there so much fraud in academia,” but without addressing one big incentive for fraud, which is that, if you make grabby enough claims, you can get featured in . . . Freakonomics! (22 comments)
Fake stories in purported nonfiction (22 comments)
Applications of (Bayesian) variational inference? (22 comments)
The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled. (21 comments)
The contrapositive of “Politics and the English Language.” One reason writing is hard: (21 comments)
“A passionate group of scientists determined to revolutionize the traditional publishing model in academia” (21 comments)
It’s Ariely time! They had a preregistration but they didn’t follow it. (21 comments)
Do research articles have to be so one-sided? (21 comments)
Adverse Adult Research Outcomes Increased After Increased Willingness of Public Health Journals to Publish Absolute Crap (21 comments)
The NYT sinks to a new low in political coverage (21 comments)
Calibration is sometimes sufficient for trusting predictions. What does this tell us when human experts use model predictions? (21 comments)
Keith O’Rourke’s final published paper: “Statistics as a social activity: Attitudes toward amalgamating evidence” (21 comments)
Close Reading Archive (21 comments)
Ben Shneiderman’s Golden Rules of Interface Design (20 comments)
Clinical trials that are designed to fail (20 comments)
Zotero now features retraction notices (20 comments)
“He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.” (20 comments)
Evidence, desire, support (20 comments)
Now here’s a tour de force for ya (20 comments)
To what extent is psychology different from other fields regarding fraud and replication problems? (20 comments)
What to do with election forecasts after Biden is replaced on the ticket? Also something on prediction markets. (20 comments)
The Rise and Fall of the Rock Stars (20 comments)
Carroll/Langer: Credulous, scientist-as-hero reporting from a podcaster who should know better (20 comments)
Who understands alignment anyway (19 comments)
“‘Pure Craft’ Is a Lie” and other essays by Matthew Salesses (19 comments)
Myths of American history from the left, right, and center; also a discussion of the Why everything you thought you knew was wrong” genre of book. (19 comments)
Sorry, NYT, but, yes, “Equidistant Letter Sequences in the Book of Genesis” was junk science (19 comments)
Clarke’s Law, and who’s to blame for bad science reporting (18 comments)
Preregistration is a floor, not a ceiling. (18 comments)
Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive. (18 comments)
“Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.” (18 comments)
Not eating sweet potatoes: Is that gonna kill me? (18 comments)
Three takes on the protests at Columbia University (18 comments)
Marquand. (18 comments)
5 different reasons why it’s important to include pre-treatment variables when designing and analyzing a randomized experiment (or doing any causal study) (18 comments)
How did the press do on that “black spatula” story? Not so great. (18 comments)
Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals. (17 comments)
Jonathan Bailey vs. Stephen Wolfram (17 comments)
The data are on a 1-5 scale, the mean is 4.61, and the standard deviation is 1.64 . . . What’s so wrong about that?? (17 comments)
What to make of implicit biases in LLM output? (17 comments)
Statistics Blunder at the Supreme Court (17 comments)
Design analysis is not just about statistical significance and power; it’s relevant for Bayesian inference too. (17 comments)
What can aspiring political moderates learn from the example of Nelson Rockefeller? (17 comments)
Chutzpah is their superpower (Dominic Sandbrook edition) (17 comments)
Stan’s autodiff is 4x faster than JAX on CPU but 5x slower on GPU (in one eval) (17 comments)
“Toward reproducible research: Some technical statistical challenges” and “The political content of unreplicable research” (my talks at Berkeley and Stanford this Wed and Thurs) (17 comments)
Clybourne Park. And a Jamaican beef patty. (But no Gray Davis, no Grover Norquist, no rabbi.) (17 comments)
Columbia Surgery Prof Fake Data Update . . . (yes, he’s still being promoted on the university webpage) (17 comments)
Objects of the class “David Owen” (17 comments)
Why does this guy have 2 gmail accounts? (17 comments)
Why am I willing to bet you $100-1000 there will be a Nobel Prize for Adaptive Experimentation in the next 40 years? (17 comments)
Regarding the use of “common sense” when evaluating research claims (16 comments)
“Whistleblowers always get punished” (16 comments)
Every time Tyler Cowen says, “Median voter theorem still underrated! Hail Anthony Downs!”, I’m gonna point him to this paper . . . (16 comments)
People have needed rituals to turn data into truth for many years. Why would we be surprised if many people now need procedural reforms to work? (16 comments)
No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study. (16 comments)
I just got a strange phone call from two people who claimed to be writing a news story. They were asking me very vague questions and I think it was some sort of scam. I guess this sort of thing is why nobody answers the phone anymore. (16 comments)
“A bizarre failure in the review process at PNAS” (16 comments)
From what body part does the fish rot? (16 comments)
Put multiple graphs on a page: that’s what Nathan Yau says, and I agree. (16 comments)
Remember that paper that reported contagion of obesity? How’s it being cited nowadays? (16 comments)
It’s martingale time, baby! How to evaluate probabilistic forecasts before the event happens? Rajiv Sethi has an idea. (Hint: it involves time series.) (16 comments)
The odd non-spamness of some spam comments (16 comments)
“And while I don’t really want a back-and-forth . . .” (15 comments)
“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?” (15 comments)
Scientific publishers busily thwarting science (again) (15 comments)
A suggestion on how to improve the broader impacts statement requirement for AI/ML papers (15 comments)
Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment) (15 comments)
What is your superpower? (15 comments)
Who is the Stephen Vincent Benet of today? (15 comments)
“The Active Seating Zone (An Educational Experiment)” (15 comments)
The JAMA effect plus news media echo chamber: More misleading publicity on that problematic claim that lesbians and bisexual women die sooner than straight women (15 comments)
Heroes and Villains: The Effects of Identification Strategies on Strong Causal Claims in France (15 comments)
Present Each Other’s Posters: An update after 15 years (15 comments)
Different perspectives on the claims in the paper, The Colonial Origins of Comparative Development (15 comments)
What makes an MCMC sampler GPU-friendly? (15 comments)
Meta-analysis with a single study (15 comments)
Those correction notices, in full. (Yes, it’s possible to directly admit and learn from error.) (15 comments)
Practical issues with calibration for every group and every decision problem (15 comments)
Presidential campaign effects are small. (15 comments)
Bayesian inference (and mathematical reasoning more generally) isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision. (15 comments)
The Lakatos soccer training (14 comments)
“Science as Verified Trust” (14 comments)
Hand-drawn Statistical Workflow at Nelson Mandela (14 comments)
Combining multiply-imputed datasets, never easy (14 comments)
Simulation from a baseline model as a way to better understand your data: This is what “hypothesis testing” should be. (14 comments)
Age gaps between spouses in U.S., U.K., and India (14 comments)
“Alphabetical order of surnames may affect grading” (14 comments)
Piranhas for “omics”? (14 comments)
Pete Rose and gambling addiction: An insight and a question (14 comments)
Update on that politically-loaded paper published in Demography that I characterized as a “hack job”: Further post-publication review (14 comments)
Bad science as genre fiction: I think there’s a lot to be said for this analogy! (14 comments)
If you wanted to be a top tennis player in the late 1930s, there was a huge benefit to being a member of ____. Or to being named ____. (14 comments)
The Village Voice in the 1960s/70s and blogging in the early 2000s (14 comments)
Social penumbras predict political attitudes (my talk at Harvard on Monday Feb 12 at noon) (13 comments)
A new piranha paper (13 comments)
“There is a war between the ones who say there is a war, and the ones who say there isn’t.” (13 comments)
Another opportunity in MLB for Stan users: the Phillies are hiring (13 comments)
Statistical factuality versus practicality versus poetry (13 comments)
Nicholas Carlini on LLMs and AI for research programmers (13 comments)
Movements in the prediction markets, and going beyond a black-box view of markets and prediction models (13 comments)
ChatGPT o1-preview can code Stan (13 comments)
What if the polls are right? (some scatterplots, and some comparisons to vote swings in past decades) (13 comments)
Help teaching short-course that has a healthy dose of data simulation (13 comments)
Answering two questions, one about Bayesian post-selection inference and one about prior and posterior predictive checks (13 comments)
Evaluating samplers with reference draws (12 comments)
Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem? (12 comments)
Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1. (12 comments)
You probably don’t have a general algorithm for an MLE of Gaussian mixtures (12 comments)
This is a very disturbing map. (12 comments)
This is not an argument against self-citations. It’s an argument about how they should be counted. Also, a fun formula that expresses the estimated linear regression coefficient as a weighted average of local slopes. (12 comments)
Data issues in that paper that claims that TikTok and Instagram have consumption spillovers that lead to negative utility (12 comments)
Sports media > Prestige media (space aliens edition) (12 comments)
In search of a theory associating honest citation with a higher/deeper level of understanding than (dishonest) plagiarism (12 comments)
Sports gambling addiction epidemic fueled by some combination of psychology, economics, and politics (12 comments)
Awesome online graph guessing game. And scatterplot charades. (12 comments)
“Pitfalls of Demographic Forecasts of US Elections” (12 comments)
A feedback loop can destroy correlation: This idea comes up in many places. (11 comments)
The paradox of replication studies: A good analyst has special data analysis and interpretation skills. But it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. (11 comments)
Hey, here’s some free money for you! Just lend your name to this university and they’ll pay you $1000 for every article you publish! (11 comments)
Mary Rosh! (11 comments)
When Steve Bannon meets the Center for Open Science: Bad science and bad reporting combine to yield another ovulation/voting disaster (11 comments)
Boris and Natasha in America: How often is the wife taller than the husband? (11 comments)
What happens when you’ve had deferential media coverage and then, all of a sudden, you’re treated as a news item rather than as a figure of admiration? (11 comments)
A data science course for high school students (11 comments)
Is there a balance to be struck between simple hierarchical models and more complex hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data? (11 comments)
Blog was down and is now operating again. (11 comments)
The state of statistics in 1990 (11 comments)
“A Hudson Valley Reckoning: Discovering the Forgotten History of Slaveholding in My Dutch American Family” (11 comments)
Probabilistic numerics and the folk theorem of statistical computing (11 comments)
Specification curve analysis and the multiverse (11 comments)
The true meaning of the alzabo (11 comments)
Resources for teaching and learning survey sampling, from Scott Keeter at Pew Research (10 comments)
Hey! Here’s some R code to make colored maps using circle sizes proportional to county population. (10 comments)
Our new book, Active Statistics, is now available! (10 comments)
“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy: (10 comments)
N=43, “a statistically significant 226% improvement,” . . . what could possibly go wrong?? (10 comments)
If I got a nickel every time . . . (10 comments)
“Nonreplicable” publications are cited more than “replicable” ones? (10 comments)
Implicit assumptions in the Tversky/Kahneman example of the blue and green taxicabs (10 comments)
It’s Stanford time, baby: 8-hour time-restricted press releases linked to a 91% higher risk of hype (10 comments)
“Responsibility for Raw Data”: “Failure to retain data for some reasonable length of time following publication would produce notoriety equal to the notoriety attained by publishing inaccurate results. A possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed.” (10 comments)
The interactions paradox in statistics (10 comments)
Here is the Data Sharing Statement, in its entirety, for van Dyck CH, Swanson CJ, Aisen P, et al. Trial of Lecanemab in Early Alzheimer’s Disease. N Engl J Med. DOI: 10.1056/NEJMoa2212948. (10 comments)
Which book should you read first, Active Statistics or Regression and Other Stories? (10 comments)
Stan Playground: Run Stan on the web, play with your program and data at will, and no need to download anything on your computer (10 comments)
How to cheat at Codenames; cheating at board games more generally (10 comments)
Progress in 2023 (9 comments)
What’s up with spring blooming? (9 comments)
Fun with Dååta: Reference librarian edition (9 comments)
Hey, I got tagged by RetractoBot! (9 comments)
Minimum criteria for studies evaluating human decision-making (9 comments)
Here’s some academic advice for you: Never put your name on a paper you haven’t read. (9 comments)
Defining optimal reliance on model predictions in AI-assisted decisions (9 comments)
Philip K. Dick’s character names (9 comments)
Evilicious 3: Face the Music (9 comments)
Two kings, a royal, a knight, and three princesses walk into a bar . . . (Dude from Saudi Arabia accuses the lords of AI of not giving him enough credit.) (9 comments)
Dan Luu asks, “Why do people post on [bad platform] instead of [good platform]?” (9 comments)
When the story becomes the story (9 comments)
1. Why so many non-econ papers by economists? 2. What’s on the math GRE and what does this have to do with stat Ph.D. programs? 3. How does modern research on combinatorics relate to statistics? (9 comments)
Pervasive randomization problems, here with headline experiments (9 comments)
Some solid criticisms of Ariely and Nudge—from 2012! (9 comments)
It’s lumbar time: Wrong inference because of conditioning on a reasonable, but in this case false, assumption. (9 comments)
You can guarantee that the term “statistical guarantee” will irritate me. Here’s why, and let’s go into some details. (9 comments)
It’s $ time! How much should we charge for a link? (9 comments)
“How bad are search results?” Dan Luu has some interesting thoughts: (9 comments)
Some books: The Good Word (1978), The Hitler Conspiracies (2020), In Defense of History (1999), The Book of the Month (1986), Slow Horses (2010), Freedom’s Dominion (2022), A Meaningful Life (1971) (9 comments)
Background on “fail fast” (9 comments)
“Unusual Betting Patterns With Several Temple Games”: It’s martingale time, baby! (9 comments)
“The Stadium” by Frank Guridy (9 comments)
“Very interesting failed attempt at manipulation on Polymarket today” (9 comments)
Supercentenarian superfraud update (9 comments)
Inaccuracy in New York magazine report on election forecasting (9 comments)
The comments section: A request to non-commenters, occasional commenters, and frequent commenters (9 comments)
20-year anniversary of this blog (9 comments)
Supporting Bayesian modeling workflows with iterative filtering for multiverse analysis (9 comments)
That day in 1977 when Jerzy Neyman committed the methodological attribution fallacy. (9 comments)
Plagiarism searches and post-publication review (9 comments)
Physics is like Brazil, Statistics is like Chile (9 comments)
Progress in 2023, Aki’s software edition (8 comments)
Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024) (8 comments)
Lefty Driesell and Bobby Knight (8 comments)
There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case. (8 comments)
How large is that treatment effect, really? (my talk at NYU economics department Thurs 18 Apr 2024, 12:30pm) (8 comments)
Delayed retraction sampling (8 comments)
“Close but no cigar” unit tests and bias in MCMC (8 comments)
Infovis, infographics, and data visualization: My thoughts 12 years later (8 comments)
6 ways to follow this blog (8 comments)
For that price he could’ve had 54 Jamaican beef patties or 1/216 of a conference featuring Gray Davis, Grover Norquist, and a rabbi (8 comments)
Break it to grok it: The best way to understand how a method works is go construct scenarios where it fails (8 comments)
Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions (8 comments)
Here’s a useful response by Christakis to criticisms of the contagion-of-obesity claims (8 comments)
Here is the Data Sharing Statement, in its entirety, for Goodwin GM, Aaronson ST, Alvarez O, et al. Single-Dose Psilocybin for a Treatment-Resistant Episode of Major Depression. N Engl J Med. DOI: 10.1056/NEJMoa2206443. (8 comments)
Progress in 2023, Jessica Edition (7 comments)
Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins on 26 Apr) (7 comments)
Listen to those residuals (7 comments)
When do we expect conformal prediction sets to be helpful? (7 comments)
Hey! A new (to me) text message scam! Involving a barfing dog! (7 comments)
Michael Lewis. (7 comments)
Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan (7 comments)
Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics (7 comments)
How to code and impute income in studies of opinion polls? (7 comments)
How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls? (7 comments)
Banning the use of common sense in data analysis increases cases of research failure: evidence from Sweden (7 comments)
“Bayesian Workflow: Some Progress and Open Questions” and “Causal Inference as Generalization”: my two upcoming talks at CMU (7 comments)
Decorative statistics and historical records (7 comments)
Some fun basketball graphs (7 comments)
“A Columbia Surgeon’s Study Was Pulled. He Kept Publishing Flawed Data.” . . . and it appears that he’s still at Columbia! (7 comments)
He wants to compute “the effect of a predictor” (that is, an average predictive comparison) for a hierarchical mixture model. You can do it in Stan! (7 comments)
Luck vs. skill in poker (7 comments)
Basu’s Bears (Fat Bear Week and survey calibration) (7 comments)
Flatiron Institute hiring: postdocs, joint faculty, and permanent research positions (7 comments)
Violent science teacher makes ridiculously unsupported research claims, gets treated by legislatures/courts/media as expert on the effects of homeschooling (7 comments)
Should pollsters preregister their design, data collection, and analyses? (7 comments)
Calibration for everyone and every decision problem, maybe (7 comments)
Iterative imputation and incoherent Gibbs sampling (7 comments)
Data manipulation in the world of long-distance swimming! (7 comments)
Announcing two new members of our blogging team . . . (7 comments)
Progress in 2023, Aki Edition (6 comments)
Michael Wiebe has several new replications written up on his site. (6 comments)
The importance of measurement, and how you can draw ridiculous conclusions from your statistical analyses if you don’t think carefully about measurement . . . Leamer (1983) got it. (6 comments)
Cherry blossoms—not just another prediction competition (6 comments)
Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson (6 comments)
Mitzi’s and my talks in Trieste 3 and 4 June 2024 (yes, they’ll be broadcast) (6 comments)
One way you can understand people is to look at where they prefer to see complexity. (6 comments)
Edward Kennedy on the Facebook/Instagram 2020 election experiments (6 comments)
Last week’s summer school on probabilistic AI (6 comments)
Toward a Shnerbian theory that establishes connections between the complexity (nonlinearity, chaotic dynamics, number of components) of a system and the capacity to infer causality from datasets (6 comments)
The “fail fast” principle in statistical computing (6 comments)
Two job openings, one in New York on data visualization, one near Paris on Bayesian modeling (6 comments)
An apparent paradox regarding hypothesis tests and rejection regions (6 comments)
What’s a generative model? PyMC and Stan edition (6 comments)
“Announcing the 2023 IPUMS Research Award Winners” (6 comments)
Pete Rose (6 comments)
eLife press release: Deterministic thinking led to a nonsensical statement (6 comments)
“Reduce likelihood of a tick bite by 73.6 times”? Forking paths on the Appalachian Trail. (6 comments)
Delicate language for talking about statistical guarantees (6 comments)
It’s about time (5 comments)
A gathering of the literary critics: Louis Menand and Thomas Mallon, meet Jeet Heer (5 comments)
Why we say that honesty and transparency are not enough: (5 comments)
Statistical practice as scientific exploration (5 comments)
Analogy between (a) model checking in Bayesian statistics, and (b) the self-correcting nature of science. (5 comments)
Population forecasting for small areas: an example of learning through a social network (5 comments)
Data challenges with the Local News Initiative mapping project (5 comments)
Lucy is not a nickname. (5 comments)
“The Secret Life of John Le Carré” (5 comments)
Evil scamming fake publishers (5 comments)
The Mets are looking to hire a data scientist (5 comments)
Why art is more forgiving than game design (5 comments)
Salesses: “some writing exercises meant to help students with various elements of craft” (5 comments)
StanCon 2024 Oxford: recorded talks are now released! (5 comments)
Code it! (patterns in data edition) (5 comments)
Softmax is on the log, not the logit scale (5 comments)
“My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. . . . What I hear, instead, is the following . . .” (4 comments)
Lancet-bashing! (4 comments)
Bayesian Analysis with Python (4 comments)
Bayesian inference with informative priors is not inherently “subjective” (4 comments)
Here’s something you should do when beginning a project, and in the middle of a project, and in the end of the project: Clearly specify your goals, and also specify what’s not in your goal set. (4 comments)
“When are Bayesian model probabilities overconfident?” . . . and we’re still trying to get to meta-Bayes (4 comments)
“Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments” (4 comments)
Studying causal inference in the presence of feedback: (4 comments)
They’re trying to get a hold on the jungle of cluster analysis. (4 comments)
“Beyond the black box: Toward a new paradigm of statistics in science” (talks this Thursday in London by Jessica Hullman, Hadley Wickham, and me) (4 comments)
Interactive and Automated Data Analysis: thoughts from Di Cook, Hadley Wickham, Jessica Hullman, and others (4 comments)
(This one’s important:) Looking Beyond the Obvious: Essentialism and abstraction as central to our reasoning and beliefs (4 comments)
“The Waltz of Reason” and a paradox of book reviewing (4 comments)
StanCon 2024… is a wrap! (4 comments)
22 Revision Prompts from Matthew Salesses (4 comments)
Two spans of the bridge of inference (4 comments)
Average predictive comparisons (4 comments)
Gayface Data Replicability Problems (4 comments)
Addressing legitimate counterarguments in a scientific review: The challenge of being an insider (4 comments)
Most popular posts of 2024 (4 comments)
What is the minimum bloggable contribution? (4 comments)
What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%” (3 comments)
Bayesian BATS to advance Bayesian Thinking in STEM (3 comments)
Intro to BridgeStan: The new in-memory interface for Stan (3 comments)
Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins this Friday) (3 comments)
Evaluating MCMC samplers (3 comments)
Applied modelling in drug development? brms! (3 comments)
GPT today: Buffon’s Needle in Python with plotting (and some jokes) (3 comments)
Comedy and child abuse in literature (3 comments)
Cross validation and pointwise or joint measures of prediction accuracy (3 comments)
A guide to detecting AI-generated images, informed by experiments on people’s ability to detect them (3 comments)
Free Textbook on Applied Regression and Causal Inference (3 comments)
Free Book of Stories, Activities, Computer Demonstrations, and Problems in Applied Regression and Causal Inference (3 comments)
Close reading in literary criticism and statistical analysis (3 comments)
GIST: Now with local step size adaptation for NUTS (3 comments)
Should you always include a varying slope for the lower-level variable involved in a cross-level interaction? (3 comments)
3 levels of fraud: One-time, Linear, and Exponential (3 comments)
Postdoc position at Northwestern on evaluating AI/ML decision support (3 comments)
The marginalization or Jeffreys-Lindley paradox: it’s already been resolved. (3 comments)
They solved the human-statistical reasoning interface back in the 80s (2 comments)
“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates. (2 comments)
Leap Day Special! (2 comments)
Postdoc Opportunity at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon (2 comments)
GIST: Gibbs self-tuning for HMC (2 comments)
“Former dean of Temple University convicted of fraud for using fake data to boost its national ranking” (2 comments)
Update on “the hat”: It’s “the spectre,” a single shape that can tile the plane aperiodically but not periodically, and doesn’t require flipping (2 comments)
New online Stan course: 80 videos + hosted live coding environment (2 comments)
In Stan, “~” should be called a “distribution statement,” not a “sampling statement.” (2 comments)
Forking paths and workflow in statistical practice and communication (2 comments)
Doctoral student positions in Bayesian workflow at Aalto, Finland (2 comments)
Election prediction markets: What happens next? (2 comments)
Oregon State Stats Dept. is Hiring (2 comments)
Hey, journalist readers! Does anyone have a contact at NPR? (2 comments)
“My quick answer is that I don’t care much about permutation tests because they are testing a null null hypothesis that is of typically no interest” (1 comments)
“Theoretical statistics is the theory of applied statistics”: A scheduled conference on the topic (1 comments)
Progress in 2023, Charles edition (1 comments)
A question about Lindley’s supra Bayesian method for expert probability assessment (1 comments)
Those annoying people-are-stupid narratives in journalism (1 comments)
ISBA 2024 Satellite Meeting: Lugano, 25–28 June (1 comments)
My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar) (1 comments)
Is the 2024 New York presidential primary really an “important election”? (1 comments)
Fully funded doctoral student positions in Finland (1 comments)
“Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims.” (1 comments)
Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis (1 comments)
No, I don’t believe the claim that “Mothers negatively affected by having three daughters and no sons, study shows.” (1 comments)
He has some questions about a career in sports analytics. (1 comments)
Questions and Answers for Applied Statistics and Multilevel Modeling (1 comments)
StanCon 2024: scholarships, sponsors, and other news (1 comments)
19 ways of looking at data science at the singularity, from David Donoho and 17 others (1 comments)
A message to Christian Hesse, mathematician and author of chess books (1 comments)
Online seminar for Monte Carlo Methods++ (1 comments)
Faculty positions at the University of Oregon’s new Data Science department (1 comments)
“Tough choices in election forecasting: All the things that can go wrong” (my webinar this Friday 11am with the Washington Statistical Society) (1 comments)
Call for StanCon 2025+ (1 comments)
What should Yuling include in his course on statistical computing? (1 comments)
Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying? (1 comments)
Since Jeffrey Epstein is in the news again . . . (0 comments)
Postdoc at Washington State University on law-enforcement statistics (0 comments)
Here’s how to subscribe to our new weekly newsletter: (0 comments)
Progress in 2023, Leo edition (0 comments)
Click here to help this researcher gather different takes on making data visualizations for blind people (0 comments)
Using the term “visualization” for non-visual representation of data (0 comments)
Varying slopes and intercepts in Stan: still painful in 2024 (0 comments)
BD corner: I came across this interesting interview with Daniel Clowes on the sources for Monica (0 comments)
Hey, some good news for a change! (Child psychology and Bayes) (0 comments)
A nested helix plot that simultaneously shows events on the scales of centuries, millennia, . . . all the way back to billions of years (0 comments)
Job Ad: Spatial Statistics Group Lead at Oak Ridge National Laboratory (0 comments)
Bayesian Workflow, Causal Generalization, Modeling of Sampling Weights, and Time: My talks at Northwestern University this Friday and the University of Chicago on Monday (0 comments)
Papers on human decision-making under uncertainty in ML venues! We have advice. (0 comments)
New stat podcast just dropped (0 comments)
Faculty and postdoc jobs in computational stats at Newcastle University (UK) (0 comments)
Subscribe to this free newsletter and get a heads-up on our scheduled posts a week early! (0 comments)
“What do we need from a probabilistic programming language to support Bayesian workflow?” (0 comments)
StanCon 2024 is in 32 days! (0 comments)
NeurIPS 2024 workshop on Statistical Frontiers in LLMs and Foundation Models (0 comments)
Faculty positions at the University of California on AI, Inequality, and Society (0 comments)
Modeling Weights to Generalize (my talk this Wed noon at the Columbia University statistics department) (0 comments)
Bayesian social science conference in Amsterdam! Next month! (0 comments)
Postdoc opportunity! to work with me here at Columbia! on Bayesian workflow! for contamination models! With some wonderful collaborators!! (0 comments)
NYT catches up to Statistical Modeling, Causal Inference, and Social Science (0 comments)
Leave-one-out cross validation (LOO) for an astronomy problem (0 comments)
Self-reference and self-reproduction of evidence (0 comments)
The Red Sox are hiring (0 comments)
Faculty positions at Princeton in interdisciplinary data science (0 comments)

Thank you all for your contributions!

Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying?

This is Jessica. The last day of the year is like a good time for finishing things up, so I figured it’s time for one last post wrapping up some thoughts on calibration. 

As my previous posts got into, calibrated prediction uncertainty is the goal of various posthoc calibration algorithms discussed in machine learning research, which use held out data to learn transformations on model predicted probabilities in order to achieve calibration on the held out data. I’ve reflected a bit on what calibration can and can’t give us in terms of assurances for decision-making. Namely, it makes predictions trustworthy for decisions in the restricted sense that a decision-maker who will choose their action purely based on the prediction can’t do better than treating the calibrated predictions as the true probabilities. 

But something I’ve had trouble articulating as clearly as I’d like involves what’s missing (and why) when it comes to what calibration gives us versus a more complete representation of the limits of our knowledge in making some predictions. 

The distinction involves how we express higher order uncertainty. Let’s say we are doing multiclass classification, and fit a model fhat to some labeled data. Our “level 0” prediction fhat(x) contains no uncertainty representation at all; we check it against the ground truth y. Our “level 1” prediction phat(.|x) predicts the conditional distribution over classes; we check it against the empirical distribution that gives a probability p(y|x) for each possible y. Our “level 2” prediction is trying to predict the distribution of the conditional distribution over classes, p(p(.|x), e.g. a Dirichlet distribution that assigns probability to each distribution p(.|x), which we can distinguish using some parameters theta.

From a Bayesian modeling perspective, it’s natural to think about distributions of distributions. A prior distribution over model parameters implies a distribution over possible data-generating distributions. Upon fitting a model, the posterior predictive distribution summarizes both “aleatoric” uncertainty due to inherent randomness in the generating process and “epistemic” uncertainty stemming from our lack of knowledge of the true parameter values. 

In some sense calibration “resolves” epistemic uncertainty by providing point predictions that are indistinguishable from the true probabilities. But if you’re hoping to get a faithful summary of the current state of knowledge, it can seem like something is still missing. In the Bayesian framework, we can collapse our posterior prediction of the outcome y for any particular input x to a point estimate, but we don’t have to. 

Part of the difficulty is that whenever we evaluate performance as loss over some data-generating distribution, having more than a point estimate is not necessary. This is true even without considering second order uncertainty. If we train a level 0 prediction of the outcome y using the standard loss minimization framework with 0/1 loss, then it will learn to predict the mode. And so to the extent that it’s hard to argue one’s way out of loss minimization as a standard for evaluating decisions, it’s hard to motivate faithful expression of epistemic uncertainty.

For second order uncertainty, the added complication is there is no ground truth. We might believe there is some intrinsic value in being able to model uncertainty about the best predictor, but how do we formalize this given that there’s no ground truth against which to check our second order predictions? We can’t learn by drawing samples from the distribution that assigns probability to different first order distributions p(.|x) because technically there is no such distribution beyond our conception of it. 

Daniel Lakeland previously provided an example I found helpful of the difference between putting Bayesian probability on a predicted frequency, where there’s no sense in which we can check the calibration of the second order prediction. 

Related to this, I recently came across a few papers by Viktor Bengs et al that formalize some of this in an ML context. Essentially, they show that there is no well-defined loss function that can be used in the typical ML learning pipeline to incentivize the learner to make correct predictions that are also faithful as expressions of the epistemic uncertainty. This can be expressed in terms of trying to find a proper scoring rule. In the case of first order predictions, as long as we use a proper scoring rule as the loss function, we can expect accurate predictions, because a proper scoring rule is one for which one cannot score higher by deviating from reporting our true beliefs. But there is no loss function that incentivizes a second-order learner to faithfully represent its epistemic uncertainty like a proper scoring rule does for a first order learner. 

This may seem obvious, especially if you’re coming from a Bayesian tradition, considering that there is no ground truth against which to score second order predictions. And yet, various loss functions have been proposed for estimating level 2 predictors in the ML literature, such as minimizing the empirical loss of the level 1 prediction averaged over possible parameter values. These results make clear that one needs to be careful interpreting the predictors they give, because, e.g., they can actually incentivize predictors that appear to be certain about the first order distribution. 

I guess a question that remains is how to talk about incentives for second order uncertainty at all in a context where minimizing loss from predictions is the primary goal. I don’t think the right conclusion is that it doesn’t matter since we can’t integrate it into a loss minimization framework. Having the ability to decompose predictions by different sources of uncertainty and be explicit about what our higher order uncertainty looks like going in (i.e., by defining a prior) has scientific value in less direct ways, like communicating beliefs and debugging when things go wrong. 

What is the minimum bloggable contribution?

The other day someone sent me an email pointing me to an online article, a statistical analysis criticizing an online article that was a reanalysis of data from an article that was a meta-analysis and literature review of a controversial topic that had been written about in some earlier published papers that were themselves literature reviews.

My correspondent was interested in my take, and I replied that the latest article made some good points and also some errors. I didn’t think I had anything useful to say on this one so I didn’t post on the topic.

Given that I’d already gone to the trouble of reading–OK, skimming–all these articles, and I can post here for free, arguably I could contribute in a useful way just with a short post explaining where I agreed with this new article and where I thought it went wrong. I have no real or perceived beefs with any of the people involved in this one, and I expect that whatever feedback I were to provide would be taken constructively, not defensively.

So why not post? The difficulty here would come not directly in my comments on those articles but rather in the necessary scaffolding: all the bits I’d need to add to avoid writing something that could be misinterpreted.

One of the challenges of writing, as compared to speaking, is that your words are just out there, interpreted without the benefit of intonation, context, and dialogue. This isn’t anything specific to blogging; the same issue can arise when communicating with people by email. Not that direct face-to-face conversation works either; it’s just that something written is just out there in some way.

So it’s a cost-benefit calculation: weighing my contributions to this particular discussion against the effort of constructing the scaffolding. In this case I decided the best option was to not bother.

But then I became interested in the meta-topic of when to post, so I wrote this, which I’ll schedule to appear in a few months.

Bayesian inference (and mathematical reasoning more generally) isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Palko writes:

I’m just an occasional Bayesian (and never an exo-biologist) so maybe I’m missing some subtleties here, but I believe educated estimates for the first probability vary widely with some close to 0 and some close to 1 with no good sense of the distribution. Is there any point in applying the theorem at that point? From this Wired article:

If or when scientists detect a putative biosignature gas on a distant planet, they can use a formula called Bayes’ theorem to calculate the chance of life existing there based on three probabilities. Two have to do with biology. The first is the probability of life emerging on that planet given everything else that’s known about it. The second is the probability that, if there is life, it would create the biosignature we observe. Both factors carry significant uncertainties, according to the astrobiologists Cole Mathis of Arizona State University and Harrison Smith of the Earth-Life Science Institute of the Tokyo Institute of Technology, who explored this kind of reasoning in a paper last fall.

My reply: I guess it’s fine to do the calculation, if only to make it clear how dependent it is on assumps. Bayesian inference isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Come to think about it, that last paragraph remains true if you replace “Bayesian inference” with “Mathematics.”

How to cheat at Codenames; cheating at board games more generally

This is a good post for Christmas Day, with all of you at home with your families playing board games.

Dan Luu has an amusing post explaining how you can win Codenames by just memorizing the configurations of the 40 setup carts. The basic strategy is to play your best until you can figure out the unique configuration, then you win. The fun part is that if you’re playing against a team that hasn’t learned this memorization trick, then you can win even if you don’t guess any words yourself—you just take advantage of the config information that you get from their correct guesses (along with any wrong guesses that come up)! If both teams have memorized the 40 cards, then you get to a new level of strategy.

As Luu says, no one would want to play Codenames in this way. The whole point of the game is to guess the words; if you’re gonna do it by memorizing patterns, why play the game in the first place? On the other hand, he also points out that once this information is there, you can’t un-see it. So it’s a balance.

This comes up in the rules of Codenames itself: you’re not allowed to give clues that suggest the position of the word on the grid, nor are you allowed to make faces or otherwise give clues as people are guessing. It can be hard to avoid giving this information sometimes!

More generally, most games can be “cracked” through a backdoor approach in some way or another. Here’s how Luu puts it:

Personally, when I run into a side-channel attack in a game or a game that’s just totally busted if played to win . . . I think it makes sense to try to avoid “attacking” the game to the extent possible. I think this is sort of impossible to do perfectly in Codenames because people will form subconscious associations (I’ve noticed people guessing an extra word on the first turn just to mess around, which works more often than not — assuming they’re not cheating, and I believe they’re not cheating, the success rate strongly suggests the use some kind of side-channel information. That doesn’t necessarily have to be positional information from the cards, it could be something as simple as subconsciously noticing what the spymasters are intently looking at.

Dave Sirlin calls anyone who doesn’t take advantage of any legal possibility to win is a sucker (he derogatorily calls such people “scrubs”) (he says that you should use cheats to win, like using maphacks in FPS games, as long as tournament organizers don’t ban the practice, and that tournaments should explicitly list what’s banned, avoiding generic “don’t do bad stuff” rules). I think people should play games however they find it fun and should find a group that likes playing games in the same way. If Dave finds it fun to memorize arbitrary info to win all of these games, he should do that. The reason I, as Dave Sirlin would put it, play like a scrub, for the kinds of games discussed here is because the games are generally badly broken if played seriously and I don’t personally find the ways in which they’re broken to be fun.

It gets tricky sometimes, though. Consider those goofy words that are in the Scrabble dictionary but aren’t really words, for example ef (“the letter F”) or po (“a chamber pot”). These are not English words! On the other hand, when you’re actually playing and you see an opportunity for ef or po or whatever, it’s hard to deny yourself the opportunity. In that case, there’s an easy solution: the rules allow the players to agree on any dictionary ahead of time, so no need to use the Scrabble dictionary. On the other hand, this will annoy serious players.

There’s more of gray area with collusion, which can “break” almost any multiplayer game. In poker, collusion is a form of cheating. I don’t know how casinos or informal games monitor or enforce the rule against collusion, but you’re not supposed to do it. You’re allowed to lie in poker but not to cheat.

But what about a game such as Monopoly or Risk where bargaining is part of the game? Here’s a simple strategy in a 3-player game of Monopoly that will up your odds of winning from 1/3 to nearly 1/2: Before the game begins, pick one of the other players and agree to flip a coin, after which the winner of the flip will devote all their effort to helping the other player win. That’s easy enough to do: just buy whatever property that comes up and sell to the other player for $1. It won’t guarantee a win but it’s gotta take the win probability to very close to 100%. Similarly with Risk. Now, nobody’s gonna play this way because it’s no fun (except maybe once as a joke). To put it another way, “winning a game of Monopoly or Risk” does not have much positive value in itself; the fun is in winning the game legitimately. Again, though, there is a gray zone, and other players will rightly get annoyed if they see player A deliberately trying to help player B without there being a good reason in the context of a game. In Risk, “I won’t attack you here if you don’t attack me there” is a legitimate strategy, but “I don’t attack you because I want to help you win” is not so cool.

A few years ago I was playing a lot of online chess, and one thing I noticed is that some players would set up opening traps: clearly unsound sequences of moves that would get them a win if their opponents played naively and hadn’t seen the trick before. My thought was: Why do that? Winning against a stranger using a trap, what’s the point of that? Upon reflection, though, I decided to not be so bothered by this. If you try to spring a trap, then the fun part is when the trap fails and then you have to get out of a bad position of your own devising. So, all good.

Years ago I read the book Thursday Night Poker by Peter Steiner. One thing Steiner discusses is that in a casual game you can often do just fine by playing really tight, a strategy that won’t work against good players but can make you steady money if some of the people at the table are just playing for fun. As Steiner says, though, most of us are not playing in a friendly poker game with the goal of maximizing our dollars. We’re playing poker for fun, and “action”—getting involved in hands, making betting decisions, going up against the other players—is where the fun is at. No poker player would be a “scrub”—you’ll always take advantage of any legal way to win, it’s not like you’d ignore relevant information that someone reveals—but, even in poker, winning is not the only goal.

All of this is kind of obvious, but as Luu discusses, sometimes it needs to be pointed out, to push against naive models of the world. Also, the bit about the Codenames cards is cool—I’d never thought about that!

The Theory and Practice of Oligarchical Collectivism

Paul Campos shares the story of a tech zillionaire who allegedly pocketed $415,726 from an insider trading scheme. Campos asks:

Why commit a serious crime for a payoff that was, for the criminal, almost literally nothing in practical terms? Keep in mind that for a Silicon Valley legend like this guy, incredibly juicy investment opportunities are never more than a thirty-second phone call away, so the call he made to make this trade probably had an opportunity cost higher than his prospective payoff from making it, even if you ignore the potential criminal liability. . . .

So maybe this really isn’t even about money. Maybe it’s about the apparently irresistible urge to get over on the System, even when the System has rewarded you with so much money that more money should mean nothing.

On the other hand, in our world more money never means nothing, because the acquisitive habit can become as compulsive as any other perverse addiction. . . .

I have a different theory. Campos is asking, Why would someone so rich throw it all away for a trivial amount of money? My take is that the guy didn’t see himself as risking anything. From the zillionaire’s point of view, insider trading isn’t really illegal, in two senses.

First, the dude probably doesn’t think that insider trading is wrong. It’s just capitalism, friends helping friends. Indeed, he could well take the position that laws against insider trading are productivity-destroying interferences with the natural laws of the market (see here, for example).

Second, he probably doesn’t think he would personally suffer serious consequences if caught. He’s a pillar of the community, and judges are reasonable people, right? And, indeed, if you follow the link to the news article, you see this:

Last month, Mr. Bechtolsheim, 68, settled the insider trading charges without admitting wrongdoing. He agreed to pay a fine of more than $900,000 and will not serve as an officer or director of a public company for five years.

For a guy with billions of dollars, a million-dollar fine ain’t much. And, sure, he can’t serve as an officer or director of a public company for five years. . . . but there are other ways he can spend his time! And I doubt he’ll be socially shunned: he’s such a successful investor! And so rich!

I’m not saying he faced zero consequences, I’m just saying that I don’t think we need an elaborate theory of greed or irresistible urges or acquisitive habits. Dude did something that in his view might have been “technically” illegal, kind of like what you might feel if you took some pens home or used the office copier for personal items or parked illegally or whatever. I think the framing of this as a risky decision doesn’t quite match where he was coming from.

“How a simple math error sparked a panic about black plastic kitchen utensils”: Does it matter when an estimate is off by a factor of 10?

tl;dr. Researcher finds out they were off by a factor of 10, responds that “this does not impact our results and our recommendations remain the same.”

Dean Eckles sent me an email, subject line, “order-of-magnitude errors that don’t affect the conclusions,” pointing to this news article from Joseph Brean:

Plastics rarely make news like this . . . the media uptake was enthusiastic on a paper published in October in the peer-reviewed journal Chemosphere.

“Your cool black kitchenware could be slowly poisoning you, study says. Here’s what to do,” said the LA Times. “Yes, throw out your black spatula,” said the San Francisco Chronicle. Salon was most blunt: “Your favorite spatula could kill you,” it said.

The study, by researchers at the advocacy group Toxic-Free Future, sought to determine whether black plastic household products sold in the U.S. contain brominated flame retardants, fire-resistant chemicals that are added to plastics for use in electronics, such as televisions, to prevent accidental fires. . . .

The study estimated that using contaminated kitchenware could cause a median intake of 34,700 nanograms per day of Decabromodiphenyl ether, known as BDE-209. That is far more than the bodily intake previously estimated from other modes, such as ingesting dust.

OK, so far, so good. But then:

The trouble is that, in the study’s section on “Health and Exposure Concerns,” the researchers said this number, 34,700, “would approach” the reference dose given by the United States Environmental Protection Agency. . . .

The paper correctly gives the reference dose for BDE-209 as 7,000 nanograms per kilogram of body weight per day, but calculates this into a limit for a 60-kilogram adult of 42,000 nanograms per day. So, as the paper claims, the estimated actual exposure from kitchen utensils of 34,700 nanograms per day is more than 80 per cent of the EPA limit of 42,000.

Did you catch that? Look carefully:

That sounds bad. But 60 times 7,000 is not 42,000. It is 420,000. This is what [McGill University’s] Joe Schwarcz noticed. The estimated exposure is not even a tenth of the reference dose. That does not sound as bad.

Indeed.

We all make mistakes, and some of them make their way into journals. I can see how the reviewers of the article could’ve not caught this particular error, with all those zeros floating around in the numbers. It’s less excusable for the authors to have missed it—I guess that part of the problem is that the incorrect number fit their story so well. When you come up with a result that doesn’t accord with the story you want to tell, you’re inclined to check it If the result is a perfect fit, you might not even give it a second look.

How to avoid this in the future?

Schwarcz, who directs McGill University’s Office for Science and Society, offers some helpful insights for avoiding this sort of error:

Schwarcz does not generally like measurements of risk expressed in percentages. Absolute numbers tend to be more useful, as in this study. He gives the example of a lottery ticket. If you have one lottery ticket, your chances of winning are, say, one in a million. If you buy another, your chances of winning have increased by 100 per cent, which sounds like a lot until you realize they are still just two in a million.

“Risk analysis is a sketchy business in the first place, very difficult to do, especially if you don’t express units properly,” Schwarcz said. “You can make things sound worse.”

There was also no need to use nanograms as the unit of measurement in this study, Schwarcz said, which gave unit amounts in the tens of thousands. The more common micrograms would have given units in the tens.

“It’s a common thing in scientific literature, especially in ones that try to call attention to some kind of toxin,” Schwarcz said.

Scaling is really important, and often people seem to be going out of their way to use numbers that are hard to interpret. Or, they just use whatever default scaling comes to them, without reflecting on how they could do better. A few years ago we discussed some graphs of annual death rates that were given in units such as “1,000 deaths per 100,000.” It’s hard to get intuition on a number like that. It would’ve been so easy to just do everything per 100, and then that number would be a much more interpretable 1%. (About 1% of Americans die each year, which makes sense given demographics.)

Did the factor-of-10 error matter?

From the news article:

Lead author Megan Liu, science and policy manager at Toxic-Free Future, described the mistake as a “typo” and said her co-authors have submitted a correction to the journal. The error remains in the online version but Liu said she anticipates it will be updated soon.

“However, it is important to note that this does not impact our results,” Liu told National Post. “The levels of flame retardants that we found in black plastic household items are still of high concern, and our recommendations remain the same.”

Hmmm, maybe. The news article also states, “it appears the study’s hypothesis is correct, that black plastic recycled out of electronic devices, mostly in Asia, is getting back into the American supply chain for household kitchen items, including spatulas. So if you’re keen on eliminating these chemicals in any amount, chucking the black plastic kitchenware is a start, even if not as effective as the erroneous calculation suggests.”

It still seems wrong to say, “Our recommendations remain the same,” if their estimate of risk is off by a factor of 10.

Look at it another way. Suppose someone else had done a study and found that the level of exposure was “8% of the reference dose, thus, a potential concern,” but they’d done the calculation wrong, and the level was really 80% of the reference dose. Then I assume that the folks at Toxic-Free Future would’t say that the recommendations remain the same, right? They’d say the exposure had been underestimated by a factor of 10 and that’s a big deal!

To put it another way, comparisons are symmetric. If you say that an exposure of 80% of the recommended dose is 10 times as bad as an exposure of 8% of the recommended dose, then the reverse should be true as well, that an exposure of 8% is 1/10 as bad as an exposure of 80%.

Does this change the recommendation? Yes, I’d say it does, in part. The individual recommendation—throw away your black plastic spatula—might not change, but the policy recommendation would change, because policy recommendations are not just directional, they also include a sense of urgency or priority, which depends on magnitude.

I’ll return to this issue in a future post. It’s an important issue that arises in many examples.

Why am I willing to bet you $100-1000 there will be a Nobel Prize for Adaptive Experimentation in the next 40 years?

Why am I willing to bet you $100-1000 there will be a Nobel Prize for Adaptive Experimentation in the next 40 years?

This post is by Joseph Jay Williams, a professor at University of Toronto in Human Computer Interaction, Psychology, Statistics, & Economics. You can read more on him at www.josephjaywilliams.com.

Here is a recording of Joseph providing some context to this answer, recorded while Andrew was on a call. I’m looking forward to his agreements and disagreements :D.

https://www.loom.com/share/559e83b392dc43b3b141c117d9677033

This is a different format, which can feel unnatural, but might provide novel value in the long run. I think having a video can be an interesting interaction, and I thought people might like to see Andrew & I having a conversation as colleagues.

Please comment with advice and suggestions!

Joseph

Postdoc position at Northwestern on evaluating AI/ML decision support

This is Jessica. I’m looking for a postdoc to work with me and Ken Holstein (CMU) on evaluation tools for AI-based decision support, with emphasis on elicitation challenges associated with specifying decision problems in real-world deployments. The postdoc is through Northwestern University Department of Computer Science. 

I’ll be at NeurIPS this Thursday afternoon through Sunday night if anyone wants to chat about the position there. 

Position description

The Department of Computer Science at Northwestern University is seeking an outstanding postdoctoral scholar for a research opportunity studying AI-assisted human decision-making under Dr. Jessica Hullman. The candidate will also collaborate with Dr. Ken Holstein of Carnegie Mellon University. We apply tools from statistics, decision theory, and human-computer interaction to better understand how to effectively design and deploy AI and ML models to support data-driven decision making. 

Potential projects to be led by the postdoc will address challenges associated with organizations’ use of models developed using standard AI/ML approaches in decision pipelines currently dominated by human experts. To responsibly integrate such models requires careful prospective and retrospective analysis to evaluate the impacts of the deployment on decision-making. Yet existing evaluation approaches risk misspecifying real-world decision problems, threatening the reliability of analyses. There is an opportunity to develop practically motivated yet theoretically rigorous methods for specifying and evaluating relevant decision problems to AI/ML deployments. 

The postdoctoral researcher will work in a collaborative, multidisciplinary-oriented computer science environment to identify and lead human-centered projects related to AI for decision support. 

Minimum Qualifications: Doctoral degree in computer science, statistics, information science, or related field, and an interest in AI/ML for decision support. Additionally, applicants should have:

  • Interest in working in real-world contexts in collaboration with domain experts
  • At least some exposure to human-centered design or human-computer interaction (HCI)
  • At least some exposure to experiment design, particularly human subjects 

Ideally applicants will have also have the following: 

  • Familiarity with statistical decision theory, other theories of decision-making, and/or mathematical models of human behavior
  • Some prior experience in HCI and AI/ML 
  • Interest and prior experience in system/software development or method development 

Interested applicants should send their CV, names and contact information for a minimum of three recommenders, two representative publications, and (optionally) a statement of interest to [email protected].

New Course: Prediction for (Individualized) Decision-making

This is Jessica. This winter I’m teaching a new graduate seminar on prediction for decision-making intended primarily for Computer Science Ph.D. students. The goal of the new course is to consider various perspectives on what it means to predict for the purpose of decision-making. We’ll look at this question in the context of predictive modeling for automated decisions or to inform expert decisions and causal estimation to inform policy. I’m trying to include a mix of theoretical and applied papers, with an emphasis on philosophical and ethical challenges to evaluating decision-making and applying formal methods in practice, especially in contexts where human experts currently make decisions and/or the decisions involve people. Technically the course title is Prediction for Decision-making. But one of the motivations is that we have yet to adequately address the gap between conventional machine learning, where we optimize loss over aggregates, and the needs of human decision-makers in practice, where we often care about doing right by individual cases. Hence the reference to “individualized.” 

Suggestions welcome if this is your cup of tea and you think I missed something important. A few of the listed papers are already coming from pointers I’ve gotten from readers here. I’m especially interested in papers that help illustrate the gaps in current methods when it comes to good individual decisions. 

Course Schedule

Week 1 – Introduction and background on statistical decision rules

     Background: Statistical decision theory, randomized controlled trials

  • Berger, J. O. (2013). Statistical decision theory and Bayesian analysis. Springer Science & Business Media. Chapter 1.
  • Hernan, Miguel A., & Robins, James, M. (2023). Causal inference: what if. CRC PRESS. Chapters 1, 2

    Examples

Week 2 – Prediction versus decision-making

     Optional

Week 3 – Human versus statistical judgment

     Optional

Week 4 – Evaluating (individual) predictions and decisions

Optional

Week 5 – Data shifts and causality

     Optional

Week 6 – Personalization and fairness

      Optional

Week 7 – Calibration for decision-making

     Optional

Week 8 – Communicating prediction uncertainty

    Optional

Week 9 – Designing human-AI workflows 

     Optional

Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.

The usual way we think about p-values is that they’re part of null-hypothesis testing inference, and if that’s not your bag, you don’t have use for p-values.

That summary is pretty much the case. I once did write a paper for the journal Epidemiology called “P-values and statistical practice,” and in that paper I gave an example of a p-value that worked (further background is here), but at this point my main interest in p-value is that other people use p-values so it behooves me to understand what they’re doing.

Theoretical statistics is the theory of applied statistics, and part of applied statistics is what other people do.

What this means is that, just as non-Bayesians should understand enough about Bayesian methods to be able to assess the frequency properties of said methods, so should I, as a Bayesian, understand the properties of p-values. Bayesians are frequentists.

The point is, a p-value is a data summary, and it should be interpretable under various assumptions. As we like to say, it’s all about the averaging.

Below are two different ways of understanding p-values. You could think of these as the classical interpretation or the Bayesian interpretation, but I prefer to think of them as conditioning-on-the-null-hypothesis or averaging-over-an-assumed-population-distribution.

So here goes:

1. One interpretation of the p-value is as the probability of seeing a test statistic as extreme as, or more extreme than, the data, conditional on a null hypothesis of zero effects. This is the classical interpretation.

2. Another interpretation of the p-value is conditional on some empirically estimated distribution of effect sizes. This is we did in our recent article by Zwet et al., “A new look at p-values for randomized clinical trials,” using the Cochrane database of medical trials.

Both interpretations 1 and 2 are valid! No need to think of interpretation 2 as a threat to interpretation 1, or vice versa. It’s the same p-value, we’re just understanding it by averaging over different predictive distributions.

What to do with all this theory and empiricism is another question, and there is a legitimate case to be made that following procedures derived from interpretation 2 could lead to worse scientific outcomes, just as of course there is a strong case to be made that procedures derived from interpretation 1 have already led to bad scientific outcomes.

Following that logic, one could argue that interpretation 1, or interpretation 2, or both, are themselves pernicious in leading, inexorably or with high probability, toward these bad outcomes. One can continue with the statement that interpretation 1, or interpretation 2, or both, have intellectual or institutional support that prop them up and allow the relating bad procedures to continue; various people benefit from these theories, procedures, and outcomes.

To the extent there are, or should be, disputes about p-values, I think such disputes should focus on the bad outcomes for which there is concern, not on the p-values themselves or on interpretations 1 and 2, both of which are mathematically valid and empirically supported within their zones of interpretation.

The odd non-spamness of some spam comments

I checked the spam filter this morning and came across a new comment on an old post.

It’s was a reasonable comment. Not an amazing contribution to the discussion, but not completely nothing, either. And I would have approved it—except that the url supplied by the commenter was a spam link. Or, maybe not spam, just some business that had no connection to anything that ever appears on the blog. In any case, I shot the comment into oblivion. I don’t want to be hosting or encouraging spam—at least, not for free!

We get this kind of comment from time to time, and it always makes me wonder: When people do this, are they coming to the blog with an intent to spam (or, one could say, to advertise their wares) and then they write some minimal comment in the hope that it gets approved? Or are they coming to the blog to write a comment, and then they figure they might as well get some benefit out of it so they throw in the spam link? I have no idea.

Here’s my excuse for using obsolete, sub-optimal, or inadequate statistical methods or using a method irresponsibly.

E. J. Wagenmakers writes:

My colleague Klaas Sijtsma wrote a book titled “Never waste a good crisis” and I [E. J.] am reviewing it here.

His main claim is that questionable research practices would be much reduced if researchers would seek advice or collaborate with professional statisticians or methodologists. I thought it was refreshing to see it put so clearly. Maybe this is a claim that can be empirically tested. Anyhow, you might have an opinion on it and I thought it could be something for your blog.

The review is interesting, and the reviewed book looks interesting too.

But I disagree with this quoted statement from the book: “There is little if any excuse for using obsolete, sub-optimal, or inadequate statistical methods or using a method irresponsibly.”

Don’t forget about the concept of bounded rationality! It can be costly for an applied researcher, or even a statistical expert, to tool up and learn how to correctly use up-to-date and optimal methods (even if we accept the generally dubious idea that there is in general an “optimal” statistical method for any given applied problem).

That’s what happens when you try to run the world while excluding 99.8% of the population

Daniel Immerwahr writes:

Of the C.I.A.’s thirty-eight Soviet analysts in 1948, only twelve knew any Russian.

Whaaaaa?

There were a lot of Russian speakers in the U.S. in 1948. From Immerwahr’s article, my impression is that the problem was that the CIA was restricting itself to upper-class Ivy League types—and not many of those people knew Russian.

Assuming the story of the 26 out of 38 non-Russian-speaking analysts is correct (it’s so hard to believe, maybe the reporter got this one wrong?), wow. An amazing example of the narrowness of America’s ruling class.

Can we put a number on it? I did a quick google to see how many students were graduating from the Ivy League at that time. I couldn’t find any convenient table or graph, but I did come across this news article from 1950 saying that Harvard was giving out 1144 undergraduate degrees an that spring. I can’t readily find the numbers from Yale and the others, but let’s just say that there were roughly 5000 Ivy League graduates that year. To compare to the general population . . . 2.2 million Americans were born in 1930. So, roughly (only a rough approx because I’m excluding foreign-born in the denominator), if you restrict your recruiting to the Ivy League, you’re only targeting 0.2%, or 1/440th of the population. And that other 99.8% is where most the Russian-speakers are hanging out.

OK, those 99.8% would’ve included lots of incompetent people. The CIA couldn’t just hire at random; they’d need to do some interviewing. But the 0.2% seems to have contained a fair number of incompetents too! Maybe screening based on social class and grades in school wasn’t such a great idea.

I’m reminded of some procedures in statistics, where researchers screen their results using strict statistical significance thresholding based on noisy data. A big selling point of this approach, beyond its apparent guarantee of rigor, is that it has enough researcher degrees of freedom that you can get whatever result you want out of it. The other selling point is that you can take what you find as having great scientific merit, as it has survived this very difficult selection process.

Kind of like hiring an Ivy League graduate in 1948. He may not speak Russian, but he’s one of nature’s noblemen.