Evidence, desire, support

I keep worrying, as with a loose tooth, about news media elites who are going for the UFOs-as-space-aliens theory. This one falls halfway between election denial (too upsetting for me to want to think about too often) and belief in ghosts (too weird to take seriously).

I was also thinking about the movie JFK, which I saw when it came out in 1991. As a reader of the newspapers, I knew that the narrative pushed in the movie was iffy, to say the least; still, I watched the movie intently—I wanted to believe. In the same way that in the 1970s I wanted to believe those claims that dolphins are smarter than people, or that millions of people wanted to believe in the Bermuda Triangle or ancient astronauts or Noah’s Ark or other fringe ideas that were big in that decade. None of those particular ideas appealed to me.

Anyway, this all got me thinking about what it takes for someone to believe in something. My current thinking is that belief requires some mixture of the following three things:
1. Evidence
2. Desire
3. Support

To go through these briefly:

1. I’m using the term “evidence” in a general sense to include things you directly observe and also convincing arguments of some sort or another. Evidence can be ambiguous and, much to people’s confusion, it doesn’t always point in the same direction. The unusual trajectory of Oswald’s bullet is a form of evidence, even though not as strong as has been claimed by conspiracy theories. The notorious psychology paper from 2011 is evidence for ESP. It’s weak evidence, really no evidence at all for anything beyond the low standards of academic psychology at the time, but it played the role of evidence for people who were interested in or open to believing.

2. By “desire,” I mean a desire to believe in the proposition at hand. There can be complicated reasons for this desire. Why did I have some desire in 1991 to believe the fake JFK story, even thought I knew ahead of time it was suspect? Maybe because it helped make sense of the world? Maybe because, if I could believe the story, I could go with the flow of the movie and feel some righteous anger? I don’t really know. Why do some media insiders seen to have the desire to believe that UFOs are space aliens? Maybe because space aliens are cool, maybe because, if the theory is true, then these writers are in on the ground floor of something big, maybe because the theory is a poke in the eye at official experts, maybe all sorts of things.

3. “Support” refers to whatever social environment you’re in. 30% of Americans believe in ghosts, and belief in ghosts seems to be generally socially acceptable—I’ve heard people from all walks of life express the belief—but there are some places where it’s not taken seriously, such as in the physics department. The position of ghost-belief within the news media is complicated, typically walking a fine line to avoid expressing belief or disbelief. For example, a quick search of *ghosts npr* led to this from the radio reporter:

I’m pretty sure I don’t believe in ghosts. Now, I say pretty sure because I want to leave the possibility open. There have definitely been times when I felt the presence of my parents who’ve both died, like when one of their favorite songs comes on when I’m walking the aisles of the grocery store, or when the wind chime that my mom gave me sings a song even though there’s no breeze. But straight-up ghosts, like seeing spirits, is that real? Can that happen?

This is kind of typical. It’s a news story that’s pro-ghosts, reports a purported ghost sighting with no pushback, but there’s that kinda disclaimer too. It’s similar to reporting on religion. Different religions contradict each other, and so if you want to report in a way that’s respectful of religion, you have to place yourself in a no-belief-yet-no-criticism mode: if you have a story about religion X, you can’t push back (“Did you really see the Lord smite that goat in your backyard that day?”) because that could offend adherents of that religion, but you can’t fully go with it, as that could offend adherents of every other religion.

I won’t say that all three of evidence, desire, and support are required for belief, just that they can all contribute. We can see this with some edge cases. That psychologist who published the terrible paper on ESP: he had a strong desire to believe, a strong enough desire to motivate an entire research program on his part. There was also a little bit of institutional support for the belief. Not a lot—ESP is a fringe take that would be, at best, mocked by most academic psychologists, it’s a belief that has much lower standing now than it did fifty years ago—but some. Anyway, the strong desire was enough, along with the terrible-but-nonzero evidence and the small-but-nonzero support. Another example would be Arthur Conan Doyle believing those ridiculous faked fairy photos: spiritualism was big in society at the time, so he had strong social support as well as strong desire to believe. In other cases, evidence is king, but without the institutional support it can be difficult for people to be convinced. Think of all those “they all laughed, but . . .” stories of scientific successes under adversity: continental drift and all the rest.

As we discussed in an earlier post, the “support” thing seems like a big change regarding the elite media and UFOs-as-space-aliens. The evidence for space aliens, such as it is—blurry photographs, eyewitness testimony, suspiciously missing government records, and all the rest—has been with us for half a century. The desire to believe has been out there too for a long time. What’s new is the support: some true believers managed to insert the space aliens thing into the major news media in a way that gives permission to wanna-believers to lean into the story.

I don’t have anything more to say on this right now, just trying to make sense of it all. This all has obvious relevance to political conspiracy theories, where authority figures can validate an idea, which then gives permission for other wanna-believers to push it.

“He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.”

Ron Bloom points us to this wonderful article, “The Ethics of Belief,” by the mathematician William Clifford, also known for Clifford algebras. The article is related to some things I’ve written about evidence vs. truth (see here and here) but much more beautifully put. Here’s how it begins:

A shipowner was about to send to sea an emigrant-ship. He knew that she was old, and not overwell built at the first; that she had seen many seas and climes, and often had needed repairs. Doubts had been suggested to him that possibly she was not seaworthy. These doubts preyed upon his mind, and made him unhappy; he thought that perhaps he ought to have her thoroughly overhauled and refitted, even though this should put him to great expense. Before the ship sailed, however, he succeeded in overcoming these melancholy reflections. He said to himself that she had gone safely through so many voyages and weathered so many storms that it was idle to suppose she would not come safely home from this trip also. He would put his trust in Providence, which could hardly fail to protect all these unhappy families that were leaving their fatherland to seek for better times elsewhere. He would dismiss from his mind all ungenerous suspicions about the honesty of builders and contractors. In such ways he acquired a sincere and comfortable conviction that his vessel was thoroughly safe and seaworthy; he watched her departure with a light heart, and benevolent wishes for the success of the exiles in their strange new home that was to be; and he got his insurance-money when she went down in mid-ocean and told no tales.

What shall we say of him? Surely this, that he was verily guilty of the death of those men. It is admitted that he did sincerely believe in the soundness of his ship; but the sincerity of his conviction can in no wise help him, because he had no right to believe on such evidence as was before him. He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.

Clifford’s article is from 1877!

Bloom writes:

One can go over this in two passes. One pass may be read as “moral philosophy.”

But the second pass helps one think a bit about how one ought to make precise the concept of ‘relevance’ in “relevant evidence.”

Specifically (this is remarkably deficient in the Bayesian corpus I find) I would argue that when we say “all probabilities are relative to evidence” and write the symbolic form straightaway P(A|E) we are cheating. We have not faced the fact — I think — that not every “E” has any bearing (“relevance”) one way or another on A and that it is *inadmissible* to combine the symbols because it is so easy to write ’em down. Perhaps one evades the problem by saying, well what do you *think* is the case. Perhaps you might say, “I think that E is irrelevant if P(A|E) = P(A|~E).” But that begs the question: it says in effect that *both* E and ~E can be regarded as “evidence” for A. I argue that easily leads to nonsense. To regard any utterance or claim as “evidence” for any other utterance or claim leads to absurdities. Here for instance:

A = “Water ice of sufficient quantity to maintain a lunar base will be found in the spectral analysis of the plume of the crashed lunar polar orbiter.”

E = If there are martians living on the Moon of Jupiter, Europa, then they celebrate their Martian Christmas by eating Martian toast with Martian jam.

Is E evidence for A? is ~E evidence for A? Is any far-fetched hypothetical evidence for any other hypothetical whatsoever?

Just to provide some “evidence” that I am not being entirely facetious about the Lunar orbiter; I attach also a link to now much superannuated item concerning that very intricate “experiment” — I believe in the end there was some spectral evidence turned up consistent with something like a teaspoon’s worth of water-ice per 25 square Km.

P.S. Just to make the connection super-clear, I’d say that Clifford’s characterization, “He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it,” is an excellent description of those Harvard professors who notoriously endorsed the statement, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” Also a good match to those Columbia administrators who signed off on those U.S. News numbers. In neither case did a ship go down; it’s the same philosophical principle but lower stakes. Just millions of dollars involved, no lives lost.

As Isaac Asimov put it, “A robot may not injure a human being or, through inaction, allow a human being to come to harm.” Sometimes that inaction is pretty damn active, when a shipowner or a scientific researcher or a university administrator puts in some extra effort to avoid looking at some pretty clear criticisms.

Here’s something you should do when beginning a project, and in the middle of a project, and in the end of the project: Clearly specify your goals, and also specify what’s not in your goal set.

Here’s something from from Witold’s slides on baggr, an R package (built on Stan) that does hierarchical modeling for meta-analysis:

Overall goals:

1. Implement all basic meta-analysis models and tools
2. Focus on accessibility, model criticism and comparison
3. Help people avoid basic mistakes
4. Keep the framework flexible and extend to more models

(Probably) not our goal:

5. Build a package for people who already build their models in Stan

I really like this practice of specifying goals. This is so basic that it seems like we should always be doing it—but so often we don’t! Also I like the bit where he specifies something that’s not in his goals.

Again, this all seems so natural when we see it, but it’s something we don’t usually do. We should.

There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Following our recent post on the latest Dishonestygate scandal, we got into a discussion of the challenges of simulating fake data and performing a pre-analysis before conducting an experiment.

You can see it all in the comments to that post—but not everybody reads the comments, so I wanted to repeat our discussion here. Especially the last line, which I’ve used as the title of this post.

Raphael pointed out that it can take some work to create a realistic simulation of fake data:

Do you mean to create a dummy dataset and then run the preregistered analysis? I like the idea, and I do it myself, but I don’t see how this would help me see if the endeavour is doomed from the start? I remember your post on the beauty-and-sex ratio, which proved that the sample size was far too small to find an effect of such small magnitude (or was it in the Type S/Type M paper?). I can see how this would work in an experimental setting – simulate a bunch of data sets, do your analysis, compare it to the true effect of the data generation process. But how do I apply this to observational data, especially with a large number of variables (number of interactions scales in O(p²))?

I elaborated:

Yes, that’s what I’m suggesting: create a dummy dataset and then run the preregistered analysis. Not the preregistered analysis that was used for this particular study, as that plan is so flawed that the authors themselves don’t seem to have followed it, but a reasonable plan. And that’s kind of the point: if your pre-analysis plan isn’t just a bunch of words but also some actual computation, then you might see the problems.

In answer to your second question, you say, “I can see how this would work in an experimental setting,” and we’re talking about an experiment here, so, yes, it would’ve been better to have simulated data and performed an analysis on the simulated data. This would require the effort of hypothesizing effect sizes, but that’s a bit of effort that should always be done when planning a study.

For an observational study, you can still simulate data; it just takes more work! One approach I’ve used, if I’m planning to fit data predicting some variable y from a bunch of predictors x, is to get the values of x from some pre-existing dataset, for example an old survey, and then just do the simulation part for y given x.

Raphael replied:

Maybe not the silver bullet I had hoped for, but now I believe I understand what you mean.

To which I responded:

There is no silver bullet; there is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Again, this is not a diss on preregistration. Preregistration does one thing; it’s not intended to fix bad aspects of the culture of science such as the idea that you can gather a pile of data, grab some results, declare victory, go on the Ted talk circuit based only on the very slender bit of evidence that you seem to have been able to reject that the data came from a specific random number generator. That line of reasoning, where rejection of straw-man null hypothesis A is taken as evidence in favor of preferred alternative B, is wrong—but it’s not preregistration’s fault that people think that way!

P-hacking can be bad (but the problem here, in my view, is not in performing multiple analyses but rather in reporting only one of them rather than analyzing them all together); various questionable research practices are, well, questionable; and preregistration can help with that, either directly (by motivating researchers to follow a clear plan) or indirectly (by allowing outsiders to see problems in post-publication review, as here).

I am, however, bothered by the focus on procedural/statistical “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Again, the problem is if researchers mistakenly think that following such advice will place them back on that nonexistent golden path to discovery.

So, again, I recommend to make assumptions, simulate fake data, and analyze these data as a way of constructing a pre-analysis plan, before collecting any data. That won’t put you on the golden path to discovery either!

All I can offer you here is blood, toil, tears and sweat, along with the possibility that a careful process of assumptions/simulation/pre-analysis will allow you to avoid disasters such as this ahead of time, thus avoiding the consequences of: (a) fooling yourself into thinking you’ve made a discovery, (b) wasting the time and effort of participants, coauthors, reviewers, and postpublication reviewers (that’s me!), and (c) filling the literature with junk that will later be collected in a GIGO meta-analysis and promoted by the usual array of science celebrities, podcasters, and NPR reporters.

Aaaaand . . . in the time you’ve saved from all of that could be repurposed into designing more careful experiments with clearer connections between theory and measurement. Not a glide along the golden path to a discovery; more of a hacking through the jungle of reality to obtain some occasional glimpses of the sky.

Bad parenting in the news, also, yeah, lots of kids don’t believe in Santa Claus

A recent issue of the New Yorker had two striking stories of bad parenting.

Margaret Talbot reported on a child/adolescent-care center in Austria from the 1970s that was run by former Nazis who were basically torturing the kids. This happened for decades. The focus of the story was a girl whose foster parents had abused her before sending her to this place. The creepiest thing about all of this was how normal it all seemed. Not normal to me, but normal to that society: abusive parents, abusive orphanage, abusive doctors, all of which fit into an authoritarian society. Better parenting would’ve helped, but it seems that all of these people were all trapped in a horrible system, supported by an entrenched network of religious, social, and political influences.

In that same issue of the magazine, Sheelah Kolhatkar wrote about the parents of crypto-fraudster Sam Bankman-Fried. This one was sad in a different way. I imagine that most parents don’t want their children to grow up to be criminals, but such things happen. The part of the story that seemed particularly sad to me was how the parents involved themselves in their son’s crimes. They didn’t just passively accept it—which would be bad enough, but, sure, sometimes kids just won’t listen and they need to learn their lessons on their own—; they very directly got involved, indeed profited from the criminal activity. What kind of message is that to send to your child? In some ways this is similar to the Austrian situation, in that the adults involved were so convinced in their moral righteousness. Anyway, it’s gotta be heartbreaking to realize that, not only did you not stop your child’s slide into crime, you actually participated in it.

Around the same time, the London Review of Books ran an article which motivated me to write them this letter:

Dear editors,

In his article in the 2 Nov 2023 issue, John Lanchester writes that financial fraudster Sam Bankman-Fried “grew up aware that his mind worked differently from most people’s. Even as a child he thought that the whole idea of Santa Claus was ridiculous.” I don’t know what things are like in England, but here in the United States it’s pretty common for kids to know that Santa Claus is a fictional character.

More generally, I see a problem with the idealization of rich people. It’s not enough to say that Bankman-Fried was well-connected, good at math, and had a lack of scruple that can be helpful in many aspects of life. He also has to be described as being special, so much that a completely normal disbelief in the reality of Santa Claus is taken as a sign of how exceptional he is.

Another example is Bankman-Fried’s willingness to gamble his fortune in the hope of even greater riches, which Lanchester attributes to the philosophy of effective altruism, rather than characterizing it as simple greed.

Yours

Andrew Gelman
New York

They’ve published my letters before (here and here), but not this time. I just hope that in the future they don’t take childhood disbelief in Santa Claus as a signal of specialness, or attribute a rich person’s desire for even more money to some sort of unusual philosophy.

Every time Tyler Cowen says, “Median voter theorem still underrated! Hail Anthony Downs!”, I’m gonna point him to this paper . . .

Here’s Cowen’s post, and here’s our paper:

Moderation in the pursuit of moderation is no vice: the clear but limited advantages to being a moderate for Congressional elections

Andrew Gelman Jonathan N. Katz

September 18, 2007

It is sometimes believed that is is politically risky for a congressmember to go against his or her party. On the other hand, Downs’s familiar theory of electoral competition holds that political moderation is a vote-getter. We analyze recent Congressional elections and find that moderation is typically worth less about 2% of the vote. This suggests there is a motivation to be moderate, but not to the exclusion of other political concerns, especially in non-marginal districts. . . .

Conformal prediction and people

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes. 

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution. 

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.  

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments  doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction. 

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially. 

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Defining optimal reliance on model predictions in AI-assisted decisions

This is Jessica. In a previous post I mentioned methodological problems with studies of AI-assisted decision-making, such as are used to evaluate different model explanation strategies. The typical study set-up gives people some decision task (e.g., Given the features of this defendant, decide whether to convict or release), has them make their decision, then gives them access to a model’s prediction, and observes if they change their mind. Studying this kind of AI-assisted decision task is of interest as organizations deploy predictive models to assist human decision-making in domains like medicine and criminal justice. Ideally, the human is able to use the model to improve the performance they’d get on their own or if the model was deployed without a human in the loop (referred to as complementarity). 

The most frequently used definition of appropriate reliance is that if the person goes with the model prediction but it’s wrong, this is overreliance. If they don’t go with the model prediction but it’s right, this is labeled underreliance. Otherwise it is labeled appropriate reliance. 

This definition is problematic for several reasons. One is that the AI might have a higher probability than the human of selecting the right action, but still end up being wrong. It doesn’t make sense to say the human made the wrong choice by following it in such cases. Because it’s based on post-hoc correctness, this approach confounds two sources of non-optimal human behavior: not accurately estimating the probability that the AI is correct versus not making the right choice of whether to go with the AI or not given one’s beliefs. 

By scoring decisions in action space, it also equally penalizes not choosing the right action (which prediction to go with) in a scenario where the human and the AI have very similar probabilities of being correct and one where either the AI or human has a much higher probability of being correct. Nevertheless, there are many papers doing it this way, some with hundreds of citations. 

In A Statistical Framework for Measuring AI Reliance, Ziyang Guo, Yifan Wu, Jason Hartline and I write: 

Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s prediction from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational agent facing the same decision task as the behavioral agents.

It’s a similar approach to our rational agent framework for data visualization, but here we assume a setup in which the decision-maker receives a signal consisting of the feature values for some instance, the AI’s prediction, the human’s prediction, and optionally some explanation of the AI decision. The decision-maker chooses which prediction to go with. 

We can compute the upper bound or best attainable performance in such a study (rational benchmark) as the expected score of a rational decision-maker on a randomly drawn decision task. The rational decision-maker has prior knowledge of the data generating model (the joint distribution over the signal and ground truth state). Seeing the instance in a decision trial, they accurately perceive the signal, arrive at Bayesian posterior beliefs about the distribution of the payoff-relevant state, then choose the action that maximizes their expected utility over the posterior. We calculate this in payoff space as defined by the scoring rule, such that the cost of an error can vary in magnitude. 

We can define the “value of rational complementation” for the decision-problem at hand by also defining the rational agent baseline: the expected performance of the rational decision-maker without access to the signal on a randomly chosen decision task from the experiment. Because it represents the score the rational agent would get if they could rely only on their prior beliefs about the data-generating model, the baseline is the expected score of a fixed strategy that always chooses the better of the human alone or the AI alone.

figure showing human alone, then ai alone, then (with ample room between them) rational benchmark

If designing or interpreting an experiment on AI reliance, the first thing we might want to do is look at how close to the benchmark the baseline is. We want to see a decent amount of room for the human-AI team to improve performance over the baseline, as in the image above. If the baseline is very close to the benchmark, it’s probably not worth adding the human.

Once we have run an experiment and observed how well people make these decisions, we can treat the value of complementation as a comparative unit for interpreting how much value adding the human contributes over making the decision with the baseline. We do this by normalizing the observed score within the range where the rational agent baseline is 0 and the rational agent benchmark is 1 and looking at where the observed human+AI performance lies. This also provides a useful sense of effect size when we are comparing different settings. For example, if we have two model explanation strategies A and B we compared in an experiment, we can calculate expected human performance on a randomly drawn decision trial under A and under B and measure the improvement by calculating (score_A − score_B)/value of complementation. 

figure showing human alone, then ai alone, then human+ai, then rational benchmark

We can also decompose sources of error in study participants’ performance. To do this, we define a “mis-reliant” rational decision-maker benchmark, which is the expected score of a rational agent constrained to the reliance level that we observe in study participants. Hence this is the best score a decision-maker who relies on the AI the same overall proportion of the time could attain had they perfectly perceived the probability that the AI is correct relative to the probability that the human is correct on every decision task. Since the mis-reliant benchmark and the study participants have the same reliance level (i.e., they both accept the AI’s prediction the same percentage of the time), the difference in their decisions lies entirely in accepting the AI predictions at different instances. The mis-reliant rational decision-maker always accepts the top X% AI predictions ranked by performance advantage over human predictions, but study participants may not.

figure showing human alone, ai alone, human+ai, misreliant rational benchmark, rational benchmark

By calculating the mis-reliant rational benchmark for the observed reliance level of study participants, we can distinguish between reliance loss, the loss from over- or under-relying on the AI (defined as the difference between the rational benchmark and mis-reliant benchmark divided by the value of rational complementation), and discrimination loss, the loss from not accurately differentiating the instances where the AI is better than the human from the ones where the human is better than the AI (defined as the difference between the mis-reliant benchmark and the expected score of participants divided by the value of rational complementation). 

We apply this approach to some well-known studies on AI reliance and can extend the original interpretations to varying degrees, ranging from observing a lack of potential to see complementarity in the study given how much better the AI was going in than the human, to the original interpretation missing that participants’ reliance levels were pretty close to rational and they just couldn’t distinguish which signals they should go with AI on. We also observe researchers making comparisons across conditions for which the upper and lower bounds on performance differ without accounting for the difference.

There’s more in the paper – for example, we discuss how the rational benchmark, our upper bound representing the expected score of a rational decision-maker on a randomly chosen decision task, may be overfit to the empirical data. This occurs when the signal space is very large (e.g., the instance is a text document) such that we observe very few human predictions per signal. We describe how the rational agent could determine the best response on the optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and the overfit upper bound. 

While we focused on showing how to improve research on human-AI teams, I’m excited about the potential for this framework to help organizations as they consider whether deploying an AI is likely to improve some human decision process. We are currently thinking about what sorts of practical questions (beyond Could pairing a human and AI be effective here?) we can answer using such a framework.

Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile)

This is a long post, so let me give you the tl;dr right away: Don’t use the word “whom” in your dating profile.

OK, now for the story. Fasten your seat belts, it’s going to be a bumpy night.

It all started with this message from Dmitri with subject line, “Man I hate to do this to you but …”, which continued:

How could I resist?

https://www.cnbc.com/2024/02/15/using-this-word-can-make-you-more-influential-harvard-study.html

I’m sorry, let me try again … I had to send this to you BECAUSE this is the kind of obvious shit you like to write about. I like how they didn’t even do their own crappy study they just resurrected one from the distant past.

OK, ok, you don’t need to shout about it!

Following the link we see this breathless press release NBC news story:

Using this 1 word more often can make you 50% more influential, says Harvard study

Sometimes, it takes a single word — like “because” — to change someone’s mind.

That’s according to Jonah Berger, a marketing professor at the Wharton School of the University of Pennsylvania who’s compiled a list of “magic words” that can change the way you communicate. Using the word “because” while trying to convince someone to do something has a compelling result, he tells CNBC Make It: More people will listen to you, and do what you want.

Berger points to a nearly 50-year-old study from Harvard University, wherein researchers sat in a university library and waited for someone to use the copy machine. Then, they walked up and asked to cut in front of the unknowing participant.

They phrased their request in three different ways:

“May I use the Xerox machine?”
“May I use the Xerox machine because I have to make copies?”
“May I use the Xerox machine because I’m in a rush?”
Both requests using “because” made the people already making copies more than 50% more likely to comply, researchers found. Even the second phrasing — which could be reinterpreted as “May I step in front of you to do the same exact thing you’re doing?” — was effective, because it indicated that the stranger asking for a favor was at least being considerate about it, the study suggested.

“Persuasion wasn’t driven by the reason itself,” Berger wrote in a book on the topic, “Magic Words,” which published last year. “It was driven by the power of the word.” . . .

Let’s look into this claim. The first thing I did was click to the study—full credit to CNBC Make It for providing the link—and here’s the data summary from the experiment:

If you look carefully and do some simple calculations, you’ll see that the percentage of participants who complied was 37.5% under treatment 1, 50% under treatment 2, and 62.5% under treatment 3. So, ok, not literally true that both requests using “because” made the people already making copies more than 50% more likely to comply: 0.50/0.375 = 1.33, and increase of 33% is not “more than 50%.” But, sure, it’s a positive result. There were 40 participants in each treatment, so the standard error is approximately 0.5/sqrt(40) = 0.08 for each of those averages. The key difference here is 0.50 – 0.375 = 0.125, that’s the difference between the compliance rates under the treatments “May I use the Xerox machine?” and “May I use the Xerox machine because I have to make copies?”, and this will have a standard error of approximately sqrt(2)*0.08 = 0.11.

The quick summary from this experiment: an observed difference in compliance rates of 12.5 percentage points, with a standard error of 11 percentage points. I don’t want to say “not statistically significant,” so let me just say that the estimate is highly uncertain, so I have no real reason to believe it will replicate.

But wait, you say: the paper was published. Presumably it has a statistically significant p-value somewhere, no? The answer is, yes, they have some “p < .05" results, just not of that particular comparison. Indeed, if you just look at the top rows of that table (Favor = small), then the difference is 0.93 - 0.60 = 0.33 with a standard error of sqrt(0.6*0.4/15 + 0.93*0.07/15) = 0.14, so that particular estimate is just more than two standard errors away from zero. Whew! But now we're getting into forking paths territory: - Noisy data - Small sample - Lots of possible comparisons - Any comparison that's statistically significant will necessarily be huge - Open-ended theoretical structure that could explain just about any result. I'm not saying the researchers were trying to anything wrong. But remember, honesty and transparency are not enuf. Such a study is just too noisy to be useful.

But, sure, back in the 1970s many psychology researchers not named Meehl weren’t aware of these issues. They seem to have been under the impression that if you gather some data and find something statistically significant for which you could come up with a good story, that you’d discovered a general truth.

What’s less excusable is a journalist writing this in the year 2024. But it’s no surprise, conditional on the headline, “Using this 1 word more often can make you 50% more influential, says Harvard study.”

But what about that book by the University of Pennsylvania marketing professor? I searched online, and, fortunately for us, the bit about the Xerox machine is right there in the first chapter, in the excerpt we can read for free. Here it is:

He got it wrong, just like the journalist did! It’s not true that including the meaningless reason increased persuasion just as much as the valid reason did. Look at the data! The outcomes under the three treatment were 37.5%, 50%, and 62.5%. 50% – 37.5% ≠ 62.5% – 37.5%. Ummm, ok, he could’ve said something like, “Among a selected subset of the data with only 15 or 16 people in each treatment, including the meaningless reason increased persuasion just as much as the valid reason did.” But that doesn’t sound so impressive! Even if you add something like, “and it’s possible to come up with a plausible theory to go with this result.”

The book continues:

Given the flaws in the description of the copier study, I’m skeptical about these other claims.

But let me say this. If it is indeed true that using the word “whom” in online dating profiles makes you 31% more likely to get a date, then my advice is . . . don’t use the word “whom”! Think of it from a potential-outcomes perspective. Sure, you want to get a date. But do you really want to go on a date with someone who will only go out with you if you use the word “whom”?? That sounds like a really pretentious person, not a fun date at all!

OK, I haven’t read the rest of the book, and it’s possible that somewhere later on the author says something like, “OK, I was exaggerating a bit on page 4 . . .” I doubt it, but I guess it’s possible.

Replications, anyone?

To return to the topic at hand: In 1978 a study was conducted with 120 participants in a single location. The study was memorable enough to be featured in a business book nearly fifty years later.

Surely the finding has been replicated?

I’d imagine yes; on the other hand, if it had been replicated, this would’ve been mentioned in the book, right? So it’s hard to know.

I did a search, and the article does seem to have been influential:

It’s been cited 1514 times—that’s a lot! Google lists 55 citations in 2023 alone, and in what seem to be legit journals: Human Communication Research, Proceedings of the ACM, Journal of Retailing, Journal of Organizational Behavior, Journal of Applied Psychology, Human Resources Management Review, etc. Not core science journals, exactly, but actual applied fields, with unskeptical mentions such as:

What about replications? I searched on *langer blank chanowitz 1978 replication* and found this paper by Folkes (1985), which reports:

Four studies examined whether verbal behavior is mindful (cognitive) or mindless (automatic). All studies used the experimental paradigm developed by E. J. Langer et al. In Studies 1–3, experimenters approached Ss at copying machines and asked to use it first. Their requests varied in the amount and kind of information given. Study 1 (82 Ss) found less compliance when experimenters gave a controllable reason (“… because I don’t want to wait”) than an uncontrollable reason (“… because I feel really sick”). In Studies 2 and 3 (42 and 96 Ss, respectively) requests for controllable reasons elicited less compliance than requests used in the Langer et al study. Neither study replicated the results of Langer et al. Furthermore, the controllable condition’s lower compliance supports a cognitive approach to social interaction. In Study 4, 69 undergraduates were given instructions intended to increase cognitive processing of the requests, and the pattern of compliance indicated in-depth processing of the request. Results provide evidence for cognitive processing rather than mindlessness in social interaction.

So this study concludes that the result didn’t replicate at all! On the other hand, it’s only a “partial replication,” and indeed they do not use the same conditions and wording as in the original 1978 paper. I don’t know why not, except maybe that exact replications traditionally get no respect.

Langer et al. responded in that journal, writing:

We see nothing in her results [Folkes (1985)] that would lead us to change our position: People are sometimes mindful and sometimes not.

Here they’re referring to the table from the 1978 study, reproduced at the top of this post, which shows a large effect of the “because I have to make copies” treatment under the “Small Favor” condition but no effect under the “Large Favor” condition. Again, given the huge standard errors here, we can’t take any of this seriously, but if you just look at the percentages without considering the uncertainty, then, sure, that’s what they found. Thus, in their response to the partial replication study that did not reproduce their results, Langer et al. emphasized that their original finding was not a main effect but an interaction: “People are sometimes mindful and sometimes not.”

That’s fine. Psychology studies often measure interactions, as they should: the world is a highly variable place.

But, in that case, everyone’s been misinterpreting that 1978 paper! When I say “everybody,” I mean this recent book by the business school professor and also the continuing references to the paper in the recent literature.

Here’s the deal. The message that everyone seems to have learned, or believed they learned, from the 1978 paper is that meaningless explanations are as good as meaningful explanations. But, according to the authors of that paper when they responded to criticism in 1985, the true message is that this trick works sometimes and sometimes not. That’s a much weaker message.

Indeed the study at hand is too small to draw any reliable conclusions about any possible interaction here. The most direct estimate of the interaction effect from the above table is (0.93 – 0.60) – (0.24 – 0.24) = 0.33, with a standard error of sqrt(0.93*0.07/15 + 0.60*0.40/15 + 0.24*0.76/25 + 0.24*0.76/25) = 0.19. So, no, I don’t see much support for the claim in this post from Psychology Today:

So what does this all mean? When the stakes are low people will engage in automatic behavior. If your request is small, follow your request with the word “because” and give a reason—any reason. If the stakes are high, then there could be more resistance, but still not too much.

This happens a lot in unreplicable or unreplicated studies: a result is found under some narrow conditions, and then it is taken to have very general implications. This is just an unusual case where the authors themselves pointed out the issue. As they wrote in their 1985 article:

The larger concern is to understand how mindlessness works, determine its consequences, and specify better the conditions under which it is and is not likely to occur.

That’s a long way from the claim in that business book that “because” is a “magic word.”

Like a lot of magic, it only works under some conditions, and you can’t necessarily specify those conditions ahead of time. It works when it works.

There might be other replication studies of this copy machine study. I guess you couldn’t really do it now, because people don’t spend much time waiting at the copier. But the office copier was a thing for several decades. So maybe there are even some exact replications out there.

In searching for a replication, I did come across this post from 2009 by Mark Liberman that criticized yet another hyping of that 1978 study, this time from a paper by psychologist Daniel Kahenman in the American Economic Review. Kahneman wrote:

Ellen J. Langer et al. (1978) provided a well-known example of what she called “mindless behavior.” In her experiment, a confederate tried to cut in line at a copying machine, using various preset “excuses.” The conclusion was that statements that had the form of an unqualified request were rejected (e.g., “Excuse me, may I use the Xerox machine?”), but almost any statement that had the general form of an explanation was accepted, including “Excuse me, may I use the Xerox machine because I want to make copies?” The superficiality is striking.

As Liberman writes, this represented a “misunderstanding of the 1978 paper’s results, involving both a different conclusion and a strikingly overgeneralized picture of the observed effects.” Liberman performs an analysis of the data from that study which is similar to what I have done above.

Liberman summarizes:

The problem with Prof. Kahneman’s interpretation is not that he took the experiment at face value, ignoring possible flaws of design or interpretation. The problem is that he took a difference in the distribution of behaviors between one group of people and another, and turned it into generic statements about the behavior of people in specified circumstances, as if the behavior were uniform and invariant. The resulting generic statements make strikingly incorrect predictions even about the results of the experiment in question, much less about life in general.

Mindfulness

The key claim of all this research is that people are often mindless: they respond to the form of a request without paying attention to its context, with “because” acting as a “magic word.”

I would argue that this is exactly the sort of mindless behavior being exhibited by the people who are promoting that copying-machine experiment! They are taking various surface aspects of the study and using it to draw large, unsupported conclusions, without being mindful of the details.

In this case, the “magic words” are things like “p < .05," "randomized experiment," "Harvard," "peer review," and "Journal of Personality and Social Psychology" (this notwithstanding). The mindlessness comes from not looking into what exactly was in the paper being cited.

In conclusion . . .

So, yeah, thanks for nothing, Dmitri! Three hours of my life spent going down a rabbit hole. But, hey, if any readers who are single have read far enough down in the post to see my advice not to use “whom” in your data profile, it will all have been worth it.

Seriously, though, the “mindlessness” aspect of this story is interesting. The point here is not, Hey, a 50-year-old paper has some flaws! Or the no-less-surprising observation: Hey, a pop business book exaggerates! The part that fascinates me is that there’s all this shaky research that’s being taken as strong evidence that consumers are mindless—and the people hyping these claims are themselves demonstrating the point by mindlessly following signals without looking into the evidence.

The ultimate advice that the mindfulness gurus are giving is not necessarily so bad. For example, here’s the conclusion of that online article about the business book:

Listen to the specific words other people use, and craft a response that speaks their language. Doing so can help drive an agreement, solution or connection.

“Everything in language we might use over email at the office … [can] provide insight into who they are and what they’re going to do in the future,” says Berger.

That sounds ok. Just forget all the blather about the “magic words” and the “superpowers,” and forget the unsupported and implausible claim that “Arguments, requests and presentations aren’t any more or less convincing when they’re based on solid ideas.” As often is the case, I think these Ted-talk style recommendations would be on more solid ground if they were just presented as the product of common sense and accumulated wisdom, rather than leaning on some 50-year-old psychology study that just can’t bear the weight. But maybe you can’t get the airport book and the Ted talk without a claim of scientific backing.

Don’t get me wrong here. I’m not attributing any malign motivations to any of the people involved in this story (except for Dmitri, I guess). I’m guessing they really believe all this. And I’m not using “mindless” as an insult. We’re all mindless sometimes—that’s the point of the Langer et al. (1978) study; it’s what Herbert Simon called “bounded rationality.” The trick is to recognize your areas of mindlessness. If you come to an area where you’re being mindless, don’t write a book about it! Even if you naively think you’ve discovered a new continent. As Mark Twain apparently never said, it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

The usual disclaimer

I’m not saying the claims made by Langer et al. (1978) are wrong. Maybe it’s true that, under conditions of mindlessness, all that matters is the “because” and any empty explanation will do; maybe the same results would show up in a preregistered replication. All I’m saying is that the noisy data that have been presented don’t provide any strong evidence in support of such claims, and that’s what bothers me about all those confident citations in the business literature.

P.S.

After writing the above post, I sent this response to Dmitri:

OK, I just spent 3 hours on this. I now have to figure out what to do with this after blogging it, because I think there are some important points here. Still, yeah, you did a bad thing by sending this to me. These are 3 hours I could’ve spent doing real work, or relaxing . . .

He replied:

I mean, yeah, that’s too bad for you, obviously. But … try to think about it from my point of view. I am more influential, I got you to work on this while I had a nice relaxing post-Valentine’s day sushi meal with my wife (much easier to get reservations on the 15th and the flowers are a lot cheaper), while you were toiling away on what is essentially my project. I’d say the magic words did their job.

Good point! He exploited my mindlessness. I responded:

Ok, I’ll quote you on that one too! (minus the V-day details).

I’m still chewing on your comment that you appreciate the Beatles for their innovation as much as for their songs. The idea that there are lots of songs of similar quality but not so much innovation, that’s interesting. The only thing is that I don’t know enough about music, even pop music, to have a mental map of where everything fits in. For example, I recently heard that Coldplay song, and it struck me that it was in the style of U2 . But I don’t really know if U2 was the originator of that soaring sound. I guess Pink Floyd is kinda soaring too, but not quite in the same way . . . etc etc … the whole thing was frustrating to me because I had no sense of whether I was entirely bullshitting or not.

So if you can spend 3 hours writing a post on the above topic, we’ll be even.

Dmitri replied:

I am proud of the whole “Valentine’s day on the 15th” trick, so you are welcome to include it. That’s one of our great innovations. After the first 15-20 Valentine’s days, you can just move the date a day later and it is much easier.

And, regarding the music, he wrote:

U2 definitely invented a sound, with the help of their producer Brian Eno.

It is a pretty safe bet that every truly successful musician is an innovator—once you know the sound it is easy enough to emulate. Beethoven, Charlie Parker, the Beatles, all the really important guys invented a forceful, effective new way of thinking about music.

U2 is great, but when I listened to an entire U2 song from beginning to end, it seemed so repetitive as to be unlistenable. I don’t feel that way about the Beatles or REM. But just about any music sounds better to me in the background, which I think is a sign of my musical ignorance and tone-deafness (for real, I’m bad at recognizing pitches) more than anything else. I guess the point is that you’re supposed to dance to it, not just sit there and listen.

Anyway, I warned Dmitri about what would happen if I post his Valentine’s Day trick:

I post this, then it will catch on, and it will no longer work . . . just warning ya! You’ll have to start doing Valentine’s Day on the 16th, then the 17th, . . .

To which Dmitri responded:

Yeah but if we stick with it, it will roll around and we will get back to February 14 while everyone else is celebrating Valentines Day on these weird wrong days!

I’ll leave him with the last word.

When do we expect conformal prediction sets to be helpful? 

This is Jessica. Over on substack, Ben Recht has been posing some questions about the value of prediction bands with marginal guarantees, such as one gets from conformal prediction. It’s an interesting discussion that caught my attention since I have also been musing about where conformal prediction might be helpful. 

To briefly review, given a training data set (X1, Y1), … ,(Xn, Yn), and a test point (Xn+1, Yn+1) drawn from the same distribution, conformal prediction returns a subset of the label space for which we can make coverage guarantees about the probability of containing the test point’s true label Yn+1. A prediction set Cn achieves distribution-free marginal coverage at level 1 − alpha when P(Yn+1 ∈ Cn(Xn+1)) >= 1 − alpha for all joint distributions P on (X, Y). The commonly used split conformal prediction process attains this by adding a couple of steps to the typical modeling workflow: you first split the data into a training and calibration set, fitting the model on the training set. You choose a heuristic notion of uncertainty from the trained model, such as the softmax values–pseudo-probabilities from the last layer of a neural network–to create a score function s(x,y) that encodes disagreement between x and y (in a regression setting these are just the residuals). You compute q_hat, the ((n+1)(1-alpha))/n quantile of the scores on the calibration set. Then given a new instance x_n+1, you construct a prediction set for y_n+1 by including all y’s for which the score is less than or equal to q_hat. There are various ways to achieve slightly better performance, such as using cumulative summed scores and regularization instead.

Recht makes several good points about limitations of conformal prediction, including:

—The marginal coverage guarantees are often not very useful. Instead we want conditional coverage guarantees that hold conditional on the value of Xn+1 we observe. But you can’t get true conditional coverage guarantees (i.e., P(Yn+1 ∈ Cn(Xn+1)|Xn+1 = x) >= 1 − alpha for all P and almost all x) if you also want the approach to be distribution free (see e.g., here), and in general you need a very large calibration set to be able to say with high confidence that there is a high probability that your specific interval contains the true Yn+1.

—When we talk about needing prediction bands for decisions, we are often talking about scenarios where the decisions we want to make from the uncertainty quantification are going to change the distribution and violate the exchangeability criterion. 

—Additionally, in many of the settings where we might imagine using prediction sets there is potential for recourse. If the prediction is bad, resulting in a bad action being chosen, the action can be corrected, i.e., “If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong.”

Recht also criticizes research on conformal prediction as being fixated on the ability to make guarantees, irrespective of how useful the resulting intervals are. E.g., we can produce sets with 95% coverage with only two points, and the guarantees are always about coverage instead of the width of the interval.

These are valid points, worth discussing given how much attention conformal prediction has gotten lately. Some of the concerns remind me of the same complaints we often hear about traditional confidence intervals we put on parameter estimates, where the guarantees we get (about the method) are also generally not what we want (about the interval itself) and only actually summarize our uncertainty when the assumptions we made in inference are all good, which we usually can’t verify. A conformal prediction interval is about uncertainty in a model’s prediction on a specific instance, which perhaps makes it more misleading to some people given that it might not be conditional on anything specific to the instance. Still, even if the guarantees don’t stand as stated, I think it’s difficult to rule out an approach without evaluating how it gets used. Given that no method ever really quantifies all of our uncertainty, or even all of the important sources of uncertainty, the “meaning” of an uncertainty quantification really depends on its use, and what the alternatives considered in a given situation are. So I guess I disagree that one can answer the question “Can conformal prediction achieve the uncertainty quantification we need for decision-making?” without considering the specific decision at hand, how we are constructing the prediction set exactly (since there are ways to condition the guarantees on some instance-specific information), and how it would be made without a prediction set. 

There are various scenarios where prediction sets are used without a human in the loop, like to get better predictions or directly calibrate decisions, where it seems hard to argue that it’s not adding value over not incorporating any uncertainty quantification. Conformal prediction for alignment purposes (e.g., control the factuality or toxicity of LLM outputs) seems to be on the rise. However I want to focus here on a scenario where we are directly presenting a human with the sets. One type of setting where I’m curious whether conformal prediction sets could be useful are those where 1) models are trained offline and used to inform people’s decisions, and 2) it’s hard to rigorously quantify the uncertainty in the predictions using anything the model produces internally, like softmax values which can be overfit to the training sample.

For example, a doctor needs to diagnose a skin condition and has access to a deep neural net trained on images of skin conditions for which the illness has been confirmed. Even if this model appears to be more accurate than the doctor on evaluation data, the hospital may not be comfortable deploying the model in place of the doctor. Maybe the doctor has access to additional patient information that may in some cases allow them to make a better prediction, e.g., because they can decide when to seek more information through further interaction or monitoring of the patient. This means the distribution does change upon acting on the prediction, and I think Recht would say there is potential for recourse here, since the doctor can revise the treatment plan over time (he lists a similar example here). But still, at any given point in time, there’s a model and there’s a decision to be made by a human.    

Giving the doctor information about the model’s confidence in its prediction seems like it should be useful in helping them appraise the prediction in light of their own knowledge. Similarly, giving them a prediction set over a single top-1 prediction seems potentially preferable so they don’t anchor too heavily on a single prediction. Deep neural nets for medical diagnoses can do better than many humans in certain domains while still having relatively low top-1 accuracy (e.g., here). 

A naive thing to do would be to just choose some number k of predictions from the model we think a doctor can handle seeing at once, and show the top-k with softmax scores. But an adaptive conformal prediction set seems like an improvement in that at least you get some kind of guarantee, even if it’s not specific to your interval like you want. Set size conveys information about the level of uncertainty like the width of a traditional confidence interval does, which seems more likely to be helpful for conveying relative uncertainty than holding set size constant and letting the coverage guarantee change (I’ve heard from at least one colleague who works extensively with doctors that many are pretty comfortable with confidence intervals). We can also take steps toward the conditional coverage that we actually want by using an algorithm that calibrates the guarantees over different classes (labels), or that achieves a relaxed version of conditional coverage, possibilities that Recht seems to overlook. 

So while I agree with all the limitations, I don’t necessarily agree with Recht’s concluding sentence here:

“If you have multiple stages of recourse, it almost doesn’t matter if your prediction bands were correct. What matters is whether you can do something when your predictions are wrong. If you can, point predictions coupled with subsequent action are enough to achieve nearly optimal decisions.” 

It seems possible that seeing a prediction set (rather than just a single top prediction) will encourage a doctor to consider other diagnoses that they may not have thought of. Presenting uncertainty often has _some_ effect on a person’s reasoning process, even if they can revise their behavior later. The effect of seeing more alternatives could be bad in some cases (they get distracted by labels that don’t apply), or it could be good (a hurried doctor recognizes a potentially relevant condition they might have otherwise overlooked). If we allow for the possibility that seeing a set of alternatives helps, it makes sense to have a way to generate them that give us some kind of coverage guarantee we can make sense of, even if it gets violated sometimes. 

This doesn’t mean I’m not skeptical of how much prediction sets might change things over more naively constructed sets of possible labels. I’ve spent a bit of time thinking about how, from the human perspective, prediction sets could or could not add value, and I suspect its going to be nuanced, and the real value probably depends on how the coverage responds under realistic changes in distribution. There are lots of questions that seem worth trying to answer in particular domains where models are being deployed to assist decisions. Does it actually matter in practice, such as in a given medical decision setting, for the quality of decisions that are made if the decision-makers are given a set of predictions with coverage guarantees versus a top-k display without any guarantees? And, what happens when you give someone a prediction set with some guarantee but there are distribution shifts such that the guarantees you give are not quite right? Are they still better off with the prediction set or is this worse than just providing the model’s top prediction or top-k with no guarantees? Again, many of the questions could also be asked of other uncertainty quantification approaches; conformal prediction is just easier to implement in many cases. I have more to say on some of these questions based on a recent study we did on decisions from prediction sets, where we compared how accurately people labeled images using them versus other displays of model predictions, but I’ll save that for another post since this is already long. 

Of course, it’s possible that in many settings we would be better using some inherently interpretable model for which we no longer need a distribution-free approach. And ultimately we might be better off if we can better understand the decision problem the human decision-maker faces and apply decision theory to try to find better strategies  rather than leaving it up to the human how to combine their knowledge with what they get from a model prediction. I think we still barely understand how this occurs even in high stakes settings that people often talk about.

Uncertainty in games: How to get that balance so that there’s a motivation to play well, but you can still have a chance to come back from behind?

I just read the short book, “Uncertainty in games,” by Greg Costikyan. It was interesting. His main point, which makes sense to me, is that uncertainty a key part of the appeal of any game. He gives interesting examples of different sources of uncertainty. For example, if you’re playing a video game such as Pong, the uncertainty is in your own reflexes and reactions. With Diplomacy, there’s uncertainty in what the other players will do. With poker, there’s uncertainty about all the hole cards. With chess, there’s uncertainty in what the other player will do and also uncertainty in the logical implications of any position, in the same way that I am uncertain about what is the 200th digit of the decimal expansion of pi, even though that number exists. I agree with Costikyan that uncertainty is a helpful concept for thinking about games.

There’s one thing he didn’t discuss in his book, though, that I wanted to hear more about, and that’s the way that time and uncertainty interact in games, and how this factors into game design. I’ve been thinking a lot about time lately, and this is another example, especially relevant to me as we’re in the process of finishing up the design of a board game, and we want to improve its playability.

To fix ideas, consider a multi-player tabletop game with a single winner, and suppose the game takes somewhere between a half hour and two hours to play. As a player, I want to have a real chance of winning, until close to the end, and when the game reaches the point at which I pretty much know I can’t lose, I still want it to be fun, I want some intermediate goal such as the possibility of being a spoiler, or of being able to capitalize on my opponents’ mistakes. At the same time, I don’t want the outcome to be entirely random.

Consider two extremes:
1. One player gets ahead early and then can relentlessly exploit the advantage to get a certain win.
2. Nobody is ever ahead by much; there’s a very equal balance, and the winner is decided only at the very end by some random event.

Option #1 actually isn’t so bad—as long as the player in the lead can compound the advantage and force the win quickly. For example, in chess, if you have a decisive lead you can use your pieces together to increase your advantage. This is to be distinguished from how we played as kids, which was that once you’re in the lead you’d just try to trade pieces until the opposing player had nothing left: that got pretty boring. If you can use your pieces together, the game is more interesting even during the period where the winning player is clinching it.

Option #2 would not be so much fun. Sure, sometimes you will have a close game that’s decided at the very end, and that’s fine, but I’d like for victory to be some reflection of cumulative game play, as otherwise it’s meaningless.

Sometimes this isn’t so important. In Scrabble, for example, the play itself is enjoyable. The competition can also be good—it’s fun to be in a tight game where you’re counting the letters, blocking out the other player, and strategizing to get that final word on the board—but even if you’re way behind, you can still try to get the most out of your rack.

In some other games, though, once you’re behind and you don’t have a chance to win, it’s just a chore to keep playing. Monopoly and Risk handle this by creating a positive incentive for players to wipe out weak opponents, so that once you’re down, you’ll soon be out.

And yet another approach is to have cumulative scoring. In poker it’s all about the money. Whether you’re ahead or behind for the night, you’re still motivated to improve your bankroll.

One thing I don’t have a good grip on regarding game design is how to get that balance between all these possibilities, so that how you play matters throughout the game, while at the same time keeping the possibility of winning for as long as is feasibly possible.

I remember my dad saying that he preferred tennis scoring (each game is played to 4 points, each set is 6 games, you need to win 2 or 3 sets) as compared to old-style ping-pong scoring (whoever reaches 21 points first, wins), because in tennis, even if you’re way behind, you always have a chance to come back. Which makes sense, and is related to Costikyan’s point about uncertainty, but is hard for me to formalize.

A key idea here, I think, is that the relative skill of the players during the course of a match is a nonstationary process. For example, if player A is winning, perhaps up 2 sets to 0 and up 5 games to 2 in the third set, but then player B comes from behind to catch up and then maybe win in the fifth set, yes, this is an instance of uncertainty in action, but it won’t be happening at random. What will happen is that A gets tired, or B figures out a new plan of action, or some other factor that affects the relative balance of skill. And that itself is part of the game.

In summary, we’d like the game to balance three aspects:

1. Some positive feedback mechanism so that when you’re ahead you can use this advantage to increase your lead.

2. Some responsiveness to changes in effort and skill during the game, so that by pushing really hard or coming up with a clever new strategy you can come back from behind.

3. Uncertainty, as emphasized by Costikyan.

I’m sure that game designers have thought systematically about such things; I just don’t know where to look.

Clinical trials that are designed to fail

Mark Palko points us to a recent update by Robert Yeh et al. of the famous randomized parachute-jumping trial:

Palko writes:

I also love the way they dot all the i’s and cross all the t’s. The whole thing is played absolutely straight.

I recently came across another (not meant as satire) study where the raw data was complete crap but the authors had this ridiculously detailed methods section, as if throwing in a graduate level stats course worth of terminology would somehow spin this shitty straw into gold.

Yeh et al. conclude:

This reminded me of my zombies paper. I forwarded the discussion to Kaiser Fung, who wrote:

Another recent example from Covid is this Scottish study. They did so much to the data that it is impossible for any reader to judge whether they did the right things or not. The data are all locked down for “privacy.”

Getting back to the original topic, Joseph Delaney had some thoughts:

I think the parachute study makes a good and widely misunderstood point. Our randomized controlled trial infrastructure is designed for the drug development world, where there is a huge (literally life altering) benefit to proving the efficacy of a new agent. Conservative errors are being cautious and nobody seriously considers a trial designed to fail as a plausible scenario.

But you see new issues with trials designed to find side effects (e.g., RECORD has a lot more LTFU than I saw in a drug study, when I did trials we studied how to improve adherence to improve the results—but a trial looking for side effects that cost the company money would do the reverse). We teach in pharmacy that conservative design is actually a problem in safety trials.

Even worse are trials which are aliased with a political agenda. It’s easy-peasy to design a trial to fail (the parachute trial was jumping from a height of 2 feet). That makes me a lot more critical when you see trials where the failure of the trial would be seen as a upside, because it is just so easy to botch a trial. Designing good trials is very hard (smarter people than I spend entire careers doing a handful of them). It’s a tough issue.

Lots to chew on here.

If school funding doesn’t really matter, why do people want their kid’s school to be well funded?

A question came up about the effects of school funding and student performance, and we were referred to this review article from a few years ago by Larry Hedges, Terri Pigott, Joshua Polanin, Ann Marie Ryan, Charles Tocci, and Ryan Williams:

One question posed continually over the past century of education research is to what extent school resources affect student outcomes. From the turn of the century to the present, a diverse set of actors, including politicians, physicians, and researchers from a number of disciplines, have studied whether and how money that is provided for schools translates into increased student achievement. The authors discuss the historical origins of the question of whether school resources relate to student achievement, and report the results of a meta-analysis of studies examining that relationship. They find that policymakers, researchers, and other stakeholders have addressed this question using diverse strategies. The way the question is asked, and the methods used to answer it, is shaped by history, as well by the scholarly, social, and political concerns of any given time. The diversity of methods has resulted in a body of literature too diverse and too inconsistent to yield reliable inferences through meta-analysis. The authors suggest that a collaborative approach addressing the question from a variety of disciplinary and practice perspectives may lead to more effective interventions to meet the needs of all students.

I haven’t followed this literature carefully. It was my vague impression that studies have found effects of schools on students’ test scores to be small. So, not clear that improving schools will do very much. On the other hand, everyone wants their kid to go to a good school. Just for example, all the people who go around saying that school funding doesn’t matter, they don’t ask to reduce the funding of their own kids’ schools. And I teach at an expensive school myself. So lots of pieces here, hard for me to put together.

I asked education statistics expert Beth Tipton what she thought, and she wrote:

I think the effect of money depends upon the educational context. For example, in higher education at selective universities, the selection process itself is what ensures success of students – the school matters far less. But in K-12, and particularly in under resourced areas, schools and finances can matter a lot – thus the focus on charter schools in urban locales.

I guess the problem here is that I’m acting like the typical uninformed consumer of research. The world is complicated, and any literature will be a mess, full of claims and counter-claims, but here I am expecting there to be a simple coherent story that I can summarize in a short sentence (“Schools matter” or “Schools don’t matter” or, maybe, “Schools matter but only a little”).

Given how frustrated I get when others come into a topic with this attitude, I guess it’s good for me to recognize when I do it.

“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

Minimum criteria for studies evaluating human decision-making

This is Jessica. A while back on the blog I shared some opinions about studies of human-decision making, such as to understand how visualizations or displays of model predictions and explanations impact people’s behavior. My view is essentially that a lot of the experiments being used to do things like rank interfaces or model explanation techniques are not producing very informative results because the decision task is defined too loosely. 

I decided to write up some thoughts rather than only blogging them. In Decision Theoretic Foundations for Human Decision Experiments (with Alex Kale and Jason Hartline), we write: 

Decision-making with information displays is a key focus of research in areas like explainable AI, human-AI teaming, and data visualization. However, what constitutes a decision problem, and what is required for an experiment to be capable of concluding that human decisions are flawed in some way, remain open to speculation. We present a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics. We argue that to attribute loss in human performance to forms of bias, an experiment must provide participants with the information that a rational agent would need to identify the normative decision. We evaluate the extent to which recent evaluations of decision-making from the literature on AI-assisted decisions achieve this criteria. We find that only 6 (17%) of 35 studies that claim to identify biased behavior present participants with sufficient information to characterize their behavior as deviating from good decision-making. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow us to conceive. In contrast, the ambiguities of a poorly communicated decision problem preclude normative interpretation. 

We make a couple main points. First, if you want to evaluate human decision-making from some sort of information interface, you should be able to formulate the task you are studying as a decision problem as defined by statistical decision theory and information economics. Specifically, a decision problem consists of a payoff-relevant state, a data-generating model which produces signals that induce a distribution over the state, an action space from which the decision-maker chooses a response, and a scoring rule that defines the quality of the decision as a function of the action that was chosen and the realization of the payoff-relevant state. Using this definition of a decision problem gives you a statistically coherent way to define the normative decision, i.e., the action that a Bayesian agent would choose to maximize their utility under whatever scoring rule you’ve set up. In short, if you want to say anything based on your results that implies people’s decisions are flawed, you need to make clear what is optimal, and you’re not going to do better than statistical decision theory. 

The second requirement is that you communicate to the study participants sufficient information for a rational agent to know how to optimize: select the optimal action after forming posterior beliefs about the state of the world given whatever signals –visualizations, displays of model predictions, etc–you are showing them. 

When these criteria are met you gain the ability to conceive of different sources of performance loss implied by the process that the rational Bayesian decision-maker goes through when faced with the decision problem: 

  • Prior loss, the loss in performance due to the difference between the agent’s prior beliefs and those used by the researchers to calculate the normative standard.
  • Receiver loss, the loss due to the agent not properly extracting the information from the signal, for example, because the human visual system constrains what information is actually perceived or because participants can’t figure out how to read the signal.
  • Updating loss, the loss due to the agent not updating their prior beliefs according to Bayes rule with the information they obtained from the signal (in cases where the signal does not provide sufficient information about the posterior probability on its own).
  • Optimization loss, the loss in performance due to not identifying the optimal action under the scoring rule. 

Complicating things is loss due to the possibility that the agent misunderstands the decision task, e.g., because they didn’t really internalize the scoring rule. So any hypothesis you might try to test about one of the sources of loss above is actually testing the joint hypothesis consisting of your hypothesis plus the hypothesis that participants understood the task. We don’t get into how to estimate these losses, but some of our other work does, and there’s lots more to explore there. 

If you communicate to your study participants part of a decision problem, but leave out some important component, you should expect their lack of clarity about the problem to induce heterogeneity in the behaviors they exhibit. And then you can’t distinguish such “heterogeneity by design” from the real differences between decision-quality based on the differences between the conditions that you are trying to study. You don’t know if participants are making flawed decisions because of real challenges with forming accurate beliefs or selecting the right action under different types of signals or because they are operating under a different version of the decision problem than you have in mind.

Here’s a picture that comes to mind:

Diagram showing underspecified decision problem being interpreted differently by people  

 

I.e., each participant might have a unique way of filling in the details about the problem that you’ve failed to communicate, which differs from how you analyze it. Often I think experimenters are overly optimistic about how easy it is to move from the left side–the artificial world of the experiment–to draw conclusions about the right. I think sometimes people believe that if they leave out some information (e.g.,  they don’t communicate to participants the prior probability of recidivating in a study on recidivism prediction, or they set up a fictional voting scenario but don’t give participants a clear scoring rule when studying effects of different election forecast displays), they are “being more realistic”, because in the real world people rely on their own intuitions and past experience so there are lots of possible influences on how a person makes their decision. But, as we write in the paper, this is a mistake, because people will generally have different goals and beliefs in an experiment than they do in the real world. Even if everyone is influenced in the experiment by a different factor that does operate in the real world, the idea that the composition of all these interpretations gives us a good approximation of real world behavior is not supported, as we say in the paper it “arises from a failure to recognize our fundamental uncertainty about how the experimental context relates to the real world.“ We can’t know for sure how good a simulacrum our experimental context is for the real world task, so we should at least be very clear about what the experimental context is so we can draw internally valid conclusions. 

Criteria 1 is often met in visualization and human-centered AI, but Criteria 2 is not

I don’t think these two criteria are met in most of the interface decision experiments I come across. In fact, as the abstract mentions, Alex and I looked at a sample of 46 papers on AI assisted decision-making that a survey previously labeled as evaluating human decisions; of these 11 were interested in studying tasks for which you can’t define ground truth, like emotional responses people had to recommendations, or had a descriptive purpose, like estimating how accurately a group of people can guess the post-release criminal status of a set of defendants in the COMPAS dataset. Of the remaining 35, only a handful gave participants enough information for them to at least in theory know how to best respond to the problem. And even when sufficient information to solve the decision problem in theory is given, often the authors use a different scoring rule to evaluate the results than they gave to participants. The problem here is that you are assigning a different meaning to the same responses when you evaluate versus when you instruct participants. There were also many instances of information asymmetries between conditions the researchers compared, like where some of the prediction displays contained less decision-relevant information or some of the conditions got feedback after each decision while others didn’t. Interpreting the results is easier if the authors account for the difference in expected performance based on giving people a slightly different problem. 

In part the idea of writing this up was that it could provide a kind of explainer of the philosophy behind work we’ve done recently that defines rational agent benchmarks for different types of decision studies. As I’ve said before, I would love to see people studying interfaces adopt statistical decision theory more explicitly. However, we’ve encountered resistance in some cases. One reason I suspect is because people don’t understand the assumptions made in decision theory, so this is an attempt to walk through things step by step to build confidence. Though there may be other reasons too, related to people distrusting anything that claims to be “rational.”  

“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?”

Alexey Guzey asks:

How much have you thought about AI and when will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?

My first reply: I guess that AI can already do better science than Matthew “Sleeplord” Walker, Brian “Pizzagate” Wansink, Marc “Schoolmarm” Hauser, or Satoshi “Freakonomics” Kanazawa. So some humans are already obsolete, when it comes to producing science.

OK, let me think a bit more. I guess it depends on what kind of scientific research we’re talking about. Lots of research can be automated, and I could easily imagine an AI that can do routine analysis of A/B tests better than a human could. Indeed, thinking of how the AI could do this is a good way to improve how humans currently do things.

For bigger-picture research, I don’t see AI doing much. But a big problem now with human research is that human researchers want to take routine research and promote it as big-picture (see Walker, Wansink, Kanazawa, etc.). I guess that an AI could be programmed to do hype and create Ted talk scripts.

Guzey’s response:

What’s “routine research”? Would someone without a college degree be able to do it? Is routine research simply defined as such that can be done by a computer now?

My reply: I guess the computer couldn’t really do the research, as it that would require filling test tubes or whatever. I’m thinking that the computer could set up the parameters of an experiment, evaluate measurements, choose sample size, write up the analysis, etc. It would have to be some computer program that someone writes. If you just fed the scientific literature into a chatbot, I guess you’d just get millions more crap papers, basically reproducing much of what is bad about the literature now, which is the creation of articles that give the appearance of originality and relevance while actually being empty in content.

But, now that I’m writing this, I think Guzey is asking something slightly different: he wants to know when a general purpose “scientist” computer could be written, kind of like a Roomba or a self-driving car, but instead of driving around, it would read the literature, perform some sort of sophisticated meta-analyses, and come up with research ideas, like “Run an experiment on 500 people testing manipulations A and B, measure pre-treatment variables U and V, and look at outcomes X and Y.” I guess the first step would be to try to build such a system in a narrow environment such as testing certain compounds that are intended to kill bacteria or whatever.

I don’t know. On one hand, even the narrow version of this problem sounds really hard; on the other hand, our standards for publishable research are so low that it doesn’t seem like it would be so difficult to write a computer program that can fake it.

Maybe the most promising area of computer-designed research would be in designing new algorithms, because there the computer could actually perform the experiment; no laboratory or test tubes required, so the experiments can be run automatically and the computer could try millions of different things.

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

Regarding the use of “common sense” when evaluating research claims

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”