Inspiring story from a chemistry classroom

From former chemistry teacher HildaRuth Beaumont:

I was reminded of my days as a newly qualified teacher at a Leicestershire comprehensive school in the 1970s, when I was given a group of reluctant pupils with the instruction to ‘keep them occupied’. After a couple of false starts we agreed that they might enjoy making simple glass ornaments. I knew a little about glass blowing so I was able to teach them how to combine coloured and transparent glass to make animal figures and Christmas tree decorations. Then one of them made a small bottle complete with stopper. Her classmate said she should buy some perfume, pour some of it into the bottle and give it to her mum as a Mother’s Day gift. ‘We could actually make the perfume too,’ I said. With some dried lavender, rose petals, and orange and lemon peel, we applied solvent extraction and steam distillation to good effect and everyone was able to produce small bottles of perfume for their mothers.

What a wonderful story. We didn’t do anything like this in our high school chemistry classes! Chemistry 1 was taught by an idiot who couldn’t understand the book he was teaching out of. Chemistry 2 was taught with a single-minded goal of teaching us how to solve the problems on the Advanced Placement exam. We did well on the exam and learned essentially zero chemistry. On the plus side, this allowed me to place out of the chemistry requirement in college. On the minus side . . . maybe it would’ve been good for me to learn some chemistry in college. I don’t remember doing any labs in Chemistry 2 at all!

Preregistration is a floor, not a ceiling.

This comes up from time to time, for example someone sent me an email expressing a concern that preregistration stifles innovation: if Fleming had preregistered his study, he never would’ve noticed the penicillin mold, etc.

My response is that preregistration is a floor, not a ceiling. Preregistration is a list of things you plan to do, that’s all. Preregistration does not stop you from doing more. If Fleming had followed a pre-analysis protocol, that would’ve been fine: there would have been nothing stopping him from continuing to look at his bacterial cultures.

As I wrote in comments to my 2022 post, “What’s the difference between Derek Jeter and preregistration?” (which I just added to the lexicon), you don’t preregister “the” exact model specification; you preregister “an” exact model specification, and you’re always free to fit other models once you’ve seen the data.

It can be really valuable to preregister, to formulate hypotheses and simulate fake data before gathering any real data. To do this requires assumptions—it takes work!—and I think it’s work that’s well spent. And then, when the data arrive, do everything you’d planned to do, along with whatever else you want to do.

Planning ahead should not get in the way of creativity. It should enhance creativity because you can focus your data-analytic efforts on new ideas rather than having to first figure out what defensible default thing you’re supposed to do.

Aaaand, pixels are free, so here’s that 2002 post in full:
Continue reading

“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.

How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls?

Palko points to this news article, “The mystery of Vivek Ramaswamy’s rapid rise in the polls,” which states:

Ramaswamy’s strength comes almost entirely from polls conducted over the internet, according to a POLITICO analysis. In internet surveys over the past month — the vast majority of which are conducted among panels of people who sign up ahead of time to complete polls, often for financial incentives — Ramaswamy earns an average of 7.8 percent, a clear third behind Trump and DeSantis.

In polls conducted mostly or partially over the telephone, in which people are contacted randomly, not only does Ramaswamy lag his average score — he’s way back in seventh place, at just 2.6 percent.

There’s no singular, obvious explanation for the disparity, but there are some leading theories for it, namely the demographic characteristics and internet literacy of Ramaswamy’s supporters, along with the complications of an overly white audience trying to pronounce the name of a son of immigrants from India over the phone.”

And then, in order for a respondent to choose Ramaswamy in a phone poll, he or she will have to repeat the name back to the interviewer. And the national Republican electorate is definitely older and whiter than the country as a whole: In a recent New York Times/Siena College poll, more than 80 percent of likely GOP primary voters were white, and 38 percent were 65 or older.

‘When your candidate is named Vivek Ramaswamy,’ said one Republican pollster, granted anonymity to discuss the polling dynamics candidate, ‘that’s like DEFCON 1 for confusion and mispronunciation.’

Palko writes:

Keeping in mind that the “surge” was never big (maxed out at 10% and has been flat since), we’re talking about fairly small numbers in absolute terms, here are some questions:

1. How much do we normally expect phone and online to agree?

2. Ramaswamy generally scores around 3 times higher online than with phone. Have we seen that magnitude before?

3. How about a difficult name bias. Have we seen that before? How about Buttigieg, for instance? Did a foreign-sounding name hurt Obama in early polls?

4. Is the difference in demographics great enough to explain the difference? Aren’t things like gender and age normally reweighted?

5. Are there other explanations we should consider?

I don’t have any answers here, just one thought which is that it’s early in the campaign (I guess I should call it the pre-campaign, given that the primary elections haven’t started yet), and so perhaps journalists are reasoning that, even if this candidate is not very popular among voters, his active internet presence makes him a reasonable dark-horse candidate looking forward. An elite taste now but could perhaps spread to the non-political-junkies in the future? Paradoxically, the fact that Ramaswamy has this strong online support despite his extreme political stances could be taken as a potential sign of strength? I don’t know.

Conformal prediction and people

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes. 

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution. 

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.  

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments  doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction. 

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially. 

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy:

I was involved in a recent email discussion, leading to this summary:

There is no theoretical or empirical reason for the hot hand to be controversial. The only good reason for there being a controversy is that the mistaken paper by Gilovich et al. appeared first. At this point we should give Gilovich et al. credit for bringing up the hot hand as a subject of study and accept that they were wrong in their theory, empirics, and conclusions, and we can all move on. There is no shame in this for Gilovich et al. We all make mistakes, and what’s important is not the personalities but the research that leads to understanding, often through tortuous routes.

“No theoretical reason”: see discussion here, for example.

“No empirical reason”: see here and lots more in the recent literature.

“The only good reason . . . appeared first”: Beware the research incumbency rule.

More generally, what makes something a controversy? I’m not quite sure, but I think the news media play a big part. We talked about this recently in the context of the always-popular UFOs-as-space-aliens theory, which used to be considered a joke in polite company but now seems to have reached the level of controversy.

I don’t have anything systematic to say about all this right now, but the general topic seems very worthy of study.

“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct”

Stephanie Lee has the story:

Harvard Business School’s investigative report into the behavioral scientist Francesca Gino was made public this week, revealing extensive details about how the institution came to conclude that the professor committed research misconduct in a series of papers.

The nearly 1,300-page document was unsealed after a Tuesday ruling from a Massachusetts judge, the latest development in a $25 million lawsuit that Gino filed last year against Harvard University, the dean of the Harvard Business School, and three business-school professors who first notified Harvard of red flags in four of her papers. All four have been retracted. . . .

According to the report, dated March 7, 2023, one of Gino’s main defenses to the committee was that the perpetrator could have been someone else — someone who had access to her computer, online data-storage account, and/or data files.

Gino named a professor as the most likely suspect. The person’s name was redacted in the released report, but she is identified as a female professor who was a co-author of Gino’s on a 2012 now-retracted paper about inducing honest behavior by prompting people to sign a form at the top rather than at the bottom. . . . Allegedly, she was “angry” at Gino for “not sufficiently defending” one of their collaborators “against perceived attacks by another co-author” concerning an experiment in the paper.

But the investigation committee did not see a “plausible motive” for the other professor to have committed misconduct by falsifying Gino’s data. “Gino presented no evidence of any data falsification actions by actors with malicious intentions,” the committee wrote. . . .

Gino’s other main defense, according to the report: Honest errors may have occurred when her research assistants were coding, checking, or cleaning the data. . . .

Again, however, the committee wrote that “she does not provide any evidence of [research assistant] error that we find persuasive in explaining the major anomalies and discrepancies.”

The full report is at the link.

Some background is here, also here, and some reanalyses of the data are linked here.

Now we just have to get to the bottom of the story about the shredder and the 80-pound rock and we’ll pretty much have settled all the open questions in this field.

We’ve already determined that the “burly coolie” story and the “smallish town” story never happened.

It’s good we have dishonesty experts. There’s a lot of dishonesty out there.

Abraham Lincoln and confidence intervals

This one from 2017 is good; I want to share it with all of you again:

Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Fully funded doctoral student positions in Finland

There is a new government funded Finnish Doctoral Program in AI. Research topics include Bayesian inference, modeling and workflows as part of fundamental AI. There is a big joint call, where you can choose the supervisor you want to work with. I (Aki) am also one of the supervisors. Come work with me or share the news! The first call deadline is April 2, and the second call deadline in fall 2024. See how to apply at https://fcai.fi/doctoral-program, and more about my research at my web page.

Zotero now features retraction notices

David Singerman writes:

Like a lot of other humanities and social sciences people I use Zotero to keep track of citations, create bibliographies, and even take & store notes. I also am not alone in using it in teaching, making it a required tool for undergraduates in my classes so they learn to think about organizing their information early on. And it has sharing features too, so classes can create group bibliographies that they can keep using after the semester ends.

Anyway my desktop client for Zotero updated itself today and when it relaunched I had a big red banner informing me that an article in my library had been retracted! I didn’t recognize it at first, but eventually realized that was because it was an article one of my students had added to their group library for a project.

The developers did a good job of making the alert unmissable (i.e. not like a corrections notice in a journal), the full item page contains lots of information and helpful links about the retraction, and there’s a big red X next to the listing in my library. See attached screenshots.

The way they implemented it will also help the teaching component, since a student will get this alert too.

Singerman adds this P.S.:

This has reminded me that some time ago you posted something about David Byrne, and whatever you said, it made me think of David Byrne’s wonderful appearance on the Colbert Report.

What was amazing to me when I saw it was that it’s kind of like a battle between Byrne’s inherent weirdness and sincerity, and Colbert’s satirical right-wing bloviator character. Usually Colbert’s character was strong enough to defeat all comers, but . . . decide for yourself.

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.

Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem?

Paul von Hippel writes:

Stuart Buck noticed your recent post on A WestLaw for Science. This is something that Stuart and I started talking about last year, and Stuart, who trained as an attorney, believes it was first suggested by a law professor about 15 years ago.

Since the 19th century, the legal profession has had citation indices that do far more than count citations and match keywords. Resources like Shepard’s Citations—first printed in 1873 and now published online along with competing tools such as JustCite, KeyCite, BCite, and SmartCite—do not just find relevant cases and statutes; they show lawyers whether a case or statute is still “good law.” Legal citation indexes show lawyers which cases have been affirmed or cited approvingly, and which have been criticized, reversed, or overruled by later courts.

Although Shepard’s Citations inspired the first Science Citation Index in 1960, which in turn inspired tools like Google Scholar, today’s academic search engine still rely primarily on citation counts and keywords. As a result, many scientists are like lawyers who walk into the courtroom unaware that a case central to their argument has been overruled.

Kind of, but not quite. A key difference is that in the courtroom there is some reasonable chance that the opposing lawyer or the judge will notice that the key case has been overruled, so that your argument that hinges on that case will fail. You have a clear incentive to not rely on overruled cases. In science, however, there’s no opposing lawyer and no judge: you can build an entire career on studies that fail to replicate, and no problem at all, as long as you don’t pull any really ridiculous stunts.

Hippel continues:

Let me share a couple of relevant articles that we recently published.

One, titled “Is Psychological Science Self-Correcting?, reports that replication studies, whether successful or unsuccessful, rarely have much effect on citations to the studies being replicated. When a finding fails to replicate, most influential studies sail on, continuing to gather citations at a similar rate for years, as though the replication had never been tried. The issue is not limited to psychology and raises serious questions about how quickly the scientific community corrects itself, and whether replication studies are having the correcting influence that we would like them to have. I considered several possible reasons for the persistent influence on studies that failed to replicate, and concluded that academic search engines like Google Scholar may well be part of the problem, since they prioritize highly cited articles, replicable or not, perpetuating the influence of questionable findings.

The finding that replications don’t affect citations has itself replicated pretty well. A recent blog post by Bob Reed at the University of Canterbury, New Zealand, summarized five recent papers that showed more or less the same thing in psychology, economics, and Nature/Science publications.

In a second article, published just last week in Nature Human Behaviour, Stuart Buck and I suggest ways to Improve academic search engines to reduce scholars’ biases. We suggest that the next generation of academic search engines should do more than count citations, but should help scholars assess studies’ rigor and reliability. We also suggest that future engines should be transparent, responsive and open source.

This seems like a reasonable proposal. The good news is that it’s not necessary for their hypothetical new search engine to dominate or replace existing products. People can use Google Scholar to find the most cited papers and use this new thing to inform about rigor and reliability. A nudge in the right direction, you might say.

A new piranha paper

Kris Hardies points to this new article, Impossible Hypotheses and Effect-Size Limits, by Wijnand and Lennert van Tilburg, which states:

There are mathematical limits to the magnitudes that population effect sizes can take within the common multivariate context in which psychology is situated, and these limits can be far more restrictive than typically assumed. The implication is that some hypothesized or preregistered effect sizes may be impossible. At the same time, these restrictions offer a way of statistically triangulating the plausible range of unknown effect sizes.

This is closely related to our Piranha Principle, which we first formulated here and then followed up with this paper. It’s great to see more work being done in this area.

Statistical practice as scientific exploration

This was originally going to happen today, 8 Mar 2024, but it got postponed to some unspecified future date, I don’t know why. In the meantime, here’s the title and abstract:

Statistical practice as scientific exploration

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? After a brief review of that topic (in short, I am a Bayesian but not an inductivist), I discuss the ways in which researchers when using and developing statistical methods are acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modeling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formally tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow, as described in part in this article: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

The whole thing is kind of mysterious to me. In the email invitation it was called the UPenn Philosophy of Computation and Data Workshop, but then they sent me a flyer where it was called the Philosophy of A.I., Data Science, & Society Workshop in the Quantitative Theory and Methods Department at Emory University. It was going to be on zoom so I guess the particular university affiliation didn’t matter.

In any case, the topic is important, and I’m always interested in speaking with people on the philosophy of statistics. So I hope they get around to rescheduling this one.

Relating t-statistics and the relative width of confidence intervals

How much does a statistically significant estimate tell us quantitatively? If you have an estimate that’s statistically distinguishable from zero with some t-statistic, what does that say about your confidence interval?

Perhaps most simply, with a t-statistic of 2, your 95% confidence intervals will nearly touch 0. That is, they’re just about 100% wide in each direction. So they cover everything from nothing (0%) to around double your estimate (200%).

More generally, for a 95% confidence interval (CI), 1.96/t — or let’s say 2/t — gives the relative half-width of the CI. So for an estimate with t=4, then everything from half your estimate to 150% of your estimate is in the 95% CI.

For other commonly-used nominal coverage rates, the confidence intervals have a width that is less conducive to a rule of thumb, since the critical value isn’t something nice like ~2. (For example, with 99% CIs, the Gaussian critical value is 2.58.) Let’s look at 90, 95, and 99% confidence intervals for t = 1.96, 3, 4, 5, and 6:

Confidence intervals on the scale of the estimate

You can see, for example, that even at t=5, the halved point estimate is still inside the 99% CI. Perhaps this helpfully highlights how much more precision you need to confidently state the size of an effect than just to reject the null.

These “relative” confidence intervals are just this smooth function of t (and thus the p-value), as displayed here:

confidence intervals on scale of the estimate by p-value and t-statistic

It is only when the statistical evidence against the null is overwhelming — “six sigma” overwhelming or more —that you’re also getting tight confidence intervals in relative terms. Among other things, this highlights that if you need to use your estimates quantitatively, rather than just to reject the null, default power analysis is going to be overoptimistic.

A caveat: All of this just considers standard confidence intervals based on normal theory labeled by their nominal coverage. Of course, many p < 0.05 estimates may have been arrived at by wandering through a garden of forking paths, or precisely because it passed a statistical significance filter. Then these CIs are not going to conditionally have their advertised coverage.

With journals, it’s all about the wedding, never about the marriage.

John “not Jaws” Williams writes:

Here is another example of how hard it is to get erroneous publications corrected, this time from the climatology literature, and how poorly peer review can work.

From the linked article, by Gavin Schmidt:

Back in March 2022, Nicola Scafetta published a short paper in Geophysical Research Letters (GRL) . . . We (me, Gareth Jones and John Kennedy) wrote a note up within a couple of days pointing out how wrongheaded the reasoning was and how the results did not stand up to scrutiny. . . .

After some back and forth on how exactly this would work (including updating the GRL website to accept comments), we reformatted our note as a comment, and submitted it formally on December 12, 2022. We were assured from the editor-in-chief and publications manager that this would be a ‘streamlined’ and ‘timely’ review process. With respect to our comment, that appeared to be the case: It was reviewed, received minor comments, was resubmitted, and accepted on January 28, 2023. But there it sat for 7 months! . . .

The issue was that the GRL editors wanted to have both the comment and a reply appear together. However, the reply had to pass peer review as well, and that seems to have been a bit of a bottleneck. But while the reply wasn’t being accepted, our comment sat in limbo. Indeed, the situation inadvertently gives the criticized author(s) an effective delaying tactic since, as long as a reply is promised but not delivered, the comment doesn’t see the light of day. . . .

All in all, it took 17 months, two separate processes, and dozens of emails, who knows how much internal deliberation, for an official comment to get into the journal pointing issues that were obvious immediately the paper came out. . . .

The odd thing about how long this has taken is that the substance of the comment was produced extremely quickly (a few days) because the errors in the original paper were both commonplace and easily demonstrated. The time, instead, has been entirely taken up by the process itself. . . .

Schmidt also asks a good question:

Why bother? . . . Why do we need to correct the scientific record in formal ways when we have abundant blogs, PubPeer, and social media, to get the message out?

His answer:

Since journals remain extremely reluctant to point to third party commentary on their published papers, going through the journals’ own process seems like it’s the only way to get a comment or criticism noticed by the people who are reading the original article.

Good point. I’m glad that there are people like Schmidt and his collaborators who go to the trouble to correct the public record. I do this from time to time, but mostly I don’t like the stress of dealing with the journals so I’ll just post things here.

My reaction

This story did not surprise me. I’ve heard it a million times, and it’s often happened to me, which is why I once wrote an article called It’s too hard to publish criticisms and obtain data for replication.

Journal editors mostly hate to go back and revise anything. They’re doing volunteer work, and they’re usually in it because they want to publish new and exciting work. Replications, corrections, etc., that’s all seen as boooooring.

With journals, it’s all about the wedding, never about the marriage.

My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar)

Hi all. The other day I announced a talk I’ll be giving at the NYU economics seminar. It will be Thurs 18 Apr 12:30pm at 19 West 4th St., room 517.

In my earlier post, I’d given the wrong day for the talk. I’d written that it was this Thurs, 7 Mar. That was wrong! Completely my fault here; I misread my own calendar.

So I hope nobody shows up to that room tomorrow! Thank you for your forbearance.

I hope to see youall on Thurs 18 Apr. Again, here’s the title and abstract:

How large is that treatment effect, really?

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.