2015-vintage replication-crisis-era junk science floats into the news

So, I came across this news article titled, “Riley Thinks Suits Make the Coach. Research Says He Might Be Right.”:

The suit had a classic name: the Clark Gable. Navy blue and cut just right, it was the creation of Giorgio Armani, the legendary Italian designer.

It was the piece that made Pat Riley, the legendary NBA coach and executive, believe in the power of style. . . .

“I think an audience wants to see somebody on the sidelines who looks like a leader, dresses like a leader, acts like a leader,” Riley said.

It sounded like a bold claim. Sure, a business suit is undoubtedly nicer than the casual “athleisure” look — team-issue polos and pullovers — that NBA coaches adopted during the COVID-19 pandemic. But can a coat and tie really make someone more of a leader?

“It’s a perfectly reasonable thing to think,” said Abe Rutchick, a professor of psychology at California State University, Northridge. “Which is the idea that the clothes we wear have psychological meaning. We put something on, it’s not just clothes. It means something.”

Uh oh, social psychology research . . .

The article continues:

In the early 2010s, during the rise of casual attire, Rutchick and his colleagues examined a similar question and found something intriguing: Wearing formal attire might actually make a person think and act like a leader.

The researchers, using a variety of cognitive tasks, found that wearing formal clothes caused participants to shift from a concrete mode of thinking to a more abstract mindset — they thought of the big picture and looked further into the future. In other words, they thought like someone who was in charge. . . .

The paper, published in 2015, came a few years after another group of researchers found that people who wore a doctor’s white lab coat — and understood its symbolic meaning — had an increased ability to focus and pay attention. . . .

This sounds pretty bad, no joke. The early 2010s were the high-water mark of junk social psychology. This sort of study was one of the main reasons that the replication crisis became a crisis.

I thought journalists had wised up on this sort of thing, but I guess it remains afloat in the business-inspirational world of leadership.

Don’t get me wrong–I have no problem with these “leadership” stories. It’s cool to read about Pat Riley, and I have no reason to doubt that suit-wearing worked well for him. Everyone has to develop their own personal style. My problem is just with the purported scientific claims.

I found the journal article and, yeah, it’s classic replication crisis fodder:

Study 1: N = 60, p = .03
Study 2: “conceptual replication,” N = 60, p = .05 with 18 people excluded because of missing data
Study 3: N = 34, p = .02
Study 4: N = 54, p = .03 after some data were excluded
Study 5: N = 150, a mix of significant and non-significant results, conclusions made based on whether various inferences reached a significance threshold.

This is pretty much textbook bad statistical analysis of the replication-crisis variety:
– Small sample sizes and noisy data so that there’s essentially no power to detect realistic effect sizes (the kangaroo problem);
– Many researcher degrees of freedom in data exclusion, coding, and analysis, the sort of flexibility that makes it possible to achieve statistically significant p-values even in the absence of any signal;
– A bunch of p-values all in the 0.01 to 0.05 range, which is not what you’d expect from a sampling model of independent experiments (or see here);
– Flexible theories that could explain results through many sorts of interactions (the piranha problem);
– No preregistered replications.

That’s just how they did things back in 2015 so I’m not trying to single out these particular researchers. We know better now. We know not to trust this sort of claims. We don’t need to find a Wansink- or Ariely-style smoking gun; nobody’s suggesting there’s fraud here; it’s just standard-issue junk science of the sort that, until recently, was regularly published in major psychology journals and was regularly featured uncritically in major news media.

The only notable thing to me is to see this sort of claim being pushed in the New York Times now, because I had the vague impression that journalists were now aware of the replication crisis. But I guess there’s still a reservoir of credulity for such claims for stories related to the fuzzy topic of business leadership. I’d hope that straight-up sports reporting would have higher standards for the reporting of research on human performance.

P.S. This is an appropriate post for July 4th now that junk science is ensconced in the U.S. government.

Survey Statistics: Big Changes in the Times/Siena Poll

Yesterday Nate Cohn wrote about The Big Changes Coming to the Times/Siena Poll, with
more details in their poll of Maine.

Say we want to estimate average Platner support in Maine’s likely electorate, E(Y). But we only have survey respondents, R = 1.

The NYT uses survey weights to weight respondents, E(YW | R = 1). In contrast, some pollsters use MRP, fitting a Multilevel Regression model for Platner support, then applying it to the population, E(E_model(Y | X, R = 1)).

Nate discusses 2 Big Changes to how they construct the weights W.

(The polar bear has not yet hiked in ME, but he is training for it. This above is in TN.)

Big Change 1: Support score

A few weeks ago we saw the NYT started weighting on “synthetic 2024 vote”, which is recalled 2024 vote that is validated with the voter file and imputed if needed.

Now they’re also weighting on support score = E(2024 vote | other X variables). Nate explains the motivation:

While a poll can’t weight on dozens of variables, the support score lets us pile a lot of information into a single measure.

This reminded me of the causal inference context, where D’Amour and Franks (2021) “see especially strong performance for propensity weights computed with respect to the prognostic score”, where the prognostic score is E(Y | X, control). In our survey context, this would be a model for Platner support Y. Instead, the NYT use 2024 vote, perhaps for applicability across multiple outcomes Y ?

Big Change 2: Energy balancing

Beyond adding new weighting variables, they’re also changing how they calculate the weights. Nate notes the challenge of weighting on many variables and interactions with typical sample sizes. So they are turning to the R package WeightIt, which implements the energy balancing method from Huling & Mak (2024):

This article introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses without the need for careful model or moment specification.

The energy balancing weights do not use outcome Y, but the paper notes that estimates can be improved with a model for Y.

How do energy balancing weights handle the challenge of jointly weighting on many variables with typical sample sizes “without the need for model specification” ?

The Anthropic Principle in Statistics and Science (my talk this Mon 29 June, 4:20pm London time)

The Anthropic Principle in Statistics and Science

The anthropic principle in physics states that our existence implies certain constraints on the natural conditions under which we evolved. In statistics, a corresponding anthropic principle can be used to infer properties of the models we should fit to data. For example, experiments are typically aimed to have a precision sufficient to estimate effects of interest but without overkill; it is rare to have an estimate that is 10 standard errors from zero. We demonstrate through several examples in social and medical sciences how the anthropic principle, combined with Bayesian inference, can be used to improve statistical practice.

Here are a couple of applications of the idea:

• [2000] Should we take measurements at an intermediate design point?

• [2022] A proposal for informative default priors scaled by the standard error of estimates (with Erik van Zwet)

In my talk I’ll discuss these and other examples. I think this anthropic principle is really important, arguably more important in statistics than in physics, which is the field where it originated.

Here’s the zoom information for the talk on Mon 29 June, 4:20pm London time:

https://imperial-ac-uk.zoom.us/j/97341955036?pwd=1kKNbPAwJthKtG55ynXMVF3TLSvIbl.1
Meeting ID: 973 4195 5036
Passcode: J3Ue$f

I’ll be speaking (remotely) at this conference celebrating the 60th birthday of physicist Andrew Jaffe. This seems to be the season for 60th birthday conferences.

I know AJ from when he was visiting the Flatiron Institute last year. We worked together on The Squealer: Sensification of model exploration and model misfit. There’s no connection between the Squealer and the anthropic principle; I decided to speak on the latter topic because I thought it would be of general interest to an audience of physicists.

Treating AI review like the contentious policy design problem it is

This is Jessica. Many researchers are thinking about what we should do about scientific peer review now that AI makes producing papers so much easier. Submission numbers keep getting higher — in the past week, I saw reports that the most recent ACL submission cycle got 17k+ submissions, up from ~10k last cycle. TMLR went from getting 500 submissions every 60 days or so to getting the same number ever 19 days. There are simply not enough human reviewers to handle the surge, at least not without a dip in quality. The noiser the review system gets, the greater the incentive to submit sloppy papers, because you might get lucky. This is the so called “review death spiral.” 

It is a hard problem. Quotas on submissions per author are one avenue forward, which TMLR just announced it would adopt. Not surprisingly, many reviewers are also turning to AI to help. The question becomes how to design AI review protocols to help reduce some of the noise, through preliminary filtering or flagging or helping guide human attention to parts of a paper that are most likely to be problematic. 

But what sorts of checks should an AI review assistant run on a paper? It’s useful to separate basic integrity violations AI could flag, like is there evidence of plagiarism, fake citations, missing code/data to reproduce main results (which are comparatively less controversial) from “epistemic filters,” like does the paper pass replicability checks, robustness checks, preregistration checks, statistical significance checks, etc. There’s a temptation to blur these things in proposing how to apply AI to review. It’s easy to assume that the metascientists have already established that practices like replicability or preregistration are truth-indicating and we can just implement them at scale (and indeed, ML researchers are citing open science and other reform arguments to back their proposals).

But if there’s one lesson to be learned from the aftermath of the replication crisis, it’s that there is no small, stable, non-conflicting set of detectable signals of good science that will find the good stuff and reject the bad. There are heuristics that can be useful prompts for deliberation – get in the habit of preregistering, make sure you can replicate your results, test the sensitivity of your results to choices you made along the way – but things get weird when we start treating them like universal requirements. Authors shift attention away from unrewarded signals, like better theory or exploratory work, and become preoccupied with rigor signaling through their methods. The result is not necessarily more thoughtfulness. 

And so even if the AI review tools we create are simply intended to inform human reviewers about what checks a paper passed, what we implement will have important policy implications by incentivizing more work like that in the future. I don’t think we are in a good position to predict what happens if suddenly we require multiverse robustness or statistical significance in a field like machine learning, which has in many ways been all about iterative improvement and “frictionless reproducibility” rather than individual results passing all the robustness checks.

The answer is not to avoid using AI in review until we can find a non-gameable set of credibility qualities to have AI focus on, as some have recently argued (though I agree with the linked paper that we need more rigor in how we go about motivating review tools). Non-gameability sounds nice, but any automated review policy that allocates attention will be gameable, because ensuring good science is not so simple as finding the right checklist. The relevant question is instead what assumptions and downstream incentives we are willing to tolerate. To this end, at the very least we should get in the habit of spelling out the assumptions we’re making, so that the trade-offs of focusing on particular proxies become explicit.

I wrote up this view recently in a paper called “Stop Treating Metascientific Heuristics as Quality Filters in AI Review.” Here’s the abstract: 

AI-implemented checks for reproducibility, robustness, preregistration, claim scope, and other intended proxies for scientific credibility can extend human reviewers’ capabilities. However, treating metascientific heuristics–whose theoretical grounding remains contested or incomplete–as necessary and sufficient signals for filtering out bad science is counterproductive to scientific progress. The emerging literature blurs the line between integrity filtering, based on necessary but insufficient signals of validity like reproducibility of stated results or lack of fake citations, and epistemic filtering, which uses machine-detectable signals to judge scientific quality. Drawing on critical metascience, we show that commonly proposed signals of research quality are insufficiently justified as general indicators of scientific value. The answer is not necessarily to ban AI in review, given the deluge of submissions venues are facing. Instead, in recognition of how any use of automated signals–even when deployed with human oversight–will shape attention and create incentives upstream, developers of AI review tools should explicitly specify their assumptions about how proxy signals inform on scientific quality in the context of specific review decisions. This approach treats AI review contributions as contestable decision policies that will shape future research, acknowledging the value-laden nature of scientific judgment and surfacing relevant tradeoffs. 

Rather than arguing for or against any particular proxies, I’m more interested in the methodological and philosophical mindset we should bring to the new questions raised by AI review. To demonstrate what I mean by more explicit motivation, I analyze an example review decision problem and set of detectable signals in the appendix, drawing on an analysis of how statistical significance and exact replication success relate to signal-to-noise ratios measured under error from a recent paper by Eric van Zwet, Andrew, and Witold Więcek. The takeaway is that the value of a proxy will depend on how you define the latent state you care about (e.g., whether the direction of an effect was correctly estimated, how big the true signal-to-noise ratio is), what you assume about the generating process (i.e., how the proxy noisily reflects the latent state), and what you assume about the decision-maker’s choice of actions and utility function. By suggesting this approach, I am *not* suggesting that one can validate a new review tool’s utility before its been deployed. The point is that there will be trade-offs no matter what, and the best we can do is be concrete about the kinds of  assumptions that have to hold for proxies to be useful in review, so the community can debate what risks they are willing to accept. 

In this sense, my argument is very much along the same lines as Devezer et al’s argument that those proposing reform procedures should adopt more formal methodology to avoid unwarranted overgeneralization. Once checks become part of review infrastructure, they stop being neutral diagnostics and become policy levers. Let’s start treating them as such in research on AI review.

Survey Statistics: perfect collinearity in the sample but not in the population

In 2019, Andrew blogged about collinearity in Bayesian models. In the comments, he pointed to an example from Bayesian Data Analysis, 2nd edition (BDA2). I think it is a useful example to keep in mind when extrapolating from sample to population. Since folks (like me) may only have BDA3 on their shelf, I thought I’d talk thru it.

Amazon.com: Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science): 9781584883883: Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin: Books

Pretend it is 1980 and we are at the US Census Bureau. We just revamped the occupational coding system, and it’s so much better ! We want 1980-style codes on all our old data that only had 1970-style codes. Let’s trade in our peasant blouses for some shoulder pads.

Say we have double-coded training data (n = 10,000) with:

  • O_1980 = occupation coded in the 1980 coding system
  • O_1970 = occupation coded in the 1970 coding system
  • E = education, either high or low
  • I = income, either high or low

We want to impute O_1980 for the single-coded full dataset (N = 1,000,000) with only O_1970, E, and I.

Consider everyone with the a specific occupation according to the 1970 codes, e.g. Accountants. Say there are 200 accountants in the double-coded training data and they have either high income and high education or low income and low education. They have either OCCUP1 or OCCUP2 according to the 1980 codes.

From BDA2 Table 9.1:

Say we use standard regression software to fit p(O_1980 | O_1970 = Accountants, E, I). It will flag the predictors E and I as perfectly collinear, because in the double-coded training sample, education and income are perfectly correlated.

Suppose you drop education and use only income. The single-coded data actually has some low education and high income folks. The model only uses income, so 90% of them get OCCUP1. But suppose I drop income and use only education. My model only uses education, so only 10% of them get OCCUP1. Who is correct ?

As the authors say:

the truth is that we have essentially no evidence on the split for these units… the occupational split for the ‘E=low, I=high’ units should vary between, say, 90/10 and 10/90. … If some variable should or could be in the model on substantive grounds, then it should be included even if it is not ‘statistically significant’ and even if there is no information in the data to estimate it using traditional methods.

 

A tool for learning about Fourier transforms

Eric Novik came by my talk the other day and we were chatting about a number of things, including how much we forget as the years go by. I remarked that I used to be very comfortable with Fourier analysis and was able to use it as a research tool—see section 2.2 of my Ph.D. thesis, and it also came up in my research leading to R-hat (although it didn’t make it into the writeup)—but at this point I only understand Fourier analysis on a conceptual level. It’s not one of these things that stuck with me.

In response, Eric pointed to this app that he created (with chatbot assistance) to help him my understand some things about Fourier series. Maybe it will be useful to some of you too. The source code is here.

Survey Statistics: using MRP in later analyses (pride edition)

Happy pride !

One way I celebrated was by reading Lax & Phillips 2009, Gay Rights in the States: Public Opinion and Policy Responsiveness. It’s on-theme, an example in the MrPlew paper (which I also still need to digest), and I wanted examples of using MRP in later analyses.

Lax & Phillips 2009 studied the relationship between state-level public opinion and state adoption of policies affecting gays and lesbians. Andrew blogged about this work in Nov 2008Jan 2009, and June 2009 when he wrote:

Fancy statistical analysis can indeed lead to better understanding. Jeff Lax and Justin Phillips used the method of multilevel regression and poststratification (“Mister P”…

The paper’s appendix includes a NYT article and an almost-rainbow-colored plot:

Lax & Phillips 2009 used MRP to estimate state-level public opinion E(y | s). Let

  • y_i = 1 if person i supports laws to protect against discrimination in job opportunities (for example), = 0 otherwise
  • s[i] = state where person i lives, e.g. NY
  • L_s = 1 if state s has laws to protect against discrimination in job opportunities, = 0 otherwise

Their Multilevel Regression (“MR” of MRP) model had race, gender, age, education, state, and poll effects:

They modeled the state effect with state-level predictors (% religious conservatives, % Democratic voters in 2004):

Then they Poststratified (“P” of MRP) to the population:

Then they used the MRP estimate of public opinion as a predictor of whether the state adopts the policy:
Pr(L_s = 1) = logit^-1(a + b * y_s^pred)

From their Figure 1:

Questions:

  1. (How) did Lax & Phillips 2009 incorporate uncertainty in the MRP estimate of public opinion y_s^pred in their later analysis of its effect on policy adoption L_s ?
    Footnote 7 says they incorporated uncertainty for non-MRP estimates:

    if we use an opinion index based on disaggregation instead of MRP estimates, correcting for reliability using an error-in-variables approach (eivreg in Stata)…

  2. Are results sensitive to whether policy adoption L_s is a state-level predictor in the MRP model ?

The New York Knicks and the martingale property of calibrated probability forecasts (with some simulation and R code)

This long post covers four topics:

1. The Knicks’ stunning series of come-from-behind victories to win the NBA title in 5 games;

2. The martingale property of probability forecasts;

3. An example of learning from simulation;

4. How we (sometimes) do research in probability and statistics.

I don’t know enough about this blog’s audience to know which of the four topics will appeal to most of you. For the internet as a whole, it’s #1; for most of you, it might be #3.

I’m interested in all four, which is why I’m writing this all up right now. I’m embarrassed to say that it took several hours to do this. I was originally planning to post this Sunday morning after the game but it took time for me to get to the task. Most of the effort came from writing the code, not from writing the text. And there’s actually not much code, as you can see if you scroll to the end of this post. The main effort was not figuring out the syntax or even debugging (although there was some of that) but in working out what I wanted to be coding in the first place.

On the plus side, this is research I’ve been wanting to do for awhile, so (a) I don’t think this effort is wasted, even beyond whatever educational and entertainment value if has for you, and (b) I learned a bit from this already. Looking at data is always good; experimenting with simulation is always good.

Ok, here goes.

The NBA finals

Hey, remember this, from game 4 of the recent NBA finals:

Or the trajectory of the game that came after:

Just for completeness, here are the traces for games 3, 2, and 1, also courtesy of ESPN:

In game 4, the Spurs at one point were estimated to have a 99.6% chance of winning. But, as you might have heard, they lost.

Extreme win probabilities

Were those stated win probabilities too extreme?

On one hand, sure, unusual events happen on occasion. If you have a 0.4% chance of losing, that’s something that should happen 1 in 250 times, and there were a lot more than 250 basketball games just in this past season. On the other hand, very unusual event are supposed to happen only very rarely, and there was a point in the third quarter of game 4 where ESPN’s algorithm gave the Spurs a 97.1% chance of winning, a point in game 1 where the Spurs were given a 94.1% chance. There was a moment in game 2 where the Knicks were assigned a 98.2% chance of winning, and, sure, they did win that one, but given that the final score was 105-104, after being tied 97-97 and 104-104, it seems in retrospect that this 98.2% was a bit overconfident.

Should we be suspicious of these probabilities? One way to ask this question is to check calibration: if we collect all game situations where a team has a 99.6% of winning, are they winning 99.6% of the time?

On the other hand, I’m picking the most extreme values of these win probabilities. You should get calibration of win probabilities at any time, and it’s ok to condition on them, but only to condition on what came before.

That is, if we look at win probabilities at the end of the first quarter, or at the end of the first half, or at the end of the third quarter, they should be calibrated. And if you look only at win probabilities only when they’re greater than 99%, they should be calibrated. And if you look only at win probabilities when they are the maximum for the game so far, they should be calibrated. But it’s not clear to me that you should expect calibration for win probabilities selected to be the maximum for the entire game, because if the win probability at time t is p(t), and you condition on the event p(t) < p(t_0) for t > t_0, that could provide information. It’s tricky.

The martingale property of probability forecasts

We wrote about this in section 1.6 of our 2020 article, Information, incentives, and goals in election forecasts:


And it also came up in some blog posts:

from 2020: Do we really believe the Democrats have an 88% chance of winning the presidential election?

from 2020: More on martingale property of probabilistic forecasts and some other issues with our election model

from 2024: “Unusual Betting Patterns With Several Temple Games”: It’s martingale time, baby!

also from 2024: It’s martingale time, baby! How to evaluate probabilistic forecasts before the event happens? Rajiv Sethi has an idea. (Hint: it involves time series.)

I’d expect ESPN’s win probabilities to be closer to calibrated than prediction-market odds or model-based election forecasts. Prediction markets depend on the bettors and there’s no reason to expect calibration, at least not until the market is fully mature in some way. Model-based election forecasts are based on approximate models that have known pathologies (for example here), so they won’t be universally calibrated. ESPN’s probabilities won’t be calibrated either–they too are based on an imperfect model–but I assume it’s model has been trained on tons of data so I don’t think it should be far off.

If someone could send me the moment-by-moment estimated win probabilities from some large database of basketball games, we could take a look.

In the meantime we can get some intuition by simulating from a mathematical model where we can compute win probabilities exactly.

Simulating the process

Assume a simple Brownian motion with drift, where the score differential y(t) starts at y(0) = 0 and then takes a continuous random walk so that y(t) ~ normal(delta*t, sigma*sqrt(t)). We’ll scale t to be in minutes, so the game goes from t=0 to t=48, with the winner being determined by y(48). The drift is then delta=point_spread/48, because this is the expected final score differential before the game has started. And we’ll set sigma=2, which seems reasonable: 2*sqrt(48)=13.8, so that the sd of the final score differential is approximately 14 points.

One cool thing about this model is that the win probability can be trivially computed given the score differential at any point in the game.

How wrong can you be?

To demonstrate, I’ll show the results–the score and the win probability during the game–for 18 independently simulated games. For simplicity I’ll assume the point spread is 0, so the two teams are always assumed to be evenly matched. And I’ll step through the game 10 times per minute, thus approximating the game as a sum of 480 independent increments.

The code is below; here are the results:

I don’t know enough about basketball to have a sense of how plausible these are as game outcomes (setting aside the lack of discreteness in the score; we used a continuous model so that we could more easily compute the relevant probabilities analytically). They don’t look too much like the Knicks-Spurs game except for that one simulation near the lower left of the plot, where the “Spurs” led by 10 points into the third quarter, maxing out with a win probability of 95.6% before eventually losing.

To get a broader picture, I simulated 10,000 games. (Just as a reference point, there are 30 NBA teams, so there are 82*30/2=1230 regular season games each year.)

For each game, I computed “max_p_wrong”: the highest win probability assigned to the game’s eventual loser. In my simulation, every game starts with a 50/50 probability–remember, for simplicity I’m always assuming a point spread of 0–so max_p_wrong must be somewhere between 0.5 and 1. Here’s what comes out:

So, extreme wrong probabilities are not unheard of. How common are they? Out of these 10,000 games, 61 had max_p_wrong greater than 99%. That is, in 0.6% of games, the eventually-losing team exceeds the threshold of 99% win probability during some point in the game.

This result should go up if we move to continuous updating. But we’re already updating 10 times a minute. Increasing this schedule to 50 times a minute increases Pr(max_p_wrong > 0.99) to 0.0075, and increasing to 100 times a minute takes it to 0.0076, so my guess is that this is roughly the continuous limit.

OK, just to check, I’ll simulate 100,000 games, and now Pr(max_p_wrong > 0.99) is 0.0072 with 10 updates a minute, or 0.0084 with 50 updates per minute. So I’ll go out on a limb and say that if we were to compute the exact probability under continuous updating, we’d get 0.0085.

This was a surprise. Before doing this simulation, I was assuming that the probability of p_win exceeding 99% in for the eventual loser at any time in the game would be more than 1% because of selection. I guess my intuition was wrong. Maybe it has to do with the fact that I’m conditioning on which team wins. (Of course, if you go the other way, the probability of p_win exceeding 99% for the eventual winner is 100% in the continuous limit, because with epsilon of a second left in the game the winner will almost certainly be known.)

So, yeah, the above graph is kind of interesting. Under our model, most games won’t stray too far into retrospectively-embarrassing probability estimates, but it can happen sometimes.

It would be interesting to compare the above graph with what you’d get from a database of game-odds data from ESPN or whatever.

Just to be clear: there’s no reason to think that the above graph represents any sort of universal property of martingales. It’s a very specific model! But you have to start somewhere. Also, the existence of various central limit theorems makes me hold out the hope that this could be a general result under some appropriately restricted class of continuous martingale processes. It’s a research question!

A surprising uniform distribution

To get some further understanding of the process, I gathered the win probabilities after the end of each of the three quarters for the 10,000 simulated games. Below are histograms of these probabilities and calibration plots:

Unsurprisingly, the calibration is fine. After all, the probabilities are computed from the same model that the data are drawn from. Indeed, even the apparent anomaly in the lower-left plot is just a small-sample artifact which disappears when we up the number of simulations to 100,000.

More interesting are the histograms. It makes sense that, as the game goes on, the distribution of win probabilities starts at 0.5, then gradually bunches up at 0 and 1. Indeed, at the end of the fourth quarter the win probabilities are exactly 0 and 1.

But it’s funny how the distribution of win probabilities is exactly uniform at halftime. There must be a direct mathematical argument giving intuition for that result; it’s too perfect to just be an accident.

Lots more research to be done here:

– Generalizing beyond the continuous model to allow discrete scoring changes.

– Generalizing beyond the random walk; there’s no reason the model needs to be Markovian.

– Are there general statements that can be made about these distributions of win probabilities under arbitrary martingale processes? I’m guessing there are some results. At least, there should be some inequalities and limit theorems.

– Looking at real data from basketball, other sports, and other realms, including election forecasts and prediction markets.

Our ultimate aim here is to come up with a general measure of departure from the martingale property of probability forecasts. We want something that can be applied to any dataset, obviously with more precision as the series get longer, more finely-spaced in time, and when replications are available (as in those thousands of basketball games).

P.S. Here’s the R code to make the above simulations and graphs:
Continue reading

Adjusting for nonrepresentativeness in continuous norming using multilevel regression and poststratification.

Klazien de Vries, Marieke E. Timmerman, Anja F. Ernst, and Casper J. Albers write:

In psychological test norming, nonrepresentativeness in background variables in the normative sample can lead to bias in the normed score estimates. Because representativeness is difficult to establish in practice, adjustment methods are needed to combat this bias. As a candidate adjustment method, we investigated generalized additive models for location, scale, and shape with multilevel regression and poststratification (GAMLSS + MRP), the combination of MRP and continuous norming with GAMLSS. This adjustment method was then compared to current adjustment methods in continuous norming using weighted regression: GAMLSS + P (with poststratification) and cNORM + R (with raking). The results of our simulation showed that GAMLSS + MRP was generally more efficient than GAMLSS + P and cNORM + R. Furthermore, GAMLSS + MRP was better than the current methods at reducing bias in samples where the nonrepresentativeness was age-dependent. We argue that GAMLSS + MRP is a valid adjustment method in continuous norming and recommend this adjustment method to mitigate bias in nonrepresentative normative samples. To facilitate the use of GAMLSS + MRP in practice, we provide a step-wise approach for the implementation of GAMLSS + MRP. We illustrate this approach by deriving normed scores from the normative data of the third Schlichting language test.

I don’t recall how I came across this paper, and I haven’t actually read it, but I wanted to share it with you, just because it’s cool to see the different ways that multilevel regression and poststratification (MRP) can be used.

Ultimately, MRP is the inevitable consequence of three things:

1. We are interested in generalizing to populations of interest.

2. Available data are typically unrepresentative of the population. This is the case even with simple random sampling–Hello, random variation! Hello, small-area estimation!–and is even more so with selected samples, nonresponse, dropout, etc. In some settings such as medical experimentation there’s not even an attempt to get a representative sample: you’re directly aiming to include in the study the groups of people who might get the greatest benefit from the treatment.

3. When adjusting for differences between sample and population, many variables can be relevant–for example, demographic and geographical variables in a survey of people–and so simple adjustments such as raw poststratification or non-multilevel regression adjustment won’t do the job.

Put this together and you’ll want to do MRP (or, more generally, RPP). It’s not just for survey research. It comes up everywhere in statistics and machine learning, whenever there is a concern with population prediction, or generalization, or transportability, or whatever you want to call it.

It can seem like a hassle that to do this you need to know (or estimate, or postulate) a distribution of predictors in your population, but (a) this is often work that’s well worth the effort, if you really care about the population, (b) dependence of the result on the choice of population is important, and where this dependence is strong you should be aware of it, and (c) if you want to take the easy way out you can always bootstrap to get inference for the hypothetical population of which your data are considered to be a random sample.

Survey Statistics: should MRP workflow include LOCO-CV ?

Due tomorrow (June 10): Enter a contest for Alexandre Andorra’s interview of Aki, Richard, and Andrew about their new book Bayesian Workflow.

I hope folks ask about evaluating MRP models. We’ve seen:

At Andrew Gelman’s 60-ish Birthday workshop Aki gave a great talk about loo’s 10ish birthday. The loo R package computes approximate leave-one-out (loo) cross-validation. Aki covered a huge range of work across the Bayesian workflow. He said there will soon be a new version of their paper about evaluating MRP models, Kennedy et al. 2024.

Sketch portrait of Andrew Gelman

Kennedy et al. 2024 pivot from the usual individual-level Loss(y_i, yhat_i) to a population-level Loss(E(Y), E(yhat_i)). We don’t have the true E(Y), so they replace it with a classical poststratification estimate (see the post on poststratification). To avoid overfitting, this classical estimate should be calculated on different data than the MRP model itself.

They use leave-one-cell-out (LOCO) cross-validation, a version of leave-one-group-out (LOGO) that we mentioned in “design-based cross validation (dCV)”. In “dCV for MRP ?” we asked if we should be assessing how well the MRP model predicts new groups (e.g. new cells).

Should MRP workflow include LOCO-CV ?

When is detecting AI-generated text worthwhile?

This is Jessica. AI-text detectors are coming to play a bigger role in adjudicating what texts are worthy of our attention. There was the surprising case of an apparently AI-generated short story winning the Commonwealth Foundation Short Story Prize, which returns 100% AI generated by Pangram, the leading detector whose false positive rate is reported as roughly 1 in 10,000 in its own audits and near zero on medium-to-long passages in an external audit. Applying Pangram to the other 4 stories that won awards this year suggests two others were heavily AI-assisted. More recently, the NeurIPS Position Paper track announced that it was desk rejecting 18% of submitted papers that were detected by Pangram as fully AI-generated. Another 13% are getting followed up on with the authors to investigate AI use. In this case the Call for Papers made clear that submissions should be “substantially written by human authors,” so this should not have come as a surprise.

We’re having to reconsider what authorship means. Can a person create literature or express their position on a subject without writing a single sentence themselves? When do we really care who strung the words together?   

Some people think detection is a waste of our collective time because we will never reach an equilibrium. AI-generated text will keep shifting toward what passes the detector. Human writers will continually update their beliefs about what features are indicative of AI-writing, but will also be influenced to write more like AI by reading so much AI text. There’s no stable target, just an endless cat and mouse game that incentivizes being savvy enough at any given time to avoid getting flagged. Meanwhile people are being morally scorned and suffering reputational damage for being caught on the wrong side of things. This may disproportionately affect some writers (like non-native english speakers) who are finally seeing the playing field leveled a bit. 

On the other hand, there are situations where it really is important to know who strung the words together. Education is the most obvious one. It’s just very hard to teach someone to think if they’re not writing down their ideas themselves. 

The problem is that outside of select scenarios like teaching, what we really tend to care about is who controlled the ideas, and this is not equivalent to who strung the words together. Some would argue that the latter is becoming increasingly irrelevant given that AI can write more fluently than many people and many people prefer AI-generated text. 

Of course the reason we’re seeing detection used to filter paper submissions is because the ideal process–where the content of each paper is carefully considered on its own merits–is increasingly untenable given the huge surge in submissions in some fields. It’s easy to pump out credible-seeming papers with minimal human oversight using AI, and enough people are doing this to create serious problems. 

Mostly my response is that if we are going to debate the value of detection we should be willing to make our assumptions explicit. So let’s walk through a toy model to think about what we’re really conjecturing about.

One way to think of the latent state that we actually care about in paper review is the author type. Let’s say type A authors come up with their ideas and do a lot of the writing themselves. Type B authors rely on AI to do much of the thinking for them, and also use AI to do much of the writing. Type C authors come up with their own ideas, but engage in extensive prompting to get AI to write everything they want to say for them.*

For each paper, we choose to either pass or reject, conditional on the output of a Pangram check. Let’s say we only care about whether it flags 100% AI generated or not, so the signal s is binary, where s=1 means AI detected.

Based on available Pangram audits, if a text is actually written heavily by AI there is a very high chance it flags as AI-generated: beta=P(s=1|AI written) with beta very close to 1. If a text is not written by AI, there is a very small chance it flags as AI-generated: alpha=P(s=1|human written). Pangram’s internal audits put alpha around 10^−4 but other audits find essentially zero false positives for medium-to-long passages. 

So P(s=1| A)=alpha, and if we assume Types B and C use AI to a similar extent for the writing, then \beta=P(s=1|B) = P(s=1|C). The posterior probability that a flagged paper is from a Type B author is then:

P(B|s=1) = (beta × p_B)/(alpha × p_A + beta × p_B + beta × p_C), and since alpha is tiny and beta is close to 1, P(B|s=1) ≈ p_B/(p_B + p_C)

The relevant considerations become what we think the author population looks like, and how costly we think a false positive versus a false negative are. 

As a starting point, let’s say that for our conference submissions this year, Type C is the rarest, at 20%, and Type A and Type B equally split the remaining mass at 40% each. Let’s also say that we consider rejecting an acceptable paper, c_FP, to be twice as bad as passing an unacceptable one c_FN. 

The optimal decision rule is to reject if c_FN​ * P(B|s=1)>c_FP * ​P(A or C|s=1), or equivalently P(B|s=1)>c_FP/(c_FN+​c_FP​​)

With c_FP=2 and c_FN=1, this means we reject if P(B|s=1) > 2/3.

Under the prevalence assumptions above, P(B|s=1) is approximately 2/3, so we are right on the boundary. From the standpoint of making the right decisions for this particular conference cycle, it’s not obviously bad. But if Type C is a little more common, e.g., we shift a little mass from p_A to p_C to make p_C 0.25, then P(B|s=1) is 0.62, then we shouldn’t desk reject only based on the flag. Similarly if we were to decide that falsely rejecting an acceptable paper is three times as bad as passing an unacceptable one, we shouldn’t rely on it alone. 

This model is obviously very simple. But it shows us what kinds of things we have to make assumptions about in the most basic case. Obviously I don’t really know how many people are using AI blindly to write papers, nor how many people are relying heavily on AI to write up their own ideas. You should take my numbers with a grain of salt. Personally I can’t imagine how relying on AI to do all the writing when I came up with the ideas would ever feel efficient, because I tend to have strong opinions on how things are said. But I can accept I am probably more of a control freak than many others. And AI overreliance is easy to slip into. Maybe papers chairs from recent ML conferences (or arXiv moderators) have estimates on bad-actor rates based on what they are seeing. 

What this exercise can’t tell us is how scientific progress is impacted by the warping of incentives that can happen when we use AI-detection as a filter. Classic principal-agent problems suggest that when we care about something hard to observe—like scientific quality or long-term epistemic value—but must rely on observable proxy signals to judge authors’ outputs, we should expect authors to shift more effort toward improving exclusively on those proxies. Avoiding m-dashes and ‘not this, but this’ constructions and whatever else currently ups the posterior probability of AI-generation is orthogonal to the actual thinking that research requires. What if relying more heavily on AI to write up our ideas is a good idea for science in the long run, in terms of more clearly communicating the ideas or saving a lot of time, so that we can get more good ideas out in the same amount of time? Then too much emphasis on detection might slow us down. However, I’m doubtful we are currently anywhere near a state of the world where discouraging writing with AI is as costly for scientific progress as spending time reviewing and reading many more questionable AI-generated papers is. The bigger threat at the moment is the slop overwhelming our ability to find the good stuff.

*We could also posit Type D authors that get AI to generate the ideas, but then write the papers themselves to evade detection, or are extremely good at getting AI-written text to evade detection. But this seems much less likely so I’m ignoring it.

What is the relation between interactions in a regression model and correlations among the predictors?

I’ve often seen confusion between interactions in a regression model and correlations among the predictors. To keep it simple, consider the model y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + error, and assume the predictors have been signed so that both b1 and b2 are positive. Then b3 represents the interaction. This has nothing to do with the joint distribution of x1 and x2 in the data, or in the population. (For simplicity, assume the data to which the model are being fit is a random sample from the population of interest.)

The interaction depends on the model of y given x1 and x2, while the correlation depends on the model for x1 and x2. These are two completely different parts of the model. And yet, they often seem connected.

I have the general impression that I’d be more likely to expect a positive interaction of x1 and x2 when predicting y, if x1 and x2 are positively correlated in the population.

For example, when predicting income from height and sex, being taller and being male both predict higher income, also they interact–the coefficient for height is higher for men than for women–and of course the two predictors, height and male, are positively correlated in the population.

I’m not sure how to think about this connection or even whether it’s a real pattern! But there might be something there so I wanted to share it with you.

The issue of interactions comes up in the context of the concept of intersectionality, which is a form of interaction that comes up in sociology. It started for me with this email from Elin Waring:

I’ve been working on data on intersectionality and retention of students in STEM majors. My little group is specifically looking at data from Lehman College and trying to model graduation with a STEM degree. There are a lot of details, but basically we have come to the conclusion that the right way to describe this is with a discrete time competing risk model (the competing risks being graduation with a STEM degree and graduation with a non-STEM degree). I won’t go into all the details. We have data for between 1 and 20 semesters enrolled for students starting as freshman. For us, intersectional identity is defined by 5 variables that yield 32 distinct combinations or strata as used in the next articles.

In trying to think about how to account for intersectional identities we came across the “MAIHDA Method.” I was wondering if you had seen this discussion before or have any thoughts about it.

Evans, Clare R., George Leckie, and Juan Merlo. 2020. “Multilevel versus Single-Level Regression for the Analysis of Multilevel Information: The Case of Quantitative Intersectional Analysis.” Social Science & Medicine (1982) 245:112499. doi:10.1016/j.socscimed.2019.112499.

They essentially argue for treating the strata as random effects in a multilevel model where with the individual components of the combinations introduced as fixed effects describing the combinations.

The next article criticizes that approach and argues for fixed effects all around.

Wilkes, Rima, and Aryan Karimi. 2024. “What Does the MAIHDA Method Explain?” Social Science & Medicine 345:116495. doi:10.1016/j.socscimed.2023.116495.

Responded to here:

Evans, Clare R., Luisa N. Borrell, Andrew Bell, Daniel Holman, S. V. Subramanian, and George Leckie. 2024. “Clarifications on the Intersectional MAIHDA Approach: A Conceptual Guide and Response to Wilkes and Karimi (2024).” Social Science & Medicine 350:116898. doi:10.1016/j.socscimed.2024.116898.

I was wondering if you have any thoughts about this? For me, intersectionality as a theoretical approach does mean that it makes sense to look at the strata rather than thinking of the strata as just the most complex level of creating statistical models of the intersection of the variables. But then it seems as though treating this a random effect more or less undermines its centrality to the theory. And is treating both the strata and the individual characteristics as variables at the same level basically a way to decompose?

In the end, I feel like the pro-MAIHDA people retreat to “we are just descriptive” in a way that isn’t very helpful. That said, they are right that this seems to have some traction in the world of health disparity research.

I replied that I’d never heard of any of this method before. I couldn’t actually muster the energy to read the above articles, as all this debate seems to be missing the key issues. I don’t really care if something is called a fixed effect or a random effect (see here); my current preferred way of thinking of these problems is by framing as a generative model.

Regarding intersectionality, the natural way I would see it is that this would show up as an interaction term, the idea that the interaction is more than the sum of its parts? For a simple example, if there are 5 binary variables and each has the same effect on its own (which they wouldn’t, this is just a simple hypothetical example), then you could create a variable which is the total number of identities, thus a number from 0 to 5, and “intersectionality” would show up as a super-linear or convex relation between the outcome and this total predictor?

Waring responded:

Sure, but the idea you suggested about intersectionality itself isn’t right. You can’t just sum the number of identities, everyone has identities and the idea is that it is not just about concentrated disadvantage of having all or some specific identities. If we have 5 dichtomous identity/group variables everyone has 5 dimensions of identity. Intersectionality is about the idea that something like “white, native born. woman, high income” shapes what happens because of how those come together to shape (in the case of my analysis) whether, as an undergraduate, you persist in STEM fields.

I replied as follows:

Yes, I was actually thinking this when I wrote that! I was imagining that each of the 5 factors has an “off” and “on” setting, and intersectionality kicks in when there are multiple “on” settings, where “on” represents the group that faces more difficulty (nonwhite, non-native born, female, low income, gender nonconformist, etc.). Once you allow arbitrary possibilities for intersectionality, then my simple superadditive model wouldn’t fit. On the other hand, if you were to allow all 32 possibilities to take on any value, then realistically you would not be able to estimate anything much at all: this is the usual problem in sociology of approximating a complex social structure by a simple model that explains most of the variance. For predicting persistence in STEM (or any academic field), one possible factor that could enter in a complicated way is conservative political ideology, in that for many attitudes and behavior its predictive effect goes in the opposite of the “on” categories listed above, but grad students, in STEM and other fields are predominantly politically on the left. I could well imagine that conservative political ideology, like the other “on” categories, is predictive of not persisting in STEM but that this could interact in unexpected ways with those other categories.

From a statistical perspective, my main message is to choose such a model based on its explanatory power and recognizing that it’s an approximation, rather than using methods such as statistical significance or Bayes factors which in different ways are driven by sample size, as we discussed in this 1995 paper.

Another interesting statistical feature of this and similar discussions is that it’s natural for the discussion to go back and forth between the correlation between two predictors in the data (or the population) and the interaction between their predictive effects, as discussed at the top of this post.

I’m not sure if this interaction thing is a general pattern that has some statistical explanation, or just a faulty intuition of mine based on just a couple of special cases. But I have noticed a general confusion that when people talk about interactions, often they seem to be talking about correlation between the predictors.

Survey Statistics: it is (still) the people

A year and a day ago, the Survey Statistics blog series launched with: “it is the people that make make survey statistics (and anything) great”. This past weekend, we got to celebrate wonderful people at Andrew Gelman’s 60-ish Birthday workshop.

Artist Sophie Gelman made the below:

Sketch portrait of Andrew Gelman

Yair Ghitza gave a talk about Andrew’s influence on polling. Yair is Chief Scientist at Catalist and coauthor of excellent papers about MRP we’ve cited in this blog series: Ghitza and Gelman 2013 and Ghitza and Gelman 2020. The discussion after his talk included mention of Nate Cohn’s May 18, 2026 NYT article about weighting with “synthetic past vote”.

Let’s use our notation from “is a mismeasured X better than none at all ?” and “more adventures in mismeasured X” (see also “more on recalled vote”):

  • Y = current support
  • X = true 2024 vote, unknown
  • X* = recalled 2024 vote

And notation from “weights and MRP for voters”:

  • V = current registered voter
  • V2024 = record of voting in 2024 (a coarsened version of X that only tells of whether someone voted)

Suppose we want E(Y | V=1), current support among current registered voters. Using MRP, we might want to estimate this via E(E(Y | X, sample, V = 1) | V = 1). But we’ve got at least 2 challenges:

  1. We can’t directly estimate E(Y | X, sample, V = 1) because we only have recalled vote X*.
  2. We need p(X | V=1) for the outer expectation. But past election results give p(X).

Nate’s article proposes:

  1. Create synthetic past vote, X**, which aims to improve recalled vote X*:
    1. Impute X** if X* is missing and there is a record they voted (i.e. V2024 = 1).
    2. Validate: set X** to “nonvoter” if there is no record they voted (i.e. V2024 = 0).
  2. I interpret this to mean they estimate p(X** | V = 1) ? See “weights and MRP for voters” for ideas.

    “Synthetic past vote is weighted to match our estimate for how today’s registered voters …voted in the last election.”

     

What do you think ?

15 new articles on statistical workflow!

Aki, Richard, Lizzie, and I put together a special issue on Statistical Workflow for the Philosophical Transactions of the Royal Society. I guess “royal” isn’t as impressive as it used to be, but still.

Statistics and data analytics play an increasingly important role in and across science and policy. But much of what is done by the best practitioners–their “workflow”–is tacit knowledge only glanced over in textbooks and research articles. In this new collection covering a wide range of disciplines, leading statisticians and researchers discuss the motivations and details for their workflows.

The four of us did this project because we were all interested in Bayesian workflow, and we wanted to learn more about statistical workflow in general, not just the Bayesian part.

Here’s what’s in the issue:

  • Statistical workflow, by Andrew Gelman, Aki Vehtari & Richard McElreath
  • Unsupervised machine learning for scientific discovery: workflow
    and best practices, by Andersen Chang, Tiffany M Tang, Tarek M Zikry & Geneva I Allen
  • PCS workflow for veridical data science in the age of AI, by Zachary T Rewolinski & Bin Yu
  • Simulations in statistical workflows, by Paul-Christian Bürkner, Marvin Schmitt & Stefan T Radev
  • An automatic finite-sample robustness metric: when can dropping a little data change conclusions? Part I: definitions and experiments, by Ryan Giordano, Rachael Meager & Tamara Broderick
  • An automatic finite-sample robustness metric: when can dropping a little data change conclusions? Part II: theory and intuition, by Ryan Giordano, Rachael Meager & Tamara Broderick
  • Building a Backdrop of Meaning in Magnitude (BoMM) as part of research workflow, by Megan Dailey Higgs
  • A preliminary data analysis workflow for meta-analysis of dependent effect sizes, by Elizabeth Tipton, James Pustejovsky & Jingru Zhang
  • A four-step simulation-based workflow for ecological analysis and science, by EM Wolkovich, T Jonathan Davies, William D Pearse & Michael Betancourt
  • Scientific workflow in experimental economics, by Anna Dreber & Séverine Toussaert
  • Hidden processes of workflow in cognitive developmental psychology, by Lauren N. Girouard & Susan A. Gelman
  • Reproducible workflow for online AI in digital health, by Susobhan Ghosh et al.
  • Model checks for Bayesian estimation and forecasting of health coverage indicators in low- and middle-income countries, by Leontine Alkema et al.
  • Closing the gap between statistical and scientific workflows for improved forecasts in ecology, by Victor Van der Meersch, James Regetz, T Jonathan Davies & EM Wolkovich
  • Machine learning workflows in climate modeling: design patterns and insights from case studies, by Tian Zheng et al.

Lots of good stuff here, and lots of different perspectives. Thanks to all the authors. The issue is here, and all the papers should be freely available.

If you have any thoughts on the articles in the volume, or on any other statistical workflow topics, just let us know right here in the comments box.

What if scientists really were dispassionate observers, communicating ideas without irrational commitment? Look here, says AI.

This is Jessica. We often idealize science as proceeding primarily by the scientific method, where scientists approach the objects of their investigation with a healthy dose of detachment and neutrality, who become convinced only when the evidence is there, and remain open to changing their mind if new evidence becomes available. But in reality we see examples of authors becoming personally attached to their ideas despite the data, slipping into advocacy and becoming defensive or going into denial mode when presented with clear evidence they were wrong.

The seemingly irrational attachment to the ideas or findings can seem easy to dismiss as a bad thing. Yet there are also times when having some level of personal commitment makes one more effective at certain roles that scientists must play. For example, being too transparent about our own uncertainty is not always effective when presenting research to others, because the audience can become distracted and stop listening entirely, even if you have some useful insight to convey. My question is, What does our ability to now use AI to generate implementations, presentations, and even the ideas we work on themselves add to the mix?

I got a glimpse of this recently. May ended up being workshop month for me, with at least one each week. I saw a lot of presentations. A couple of these showed me something I hadn’t yet seen, at least outside of student presentations: talks comprised of obviously AI-generated slides. If you’ve tried to use the non-design optimized versions of models like GPT or Claude to create slides yourself, you will know what I mean. Almost every slide has content organized in a grid. There’s too much text—full sentences or nearly so in multiple places, headers and footers, and stylized phrasing everywhere, like “principal design levers” and “load-bearing assumptions” and “actionable pathways”. 

These were not presentations by overwhelmed junior faculty or researchers I’d never heard of. They were by prominent researchers who are respected in their fields. 

Needless to say, they were not very effective talks. The slides tended to have too much going on to parse in time, with way too much text. The vague phrasing was distracting, making me wonder what exactly the presenter meant by terms like “governing frictions” or “strategic bottlenecks” and whether they write like that in their papers too. Part of the problem is that the presenter tends to use their own language as they present, rather than reinforcing what’s on the slide, so you have two competing streams of information that feel like they’re from two distinct viewpoints, one which is quite confident and willing to summarize and even exaggerate, the other more reserved. 

It makes sense that you’re more likely to hold at a distance what you didn’t come up with yourself, subconsciously at least, even if you think you’re selling it. In one case, the speaker also described how some of the results themselves were discovered by AI, which probably further contributes to the impression that they hadn’t fully committed to what they are presenting. 

This has me wondering what the impact on diffusion of ideas will be as it becomes more standard practice to rely on AI for implementation in scientific production and communication. It’s funny how reserving skepticism for your own results often comes up in discussing epistemic virtues, but when speakers present as if holding their work at arm’s length, the result is not so informative. As we rely more heavily on AI in all stages of research, will we face more challenges in getting others to adopt our ideas? 

It’s also another reminder of how few people thinking about AI for science seem to have considered all the personal stuff that goes into the practice of science, with lots of irrational investment and fixation and stubbornness and pride to drive the loop of discovery and validation and communication. Scientific discovery may be an “ocean,” to borrow an analogy associated with Leibniz, but surfing it requires strapping oneself to a board and committing to seeing where it gets you, not just keeping it in sight while you splash around somewhere else. 

This also leads to a practical question of how you instill a sense of ownership, or at least commitment, to ideas that were partly produced by AI. My own experience is that it takes a lot of time to verify AI produced results before I get to the level of confidence I’d have if I’d done it myself. For complex tasks there will inevitably be decisions made along the way, e.g., about how to parameterize certain things in implementation or to deal with edge cases or other exceptions. Each of these has to be reconstructed before I can really feel that I stand behind the output. 

Is there an alternative? It makes me think of the “baking guilt” that housewives supposedly felt after cake mixes came on the market, because they only required adding water. There was a loss of a sense of personal contribution and emotional ownership. The solution, which persists today, was to have them add an egg. Some psychoanalysts went so far as to interpret this as symbolic of their fertility. For AI-aided science, the closest thing to adding an egg seems to be having agents explain at length to you what was done, which can still mean a big improvement over implementing everything yourself, but not as much of a boost as it first seems.

At any rate, interpreting the new challenges of AI-generated presentations of potentially AI-generated ideas as an aesthetic problem, or of “putting style before substance,” does not seem right. Scientific ideas don’t diffuse as bare propositions. They diffuse through people who have developed some passion for them. If we’re talking about AI for science, we shouldn’t be ignoring scientists and their relationships with what they do.  

Statistical analysis recapitulates the development of statistical methods

We ran this a few years ago but it remains interesting so I’m reposting:

There’s a old saying in biology that the development of the organism recapitulates the development of the species: thus in utero each of us starts as a single-celled creature and then develops into an embryo that successively looks like a simple organism, then like a fish, an amphibian, etc., until we reach our human form in preparation for birth.

Modern biologists don’t believe in this recapitulation. But taking this as an intriguing idea, I see an analogy with statistical practice.

Some version of this recapitulation occurs just about whenever we do applied statistics. We start with the simplest methods–univariate data summaries and some basic multivariate analyses–then we perform some comparisons which we check via standard errors and off-the-shelf hypothesis tests, then we move to modeling. We might well start with least squares and maximum likelihood and then move to regularization and multilevel modeling as needed, then throw in measurement error models, selection models, nonparametric this and that, and so forth.

The analogy isn’t perfect–in particular, we don’t always begin an analysis with simple averages and plots; sometimes we begin with a sophisticated nonparametric data-exploration tool such as lowess or deep nets. And, lots of methods for graphical exploratory data analysis have only been developed recently; indeed, even methods as basic as scatterplots are only a few centuries old.

Within the context of modeling, though, it does seem to me that we tend to start simple and then add more complicated features one at a time–and this seems like a sensible way to proceed. In so proceeding, we’re motivated in part by computational stability but also in part by the logic of increasing complexity: we take each step for a reason. Thus it is logical that statistical analysis recapitulates the development of statistical methods.

Survey Statistics: double-plus robustness

Meng (2022) pops up a lot here: “it is the people” (the launch of this blog series a year ago !), “probability samples vs epsem samples vs SRS samples”, “divine probabilities”, and last week’s “GREG”. Like a lot of Meng’s papers, it deserves several rereads.

(The polar bear celebrated the blog series birthday with a rainy hike on the PA AT. Here he is attempting to dry off.)

Let’s zoom in on the part about the Generalized REGression estimator (that doesn’t specifically say “GREG”). Green anotations are mine:

Meng (2022)‘s (5.2) is the first way of writing GREG in our post “GREG”, from Särndal, Swensson, Wretman (1992):

That book goes on to say that GREG often takes a super simple form:

Meng (2022) doesn’t mention this as far as I can tell ? Although I think Meng’s example satisfies the conditions the book Särndal, Swensson, Wretman (1992) goes on to describe: the regression model assumes constant variance and has an intercept.

Anyways, back to the title of this post. Meng emphasizes that GREG is not only “double robust” (consistent if either the outcome model or response model are correct), but “double-plus robust” (consistent if what is left of the outcome model and response model are uncorrelated). I’m interested in the practical implications of this, such as the suggestion to include the estimated response probabilities in the outcome regression model. Thoughts ?

Survey Statistics: GREG

I just got to chat with Andrew and some of the authors of the MrPlew paper: Ryan Giordano, Erin Hartman, and Avi Feller. Lots more I have to digest here ! The paper came out while the polar bear and I were crossing from TN into VA.

We talked about using a model for response R, a model for outcome Y, or both. So GREG came up, and Andrew asked “what’s GREG ?” Good question.

GREG is Generalized REGression estimator. Särndal, Swensson, Wretman (1992) has a nice section that writes it in a few alternative ways:

1. Adjust an estimate based on the model with a Horvitz-Thompson estimate of the error:

2. Or on the flip side, you can see it as adjusting the Horvitz-Thompson estimate with the model:

It’s called GREG for Generalized REGression estimator, what is being generalized ?

Lumley 2010 made me think we were generalizing to continuous X variables:

Preview

Sharon Lohr’s book made me think we were generalizing beyond simple random samples:

Sampling Design and Analysis: Third Edition — Sharon Lohr

Särndal, Swensson, Wretman (1992) made me think we were generalizing to multiple X  variables:

Amazon.com: Model Assisted Survey Sampling (Springer Series in Statistics): 9780387406206: Särndal, Carl-Erik, Swensson, Bengt, Wretman, Jan: Books

Regardless of the exact origin of the name, GREG has connections to the Doubly Robust literature in causal inference (as Coston et al. (2020) note in a footnote). Any favorite references making these connections ?

MrPlew: Locally Equivalent Weights for Multilevel Regression and Poststratification

Ryan Giordano, Alice Cima, Jared Murray, Erin Hartman, and Avi Feller write:

Multilevel regression and poststratification (MrP) has become a workhorse method for estimating population quantities from non-probability surveys, and is the primary model-based alternative to traditional survey calibration weighting methods, such as raking. For simple linear regression models, MrP methods admit “equivalent weights”, allowing for direct comparisons between MrP and traditional calibration weighting. Such weights, however, have been unavailable for the most widely used MrP models, such as logistic regression. In this paper, we develop a natural generalization, “MrP locally equivalent weights” (MrPlew), which represent MrP as a weighting-style estimator that is locally equivalent to calibration weights near the observed responses.

Cool! This goes beyond my 2007 paper, Struggles with survey weighting and regression modeling (“for logistic regression, the poststratified estimate is no longer a weighted average of the data, even after controlling for the variance parameters in the model. However, we suspect that the model could be linearized, yielding approximate weights”) and our 2004 paper on dilution assays, in particular Section 5.2, “Equivalent weights for nonlinear models.” The funny thing is that I forgot about that 2004 paper when working on equivalent weights for MRP in the 2007 paper. Also, the 2004 method won’t work as is, because it’s designed to estimate sensitivity to individual data points, not to produce good weighted averages.

I say this not to try to claim credit for the method of Giordano et al., but rather the opposite, to emphasize that even though I’ve been thinking about equivalent weights in MRP for a long time, I haven’t yet succeeded in getting them to work in practice, so I’m very happy to see developments in this area.

One thing that came up with equivalent weights when we tried to apply them in practice is that sometimes the weights can be negative.

Negative weights can sometimes make statistical sense. The idea is that, depending on how the data line up in the regression model, sometimes if you pull one data point upward, it will cause the slope of the fitted line to change in such a way as to reduce the predicted mean value. This doesn’t sound right at first, but it can easily occur with poststratification when the population distribution of the predictors differs from the sample. Even if the negative weights can make sense in the estimation context, it still would seem kind of awkward to pass them along to the user.

The other thing that’s tricky is: What are the weights going to be used for? In the 2007 paper, the equivalent weights are set up to get the right answer for the estimate of the population mean, but presumably they’d be used for large subgroups too (for example, the average among men or women in the population). For more complicated estimates such as arise in small-area estimation or regression, you might want to use MRPW. Which is fine, but whatever it would take to get good weights for one of these purposes might not work best for the others.

Still, I remain interested in MRP locally equivalent weights of some sort, for two reasons:

1. We’re often doing MRP (or, more generally, RPP) anyway, so why not provide weights for other users of the survey that we’re analyzing?

2. Sometimes we’re called upon to provide weights for a public-facing survey, and the way we end up doing this is through an awkward and unsatisfying sequence of adjustment and smoothing steps (the “struggles” in “Struggles with survey weighting and regression modeling”). If we can do this using modeling and MRP, that could be a much more effective workflow, providing weights that are more stable and yield more accurate estimates of population quantities while also being more scientifically defensible and requiring fewer arbitrary choices.

Model-based weights will depend on some set of predictors X, variables that are observed in the sample and in the population (or, as necessary or appropriate, estimated from the population). One funny thing is that the weights will be mathematically a function of X, but the function itself will depend not just on sampling design, and not just on the distributions of X in the sample and population, but also on the outcome y that is being modeled. Different outcome variables will yield different sets of weights. At first this might seem disturbing, but upon reflection I think this dependence is a good thing. When it comes to weighting, the relative importance of the different variables in X will indeed depend on the outcome. Different variables are important for predicting public health risk factors than predicting how you will vote. That said, if you want some sort of omnibus weights, which you probably will want for a public survey, you can compute equivalent weights for each of a battery of outcomes and then average these weights to get a single set. That seems reasonable enough.

OK, back to Giordano et al., who continue:

This enables a suite of standard weighting diagnostics, including frequentist sampling variability, covariate balance, and subgroup contribution. We formally justify the use of MrPlew in these cases: we prove the MrPlew-based variance estimator is asymptotically equivalent to the infinitesimal jackknife for common exponential family models, and we introduce a novel class of model checks based on invariance to data perturbations that generalize covariate balance and subgroup contribution to nonlinear models. We further show that MrPlew can be computed easily using existing MCMC samples and provide open-source software to compute MrPlew using the output of standard software. We illustrate our approach for several canonical studies that use MrP, including via a logistic regression outcome model, showing that implied covariate balance can sometimes be worse for MrP than for raking. Given the ease of computing, we recommend making MrPlew a standard part of the MrP model interrogation workflow.

It makes sense that implied covariate balance can sometimes be worse for MRP than for raking. MRP is a smoothed version of raking, and unsmoothed raking can overfit. Or, in practice, you might rake on fewer variables so as to avoid overfitting. Multilevel regression gives you the freedom to include more predictors and interactions, secure in the understanding that the model will smooth the estimate and there will be less possibility for overfitting. In short, multilevel modeling–or, more generally, regularization–is a sort of safety net that can give us the security to construct better models, in the same way that a social safety net can give people the security to try new jobs, or for that matter in the same way that an actual safety net can give acrobats the security to perform more elaborate routines.

Where I want to go next is to be able to use these methods to construct weights for public surveys. I’m still not sure about all the steps that will take us there, but I continue to think it’s possible.

The new Giordano et al. paper is thoughtful and readable as well as having lots of math, statistical modeling, and real-data examples. I recommend you read it.

Jonah’s seminar tomorrow: “Bayesian Workflow and the Software That Shapes It”

This is Leo. Jonah Gabry (Stan developer, Andrew’s collaborator, etc.) is spending the whole month of May as a visiting professor here with us at the University of Trieste in Italy. Tomorrow, May 19th, in the De Finetti room at the University of Trieste, at  9 am NYC time (GMT-4), Jonah will give the following talk:

“Bayesian Workflow and the Software That Shapes It”

based on the upcoming book:  “Bayesian Workflow”.

For anyone local, you are welcome to come in person. Anyone else can join on Microsoft Teams (available here).