The interactions paradox in statistics

A colleague was telling me that people are running lots of A/B tests at the company she works at, and people are interested in interactions. If they start grabbing things that are large and conventionally “statistically significant,” they see all sorts of random things. But when she fits a multilevel model, everything goes away. Also, the interaction models she fits look worse under cross-validation than simple additive, no-interaction models. What to do?

My quick guess is that what’s going on is that the standard errors on these interactions are large and so any reasonable estimate will do tons of partial pooling. Here’s a simple example.

Regarding what to do . . . I think you need some theory or at least some prior expectations. If the client has some idea ahead of time of what might work, even something as simple as pre-scoring each interaction from -5 (expect a large negative interaction), through 0 (no expectation of any effect), through +5 (expect something large and positive), you could use this as a predictor in the multilevel model, and if something’s there, you might learn something.

This sort of thing comes up all the time in applications. On one hand, interactions are important; on the other hand, their estimates are noisy. The solution has to be some mix of (a) prior information (as in my above example) and acceptance of uncertainty (not expecting or demanding near-certainty in order to make a decision).

A quick google search turned up this article from 2016 by Kelvyn Jones, Ron Johnston, and David Manley, who try to do this sort of thing. I haven’t read the article in detail but it could be worth looking at.

A quick simulation to demonstrate the wild variability of p-values

Just did this one in class today. Suppose you run 10 independent experiments, each has the same true effect size and each has the same standard error. Further assume that the experiments are pretty good, and the true effect size is 2 standard errors from zero.

What will the p-values look like?

Here’s a simulation:

set.seed(123)
J <- 10
theta <- rep(0.2, J)
se <- rep(0.1, J)
theta_hat <- rnorm(J, theta, se)
p_value <- 2*(1 - pnorm(abs(theta_hat), 0, se))
print(round(p_value, 3))

And here's what we get:

0.150 0.077 0.000 0.038 0.033 0.000 0.014 0.462 0.189 0.120

Now imagine this was real life. A couple of experiments are consistent with pure noise, a few of them seem to show weak evidence against the null hypothesis, and a few are majorly statistically significant. It would be natural to try to categorize them in some way. Yeah, the difference between "significant" and "not significant" is not itself statistically significant, but a p-value of 0.46 in one case and 0.0002 in another . . . surely this must be notable, right? No.

OK, this is an extreme case, in that there is no underlying variation, and indeed if you fit a multilevel model you could see the lack of evidence for underlying variation.

The larger points are:

1. The p-value is a statement relative to the null hypothesis of no effect. It doesn't have much of an interpretation relative to a real, nonzero effect.

2. The p-value is super-noisy. It's a weird nonlinear transformation of the z-score (which does have a pretty clear interpretation) with all kinds of nonintuitive behavior.

And:

3. You can learn a lot from a simulation experiment. Even something really simple like this one!

P.S. Bayesian inferences are highly variable too. Any data summary will be variable! The problem here is not so much with p-values as with when they are misinterpreted (point 1 above) and when they are taken to be a strong statement about reality (point 2 above), rather than a noisy summary of a particular experiment. If you misinterpret and overinterpret Bayesian inferences---for example, by fitting model with flat prior, taking the posterior probability that the parameter is greater than zero, and then making a decision based on whether that probability exceeds some threshold---then you're just as bad off.

Remember, the problems with p-values are not just with p-values

P.P.S. There was some confusion about the simulation in comments, so I'll explain some more.

The above is not a simulation of raw data. I could’ve simulated raw data, but it was simpler to just simulate the parameter estimates and cut out the middleman. Instead of simulating y_1,…,y_n ~ normal(theta_1, sigma_1) and z_1,…,z_n ~ normal(theta_2, sigma_1), I simply note that theta_hat = y_bar – z_bar has mean theta = theta_1 – theta_2 and standard error s = sqrt(sigma_1^2/n + sigma_2^2/n), I assume n is large enough that we can treat s as known, and I just directly simulate theta_hat and s. I did a simulation where the true effect is 2 standard errors away from 0, which is realistic for many studies. In medical research, they’re supposed to be powered so the true effect is 2.8 standard errors from 0 (for explanation of that 2.8, see for example chapter 16 of Regression and Other Stories); realistically, though, I think the assumptions made in such designs are overly optimistic.

My colleagues and also done empirical work on this topic: see A new look at p-values for randomized clinical trials (with Erik van Zwet, Sander Greenland, Guido Imbens, Simon Schwab, and Steven N. Goodman) and A proposal for informative default priors scaled by the standard error of estimates (with Erik van Zwet).

“Tough choices in election forecasting: All the things that can go wrong” (my webinar this Friday 11am with the Washington Statistical Society)

Friday, October 11th, 2024, 11:00 A.M. – 12:00 P.M. Eastern Time, Remote Only
The seminar talk is live-streamed. Please register to receive the link.

Tough choices in election forecasting: All the things that can go wrong

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Different election forecasts use different models but yield similar predictions, which is no surprise as they’re using the same information: state polls, national polls, and what we call the “fundamentals”: economic conditions, presidential popularity, incumbency, and the relative positions of the states in previous elections. And when the forecast is close to 50/50, which it is at the time of this writing, it’s hard to be called out. Back in the day, it was considered impressive for a forecast to predict 45 or more states correctly, but now everyone knows which are the key swing states, and it’s not hard to come up with a reasonable state-by-state forecast with historically-calibrated uncertainty. Nonetheless, there are lots of ways a forecast can go wrong, and lots of choices involved in modeling, data inclusion, testing and validation of the forecasting procedure, and communication of results. We discuss this in the context of our efforts in election forecasts from 1992 to the present day.

At the beginning of the session, before the talk, I’ll be interviewed by Edward Wu, who has a Ph.D. in statistics and is currently the Data, Polling, and Election Analytics Senior Producer at CNN.

I was told that CNN doesn’t allow questions to be prepared in advance, which is actually fine with me because I prefer to be surprised!

If you in the audience want to be prepared, I recommend my recent article, Grappling with uncertainty in forecasting the 2024 U.S. presidential election (with Ben Goodrich and Geonhee Han), my article from 2020, Information, incentives, and goals in election forecasts (with Jessica Hullman, Chris Wlezien, and Elliott Morris), and my article from 1993, Why are American Presidential election campaign polls so variable when votes are so predictable? (with Gary King).

Modeling Weights to Generalize (my talk this Wed noon at the Columbia University statistics department)

In the student-organized seminar series, Wed 11 Sep 2024 noon in room 903 room 1025 Social Work Bldg:

Modeling Weights to Generalize

A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights came from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/weight_regression.pdf

Research!

Should you always include a varying slope for the lower-level variable involved in a cross-level interaction?

Sociology student Josh Aleksanyan writes:

I’ve also been working with a dataset on addiction treatment admissions and I have a question that I can’t really figure out. I figured I should go multilevel since I am interested in how broader social policy contexts influence the relationship between a criminal justice referral and treatment outcomes (i.e., length of treatment stay). When building a multilevel model that explicitly includes an interaction between two levels (in my case, a criminal justice referral and rescaled state-level probation and incarceration rates), is it recommended we always allow the individual-level predictor to vary? I know your recommendation is to build many models and always let coefficients vary (up to a point) but I can’t figure out if this is in fact a “rule” if we have a cross-level interaction. This article suggests so. In any case, in my case, adding a varying slope for the level 1 predictor reduces residuals outside of the error bounds (so I guess I partly answered my own question) but I’m wondering (going forward in life) if this is something we have to do because of the cross-level interaction.

Also, since I have your attention, if I expand my data beyond a single year, would you explicitly model year as its own varying intercept (as opposed to adding year dummies)?

The linked article, by Jan Paul Heisig and Merlin Schaeffer, is helpfully titled, “Why You Should Always Include a Random Slope for the Lower-Level Variable Involved in a Cross-Level Interaction.”

What do I think about this recommendation?

It’s complicated. As Aleksanyan notes, my general advice is to include everything. But in real life we can’t include everything, even if the data are available. As we say in “parent talk,” you have to pick your battles. A quick answer is that you always are including these interactions: even when you’re not including them, you’re including them with a coefficient of zero. That seems silly, but the larger point is that it’s not just a question of whether you’re including an effect or interaction, or even how you’re modeling it, so much as how you’re fitting it. If you set a coefficient to zero, that’s complete pooling, or we could say complete regularization. Interactions can be noisy to estimate, so least-squares or even hierarchical models with flat priors on the hyperparameters won’t necessarily do the job.

Anyway, my general advice is to include these varying slopes, at least conceptually. So even if you fit the model without that component of variation, you should recognize that it’s there.

“My basic question is do we really need data to be analysed by both methods?”

Ram Bajpai writes:

I’m an early career researcher in medical statistics with keen interest in meta-analysis (including Bayesian meta-analysis) and prognostic modeling. I’m conducting a methodological systematic review of Bayesian meta-analysis in the biomedical research. After reading these studies, many authors presented both Bayesian and classical results together and comparing them and usually say both methods provide similar results (trying to validate). However, being a statistician, I don’t see any point of analysing data from both techniques as these are two different philosophies and either one is sufficient if well planned and executed. Consider me no Bayesian expert, I seek your guidance on this issue. My basic question is do we really need data to be analysed by both methods?

My quick answer is that I think more in terms of methods than philosophies. Often a classical method is interpretable as a Bayesian method with a certain prior. This sort of insight can be useful. From the other direction, the frequency properties of a Bayesian method can be evaluated as if it were a classical procedure.

This also reminds me of a discussion I had yesterday with Aaditya Ramdas at CMU. Ramdas has done a lot of theoretical work on null hypothesis significance testing; I’ve done lots of applied and methodological work using Bayesian inference. Ramdas expressed the view that it is a bad thing that there are deep philosophical divisions in statistical regarding how to approach even the simplest problems. I replied that I didn’t see a deep philosophical divide between Bayesian inference and classical significance testing. To me, the differences are in what assumptions we are willing to swallow.

My take on statistical philosophy is that all statistical methods require assumptions that are almost always clearly false. Hypothesis testing is all about assumptions that something is exactly zero, which does not make sense in any problem I’ve studied. If you bring this up with people who work on or use hypothesis testing, they’ll say something along the lines of, Yeah, yeah, sure, I know, but it’s a reasonable approximation and we can alter the assumption when we need to. Bayesian inference relies assumptions such as normality and logistic curves. If you bring this up with people who work on or use Bayesian inference, they’ll say something along the lines of, Yeah, yeah, sure, I know, but it’s a reasonable approximation and we can alter the assumption when we need to. To me, what appears to be different philosophies are more like different sorts of assumptions that people are comfortable with. It’s not just a “matter of taste”—different methods work better for different problems, and, as Rob Kass says, the methods that you use will, and should, be influenced by the problems you work on—I just think it makes more sense to focus on differences in methods and assumptions rather than frame as incommensurable philosophies. I do think philosophical understanding, and misunderstanding, can make a difference in applied work—see section 7 of my paper with Shalizi.

Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?)

Political scientists Justin Grimmer, Dean Knox, and Sean Westwood released a paper that begins:

Probabilistic election forecasts dominate public debate, drive obsessive media discussion, and influence campaign strategy. But in recent presidential elections, apparent predictive failures and growing evidence of harm have led to increasing criticism of forecasts and horse-race campaign coverage. Regardless of their underlying ability to predict the future, we show that society simply lacks sufficient data to evaluate forecasts empirically. Presidential elections are rare events, meaning there is little evidence to support claims of forecasting prowess. Moreover, we show that the seemingly large number of state-level results provide little additional leverage for assessment, because determining winners requires the weighted aggregation of individual state winners and because of substantial within-year correlation.

I agree with all of that. There’s too much horse-race coverage—as we wrote many years ago, “perhaps journalists and then the public will understand that the [general election polls for president] are not worthy of as much attention as they get.” I also agree about the small sample limiting what we can learn about election forecast accuracy from data alone.

Grimmer et al. then continue:

We demonstrate that scientists and voters are decades to millennia away from assessing whether probabilistic forecasting provides reliable insights into election outcomes.

I think that’s an overstatement, and I’ll explain in the context of going through the Grimmer et al. paper. I don’t think this is a bad paper, I just have a disagreement with how they frame part of the question.

Right at the beginning, they write about “recent failures” of election forecasts. I guess that 2016 represents a failure in that some prominent forecasts in the news media were giving Clinton a 90% chance of winning the electoral vote, and she lost. I wouldn’t say that 2020 was such a failure: forecasts correctly predicted the winner, and the error in the predicted vote share (about 2 percentage points of the vote) was within forecast uncertainties. Julia Azari and I wrote about 2016 in our paper, 19 things we learned from the 2016 election, and I wrote about 2020 in the paper, Failure and success in political polling and election forecasting (which involved a nightmare experience with a journal—not the one that ultimately published it). I’m not pointing to those article because they need to be cited–Grimmer et al. already generously cite me elsewhere!—but just to give a sense that the phrase “recent failures” could be misinterpreted.

Also if you’re talking about forecasting, I strongly recommend Rosenstone’s classic book on forecasting elections, which in many ways was the basis of my 1993 paper with King. See in particular table 1 of that 1993 paper, which is relevant to the issue of fundamentals-based forecasts outperforming poll-based forecasts. I agree completely with the larger point of Grimmer et al. that this table is just N=3; it shows that those elections were consistent with forecasts but it can’t allow us to make any strong claims about the future. That said, the fact that elections can be predicted to within a couple percentage points of the popular vote given information available before the beginning of the campaign . . . that’s an important stylized fact about U.S. general elections for president, not something that’s true in all countries or all elections (see for example here).

Grimmer et al. are correct to point out that forecasts are not just polls! As I wrote the other day, state polls are fine, but this whole obsessing-over-the-state-polls is getting out of control. They’re part of a balanced forecasting approach (see also here) which allows for many sources of uncertainty. Along these lines, fundamentals-based forecasts are also probabilistic, as we point out in our 1993 paper (and is implicitly already there in the error term of any serious fundamentals-based model).

Getting back to Rosenstone, they write that scientists are decades away from knowing if “probabilistic forecasting is more accurate than uninformed pundits guessing at random.” Really??? At first reading I was not quite sure what they meant by “guessing at random.” Here are two possibilities; (1) the pundits literally say that the two candidates are equally likely to win, (2) the pundits read the newspapers, watch TV, and pull guesses out of their ass in some non-algorithmic way.

If Grimmer et al. are talking about option 1, I think we already know that probabilistic forecasting is more accurate than a coin flip: read Rosenstone and then consider the followup elections of 1984, 1988, 1992, and 1996, none of which were coin flips and all of which were correctly predicted by fundamentals (and, for that matter, by late polls). From 2000 on, all the elections have been close (except 2008, and by historical standards that was pretty close too), so, sure, in 2000, 2004, 2012, 2016, and 2020, the coin-flip forecast wasn’t so bad. But then their argument is leaning very strongly on the current condition of elections being nearly tied.

If they’re talking about option 2, then they have to consider that, nowadays, even uninformed pundits are aware of fundamentals-based ideas of the economy and incumbency, and of course they’re aware of the polls. So, in that sense, sure, given that respected forecasts exist and pundits know about them, pundits can do about as well as forecasts.

OK, now I see in section 3 of their paper that by “guessing at random,” Grimmer et al. really are talking about flipping a coin. I disagree with the method they are using in section 3—or, I should say, I’m not disagreeing with their math; rather, I think the problem is that they’re evaluating each election outcome as binary. But some elections have more information than others. Predicting 1984 or 1996 as a coin flip would be ridiculous. The key here is that forecasters predict the popular and electoral vote margins, not just the winner (see here).

I also don’t see why they are demanding the forecast have a 95% chance of having a higher score, but I guess that’s a minor point compared to the larger issue that forecasts should be evaluated using continuous outcomes.

Finally, at the end they ask, why are we making probabilistic forecasts? I have some answers, other than “the lucrative marketing of statistical expertise.” First, political science. Rosenstone made a probabilistic forecasting model back in 1983, and we used an improved version of that model for our 1993 paper. The fact that U.S. general elections for president are predictable, within a few percentage points, helps us understand American politics. Second, recall baseball analyst Bill James’s remark that the alternative to good statistics is not “no statistics,” it’s “bad statistics.” Political professionals, journalists, and gamblers are going to make probabilistic forecasts one way or another; fundamentals-based models exist, polls exist . . . given that this information is going to be combined in some way, I don’t think there’s any shame in trying to do it well.

In summary, I agree with much of what Grimmer et al. have to say. We can use empirical data to shoot down some really bad forecasting models such as those that were giving Hillary Clinton a 99% chance of winning in 2016 (something that can happen from a lack of appreciation for non-sampling error in polls, a topic that has been studied quantitatively for a long time (see for example this review by Ansolabehere and Belin from 1993), and other times we can see mathematical or theoretical problems even before the election data come in (for example this from October 2020), but once we narrow the field to reasonable forecasts, it’s pretty much impossible to choose between them on empirical grounds. This is a point I made here and here; again, my point in giving all these links is to avoid having to restate what I’ve already written, I’m not asking them to cite all these things.

I sent the above to Grimmer et al., who responded:

To your point about elections from 30-40 years ago—sure, the forecasts from then look reasonable in retrospect. But again, as our calculations show, we need more information to distinguish those forecasts from other plausible forecasts. Suppose a forecaster like Rosenstone is accurate in his election predictions 85% of the time. It would take 48 years to distinguish his forecast from the 50% accuracy pundit (on average). This would mean that, on average, if Rosenstone started out of sample forecasting in 1980, then in 2028 we’d finally be able to distinguish from the 50% correct pundit. If we built a baseline pundit with more accuracy (say, accounting for obvious elections like you suggest) it would take even longer to determine whether Rosenstone is more accurate than the pundit.

I agree regarding the comparison to the baseline pundit, as this pundit can pay attention to the polls and prediction markets, read election forecasts, etc. The pundit can be as good as a forecast simply by repeating the forecast itself! But my point about Rosenstone is not that his model predicts the winner 85% of the time; it’s that his model (or more improved versions of it) predict the vote margin to a degree of accuracy that allows us to say that the Republicans were heavily favored in 1984 and 1988, the Democrats were favored in 1992 and 1996, etc. Not to mention the forecasts for individual states. Coin flipping only looks like a win if you collapse election forecasting to binary outcomes.

So I disagree with Grimmer et al.’s claim that you would need many decades of future data to learn that a good probabilistic forecast is better than a coin flip. But since nobody’s flipping coins in elections that are not anticipated to be close, this is a theoretical disagreement without practical import.

So, after all that, my summary is that, by restricting themsleves to evaluating forecasts in a binary way, they’ve overstated their case, and I disagree with their negative attitude about forecasting, but, yeah, we’re not gonna ever have the data to compare different reasonable forecasting methods on straight empirical grounds—and that doesn’t even get into the issue that the forecasts keep changing! I’m working with the Economist team and we’re doing our best, given the limited resources we’ve allocated to the problem, but I wouldn’t claim thatg our models, or anyone else’s, “are the best”; there’s just no way to know.

The other thing that’s relevant here is that not much effort is being put into these forecasts! These are small teams! Ben Goodrich and I have helped out the Economist as a little side gig, and the Economist hasn’t devoted a lot of internal resources to this either. I expect the same is true of Fivethirtyeight and other forecasts. Orders of magnitude more time and money is spent on polling (not to mention campaigns’ private polls and focus groups) than on statistical analysis, poll aggregation, fundamentals models, and the rest. Given that the information is out there, and it’s gonna be combined in some way, it makes sense that a small amount of total effort is put into forecasting.

In that sense, I think we already are at the endgame that Grimmer et al. would like: some version of probabilistic forecasting is inevitable, there’s a demand for it, so a small amount of total resources are spent on it. I get the sense that they think probabilistic forecasts are being taken too seriously, but given that these forcasts currently show a lot of uncertainty (for example, the Economist forecast currently has the race at 60/40), I’d argue that they’re doing their job in informing people about uncertainty.

Prediction markets

I sent the above discussion to Rajiv Sethi, an economist who studies prediction markets. Sethi points to this recent paper, which begins:

Any forecasting model can be represented by a virtual trader in a prediction market, endowed with a budget, risk preferences, and beliefs inherited from the model. We propose and implement a profitability test for the evaluation of forecasting models based on this idea. The virtual trader enters a position and adjusts its portfolio over time in response to changes in the model forecast and market prices, and its profitability can be used as a measure of model accuracy. We implement this test using probabilistic forecasts for competitive states in the 2020 US presidential election and congressional elections in 2020 and 2022, using data from three sources: model-based forecasts published by The Economist and FiveThirtyEight, and prices from the PredictIt exchange. The proposed approach can be applied more generally to any forecasting activity as long as models and markets referencing the same events exist.

Sethi writes:

I suspect that the coin flip forecaster would lose very substantial sums.

Joyce Berg and colleagues have been looking at forecasting accuracy of prediction markets for decades, including the IEM vote share markets. This survey paper is now a bit dated but has vote share performance relative to polls for elections in many countries.

They are looking at markets rather than models but the idea that we don’t have enough data to judge would seem to apply to both.

I think these comparisons need to deal with prediction markets, the implicit suggestion in the paper (I think) is that we don’t know (and will never know) whether they can be beaten by coin flippers, and I think we do know.

Yes, as discussed above, I think the problem is that Grimmer et al. were only using binary win/loss outcomes in the analysis where they compared forecasts to coin flips. Throwing away the information on vote margin is going to make it much much harder to distinguish an informative forecast from noise.

Commercial election forecasting

I sent the above discussion to Fivethirtyeight’s Elliott Morris, who wrote:

It’s interesting to see how academics answer the questions we ask ourselves all the time in model development and forecast evaluation. Whether we are better than dart-throwing is a very important question.

2. I’m reasonably confident we are better. As a particularly salient example, a monkey does not know that Wyoming is (practically speaking) always going to be red and CT always blue in 2024. Getting one of those states wrong would certainly erase any gains in accuracy (take the Brier score) from reverting probabilities towards 50-50 in competitive states.

3. Following from that, it seems a better benchmark (what you might call the “smarter pundit” model) would be how the state voted in the last election—or even better, p(win | previous win + some noise). That might still not replicate what pundits are doing but I’d find losing to that hypothetical pundit more troubling for the industry.

4. Could you propose a method that grades the forecasters in terms of distance from the result on vote share grounds? This is closer to how we think of things (we do not think of ourselves as calling elections) and adds some resolution to the problem and I imagine we’d see separation between forecasters and random guessing (centered around previous vote, maybe), much sooner (if not practically immediately).

5. Back on the subject of how we grade different forecasts, we calculate the LOOIC of candidate models on out-of-sample data. Why not create the dumb pundit model in Stan and compare information criterion in a Bayesian way? I think this would augment the simulation exercise nicely.

6. Bigger picture, I’m not sure what you’re doing is really grading the forecasters on their forecasting skill. Our baseline accuracy is set by the pollsters, and it is hard to impossible to overcome bias in measurement. So one question would be whether pollsters beat random guessing. Helpfully you have a lot more empirical data there to test the question. Then, if polls beat the alternative in long-range performance (maybe assign binary wins/losses for surveys outside the MOE?), and pollsters don’t, that is a strong indictment.

7. An alternative benchmark would be the markets. Rajiv’s work finds traders profited off the markets last year if they followed the models and closed contracts before expiration. Taking this metaphor: If you remove assignment risk from your calculations, how would a new grading methodology work? Would we need a hundred years of forecasts or just a couple cycles of beating the CW?

Morris’s “smarter pundit” model in his point #3 is similar to the fundamentals-based models that combine national and state predictors, including past election results. This is what we did in our 1993 paper (we said that elections were predictable given information available ahead of time, so we felt the duty to make such a prediction ourselves) and what is done in a much improved way to create the fundamentals-based forecast for the Economist, Fivethirtyeight, etc.

Political science

Above I wrote about the relevance of effective forecasting to our understanding of elections. Following up on Sethi’s mention of the Berg et al.’s research on prediction markets and polls, political scientist Chris Wlezien points us to two papers with Bob Erikson.

Are prediction markets really superior to polls as election predictors?:

We argue that it is inappropriate to naively compare market forecasts of an election outcome with exact poll results on the day prices are recorded, that is, market prices reflect forecasts of what will happen on Election Day whereas trial-heat polls register preferences on the day of the poll. We then show that when poll leads are properly discounted, poll-based forecasts outperform vote-share market prices. Moreover, we show that win projections based on the polls dominate prices from winner-take-all markets.

Markets vs. polls as election predictors: An historical assessment:

When we have both market prices and polls, prices add nothing to election prediction beyond polls. To be sure, early election markets were (surprisingly) good at extracting campaign information without scientific polling to guide them. For more recent markets, candidate prices largely follow the polls.

This relates to my point above that one reason people aren’t always impressed by poll-based forecasts is that, from the polls, they already have a sense of what they expect will happen.

I sent the above discussion to economist David Rothschild, who added that, even beyond whatever predictive value they give,

Polling data (and prediction market data) are valuable for political scientists to understand the trajectory and impact of various events to get to the outcome. Prediction markets (and high frequency polling) in particular allow for event studies.

Good point.

Rothschild adds:

Duncan Watts and I have written extensively on the massive imbalance of horse-race coverage to policy coverage in Mainstream Media election coverage. Depending on how you count it, no more than 5-10% of campaign coverage, even at the New York Times, covers policy in any remotely informative way. There are a myriad of reasons to be concerned about the proliferation of horse-race coverage, and how it is used to distract or misinform news consumers. But, to me, that seems like a separate question from making the best forecasts possible from the available data (how much horse-race coverage should we have), rather than reverting to earlier norms of focusing on over individual polls without context (conditional on horse-race coverage, should we make it as accurate and contextualized as possible).

Summary

I disagree with Grimmer et al. that we can’t distinguish probabilistic election forecasts from coin flips. Election forecasts, at the state and national level, are much better than coin flips, as long as you include non-close elections such as lots of states nowadays and most national elections before 2000. If all future elections are as close in the electoral college as 2016 and 2020, then, sure, the national forecasts aren’t much better than coin flips, but then their conclusion is very strongly leaning on that condition. In talking about evaluation of forecasting accuracy, I’m not offering a specific alternative here–my main point is that the evaluation should use the vote margin, not just win/loss. When comparing to coin flipping, Grimmer et al. only look at predicting the winner of the national election, but when comparing forecasts, they also look at electoral vote totals.

I agree with Grimmer et al. that it is essentially impossible from forecasting accuracy alone to choose between reasonable probabilistic forecasts (such as those from the Economist and Fivethirtyeight in 2020 and 2024, or from prediction markets, or from fundamentals-based models in the Rosenstone/Hibbs/Campbell/etc. tradition). N is just too small, also the models themselves along with the underlying conditions change from election to election, so it’s not even like there are stable methods to make such a comparison.

Doing better than coin flipping is not hard. Once you get to a serious forecast using national and state-level information and appropriate levels of uncertainty, there are lots of ways to go, there are reasons to choose one forecast over another based on your take on the election, but you’re not gonna be able to empirically rate them based on forecast accuracy, a point that Grimmer et al. make clearly in their Table 2.

Grimmer et al. conclude:

We think that political science forecasts are interesting and useful. We agree that the relatively persistent relationship between those models and vote share does teach us something about politics. In fact when one of us (Justin) teaches introduction to political science, his first lecture focuses on these fundamental only forecasts. We also agree it can be useful to average polls to avoid the even worse tendency to focus on one or two outlier polls and overinterpret random variation as systematic changes.

It is a leap to go from the usefulness of these models for academic work or poll averaging to justifying the probabilities that come from these models. If we can never evaluate the output of the models, then there is really no way to know if these probabilities correspond to any sort of empirical reality. And what’s worse, there is no way to know that the fluctuations in probability in these models are any more “real” than the kind of random musing from pundits on television.

OK, I basically agree (even if I think “there is really no way to know if these probabilities correspond to any sort of empirical reality” is a slight overstatement).

Grimmer et al. are making a fair point. My continuation of their point is to say that this sort of poll averaging is gonna be done, one way or another, so it makes sense to me that news organizations will try to do it well. Which in turn should allow the pundits on television to be more reasonable. I vividly recall 1988, when Dukakis was ahead in the polls but my political scientist told be that Bush was favored because the state of the economy (I don’t recall hearing the term “fundamentals” before our 1993 paper came out). The pundits can do better now, but conditions have changed, and national elections are much closer.

All this discussion is minor compared to horrors such as election denial (Grimmer wrote a paper about that too), and I’ll again say that the total resources spent on probabilistic forecasting is low.

One thing I think we can all agree on is that there are better uses of resources than endless swing-state and national horserace polls, and that there are better things for political observers to focus on than election forecasts. Ideally, probabilistic forecasts should help for both these things, first by making it clear how tiny the marginal benefit is from each new poll, and second by providing wide enough uncertainties that people can recognize that the election is up in the air and it’s time to talk about what the candidates might do if they win. Unfortunately, poll averaging does not seem to have reduced the attention being paid to polls, and indeed the existence of competing forecasts just adds drama to the situation. Which perhaps I’m contributing to, even while writing a post saying that there are too many polls and that poll aggregation isn’t all that.

Let me give the last word to Sean Westwood (the third author of the above-discussed paper), who writes:

Americans are confused by polls and even more confused by forecasts. A significant point in our work is that without an objective assessment of performance, it is unclear how Americans should evaluate these forecasts. Is being “right” in a previous election a sufficient reason to trust a forecaster or model? I do not believe this can be the standard. Lichtman claims past accuracy across many elections, and people evaluated FiveThirtyEight in 2016 with deference because of their performance in 2008 and 2012. While there is value in past accuracy, there is no empirical reason to assume it is a reliable indicator of overall quality in future cycles. We might think it is, but at best this is a subjective assessment.

Agreed.

Piranhas for “omics”?

Don Vicendese writes:

I remember one of your long ago comments, perhaps from your blog, along the lines that you were wondering if, in outcomes that involve many contributory factors, whether these factors might cancel themselves out to a certain extent. [That’s the piranha theorem; see original blog post from 2017 and recent research paper. — ed.]

I have often thought about this especially in regard to exposome type studies – add what ever prefix you like in front of the “ome.”

I recently read about the mathematical result obtained by Michel Talagrand regarding outcomes that are a function of many contributing stochastic factors. He won the 2024 Abel prize for it. He was able to derive limits for the resultant variation in the outcome which show that this variation is quite constrained. There does seem to be a lot of canceling going on.

I had previously also thought about how in very high dimensional spaces, the data seem to be clumped together with much of the space empty. These high dimensional spaces could be all sorts of data, not just omics. This may explain why our statistical methods may struggle in these very high dimensional settings – they just haven’t got the resolution to penetrate this type of highly compacted data.

Given all that, I have also wondered whether, when we analyse outcomes and isolate a few contributory factors and find an association, that this association may vanish if we were able to consider a sufficiently higher number of factors that may cancel each other out. In other words, associations we may have found in a lower dimensional setting may have been “lucky” due to the lower dimensional setting.

I am just speculating here but at the same time this is somewhat worrying to me and I would be interested in your thoughts on the matter.

My response: I’m not sure! I had not heard of this result from Talagrand; a quick google turned up this lay summary, which indeed sounds similar to our piranha theorems (but much much more mathematically sophisticated, I’m sure). It would be cool if we could connect our piranha idea to some deep mathematics . . . . I’ve long felt that there are some deep ideas here, even if maybe I’m not the one to fully plumb their depths!

I also like Vicendese’s framing in terms of adding or removing factors. This seems like a good angle on the problem.

Finally, I know nothing about “omics,” but that could well be a good area of application of these ideas, given that there literally are hundreds of genes out there. To the extent that the goal of this piranha research is to help do better science (rather than just to give insight into the problems of certain ideas of bad science), it would make sense to study something real such as “omics” rather than silly things like the purported effect of time of the month on clothing choices, etc.

“Alphabetical order of surnames may affect grading”

A Beatles fan points to this press release:

An analysis by University of Michigan researchers of more than 30 million grading records from U-M finds students with alphabetically lower-ranked names receive lower grades. This is due to sequential grading biases and the default order of students’ submissions in Canvas — the most widely used online learning management system — which is based on alphabetical rank of their surnames. . . .

The researchers collected available historical data of all programs, students and assignments on Canvas from the fall 2014 semester to the summer 2022 semester. They supplemented the Canvas data with university registrar data, which contains detailed information about students’ backgrounds, demographics and learning trajectories at the university. . . .

Their research uncovered a clear pattern of a decline in grading quality as graders evaluate more assignments. Wang said students whose surnames start with A, B, C, D or E received a 0.3-point higher grade out of 100 possible points than compared with when they were graded randomly. Likewise, students with later-in-the-alphabet surnames received a 0.3-point lower grade — creating a 0.6-point gap.

Wang noted that for a small group of graders (about 5%) that grade from Z to A, the grade gap flips as expected: A-E students are worse off, while W-Z students receive higher grades relative to what they would receive when graded randomly. . . .

Here’s the research article, by Zhihan (Helen) Wang, Jiaxin Pei, and Jun Li.

The result seems plausible to me.

What I’d really like to see is some graphs. To start with, a plot showing average grade on the y-axis vs. first letter of surname (from A to Z) on x-axis, with two sets of dots: red dots for the assignments graded in surname initial, black dots for the assignments graded in quasi-random order order, and blue dots for the one-third of assignments that were not graded in either of those orders. And then separate graphs for social science, humanities, engineering, science, and medicine. With 30 million observations, there should be more than enough data to make all these plots.

The regression analyses are fine, sure, whatever, but I wanna see the data. Also, I want to see all 26 letters. For some reason, in their they put the surnames into five bins. I guess the data are probably owned by the University of Michigan and not available for reanalysis.

He wants to compute “the effect of a predictor” (that is, an average predictive comparison) for a hierarchical mixture model. You can do it in Stan!

Alex Quent writes:

Even after reading Bayesian Data Analysis I still have questions about using finite mixtures models in the context of regression. I couldn’t find out who was (mostly) responsible for writing the chapter about mixture models so I decided to contact you.

In short, my main question (as described in length here) is how do I calculate the effect of a predictor, let’s say x1 in “y ~ x1 * x2 + (x1 * x2 | ppid)”, when I for instance use two Gaussian distributions? Can I just average across the distributions and calculate the posterior distribution of x1 via “theta1 * mu1_x1 + theta2 * mu2_x1”? With theta being the mixing parameter in brms and mu1_x1 and mu2_x1 being the coefficients for each distribution. Is it even possible to calculate the overall effect of x1?

The book mentioned averaging but I didn’t really understand how I am able to average it if I am honest. I also looked at some packages that allow calculating marginal effects (e.g. https://joshuawiley.com/brmsmargins/) but they don’t seem to support mixture models.

My response: As a way of thinking about this problem, I recommend my 2007 article with Iain Pardoe, Average predictive comparisons for models with nonlinearity, interactions, and variance components. The basic idea is that the comparison—the expected difference in y, comparing two people who differ by 1 unit in x1 but are identical in other predictors—depends on the values of all the predictors. In general, the way to do this is using simulations, although if you’re fitting a linear model (rather than logistic, say), you can do some of the computations analytically. We also discuss these ideas in Section 21.4 of my book with Jennifer and Section 14.4 of Regression and Other Stories. Using a mixture model rather than, say, an ordinary regression model with interactions, introduces no new conceptual challenges.

If you’re working with Stan, you should compute the average predictive comparisons in the generated quantities block, and then your Stan fit will automatically give you the posterior distribution of all the average predictive comparisons that you compute.

P.S. We prefer the term “comparisons” rather than “effects” because the interpretation as a comparison in a predictive model is more general than the interpretation as a casual effect. The comparison reduces to an effect in the special case that the model has a valid causal interpretation.

P.P.S. At first I misread the correspondent’s name—it looked like “Alex Quant.” Which would be such a cool name for a quantitative researcher!

The “fail fast” principle in statistical computing

If you’re not careful, you’ll spend most of your time on computations that don’t work.

Here’s how we put it in the Bayesian workflow article:

An important intermediate goal is to be able to fail fast when fitting bad models. This can be considered as a shortcut that avoids spending a lot of time for (near) perfect inference for a bad model. There is a large literature on approximate algorithms to fit the desired model fast, but little on algorithms designed to waste as little time as possible on the models that we will ultimately abandon. We believe it is important to evaluate methods on this criterion, especially because inappropriate and poorly fitting models can often be more difficult to fit.

For a simple idealized example, suppose you are an astronomer several centuries ago fitting ellipses to a planetary orbit based on 10 data points measured with error. Figure 6a shows the sort of data that might arise, and just about any algorithm will fit reasonably well. For example, you could take various sets of five points and fit the exact ellipse to each, and then take the average of these fits. Or you could fit an ellipse to the first five points, then perturb it slightly to fit the sixth point, then perturb that slightly to fit the seventh, and so forth. Or you could implement some sort of least squares algorithm.

Now suppose some Death Star comes along and alters the orbit—in this case, we are purposely choosing an unrealistic example to create a gross discrepancy between model and data—so that your 10 data points look like Figure 6b. In this case, convergence will be much harder to attain. If you start with the ellipse fit to the first five points, it will be difficult to take any set of small perturbations that will allow the curve to fit the later points in the series. But, more than that, even if you could obtain a least squares solution, any ellipse would be a terrible fit to the data. It’s just an inappropriate model. If you fit an ellipse to these data, you should want the fit to fail fast so you can quickly move on to something more reasonable.

This example has, in extreme form, a common pattern of difficult statistical computations, that fitting to different subsets of the data yields much different parameter estimates.

The “fail fast” principle leads to three recommendations:

1. Recognize when your computation is failing. Don’t just run it blind.

2. When computation fails, don’t just brute-force it by running it overnight, setting max treedepth to 20, or whatever. Instead, use the fit from a simpler model as a starting point and try to figure out what’s going wrong. In some settings, you can also try fitting different models to different subsets of the data.

3. Thinking about this from a system level, set up your workflow so that failures will show up as soon as possible. This is not as easy as it sounds! It’s something that we’re working on. My dream is to put this all into a generalized expectation propagation framework, unifying the computational idea of federated learning with statistical ideas of multilevel modeling.

Toward a Shnerbian theory that establishes connections between the complexity (nonlinearity, chaotic dynamics, number of components) of a system and the capacity to infer causality from datasets

Nadav Shnerb writes:

Regarding your latest post on casual claims, I believe there must be a theory of the “complexity-predictability relationship” that establishes connections between the complexity (nonlinearity, chaotic dynamics, number of components) of a system and the capacity to infer causality from datasets. For example, when a system’s dynamics are chaotic, the occurrence of state A at t=0 preceding state B at t=T provides very little insight into the system’s behavior when initiated at a slightly perturbed state A+dA and so forth. Are you familiar with any existing theories that explore this concept?

Recently, I delved into a variant of ANOVA, by examining the impact of random perturbations in selected elements of a random matrix on its highest eigenvalue. The outcomes imply that as the size of the matrix increases, the viability of such an analysis diminishes. This observation has prompted me to ponder these questions.

This is an interesting topic; it relates to our piranha theorems, I think. Shnerb is a physicist so he’s coming at this problem from a different direction than we are, as social scientists and mathematicians.

This also reminds me of the concept of unfolding-flower models and, more generally, the idea that we can infer more parameters when we get more data, with interesting patterns that arise for nonlinear models. There’s a connection here to phase transitions in statistical physics, where when you add more energy to a system, additional quantum levels get activated. I’ve wanted to pursue this idea further for decades but have never quite taken the time to work it out carefully in a series of examples. Maybe writing this post is a first step!

Questions and Answers for Applied Statistics and Multilevel Modeling

Last semester, every student taking my course was required to contribute before each class to a shared Google doc by putting in a question about the reading or the homework, or by answering another student’s question. The material on this document helped us guide discussion during the class.

At the end of the semester, the students were required to add one more question, which then I responded to in the document itself.

Here it is!

Report of average change from an Alzheimer’s drug: I don’t get the criticism here.

Alexander Trevelyan writes:

I was happy to see you take a moment to point out the issues with the cold water study that was making the rounds recently. I write occasionally about what I consider to be a variety of suspect practices in clinical trial reporting, often dealing with deceptive statistical methods/reporting. I’m a physicist and not a statistician myself—I was in a group that had joint meetings with Raghu Parthasarathy’s lab at Oregon—but I’ve been trying to hone my understanding of clinical trial stats recently.

Last week, the Alzheimer’s drug Leqembi (lecanemab) was approved by the FDA, which overall seems fine, but it rekindled some debate about the characterization of the drug causing a “27% slowing in cognitive decline” over placebo; see here. This 27% figure was touted by, for example, the NIH NIA in a statement about the drug’s promise.

So here’s my issue, which I’d love to hear your thoughts on (since this drug is a fairly big deal in Alzheimer’s and has been quite controversial)—the 27% number is a simple percentage difference that was calculated by first finding the change in baseline for the placebo and treatment groups on the CDR-SB test (see first panel of Figure 2 in the NEJM article), then using the final data point for each group to calculate the relative change between placebo and treatment. Does this seem as crazy to you as it does to me?

First, the absolute difference in the target metric was under 3%. Second, calculating a percentage difference on a quantity that we’ve rescaled to start at zero seems a bit… odd? It came to my attention because a smaller outfit—one currently under investigation by about every three-letter federal agency you can name—just released their most recent clinical trial results, which had very small N and no error bars, but a subgroup that they touted hovered around zero and they claimed a “200% difference!” between the placebo and treatment groups (the raw data points were a +0.6 and -0.6 change).

OK, I’ll click through and take a look . . .

My first reaction is that it’s hard to read a scholarly article from an unfamiliar field! Lots of subject-matter concepts that I’m not familiar with, also the format is different from things I usually read, so it’s hard for me to skim through to get to the key points.

But, OK, this isn’t so hard to read, actually. I’m here in the Methods and Results section of the abstract: They had 1800 Alzheimer’s patients, half got treatment and half got placebo, and their outcome is the change in score in “Clinical Dementia Rating–Sum of Boxes (CDR-SB; range, 0 to 18, with higher scores indicating greater impairment).” I hope they adjust for the pre-test score; otherwise they’re throwing away information, but in this case the sample size is so large that this should be no big deal, we should get approximate balance between the two groups.

In any case, here’s the result: “The mean CDR-SB score at baseline was approximately 3.2 in both groups. The adjusted least-squares mean change from baseline at 18 months was 1.21 with lecanemab and 1.66 with placebo.” So both groups got worse. That’s sad but I guess expected. And I guess this is how they got the 27% slowing thing: Average decline in control group was 1.66, average decline in treatment group is 1.21, you take 1 – 1.21/1.66 = 0.27, so a 27% slowing in cognitive decline.

Now moving to the statistical analysis section of the main paper: Lots of horrible stuff with significance testing and alpha values, but I can ignore all this. The pattern in the data seems clear. Figure 2 shows time trends for averages. I’d also like to see trajectories for individuals. Overall, though, saying “an average 27% slowing in cognitive decline” seems reasonable enough, given the data they show in the paper.

I sent the above to Trevelyan, who responded:

Interesting, but now I’m worried that maybe I spend too much time on the background and not enough time in making my main concern more clear. I don’t have any issues with the calculation of the percent difference, per se, but rather what it is meant to represent (i.e., the treatment effect). As you noted, and is unfortunately the state of the field, the curves always go down in Alzheimer’s treatment—but that doesn’t have to be the case! The holy grail is something that makes the treatment curve go up! The main thing that set off alarm bells for me is that the “other company” I referenced claims to have observed an improvement with their drug and an associated 200%(!) slowing in cognitive decline. In their case, the placebo got 0.6 points worse and the treatment 0.6 points better, so 200%! But their treatment could’ve gotten 10 points better and the placebo 10 points worse, and that’s also 200%! Or maybe 0.000001 points better versus 0.000001 points worse—again, 200%.

I think my overall concern is, “why are we using a metric that can break in such an obvious way under perfectly reasonable (if currently aspiration) treatment outcomes?”

See here for data from “other company” if you are curious (scroll down to mild subgroup, ugh).

And here’s a graph made by Matthew Schrag, who is an Alzheimer’s researcher and data sleuth, which rescales the change in the metric and shows the absolute change in the CDR-SB test. The inner plot shows the graph from the original paper; the larger plot is rescaled:

My reply: I’m not sure. I get your general point, but if you have a 0-18 score and it increases from 3.2 to 4.8, that seems like a meaningful change, no? They’re not saying they stopped the cognitive decline, just that they slowed it by 27%.

P.S. I talked with someone who works in this area who says that almost everyone in the community is skeptical about the claimed large benefits for lecanemab, and also that there’s general concern that resources spent on this could be better used in direct services. This is not to say the skeptics are necessarily right—I know nothing about all this!—but just to point out that there’s a lot of existing controversy here.

Is there a balance to be struck between simple hierarchical models and more complex hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data?

Kiran Gauthier writes:

After attending your talk at the University of Minnesota, I wanted to ask a follow up regarding the structure of hierarchical / multilevel models but we ran out of time. Do you have any insight on the thought that probabilistic programming languages are so flexible, and the Bayesian inference algorithms so fast, that there is a balance to be struck between “simple” hierarchical models and more “complex” hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data?

I think that a real benefit of the Bayesian paradigm is that (in theory) if the data doesn’t converge my uncertainty in a parameter, then the inference engine should return my prior (or something close to it). Does this happen in reality? I know you’ve written about canary variables before as an indication of model misspecification which I think is an awesome idea, I’m just wondering how to strike that balance between a simple / approximate model, and a more complicated model given that the true generative process is unknown, and noisy data with bad models can lead good inference engines astray.

My reply: I think complex models are better. As Radford Neal put it so memorably, nearly thirty years ago,

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

That said, I don’t recommend fitting the complex model on its own. Rather, I recommend building up to it from something simpler. This building-up occurs on two time scales:

1. When working on your particular problem, start with simple comparisons and then fit more and more complicated models until you have what you want.

2. Taking the long view, as our understanding of statistics progresses, we can understand more complicated models and fit them routinely. This is kind of the converse of the idea that statistical analysis recapitulates the development of statistical methods.

If I got a nickel every time . . .

Justin Savoie came across this webpage on “Abacus Data MRP: A Game-Changer for GR, Advocacy, and Advertising”:

We should get these people to write our grant proposals! Seriously, we should tap them for funding. They’re using MRP, we’re developing improvements to MRP . . . it would be a good investment.

After noticing this bit:

Abacus Data collaborated closely with the Association to design, execute, and analyze a comprehensive national survey of at least 3,000 Canadian adults. The survey results were meticulously broken down by vital demographics, political variables, and geographical locations in a easy to read, story-telling driven report. Something we are known for – ask around!

The real innovation, however, lay in Abacus Data MRP’s unique capability. Beyond the general survey analysis, it generated 338 individual reports, each tailored for a specific Member of Parliament.

Savoie commented:

Cool! But 3000/338 = 9 … so I don’t know about the tailoring.

Good point. Let’s not oversell.

Applied modelling in drug development? brms!

Colleagues of mine here at Novartis – Björn Holzhauer, Lukas Widmer & Andrew Bean – and myself have recently published the new website on “Applied modelling in drug development via brms”. The idea is to present real-life case studies from drug development and demonstrate how these can be solved using the amazing R package brms by Paul Bürkner (short for Bayesian regression models using Stan). brms lets you specify your model using the R model formula syntax you may be familiar with so that you do not need to write your own Stan code – while offering a surprising number of features in a super flexible manner. Each case study showcases a key modelling technique or a specific brms feature.

Every case study starts with a problem statement from drug development and then shows how to solve it in a stepwise manner. Most case studies are self-contained and can be run directly with the code provided on the website. The code to generate the website is open-source and hosted on GitHub.

The material has been created for annual workshops organised at Novartis over the last three years with Paul as a guest instructor. Each case study is contributed by a Novartis colleague, making the website almost like an edited online book full of problems in biostatistics. The material has become a valuable resource to jump-start my work at hand as quite often I reuse some code from the website (or use the website as a reminder for how to do things). I’d hope that others find the material just as useful, and I would be curious to hear what your thoughts are.

Finally, colleagues of mine – David Ohlssen, Björn Holzhauer & Andrew Bean – will teach a half-day course based on the website material at the JSM conference in Portland on Monday, August 5th. In case you are interested, you can sign-up for the course during conference registration.

Combining multiply-imputed datasets, never easy

Thomas Hühn writes:

I’m thinking about doing a Bayesian analysis of a very small subset of PISA or TIMSS data. Those large-scale education surveys do not report students achievement scores as single numbers, but they report five or ten numbers, so called plausible values. Those plausible values have been sampled from a constructed probability distribution.

The user guides and methodology papers strongly warn against taking those five plausible values as five observations, and also against taking the mean of those five plausible values and doing statistical analysis on that.

Instead you’re supposed to do five analyses on each set of plausible values and then take the arithmetic mean of the descriptive statistics metrics that you care about. Also, you’re supposed to calculate some imputation error, but I haven’t really understood that, yet.

How would a Bayesian proceed here?

I guess they would also do five analyses and then end up with five posterior probability distributions. How can they be combined? It surely cannot be as easy as doing a vector add in R and dividing by five (you’re sensing my small glimmer of hope here)?

My response: I’m not sure. I guess it depends where the plausible values come from? The method they describes sounds similar to the analysis of multiple imputations. There can be reasons that such an analysis will work better than taking the average for each person, because if you take the average for each person, your analysis kinda thinks that people are less variable than they really are.

One way to get a handle on this would be to simulate fake data based on a known model.

Does this study really show that lesbians and bisexual women die sooner than straight women? Disparities in Mortality by Sexual Orientation in a Large, Prospective JAMA Paper

This recently-published graph is misleading but also has the unintended benefit of revealing a data problem:

Jrc brought it up in a recent blog comment. The figure is from an article published in the Journal of the American Medical Association, which states:

This prospective cohort study examined differences in time to mortality across sexual orientation, adjusting for birth cohort. Participants were female nurses born between 1945 and 1964, initially recruited in the US in 1989 for the Nurses’ Health Study II, and followed up through April 2022. . . .

Compared with heterosexual participants, LGB participants had earlier mortality (adjusted acceleration factor, 0.74 [95% CI, 0.64-0.84]). These differences were greatest among bisexual participants (adjusted acceleration factor, 0.63 [95% CI, 0.51-0.78]) followed by lesbian participants (adjusted acceleration factor, 0.80 [95% CI, 0.68-0.95]).

The above graph tells the story. As some commenters noted, there’s something weird going on with the line for heterosexual women 25+ years after exposure assessment. I guess these are the deaths from 2020 until the end of data collection in 2022.

Maybe not all the deaths were recorded? I dunno, it looks kinda weird to me. Here’s what they say in the paper:

The linkages to the NDI [National Death Index] were confirmed through December 31, 2019; however, ongoing death follow-up (eg, via family communication that was not yet confirmed with the NDI) was assessed through April 30, 2022. In the sensitivity analyses, only NDI-confirmed deaths (ie, through December 31, 2019) were examined.

Given the problem in the above graph, I’m guessing we should just forget about the main results of the paper and just look at the estimates from the sensitivity analyses. They say that sexual orientation was missing for 22% of the participants.

Getting back to the data . . . Table 1 shows that the lesbian and bisexual women in the study were older, on average, and more likely to be smokers. So just based on that, you’d expect they would be dying sooner. Thus you can’t really do much with the above figures that mix all the age and smoking groups. I mean, sure, it’s a good idea to display them—it’s raw data, and indeed it did show the anomalous post-2000 pattern that got our attention in the first place—but it’s not so useful as a comparison.

Regarding age, the paper says “younger cohorts endorse LGB orientation at higher rates,” but the table in the paper clearly shows that the lesbians and bisexuals in the study are in the older cohorts in their data. So what’s up with that? Are the nurses an unusual group? Or perhaps there are some classification errors?

Here’s what I’d like to see: a repeat of the cumulative death probabilities in the top graph above, but as a 4 x 2 grid, showing results separately in each of the 4 age categories and for smokers and nonsmokers. Or maybe a 4 x 3 grid, breaking up the ever-smokers into heavy and light smokers. Yes, the data will be noisier, but let’s see what we’ve got here, no?

They also run some regressions adjusting for age cohort. That sounds like a good idea, although I’m a little bit concerned about variation within cohort for the older groups. Also, do they adjust for smoking? I don’t see that anywhere . . .

Ummmm, OK, I found it, right there on page 2, in the methods section:

We chose not to further control for other health-related variables that vary in distribution across sexual orientation (eg, diet, smoking, alcohol use) because these are likely on the mediating pathway between LGB orientation and mortality. Therefore, controlling for these variables would be inappropriate because they are not plausible causes of both sexual orientation and mortality (ie, confounders) and their inclusion in the model would likely attenuate disparities between sexual orientation and mortality. However, to determine whether disparities persisted above and beyond the leading cause of premature mortality (ie, smoking), we conducted sensitivity analyses among the subgroup of participants (n = 59 220) who reported never smoking between 1989 and 1995 (year when sexual orientation information was obtained).

And here’s what they found:

Among those who reported never smoking (n = 59 220), LGB women had earlier mortality than heterosexual women (acceleration factor for all LGB women, 0.77 [95% CI, 0.62-0.96]; acceleration factor for lesbian participants, 0.80 [95% CI, 0.61-1.05]; acceleration factor for bisexual participants, 0.72 [95% CI, 0.50-1.04]).

I don’t see this analysis anywhere in the paper. I guess they adjusted for age cohorts but not for year of age, but I can’t be sure. Also I think you’d want to also adjust for lifetime smoking level, not just between 1989 and 1995. I’d think the survey would also have asked about whether participants’ earlier smoking status, and I’d think that smoking would be asked in followups?

Putting it all together

The article and the supplementary information present a lot of estimates. None of them are quite what we want—that would be an analysis that adjusts for year of age as well as cohort, adjusts for smoking status, and also grapples in some way with the missing data.

But we can triangulate. For all the analyses, I’ll pool the lesbian and bisexual groups because the sample size within each of these two groups is too small to learn much about them separately. For each analysis, I’ll give the reported 95% adjusted acceleration factor as an estimate +/- margin of error:

Crude analysis, not adjusting for cohort: 0.71 +/- 0.10
Main analysis, adjusting for cohort: 0.74 +/- 0.10
Using only deaths before 2020, adjusting for cohort: 0.79 +/- 0.10
Imputing missing responses for sexual orientation, adjusting for cohort: 0.79 +/- 0.10
Using only the nonsmokers, maybe adjusting for cohort: 0.77 +/- 0.17

Roughly speaking, adjusting for cohort adds 0.03 to the acceleration factor, imputing missing responses adds 0.05, and adjusting for smoking adds 0.03 or 0.06. I don’t know what would happen if you add all three adjustments; given what we have here, my best guess would be to add 0.03 + 0.05 + (0.03 or 0.06) = 0.11 or 0.14, which would take the estimated acceleration factor to 0.82 or 0.85. Also, if we’re only going to take the nonsmokers, we’ll have to use the wider margin of error.

Would further adjustment do more? I’m not sure. I’d like to do more adjustment for age and for smoking status; that said, I have no clear sense what direction this would shift the estimate.

So our final estimate is 0.82 +/- 0.17 or 0.85 +/- 0.17, depending on whether or not their only-nonsmokers analysis adjusted for cohort. They’d get more statistical power by including the smokers and adjusting for smoking status, but, again, without the data here I can’t say how this would go, so we have to make use of what information we have.

This result is, or maybe is not, statistically significant at the conventional 95% level. That is not to say that there is no population difference, just that there’s uncertainty here, and I think the claims made in the research article are not fully supported by the evidence given there.

I have similar issues with the news reports of this study, for example this:

One possible explanation for the findings is that bisexual women experience more pressure to conceal their sexual orientation, McKetta said, noting that most bisexual women have male partners. “Concealment used to be thought of as sort of a protective mechanism … but [it] can really rot away at people’s psyches and that can lead to these internalizing problems that are also, of course, associated with adverse mental health,” she said.

Sure, it’s possible, but shouldn’t you also note that the observed differences could also be explained by missing-data issues, smoking status, and sampling variation?

And now . . . the data!

Are the data available for reanalysis? I’m not sure. The paper had this appendix:

I don’t like the requirement of “co-investigator approval” . . . what’s with that? Why they can’t just post the data online? Anyway, I followed the link and filled out the form to request the data, so we’ll see what happens. I needed to give a proposal title so I wrote, “Reanalysis of Disparities in Mortality by Sexual Orientation in a Cohort of Female Nurses.” I also had to say something about the scientific significance of my study, for which I wrote, “The paper, Disparities in Mortality by Sexual Orientation in a Cohort of Female Nurses, received some attention, and it would be interesting to see how the results in that paper vary by age and smoking status.” For my “Statement of Hypothesis and Specific Aims,” I wrote, “I have no hypotheses. My aim is to see how the results in that paper vary by age and smoking status.” . . .

Ummmm . . . I was all ready to click Submit—I’d filled out the 6-page form, and then this:

A minimum of $7800 per year per cohort?? What the hell???

To be clear, I don’t think this is the fault of the authors of the above-discussed paper. It just seems to be the general policy of the Nurses Health Study.

Summary

I don’t think it’s bad that this paper was published. It’s pretty much free to root around at available datasets and see what you can find. (OK, in this case, it’s not free, it’s actually a minimum of $7800 per year per cohort, but this isn’t a real cost; it’s just a transfer payment: the researchers get an NIH grant, some of which is spent on . . . some earlier recipient of an NIH grant.) The authors didn’t do a perfect statistical analysis, but no statistical analysis is perfect; there are always ways to improve. The data have problems (as shown in the graph at the top of this post), but there was enough background in the paper for us to kind of figure out where the problem lay. And they did a few extra analyses—not everything I’d like to see, but some reasonable things.

So what went wrong?

I hate to say it because regular readers have heard it all before, but . . . forking paths. Researcher degrees of freedom. As noted above, my best guess of a reasonable analysis would yield an estimate that’s about two standard errors away from zero. But other things could’ve been done.

And the authors got to decide what result to foreground. For example, their main analysis didn’t adjust for smoking. But what if smoking had gone the other way in the dataset, so that the lesbians and bisexuals in the data had been less likely to smoke? Then a lack of adjustment would’ve diminished the estimated effect, and the authors might well have chosen the adjusted analysis as their primary result. Similarly with exclusion of the problematic data after 2020 and the missing data on sexual orientation. And lots of other analyses could’ve been done here, for example adjusting for ethnicity (they did do separate analyses for each ethnic group, but it was hard to interpret those results because they were so noisy), adjusting for year of age, and adjusting more fully for past smoking. I have no idea how these various analyses would’ve gone; the point is that the authors had lots of analyses to choose from, and the one they chose gave a big estimate in part because a lack of adjustment for some data problems.

What should be done in this sort of situation? Make the data available, for sure. (Sorry, Nurses Health Study, no $7800 per year per cohort for you.) Do the full analysis or, if you’re gonna try different things, try all of them or put them in some logical order; don’t just pick one analysis and then perturb it one factor at a time. And, if you’re gonna speculate, fine, but then also speculate about less dramatic explanations of your findings, such as data problems and sampling variation.

It should be no embarrassment that it’s hard to find any clear patterns in these data, as all of it is driven by a mere 81 deaths. n = 81 ain’t a lot. I’d say I’m kinda surprised that JAMA published this paper, but I don’t really know what JAMA publishes. It’s not a horrible piece of work, just overstated to the extent that the result is more of an intriguing pattern in a particular dataset rather than strong evidence of a general pattern. This often happens in the social sciences.

Also, the graphs. Those cumulative mortality graphs above are fine as data summaries—almost. First, it would’ve been good if the authors had noticed the problem post-2020. Second, the dramatic gap between the lines is misleading in that it does not account for the differences in average age and smoking status of the groups. As noted above, this could be (partly) fixed by replacing with a grid of graphs broken down by cohort and smoking status.

All of the above is an example of what can be learned from post-publication review. Only a limited amount can be learned without access to the data; it’s interesting to see what we can figure out from the indirect evidence available.

“When are Bayesian model probabilities overconfident?” . . . and we’re still trying to get to meta-Bayes

Oscar Oelrich, Shutong Ding, Måns Magnusson, Aki Vehtari, and Mattias Villani write:

Bayesian model comparison is often based on the posterior distribution over the set of compared models. This distribution is often observed to concentrate on a single model even when other measures of model fit or forecasting ability indicate no strong preference. Furthermore, a moderate change in the data sample can easily shift the posterior model probabilities to concentrate on another model.

We document overconfidence in two high-profile applications in economics and neuroscience.

To shed more light on the sources of overconfidence we derive the sampling variance of the Bayes factor in univariate and multivariate linear regression. The results show that overconfidence is likely to happen when i) the compared models give very different approximations of the data-generating process, ii) the models are very flexible with large degrees of freedom that are not shared between the models, and iii) the models underestimate the true variability in the data.

This is related to our work on stacking:

[2018] Using stacking to average Bayesian predictive distributions (with discussion). {\em Bayesian Analysis} {\bf 13}, 917–1003. (Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman)

[2022] Bayesian hierarchical stacking: Some models are (somewhere) useful. {\em Bayesian Analysis} {\bf 17}, 1043–1071. (Yuling Yao, Gregor Pirš, Aki Vehtari, and Andrew Gelman)

[2022] Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors. {\em Journal of Machine Learning Research} {\bf 23}, 79. (Yuling Yao, Aki Vehtari, and Andrew Gelman)

Big open problems remain in this area. For choosing among or working with discrete models, stacking or other predictive model averaging techniques seem to work much better than Bayesian model averaging. On the other hand, for models with continuous parameters we’re usually happy with full Bayes. The difficulty here is that discrete models can be embedded in a continuous space, and continuous models can be discretized. What’s missing is some sort of meta-Bayes (to use Yuling’s term) that puts this all together.