Skip to content

Pathfinder: A parallel quasi-Newton algorithm for reaching regions of high probability mass

Lu Zhang, Bob Carpenter, Aki Vehtari and I write:

We introduce Pathfinder, a variational method for approximately sampling from differentiable log densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL-divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder’s speed advantage with multiple cores.

The current title of the paper is actually “Pathfinder: Parallel quasi-Newton variational inference.” We didn’t say it in the above abstract, but our original motivation for Pathfinder was to get better starting points for Hamiltonian Monte Carlo, but then at some point it became clear that the way to do this is to get a variational approximation to the target distribution. I gave the above title to the blog post to draw attention to the idea that Pathfinder finds where to go.

In the world of Stan, we see three roles for Pathfinder: (1) getting a fast approximation (should be faster and more reliable than ADVI), (2) part of a revamped adaptation (warmup) phase, which again we hope will be faster and more reliable than what we have now, and (3) more effectively getting rid of minor modes. This is a cool paper (and I’m the least important of the four authors) and I’m really excited about the idea of having this in Stan and other probabilistic programming languages.

P.S. Bob provides further background:

The original idea was based on the idea that the intermediate value theorem of calculus would guarantee that if we started with a random init in the tail and followed an optimization path, that path would have to go through the bulk of the probability mass on the way to the mode or pole (for example, hierarchical models don’t have modes because the density is unbounded). Here’s an ASCII art version of the diagram of what should happen,

*———[——–]———–> opt path
TAIL BODY HEAD

The brackets pick out the region in the body of the distribution corresponding to reasonable posterior draws (see below for how we refined this idea). For instance, in a multivariate standard normal, this is a thin shell at a fixed radius from the mode at the origin.

We were at the same time playing around with the notion of typical set from information theory. The theorem around that is an asymptotic result for growing samples having a mean log density that approaches the entropy. We have been using it informally to denote the subset of the sample space where a sampler will draw samples. But that’s not quite what it is. Formally, the typical set with one draw is defined by sets of draws whose log density is within epsilon of the expected log density, i.e.,

Typical_epsilon = {theta : log p(theta | y) in E[log p(Theta | y)] +/- epsilon}.

This +/- epsilon business doesn’t work well when log p(Theta | y) is skewed, as it often is in applied problems. So instead we settled on the set of theta whose log densities fall the 99% central interval of log densities. That’s where 99% of our samples are drawn and should make a good target for initializing an MCMC sampler, our original goal.

Given an optimization path, we can evaluate the points in parallel to detect if they’re in the body of the distribution. If we start MCMC chains at those points, the ones in the tail should drift higher in density, the ones in the head should drift lower, whereas those in the body should bounce around. The only problem is that it’s hard to run MCMC on arbitrary densities robustly without a lot of adaptation, which is precisely what we were designing Pathfinder to avoid. Also, running MCMC is intrinsically serial. So next we considered evaluating the volume around a point and using density times volume as a proxy for being in the body of a distribution. We could use the low rank plus diagonal covariance approximation from the L-BFGS (quasi-Newton) optimizer to evaluate the volume. It worked better than MCMC and was very fast, but still failed in many cases. The final idea was to just evaluate the evidence lower bound (ELBO), equivalently KL-divergence from the approximation to the target density, at the normal approximation derived from L-BFGS at each point along the optimization path. That can also be done easily in parallel. It’s way more robust to run L-BFGS on the objective log p(theta | y) than it is to try to optimize the ELBO[normal(mu, Sigma) || log p(theta | y)] over mu and Sigma directly as ADVI does. We also don’t have the serialization bottleneck of ADVI where the evaluations of the ELBO need to be staged one after the other—all of our ELBO evals happen in parallel.

Along the way, we lost the original Pathfinder idea of finding a point in the optimization path in the body of the distribution. The point at which the best ELBO is located is often beyond the body of the distribution. For example, the best normal approximation for a multivariate normal is located at the mode.

After the basic idea worked, we found it getting trapped in local optima. The last idea was to run multiple optimization paths of Pathfinder in parallel, then importance resample the results. That essentially gives us a low-rank multivariate normal mixture approximation of a posterior, and that worked much better. Each Pathfinder instance runs in parallel, too, and the only serialization bottleneck is importance resampling, which is fast.

Next up, we’re going to try to tackle Phase II of Stan’s warmup. The question is whether its covariance estimates will be good enough to circumvent or at least shorten Phase II warmup. Ben Bales has already done a lot of work around this in his thesis, like parallelizing adaptation in Phase II. The final goal, of course, is robust, massively parallelizable MCMC. But even without parallelization, Pathfinder is way faster than ADVI or Stan’s Phase I warmup.

Take the American Statistical Association’s “”How Well Do You Know Your Federal Data Sources?” quiz!

Jonathan Auerbach writes:

How well do you really know your federal data sources?

Life expectancy, gross domestic product, consumer price index—we reference statistics like these all the time. But how well do you really know the details?—the careful considerations that go into the official statistics we rely on every day?

Now’s your chance to find out. Take our quiz, share with that special data user in your life, and support the American Statistical Association’s Count on Stats initiative. (Ten questions randomly selected every hour.)

I just tried out the quiz. It was fun! I got 8 out of 10 correct.

Controversy over “Facial recognition technology can expose political orientation from naturalistic facial images”

A couple people pointed me to this research article, “Facial recognition technology can expose political orientation from naturalistic facial images,” by Michal Kosinski, which reports:

A facial recognition algorithm was applied to naturalistic images of 1,085,795 individuals to predict their political orientation by comparing their similarity to faces of liberal and conservative others. Political orientation was correctly classified in 72% of liberal–conservative face pairs, remarkably better than chance (50%), human accuracy (55%), or one afforded by a 100-item personality questionnaire (66%).

The result seemed plausible to me. As Kosinski writes, a face contains lots of information; it’s not just bone structure. It seems that people of different social classes have different-looking faces, so I could imagine this to be true of political orientation as well.

I wan’t quite sure how to think about the 72% result—is that high or low?—so I asked a colleague who prefers to remain anonymous, who told me this:

Regarding the paper, I’m not exactly sure what to make of the controversy. He released the underlying data, so I ran a couple of models [see script at end of this post] and here are the results:

AUC [area under the curve, a measure of classification accuracy] of simple demographic model: 0.63
AUC of model with more observable traits: 0.65
AUC of model with observable traits + personality: 0.72

It looks like ethnicity doesn’t include “Hispanic” for some reason, which is probably hurting the performance of the demographic model. If you use demographics + directly observable traits in the dataset (e.g., smiling, facial hair, glasses, etc.), you get 65% AUC. When you throw in personality (based on a survey), you get to 72% AUC, which is the same as what is reported in the paper based on the face recognition algorithm alone.

Personality is not directly visible, so the face algorithm is picking up on something more than the obviously observable traits. But I think the gap between 65% and 72% is not that big, and likely can be closed if you have more fine-grained ethnicity (e.g., Hispanic) and account for things like wearing make-up. Perhaps throwing in a few interactions and using a non-linear model would also close the gap a bit more.

As another point of comparison, I computed performance of a simple demographic model for 2012 presidential vote choice (based on exit poll data that included Hispanic ethnicity), and got an AUC of 68%.

So (unsurprisingly) there is real information about people’s political leanings that is discernible from how they look and choose to present themselves. But I think the paper itself over-claims (e.g., only hints at the comparisons above), and strongly suggests something more subtle is going on.

That’s a somewhat critical take, but I’d like to thank Kosinksi for posting his data.

I got some more feedback on the paper from political scientist Brian Sala, who sent comments to Jeff Lax who passed them on to me. Here’s Sala:

This is really interesting. Limited, because the model is about picking the liberal (conservative) in a liberal/conservative pair. But the model should be useful for classifying individuals. I would have liked to have seen it applied via an ordered probit/logit to model self-assessed ideology (the typical, 5-pt scale from very liberal to very conservative) instead of
this binary classification.

Good point! A more precise measure should allow us to learn more.

Sala continues:

I [Sala] am a little unclear on the out-of-sample classifications here from my quick skim (the main model, I think, classifies within sample which in a pair is the liberal or conservative, by comparing facial attributes of the individuals drawn from the sample to the average lib/conservative attributes in the sample). Still, cool stuff: “In other words, a single facial image reveals more about a person’s political orientation than their responses to a fairly long personality questionnaire, including many items ostensibly related to political orientation (e.g., ‘I treat all people equally’ or ‘I believe that too much tax money goes to support artists’).”

What I don’t like is the reliance on central tendencies for the collapsed “liberal” and “conservative” groups. Because one would think the tool could be used to compare all pairs from the sample to choose the “more liberal” of the pair to get a rank ordering of the whole sample, a la Groseclose and Milyo’s approach to ordering media outlets or the power rankings of sports teams.

It seems to me (again, based on a quick skim) that this paper’s approach, by comparing individuals in a lib/conservative pair, is asking which in the pair is most like a “centrist” lib or “centrist” conservative. If the recovered geometry of facial features is multidimensional, “very liberal” and “very conservative” individuals could be closer to each other than they are to their respective ideological centroids, even as each is closer to the “right” centroid than the “wrong” one (or rather, one in the pair is closer to the reference centroid, leading to the classifications in the pair). Or worse, the “very liberal” person in the pair could be closer to the conservative centroid than the “very conservative” person AND closer to the liberal centroid (or vice-versa). Implicit here is two pairs of points in the facial characteristics space: the two individuals and the two centroids.

I’m doubtful about this claim from Sala. I’d guess there’d be more of a linear trend going on, with moderate liberals and moderate conservatives closer to the center of the distribution. But I don’t really know; this is just my guess.

Sala continues:

I [Sala] presume that the model is projecting the two individuals’ locations on to the line through the two centroids. If the model is “right”, the two individuals will be ordered the same way as the two centroids (liberal projected to the left of the conservative on the line through the centroids). The two individuals could be “outside” their respective centroids, inside, both to one side of a centroid, or straddling a centroid. If the left/right order is “correct but both are projected to one side, you get classification errors (the liberal is further from the liberal centroid than is the conservative), whereas you get correct classifications if both project in between. If they straddle one centroid, it depends on the reference category (are you comparing each to the liberal centroid or the conservative centroid? If comparing to the liberal and they straddle the liberal, I think the model will classify the “closer” projection as liberal, leading to some classification errors. If comparing to the conservative centroid, no classification errors in this case.) Again, my quick read.

Again, I’m skeptical of Sala’s conjectures but who knows? and so I thought I’d share this with you.

Finally, someone forwarded me a copy of this email that was floating around Stanford (where Kosinski works):

Dear colleagues,

Several members of the ACM US TPC on AI and Algorithms are always concerned on the unethical use of AI, even more when it comes from top universities like Stanford, given its reputation and influence. The last example, but not the only one, is more modern phrenology:

Facial recognition technology can expose political orientation from naturalistic facial images, M. Kosinski, Scientific Reports, 2021. (and is not facial recognition, is facial biometrics).

The argument that this is to warn about the bad use of ML, is not a valid one for publishing this type of research as this is pseudo-science (the backslash was worse when the same approach was used for sexual orientation). Surprisingly, Stanford’s IRB approved this research, so they are not taken care of the ethical aspect. Of course Nature also shares the responsibility. For this reason we have already contacted all the stakeholders involved to avoid more of this in the future.

However, I believe that we as computer scientists also should be concerned about unethical science and bad press in AI (and for you, the same for Stanford, especially with the HAI initiative). So this personal action is just to make sure you are aware of this, in case you were not.

I disagree with this “Dear colleagues” letter on many levels. OK, let me be clear. I fully support the freedom of the author of this letter to send it around; I’m not saying such letters should not be allowed; I’m just saying I disagree with the substance.

First, I don’t think it’s helpful to refer to this as “phrenology.” It’s a statistical analysis of photos. Similarly, I don’t see why they call it “pseudo-science.” The work of that ESP guy at Cornell, or the Pizzagate guy, or the beauty-and-sex-ratio research that we’ve discussed on this blog . . . I could see calling all of that pseudoscience, or, at least, really really bad science. But this facial recognition paper seems legit. I don’t like the idea of calling a paper “pseudoscience” just because you find it annoying.

Second is the ethical question. I have mixed feelings on this one. Maybe the analysis shouldn’t be done, as there’s something Big Brotherish about it—but as Kosinski points out, companies and governments are already doing such things, so it’s not clear that academics shouldn’t be doing it too. On the other hand, if it’s really something that shouldn’t be done, then no point in academics leading the way. Overall I don’t think I’d consider this to be unethical research, but that’s a matter of opinion; I can’t really say that the letter writer is right or wrong on this one.

Third is the statement, “Surprisingly, Stanford’s IRB approved this research.” I’ll agree that Stanford researchers can have ethical problems; for example there was this mailer they sent to voters in Montana, and Stanford also employs the business school professor who notoriously told 240 different restaurants that “Our special romantic evening became reduced to my wife watching me curl up in a fetal position on the tiled floor of our bathroom between rounds of throwing up.” So, sure, there are some studies that never should be approved. For this facial recognition study, though, on what ground would Stanford not approve it? Because someone thinks it’s evil? I really don’t like the idea of the IRB being used as some sort of political correctness filter, and I’m bothered that these people think the job of the IRB is to stop research from being done, just because somebody finds it politically objectionable.

P.S. Here’s my colleague’s R script:

library(tidyverse)
library(ROCR)

auc <- function(model) {
  pred_ROCR <- prediction(predict(model), model$y)
  auc_ROCR <- performance(pred_ROCR, measure = "auc")@y.values[[1]]
  auc_ROCR
}

round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy}

# load the original face data and restrict to americans
# downloaded from https://drive.google.com/file/d/1I3QMFzb12-i6Mu9lSD1xxymm5nmQyqm9/view?usp=sharing
# note that race/ethnicity is classified as 'asian', 'black', 'india', 'white'
# in particular, hispanic ethnicity is not included
faces <- load('faces.RData')
faces <- tibble(d) %>% 
  filter(country == 'united states') %>%
  select(-userid, -starts_with('pol_'), -database, -age) %>% 
  mutate(age_bin = factor(round_any(age.value, 10))) %>%
  drop_na()

# exit poll data from 2012 presidential election
survey <- read_tsv('https://5harad.com/mse125/assets/hw6/survey.tsv')

# fit a simple logistic regression based on sex, race, and age bin
model_demo <- glm(pol == 'liberal' ~ 
                      gender + ethnicity.value + age_bin, 
                    data = faces, family = 'binomial')

# fit a model with more observables
model_obs <- glm(pol == 'liberal' ~ 
                    gender + ethnicity.value + age_bin +
                    facial_hair +
                    emotion.sadness + emotion.neutral + emotion.disgust + 
                    emotion.anger + emotion.surprise + emotion.fear + emotion.happiness +
                    headpose.yaw_angle + headpose.pitch_angle + headpose.roll_angle + 
                    smile.value +
                    left_eye_status.normal_glass_eye_open + left_eye_status.no_glass_eye_close +
                    left_eye_status.occlusion + left_eye_status.no_glass_eye_open +    
                    left_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                    right_eye_status.normal_glass_eye_open + right_eye_status.no_glass_eye_close +
                    right_eye_status.occlusion + right_eye_status.no_glass_eye_open +    
                    right_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses,
                  data = faces, family = 'binomial')

# fit a model with observables and personality (openness)
model_personality <- glm(pol == 'liberal' ~ 
                           gender + ethnicity.value + age_bin +
                           facial_hair +
                           emotion.sadness + emotion.neutral + emotion.disgust + 
                           emotion.anger + emotion.surprise + emotion.fear + emotion.happiness +
                           headpose.yaw_angle + headpose.pitch_angle + headpose.roll_angle + 
                           smile.value +
                           left_eye_status.normal_glass_eye_open + left_eye_status.no_glass_eye_close +
                           left_eye_status.occlusion + left_eye_status.no_glass_eye_open +    
                           left_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                           right_eye_status.normal_glass_eye_open + right_eye_status.no_glass_eye_close +
                           right_eye_status.occlusion + right_eye_status.no_glass_eye_open +    
                           right_eye_status.normal_glass_eye_close + right_eye_status.dark_glasses +
                           ext + neu + ope + agr + con,
                         data = faces, family = 'binomial')

# fit a simple logistic regression based on sex, race, and age
model_exit_poll <- glm(vote == 'A' ~ sex + race + age, data = survey, family = 'binomial')

# compute AUC of the various models
cat('AUC of simple demographic model: ', auc(model_demo), '\n')
cat('AUC of model with more observable traits: ', auc(model_obs), '\n')
cat('AUC of model with observable traits + personality: ', auc(model_personality), '\n')
cat('AUC of demographic model based on survey data: ', auc(model_exit_poll), '\n')

Incentives and test performance

Josh Miller points to some references:

Measuring Success in Education: The Role of Effort on the Test Itself by Gneezy et al.

When and Why Incentives (Don’t) Work to Modify Behavior by Gneezy et al.

Behavioral Economics and Psychology of Incentives by Emir Kamenica.

I have no comments on these particular articles, just wanted to post this placeholder so I can refer to it for a classroom demonstration.

Teaching and the separation of form and content

In high school math class, there’s pretty much a complete separation of form and content—we learn about algebra, geometry, functions, and calculus with very little connection to any real-world problems, it’s pretty much all form and no content, or we could say all theory and no applications. But in high school English class, there’s no separation at all: we read a bunch of books and learn how to write, but the writing assignments are particular things we have to do. Yes, there are some tricks like the “5 paragraph essay,” but writing is not taught in the abstract way that math is taught. Even when we’re asked to write a 5-paragraph essay, we’re asked to write about a particular topic.

I’m thinking is there are problems with both math and English teaching. In math teaching, I like the separation of form and content; my only problem with the standard math sequence (algebra/geometry/functions/calculus) is that it’s all form and no content at all. I’d like there to be some content in addition to the form. In English teaching, I’d like form and content to be more separated, so that students can learn how to write without having to write boring essays about the books they’ve just read. Reading literature in English class is great; I’d just separate that from the writing lessons.

The next question is how this applies to statistics classes. Statistics already has a pretty clean separation of form and content—that is, methods and applications—and we tend to teach them together. So I think that, for all the problems with statistics teaching, we do well on this dimension.

I’m curious what Basbøll thinks about all this.

Bill James on secondary average

I came across this fun recent post by Bill James, who writes:

[Before Moneyball] batting average completely dominated the market, and most baseball executives into the mid-1990s didn’t have the foggiest notion of the difference between an empty batting average and a productive hitter. And you couldn’t explain it to them, because they didn’t understand the supporting concepts. . . .

I [James] thought of a straightforward way to test, if not this theory perfectly, at least a closely related concept. Before I get to that. . . .I think that I may have invented or at least popularized the expression “an empty batting average”. I could be wrong; you might study it and find that the phrase was in common use before me, or, more likely, that it was occasionally used before me. But I think I created that one. Doesn’t matter.

Anyway, here is the approach. Suppose that we take all players in the era 1950 to 1975 who have either 15 Win Shares or 2.5 WAR in a season. 15 Win Shares and 2.5 WAR (Baseball Reference WAR) are about the same thing; there are not a lot of players who have one but not the other, and also, they represent about the bottom of the barrel for players drawing meaningful support in MVP voting, which is what I am going to be studying here. Take all players with 15 Win Shares or 2.5 WAR in the last quarter-century of the pre-sabermetric era.

Then we look up, for each player (a) his batting average, and (b) his secondary average. Then we can sort the players into three groups . . .

I love this partly for the content and partly because Bill James writes . . . just like Veronica Geng’s affectionate parody of Bill James. I can’t get enough of this stuff! It’s like visiting some country that specializes in a particular dessert, and when you’re there you have to have it with every meal.

James’s essay is also fun because he addresses two related issues: (1) how did the different players perform (as measured by wins above replacement) and (2) what was the mental model of baseball executives: how did they perceive player performance? It’s kind of like what we did in Red State Blue State, where we looked at how people voted, and we also tried to understand how the pundits could’ve kept getting it wrong.

When you have a new idea, it’s not enough to show that it works better than the old idea. You also need to explain why, if your idea is so great, people weren’t already doing it.

And this reminds me of my question when I wrote about James nearly ten years ago for Baseball Prospectus: Given all his writings about empty batting averages and how you shouldn’t take RBI so seriously, how come he provides the following four statistics for every player in his historical abstract: games played, home runs, RBI, and batting average. At the very least, why not give on-base percentage and runs scored?

“Tracking excess mortality across countries during the COVID-19 pandemic with the World Mortality Dataset”

A few months ago we posted on Ariel Karlinsky and Dmitry Kobak’s mortality dataset. Karlinsky has an update:

Our paper and dataset was finally published on eLife. Many more countries since the last version, more up to date data, some discussion and decomposition of excess mortality to various factors, etc.

“The 60-Year-Old Scientific Screwup That Helped Covid Kill”

Daniel Lakeland writes:

This news article by Megan Molteni seems like an article that’s got it all for the blog:

1) A physics based estimate from a lifetime ago, that was ignored

2) A dogma that was established and was enormously harder to debunk than it was to originally be established

3) Both physicists and physicians who for decades have fought the established dogma

4) A historian who cracks how the mistaken dogma got established

5) The pandemic, and indoors vs outdoors transmission

6) Still probably the 5 micron error persists but CDC/etc just pretend that COVID spreads in 5 micron droplets

Phil adds:

I saw this article and barely believed it, it seems so ridiculous. “Aerosol science” is a thing, it’s not like it needed to be invented for the pandemic. You can look up aerosol settling times by diameter just by searching online. And, hey, at least for me, one of the top hits is a cdc presentation that has a good qualitative and semi-quantitative description! It’s sort of incredible that it was such a battle to get so many people to pay attention to facts that were so well known and well documented.

The ML uncertainty revolution is … now?

This is Jessica. Recently I attended part of a workshop on distribution-free uncertainty quantification, which piqued my interest in what machine learning researchers are doing to express uncertainty in prediction accuracy. At one point during a panel Michael Jordan of Berkeley alluded to how uncertainty quantification has long been a niche topic in ML, and still kind of is, but this might be shifting. 

Relatedly, a couple months ago, IBM released a toolkit called Uncertainty Quantification (UQ) 360, with the goal of getting more ML developers to express (and evaluate) uncertainty in model predictions. They imply a high level pipeline where you can start with either a model that provides some (not necessarily valid) uncertainty estimate already, and rely on extrinsic (i.e., post-hoc, distribution-free) uncertainty quantification, or use the toolkit pre-model development and choose a model that provides intrinsic UQ (i.e., uncertainty is intrinsic to the fitting process, like Bayesian neural nets). Either way you then evaluate the UQ you’re getting, and can re-calibrate using the extrinsic approaches. 

flowchart IBM UQ 360

Conformal prediction, an extrinsic technique that’s been around for awhile but is attracting new interest, came up a fair amount at the workshop.  It’s claims are impressively general. Like other currently hot algorithms with strong guarantees, my first impression was that it seemed almost too good to be true. According to a recent tutorial by Angelopoulos and Bates, you can take any old “heuristic notion of uncertainty” from any old model type (or even more generally, any function of X and Y) and use it to create a score where a higher value designates more uncertainty or worse fit between X and Y. You then use the scores from a hold-out sample to generate a statistically valid 1 – alpha confidence interval (prediction region) for continuous outputs or a prediction set (a set of plausible labels with probability 1 – alpha of including the true label) for classification. You need only for instances to be sampled independently from the same (unknown) distribution the model was trained on and the hold-out set was drawn from, or that they are at least exchangeable. 

As Angelopoulos and Bates write, 

“Critically, the intervals/sets are valid without distributional assumptions or model assumptions, with explicit guarantees with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. ”

In slightly more detail, the process is to obtain a hold-out set of (ground truth) labeled instances of some size n, look at the distribution of a score derived from some (“possibly uninformative”) heuristic notion of uncertainty output by the model, find the 1 – alpha quantile of the scores for the true labels, slightly adjusted to account for the size of the hold-out set (specifically, the [(n+1)(1-alpha)]/n quantile). Then use that value as a threshold to filter the set of possible outputs (i.e., labels) for some new instance for which ground truth is not known. The remaining set is your prediction set or prediction region, stated to be guaranteed to have probability of obtaining the true label with probability between 1 – alpha and  1 – alpha + 1/(n+1).  

So what is the heuristic uncertainty output that allows you to create the distribution of scores? You have to pick one, dependent on the type of model you’re using. However, there appear to be no formal requirements on this score, which is part of what I find unintuitive about conformal prediction. 

In a classification context with a neural net, you might start with some form of confidence score provided by a model with a prediction (i.e., the result of applying softmax to the final outputs to create pseudo-probabilities), and then subtract these from 1 to get your score. If your output is continuous, you have various options for the score function, like a standard deviation given some parameter assumptions, variance in the prediction across an ensemble of models, variance given small input perturbations, 0 minus posterior predictive density in your Bayesian model, etc. The authors summarize: “Remarkably, this algorithm gives prediction sets that are guaranteed to provide coverage, no matter what (possibly incorrect) model is used or what the (unknown) distribution of the data is.” 

There’s something very Rumpelstiltskin-miller’s-daughter about the whole idea. I went back to look at some of the earlier formulations, credited to Vovk (e.g., this tutorial with Shafer of Dempster-Shafer theory), which formulates the approach for an online setting: “The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ε of the time, even though they are based on an accumulating data set rather than on independent data sets.” They explain by connecting the idea to Fisher’s rule for valid prediction intervals (which by the way may have been inspired by eugenics according to the warning label on the paper). However in contrast to Fisher’s desire to use a set of prior observations to create a valid interval for an independently drawn observation for some distribution in a single shot sense, in an online setting where you are making successive predictions (and you have access to features of the new example and the previous examples), the claim is that you can achieve the more powerful, direct guarantee that 95% of the predictions will be correct. 

But clearly there’s some dependency on the score function used for these intervals/sets to be helpful. As Angelopoulos and Bates describe, “although the guarantee always holds, the usefulness of the prediction sets is primarily determined by the score function. This should be no surprise— the score function incorporates almost all the information we know about our problem and data, including the underlying model itself.” Shafer and Vovk describe how “The claim of 95% confidence for a 95% conformal prediction region is valid under exchangeability, no matter what the probability distribution the examples follow and no matter what nonconformity measure is used to construct the conformal prediction region. But the efficiency of conformal prediction will depend on the probability distribution and the nonconformity measure. If we think we know the probability distribution, we may choose a nonconformity measure that will be efficient if we are right.” 

I guess this isn’t that surprising, in that with non-ML statistical models you don’t expect to learn much from your uncertainty intervals if your model is a poor approximation for the true data generating process, your dataset is limited, etc. So there’s a model debugging process that presumably needs to happen first. It’s a little weird to encounter such strong enthusiastic claims about conformal prediction, when the usefulness in the end still depends heavily on certain choices. Perhaps part of my difficulty finding any of this intuitive is also that I’m so used to thinking of uncertainty in the intrinsic sense, where you care about how much the model has learned and from what. 

Going back to IBM’s tool, I like their pipeline perspective as it emphasizes the bigger decision process and implies that evaluating any uncertainty quantification you use is unavoidable. It appears they provide some tools (described here) for thinking about the trade-off between calibration and interval width, though I haven’t dug into it. Also, they have a step for communicating uncertainty, with some guidance! I hadn’t really thought about it before, but I would suspect prediction sets for a classifier are easier for non-technical end-users than intervals on continuous variables, since you don’t have to deal with people’s intuitive expectations of distribution within the interval. For continuous outputs, while the IBM guide is not very detailed, they do at least nod to different possible visualizations, including quantile dotplots and fan charts.

P.S. I’m writing this post as someone who hasn’t studied conformal prediction in much detail, so if I’m mischaracterizing any assumptions in either the online or inductive (hold-out set) setting, someone who knows this work better should speak up!

“Identifying airborne transmission as the dominant route for the spread of COVID-19” followup

Going through old emails, aiming for Inbox Zero, I came across a note from Sander Greenland from Nov 2020 pointing to this online petition from Noah Haber, Mary Kate Grabowski, et al., requesting that PNAS retract a paper, “Identifying airborne transmission as the dominant route for the spread of COVID-19.” The petition is from June 2020, and some googling revealed that I discussed that article around the very same time, in a post entitled, “The point here is not the face masks; it’s the impossibility of assumption-free causal inference when the different treatments are entangled in this way.” I’d been alerted to the article by Adam Pearce.

Anyway, now it’s March 2020 and this post is scheduled to appear in August, more than a year after that petition. I see on google that the controversial article has nearly 400 citations (which I assume will be over 400 by the time this post appears). It does not seem to have been retracted; there’s a correction note (just minor things, for example, “78,000” has been changed to “75,000”) and a couple of letters.

I hope we can all agree on the impossibility of assumption-free causal inference when the different treatments are entangled. This is related to a point made by Rubin in his classic 1978 paper, that randomization gives robustness.

P.S. In comments, Joseph Delaney writes:

What I [Delaney] find most fascinating about this issue is that now, in August 2021, the CDC guidelines clearly recognize airborne transmission now:

“Current evidence strongly suggests transmission from contaminated surfaces does not contribute substantially to new infections.” which suggests a large role of droplet and/or airborne (given they only identify three modes).

This brings up the interesting question of what to do about bad papers that happen to be confirmed despite having methodological issues. Are they prophets or should we be citing the first paper that was rigorous/robust (and how do you decide on a sliding scale of grey). These are hard problems.

I agree.

Some researchers retrospect on their mistakes

Roy Mendelssohn points to this article by Julia Rohrer, Warren Tierney, Erik Uhlmann, et al., who write:

Science is often perceived to be a self-correcting enterprise. In principle, the assessment of scientific claims is supposed to proceed in a cumulative fashion, with the reigning theories of the day progressively approximating truth more accurately over time. In practice, however, cumulative self-correction tends to proceed less efficiently than one might naively suppose. Far from evaluating new evidence dispassionately and infallibly, individual scientists often cling stubbornly to prior findings. Here we explore the dynamics of scientific self-correction at an individual rather than collective level. In 13 written statements, researchers from diverse branches of psychology share why and how they have lost confidence in one of their own published findings. We qualitatively characterize these disclosures and explore their implications. A cross-disciplinary survey suggests that such loss-of-confidence sentiments are surprisingly common among members of the broader scientific population yet rarely become part of the public record. We argue that removing barriers to self-correction at the individual level is imperative if the scientific community as a whole is to achieve the ideal of efficient self-correction.

They have an interesting set of stories. I wonder what the people would say who made the ridiculous claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” I’m still wondering why they didn’t claim 200%, just to be super-sure about it.

This also reminds me of the “I Can’t Believe It’s Not Better” session.

How much faster is the Tokyo track?

Plot sprinting speeds by year and placing

Speed (meters per second) in Olympic and World Championship finals in track sprinting.

This post is by Phil Price, not Andrew.

The guy whose company made the track for the Tokyo Olympic stadium says it’s “1-2% faster” than the track used at the Rio Olympics (which is the same material used at many other tracks), due to a surface that returns more energy to the runners. I’d be interested in an estimate based on empirical data.  Fortunately the Olympics are providing us with plenty of data to work with, but what’s the best approach to doing the analysis?

One obvious possibility is to compare athletes’ performances in Tokyo to their previous performances. For instance, Karsten Warholm just set a world record in the men’s 400m hurdles with a time of 45.94 seconds, which is indeed 1.6% faster than his previous best time. Sydney McLaughlin set a world record in the women’s 400m hurdles at 51.46 seconds, 0.8% faster than her previous time.  So that 1-2% guesstimate looks pretty reasonable.

On the other hand, it’s common for new records to be set at the Olympics: athletes are training to peak at that time, and their effort and adrenaline is never higher. 

I can imagine various models that could be fit, such as a model that predicts an athlete’s time based on their previous performances, with ‘athlete effects’ as well as indicator variables for major events such as World Championships and the Olympics, and with indicator variables for track surfaces themselves. But getting all the data would be a huge pain, I think.

Another possibility is to look at the first-place times for each event: instead of comparing Karsten Warholm’s Olympic time to his other most recent competition times, we could compare (the first place time in the 400m hurdles at the Olympics) to (the first place time in the 400m hurdles at a previous major competition). We might not be comparing McLaughlin to McLaughlin this way, we’d be comparing McLaughlin to whoever won the last World Championship in the event, but maybe this approach would help remove the influence of the time-dependence of a single person’s training fitness and such. There are some problems with this approach too, though, with the most obvious one being that some athletes are simply faster than others and that is going to add a lot of noise to the system. Usain Bolt sure made that Beijing track look fast, didn’t he?

A technology-based solution would be to use some sort of running robot that can run at a fixed power output. You could run it on different tracks and quantify the speed difference. But as far as I know such a robot does not exist, and even if it did, it would have to use almost the same biomechanics as a human runner if the results are to be applicable.

Everything I’ve listed above seems like a huge pain. But there’s something that would be easier, that I think would be almost as good: compare the third- or fourth-fastest times in Tokyo with the third- or fourth-fastest times at other competitions. The idea is that the third-fastest time should be more stable than the fastest time, since a single freak performance or exceptional athlete won’t matter…basically the same reason for using a trimmed mean in some applications. For instance, in the men’s 400m hurdles at the World Championships in 2019, Kyron McMaster finished third in 48.10 seconds. In the 400m hurdles at the Tokyo Olympics, Alison dos Santos finished third in 46.72 s. That’s 2.7% faster.  For women, the 2019 World Championship time was Rushell Clayton’s 53.74 s, compared to Femke Bol’s third-place time of 52.03s in Tokyo; that’s 3.2% faster. 

Anyone got any other ideas for the best way to quantify the effect of the track surface?

[Added later: I got data (from Wikipedia) from recent Olympics and World Championships, and generated the plot that I have now included. The columns are distances (100, 200, and 400m), rows are sex.]

 

This post is by Phil.

 

 

Polling using the collective recognition heuristic to get a better sense of popularity?

Carl Gaspar writes:

Given the apparent shortcomings of forecasting for the 2020 US elections, and the 2016 elections, have you considered that it might be fruitful for polling companies to include alternative questions, based on the collective recognition heuristic, for example?

The collective recognition heuristic: Basically asking people how many other people they think might recognize the candidate. This is something based on Gerd Gigerenzer’s work. And it does seem promising for at least local-level politics. See Gaissmaier, W., & Marewski, J. N. (2011). Forecasting elections with mere recognition from small, lousy samples: A comparison of collective recognition, wisdom of crowds, and representative polls. Judgment and Decision Making, 6(1), 73-88.

Yes, it seems like a total shot in the dark. I mean, everyone recognizes Biden right? Well, maybe not at the beginning of the campaign; especially in those deep dark pockets of America that everyone likes to speculate about. A large and representative enough sample, especially at early and intermediate points in campaigning might be very informative. Part of campaigning for the challenger seems to be about catching up with simple exposure?

And of course, statistical models can include both conventional polls of intention as well as alternative questions based on projected winner and candidate recognitions. I’m no expert at such things – I’m a vision scientist – but I have never seen a hierarchical model of candidate recognitions (or projected winners) before and I’d love to see how that would play out state-wise. Perhaps it exists but I don’t think it does.

One problem with recognition – whether respondent’s recognition, or a respondent’s assessments of others’ recognition – is that, these days we supposedly live in online bubbles or echo chambers. In theory, recognition is less affected by sampling biases than voting intention but, given our little bubbles, that may not be true. Nonetheless, a liberal college student’s ideas of others’ recognitions could be additionally informed by the disappointing choices and ignorance of their friends and family who may be less inclined to respond to polls. So there does appear to be some added information there, which would be hidden if we only asked directly about a person’s voting intentions.

It feels like statisticians stick to polls simply because there’s an existing infrastructure in place for collecting loads of data, and an existing statistical framework for aggregation.

Which leads to my next question: Do prominent statisticians such as yourself ever work with polling companies to consult on question types? If so, I’d love to learn about how that works. If not so much, then why not? I feel like that should be a thing.

Gaspar continues:

Just to be sure, I [Gaspar] am not very confident about any specific alternative to direct queries of voter intention. I’m simply wondering aloud about the exciting possibilities of alternatives filling in gaps in our knowledge. I mention the Gaissmaier & Marewski reference because they also ask respondents who they think will win, which does just as well as recognition (estimates of others’ recognitions), and both much better than intention-queries when matched for sample size . . . 34 students. Too good to be true? Even if predicting only local politics, seems too good maybe. Figure 2, scatter plots on far right.

I have no idea, but I thought I’d share this because it’s good to circulate ideas that are off the beaten path but can make some sense.

P.S. Julien Marewski points to this paper from 2013, Name Recognition and Candidate Support, by Cindy Kam and Elizabeth Zechmeister.

In which we learn through the logical reasoning of a 33-year-old book that B. H. Liddell Hart wasn’t all that.

I just read “Liddell Hart and the Weight of History,” a book from 1988 by the political scientist John Mearsheimer. I’d seen this book referred to in a review of another book by the same author, and I was intrigued. I read a bunch of B. H. Liddell Hart’s books many years ago—I suspect I’m unduly influenced by books that sold a lot of copies way back when and ended up cluttering used bookstores and college libraries—so I got Mearsheimer’s book on Liddell Hart out of the library. The book was readable—I’d say much more than I’d’ve expected given the narrowness of the topic—but, then again, it’s a topic that interests me, so maybe that’s why I found it so easy to read.

Mearsheimer’s book has three themes:

– The theory and practice of tank warfare in the western front in the two world wars.

– The ups and downs of academic/literary/policy reputation.

– The specific steps that Liddell Hart took after World War II to distort the record and make himself look good.

It was all interesting:

– In reading Mearsheimer’s discussion of tanks on the western front, and the theory of their use, I realized that I hadn’t really understood this stuff at all when I’d read it in history books.

– I’ve been interested in reputation management ever since reading John Rodden’s classic Orwell study, The Politics of Literary Reputation, and the Liddell Hart story added a new twist because the reputation here was happening at the two levels of academia and policy. I guess this happens a lot with defense studies.

– Liddell Hart put a lot of effort into rewriting his own history! He got a paragraph inserted into the English translation of one of the German generals’ memoirs in order to create the impression that the 1940 blitzkrieg had been motivated by Liddell Hart’s writings. (That would seem like an occasion for embarrassment or worse, not pride, but it fit into the argument that Liddell Hart was a prophet who was unrecognized in his own land.) And lots more along the same lines. It seems that Liddell Hart and the German generals propped up each others’ reputation. It’s a story worth reading. Mearsheimer writes:

The Liddell Hart case is disturbing . . . He was able in the 1930s to make deeply flawed arguments to a vast audience without being seriously challenged. . . . Liddell Hart sufferd for the absence of criticism, which permitted him to make questionable arguments and then repeat them over and over. Formidable critics might have forced him to qualify and sharpen his ideas, perhaps even change them.

I also agree with Mearsheimer when he recommends that we “hold people accountable for arguments they put forward in key policy debates.” That’s why I keep screaming at the Nudgelords and the overconfident law professors who go around trying to tell us all how to live our lives. OK, let me clarify this. These dudes are public intellectuals; it’s fine for them to give us their take on how we should live our lives. But when they get things wrong, we should hold them accountable. Liddell Hart’s mistakes were much more consequential than those of these silly nudgelords, but I think the same principles apply.

That said, I have no real interest in reading anything else about Liddell Hart. His story was interesting for what it is, but I feel like now I have the whole story.

Logic

The other thing I liked about Mearsheimer’s book is just how logical it is. There are many examples; here’s one, from chapter 8, where he talks about the techniques Liddell Hart used to rescue his reputation. Mearsheimer writes:

First, he maintained he was unable for sound security reasons to reveal in his public writings the full truth about Allied weaknesses and German strengths in the late 1930s. He claims in his Memoirs, for example , that he wrote in a “guarded way . . . because I knew that what I wrote would receive close attention in Germany, so that exposure of any weaknesses that were not obvious might tend to precipitate the dangers that I foresaw.” There is no evidence to support that claim. Had he known of these weaknesses, he could have communicated his insights privately to government officials, but there is no evidence he did. Furthermore, this claim contradicts his often-repeated assertion that he clearly pointed out Allied deficiencies before the debacle but no responsible policy makers listened.

I just loooove this passage, and others like it throughout the book. It’s a clear, logical argument anchored in archival research.

In a world of big ideas, hot takes, flat-out lies, and people “just asking questions,” it’s so refreshing to see this sort of clear thinking.

Of course, I haven’t done the archival research myself, so I’m pretty much relying on Mearsheimer’s persuasiveness, along with a general impression that if he’d screwed this one up, someone would’ve pointed it out by now.

P.S. Here’s a review I found online that puts Mearsheimer’s book in perspective.

P.P.S. Years later, Daniel Drezner characterized a different book by Mearsheimer as “piss-poor, monocausal social science.” That’s not relevant to the present discussion; I just think Drezner’s phrase is great (however well or poorly it applies to that particular case), and I’ll take any opportunity to repeat it.

The lawsuit that never happened (Niall Ferguson vs. Pankaj Mishra)

In searching for the immortal phrase, “piss-poor monocausal social science,” I came across this amusing story of two public intellectuals discrediting each other.

But then this made wonder . . . did the lawsuit ever happen? Here’s what the headline said:

Niall Ferguson threatens to sue over accusation of racism

Historian claims writer Pankaj Mishra accused him of racism and must apologise or face court action

I googled and . . . it looks like Mishra never apologised, but the promised court action from Ferguson never happened. Dude must’ve been too busy making fun of Keynes for being gay and marrying a ballerina and talking about poetry.

People are just suing each other all the time. So let’s take a moment to celebrate an instance when someone decided not to.

Struggling to estimate the effects of policies on coronavirus outcomes

Philippe Lemoine writes:

I published a blog post in which I reanalyze the results of Chernozhukov et al. (2021) on the effects of NPIs in the US during the first wave of the pandemic and, if you have time to take a look at it, I’d be curious to hear your thoughts.

Here is a summary that recaps the main points:

– The effects of non-pharmaceutical interventions on the COVID-19 pandemic are very difficult to evaluate. In particular, most studies on the issue fail to adequately take into account the fact that people voluntarily change their behavior in response to changes in epidemic conditions, which can reduce transmission independently of non-pharmaceutical interventions and confound the effect of non-pharmaceutical interventions.

– Chernozhukov et al. (2021) is unusually mindful of this problem and the authors tried to control for the effect of voluntary behavioral changes. They found that, even when you take that into account, non-pharmaceutical interventions led to a substantial reduction in cases and deaths during the first wave in the US.

– However, their conclusions rest on dubious assumptions, and are very sensitive to reasonable changes in the specification of the model. When the same analysis is performed on a broad range of plausible specifications of the model, none of the effects are robust. This is true even for their headline result about the effect of mandating face masks for employees of public-facing businesses.

– Another reason to regard even this result as dubious is that, when the same analysis is performed to evaluate the effect of mandating face masks for everyone and not just employees of public-facing businesses, the effect totally disappears and is even positive in many specifications. The authors collected data on this broader policy, so they could have performed this analysis in the paper, but they failed to do so despite speculating in the paper that mandating face masks for everyone could have a much larger effect than just mandating them for employees.

– This suggests that something is wrong with the kind of model Chernozhukov et al. used to evaluate the effects of non-pharmaceutical interventions. In order to investigate this issue, I fit a much simpler version of this model on simulated data and find that, even in very favorable conditions, the model performs extremely poorly. I also show with placebo tests that it can easily find spurious effects. This is a problem not just for this particular study, but for any study that relies on that kind of model to study the effects of non-pharmaceutical interventions.

– To be clear, as I stress in the conclusion, this doesn’t mean that mask-wearing doesn’t reduce transmission, because this paper evaluated the effect of mandating mask wearing, which is not the same thing. It may be that, as another study recently found (though I have no idea how good this paper is), mandates don’t really matter because people who are going to wear masks do so even if they’re not legally required to do so.

Anyway, since you disagreed with my harsh take on Flaxman et al.’s paper about the effects of NPIs in Europe during the first wave, I was curious to know your thoughts about this other study.

I replied that I agree with Lemoine’s general point that it’s very hard to untangle the effects of any particular policy, given that so much depends on behavior. Another complication is the desire for definitive results. From the other direction, I see the value of quantitative analyses, as some policy choices need to be made.

Lemoine responded:

On the need to make policy choices and what it means for what should be done with quantitative analyses, I think it’s a very complicated issue. I was a hawk on COVID-19 before it was cool and, back in March, I was in favor of the first lockdown. I changed my mind after that because I became convinced that, whatever their precise effects (I think it’s impossible to estimate them with anything resembling precision), they couldn’t be huge otherwise we’d see it much more easily (as with vaccination) and they generally needed to be huge in order to have a chance of passing a cost-benefit test. One reason I came to deeply regret my initial support for lockdowns is that I have since then realized they have become a sort of institutionalized default response, which is something I think I should have predicted but didn’t, so this has taught me the wisdom of requiring a much higher level of confidence in social scientific results before acting on them. (I’m French and here we have been under a curfew and bars/restaurants have remained completely closed between last October and May of this year!)

In response to my question about what exactly was meant by “lockdown,” Lemoine pointed to his post arguing against lockdowns and added:

I [Lemoine] think it has been a problem in those debates on both sides, but it’s not really a problem in Chernozhukov et al. (2021) since they look at pretty specific policies. My impression is that, when people talk about “lockdowns”, they have in mind a vague set of particularly stringent restrictions such as curfews, closure of “non-essential businesses” and stay-at-home orders. In any case, this is what I’m referring to when I use this term, though in my work I usually talk about “restrictions” and state my position as the claim that, whatever the precise effects of the most stringent restrictions (again things like curfews, closure of “non-essential businesses” and stay-at-home orders) are, they are not plausibly large enough for those policies to pass a cost-benefit test when you take into account their immediate effects on people’s well-being, because even when I make preposterous assumptions about their effects on transmission and do a back-of-the-envelope cost-benefit analysis the results come out as incredibly lopsided against those policies. This is still vague but I think not too vague. In particular, I don’t think mask mandates of any kind count as “lockdowns”, nor do I think that anyone does even the fiercest opponents of those mandates.

I did not have the energy to read Chernozhukov et al.’s paper or Lemoine’s criticism in detail, but as noted above I am sympathetic with Lemoine’s general point that it is difficult to untangle causal effects of policies—and this difficulty persists even if, like Chernozhukov et al., you are fully aware of these difficulties and trying your best to address them. We had a similar discussion a few years ago regarding the deterrent effect of the death penalty, a topic that has seen many quantitative studies of varying quality but which, as Donohue and Wolfers explained, is pretty much impossible to figure out from empirical data. Effects of policies on disease spread should be easier to estimate, as the causal mechanism is much clearer, but we still have the problem of multiple interventions done at the same time, interventions motivated by existing conditions (which can be addressed statistically, but results will be necessarily sensitive to details of how the adjustment is done), effects that vary from one jurisdiction to another, and unclear relationships between behavior and policy. For example, when they closed the schools here in New York City, lots of parents were pulling their kids out of school and lots of teachers were not planning to keep showing up, so the school closing could be thought of as a coordination policy as much as a mandate. And then there are annoying policies such as closing parks and beaches, which nobody really thinks would have much effect on disease spread but represent some sort of signal of seriousness. And the really big thing which is people lowering the spread of disease by avoiding social situations, avoiding talking into each others’ faces, etc. From a policy standpoint it’s hard for me to hold all this in my head at once, especially because I’m really looking forward to teaching in person this fall, masked or otherwise. One of the points of a statistical analysis is to be able to integrate different sources of information—a multivariate probability distribution can “hold all this in its head at once” even when I can’t . . . ummm, at this point I’m just babbling. Speaking as a statistician, let me just say that it’s important to see the trail of breadcrumbs showing how the conclusions came from the data, scientific assumptions, and statistical model, starting from simple comparisons and then doing adjustments from there. I think the sorts of analyses of Chernozhukov et al. and Lemoine should be helpful in taking us in this direction.

P.S. Ethan Bolker shares this letter he sent to the Notices of the American Mathematical Society which he thought would be relevant to our discussion:

“Infections in vaccinated Americans are rare, compared with those in unvaccinated people . . . But when they occur, vaccinated people may spread the virus just as easily.”

Dean Eckles writes:

Thought you might like this example from the leaked CDC slides. One of the big claims being repeated in the media is that “Infections in vaccinated Americans are rare, compared with those in unvaccinated people, the document said. But when they occur, vaccinated people may spread the virus just as easily.” (NYT) That is, this focuses on possible equivalence (vs. not) within some subpopulation who get infected. And, of course, the vaccine affects who gets infected and whether it gets reported and included in the sample.

This is apparently based on the results on this slide:

The first bullet is a comparison within vaccinated people who have reported breakthrough cases. Based on 19 such cases with Delta, this suggests ~10 times increase in viral load associated with Delta. (One widely reported comparison, cited by Dr. Fauci earlier this week, in viral load for Delta is ~1000 times, so this would actually be much lower than that.)

The second bullet is a comparison — for one particular outbreak — of vaccinated and unvaccinated cases in a cluster associated with Provincetown’s extensive July 4th parties. It seems like an interesting question here is whether this conditioning on a known infection makes sense.

(Other outlets focus on a different dichotomization of these results, saying for example, “New data suggests vaccinated people could transmit delta variant” as if this is new information at all!)

To me, this all gets at how valuable it is to think about things in degrees (if not fully quantitatively) and comparatively rather than reducing everything to 0 or non-zero.

Obviously, I am not an infectious disease biologist, but this seems like a nice example of dichotomization, conditioning on post-treatment variables (which can sometimes make sense — does it here?), and science communication.

Interesting point about conditioning on a known infection. The implicit causal model in the comparison is that the infection is something that just can happen to you, but I take Eckles’s point to be that, if you know that a vaccinated person was infected, that fact tells us something about that person—some combination of behavioral and biological information that we would expect to be relevant to the rate at which they spread the virus. Thus, it could be true that infected vaccinated people spread the virus as easily as infected non-vaccinated people, but that statement could be rephrased from a latent-variable perspective as “the sorts of vaccinated people who are likely to get infected are the sorts of people who are more likely to spread the virus,” without necessarily implying that the effect of being infected on spreading the virus is the same among vaccinated and unvaccinated people. I agree with Eckles that these question can get very tangled.

“I’m not a statistician, but . . .”

Alex Lamb writes:

I’m not a statistician, but one thing I’ve noticed is that most (or all?) of the percentage change plots that I’ve seen don’t use a logarithmic scale. I think the logarithmic scale would be better, since most people are better at mentally performing addition operations, and the cumulative effect of percent-changes is multiplicative.

The log-scale would reflect this, by making the cumulative effects over many time steps additive with respect to the scale of the plot.

For example, it seems like it would make a lot of these plots easier to interpret.

I replied that I agree, but it’s controversial, and I pointed to Jessica’s recent post on the topic.

Lamb responded:

I was specifically thinking about percentage growth rates, for example GDP growth per year like: 5%, 10%, -10%, …, 1%. One thing I noticed is that if you compare countries which have had really good economic growth like Malaysia or China vs. unstable countries with low growth, the unstable country’s growth plot often has a higher net area under the curve in total, since it’s a mix of years with very high positive growth and negative growth. The negative growth years actually count for much more due to the multiplicative interaction. If you plot on the log scale, then net area under the curve actually is a correct measure for total growth.

For absolute measures like how many people have gotten coronavirus, it’s less obvious to me if log scale is the right choice. I think log scale makes it easier to discriminate between different exponential growth rates, but makes it much harder to discriminate between exponential and non-exponential growth rates.

Uh oh, don’t talk about Malaysia, it will bring the racists out of the woodwork!

‘No regulatory body uses Bayesian statistics to make decisions’

This post is by Lizzie. I also took the kitten photo — there’s a white paw taking up much of the foreground and a little gray tail in the background. As this post is about uncertainty, I thought maybe it worked.

I was back east for work in June, drifting from Boston to Hanover, New Hampshire and seeing a couple colleagues along the way. These meetings were always outside, often in the early evenings, and so they sit in my mind with the lovely luster of nice spring weather in the northeast, with the sun glinting in at just the right angle.

One meeting was sitting on a little sloping patch of grass in a backyard in Arlington, where I was chatting with a former postdoc, who now works for a consulting company tightly intertwined with US government. When he was in my lab he and I learned Bayesian statistics (and Stan), and I asked him how much he was using Bayesian approaches. He smiled slyly at me and told me a story about a recent meeting he was at where one of the senior people said:

“No regulatory body uses Bayesian statistics to make decisions.”

He quickly added that he’s not at all sure this is true, but that it encapsulates a perspective that is not uncommon in his world.

The next meeting was next to the Connecticut river and with a senior ecologist, who works on issues with some real policy implications: how to manage beetle populations as they take off for the north with warming (hello, or should I say goodbye, New Jersey pine barrens), the thawing Arctic, and more. I was asking him if he thought this statement was true, which he didn’t answer, but set off on a different declaratory statement:

“The problem with Bayesian statistics is their emphasis on uncertainty.”

Ah. Uncertainty. Do you think uncertainty is the most commonly used word in the title of blog posts here? (Some recent posts here, here and here.)

In response to my colleague I may have blurted out something like ‘but I love uncertainty!’ or ‘that is a great thing about Bayesian!’ and so the conversation veered deeply into a ditch, from which I am not sure that it ever recovered. I said something along the lines of, isn’t it better to have all that uncertainty out in the middle of the room? Rather than trying to fit in under the cushions of the sofa as I feel so many ecologists do when they do their models in sequential steps, dropping off uncertainty along the way (often using p-values of delta AIC values of 2 or…) to drive ahead to their imaginary land of near-certainty? (I know at some point I also poorly steered it towards my thoughts on whether climate change scientists have done themselves a service or disservice in shying away from communicating uncertainty; I regret that.)

We left mired in the muck that so many of the ecologists around me feel about Bayesian — too much emphasis on uncertainty, too little concrete information that could lead to decision making.

So I pose this back to you all: what should I have said in response to either of these remarks? I am looking for excellent information, and persuasive viewpoints.

I’ll open the floor with what I thought a good reply from Michael Betancourt for the first quote: fisheries, and that Bayesian gives better options to steer policy. For example, if you want maximum sustainable yield without crashing a fish stock, you can more easily suggest a quantile of catch that puts you a little more firmly in ‘non-crashing’ outcome.

Claim of police shootings causing low birth weights in the neighborhood

Under the subject line, “A potentially dubious study making the rounds, re police shootings,” Gordon Danning links to this article, which begins:

Police use of force is a controversial issue, but the broader consequences and spillover effects are not well understood. This study examines the impact of in utero exposure to police killings of unarmed blacks in the residential environment on black infants’ health. Using a preregistered, quasi-experimental design and data from 3.9 million birth records in California from 2007 to 2016, the findings show that police killings of unarmed blacks substantially decrease the birth weight and gestational age of black infants residing nearby. There is no discernible effect on white and Hispanic infants or for police killings of armed blacks and other race victims, suggesting that the effect reflects stress and anxiety related to perceived injustice and discrimination. Police violence thus has spillover effects on the health of newborn infants that contribute to enduring black-white disparities in infant health and the intergenerational transmission of disadvantage at the earliest stages of life.

My first thought is to be concerned about the use of causal language (“substantially decrease . . . no discernible effect . . . the effect . . . spillover effects . . . contribute to . . .”) from observational data.

On the other hand, I’ve estimated causal effects from observational data, and Jennifer and I have a couple of chapters in our book on estimating causal effects from observational data, so it’s not like I think this can’t be done.

So let’s look more carefully at the research article in question.

Their analysis “compares changes in birth outcomes for black infants in exposed areas born in different time periods before and after police killings of unarmed blacks to changes in birth outcomes for control cases in unaffected areas.” They consider this a natural experiment in the sense that dates of the killings can be considered as random.

Here’s a key result, plotting estimated effect on birth weight of black infants. The x-axis here is distance to the police killing, and the lines represent 95% confidence intervals:

There’s something about this that looks wrong to me. The point estimates seem too smooth and monotonic. How could this be? There’s no way that each point here represents an independent data point.

I read the paper more carefully, and I think what’s happening is that the x-axis actually represents maximum distance to the killing; thus, for example, the points at x=3 represent all births that are up to 3 km from a killing.

Also, the difference between “significant” and “not significant” is not itself statistically significant. Thus, the following statement is misleading: “The size of this effect is substantial for exposure during the first and second trimesters. . . . The effect of exposure during the third trimester, however, is small and statistically insignificant, which is in line with previous research showing reduced effects of stressors at later stages of fetal development.” This would be ok if they were to also point out that their results are consistent with a constant effect over all trimesters.

I have a similar problem with this statement: “The size of the effect is spatially limited and decreases with distance from the event. It is small and statistically insignificant in both model specifications at around 3 km.” Again, if you want to understand how effects vary by distance, you should study that directly, not make conclusions based on statistical significance of various aggregates.

The big question, though, is do we trust the causal attribution: as stated in the article, “the assumption that in the absence of police killings, birth outcomes would have been the same for exposed and unexposed infants.” I don’t really buy this, because it seems that other bad things happen around the same time as police killings. The model includes indicators for census tracts and months, but I’m still concerned.

I recognized that my concerns are kind of open-ended. I don’t see a clear flaw in the main analysis, but I remain skeptical, both of the causal identification and of forking paths. (Yes, the above graphs show statistically-significant results for the first two trimesters for some of the distance thresholds, but had the results gone differently, I suspect it would’ve been possible to find an explanation for why it would’ve been ok to average all three trimesters. Similarly, the distance threshold allows lots of places to find statistically significant results.)

So I could see someone reading this post and reacting with frustration: the paper has no glaring flaws and I still am not convinced by its conclusion! All I can say is, I have no duty to be convinced. The paper makes a strong claim and provides some evidence—I respect that. But a statistical analysis with some statistical significance is just not as strong evidence as people have been trained to believe. We’ve just been burned too many times, and not just by the Diederik Stapels, Brian Wansinks, etc., but also by serious researchers, trying their best.

I have no problem with these findings being published. Let’s just recognize that they are speculative. It’s a report of some associations, which we can interpret in light of whatever theoretical understanding we have of causes of low birth weight. It’s not implausible that mothers behave differently in an environment of stress, whether or not we buy this particular story.

P.S. Awhile after writing this post, I received an update from Danning:
Continue reading ‘Claim of police shootings causing low birth weights in the neighborhood’ »

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.