Skip to content

Korean translation of BDA3!

Hyunji Moon organized a translation of BDA3 into Korean! Here’s the link. And here is the note we wrote to accompany it:

Dear Korean readers of Bayesian Data Analysis:

We are happy that our book will reach a new audience through this Korean translation. Math is math and graphs are graphs, but the book is mostly built from sentences and paragraphs, and it requires careful effort to translate these into another language. So much of statistical advice consists in omission, in not recommending certain bad ideas or in presenting results one way and not another. In particular, when developing and applying Bayesian methods, we need to continually be aware of uncertainty; we must avoid the trap of overconfidence in our inferential summaries. We should be open about model assumptions and think hard about where these assumptions come from. We appreciate the effort that went into this translation and hope that this book will be a useful part of your self-education.

Yours
Andrew Gelman et al.

(In case your Korean is not so good, here’s the English version of the book.)

Coronavirus jailbreak

Emma Pierson writes:

My two sisters and I, with my friend Jacob Steinhardt, spent the last several days looking at the statistical methodology in a paper which has achieved a lot of press – Incarceration and Its Disseminations: COVID-19 Pandemic Lessons From Chicago’s Cook County Jail (results in supplement), published in Health Affairs. (Here’s the New York Times op-ed one of the authors published this weekend.) The central finding in the paper, quoting the abstract, is that community infections from people released from a single jail in Illinois in March are “associated with 15.7 percent of all documented novel coronavirus disease (COVID-19) cases in Illinois and 15.9 percent in Chicago as of April 19, 2020”. From the New York Times op-ed, “Roughly one in six of all cases in the city and state were linked to people who were jailed and released from this single jail”. On the basis of this claim, both the paper and the op-ed make a bunch of policy recommendations in the interest of public health – eg, reducing unnecessary arrests.

To be clear, we largely agree with these policy recommendations – and separate from this paper, there’s been a lot of good work documenting the dire COVID situation in jails and prisons. Mass incarceration was a public health emergency even before the COVID-19 pandemic, and tens of thousands of people have now contracted coronavirus in overcrowded jails and prisons. The Cook County Jail was identified as “the largest-known source of coronavirus infections in the U.S.” in April. Many incarcerated individuals have died after being denied appropriate medical care.

However, we also feel that the statistical methodology in the paper is sufficiently flawed that it does not provide strong evidence to back up its policy recommendations, or much evidence at all about the effect of jail releases on community infections.

We are concerned both about the statistical methods it uses and the effect sizes it estimates. We have shared our concerns with the first author, who helpfully shared his data and thoughts but did not persuade us that any of the concerns below are invalid. Given the high-profile nature of the paper, we thought the statistics community would benefit from open discussion. Depending on what you and your readers think we may reach out to the journal editors as well.

The analysis relies on multivariate regression. It regresses COVID cases in each zip code on the number of inmates released from the Cook County Jail to that zip code, adjusting for variables which include the number or proportion of Black residents, poverty rate, public transit utilization rate, and population density. A number of these variables are highly correlated: for example, the correlation between the number of Black residents in a zip code and the number of inmates released in March is 0.86 in the full Illinois sample (and 0.84 in Chicago zip codes). The results in the paper testify to the dangers of multivariate regression on highly correlated variables: the signs on several regression coefficients (eg, public transit utilization) flip from positive in the bivariate regressions (Supplemental Exhibit 2) to negative in the multivariate regressions (Supplemental Exhibit 3). If the regression coefficients do in fact have causal interpretations, we should infer that if more people use public transit, COVID cases will decrease, which doesn’t seem plausible.

Given the small samples (50 zip codes in Chicago, and 355 overall) and highly correlated variables, it is unsurprising that the results in the paper are not robust across alternate plausible specifications. Here are two examples of this.

First, we examined how the effect size estimate (ie, the coefficient on how many inmates were released in March) varied depending on which controls were included (assessing all subsets of the original controls used in the paper) and which subset of the dataset was used (Chicago or the full sample). We found that the primary effect estimate varied by a factor of nearly 5 across specifications, and that for some specifications, was no longer statistically significant. Similarly, in Appendix Exhibit 2 (top table, Chicago estimate) the paper shows that including all zip codes rather than just those with at least 5 cases (which adds only 3 additional zipcodes) renders the paper’s effect estimate no longer statistically significant.

Second, the results are not robust when fitting other standard regression models which account for overdispersion. As a robustness check, the paper fits a Poisson model to the case count data (Appendix Exhibit 4). Standard checks for overdispersion, like the Pearson 𝛘2 statistic from the Poisson regression, or fitting a quasipoisson model, imply the data is overdispersed. So we refit the Poisson model used in Appendix Exhibit 4 on the Chicago sample using three methods which are more robust to overdispersion: (1) we bootstrapped the confidence intervals on the regression estimate from the original Poisson model; (2) we fit a quasipoisson model, and (3) we fit a negative binomial model. All three of these methods yield confidence intervals which are much wider than the original Poisson confidence intervals, and which overlap zero. (For consistency with the original paper, we performed all these regressions using the same covariates used in Appendix Exhibit 4 for the original Poisson regression. We’re pretty sure that setup is non-standard, however, or at least it doesn’t seem to agree with the way you do it in your stop-and-frisk paper—it models cases as exponential in population, rather than using population as an offset term—so we also perform an alternate regression using a more standard covariate setup. Both methods yield similar conclusions.) Overall this analysis implies that, when overdispersion in the data is correctly modeled, the Chicago sample results are no longer statistically significant. Even in the full sample, the results are often not statistically significant depending on what specification is used.

To be clear, I would remain skeptical of the basic empirical strategy in the paper even if the sample were ten times as large, and none of the above applied. But I’m curious about your and your readers’ thoughts on this – as well as alternate strategies for estimating the number of community infections jail releases cause.

In addition to the methodological concerns summarized above, the effect sizes estimated in the paper seem somewhat unlikely. The paper estimates that each person released from the Cook County Jail results in approximately two additional reported cases in the zip code they are released to. Two facts make us question this finding. First, the CDC estimates that cases are underreported by a factor of ten, so to cause two reported infections, the average released person would have to cause twenty infections. Second, not everyone who left Cook County Jail was infected. Positive test rates at Cook County Jail are 11%; the overall fraction of inmates who were infected is likely lower, since it is probable that individuals with COVID-19 were more likely to be tested, but we use 11% as a reasonable upper bound on the fraction of released people who were infected.  Combining these two facts, in order for the average person released to cause two reported cases, the averaged infected person released in March would have to cause nearly two hundred cases by April 19 (when the paper sources its case counts). This isn’t impossible — there’s a lot of uncertainty about the reproductive number, the degree of case underreporting, and the fact that not all detainees are included in the booking dataset. Still, we coded up a quick simulation based on estimates of the reproductive number in Illinois, and it seems somewhat unlikely.

My reply:

There are three things going on here: language, statistics, and policy.

Language. In the article, the authors say, “Although we cannot infer causality, it is possible that, as arrested individuals are exposed to high-risk spaces for infection in jails and then later released to their communities, the criminal justice system is turning them into potential disease vectors for their families, neighbors, and, ultimately, the general public.” And in the op-ed: “for each person cycled through Cook County Jail, our research shows that an additional 2.149 cases of Covid-19 appeared in their ZIP code within three to four weeks after the inmate’s discharge.” I appreciate their careful avoidance of causal language. But “2.149”? That’s ridiculous? Why not just say 2.14923892348901827390823? Even seeing aside all identification issues, you could never estimate this parameter to three decimal places. Indeed, the precision of that number is immediately destroyed by the vague “three to four weeks” that follows it. You might as well say that a dollop of whipped cream ways 2.149 grams.

But I don’t like this bit in the op-ed: “Roughly one in six of all cases in the city and state were linked to people who were jailed and released from this single jail, according to data through April 19.” All the analysis is at the aggregate level. Cases have not been “linked to people” at the jail at all!

I do not mean to single out the authors of this particular article. The real message is that it’s hard to be precise when describing what you’ve learned from data. These authors were so careful in almost everything they wrote, but even so, they slipped up at one point!

Statistics. The supplementary material has a scatterplot of the data from the 50 zip codes in Chicago that had 5 or more coronavirus cases during this period:

First off, I’m not clear why the “Inmates released” variable is sometimes negative. I guess that represents some sort of recoding or standardaztion, but in that case I’d prefer to see the raw variable in the graph.

As to the rest of the graphs: the key result is that the rate of coronavirus cases is correlated with the rate of inmate releases but not correlated with poverty rate. That seems kinda surprising.

So it seems that the authors of this paper did find something interesting. If I’d been writing the article, I would’ve framed it that way: We found this surprising pattern in the zip code data, and here are some possible explanations. Next logical step is to look at data from other cities. Also it seems that we should try to understand better the selection bias in who gets tested.

Unfortunately, the usual way to write a research article is to frame it in terms of a conclusion and then to defend that conclusion against all comers. Hence robustness checks and all the rest. That’s too bad. I’d rather frame this as an exploratory analysis than as an attempt at a definitive causal inference.

My take-home point is not that this article is bad, but rather that they saw something interesting that it would be worth tracking down, comparing to other cities and also thinking about the selection bias in who got tested.

Policy. Unlike the journal Psychological Science, I support criminal justice reform. I’m in agreement with the op-ed that we should reduce the number of people in jail and prison. I’d say that whatever the story is regarding these coronavirus numbers.

The “scientist as hero” narrative

We’ve talked about the problems with the scientist-as-hero paradigm; see “Narrative #1” discussed here.

And, more recently, we’ve considered how this narrative has been clouding people’s thinking regarding the coronavirus; see here and here. That latter example is particularly bad because it involved a reporter with an undisclosed conflict of interest. But the scientist-as-hero narrative is a problem even when no lying or corruption is involved.

Recently Sander Greenland pointed us to a couple of interesting opinion pieces picking at this issue.

Stuart Ritchie, “There should never be heroes in science”:

Science, the cliché goes, is self-correcting. This is true in two senses – one lofty, one more mundane. The high-minded one is about the principle of science: the idea that we’re constantly updating our knowledge, finding out how our previous theories were wrong, and stumbling — sometimes only after a fashion — towards the truth. The second is that, for science to work, individual scientists need to do the grinding, boring work of correcting other scientists’ mistakes. . . . those self-critical, self-correcting principles of science simply don’t allow for hero-worship. Even the strongest critics of science need themselves to be criticised; those who raise the biggest questions about the way we do research need themselves to be questioned. Healthy science needs a whole community of sceptics, all constantly arguing with one another — and it helps if they’re willing to admit their own mistakes.

Hilda Bastian, “Science Heroes and Disillusion”:

“Never have heroes” – I’ve heard some version of that a lot in the last couple of months . . . But I think there can be, and indeed are, [evidence-based medicine] and science heroes. . . . it’s prosocial behavior, that’s altruistic, struggling against odds, and despite serious risks to the individual. It’s not hard to think of scientists, and healthcare providers struggling to ensure care is evidence-based, who fit that bill and are totally worthy of the label, hero. . . .

To me, the moral of these stories isn’t to not have heroes. It’s to learn a few things: to pick your heroes more carefully; to be wary of the champions of causes as well as anyone who is “against” something on the regular; to watch carefully how people respond to their critics; and to be on guard against the effects of charisma. And if you have a hero, don’t give their science a free pass.

Heroism is often in the interaction

Heroism is often in the interaction of the hero and the hero-worshipper. Some acts are flat-out heroic—rescuing someone from a burning building, etc.—and some people live exemplary lives, but, in general, if I say that somebody’s my hero, I’m saying that person is my hero, that his or her actions speak to me or inspire me in some way. Much of the heroism is in the feelings it inspires by others, kind of like induction (in the electrical sense, not the statistical or philosophical sense).

I think it’s fine to celebrate the great contributions of our scientific colleagues and predecessors. But, from my understanding of scientific practice (from my reading and my lived experience), I feel like the scientist-as-hero narrative is a distortion. People do come up with good ideas entirely on their own sometimes, but even then these ideas exist within a fabric of concepts, methods, and examples.

I try to live these principles when promoting my own work. Most of my research is collaborative, and I think it’s important to share credit. When people give me credit for things I did not do, I try to clear up the misunderstanding.

When Psychology Today wanted to write a story about my “roles as a skeptic and voice of scientific reform, as a conveyer of statistical concepts, and as a political scientist,” I replied:

It shouldn’t be about me—it shouldn’t be about personalities more generally, as one of the points of the scientific reform movement is that what is important is the data and scientific ideas are more important than individual careers. So I was thinking maybe instead of just profiling me, you could profile a group of people . . .

And they followed my suggestion! Their article, A Revolution Is Happening in Psychology, Here’s How It’s Playing Out, featured four people, not just one.

P.S. Tonks (pictured above, from Megan Higgs) has no use for heroes, but could use some food right now.

Regression and Other Stories is available!

This will be, without a doubt, the most fun you’ll have ever had reading a statistics book. Also I think you’ll learn a few things reading it. I know that we learned a lot writing it.

Regression and Other Stories started out as the first half of Data Analysis Using Regression and Multilevel/Hierarchical Models, but then we added a lot more and we ended up rewriting and rearranging just about all of what we had before. So this is basically an entirely new book. Lots has happened since 2007, so there was much new to be said. Jennifer and Aki are great collaborators. And we put lots of effort into every example.

Here’s the Table of Contents.

The chapter titles in the book are descriptive. Here are more dramatic titles intended to evoke some of the surprise you should feel when working through this material:

• Part 1:
– Chapter 1: Prediction as a unifying theme in statistics and causal inference.
– Chapter 2: Data collection and visualization are important.
– Chapter 3: Here’s the math you actually need to know.
– Chapter 4: Time to unlearn what you thought you knew about statistics.
– Chapter 5: You don’t understand your model until you can simulate from it.

• Part 2:
– Chapter 6: Let’s think deeply about regression.
– Chapter 7: You can’t just do regression, you have to understand regression.
– Chapter 8: Least squares and all that.
– Chapter 9: Let’s be clear about our uncertainty and about our prior knowledge.
– Chapter 10: You don’t just fit models, you build models.
– Chapter 11: Can you convince me to trust your model?
– Chapter 12: Only fools work on the raw scale.

• Part 3:
– Chapter 13: Modeling probabilities.
– Chapter 14: Logistic regression pro tips.
– Chapter 15: Building models from the inside out.

• Part 4:
– Chapter 16: To understand the past, you must first know the future.
– Chapter 17: Enough about your data. Tell me about the population.

• Part 5:
– Chapter 18: How can flipping a coin help you estimate causal effects?
– Chapter 19: Using correlation and assumptions to infer causation.
– Chapter 20: Causal inference is just a kind of prediction.
– Chapter 21: More assumptions, more problems.

• Part 6:
– Chapter 22: Who’s got next?

• Appendixes:
– Appendix A: R quick start.
– Appendix B: These are our favorite workflow tips; what are yours?

Here’s the preface, which among other things gives some suggestions of how to use this book as a text for a course, and here’s the first chapter.

The book concludes with a list of 10 quick tips to improve your regression modeling. Here’s the chapter, and these are the tips:

– 1. Think about variation and replication.

– 2. Forget about statistical significance.

– 3. Graph the relevant and not the irrelevant.

– 4. Interpret regression coefficients as comparisons.

– 5. Understand statistical methods using fake-data simulation.

– 6. Fit many models.

– 7. Set up a computational workflow.

– 8. Use transformations.

– 9. Do causal inference in a targeted way, not as a byproduct of a large regression.

– 10. Learn methods through live examples.

And here’s the index.

You can order the book here. Enjoy.

P.S. I saved the best for last. All the data and code for the book are on this beautiful Github page that Aki put together. You can run and modify all the examples!

Ugly code is buggy code

People have been (correctly) mocking my 1990s-style code. They’re right to mock! My coding style works for me, kinda, but it does have problems. Here’s an example from a little project I’m working on right now. I was motivated to write this partly as a followup to Bob’s post yesterday about coding practices.

I fit two models to some coronavirus testing data, with the goal of estimating the ratio of prevalence between two groups. First I fit the simple model implicitly assuming specificity and sensitivity of 1:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
p[1]     0.023   0.000 0.004    0.015    0.020    0.023    0.026    0.033 33939    1
p[2]     0.020   0.000 0.005    0.012    0.017    0.020    0.023    0.031 31022    1
ratio    1.218   0.002 0.406    0.632    0.933    1.152    1.427    2.191 28751    1
lp__  -202.859   0.008 0.999 -205.567 -203.246 -202.554 -202.146 -201.879 17156    1

The huge effective sample sizes came because I ran Stan for 10,000 iterations in warmup and 10,000 in sampling, getting a ton of draws to get good precision on the 2.5% and 97.5% quantiles, which should not be a big deal for the application—it’s hard to think of a case when we should ever care about a super-precise estimate of the endpoints of an uncertainty interval—but is convenient for getting a single set of numbers to report in the published article. Anyway, that’s not my main point. I just wanted to let you know why I was straying from my usual practice.

To continue, I then re-fit the model with the data on specificity and sensitivity. Prior tests were perfect on specificity—this had to do with the stringent cutoff they were using to declare a positive result—and also had high sensitivity, so the results shouldn’t change much, especially given that we’re focusing on the ratio. But the uncertainty in the ratio should go up a bit, I’d think.

Here were the results:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
p[1]     0.023   0.000 0.005    0.015    0.020    0.023    0.026    0.033 34764    1
p[2]     0.020   0.000 0.005    0.012    0.017    0.020    0.023    0.031 34133    1
ratio    1.218   0.002 0.405    0.626    0.932    1.151    1.431    2.185 30453    1
lp__  -202.868   0.008 1.011 -205.582 -203.255 -202.560 -202.148 -201.878 17880    1

Not much different! The lower bound of the interval has gone up from 0.626 to 0.632, and the upper bound has remained the same. Interesting. Let’s check the inferences for the two probabilities individually . . . they’re exactly the same . . . Wait a minute!

Let me check the code.

OK, I have a bug in my R code. Here are the lines of code fitting that second model:

data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_spec=149)

compare_2 <- cmdstan_model("compare_1.stan")
fit_2 <- compare_2$sample(data = data_2, refresh=0, parallel_chains=4, iter_warmup=1e4, iter_sampling=1e4)
print(stanfit(fit_2), digits=3)

Do you see it? In the second line, I took the old file, compare_1.stan, and put it in the object for the new model, compare_2. Here's what I need to do instead:

compare_2 <- cmdstan_model("compare_2.stan")

And now it works! Well, almost. First it turned out I had some little errors, such as not correctly defining a vector variable, which were caught during compilation, then when the model finally ran it gave a "variable does not exist" error, and I went back and caught the typo in the above data definition, where I mistakenly had defined n_spec twice in the copy and pasting procedure. (The repeat of 130 was not a mistake, though; their specificity testing data really were 130 negative tests out of 130.) So I fixed to:

data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_sens=149)

And then the code really did run, yielding:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
p[1]     0.019   0.000 0.007    0.004    0.014    0.019    0.024    0.032  5808 1.000
p[2]     0.016   0.000 0.007    0.002    0.011    0.016    0.021    0.030  8207 1.000
spec     0.994   0.000 0.005    0.983    0.991    0.995    0.998    1.000  4998 1.000
sens     0.913   0.000 0.023    0.862    0.898    0.915    0.929    0.952 19939 1.000
ratio    1.693   0.052 3.859    0.453    0.901    1.206    1.656    4.917  5457 1.001
lp__  -254.658   0.020 1.562 -258.611 -255.451 -254.309 -253.501 -252.688  5840 1.001

The 95% interval for the ratio is now super-wide (and the estimated interval endpoint is itself noisy, even after 10,000 iterations; re-fitting the model with a new random seed yields an upper interval of 4.5), which makes sense because of the small possibility from the data that the false positive rate is high enough to mess things up.

Anyway, the point of this post is just to demonstrate how, with sloppy code, you really have to keep your hands on the wheel at all times. Even such simple fitting is hard to do if you're not continually checking results against your expectations.

I expect that the particular things that happened to me here would not be so bad for people who use more modern coding practices and workflows. From the other direction, there are lots of researchers whose code and workflow are much sloppier than mine, and they'll have to be especially eagle-eyed to avoid tripping up and even publishing erroneous results.

To return to the title of this post: Ugly code is buggy code not just because it's easier to introduce bugs and harder to find them if your code is poorly structured. It's also that, for code to be truly non-buggy, it doesn't just have to be free of bugs. You also have to know it's free of bugs, so that you can use it confidently.

Also it took me about 20 minutes to write this post. My blogging workflow is efficient (or perhaps inefficient in a larger sense, in that blogging is so easy that I do lots of it, perhaps taking the place of more productive activities).

Drunk-under-the-lamppost testing

Edit: Glancing over this again, it struck me that the title may be interpreted as being mean. Sorry about that. It wasn’t my intent. I was trying to be constructive and I really like that analogy. The original post is mostly reasonable other than on this one point that I thought was important to call out.

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.

“Time Travel in the Brain”

Natalie Biderman and Daphna Shohamy wrote this science article for kids. Here’s the abstract:

Do you believe in time travel? Every time we remember something from the past or imagine something that will happen in the future, we engage in mental time travel. Scientists discovered that, whether we mentally travel back into the past or forward into the future, some of the same brain regions are activated. One of those regions is the hippocampus, a brain structure famous for its role in building long-term memories. Damage to the hippocampus causes memory problems, but it also impairs the ability to imagine future experiences. This brain connection between remembering the past and thinking about the future suggests that memory, planning, and decision-making may be deeply related. The ability to form memories allows us to reminisce about the past. But maybe the ability to form memories also evolved to allow us to think about and plan for the future.

And here’s how the article begins:

Let us start with a riddle. Try to figure out how to throw a ping-pong ball so that it will go a short distance, come to a dead stop, and then reverse its direction. You cannot throw the ball such that it will curve back to you (like a frisbee), and the ball is not allowed to touch or be attached to any object. What can you do? The solution to this riddle is described in the footnote below. Yes, it is a simple solution. Do not worry if you did not figure it out immediately, many people do not. The interesting question is why. . . .

And they conclude:

When we think about memory, we usually think about the past. Indeed, for more than a century, memory researchers focused on how people and animals store and recall past experiences and which brain structures support those functions. More recent research suggests a different view of memory. Recent findings show that the hippocampus—a brain region responsible for memory—is active when people imagine future events. Additionally, in patients with amnesia, damage to the hippocampus impairs the ability to imagine the future. Moreover, when rats navigate their environments, neurons in the hippocampus “simulate” future paths that will enable them to get to a desired outcome. Together, these findings suggest that the hippocampus and its connections to other brain regions build upon past experiences to make predictions about future events. . . .

This is interesting, not just for kids. I’m posting it here because Natalie worked on it as a student in our Communicating Data and Statistics class. Previous final projects for that course include ShinyStan.

Statistical controversy on estimating racial bias in the criminal justice system

1. Background

A bunch of people have asked me to comment on these two research articles:

Administrative Records Mask Racially Biased Policing, by Dean Knox, Will Lowe, and Jonathan Mummolo:

Researchers often lack the necessary data to credibly estimate racial discrimination in policing. In particular, police administrative records lack information on civilians police observe but do not investigate. In this article, we show that if police racially discriminate when choosing whom to investigate, analyses using administrative records to estimate racial discrimination in police behavior are statistically biased, and many quantities of interest are unidentified—even among investigated individuals—absent strong and untestable assumptions. Using principal stratification in a causal mediation framework, we derive the exact form of the statistical bias that results from traditional estimation. We develop a bias-correction procedure and nonparametric sharp bounds for race effects, replicate published findings, and show the traditional estimator can severely underestimate levels of racially biased policing or mask discrimination entirely. We conclude by outlining a general and feasible design for future studies that is robust to this inferential snare.

Deconstructing Claims of Post-Treatment Bias in Observational Studies of Discrimination, by Johann Gaebler, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill:

In studies of discrimination, researchers often seek to estimate a causal effect of race or gender on outcomes. For example, in the criminal justice context, one might ask whether arrested individuals would have been subsequently charged or convicted had they been a different race. It has long been known that such counterfactual questions face measurement challenges related to omitted-variable bias, and conceptual challenges related to the definition of causal estimands for largely immutable characteristics. Another concern, raised most recently in Knox et al. [2020], is post-treatment bias. The authors argue that many studies of discrimination condition on intermediate outcomes, like being arrested, which themselves may be the product of discrimination, corrupting statistical estimates. Here we show that the Knox et al. critique is itself flawed, suffering from a mathematical error. Within their causal framework, we prove that a primary quantity of interest in discrimination studies is nonparametrically identifiable under a standard ignorability condition that is common in causal inference with observational data. More generally, though, we argue that it is often problematic to conceptualize discrimination in terms of a causal effect of protected attributes on decisions. We present an alternative perspective that avoids the common statistical difficulties, and which closely relates to long-standing legal and economic theories of disparate impact. We illustrate these ideas both with synthetic data and by analyzing the charging decisions of a prosecutor’s office in a large city in the United States.

I heard about these papers a couple years ago but didn’t think too hard about them. Then recently a bunch of different people contacted me and asked for my opinion. Apparently there’s been some discussion on twitter. The Knox et al. paper was officially published in the journal so maybe that was what set off the latest round of controversy.

Also, racially biased policing is in the news. Not so many people are defending the police—even the people saying the police aren’t racially biased are making that claim based on evidence that police mistreat white people too—but I suppose that much of the current debate revolves around the idea that the police enforce inequality. So whatever the statistics are regarding police mistreatment of suspects, these are part of a larger issue of the accountability of police use of force. We’ve been hearing a lot about police unions, which raises other political questions, such as what is it like for police officers who disagree with the positions taken by their leadership.

Anyway, that’s all part of the background for how this current academic dispute has become such a topic of discussion.

2. Disclaimer

I have no financial interest in this issue, nor do I have any direct academic stake. Some colleagues and I did some research twenty years ago on racial disparities in police stops, but neither of the new articles at hand has any criticism of what we did, I guess in part because our analysis was more descriptive than causal.

I do have a personal interest, though, as Sharad Goel and Jennifer Hill are friends and collaborators of mine. So my take on these papers is kind of overdetermined: Jennifer is my go-to expert on causal inference (although we have never fought more than over the causal inference chapters in our book), and I have a lot of respect for Sharad too as a careful social science researcher. I’ll give you my own read on these articles in a moment, but I thought I should let you know where I’m coming from.

Just to be clear: I’m not saying that I expect to agree with Gaebler et al. because Sharad and Jennifer are my friends—I disagree with my friends all the time!—; rather, I’m saying that, coming into this one, my expectation is that, when it comes to causal inference, Jennifer knows what she’s talking about and Sharad has a clear sense of what’s going on in this particular application area.

3. My read of the two articles

Suppose you’re studying racial discrimination, and you compare outcomes for whites to outcomes for blacks. You’ll want to adjust for other variables: mere differences between the two groups does not demonstrate discrimination. In addition, you won’t be working with the entire population; you’ll be working with the subset who are in some situation. For example, Knox et al. look at the results of police-civilian encounters: these are restricted to the subset of civilians who are in this encounter in the first place.

Knox et al. make the argument that analyses using administrative data are implicitly conditioning on a post-treatment variable because they subset on whether you were stopped or not. To use the words in their title, adjusting for or conditioning on administrative records can mask racially biased policing. Knox et al. point out that this bias can be viewed as an example of conditioning on intermediate outcomes. It’s a well known principle in causal inference that you can’t give regression coefficients a direct causal interpretation if you’re conditioning on intermediate outcomes; see, for example, section 19.6, “Do not adjust for post-treatment variables,” of Regression and Other Stories, but this is standard advice—my point in citing ourselves here is not to claim any priority but rather to say that this is standard stuff.

Knox et al. are right that conditioning on post-treatment variables is a common mistake. R. A. Fisher made that mistake once! Causal inference is subtle. In general, there’s a tension between the principles of adjusting for more things and the principle of not adjusting away the effect that you’re interested in studying. And, as Knox et al. point out, there’s an additional challenge for criminal justice research because of missing data: “police administrative records lack information on civilians police observe but do not investigate.” This is an important point, and we should always be aware of the limitations of our data.

From a substantive point of view, the message that I take from Knox et al. is to be careful with what might seem to be kosher regression analyses purporting to estimate the extent of discrimination. I would not want to read Knox et al. as saying that regression methods can’t work here. You have to be aware of what you are conditioning on, but if you interpret the results carefully, you should be able to estimate a causal effect of one stage in the process. The concern is that it can be easy for the most important effects to be hidden in the data setup.

Gaebler et al. discuss similar issues. However, they disagree with Knox et al. on technical grounds. Gaebler et al. argue that there exist situations in which you may be able to estimate causal effects of discrimination or perception of race just fine, even when conditioning on variables that are affected by race. It’s just that these won’t capture all aspects of discrimination in the system; they will capture the discrimination inherent solely at that point in the process (for example, when a prosecutor makes a decision about charging).

For a simple example, suppose you have stone-cold evidence of racial bias in sentencing. That’s still conditional on who gets arrested, charged, and prosecuted. So it doesn’t capture all potential sources of racial bias, not by a long shot. Or, maybe you find no racial bias in sentencing. That doesn’t mean that total racial bias is zero; it just means that you don’t find anything at that stage in the process.

I see Gaebler et al. as being in agreement with Knox et al. on this key substantive point. The distinction is that Knox et al. are saying that in principle you can’t estimate effects of discrimination using the basic causal regression, whereas Gaebler et al. are saying that it can be, and often is, possible to estimate these effects, even though these effects are not the whole story.

Gaebler et al. give an example where the estimand corresponds to what one would measure in a randomized controlled trial where the stated race of arrested individuals was randomly masked on the police narratives that prosecutors use when making their decisions. With this setup, it is possible to estimate causal racial bias in a particular part of the system.

For another example, suppose you did a study of racial discrimination and you included, as a predictor, the neighborhood where people were living. And you found some amount of discrimination X, a difference between what happens to whites and to blacks, conditional on neighborhood. That could be a causal effect, but, even if X = 0, that would not mean that there is no racial discrimination. If there is discrimination by neighborhood, this can have disparate impact.

4. From the study of discrimination to the study of disparate outcomes

One thing I especially like about the Gaebler et al. article is that they move beyond the question of racial discrimination to the more relevant, I think, issue of disparate outcomes: “much of the literature has framed discrimination in terms of causal effects of race on behavior, but other conceptions of discrimination, such as disparate impact, are equally important for assessing and reforming practices.”

Again, suppose that all discrimination were explainable not by race but by what neighborhood you live in. People are segregated geographically, so discrimination by neighborhood really would be a form of racial discrimination, but this could be set up so that the causal effect of race itself (for example, in police or prosecutor decisions) would be zero. But from the Knox et al. perspective, you could say that this control variable (the ethnic and racial composition of your neighborhood) is an intermediate outcome. I prefer Gaebler et al.’s framing, but ultimately I think both articles are coming to the same point here.

And here’s how Gaebler et al. end it:

The conclusions of discrimination studies are generally limited to specific decisions that happen within a long chain of potentially discriminatory actions. Quantifying discrimination at any one point (e.g., charging decisions) does not reflect specific or cumulative discrimination at other stages—for example, arrests. Looking forward, we hope our work offers a path toward quantifying disparities, and provokes further interest in the subtle conceptual and methodological issues at the heart of discrimination studies.

I agree. Lots more can be said on this topic, and I recommend Gaebler et al. as a starting point.

5. So why the controversy?

Given that, from my perspective, the two papers have such similar messages, why the controversy? Why the twitter war?

I guess I can see where the authors of both papers are coming from. From the perspective of Gaebler et al., the distinction is clear. Knox et al. have made a mathematically false statement leading to a confusion about the identifiability of causal effects within the justice system. Knox et al. don’t seem to agree regarding the mathematical error, but in any case they take the position that their result is correct in practice, labeling Gaebler et al.’s results as “silly logic puzzles about knife-edge scenarios” that “aren’t useful to applied researchers.”

The concern I have about Knox et al. is that they make a claim that’s too strong. They say that they “show that when there is any racial discrimination in the decision to detain civilians . . . then estimates of the effect of civilian race on subsequent police behavior are biased absent additional data and/or strong and untestable assumptions,” that “the observed difference in means fails to recover any known causal quantity without additional, and highly implausible, assumptions,” and that this estimation strategy “[cannot] be rehabilitated by simply redefining the quantity of interest” or by adding additional covariates.

In some sense, sure, you can’t make any causal inference without strong assumptions. Even in a textbook randomized controlled experiment, all you’re doing is estimating the average causal effect for the people in the study (or a larger population for which your participants can be considered a random sample). When your data are observational, you need even more assumptions. And don’t get me started on instrumental variables, regression discontinuity, etc.: all these methods can be useful, but they don’t come without lots of modeling.

But that can’t be what Knox et al. are trying to say. The statement, “estimates of the effect of X on Y are biased absent additional data and/or strong and untestable assumptions,” is true for any X and Y. It’s a good point to make, quite possibly worth a paper in the APSR, but nothing in particular to do with criminal justice. The relevant point to make, and hence the point I will extract from Knox et al., is that, in this particular example of arrests, the problems of selection is, in the words of Morrissey, really serious. But you can use standard methods of causal inference to estimate causal effects here, as long as you’re careful in interpreting the results and don’t take the estimate of discrimination in one part of the system as representing the entirety of racial bias in the whole process.

My own resolution to this is to take Knox et al.’s message of the difficulty of causal inference under selection, as applied to the important problem of estimating disparities in the criminal justice system as a caution to be careful in your causal thinking: Define your causal effects carefully, and recognize their limitations (as in the above example where the causal effect of race in arrest or sentencing decision, conditional on neighborhood, might be of interest, but at the same time we realize this particular causal effect would only capture one of many sources of racial discrimination). I like Gaebler et al.’s framing of the problem in terms of the specificity of their definition of the treatment variable. I think the authors of both papers would agree that overinterpretation of naive regressions has been a problem, hence the value of this work.

6. tl;dr summary

From both papers I draw the same substantive conclusion, which is it that simple, or even not-so-simple regressions of outcome on race and pre-treatment predictors can give misleading results if you’re trying to understand the possibility of racial discrimination in the criminal justice system without thinking carefully about these issues.

There are also some technical disputes. It’s my impression that Gaebler et al. are correct on these issues, but, now that I have a sense of the statistical issues here, I’m not so interested in the theorem, or the explanation of error in the theorem, or the error in the explanation of the error in the theorem, or the corrected explanation of the error in the theorem. My read of this particular dispute is that Knox et al. were trying to prove something that is not quite correct. The correct statement is that, even when standard regression-based inferences allow you to estimate a causal effect of race at some stage in the criminal justice process, this causal effect is conditional on everything that came before, and so a focus on any particular causal effect will not catch other biases in the system. The incorrect statement is that you can’t estimate causal effects in standard regression-based inferences. These local causal effects don’t tell the fully story, but they’re still causal effects, and that’s the technical point that Gaebler et al. are making.

My tl;dr is different from that of political scientist Ethan Bueno de Mesquita, who wrote this:

I disagree with the “dreck” comment.

Don’t get me wrong. I have no general problem with labeling papers as dreck; I’ve many times called published papers “crappy,” and this blog is no stranger to pottymouth. It’s just that I like the Gaebler et al. paper. I disagree with the above-quoted assessment of its quality. I can’t really say more unless I hear Bueno de Mesquita’s reasons for giving in the “dreck” label. He and others should feel free to respond in the comments.

New England Journal of Medicine engages in typical academic corporate ass-covering behavior

James Watson (not the racist dude who, in 1998, said that a cancer cure was coming in 2 years) writes:

About a month ago, when the infamous Lancet hydroxychloroquine/chloroquine paper was still “real science” (i.e. in the official scientific record), we decided to put extra pressure on the authors by writing an open letter to the New England Journal of Medicine. Prof Mehra and his Surgisphere colleague Sapan Desai had published a paper derived from the same “database” in the New England Journal of Medicine a few weeks before their Lancet publication. This was an important but rather boring paper (ACE inhibitors not dangerous in COVID). I guess because it did not have immediate ramifications, none (to our knowledge) had critiqued it openly. But on close examination it didn’t take long to see glaring inconsistencies. Numbers from Turkey and the UK (they made the mistake of giving country-level numbers!) were basically impossible, and the relationship between age and mortality didn’t agree with later reports. The open letter can be found here. A few hours after we posted this letter online NEJM posted an expression of concern re Mehra et al, and hours later Lancet did the same (I’m not claiming full causality but it could be argued!). A few days later both papers were retracted.

On the 3rd of July we got the (expected) reply from NEJM:

Dear Dr. Watson,

I am sorry to inform you that your submission, “Expression of concern regarding data integrity and results,” has not been accepted for publication in the Journal. It was evaluated by members of our editorial staff. After considering its focus, content, and interest, we made the editorial decision not to consider your submission further. We are informing you of this promptly so that you can submit it elsewhere.

Thank you for the opportunity to consider your submission.

Sincerely yours,

Eric Rubin, MD, PhD
Editor-in-Chief

I’m not sure that one month counts as prompt given the importance of the concerns. And given that 174 researchers/clinicians signed the letter, from around the world. I can understand people not liking these massive group signatory letters, but this aspect is important in my opinion. Every statement was carefully reviewed by each signatory – many changes were proposed and made. Most of the signatories are individuals actively working in COVID-19 and the fact that none of them had heard or knew of Surgisphere activities gave considerable confidence that it was all fraudulent.

It’s a shame that NEJM didn’t publish our letter. I don’t understand why they can’t openly accept that a mistake was made. Publishing the letter formally records the worrying patterns that point to pure data fabrication and fraud. And it highlights that Prof Mehra and co-authors made a serious error appending their names to a publication with no knowledge of data provenance. In The Lancet it is stated that the corresponding author (Mehra) and coauthor ANP (Patel) had full access to all the data in the study and had final responsibility for the decision to submit for publication. Mehra and Patel must have signed this – yet when retracting the paper they claimed they did not have access to the data! (and asked a third party to audit the data).

As discussed many times on your blog, this shows a major problem in academic publishing: there are few (any?) incentives for post-publication review . And journals don’t take post-publication review seriously. Journals like NEJM and Lancet appear to believe that 2-5 anonymous reviewers will be better at spotting problems than hundreds/thousands of individuals who read it after publication. This is obviously silly. So why not encourage post-publication review instead of stamping it out?

I recently came across this somewhat remarkable example of how post-publication might happen with a responsible and modern journal, not surprisingly it’s from PLoS.

A very nice example of transparency.

Further background here.

It’s that never back down “culture of poverty,” attitude. But . . . the editors of the New England Journal of Medicine are not poor people. They have no excuse!

Priors on effect size in A/B testing

I just saw this interesting applied-focused post by Kaiser Fung on non-significance in A/B testing. Kaiser was responding to a post by Ron Kohavi. I can’t find Kohavi’s note anywhere, but you can read Kaiser’s post to get the picture.

Here I want to pick out a few sentences from Kaiser’s post:

Kohavi correctly points out that for a variety of reasons, people push for “shipping flat”, i.e. adopting the Test treatment even though it did not outperform Control in the A/B test. His note carefully lays out these reasons and debunks most of them.

The first section deals with situations in which Kohavi would accept “shipping flat”. He calls these “non-inferiority” scenarios. My response to those scenarios were posted last week. I’d prefer to call several of these quantification scenarios, in which the expected effect is negative, and the purpose of A/B testing is to estimate the magnitude. . . .

The “ship flat or not” decision should be based on a cost-benefit analysis. . . .

This all rang a bell because I’ve been thinking about priors for effect sizes in A/B tests.

– If you think the new treatment will probably work, why test it? Why not just do it? It’s because by gathering data you can be more likely to make the right decision.

– But if you have partial information (characterized in the above discussion by a non-significant p-value) and you have to decide, then you should use the decision analysis.

– Often it makes sense to consider that negative-expectation scenarios. Most proposed innovations are bad ideas, right?

Also this bit from Kohavi:

The problem is exacerbated when the first iteration shows stat-sig negative, tweaks are made, and the treatment is iterated a few times until we have a non-stat sig iteration. In such cases, a replication run should be made with high power to at least confirm that the org is not p-hacking.

– This is related to the idea that A/B tests typically don’t occur in a vacuum; we have a series of innovations, experiments, and decisions.

I want to think more about all this.

Who were the business superstars of the 1970s?

Last month, we said:

Who are today’s heroes? Not writers or even musicians? No, our pantheon of culture heroes are: rich men, athletes, some movie and TV stars, a few politicians, some offbeat intellectuals like Nate Silver and Nassim Taleb . . .

I guess I should also add social media stars like whoever is getting a million youtube subscribers or twitter followers or whatever.

In many of these categories, you can map back to similar groups of heroes from fifty years ago. Now we have Bernie Sanders, then we had Ronald Reagan. Now we have Tucker Carlson, then we had William F. Buckley. Now LeBron James, then Billie Jean King. Rod Carew has passed on the mantle to Mike Trout, Clint Eastwood has given way to Tom Hanks. Etc.

Two areas where I see a lack of parallelism are music and business.

Back in the 1970s there must have been over 100 major rock stars, each with his or her own distinctive image: just compare Elton John, David Bowie, Gene Simmons, and so forth. Nowadays there are some very popular musicians, but I don’t think they have the same role in society (and, no, I don’t think it’s just that I’m older now myself). Now, music is just music; it doesn’t represent such a large chunk of the culture.

The other difference is that now we have business heroes. It started with Steve Jobs, maybe. Then there was Bill Gates—I remember the enjoyment that people used to show, back in the 90s, just talking about how rich he was. And now there’s whole menagerie of rich people to choose from, including the richest (Gates, Bezos, etc.), the offbeat (Musk, Branson, etc.), Oprah, . . . take your pick.

But back in the 1970s we did not have a class of people who were business heroes. There were occasional famous businessmen, but they did not form a category.

Who were the business superstars of that era? There was Ted Turner. There was Lee Iacocca. And . . . is that it? There must be a few more that I’m forgetting. But not a whole category of them. Steve Jobs and the guy who founded Atari were around in the 70s, but I don’t think they’d reached hero status yet.

Inference for coronavirus prevalence by inverting hypothesis tests

Panos Toulis writes:

The debate on the Santa Clara study actually me to think about the problem from a finite sample inference perspective. In this case, we can fully write down the density
f(S | θ) in known analytic form, where S = (vector of) test positives, θ = parameters (i.e., sensitivity, specificity and prevalence).
Given observed values s_obs we can invert a test to obtain an exact confidence set for θ.

I wrote down one such procedure and its theoretical properties (See Procedure 1.) I believe that finite sample validity is a benefit over asymptotic/approximate procedures such as bootstrap or Bayes, which may add robustness. I compare results in Section 4.3.

I recently noticed that in your paper with Bob, you discuss this possibility of test inversion in Section 6. What I propose is just one way to do this style of inference.

From my perspective, I don’t see the point of all these theorems: given that the goal is to generalize to the larger population, I think probability modeling is the best way to go. And I have problems with test inversion for general reasons; see here and, in particular, this comment here. Speaking generally, I am concerned that hypothesis-test inversions will not be robust to assumptions about model error.

But I recognize that other people find the classical hypothesis-testing perspective to be useful, so I’m sharing this paper.

Also relevant is this note by Will Fithian that also uses a hypothesis-testing framework.

No, I don’t believe that claim based on regression discontinuity analysis that . . .

tl;dr. See point 4 below.

Despite the p-less-than-0.05 statistical significance of the discontinuity in the above graph, no, I do not believe that losing a close election causes U.S. governors to die 5-10 years longer, as was claimed in this recently published article.

Or, to put it another way: Despite the p-less-than-0.05 statistical significance of the discontinuity of the above graph, and despite the assurance of the authors that this finding is robust, no, I do not believe that winning a close election causes U.S. governors to live 5-10 years longer.

How can I say this? I will answer in five ways:
1. Common sense
2. Our experience as consumers of research
3. Statistical analysis
4. Statistical design
5. Sociology of science.

1. Common sense: 5-10 years is a huge amount of lifetime. Even if you imagine dramatic causal sequences (for example, you lose an election and become depressed and slide into alcoholism and then drive your car into a tree), it’s hard to imagine such a huge effect.

2. Our experience as consumers of research: The recent history of social and behavioral science is littered with published papers that made claims of implausibly large effects, supported by statistically significant comparisons and seemingly solid research methods, which did not hold up under replication or methodological scrutiny. Some familiar examples to readers of this blog include the claim that beautiful parents were 8 percentage points more likely to have girls, the claim that students at Cornell had extra-sensory perception, the claim that women were three times more likely to wear red or pink clothing during certain times of the month, the claim that single women were 20 percentage points more likely to support Barack Obama during certain times of the month, and the claim that political moderates perceived the shades of gray more accurately than extremists on the left and right. We’ve been burned before. We’ve been burned enough times that we realize we don’t have to follow Kahneman’s now-retired dictum that “you have no choice but to accept that the major conclusions of these studies are true.”

At this point, I recommend that you pause your reading now and go to this post, “50 shades of gray: A research story,” and read all of it. It won’t take long: it’s just two long paragraphs from an article of Nosek, Spies, and Motyl and then a few sentences of remarks from me.

OK, did you read “50 shades of gray: A research story”? Great. Now let’s continue.

3. Statistical analysis: If there truly is no such large effect that losing an election causing you to lose 5 to 10 years of life, then how could these researchers have found a comparison that was (a) statistically significant, and (b) robust to model perturbations? My quick answer is forking paths. I know some of you are gonna hate to hear it, but there it is. The above graph might look compelling, but here’s what the raw data look like:

For ease of interpretation I’ve plotted the data in years rather than days.

Now let’s throw a loess on there and see what we get:

And here are two loess fits, one for negative x and one for positive x:

Yup. Same old story. Fit a negative slope right near the discontinuity, and then a positive jump is needed to make everything work out. The point is not that loess is the right thing to do here; the point is that this is what’s in these data.

The fit is noisy, and finding the discontinuity all depends on there being this strong negative relation between future lifespan and vote margin in this one election—but just for vote margins in this +/- 5 percentage point range. Without that negative slope, the discontinuity goes away. It’s just like three examples discussed here.

At this point you might say, no, the authors actually fit a local linear regression, so we can’t blame the curve, and that their results were robust . . . OK, we’ll get to that. My first point here is that the data are super-noisy, and fitting different models to these data will give you much different results. Again, remember that it makes sense that the data are noisy—there’s no good reason to expect a strong relationship between vote margin in an election and the number of years that someone will live afterward. Indeed, from any usual way of looking at things, it’s ludicrous to think that a candidate’s life expectancy is:
30 years if he loses an election by 5 percentage points
25 years if he loses narrowly
35 years if he wins narrowly
30 years if he wins by 5 percentage points.
It’s a lot more believable that this variation is just noise, some artifact of the few hundred cases in this dataset, than that it represents some general truth about elections, or even about elections for governor.

And here’s a regression. I show a bunch of regressions (with code) at the end of the post; here’s one including all the elections between 1945 and 2012, excluding all missing data, and counting the first race for each candidate who ran for governor multiple times:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.60     4.05  
won                 2.39     2.44  
age                -0.98     0.08  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
---
n = 311, k = 5
residual sd = 10.73, R-Squared = 0.35

The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.

But what about the robustness as reported in the published article? My answer to that is, first, the result is not so robust, as is indicated by the above graph and regression and demonstrated further in my P.S. below—and I wasn’t trying to make the result go away, I was just trying to replicate what they were doing in that paper,—and, second, let’s see what Uri Simonsohn had to say about this:

To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code)

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

4. Statistical design: Another way to think about this is to consider the design of such a study. A priori we might consider an effect size of one additional year of life to be large, and on the border of plausibility. But this study has essentially no power to detect effects this small! You can see that from the standard errors on the regression. If an estimate of 5-10 years is two or three standard errors away from zero, than an effect of 1 year, or even 2 years, is statistically undetectable. So the study is really set up only to catch artifacts or noise. That’s the point of the “Power failure” paper by Button et al. (2012).

5. Sociology of science: If you are a social scientist, statistical methods should be your servant, not your master. It’s very tempting to say that the researchers who made the above graph followed the rules and that it’s not fair to pick on them. But the point of social science is not to follow the rules, it’s to gain understanding and make decisions. When following the rules gives silly results, it’s time to look more carefully at the rules and think hard about what went wrong. That’s what (many) psychologists did after the Bem ESP debacle and the rest of the replication crisis.

Yes, it’s possible that I’m wrong. Perhaps losing a close election really did kill these candidates for governor. It’s possible. But I don’t think so. I think this paper is just another example of statistical rule-following that’s out of control.

No smoking gun

It’s easy to criticize research when the forking paths are all out in the open (as with the ESP study) or when statistics show that your sample size is too low to detect anything by a factor of 100 (as in the beauty-and-sex-ratio example) or when there are obvious forking paths and a failed replication (as in the ovulation-and-clothing example) or when almost all the data have been excluded from the analysis (as in the union elections and stock price example) or when there’s flat-out research misconduct (as with pizzagate).

This example we’re discussing here is a bit different. It’s a clean analysis with clean data. The data are even publicly available (which allowed me to make the above graphs)! But, remember, honesty and transparency are not enough. If you do a study of an effect that is small and highly variable (which this one is: to the extent that winning or losing can have large effects on your lifespan, the effect will surely vary a lot from person to person), you’ve set yourself up for scientific failure: you’re working with noise.

I’m not happy about this, but that’s just how quantitative science works. So let me emphasize again that a study can be fatally flawed just by being underpowered, even if there’s no other obvious flaw in the study.

Or, to put it another way, there’s an attitude that causal identification + statistical significance = discovery, or that causal identification + robust statistical significance = discovery. But that attitude is mistaken. Even if you’re an honest and well-meaning researcher who has followed principles of open science.
Continue reading ‘No, I don’t believe that claim based on regression discontinuity analysis that . . .’ »

The value of thinking about varying treatment effects: coronavirus example

Yesterday we discussed difficulties with the concept of average treatment effect.

Part of designing a study is accounting for uncertainty in effect sizes. Unfortunately there is a tradition in clinical trials of making optimistic assumptions in order to claim high power. Here is an example that came up in March, 2020. A doctor was designing a trial for an existing drug that he thought could be effective for high-risk coronavirus patients. I was asked to check his sample size calculation: under the assumption that the drug increased survival rate by 25 percentage points, a sample size of N = 126 would assure 80% power. With 126 people divided evenly in two groups, the standard error of the difference in proportions is bounded above by √(0.5*0.5/63 + 0.5*0.5/63) = 0.089, so an effect of 0.25 is at least 2.8 standard errors from zero, which is the condition for 80% power for the z-test.

When I asked the doctor how confident he was in his guessed effect size, he replied that he thought the effect on these patients would be higher and that 25 percentage points was a conservative estimate. At the same time, he recognized that the drug might not work. I asked the doctor if he would be interested in increasing his sample size so he could detect a 10 percentage point increase in survival, for example, but he said that this would not be necessary.

It might seem reasonable to suppose that a drug might not be effective but would have a large individual effect in case of success. But this vision of uncertainty has problems. Suppose, for example, that the survival rate was 30% among the patients who do not receive this new drug and 55% among the treatment group. Then in a population of 1000 people, it could be that the drug has no effect on the 300 of people who would live either way, no effect on the 450 who would die either way, and it would save the lives of the remaining 250 patients. There are other possibilities consistent with a 25 percentage point benefit—for example the drug could save 350 people while killing 100—but we will stick with the simple scenario for now. In any case, the point is that the posited benefit of the drug is not “a 25 percentage point benefit” for each patient; rather, it’s a benefit on 25% of the patients. And, from that perspective, of course the drug could work but only on 10% of the patients. Once we’ve accepted the idea that the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that “the treatment effect” in any given study will depend entirely on the patient mix. There is no underlying number representing the effect of the drug. Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect. My point is that if we consider the treatment effect in the context of variation between patients, this can be the first step in a more grounded understanding of effect size.

This is an interesting example because the outcome is binary—live or die—so the variation in the treatment effect is obvious. By construction, the treatment effect on any given person is +1, -1, or 0, and there’d be no way for it to be 0.25 on everybody. Even in this clear case, however, I think the framing in terms of average treatment effect causes problems, as illustrated in the story above.

Understanding the “average treatment effect” number

In statistics and econometrics there’s lots of talk about the average treatment effect. I’ve often been skeptical of the focus on the average treatment effect, for the simple reason that, if you’re talking about an average effect, then you’re recognizing the possibility of variation; and if there’s important variation (enough so that we’re talking about “the average effect” rather than simply “the effect”), then maybe we care enough about this variation that we should be studying it directly, rather than just trying to reduce-form it away.

But that’s not the whole story. Consider an education intervention such as growth mindset. Sure, the treatment effect will vary. But if the treatment’s gonna be applied to everybody, then, yeah, let’s poststratify and estimate an average effect: this seems like a relevant number to know.

What I want to talk about today is interpreting that number. It’s something that came up in the discussion of growth mindset.

The reported effect size was 0.1 points of grade point average (GPA). GPA is measured on something like a 1-4 scale, so 0.1 is not so much; indeed one commenter wrote, “I hope all this fuss is for more than that. Ouch.”

Actually, though, an effect of 0.1 GPA point is a lot. One way to think about this is that it’s equivalent to a treatment that raises GPA by 1 point for 10% of people and has no effect on the other 90%. That’s a bit of an oversimplification, but the point is that this sort of intervention might well have little or no effect on most people. In education and other fields, we try lots of things to try to help students, with the understanding that any particular thing we try will not make a difference most of the time. If mindset intervention can make a difference for 10% of students, that’s a big deal. It would be naive to think that it would make a difference for everybody: after all, many students have a growth mindset already and won’t need to be told about it.

That’s all a separate question from the empirical evidence for that 0.1 increase. My point here is that thinking about an average effect can be misleading.

Or, to put it another way, it’s fine to look at the average, but let’s be clear on the interpretation.

I think this comes up in a lot of cases. Various interventions are proposed, and once the hype dies down, average effects will be small. Of course there’s no one-quick-trick or even one-small-trick that will raise GPA by 1 point or that will raise incomes by 44% (to use one of our recurring cautionary tales; see for example section 2.1 of this paper). An intervention that raised average GPA by 0.1 point or that raised average income by 4.4% would still be pretty awesome, if what it’s doing is acting on 10% of the people and having a big benefit on this subset. You try different interventions with the idea that maybe one of them will help any particular person.

Again, this discrete formulation is an oversimplification—it’s not like the treatment either works or doesn’t work on an individual person. It’s just helpful to understand average effects as compositional in that way. Otherwise you’re bouncing between the two extremes of hypothesizing unrealistically huge effect sizes or else looking at really tiny averages. Maybe in some fields of medicine this is cleaner because you can really isolate the group of patients who will be helped by a particular treatment. But in social science this seems much harder.

Embracing Variation and Accepting Uncertainty (my talk this Wed/Tues at a symposium on communicating uncertainty)

I’ll be speaking (virtually) at this conference in Australia on Wed 1 July (actually Tues 30 June in our time zone here):

Embracing Variation and Accepting Uncertainty

It is said that your most important collaborator is yourself in 6 months. Perhaps the best way to improve our communication of data uncertainty to others is to learn how to better communicate to ourselves. What does it mean to say to future-you: “I don’t know”? Or, even more challenging, “I know a little but I’m not completely sure”? We will discuss in the context of applications in drug testing, election forecasting, and the evaluation of scientific research.

P.S. The session was fun. It was a remote conference so there was a chat window with questions and answers. Here are the questions that were addressed to me, and my replies:

Q: US-based data journalist here. In reporting racial justice, what stastitical concepts do you suggest we use when exploring disparate proportions (for example percent of Black Americans in the population versus percent killed by police)? And where do we need to highlight uncertainty? Keep in mind our general audience.

A: I like the recent paper by Gaebler et al., Deconstructing Claims of Post-Treatment Bias in Observational Studies of Discrimination: https://5harad.com/papers/post-treatment-bias.pdf

Q: I am currently building Bayesian based COVID-19 case predictions. I was wondering, in your opinion , what type of uncertain variables (such as asymptomatic cases) could we consider in Bayesian prediction of COVID-19 r nought numbers?

A: It’s hard or me to answer such a general question. But, yes, you’d like to have latent variables such as the number of asymptomatic cases (and how this varies over time) in your model.

Q: How can we best integrate non-statistical expertise with statistical evidence to bridge the gap between data and conclusions?

A: If you can track down the assumptions that are involved in the statistical model, then you can bring in non-statistical expertise to evaluate and perturb these assumptions.

Q: Do you think there’s anything that can be done unless you actually “build a model” of the unknown process? (eg wrestler v boxer). Or is it better to go Decision Theoretic?

A: It’s not either/or. You can build a model–or, at least, consider what information it would take for you to build a model–and in the meantime you can use decision analysis to translate your uncertainties into decision evaluations.

Q: In the resources industry, there are work-culture incentives to overstate certainty, and penalise caution. I think this operates elsewhere too. Any comments on implications of this?

A: Yes, we sometimes talk about the “auto mechanic” incentive. The auto mechanic has an incentive to tell you he knows what’s wrong with your car. If he acts uncertain, you might go to a different mechanic.

Q: You mentioned story telling as a soft skill when explaining stats to a general audience. What are some of the ways to engage the audience without getting too technical?

A: I guess the key thing is to contribute in your expertise. If you have technical expertise and people are talking with you, then I think they will want to hear it.

When it comes to communicating uncertainty, I like Gerd Gigerenzer’s idea of using “natural frequencies.” For example, instead of saying 3% are exposed, you say: “Consider an auditorium with 1000 people. 30 would be exposed.”

Q: Asking from a foundation of ignorance…is it more likely that a model that predicts a less dramatic result (e.g. 50/50) more likely to be wrong by chance than a model that predicts a less “likely” distribution of results (e.g. 90/10). I.e. if a model that predicts something unexpected represents the real data well, is that model more likely to be theoretically correct than a model that predicts something less unexpected. (Feel free to ignore if this question doesn’t make sense – or would be better asked in a different session).

A: Any prediction can be wrong. One question is what are you going to do with it. Is a prediction just for fun, or will it affect decisions? What of your decisions might be different if you are told that Biden’s win probability is 90% rather than 50%?

Q: I am interested in iterative near-term forecasts of biodiversity. In this approach the model captures the process and also quatnifies the uncertainty in making the forecast, then new observations are used to update each forecast.

Do you see any issues with iterative near-term forecasts for the statistical concepts of uncertainty?

A: It’s good to consider iterative forecasts. Iterative forecasts must satisfy certain mathematical conditions–the so-called martingale property. Recognizing these constraints can help you find incoherence in your forecasts.

It’s fun having a stream of questions. It becomes like a game, to answer them as fast as they come in.

Shortest posterior intervals

By default we use central posterior intervals. For example, the central 95% interval is the (2.5%, 97.5%) quantiles.

But sometimes the central interval doesn’t seem right. This came up recently with a coronavirus testing example, where the posterior distribution for the parameter of interest was asymmetric so that the central interval is not such a reasonable summary. See Figure 1 of this paper with Bob Carpenter.

Instead, sometimes it can make sense to use a shortest probability interval (similar to the highest posterior density interval), as discussed in this paper with Ying Liu and Tian Zheng.

The brute force approach to computing a shortest probability interval is to compute all the intervals of specified coverage and take the shortest. For example, if you have 1000 simulation draws and you want the 95% interval for a particular quantity of interest, z, you sort the 1000 simulations of z, compute 50 intervals, (z_1, z_951), (z_2, z_952), . . ., (z_50, z_1000), and you take the interval that’s the shortest.

There’s one additional twist: if the quantity of interest is bounded, you want to allow the possibility of the interval to go all the way to the boundary. So if there’s a lower bound a and an upper bound b, you also consider the intervals, (a, z_950) and (z_51, b) as candidates.

Here’s R code:

spin <- function(x, lower=NULL, upper=NULL, conf=0.95){
  x <- sort(as.vector(x))
  if (!is.null(lower)) {
    if (lower > min(x)) stop("lower bound is not lower than all the data")
    else x <- c(lower, x)
  }
  if (!is.null(upper)) {
    if (upper < max(x)) stop("upper bound is not higher than all the data")
    else x <- c(x, upper)
  }
  n <- length(x)
  gap <- round(conf*n)
  width <- x[(gap+1):n] - x[1:(n-gap)]
  index <- min(which(width==min(width)))
  x[c(index, index + gap)]
}

This particular code is kinda hacky in that it does not account for fencepost errors, but it conveys the general idea.

It works! But there is a concern that the procedure has unnecessary Monte Carlo error, in that we’re computing the minimum width of these empirical intervals. It seems like some smoothing should improve things.

In the above-linked Liu et al. paper, we present a pretty complicated procedure to compute a shortest probability interval efficiently. It works well in the paper, but I don’t really trust it in general. I think we should have something more transparent and reliable.

The short version of this post is that it would be good to have the shortest posterior interval as an option, not for Stan’s automatic output---I think by default the central interval is fine---but as an available option when users want it because it does come up.

The longer version is that I think there’s a research opportunity to come up with a more efficient shortest posterior interval computation that’s also clean and robust.

This one quick trick will allow you to become a star forecaster

Jonathan Falk points us to this wonderful post by Dario Perkins. It’s all worth a read, but, following Falk, I want to emphasize this beautiful piece of advice, which is #5 on their list of 10 items:

How to get attention: If you want to get famous for making big non-consensus calls, without the danger of looking like a muppet, you should adopt ‘the 40% rule’. Basically you can forecast whatever you want with a probability of 40%. Greece to quit the euro? Maybe! Trump to fire Powell and hire his daughter as the new Fed chair? Never say never! 40% means the odds will be greater than anyone else is saying, which is why your clients need to listen to your warning, but also that they shouldn’t be too surprised if, you know, the extreme event doesn’t actually happen.

I like it.

P.S. Justin Lahert gets priority here, as he wrote about “the 40% rule” back in 2010:

The 40% rule is a time-tried tactic employed by pundits making forecasts. If you say there is a 40% chance of something improbable happening and it does, you look great. And if it doesn’t, you never said the odds favored it.

Validating Bayesian model comparison using fake data

A neuroscience graduate student named James writes in with a question regarding validating Bayesian model comparison using synthetic data:

I [James] perform an experiment and collect real data. I want to determine which of 2 candidate models best accounts for the data. I perform (approximate) Bayesian model comparison (e.g., using BIC – not ideal I know, but hopefully we can suspend our disbelief about this metric) and select the best model accordingly.

I have been told that we can’t entirely trust the results of this model comparison because 1) we make approximations when performing inference (exact inference is intractable) and 2) there may be code bugs. It has sheen recommended that I should validate this model selection process by applying it on synthetic data generated from the 2 candidate models; the rationale is that if the true model is recovered in each case I can rely on the results of model comparison on the real data.

My question is: is this model recovery process something a Bayesian would do/do you think it is necessary? I am wondering if it is not appropriate because it is conditional on the data having been generated from one of the 2 candidate models, both of which are presumably wrong; the true data that we collected in the experiment was presumably generated from a different model/process. I am wondering if it is sufficient to perform model comparison on the real data (without using recovery for validation) due to the likelihood principle – model comparison will tell us under which model the data is most probable.

I would love to hear your views on the appropriateness of model recovery and whether it is something a Bayesian would do (also taking practical considerations into account such as bugs/approximations).

I replied that quick recommendation is to compare the models using leave-one-out cross validation as discussed in this article from a few years ago.

James responded with some further questions:

1. In our experiment, we assume that each participant performs Bayesian inference (in a Bayesian non-parametric switching state-space model), however we fit the (hyper)parameters of the model using maximum likelihood estimation given the behaviour of each participant. Hence, we obtain point estimates of the model parameters. Therefore, I don’t think the methods in the paper you sent are applicable as they require access to the posterior over parameters? We currently perform model comparison using metrics such as AIC/BIC, which can be computed even thought we fit parameters using maximum likelihood estimation.

2. Can I ask why you are suggesting a method that assesses out-of-sample predictive accuracy? My (perhaps naive) understanding is that we want to determine which model among a set of candidate models best explains the data we have from our current experiment, not data that we could obtain in a future experiment. Or would you argue that we always want to use our model in the future so we really care about predictive accuracy?

3. The model recovery process I mentioned has been advocated for purely practical reasons as far as I can tell (e.g., the optimiser used to fit the models could be lousy, approximations are typically made to marginal likelihoods/predictive accuracies, there could be bugs in one’s code). So even if I performed model comparison using PSIS-LOO as you suggest, I could imagine that one could still advocate doing model recovery to check that the result of model comparison based on PSIS-LOO is reliable and can be trusted. The assumption of model recovery is that you really should be able to recover the true model when it is the set of models you are comparing – if you can’t recover the true model with reasonable accuracy, then you can’t trust the results of your model comparison on real data. Do you have any thoughts on this?

My brief replies:

1a. No need to perform maximum likelihood estimation for each participant. Just do full Bayes: that should give you better inferences. And, if it doesn’t, it should reveal problems with your model, and you’ll want to know that anyway.

1b. Don’t do AIC and definitely don’t do BIC. I say this for reasons discussed in the above-linked paper and also this paper, Understanding predictive information criteria for Bayesian models.

2. You always care about out-of-sample predictive accuracy. The reason for fitting the model is that it might be applicable in the future. As described in the above-linked papers, AIC can be understood as an estimate of out-of-sample predictive accuracy. If you really only cared about within-sample prediction, you wouldn’t be using AIC at all; you’d just do least squares or maximum likelihood and never look back. The very fact that you were thinking about using AIC tells me that you care about out-of-sample predictive accuracy. And then you might as well do LOO and cut out the middleman.

3. Sure, yes, I’m a big fan of checking your fitting and computing process using fake-data simulation. So go for it!

The two most important formulas in statistics

0.5/sqrt(n) (which in turn is short for sqrt(p*(1-p)/n)

5^2 + 12^2 = 13^2

With an honorable mention to 16.