Skip to content

Authors repeat same error in 2019 that they acknowledged and admitted was wrong in 2015

David Allison points to this story:

Kobel et al. (2019) report results of a cluster randomized trial examining the effectiveness of the “Join the Healthy Boat” kindergarten intervention on BMI percentile, physical activity, and several exploratory outcomes. The authors pre-registered their study and described the outcomes and analysis plan in detail previously, which are to be commended. However, we noted four issues that some of us recently outlined in a paper on childhood obesity interventions: 1) ignoring clustering in studies that randomize groups of children, 2) changing the outcomes, 3) emphasizing results that were statistically significant from a host of analyses, and 4) using self-reported outcomes that are part of the intervention.

First and most critically, the statistical analyses reported in the article were inadequate and deviated from the analysis plan in the study’s methods article – an error the authors are aware of and had acknowledged after some of us identified it in one of their prior publications about this same program. . . .

Second, the authors switched their primary and secondary outcomes from their original plan. . . .

Third, while the authors focus on an effect of the intervention of p ≤ 0.04 in the abstract, controlling for migration background in their full model raised this to p = 0.153. Because inclusion or exclusion of migration background does not appear to be a pre-specified analytical decision, this selective reporting in the abstract amounts to spinning of the results to favor the intervention.

Fourth, “physical activity and other health behaviours … were assessed using a parental questionnaire.” Given that these variables were also part of the intervention itself, with the control having “no contact during that year,” subjective evaluation may have resulted in differential, social-desirability bias, which may be of particular concern in family research. Although the authors mention this in the limitations, the body of literature demonstrating the likelihood of these biases invalidating the measurements raises the question of whether they should be used at all.

This is a big deal. The authors of the cited paper knew about these problems—to the extent of previously acknowledging them in print—but then did them again.

They authors did this thing of making a strong claim and then hedging it in their limitations. That’s bad. From the abstract of the linked paper:

Children in the IG [intervention group] spent significantly more days in sufficient PA [physical activity] than children in the CG [control group] (3.1 ± 2.1 days vs. 2.5 ± 1.9 days; p ≤ 0.005).

Then, deep within the paper:

Nonetheless, this study is not without limitations, which need to be considered when interpreting these results. Although this study has an acceptable sample size and body composition and endurance capacity were assessed objectively, the use of subjective measures (parental report) of physical activity and the associated recall biases is a limitation of this study. Furthermore, participating in this study may have led to an increased social desirability and potential over-reporting bias with regards to the measured variables as awareness was raised for the importance of physical activity and other health behaviours.

This is a limitation that the authors judge to be worth mentioning in the paper but not in the abstract or in the conclusion, where the authors write that their intervention “should become an integral part of all kindergartens” and is “ideal for integrating health promotion more intensively into the everyday life of children and into the education of kindergarten teachers.”

The point here is not to slam this particular research paper but rather to talk about a general problem with science communication, involving over-claiming of results and deliberate use of methods that are problematic but offer the short-term advantage of allowing researchers to make stronger claims and get published.

P.S. Allison follows up by pointing to this Pubpeer thread.

Estimating efficacy of the vaccine from 95 true infections

Gaurav writes:

The 94.5% efficacy announcement is based on comparing 5 of 15k to 90 of 15k:

On Sunday, an independent monitoring board broke the code to examine 95 infections that were recorded starting two weeks after volunteers’ second dose — and discovered all but five illnesses occurred in participants who got the placebo.

Similar stuff from Pfizer etc., of course.

Unlikely to happen by chance but low baselines.

My [Gaurav’s] guess is that the final numbers will be a lot lower than 95%.

He expands:

The data = control group is 5 out of 15k and the treatment group is 90 out of 15k. The base rate (control group) is 0.6%. When the base rate is so low, it is generally hard to be confident about the ratio (1 – (5/95)). But noise is not the same as bias. One reason to think why 94.5% is an overestimate is simply that 94.5% is pretty close to the maximum point on the scale.

The other reason to worry about 94.5% is that the efficacy of a Flu vaccine is dramatically lower. (There is a difference in the time horizons over which effectiveness is measured for Flu for Covid, with Covid being much shorter, but useful to take that as a caveat when trying to project the effectiveness of Covid vaccine.)

What went wrong with the polls in 2020? Another example.

Shortly before the election the New York Times ran this article, “The One Pollster in America Who Is Sure Trump Is Going to Win,” featuring Robert Cahaly, who on election day forecast Biden to win 235 electoral votes. As you may have heard, Biden actually won 306. Our Economist model gave a final prediction of 356.

356 isn’t 306. We were off by 50 electoral votes, and that was kind of embarrassing. We discussed what went wrong, and the NYT ran an article on “why political polling missed the mark.”

Fine. We were off by 50 electoral votes (and approximately 2.5 percentage points on the popular vote, as we predicted Biden with 54.4% of the two-party vote and he received about 52%). We take our lumps, and we try to do better next time. But . . . they were off by 71 electoral votes! So I think they should assess what went wrong with their polls, even more so.

The Times article ends with this quote from Cahaly:

“I think we’ve developed something that’s very different from what other people do, and I really am not interested in telling people how we do it,” he said. “Just judge us by whether we get it right.”

Fair enough: you run a business, and it’s your call whether to make your methods public. Trafalgar Group polling keeps their methods secret, as does Fivethirtyeight with their poll aggregation procedure. As long as things go well, it’s kinda fun to maintain that air of mystery.

But “judge us by whether we get it right” is tricky. Shift 1% of the vote from the Democrats to the Republicans, and Biden still wins the popular vote but he loses the electoral college. Shift 1% of the vote from the Republicans to the Democrats, and Biden wins one more state and the Democrats grab another seat in the Senate.

From the news articles about Cahaly’s polling, it seems that a key aspect of their method is to measure intensity of preferences, and it seems that Republicans won the voter turnout battle this year. So, looking forward, it seems that there could be some benefit to using some of these ideas—but without getting carried away and declaring victory after your forecast was off by 71 electoral votes. Remember item 3 on our list.

Nonparametric Bayes webinar

This post is by Eric.

A few months ago we started running monthly webinars focusing on Bayes and uncertainty. Next week, we will be hosting Arman Oganisian, a 5th-year biostatistics PhD candidate at the University of Pennsylvania and Associate Fellow at the Leonard Davis Institute for Health Economics. His research focuses on developing Bayesian nonparametric methods for solving complicated estimation problems that arise in causal inference. His application areas of interest include health economics and, more recently, cancer therapies.


Bayesian nonparametrics combines the flexibility often associated with machine learning with principled uncertainty quantification required for inference. Popular priors in this class include Gaussian Processes, Bayesian Additive Regression Trees, Chinese Restaurant Processes, and more. But what exactly are “nonparametric” priors? How can we compute posteriors under such priors? And how can we use them for flexible modeling? This talk will explore these questions by introducing nonparametric Bayes at a conceptual level and walking through a few common priors, with a particular focus on the Dirichlet Process prior for regression.

If this sounds interesting to you, please join us this Wednesday, 18 November at 12 noon ET.

P.S. Last month we had Matthew Kay from Northwestern University discussing his research on visualizing and communicating uncertainty. Here is the link to the video.

You don’t need a retina specialist to know which way the wind blows

Jayakrishna Ambati writes:

I am a retina specialist and vision scientist at the University of Virginia. I am writing to you with a question on Bayesian statistics.

I am performing a meta analysis of 5 clinical studies. In addition to a random effects meta analysis model, I am running Bayesian meta analysis models using half normal priors. I’ve seen scales of 0.5 or 1.0 being used. What determines this choice? Why can’t it be 0.1 or 0.2, for example? Can I use the value of the heterogeneity tau (obtained from the random effect meta model) to calculate sigma and make that or a multiple of it to be the value of the scale?

My reply:

With only 5 groups, it can help to use an informative prior on the group-level variance. What’s a good prior to use? It depends on your prior information! How large are the effects that you might see? You can play it safe and use a weak prior, even a uniform prior on the group-level scale parameter: this will, on average, lead to an overestimate of the group-level scale, which in turn will yield to an overstatement of uncertainty.

Regarding the specific question of why you’ll see normal+(0,1) or normal+(0,0.5): This depends on the problem under study, but we can get some insight by thinking about scaling.

Consider two important special cases:

1. Continuous outcome, multilevel linear regression with predictors and outcomes scaled to have sd’s equal to 0.5 (this is our default choice because a binary variable coded to 0 and 1 will have sd of approx 0.5): we’d expect coefficients to be less than 1 in absolute value, hence a normal+(0,1) prior on the sd of a set of coefs should be weakly informative.

2. Binary outcome, multilevel logistic regression, again scaling predictors to have sd’s equal to 0.5: again, we’d expect coefs to be less than 2 in absolute value (a shift of 2 on the logit scale is pretty big), hence a normal+(0,0.5) prior on the sd of a set of coefs should be weakly informative.

In many cases, normal+(0.0.2) or normal+(0,0.1) will be fine too, in examples such as policy analysis and some areas of biomedicine where we would not expect huge effects.

A related question came up last month regarding priors for non-hierarchical regression coefficients.

The rise and fall and rise of randomized controlled trials (RCTs) in international development

Gil Eyal sends along this fascinating paper coauthored with Luciana de Souza Leão, “The rise of randomized controlled trials (RCTs) in international development in historical perspective.” Here’s the story:

Although the buzz around RCT evaluations dates from the 2000s, we show that what we are witnessing now is a second wave of RCTs, while a first wave began in the 1960s and ended by the early 1980s. Drawing on content analysis of 123 RCTs, participant observation, and secondary sources, we compare the two waves in terms of the participants in the network of expertise required to carry out field experiments and the characteristics of the projects evaluated. The comparison demonstrates that researchers in the second wave were better positioned to navigate the political difficulties caused by randomization.

What were the key differences between the two waves? Leão and Eyal start with the most available explanation:

What could explain the rise of RCTs in international development? Randomistas tend to present it as due to the intrinsic merits of their method, its ability to produce “hard” evidence as compared with the “softer” evidence provided by case studies or regressions. They compare development RCTs to clinical trials in medicine, implying that their success is due to the same “gold standard” status in the hierarchy of evidence: “It’s not the Middle Ages anymore, it’s the 21st century … RCTs have revolutionized medicine by allowing us to distinguish between drugs that work and drugs that don’t work. And you can do the same randomized controlled trial for social policy” (Duflo 2010).

But they don’t buy it:

This explanation does not pass muster and need not detain us for very long. Econometricians have convincingly challenged the claim that RCTs produce better, “harder” evidence than other methods. Their skepticism is amply supported by evidence that medical RCTs suffer from numerous methodological shortcomings, and that political considerations played a key role in their adoption. These objections accord with the basic insight of science studies, namely, that the success of innovations cannot be explained by their prima facie superiority over others, because in the early phases of adoption such superiority is not yet evident.

I’d like to unpack this argument, because I agree with some but not all of it.

I agree that medical randomized controlled trials have been oversold; and even if I accept the the idea of RCT as a gold standard, I have to admit that almost all my own research is observational.

I also respect Leão and Eyal’s point that methodological innovations typically start with some external motivation, and it can take some time before their performance is clearly superior.

On the other hand, we can port useful ideas from other fields of research, and sometimes new ideas really are better. So it’s complicated.

Consider an example that I’m familiar with: Mister P. We published the first MRP article in 1997, and I knew right away that it was a big deal—but it indeed took something like 20 years for it to become standard practice. I remember in fall, 2000, standing up in front of a bunch of people from the exit poll consortium, telling them about MRP and related ideas, and they just didn’t see the point. It made me want to scream—they were so tied into classical sampling theory, they seemed to have no idea that something could be learned by studying the precinct-by-precinct swing between elections. It’s hard for me to see why two decades were necessary to get the point across, but there you have it.

My point here is that my MRP story is consistent with the randomistas’ story and also with the sociologists’. On one hand, yes, this was a game-changing innovation that ultimately was adopted because it could do the job better than what came before. (With MRP, the job was adjusting for survey nonresponse; with RCT, the job was estimating causal effects; in both cases, the big and increasing concern was unmeasured bias.) On the other hand, why did the methods become popular when they did? That’s for the sociologists to answer, and I think they’re right that the answer has to depend on the social structure of science, not just on the inherent merit or drawbacks of the methods.

As Leão and Eyal put it, any explanation of the recent success of RCTs within economics must “recognize that the key problem is to explain the creation of an enduring link between fields” and address “the resistance faced by those who attempt to build this link,” while avoiding “too much of the explanatory burden on the foresight and interested strategizing of the actors.”

Indeed, if I consider the example of MRP, the method itself was developed by putting together two existing ideas in survey research (multilevel modeling for small area estimation, and poststratification to adjust for nonresponse bias), and when we came up with it, yes I thought it was the thing to do, but I also thought the idea was clear enough that it would pretty much catch on right away. It’s not like we had any strategy for global domination.

The first wave of RCT for social interventions

Where Leão and Eyal’s article really gets interesting, though, is when they talk about the earlier push for RCTs, several decades ago:

While the buzz around RCTs certainly dates from the 2000s, the assumption—implicit in both the randomistas’ and their critics’ accounts—that the experimental approach is new to the field of international development—is wrong. In reality, we are witnessing now a second wave of RCTs in international development, while a first wave of experiments in family planning, public health, and education in developing countries began in the 1960s and ended by the early 1980s. In between the two periods, development programs were evaluated by other means.

Just as an aside—I love that above sentence with three dashes. Dashes are great punctuation, way underused in my opinion.

Anyway, they now set up the stylized fact, the puzzle:

Instead of asking, “why are RCTs increasing now?” we ask, “why didn’t RCTs spread to the same extent in the 1970s, and why were they discontinued?” In other words, how we explain the success of the second wave must be consistent with how we explain the failure of the first.

Good question, illustrating an interesting interaction between historical facts and social science theorizing.

Leão and Eyal continue:

The comparison demonstrates that the recent widespread adoption of RCTs is not due to their inherent technical merits nor to rhetorical and organizational strategies. Instead, it reflects the ability of actors in the second wave to overcome the political resistance to randomized assignment, which has bedeviled the first wave, and to forge an enduring link between the fields of development aid and academic economics.

As they put it:

The problem common to both the first and second waves of RCTs was how to turn foreign aid into a “science” of development. Since foreign aid is about the allocation of scarce resources, the decisions of donors and policy-makers need to be legitimized.

They argue that a key aspect of the success of the second wave of RCTs was the connection to academic economics.

Where next?

I think RCTs and causal inference in economics and political science and international development are moving in the right direction, in that there’s an increasing awareness of variation in treatment effects, and an increasing awareness that doing an RCT is not enough in itself. Also, Leão and Eyal talk a lot about “nudges,” but I think the whole nudge thing is dead, and serious economists are way past that whole nudging thing. The nudge people can keep themselves busy with Ted talks, book tours, and TV appearances while the rest of us get on with the real work.

How to describe Pfizer’s beta(0.7, 1) prior on vaccine effect?

Now it’s time for some statistical semantics. Specifically, how do we describe the prior that Pfizer is using for their COVID-19 study? Here’s a link to the report.

Way down on page 101–102, they say (my emphasis),

A minimally informative beta prior, beta (0.700102, 1), is proposed for θ = (1-VE)/(2-VE). The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic. The prior allows considerable uncertainty; the 95% interval for θ is (0.005, 0.964) and the corresponding 95% interval for VE is (-26.2, 0.995).

I think “VE” stands for vaccine effect. Here’s the definition from page 92 of the report.

VE = 100 × (1 – IRR). IRR is calculated as the ratio of first confirmed COVID-19 illness rate in the vaccine group to the corresponding illness rate in the placebo group. In Phase 2/3, the assessment of VE will be based on posterior probabilities of VE1 > 30% and VE2 > 30%.

VE1 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in participants without evidence of infection before vaccination, and VE2 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in all participants after vaccination.

I’m unclear on why they’d want to impose a prior on (1 – VE) / (2 – VE), or even how to interpret that quantity, but that’s not what I’m writing about. But the internet’s great and Sebastian Kranz walks us through it in a blog post, A look at Biontech/Pfizer’s Bayesian analysis of their COVID-19 vaccine trial. It turns out that the prior is on the quantity \theta = \frac{\displaystyle \pi_v}{\displaystyle \pi_v + \pi_c}, where \pi_v, \pi_c \in (0, 1) are, in Kranz’s words, “population probabilities that a vaccinated subject or a subject in the control group, respectively, fall ill to Covid-19.” I’m afraid I still don’t get it. Is the time frame restricted to the trial? What does “fall ill” mean, a positive PCR test or something more definitive. (The answers may be in the report—I didn’t read it.)

What is a weakly informative prior?

It’s the description “minimially informative” and subsequent results calling it “weakly informative” that got my attention. For instance, Ian Fellow’s post (which Andrew summarized in his own post here), The Pfizer-Biontech vaccine may be a lot more effective than you think that Andrew just reported on, Fellows calls it “a Bayesian analysis using a beta binomial model with a weakly-informative prior.”

What we mean by weakly informative is that the prior determines the scale of the answer. For example a standard normal prior (normal(0, 1)), imposes a unit scale, whereas a normal(0, 100) would impose a scale of 100 (like Stan and R, I’m using a scale or standard deviation parameterization of the normal so that the two parameters have the same units).

Weakly informative in which parameterization?

Thinking about proportions is tricky, because they’re constrained to fall in the interval (0, 1). The maximum standard deviation achievable with a beta distribution is 0.5 as alpha and beta -> 0, whereas a uniform distribution on (0, 1) has standard deviation 0.28, and a beta(100, 100) has standard deviation 0.03.

It helps to transform using logit so we can consider the log odds, mapping a proportion \theta to \textrm{logit}(\theta) = \log \theta / (1 - \theta).. A uniform distribution on theta in (0, 1) results in a standard logistic(0, 1) distribution on logit(theta) in (-inf, inf). So even a uniform distribution on the proportion leads to a unit scale distribution on the log odds. In that sense, a uniform distribution is weakly informative in the sense that we mean it when we recommend weakly informative priors in Stan. All on its own, it’ll control the scale of the unconstrained parameter. (By the way, I think transforming theta in (0, 1) to logit(theta) in (-inf, inf) is the easiest way to get a handle on Jacobian adjustments—it’s easy to see the transformed variable no longer has a uniform distribution, and it’s the Jacobian of the inverse transform that defines the logistic distribution’s density.)

Fellows is not alone. In the post, Warpspeed confidence — what is credible?, which relates Pfizer’s methodology to more traditional frequentist methods, Chuck Powell says, “For purposes of this post I’m going to use a flat, uninformed prior [beta(1, 1)] in all cases.” Sure, it’s flat on the (0, 1) scale, but not on the log odds scale. Flat is relative to parameterization. If you work with a logistic prior on the log odds scale and then transform with inverse logit, you get exactly the same answer with a prior that is far from flat—it’s centered at 0 and has a standard deviation of pi / 3, or about 1.

How much information is in a beta prior?

It helps to reparameterize the beta with a mean \phi \in (0, 1) and “count” \kappa > 0,

\textrm{beta2}(\theta \mid \phi, \kappa) = \textrm{beta}(\theta \mid \phi \cdot \kappa, (1 - \phi) \cdot \kappa).

The beta distribution is conjugate to the Bernoulli (and more generally, the binomial), which is what makes it a popular choice. What this means in practice is that it’s an exponential family distribution that can be treated as pseudodata for a Bernoulli distribution.

Because beta(1, 1) is a uniform distribution, we think of that as having no prior data, or a total of zero pseudo-observations. From this perspective, beta(1, 1) really is uninformative in the sense that it’s equivalent to starting uniform and seeing no prior data.

In the beta2 parameterization, the uniform distribution on (0, 1) is beta2(0.5, 2). This corresponds to pseudodata with count 0, not 2—we need to subtract 2 from \kappa to get the pseudocount!

Where does that leave us with the beta(0.7, 1)? Using our preferred parameterization, that’s beta2(0.4117647, 1.7). That means a prior pseudocount of -0.3 observations! That means we start with negative pseudodata when the prior count parameter kappa is less than 2. Spoiler alert—that negative pseudocount is going to be swamped by the actual data.

What about Pfizer’s beta(0.700102, 1) prior? That’s beta2(0.4118, 1.700102). If you plot beta(theta | 0.7, 1) vs. theta, you’ll see that the log density tends to infinity as theta goes to 0. That makes it look like it’s going to be somewhat or maybe even highly informative. There’s a nice density plot in Kranz’s post.

Of course, the difference between beta(0.700102, 1) and beta(0.7, 1) is negligible—1/10,000th on prior mean and 1/1000-th of a patient in prior pseudocount. They must’ve derived the number from a formula somehow and then didn’t want to round. The only harm in using 0.700102 rather than 0.7 or even 1 is that someone may assume a false sense of precision.

Let’s look at the effect on the prior, in terms of how it affects the posterior. That is, differences between beta(n + 0.7, N – n + 1) vs. beta(n + 1, N – n + 1) for a trial with n out of N successes. I’m really surprised they’re only looking at N = 200 and expecting something like n = 30. Binomial data is super noisy and thus N = 200 is a small data size unless the effect is huge.

Is that 0.00102 in prior pseudocount going to matter? Of course not. Will the difference between beta(1, 1) and beta(0.7, 1) going to matter? Nope. matter? If we compare the posteriors beta(30 + 0.7, 170 + 1) and beta(30 + 1, 170 + 1), their posterior 95% central intervals are (0.107, 0.206) and (0.106, 0.205).

So I guess it’s like Andrew’s injunction to vote. It might make a difference on the edge if we impose a three-digit threshold somewhere and just manage to cross it in the last digit.

Beta-binomial and Jeffrey’s priors

I’ll leave it to the Bayes theory wonks to talk about why beta(0.5, 0.5) is the Jeffrey’s prior for the beta-binomial model. I’ve never dug into the theory enough to understand why anyone cares about these priors other than scale invariance.

No, I don’t believe etc etc., even though they did a bunch of robustness checks.

Dale Lehman writes:

You may have noticed this article mentioned on Marginal Revolution, I [Lehman] don’t have access to the published piece, but here’s a working paper version. It might be worth your taking a look. It has all the usual culprits: forking paths, statistical significance as the filter, etc etc. As usual, it is a complex piece and done “well” by many standards. For example, I had wondered about their breaking time into during stock market times and before the market opens – I thought they might have ignored the time zone differences. However, they did convert all accident data to the Eastern time zone for purposes of determining whether an accident occurred before the market was open or not.

The result – that when the market goes down, fatal driving accidents go up – with a causal interpretation, may be correct. I don’t know. But I find it a bit hard to believe. For one thing, a missing link is whether the driver is really aware of what the market is doing – and then, the link the paper explores is for the stock market performance for the day with the fatal accidents. But the market often is up for part of the day and down for others, so the intra-day variability may undermine what the paper is finding. Perhaps stronger market drops occur on days when the market is down for larger portions of the day (I don’t know, but potentially that could be explored), but I don’t see that they examined anything to do with intraday variability. Finally, the data is not provided, although the source is publicly available (and would probably take me a week to try to match up to what they purported to use). Why the standard can’t be to just release the data (and, yes, the manipulations they did – code, if you’d like) for papers like this. Clearly, they do not expect anyone to actually try to duplicate their results and see if alternative analyses produce different results.

And, the same day, I received this email from James Harvey:

I can’t access this one and the abstract doesn’t give a clue about the methods but the headline claims seem *HIGHLY* unlikely to be true.

A huge majority of Americans don’t know or care what’s going on at opening in the stock market on any given day. For the extremely small percentage who do know and care to cause a detectable rise in traffic fatalities is absolutely preposterous.

The paper claims that these fatalities are caused by “one standard deviation reduction in daily stock market returns”. Exactly what this means isn’t clear to me right off hand, but considering the distribution of ups and downs shown in the diagram here, it looks like one standard deviation is about 1%. Hardly jump-off-the-bridge numbers.

I agree, and I agree.

As we’ve said before, one position we have to move away from is the attitude that a social-science claim in a published paper, or professional-looking working paper, is correct by default. The next step is to recognize that robustness checks don’t mean much. The third step is to not feel that the authors of a published paper, or professional-looking working paper, are owed some sort of deference. If news outlets are allowed to hype these claims without reanalyzing the data and addressing all potential criticisms, then, yes, we should be equally allowed to express our skepticism without reanalyzing the data and figuring out exactly what went wrong.

On the plus side, the commenters on that Marginal Revolution post are pretty much uniformly skeptical of that stock-market-and-crashes claim. The masses are duly skeptical; we now just have to explain the problem to the elites. Eventually even the National Academy of Sciences might understand.

P.S. Yes, it’s possible that good days in the stock market cause safer driving. It’s also possible that the opposite is the case. My point about the above-linked paper is not that it has striking flaws of the pizzagate variety, but rather that we should feel under no obligation to take it seriously. It’s a fancy version of this.

If you are interested in the topic of the psychological effects of stock price swings, then I’d recommend looking at lots of outcomes, particularly those that would effect people who follow the stock market, and move away from the goal of proof (all those robustness tests etc) and move toward an exploratory attitude (what can be learned?). It’s not so easy, given that social science and publication are all about proof—maybe it would be harder to get a paper published on the topic that is frankly exploratory—but I think that’s the way to go, if you want to study this sort of thing.

At this point, you might say how unfair I’m being: the authors of this article did all those robustness checks and I’m still not convinced. But, yeah, and robustness checks can fool you. Sorry. I apologize on behalf of the statistics profession for giving generations of researchers the impression that statistical methods can be a sort of alchemy for converting noisy data into scientific discovery. I really do feel bad about this. But I don’t feel so bad about it that I’ll go around believing claims that can be constructed from noise. Cargo-cult science stops here. And, again, this is not at all to single out this particular article, its authors, or the editors of the journal that published the paper. They’re all doing their best. It’s just that they’re enmeshed in a process that produces working papers, published articles, press releases, and news reports, not necessarily scientific discovery.

Can we stop talking about how we’re better off without election forecasting?

This is a public service post of sorts, meant to collect some reasons why getting rid of election forecasts is a non-starter in one place. 

First to set context: what are the reasons people argue we should give them up? This is far from an exhaustive list (and some of these reasons overlap) but a few that I’ve heard over the last week are: 

Fivey Fox Pandora's Box

  • If the polls are right, we don’t need forecasters. If polls are wrong, we don’t need forecasters.
  • Forecasts are hard to evaluate, therefore subject to influences of the forecaster’s goals, e.g. to not appear too certain that they can be blamed. Hence we can’t trust them as unbiased aggregations of evidence.  
  • Forecasters may have implicit knowledge from experience, such as a sense of approximately what the odds should be, but it’s hard to transparently and systematically incorporate that knowledge. When a forecaster  ‘throws in a little error here, throws in a little error there’ to get the uncertainty they want at a national level, they can end up with model predictions that defy common sense in other ways, calling into question how coherent the predictions are. The ways we want forecasts to behave may sometimes conflict with probability theory
  • There’s too much at stake to take chances on forecasts that may be wrong but influence behavior. 

I don’t think these questions are unreasonable. But it’s worth considering the implications of a suggestion that forecasts have no clear value, or even do more harm than good, since I suspect some people may jump to this conclusion without recognizing the subtext it entails. Here are some things I think of when I hear people questioning the value of election forecasts: 

#1 – A carefully constructed forecast is (very likely to be) better than the alternative. Or to quote a Bill James line Andrew has used, “The alternative to good statistics is not no statistics, it’s bad statistics.”

What would happen if there were no professional forecasts from groups like the Economist team or professional forecasters like Nate Silver? A deep stillness as we all truly acknowledge the uncertainty of the situation does not strike me as the most likely scenario. Instead, people may look to the sorts of overreactions to polls that we already see in the media to tell them what will happen, without referring back to previous elections. Or maybe they anxiously query friends and neighbors (actually there’s probably some valuable information there, but only if we aggregate across people!), or extrapolate from the attention paid to candidates on public television, or how many signs they see in nearby yards or windows, or examine tea leaves, look at entrails of dead animals, etc. 

One alternative that already exists is prediction markets. But it’s hard to argue that they are more accurate than a carefully constructed forecast. For instance, it’s not clear we can really interpret the prices in a market as aggregating information about win probabilities in any straightforward way, and there’s reasons to think they don’t make the best use of new data. They can produce strange predictions too at times, like giving Trump a >10% chance of winning Nevada even after it’s been called by some outlets. 

Even in seemingly “extreme” cases like 2016 or 2020, where bigger than anticipated poll errors led to forecasts seeming overconfident about a Biden win in various ways, forecast models do a better job than reports on the polls themselves by accounting for sampling and non sampling polling errors systematically, and to some degree unanticipated polling error, if imperfectly.  Relative to poll aggregators, forecasts make use of fundamentals like regularities in previous elections to interpret seeming shifts. 

Some arguments about forecasts no longer being valuable point to 2016 and 2020 as examples of how they’ve lost their utility since the polls are broken . But polls can be broken in different ways, and without other information like fundamentals to fall back on or aggregation methods that smooth out bumps, it can be very hard to know what to pay attention to when incoming information seems to be in disagreement. A forecasting model can be useful for helping us figure out which information to pay more attention to. We can argue about whether for this particular election  the average person’s intuitive sense of probability of winning would really be worse if they hadn’t seen a forecast, but that strikes me as somewhat of a straw man comparison.  Like any approach we take to reasoning under uncertainty, forecasting needs to be interpreted over the long term. That some aspects of elections appear predictable if we gather the right signals seems hard to dispute. None of the adjustments a forecast does will be perfect, but they’re beyond what the average journalist will do to put the new information in context during an election cycle. 

#2 – Saying election forecasts are dangerous because they can be wrong but still influence behavior is a slippery slope.

Let’s consider the implications of the argument that we should stop trying to do them because they can influence behavior in a situation where there’s a lot at stake. Zeynep Tufecki’s recent New York Times article makes many good points that echo challenges we described with evaluation, and has some anecdotes that might seem to suggest getting rid of forecasts entirely due to their potential influence on behavior. For example, Snowden tweeting in 2016 that it was a safe election to vote for third party candidates, and Comey claiming he sent his letter to Congress about reopening the email investigation in part because he thought Clinton would win. 

But from the standpoint of science communication, arguing that forecasts are harmful because they could mislead behavior becomes a slippery slope. There’s a chance pretty much any statistics, or really any news, we present to people might inform their behavior, but also might be wrong. Where do we draw the line, and who draws it? Another way to put is, by arguing that forecasts do more harm than good, we’re implying that it can be ok to censor information for people’s own good, since they won’t be able to make good choices for themselves. Taking responsibility for how information is communicated is great, but I don’t really want my news organizations deciding I can’t handle certain information. And of course, on a completely practical level, censoring information is hard. If there’s a demand, someone will try to meet it. 

#3 – We have the potential to learn a lot from forecasts. 

Here I can speak from experience. This election cycle, diving into some of the detailed discussion of the Economist forecast model, in contrast to FiveThirtyEight’s, taught me a lot about election dynamics, like the importance of, and difficulty reasoning about, between state correlations and the role the economic climate can play. That Nate Silver, Elliot Morris, etc. have thousands of followers on social media suggest that there’s a big group of people who care to hear about the details.  

Also while I have some prior familiarity with Bayesian statistics (not to mention I think about uncertainty all the time), seeing these methods applied to elections has probably improved my generic statistical modeling knowledge as well. It’s a great lesson in what options we have at our disposal when trying to incorporate some anticipation of ontological uncertainty in predictions, for instance. Not to mention all the lessons about statistical communication.  

I am admittedly far from the typical layperson consulting these. But if we think that we need to reach some level of assurance that our models are highly accurate before we can put them out in the world, we are eliminating opportunities for the general advancement of data literacy. This isn’t to imply that everyone who consults forecasts is learning a lot about elections; no doubt many go for easy answers that might assuage their anxiety. There are some major kinks to iron out, in how they’re framed and communicated, some of which present major challenges given that people are hard-wired to want certainty and answers. But I think we sometimes don’t give audiences credit for their ability to get more statistically literate over time. There are various types of graphics, scatterplots and animated simulations for instance, that were once uncommon to see at all in the media. It can take gradual exposure over time for certain next steps in data journalism to become part of the average news diet, but it does happen. There’s still a long way to go, but I’m pretty sure we can increase the numbers of people learning through election forecasts by finding ways to prioritize “insight”–into what matters in an election, for example–rather than just answers. I liked the Economist forecast’s visualization of state-wise correlations as a way to invite readers to judge for themselves how reasonable they seem. For the same reason I like a very simple suggestion Andrew made in a talk recently for how one could frame a prediction: “Biden could lose, but there’d be a reason for it.” Getting readers to think a little more deeply about what information a forecast might have missed seems to me like a valuable form of political engagement in itself. 

How science and science communication really work: coronavirus edition

Now that the election’s over, we can return to our regular coronavirus coverage. Nothing new since last night, so I wanted to share a couple of posts from a few months ago that I think remain relevant:

No, there is no “tension between getting it fast and getting it right”:

On first hearing, this statement [“There is always a tension between getting it fast and getting it right”] sounds reasonable. Back when I took typing class in 9th grade, they taught us about the tradeoff between speed and accuracy. The faster you can type, the more errors you make. But I’m thinking this doesn’t apply so much in science. It’s almost the opposite: the quicker you get your ideas out there, the more you can get feedback and find the problems. . . .

This one’s for the Lancet editorial board: A trolley problem for our times (involving a plate of delicious cookies and a steaming pile of poop):

OK, I couldn’t quite frame this one as a trolley problem—maybe those of you who are more philosophically adept than I am can do this—so I set it up as a cookie problem?

Here it is:

Suppose someone was to knock on your office door and use some mix of persuasion, fomo, and Harvard credentials to convince you to let them in so they can deliver some plates of handmade cookies. You tell everyone about this treat and you share it with your friends in the news media. Then a few days some people from an adjoining office smell something funny . . . and it seems that what that those charming visitors left on your desk was not delicious cookies, but actually was a steaming pile of poop!

What would you do?

If you’re the management of Lancet, the celebrated medical journal, then you might well just tell the world that nothing was wrong, sure, there was a minor smell, maybe a nugget or two of poop was in the mix, but really what was on your desk were scrumptious cookies that we should all continue to eat. . . .

I just looove those poop analogies.

The Pfizer-Biontech Vaccine May Be A Lot More Effective Than You Think?

Ian Fellows writes:

I [Fellows] just wrote up a little Bayesian analysis that I thought you might be interested in. Specifically, everyone seems fixated on the 90% effectiveness lower bound reported for the Pfizer vaccine, but the true efficacy is likely closer to 97%.

Please let me know if you see any errors. I’m basing it off of a press release, which is not ideal for scientific precision.

Here’s Fellows’s analysis:

Yesterday an announcement went out that the SARS-CoV-2 vaccine candidate developed by Pfizer and Biontech was determined to be effective during an interim analysis. This is fantastic news. Perhaps the best news of the year. It is however another example of science via press release. There is very limited information contained in the press release and one can only wonder why they couldn’t take the time to write up a two page report for the scientific community.

That said, we can draw some inferences from the release that may help put this in context. From the press release we know that a total of 94 COVID-19 cases were recorded. . . .

We do get two important quotes regarding efficacy.

“Vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis

The case split between vaccinated individuals and those who received the placebo indicates a vaccine efficacy rate above 90%, at 7 days after the second dose.”

How should we interpret these? Was the observed rate of infection 90% lower in the treatment group, or are we to infer that the true (population parameter) efficacy is at least 90%? I [Fellows] would argue that the wording supports the later. . . . the most compatible statistical translation of their press release is that we are sure with 95% probability that the vaccine’s efficacy is greater than 90%. . . .

Assuming my interpretation is correct, let’s back out how many cases were in the treatment group. Conditional on the total number of infections, the number of infections in the treatment group is distributed binomially. We apply the beta prior to this posterior and then transform our inferences from the binomial proportion to vaccine effectiveness. . . .

There is a lot we don’t know, and hopefully we will get more scientific clarity in the coming weeks. As it stands now, it seems like this vaccine has efficacy way above my baseline expectations, perhaps even in the 97% range or higher.

I [Fellows] could be wrong in my interpretation of the press release, and they are in fact talking about the sample effectiveness rather than the true effectiveness. In that case, 8 of the 94 cases would have been in the treatment group, and the interval for the true effectiveness would be between 81.6% and 95.6%. . . .

It is important to have realistic expectations though. Efficacy is not the only metric that is important in determining how useful the vaccine is. Due to the fact that the study population has only been followed for months, we do not know how long the vaccine provides protection for. There is significant evidence of COVID-19 reinfection, so the expectation is that a vaccine will not provide permanent immunity. If the length of immunity is very short (e.g. 3 months), then it won’t be the silver bullet we are looking for. I’d be happy to see a year of immunity and ecstatic if it lasts two. . . .

I’ve not tried to reconstruct this analysis, but I’m a fan of the general idea of trying to reverse-engineer data from published reports. We had a fun example of this a few months ago.

“In the world of educational technology, the future actually is what it used to be”

Following up on this post from Audrey Watters, Mark Palko writes:

I [Palko] have been arguing for a while that the broad outlines of our concept of the future were mostly established in the late 19th/early 20th Centuries and put in its current form in the Postwar Period. Here are a few more data points for the file.

“Books will soon be obsolete in schools” — Thomas Edison (1913)

“If, by a miracle of mechanical ingenuity, a book could be so arranged that only to him who had done what was directed on page one would page two become visible, and so on, much that now requires personal instruction could be managed by print.” — Edward Thorndike (1912)

“The central and dominant aim of education by radio is to bring the world to the classroom, to make universally available the services of the finest teachers, the inspiration of the greatest leaders … and unfolding events which through the radio may come as a vibrant and challenging textbook of the air.” — Benjamin Darrow (1932)

“Will machines replace teachers? On the contrary, they are capital equipment to be used by teachers to save time and labor. In assigning certain mechanizable functions to machines, the teacher emerges in his proper role as an indispensable human being. He may teach more students than heretofore—this is probably inevitable if the world-wide demand for education is to be satisfied—but he will do so in fewer hours and with fewer burdensome chores. In return for his greater productivity he can ask society to improve his economic condition.” — B. F. Skinner (1958)

“I believe that the motion picture is destined to revolutionize our educational system and that in a few years it will supplant largely, if not entirely, the use of textbooks. …I should say that on the average we get about two percent efficiency out of schoolbooks as they are written today. The education of the future, as I see it, will be conducted through the medium of the motion picture… where it should be possible to obtain one hundred percent efficiency.” — Thomas Edison (1922)

“At our universities we will take the people who are the faculty leaders in research or in teaching. We are not going to ask them to give the same lectures over and over each year from their curriculum cards, finding themselves confronted with another roomful of people and asking themselves, ‘What was it I said last year?’ This is a routine which deadens the faculty member. We are going to select instead the people who are authorities on various subjects — the people who are most respected within their respective departments and fields. They will give their basic lecture course just once to a group of human beings, including both the experts of their own subject and bright children and adults without special training in their field. These lectures will be recorded as Southern Illinois University did my last lecture series of fifty-two hours in October 1960. They will make moving-picture footage of the lectures as well as hi-fi tape recording. Then the professors and and their faculty associates will listen to the recordings time and again” — R. Buckminster Fuller (1962)

“The machine itself, of course, does not teach. It simply brings the student into contact with the person who composed the material it presents. It is a laborsaving device because it can bring one programmer into contact with an indefinite number of students. This may suggest mass production, but the effect upon each student is surprisingly like that of a private tutor.” — B. F. Skinner (1958)

To pull up these quotes is not to argue that distance learning, video instruction, computer drills, etc., are bad ideas. I expect that all these innovations will become increasingly important in education. They’ll take work—for example, it would be great to have computer drills for intro statistics, but it only makes sense to do that if we can figure out what to drill on: I don’t see the value in students learning how to compute tail-area probabilities or whatever—but at some point this work will get done, and in the meantime we can use whatever crappy tools are available. I think Palko would agree with me on the potential value of these technologies.

No, the point of the quotes is that the conceptual framework was already here, a century ago. And if the ideas have been around for over 100 years but they’re just now getting implemented, what does that mean? I think this tells us that the devil is in the details etc., that the challenge is not just to say the phrases “flipped classroom,” “computer-aided instruction,” etc., but to really get them to work at scale. Again, I do think this will happen, and we should be realistic about the challenges.

P.S. Watters also has this amusing faq.

Lying with statistics

As Deb Nolan and I wrote in our book, Teaching Statistics: A Bag of Tricks, the most basic form of lying with statistics is simply to make up a number. We gave the example of Senator McCarthy’s proclaimed (but nonexistent) list of 205 Communists, but we have a more recent example:

One of the supposed pieces of evidence [of votes being recorded for dead people] was a list that circulated on Twitter Thursday evening allegedly containing names, birth dates, and zip codes for registered voters in Michigan. The origin of the list and the identity of the person who first made it public are not known.

CNN examined 50 of the more than 14,000 names on the list by taking the first 25 names on the list and then 25 more picked at random. We ran the names through Michigan’s Voter Information database to see if they requested or returned a ballot. We then checked the names against publicly available records to see if they were indeed dead.

Of the 50, 37 were indeed dead and had not voted, according to the voter information database. Five people out of the 50 had voted — and they are all still alive, according to public records accessed by CNN. The remaining eight are also alive but didn’t vote.


In an interview with Maria Bartiromo on Fox News on Nov. 8, Republican Sen. Lindsey Graham said the Trump campaign had “evidence of dead people voting in Pennsylvania . . . The Trump team has canvassed all early voters and absentee mail-in ballots in Pennsylvania. And they have found over 100 people they think were dead, but 15 people that we verified that have been dead who voted. But here is the one that gets me. Six people registered after they died and voted. . . . I do know that we have evidence of six people in Pennsylvania registering after they died and voting after they died. And we haven’t looked at the entire system.” . . .

We reached out to the Trump campaign and Graham’s Senate office for details about the Trump campaign research that concluded some number of ballots were cast by people who have died, but we did not get a response.

Graham was perhaps savvy enough not to give the list of 100, or 15, or 6. No list; nothing can be checked.

What’s interesting about this example is that no quantitative analysis is needed; you can just check the individual cases. But people don’t always check.

As the saying goes, when there’s smoke there’s smoke.

Bayesian Workflow

Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Paul-Christian Bürkner, Lauren Kennedy, Jonah Gabry, Martin Modrák, and I write:

The Bayesian approach to data analysis provides a powerful way to handle uncertainty in all observations, model parameters, and model structure using probability theory. Probabilistic programming languages make it easier to specify and fit Bayesian models, but this still leaves us with many options regarding constructing, evaluating, and using these models, along with many remaining challenges in computation. Using Bayesian inference to solve real-world problems requires not only statistical skills, subject matter knowledge, and programming, but also awareness of the decisions made in the process of data analysis. All of these aspects can be understood as part of a tangled workflow of applied Bayesian statistics. Beyond inference, the workflow also includes iterative model building, model checking, validation and troubleshooting of computational problems, model understanding, and model comparison. We review all these aspects of workflow in the context of several examples, keeping in mind that in practice we will be fitting many models for any given problem, even if only a subset of them will ultimately be relevant for our conclusions.

This is a long article (77 pages! So long it has its own table of contents!) because we had a lot that we wanted to say. We were thinking about some of these ideas a few months ago, and a few years earlier, and a few years before that. Our take on workflow follows a long tradition of applied Bayesian model building, going back to Mosteller and Wallace (and probably to Laplace before that), and it also relates to S and R and the tidyverse and other statistical computing environments. We’re trying to take ideas of good statistical practice and bring them into the tent, as it were, of statistical methodology.

We see three benefits to this research program. First, by making explicit various aspects of what we consider to be good practice, we can open the door to further developments, in the same way that the explicit acknowledgment of “exploratory data analysis” led to improved methods for data exploration, and in the same way that formalizing the idea of “hierarchical models” (instead of considering various tricks for estimating the prior distribution from the data) has led to more sophisticated multilevel models. Second, laying out a workflow is the a step toward automation of these important steps of statistical analysis. Third, we would like our computational tools to work well with real workflows, handling the multiplicity of models that we fit in any serious applied project. We consider this Bayesian Workflow article to be a step in these directions.

My scheduled talks this week

Department of Biostatistics, Harvard University: Today, Tues 10 Nov 2020, 1pm

Department of Marketing, Arison School of Business, Israel: Thurs 12 Nov 2020, 10am (US eastern time)

St. Louis Chapter of the American Statistical Association: Thurs 5pm 2020, 5pm (US eastern time)

The listed topic for the first two events is election forecasting and for the third event it’s the replication crisis and how to avoid it, but I guess we’ll talk about all sorts of other things too.

What happens to the median voter when the electoral median is at 52/48 rather than 50/50?

Here’s a political science research project for you.

Joe Biden got about 52 or 53% of the two-party vote, which was enough for him to get a pretty close win in the electoral college. As we’ve discussed, 52-48 is a close win by historical or international standards but a reasonably big win in the context of recent U.S. politics, where the Democrats have been getting close to 51% in most national elections for president and congress. I’m not sure how the congressional vote ended up, but I’m guessing it’s not far from 51/49 also.

Here’s the background. From a combination of geography and gerrymandering, Republicans currently have a structural edge in races for president and congress: Democrats need something around 52% of the two-party vote to win, while Republicans can get by with 49% or so. For example, in 2010 the Republicans took back the House of Representatives with 53.5% of the two-party vote, but they maintained control in the next three elections with 49.3%, 52.9%, and 50.6%. The Democrats regained the House in 2018 with 54.4% of the two-party vote.

And it looks like this pattern will continue, mostly because Democrats continue to pile up votes in cities and suburbs and also because redistricting is coming up and Republicans control many key state governments.

And here’s the question. Assuming this does continue, so that Republicans can aim for 49% support knowing that this will give them consistent victories at all levels of national government, while Democrats need at least 52% to have a shot . . . how does this affect politics indirectly, at the level of party positioning?

When it comes to political influence, the effect is clear: as long as the two parties’ vote shares fluctuate in the 50% range, Republicans will be in power for more of the time, which directly addresses who’s running the government but also has indirect effects: if the Republicans are confident that in a 50/50 nation they’ll mostly stay in power, this is a motivation for them to avoid compromise and go for deadlock when Democrats are in charge, on the expectation that if they wait a bit, the default is that they’ll come back and win. (A similar argument held in reverse after 2008 among Democrats who believed that they had a structural demographic advantage.)

But my question here is not about political tactics but rather about position taking. If you’re the Democrats and you know you need to regularly get 52% of the vote, you have to continually go for popular positions in order to get those swing voters. There’s a limit to how much red meat you can throw to your base without scaring the center. Conversely, if all you need is 49%, you have more room to maneuver: you can go for some base-pleasing measures and take the hit among moderates.

There’s also the question of voter turnout. It can be rational, even in the pure vote-getting sense, to push for positions that are popular with the base, because you want that base to turn out to vote. But this should affect both parties, so I don’t think it interferes with my argument above. How much should we expect electoral imbalance to affect party positioning on policy issues?

The research project

So what’s the research project? It’s to formalize the above argument, using election and polling data on specific issues to put numbers on these intuitions.

As indicated by the above title, a first guess would be that, instead of converging to the median voter, the parties would be incentivized to converge to the voter who’s at the 52nd percentile of Republican support.

The 52% point doesn’t sound much different than the 50% point, but, in a highly polarized environment, maybe it is! If 40% of voters are certain Democrats, 40% are certain Republicans, and 20% are in between, then we’re talking about a shift from the median to the 60th percentile of unaffiliated voters. And that’s not nothing.

But, again, such a calculation is a clear oversimplification, given that neither party is anything close to that median. Yes, the are particular issues where one party or the other is close to the median position of Americans, but overall the two parties are well separated ideologically, which of course is a topic of endless study in the past two decades (by myself as well as many others). The point of this post is that, even in a polarized environment, there’s some incentive to appeal to the center, and the current asymmetry of the electoral system at all levels would seem to make this motivation much stronger for Democrats than for Republicans. Which might be one reason why Joe Biden’s talking about compromise but you don’t hear so much of that from the other side.

P.S. As we discussed the other day, neither candidate seemed to make much of a play for the center during the campaign. It seemed to me (just as a casual observer, without having made a study of the candidates’ policy positions and statements) that in 2016 both candidates moved to the center on economic issues. But in 2020 it seemed that Trump and Biden were staying firmly to the right and left, respectively. I guess that’s what you do when you think the voters are polarized and it’s all about turnout.

Relatedly, a correspondent writes:

Florida heavily voted for 15 minimum wage yet went to Trump. Lincoln project tried to get repubs and didnt work. florida voted for trump because of trump, not because of bidens tax plan.

To which I reply: Yeah, sure, but positioning can still work on the margin. Maybe more moderate policy positions could’ve moved Biden from 52.5% to 53% of the two-party vote, but then again he didn’t need it.

P.P.S. Back in his Baseball Abstract days, Bill James once wrote something about the different strategies you’d want if you’re competing in an easy or a tough decision. In the A.L. East in the 1970s, it generally took 95+ wins to reach the playoffs. As an Orioles fan, I remember this! In the A.L. West, 90 wins were often enough to do the trick. Bill James conjectured that if you’re playing in an easier division, it could be rational to go for certain strategies that wouldn’t work in a tougher environment where you might need regular-season 100 wins. He didn’t come to any firm conclusions on the matter, and I’m not really clear how important the competitiveness of the division is, given that it’s not like you can really target your win total. And none of this matters much now that MLB has wild cards.

P.P.P.S. Senator Lindsey Graham is quoted as saying on TV, “If Republicans don’t challenge and change the U.S. election system, there will never be another Republican president elected again,” but it’s hard for me to believe that he really thinks this. As long as the Republican party doesn’t fall apart, I don’t see why they can’t win 48% or even 50% or more in some future presidential races.

It seems nuts for a Republican to advocate that we “challenge and change the U.S. election system,” given the edge it’s currently giving them. In the current political environment, every vote counts, and the winner-take-all aggregation of votes by states and congressional districts is a big benefit to their party.

UX issues around voting

While Andrew’s worrying about how to measure calibration and sharpness on small N probabilistic predictions, let’s consider some computer and cognitive science issues around voting.

How well do elections measure individual voter intent?

What is the probability that a voter who tries to vote has their intended votes across the ballot registered? Spoiler alert. It’s not 100%.

We also want to know if the probability of having your vote recorded depends on the vote. Or on the voter. To put it in traditional statistical terms, if we think of the actual vote count as an estimate of voter intent, what is the error and bias of the estimator?

Not very well, it turns out

User interface (UX) errors are non-negligible. For a concrete analysis around the 2000 U.S. presidential election, see the following page summarizing some of the findings of the

a big study partly coordinated by our collaborator Steve Ansolabehere.*

No surprise here

There’s nothing at all surprising here from a UX point of view. Everyone who’s ever worked on UX knows that UX errors are the norm, not the exception. The point of building a good UX is to minimize errors to the extent possible.

I also want to point out that banks seem to manage handing out cash through their ATMs with low enough error that it’s still profitable. Now I’m curious about the error rate.

Simple Example

In case you don’t want to click through to see the real example, an example of a classic UX blunder in a voting context is the following ballot.


This kind of layout violates the basic UX principle of putting checkboxes distinctively closer to their associated items than to other items. With the layout above, a voter who intends to vote for candidate 2 might accidentally vote for candidate 3 because the two boxes are equally close to the name “CANDIDATE 2”.

It’s better with more whitespace so that the boxes are visually identified with their associated item.

[ ] CANDIDATE 1      [ ] CANDIDATE 2      [ ] CANDIDATE 3

A vertical layout can solve some problems, but as the example in the article I linked above shows, it can introduce other ones if done poorly.

This is just one of the many blunders that are quite common in user interfaces.

Anecdote about my own ballot this year

Personally, I had a question about the fine print of the NYC ballots because there were a bunch of judge candidates and it wasn’t clear to me how many I could vote for. I actually flipped to the instructions, which said the number would be at the top. It wasn’t. I went and asked the poll worker out of curiousity (I’m fascinated by UX issues). Turns out they moved the number to relatively fine print to the left of the column of checkboxes. Now this particular vote didn’t matter as far as I can tell because there were only four candidates and you could choose up to four. Just an example of the kind of confusion you can run into.

* Steve’s the one who introduced me to Andrew 25 years ago after Andrew and I both moved to NYC. The second-to-last grant Andrew and I got before I moved to the Flatiron Institute was with Steve to work on his Cooperative Congressional Election Study data collection and analysis. That project’s still ongoing and the data, models, etc. are all open access.

Stop-and-frisk data

People sometimes ask us for the data from our article on stop-and-frisk policing, but for legal reasons these data cannot be shared.

Other data are available, though. Sharad Goel writes:

You might also check out stop-and-frisk data from Chicago and Seattle. And, if you’re interested in traffic stop data as well, see our Open Policing Project, a repository of about 200 million traffic stop records from across the country.

So, if you’re interested, that’s a start.

What would would mean to really take seriously the idea that our forecast probabilities were too far from 50%?

Here’s something I’ve been chewing on that I’m still working through.

Suppose our forecast in a certain state is that candidate X will win 0.52 of the two-party vote, with a forecast standard deviation of 0.02. Suppose also that the forecast has a normal distribution. (We’ve talked about the possible advantages of long-tailed forecasts, but for the purpose of this example, the precise form of the distribution doesn’t matter, so I’ll use the normal distribution for simplicity.)

Then your 68% predictive interval for the candidate’s vote share is [0.50, 0.54], and your 95% interval is [0.48, 0.56].

Now suppose the candidate gets exactly half of the vote. Or you could say 0.499, the point being that he lost the election in that state.

This outcome falls on the boundary of the 68% interval, it’s one standard deviation away from the forecast. In no sense would this be called a prediction error or a forecast failure.

But now let’s say it another way. The forecast gave the candidate an 84% chance of winning! And then he lost. That’s pretty damn humiliating. The forecast failed.

Here we might just stop and say: Ha ha, people can’t understand probability.

But I don’t want frame it that way. Instead, flip it around. If you don’t want to go around regularly assigning 84% probabilities to this sort of event, then, fine assign a lower probability that candidate X wins, something closer to 50/50. Suppose you want the candidate to have a 60% chance of winning. Then you need to do some combination of shifting his expected vote toward 0.5 and increasing the predictive standard deviation to get this to work.

So what would it take? If our point prediction for the candidate’s vote share is 0.52, how much would we need to increase the forecast standard deviation to get his win probability down to 60%?

Let’s start with our first distribution, just to check that we’re on track:

> pnorm(0.52, 0.50, 0.02)
[1] 0.84

That’s right. A forecast of 0.52 +/- 0.02 gives you an 84% chance of winning.

We want to increase the sd in the above expression so as to send the win probability down to 60%. How much do we need to increase it? Maybe send it from 0.02 to 0.03?

> pnorm(0.52, 0.50, 0.03)
[1] 0.75

Uh, no, that wasn’t enough! 0.04?

> pnorm(0.52, 0.50, 0.04)
[1] 0.69

0.05 won’t do it either. We actually have to go all the way up to . . . 0.08:

> pnorm(0.52, 0.50, 0.08)
[1] 0.60

That’s right. If your best guess is that candidate X will receive 0.52 of the vote, and you want your forecast to give him a 60% chance of winning the election, you’ll have to ramp up the sd to 0.08, so that your 95% forecast interval is a ridiculously wide 0.52 +/- 2*0.08, or [0.36, 0.68].

Here’s the point. If you really want your odds to be as close as 60/40, and you don’t want to allow really extreme outcomes in your forecast, then You. Have. To. Move. Your. Point. Prediction. To. 0.50.

For example, here’s what you get if you move your prediction halfway to 0.50 and also increase your uncertainty:

> pnorm(0.51, 0.50, 0.03)
[1] 0.63

Still a bit more than 60%, but we’re getting there.

And what does this imply for election forecasts?

If our probabilistic forecast of candidate X’s vote share is 0.52 +/- 0.02, that would traditionally be considered a “statistical tie” or “within the margin of error.” And we wouldn’t feel embarrassed if candidate X were to suffer a close loss: that would be within the expected range of uncertainty.

But, considered as a probabilistic forecast, 0.52 +/- 0.02 is a strong declaration, a predictive probability of 84% that candidate X wins. 5-to-1 odds.

How is it that you can offer 5-to-1 odds based on a “statistical tie”???

What to do?

It seems like we’re trapped here between the immovable force and the irresistible object. On one hand, it seems weird to go around offering 5-to-1 odds to something that could be called a statistical tie; on the other hand, if we really feel we need to moderate the odds, then as discussed above we’d have to shift the forecast toward 0.50.

I’m still not sure on this, but right now I guess, yeah, if you really don’t buy the long odds, I think you should be shifting the prediction.

It would go like this: You do your forecast and it ends up as 0.52 +/- 0.02, but you don’t feel comfortable offering 5-to-1 odds. Maybe you only feel comfortable saying the probability is 60%. So you have to shift your point prediction down to 51% or maybe lower and also increase your uncertainty.

This now looks a lot like Bayesian inference—but the hitch is that your original 0.52 +/- 0.02 was already supposed to be Bayesian. The point is that the statement, “you don’t feel comfortable offering 5-to-1 odds,” represents information that was not already in your model.

So your next job is to step back and ask, why don’t you feel comfortable offering 5-to-1 odds? What’s wrong with that 84% probability, exactly? It’s tempting to just say that we should be wary about assigning probabilities far from 50% to any event, but that’s not right. For example, it would’ve been nuts to assign anything less than a 99% probability that the Republican candidate for Senate in Wyoming would cruise to victory. And, even before the election, I thought that Biden’s chance of winning in South Dakota was closer to 1% than to the 6% assigned by Fivethirtyeight. We talked about this in our recent article, that if you’re a forecaster who’s gonna lose reputation points every time a forecast falls outside your predictive interval, that creates an incentive to make those intervals wider. Maybe this is a good incentive, as it counteracts other incentives for overconfidence.

But then a lot has to do with what’s considered the default. Are the polls the default? (If so, which polls?) Is the fundamentals-based model the default? (If so, which model?) Is 50/50 the default? (If so, popular vote or electoral vote?)

So I’m still working this through. The key point I’ve extracted so far is that if we want to adjust our model because a predictive probability seems too high, we should think about shifting the point prediction, not just spreading out the interval.

From a Bayesian standpoint, the question is, when we say that the probability should be close to 50/50 (at least for certain elections), what information does this represent? If it represents generic information available before the beginning of the campaign, it should be incorporated into the fundamentals-based model. If it represents information we learn during the campaign, it should go in as new data, in the same way that polls represent new data. Exactly how to do this is another question, but I think this is the right way of looking at it.

Is there a middle ground in communicating uncertainty in election forecasts?

Beyond razing forecasting to the ground, over the last few days there’s been renewed discussion online about how election forecast communication again failed the public. I’m not convinced there are easy answers here, but it’s worth considering some of the possible avenues forward. Let’s put aside any possibility of not doing forecasts, and assume the forecasts were as good as they possibly could be this year (which is somewhat of a tautology anyway). Communication-wise, how did forecasters do and how much better could they have done? 

Image of guy ignoring his girlfriend (variance) for another girl (expected value)

Image from Kareem Carr

We can start by considering how forecast communication changed relative to 2016. The biggest differences in uncertainty communication that I noticed looking at FiveThirtyEight and Economist forecast displays were: 

1) More use of frequency-based presentations for probability, including reporting the odds as frequencies, and using frequency visualizations (FiveThirtyEight’s grid of maps as header and ball-swarm plot of EC outcomes). 

2) De-emphasis on probability of win by FiveThirtyEight (through little changes like moving it down the page, and making the text smaller) 

3) FiveThirtyEight’s introduction of Fivey Fox, who in multiple of his messages reminded the reader of unquantifiable uncertainty and specifically the potential for crazy (very low probability) things to happen. 

Did these things help? Probably a little bit. I for one read Fivey Fox as an expression of Silver’s belief that something 2016-like could repeat, a way to prospectively cover his ass. The frequency displays may have helped some people get a better sense of where probability of win comes from (i.e., simulation). Maybe readers directed a bit more attention to the potential for Trump to win by being shown discrete outcomes in which he did (taking things a step further, Matt Kay’s Presidential Plinko board presented the Economist’s and FiveThirtyEight’s predictions of Biden’s probability of winning plinko style with no text probabilities, so that the reader has no choice but to get the probability viscerally by watching it.) While certainly steps in the right direction, if probability of winning is the culprit behind people’s overtrust in forecasts (as suggested by some recent research), then we haven’t really changed the transaction very much. I suspect that the average reader visiting forecast sites for a quick read on how the race was progressing probably didn’t treat the numbers too differently based on the display changes alone. 

So, what could be done instead, assuming news organizations aren’t going to quit providing forecasts anytime soon?  

First, if people are predisposed to zero in on probability of winning (and then treat it as more or less deterministic), we could try removing the probability of winning entirely. Along the same lines, we could also remove other point estimates like predicted vote shares. So instead, show only intervals or animated possible outcomes for popular or EC votes. 

If probability of winning is what readers come for, then a drawback of doing this is that you’re no longer directly addressing the readers’ demand. But, would they find a way to fulfill that need anyway? My sense is that this would make things harder for the reader, but I’m not sure it would be enough. We didn’t focus on an election context, but Alex Kale, Matt Kay, and I recently did an experiment where we asked people to judge probability of superiority and make decisions under uncertainty given displays of two distributions, one representing what would happen to their payoff if they made an investment, the other if they didn’t. We varied how we visualized the two distributions (intervals, densities, discretized densities, and draws from the distributions shown one pair at a time in an animation). We expected that when you make the point estimate much harder to see, like in the animation, where the only way to estimate central tendency is to account for the uncertainty, people would do better, but if we then added a mark showing the mean to the visualization, they’d do worse, because they’d use some simpler heuristics on how big the difference in means looks. But that’s not what we found. Many people appeared to be using heuristics like judging the difference in means and mapping that to a probability scale even when the animated visualization was showing them the probability of superiority pretty directly! Some of this is probably related to the cognitive load of keeping track of payoff functions and looking at uncertainty graphics at once (this was done on Mechanical Turk). Still, I learned something about how people are even more “creative” than I thought they could be when it comes to suppressing uncertainty. If similar things apply in an election context, they might still leave the page with an answer about the probability of their candidate winning, but it would just be further off from the model predicted probability. 

Another option might be for forecast pages to lead with the upper and lower bounds on the posterior estimates. Anon suggested something like this. I can imagine, for instance, a forecast page where you first get the lower bound on a candidate’s predicted EC votes, then the upper bound. This could be accompanied by some narrative about what configurations of state results could get you there, and what model assumptions are likely to be keeping it from going lower/higher. 

I suspect reframing the communication around the ends of the intervals could help because it implies that the forecaster or news org thinks the uncertainty is very important. Sort of like if Fivey Fox were the header on the FiveThirtyEight forecast page, with a sign saying, Don’t really trust anything here! And then reappeared to question all of the predictions in the graphics below.  You’d probably think twice. Some recent work by van der Bles, van der Linden, Freeman and Spiegelhalter looks at a related question – if you convey uncertainty in a news context with a simple text statement in the article (e.g., “There’s some uncertainty around these estimates, the value could be higher or lower”) versus numerically as a range, which affects trust more? They find that the imprecise text description has a bigger influence. 

In general, leading with discussion of model assumptions, which might seem more natural when you’re focusing on the edges of the distribution, seems like a good thing for readers in the long run. It gives them an intro point to think for themselves about how good model assumptions are. 

But at the same time, it’s hard to imagine whether this kind of treatment would ever happen. First, how much will readers tolerate a shift of emphasis to the assumptions and the uncertainty? Could we somehow make this still seem fun and engaging? One relatively novel aspect of 2020 election forecast discussion was the use of visualizations to think about between-state correlations (e.g., here, and here). Could readers get into the detective work of finding weird bugs enough to forget that they came from probability of winning?

It seems rather doubtful that the average reader would. From things FiveThirtyEight has said about their graphics redesign, the majority are coming to forecasts for quick answers. If the graphics still show the entire posterior distribution at once in some form, maybe people just scroll to this every time. As long as there are at least the bounds of a distribution, most people I suspect can easily figure out how to get the answers they want, we might just be adding noise to the process.

On a more informational level, I’m not sure it’s possible to emphasize bounds and assumptions enough to prevent overtrust and backlash if things go wrong, but not enough to make readers feel like there’s no information value in what they’re getting. E.g., emphasizing the ends of intervals stretching too far above and below 50% vote percentage suggests we don’t know much. 

So the open questions seem to be, how hard do you have to make it? And if you make it hard enough, are you also likely to be killing the demand for forecasting entirely, since many readers aren’t motivated to spend the time or effort to think about the uncertainty?  Given goals that readers have for looking at forecasts (not to mention incentives of news orgs) is a “middle ground” in forecast communication possible?  

Other options involve changing the framing more drastically. The Economist could have labeled the forecast as predicting voter intent, not directly predicting outcomes, as pointed out here. If readers stopped and thought about this, it might have helped. Still, it’s unclear that most readers would take the time to process this thoughtfully. Some readers maybe, but probably not the ones that top level forecasts are currently designed for.

Another option is to consider switching away from absolute vote shares entirely, to focus displays on what the models say about relative changes to expect over the prior election. I like this idea because I think it would make probability of winning seem less coherent. What does it mean to provide relative change in probabiliyt of winning over a prior event for which we don’t know the probability? Relative predictions might still answer an information need, in that people can interpret the forecast simply by remembering the last election, and all the context that goes along with it, and have some idea of what direction to expect changes this year to be in. But on the other hand, this approach could also be immobilizing, like when one’s candidate narrowly won the last election, but this one they’re predicted to have less of a narrow lead.  Maybe we need to give relative predictions over many past elections, so that the older one is, the more lenses they have for thinking about what this one might be like. 

At least in 2020, if a forecaster really wanted to emphasize the potential for ontological uncertainty, they could also tell the reader how much he or she should expect the vote predictions to be off if they’re off by the same amounts as in the last election. Kind of like leading with acknowledgment of one’s own past errors. But whether news organizations would agree to do this is another question. There’s also some aspect of suppressing information that might be unrealistic. Can you really hide the specific numbers while making all the data and code open? Do you end up just looking foolish?  

At the end of the day, I’m not sure the revenue models used by news organizations would have any patience with trying to make the forecasts harder to overtrust, but it’s interesting to reflect on how much better they could possibly get before losing people’s interest.