## New golf putting data! And a new golf putting model!

Part 1

Here’s the golf putting data we were using, typed in from Don Berry’s 1996 textbook. The columns are distance in feet from the hole, number of tries, and number of successes:

x n y
2 1443 1346
3 694 577
4 455 337
5 353 208
6 272 149
7 256 136
8 240 111
9 217 69
10 200 67
11 237 75
12 202 52
13 192 46
14 174 54
15 167 28
16 201 27
17 195 31
18 191 33
19 147 20
20 152 24

Graphed here:

Here’s the idealized picture of the golf putt, where the only uncertainty is the angle of the shot:

Which we assume is normally distributed:

And here’s the model expressed in Stan:

data {
int J;
int n[J];
vector[J] x;
int y[J];
real r;
real R;
}
parameters {
real sigma;
}
model {
vector[J] p;
for (j in 1:J){
p[j] = 2*Phi(asin((R-r)/x[j]) / sigma) - 1;
}
y ~ binomial(n, p);
}
generated quantities {
real sigma_degrees;
sigma_degrees = (180/pi())*sigma;
}

Fit to the above data, the estimate of sigma_degrees is 1.5. And here’s the fit:

Part 2

The other day, Mark Broadie came to my office and shared a larger dataset, from 2016-2018. I’m assuming the distances are continuous numbers because the putts have exact distance measurements and have been divided into bins by distance, with the numbers below representing the average distance in each bin.

x n y
0.28 45198 45183
0.97 183020 182899
1.93 169503 168594
2.92 113094 108953
3.93 73855 64740
4.94 53659 41106
5.94 42991 28205
6.95 37050 21334
7.95 33275 16615
8.95 30836 13503
9.95 28637 11060
10.95 26239 9032
11.95 24636 7687
12.95 22876 6432
14.43 41267 9813
16.43 35712 7196
18.44 31573 5290
20.44 28280 4086
21.95 13238 1642
24.39 46570 4767
28.40 38422 2980
32.39 31641 1996
36.39 25604 1327
40.37 20366 834
44.38 15977 559
48.37 11770 311
52.36 8708 231
57.25 8878 204
63.23 5492 103
69.18 3087 35
75.19 1742 24

Comparing the two datasets in the range 0-20 feet, the success rate is similar for longer putts but is much higher than before for the short putts. This could be a measurement issue, if the distances to the hole are only approximate for the old data.

Beyond 20 feet, the empirical success rates are lower than would be predicted by the old model. This makes sense: for longer putts, the angle isn’t the only thing you need to control; you also need to get the distance right too.

So Broadie fit a new model in Stan. See here and here for further details.

## “Retire Statistical Significance”: The discussion.

So, the paper by Valentin Amrhein, Sander Greenland, and Blake McShane that we discussed a few weeks ago has just appeared online as a comment piece in Nature, along with a letter with hundreds (or is it thousands?) of supporting signatures.

Following the first circulation of that article, the authors of that article and some others of us had some email discussion that I thought might be of general interest.

I won’t copy out all the emails, but I’ll share enough to try to convey the sense of the conversation, and any readers are welcome to continue the discussion in the comments.

1. Is it appropriate to get hundreds of people to sign a letter of support for a scientific editorial?

John Ioannidis wrote:

Brilliant Comment! I am extremely happy that you are publishing it and that it will certainly attract a lot of attention.

He had some specific disagreements (see below for more on this). Also, he was bothered by the group-signed letter and wrote:

I am afraid that what you are doing at this point is not science, but campaigning. Leaving the scientific merits and drawbacks of your Comment aside, I am afraid that a campaign to collect signatures for what is a scientific method and statistical inference question sets a bad precedent. It is one thing to ask for people to work on co-drafting a scientific article or comment. This takes effort, real debate, multiple painful iterations among co-authors, responsibility, undiluted attention to detailed arguments, and full commitment. Lists of signatories have a very different role. They do make sense for issues of politics, ethics, and injustice. However, I think that they have no place on choosing and endorsing scientific methods. Otherwise scientific methodology would be validated, endorsed and prioritized based on who has the most popular Tweeter, Facebook or Instagram account. I dread to imagine who will prevail.

To this, Sander Greenland replied:

YES we are campaigning and it’s long overdue . . . because YES this is an issue of politics, ethics, and injustice! . . .

My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term). Our commentary is an early step toward revealing that sad reality. Not one point in our commentary is new, and our central complaints (like ending the nonsense we document) have been in the literature for generations, to little or no avail – e.g., see Rothman 1986 and Altman & Bland 1995, attached, and then the travesty of recent JAMA articles like the attached Brown et al. 2017 paper (our original example, which Nature nixed over sociopolitical fears). Single commentaries even with 80 authors have had zero impact on curbing such harmful and destructive nonsense. This is why we have felt compelled to turn to a social movement: Soft-peddled academic debate has simply not worked. If we fail, we will have done no worse than our predecessors (including you) in cutting off the harmful practices that plague about half of scientific publications, and affect the health and safety of entire populations.

And I replied:

I signed the form because I feel that this would do more good than harm, but as I wrote here, I fully respect the position of not signing any petitions. Just to be clear, I don’t think that my signing of the form is an act of campaigning or politics. I just think it’s a shorthand way of saying that I agree with the general points of the published article and that I agree with most of its recommendations.

Whether political or not, it seems like signing a piece as a form of endorsement seems far more appropriate than having papers with mass authorships of 50+ authors where it is unlikely that every single one of those authors contributed enough to actually be an author, and their placement as an author is also a political message.

I also wonder if such pieces, whether they be mass authorships or endorsements by signing, actually lead to notable change. My guess is that they really don’t, but whether or not such endorsements are “popularity contests” via social media, I think I’d prefer that people who participate in science have some voice in the manner, rather than having the views of a few influential individuals, whether they be methodologists or journal editors, constantly repeated and executed in different outlets.

2. Is “retiring statistical significance” really a good idea?

Now on to problems with the Amrhein et al. article. I mostly liked it, although I did have a couple places where I suggested changes of emphasis, as noted in my post linked above. The authors made some of my suggested changes; in other places I respect their decisions even if I might have written things slightly differently.

Ioannidis had more concerns, as he wrote in an email listing a bunch of specific disagreements with points in the article:

1. Statement: Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exist
Why it is misleading: Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important. It will also facilitate claiming that that there are no conflicts between studies when conflicts do exist.

2. Statement: Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.
Why it is misleading: In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim. In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense. We need to make some careful choices and move on. Saying that any and all associations cannot be 100% dismissed is correct strictly speaking, but practically it is nonsense. We will get paralyzed because we cannot exclude that everything may be causing everything.

3. Statement: statistically non-significant results were interpreted as indicating ‘no difference’ in XX% of articles
Why it is misleading: this may have been entirely appropriate in many/most/all cases, one has to examine carefully each one of them. It is probably at least or even more inappropriate that some/many of the remaining 100-XX% were not indicated as “no difference”.

4. Statement: The editors introduce the collection (2) with the caution “don’t say ‘statistically significant’.” Another article (3) with dozens of signatories calls upon authors and journal editors to disavow the words. We agree and call for the entire concept of statistical significance to be abandoned. We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.
Why it is misleading: please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?

5. Statement: whole paragraph beginning with “Tragically…”
Why it is misleading: we have no evidence that if people did not have to defend their data as statistically significant, publication bias would go away and people would not be reporting whatever results look nicer, stronger, more desirable and more fit to their biases. Statistical significance or any other preset threshold (e.g. Bayesian or FDR) sets an obstacle to making unfounded claims. People may play tricks to pass the obstacle, but setting no obstacle is worse.

6. Statement: For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).
Why it is misleading: this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

7. Statement: One way to do this is to rename confidence intervals ‘compatibility intervals,’ …
Why it is misleading: Probably the least thing we need in the current confusing situation is to add yet a new, idiosyncratic term. “Compatibility” is even a poor choice, probably worse than “confidence”. Results may be entirely off due to bias and the X% CI (whatever C stands for) may not even include the truth much of the time if bias is present.

8. Statement: We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits.
Why it is misleading: I think it is far more important to consider what biases may exist and which may lead to the entire interval, no matter how we call it, to be off and thus incompatible with the truth.

9. Statement: We’re frankly sick of seeing nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews, and instructional materials.
Why it is misleading: I (and many others) are frankly sick with seeing nonsensical “proofs of the non-null”, people making strong statements about associations and even causality with (or even without) formal statistical significance (or other statistical inference tool) plus tons of spin and bias. Removing entirely the statistical significance obstacle, will just give a free lunch, all-is-allowed bonus to make any desirable claim. All science will become like nutritional epidemiology.

10. Statement: That means you can and should say “our results indicate a 20% increase in risk” even if you found a large P-value or a wide interval, as long as you also report and discuss the limits of that interval.
Why it is misleading: yes, indeed. But then, welcome to the world where everything is important, noteworthy, must be licensed, must be sold, must be bought, must lead to public health policy, must change our world.

11. Statement: Paragraph starting with “Third, the default 95% used”
Why it is misleading: indeed, but this means that more appropriate P-value thresholds and, respectively X% CI intervals are preferable and these need to be decided carefully in advance. Otherwise, everything is done post hoc and any pre-conceived bias of the investigator can be “supported”.

12. Statement: Factors such as background evidence, study design, data quality, and mechanistic understanding are often more important than statistical measures like P-values or intervals (10).
Why it is misleading: while it sounds reasonable that all these other factors are important, most of them are often substantially subjective. Conversely, statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective.

13. Statement: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.
Why it is misleading: This argument is equivalent to hand waving. Indeed, most of the time yes/no decisions need to be made and this is why removing statistical significance and making it all too fluid does not help. It leads to an “anything goes” situation. Study designs for questions that require decisions need to take all these other parameters into account ideally in advance (whenever possible) and set some pre-specified rules on what will be considered “success”/actionable result and what not. This could be based on p-values, Bayes factors, FDR, or other thresholds or other functions, e.g. effect distribution. But some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product do support its application for licensing.

14. Statement: People will spend less time with statistical software and more time thinking.
Why it is misleading: I think it is unlikely that people will spend less time with statistical software but it is likely that they will spend more time mumbling, trying to sell their pre-conceived biases with nice-looking narratives. There will be no statistical obstacle on their way.

15. Statement: the approach we advocate will help halt overconfident claims, unwarranted declarations of ‘no difference,’ and absurd statements about ‘replication failure’ when results from original and the replication studies are highly compatible.
Why it is misleading: the proposed approach will probably paralyze efforts to refute the millions of nonsense statements that have been propagated by biased research, mostly observational, but also many subpar randomized trials.

Overall assessment: the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible.

That said, despite these various specific points of disagreement, Ioannidis emphasized that Amrhein et al. raise important points that “need to be given an opportunity to be heard loud and clear and in their totality.”

In reply to Ioannidis’s points above, I replied:

1. You write, “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” I completely disagree. Or, maybe I should say, anyone is already allowed to make any overstated claim about any result being important. That’s what PNAS is, much of the time. To put it another way: I believe that embracing uncertainty and avoiding overstated claims are important. I don’t think statistical significance has much to do with that.

2. You write, “In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim.” Again, this is already the case that people can conclude what they want. One concern is what is done by scientists who are honestly trying to do their best. I think those scientists are often misled by statistical significance, all the time, ALL THE TIME, taking patterns that are “statistically significant” and calling them real, and taking patterns that are “not statistically significant” and treating them as zero. Entire scientific papers are, through this mechanism, data in, random numbers out. And this doesn’t even address the incentives problem, by which statistical significance can create an actual disincentive to gather high-quality data.

I disagree with many other items on your list, but two is enough for now. I think the overview is that you’re pointing out that scientists and consumers of science want to make reliable decisions, and statistical significance, for all its flaws, delivers some version of reliable decisions. And my reaction is that whatever plus it is that statistical significance sometimes provides reliable decisions, is outweighed by (a) all the times that statistical significance adds noise and provides unreliable decisions, and (b) the false sense of security that statistical significance gives so many researchers.

One reason this is all relevant, and interesting, is that we all agree on so much—yet we disagree so strongly here. I’d love to push this discussion toward the real tradeoffs that arise when considering alternative statistical recommendations, and I think what Ioannidis wrote, along with the Amrhein/Greenland/McShane article, would be a great starting point.

Ioannidis then responded to me:

On whether removal of statistical significance will increase or decrease the chances that overstated claims will be made and authors will be more or less likely to conclude according to their whim, the truth is that we have no randomized trial to tell whether you are right or I am right. I fully agree that people are often confused about what statistical significance means, but does this mean we should ban it? Should we also ban FDR thresholds? Should we also ban Bayes factor thresholds? Also probably we have different scientific fields in mind. I am afraid that if we ban thresholds and other (ideally pre-specified) rules, we are just telling people to just describe their data as best as they can and unavoidably make strength-of-evidence statements as they wish, kind of impromptu and post-hoc. I don’t think this will work. The notion that someone can just describe the data without making any inferences seems unrealistic and it also defies the purpose of why we do science: we do want to make inferences eventually and many inferences are unavoidably binary/dichotomous. Also actions based on inferences are binary/dichotomous in their vast majority.

I replied:

I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.

3. Mob rule, or rule of the elites, or gatekeepers, consensus, or what?

One issue that came up is, what’s the point of that letter with all those signatories? Is it mob rule, the idea that scientific positions should be determined by those people who are loudest and most willing to express strong opinions (“the mob” != “the silent majority”)? Or does it represent an attempt by well-connected elites (such as Greenland and myself!) to tell people what to think? Is the letter attempting to serve a gatekeeping function by restricting how researchers can analyze their data? Or can this all be seen as a crude attempt to establish a consensus of the scientific community?

None of these seem so great! Science should be determined my truth, accuracy, reproducibility, strength of theory, real-world applicability, moral values, etc. All sorts of things, but these should not be the property of the mob, or the elites, or gatekeepers, or a consensus.

That said, the mob, the elites, gatekeepers, and the consensus aren’t going anywhere. Like it or not, people do pay attention to online mobs. I hate it, but it’s there. And elites will always be with us, sometimes for good reasons. I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books—and I say that even though, at the beginning of my career, I had to spend a huge amount of time and effort struggling against the efforts of elites (my colleagues in the statistics department at the University of California, and their friends elsewhere) who did their best to use their elite status to try to put me down. And gatekeepers . . . hmmm, I don’t know if we’d be better off without anyone in charge of scientific publishing and the news media—but, again, the gatekeepers are out there: NPR, PNAS, etc. are real, and the gatekeepers feed off of each other: the news media bow down before papers published in top journals, and the top journals jockey for media exposure. Finally, the scientific consensus is what it is. Of course people mostly do what’s in textbooks, and published articles, and what they see other people do.

So, for my part, I see that letter of support as Amrhein, Greenland, and McShane being in the arena, recognizing that mob, elites, gatekeepers, and consensus are real, and trying their best to influence these influencers and to counter negative influences from all those sources. I agree with the technical message being sent by Amrhein et al., as well as with their open way of expressing it, so I’m fine with them making use of all these channels, including getting lots of signatories, enlisting the support of authority figures, working with the gatekeepers (their comment is being published in Nature, after all; that’s one of the tabloids), and openly attempting to shift the consensus.

Amrhein et al. don’t have to do it that way. It would be also fine with me if they were to just publish a quiet paper in a technical journal and wait for people to get the point. But I’m fine with the big push.

4. And now to all of you . . .

As noted above, I accept the continued existence and influence of mob, elites, gatekeepers, and consensus. But I’m also bothered by these, and I like to go around them when I can.

Hence, I’m posting this on the blog, where we have the habit of reasoned discussion rather than mob-like rhetorical violence, where the comments have no gatekeeping (in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!—because they were too obnoxious), and where any consensus is formed from discussion that might just lead to the pluralistic conclusion that sometimes no consensus is possible. And by opening up our email discussion to all of you, I’m trying to demystify (to some extent) the elite discourse and make this a more general conversation.

P.S. There’s some discussion in comments about what to do in situations like the FDA testing a new drug. I have a response to this point, and it’s what Blake McShane, David Gal, Christian Robert, Jennifer Tackett wrote in section 4.4 of our article, Abandon Statistical Significance:

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where non-governmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a non-statistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.
Even in pure research scenarios where there is no obvious cost-benefit calculation—for example a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017, 2018).

## My two talks in Montreal this Friday, 22 Mar

McGill University Biostatistics seminar, Purvis Hall, 102 Pine Ave. West, Room 25, 1-2pm Fri 22 Mar:

Resolving the Replication Crisis Using Multilevel Modeling

In recent years we have come to learn that many prominent studies in social science and medicine, conducted at leading research institutions, published in top journals, and publicized in respected news outlets, do not and cannot be expected to replicate. Proposed solutions to the replication crisis in science fall into three categories: altering procedures and incentives, improving design and data collection, and improving statistical analysis. We argue that progress in all three dimensions is necessary: new procedures and incentives will offer little benefit without better data; more complex data structures require more elaborate analysis; and improved incentives are required for researchers to try new methods. We propose a way forward involving multilevel modeling, and we discuss in the context of applications in social research and public health.

Montréal Mathematical Sciences Colloquium, 1205 Burnside Hall, 3:30-4:30pm:

Challenges in Bayesian Computing

Computing is both the most mathematical and most applied aspect of statistics. We shall talk about various urgent computing-related topics in statistical (in particular, Bayesian) workflow, including exploratory data analysis and model checking, Hamiltonian Monte Carlo, monitoring convergence of iterative simulations, scalable computing, evaluation of approximate algorithms, predictive model evaluation, and simulation-based calibration. This work is inspired by applications including survey research, drug development, and environmental decision making.

Ed Bein writes:

I’m hoping you can clarify a Bayesian “metaphysics” question for me. Let me note I have limited experience with Bayesian statistics.

In frequentist statistics, probability has to do with what happens in the long run. For example, a p value is defined in terms of what happens if, from now till eternity, we repeatedly draw random samples from some population of interest, compute the value of a test statistic, and keep a running tabulation of the proportion of values that exceed a certain given value. Let me refer to probability in a frequentist context as F-probability.

In Bayesian statistics, probability has to do with degree of belief. Prior and posterior distributions refer to our degree of confidence (prior to looking at data and after looking at data, respectively) that a parameter falls within certain ranges of values, where 1 represents total certainty and 0 represents total disbelief. Let me refer to probability in a Bayesian context as B-probability.

Both F-probability and B-probability are valid interpretations of probability, in that they satisfy the axioms of probability. But they are distinct interpretations.

My conceptual confusion is that Bayes Theorem combines a term with an F-probability interpretation (the likelihood, which is essentially the density of the sampling distribution) with a term with a B-probability interpretation (density of the prior distribution) to produce an entity with a B-probability interpretation, namely, the density of the posterior distribution. I’m not questioning the validity of the derivation of Bayes Theorem here. Rather, it seems conceptually messy to me that an F-probability term is combined with a B-probability term; both terms have to do with “probability,” but what is meant by “probability” is very different for each of them.

Can you provide some conceptual clarity?

See here
and here, also here and here.

At this point, I’ve written about this so many times I just have to point to the relevant links. Kinda like that joke about the jokes with the numbers.

## Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it.

“Otto eye-balled the diva lying comatose amongst the reeds, and he suddenly felt the fire of inspiration flood his soul. He ran back to his workshop where he futzed and futzed and futzed.” –Bette Midler

Andrew was annoyed. Well, annoyed is probably too strong a word. Maybe a better way to start is with The List. When Andrew, Aki, and I work together we have The List of projects that need to be done and not every item on this list weighted the same by all of us.

The List has longer term ideas that we’ve been slouching towards, projects that have stalled, small ideas that have room to blossom, and then there’s my least favourite part. My least favourite part of The List is things that are finished but haven’t been written up as papers yet. This is my least favourite category because I never think coercing something into a paper is going to be any fun.

Regular readers of my blogs probably realize that I am frequently and persistently wrong.

But anyway. It was day one of one of Aki and my visits to Columbia and we were going through The List. Andrew was pointing out a project that had been sitting on The List for a very very long time. (Possibly since before I was party to The List.) And he wanted it off The List.

(Side note: this is the way all of our projects happen. Someone suddenly wants it done enough that it happens. Otherwise it stays marinating on The List.)

So let’s start again.

Andrew wanted a half-finished paper off The List and he had for a while. Specifically, the half-finished paper documenting the actual way that the Stan project computes $\widehat{R}$ (aka the Potential Scale Reduction Factor or, against our preference of not naming things after people, the Gelman-Rubin(-Brooks) statistic). So we agreed to finish it and then moved on to some more exciting stuff.

But then something bad happened: Aki broke R-hat and we had to work out how to fix it.

The preprint is here. There is an extensive online appendix here. The paper is basically our up to date “best practice” guide to monitoring convergence for general MCMC algorithms. When combined with the work in our recent visualization paper, and two of Michael Betancourt’s case studies (one and two), you get our best practice recommendations for Stan. All the methods in this paper are available in a github repo and will be available in future versions of the various Stan and Stan-adjacent libraries.

What is R-hat?

R-hat, or the potential scale reduction factor, is a diagnostic that attempts to measure whether or not an MCMC algorithm1 has converged flag situations where the MCMC algorithm has failed converge.  The basic idea is that you want to check a couple of things:

1. Is the distribution of the first part of a chain (after warm up) the same as the distribution of the second half of the chain?
2. If I start the algorithm at two different places and let the chain warm up, do both chains have the same distribution?

If one of these two checks fail to hold, then your MCMC algorithm has probably not converged. If both checks pass, it is still possible that the chain has problems.

Historically, there was a whole lot of chat about whether or not you need to run multiple chains to compute R-hat. To summarize that extremely long conversation: you do. Why? To paraphrase the great statistician Vanessa Williams:

Sometimes the snow comes down in June
Sometimes the sun goes ’round the moon
Sometimes the posterior is multimodal
Sometimes the adaptation you do during warm up is unstable

Also it’s 2019 and all of our processors are multithreaded so just do it.

The procedure, which is summarized in the paper (and was introduced in BDA3), computes a single number summary and we typically say our Markov Chains have not converged if $\widehat{R} > 1.01$.

A few things before we tear the whole thing down.

1. Converging to the stationary distribution is the minimum condition required for a MCMC algorithm to be useful. R-hat being small doesn’t mean the chain is mixing well, so you need to check the effective sample size!
2. R-hat is a diagnostic and not a proof of convergence. You still need to look at all of the other things (like divergences and BFMI in Stan) as well as diagnostic plots (more of which are in the paper)
3. The formula for R-hat in BDA3 assumes that the stationary distribution has finite variance. This is a very hard property to check from a finite sample.

The third point is how Aki broke R-hat.

R-hat is a diagnostic not a hypothesis test

A quick (added after posting) digression here about how you should treat R-hat. Fundamentally, it is not a formal check to see if an MCMC algorithm has reached its stationary distribution. And it is definitely not a hypothesis test!

A better way of thinking about R-hat is like the fuel light in a car: if the fuel light is on, get some petrol. If the fuel light isn’t on, look at the fuel gauge, how far you’re gone since your last refill, etc.

Similarly, if R-hat is bigger than 1.01, your algorithm probably isn’t working. If R-hat is less than this fairly arbitrary threshold, you should look at the effective sample size (the fuel gauge), divergences, appropriate plots, and everything else to see if there are any other indicators that the chain is misbehaving. If there aren’t, then you’ve done your best to check for convergence problems, so you might as well hold your breath and hope for the best.

But what about the 1.01 threshold? Canny readers will look at the author list and correctly assume that it’s not a high quantile of the distribution of R-hat under some null hypothesis. It’s actually just a number that’s bigger than one but not much bigger than one. Some earlier work suggested bigger thresholds like 1.1, but Vats and Knudson give an excellent argument about why that number is definitely too big. They suggest making a problem-dependent threshold for R-hat to take into account it’s link with effective sample size, but we prefer just to look at the two numbers separately, treating R-hat as a diagnostic (like the fuel light) and the effective sample size estimate like the fuel gauge.

So to paraphrase Flight of The Concords: A small R-hat value is not a contract, but it’s very nice.

When does R-hat break?

The thing about turning something into a paper is that you need to have a better results section than you typically need for other purposes. So Aki went off and did some quick simulations and something bad happened. When he simulated from four chains where one had the wrong variance, the R-hat value was still near one. So R-hat was not noticing the variance was wrong. (This is the top row of the above figure. The left column has one bad chain, the right column four correct chains.)

On the other hand, R-hat was totally ok noticing when the location parameter of one chain had the wrong location parameter. Except for the bottom row of the figure, where the target distribution is Cauchy.

So we noticed two things:

1. The existing R-hat diagnostic was only sensitive to errors in the first moment.
2. The existing R-hat diagnostic failed catastrophically when the variance was infinite.

(Why “catastrophically”? Because it always says the chain is good!)

So clearly we could no longer just add two nice examples to the text that was written and then send the paper off. So we ran back to our workshop where we futzed and futzed and futzed.

He tried some string and paper clips…

Well, we2 came up with some truly terrible ideas but eventually circled around to two observations:

1. Folding the draws by computing $\zeta^{(mn)}=\left|\theta^{(nm)}-\mathrm{median}(\theta)\right|$ and computing R-hat on the folded chain will give a statistic that is sensitive to changes in scale.
2. Rank-based methods are robust against fat tails. So perhaps we could rank-normalize the chain (ie compute the rank for each draw inside the total pool of samples and replace the true value with the quantile of a standard normal that corresponds to the rank).

Putting these two together, we get our new R-hat value: after rank-normalizing the chains, compute the standard R-hat and the folded R-hat and report the maximum of the two values.  These are the blue histograms in the picture.

There are two advantages of doing this:

1. The new value of R-hat is robust against heavy tails and is sensitive to changes in scale between the chains.
2. The new value of R-hat is parameterization invariant, which is to say that the R-hat value for $\theta$ and $\log(\theta)$ will be the same. This was not a property of the original formulation.

What does the rank-normalization actually do?

Great question imaginary bold-face font person! The intuitive answer is that it is computing the R-hat value for the nicest transformation of the parameter. (Where nicest is problematically defined to be “most normal”). So what does this R-hat tell you? It tells you that if the MCMC algorithm has failed to converge after we strip away all of the problems with heavy tails and skewness and all that jazz. Similarly we can compute an Effective Sample Size (ESS) for this nice scenario. This is the best case scenario R-hat and ESS. If it isn’t good, we have no hope.

Assuming the rank-normalized and folded R-hat is good and the rank-normalized ESS is good, it is worth investigating the chain further.

R-hat does not give us all the information we need to assess if the chain is useful

The old version of R-hat basically told us if the mean was ok. The new version tells us if the median and the MAD are ok. But that’s not the only thing we usually report. Typically a posterior is summarized by a measure of centrality (like the median) and a quantile-based uncertainty interval (such as the 0.025 and 0.975 quantiles of the posterior). We need to check that these quantiles are computed correctly!

This is not trivial: MCMC algorithms do not explore the tails as well as the bulk. This means that the Monte Carlo error in the quantiles may potentially be much higher than the Monte Carlo error in the mean.  To deal with this, we introduced a localized measure of quantile efficiency, which is basically an effective sample size for computing a quantile.  Here’s an example from the online appendix, where the problem is sampling from a Cauchy distribution using the “nominal” parameterization. You can see that it’s possible that central quantiles are being resolved well, but extreme quantile estimates will be very noisy.

Maybe it’s time to let traceplots die

As we are on somewhat of a visualization kick around these parts, let’s talk about traceplots of MCMC algorithms. They’re terrible. If the chain is long, all the interesting information is compressed, and if you try to include information from multiple chains it just becomes a mess. So let us propose an alternative: rank plots.

The idea is that if the chains are all mixing well exploring the same distribution, the ranks should be uniformly distributed. In the following figure, 4 chains of 1000 draws are plotted and you can easily see that the histograms are not uniform. Moreover, the histogram for the first chain clearly never visits the left tail of the distribution, which is indicative of a funnel. This would be harder to see with 4 traceplots plotted over each other.

How to use these new tools in practice

To close off, here are our recommendations (taken directly from the paper) for using R-hat. All of the methods in this paper will make their way into future version of RStan, rstanarm, and bayesplot (as well as all the other places we put things).

In this section [of the paper] we lay out practical recommendations for using the tools developed in this paper. In the interest of specificity, we have provided numerical targets for both R-hat􏰄 and effective sample size (ESS). However, these values should be adapted as necessary for the given application.

In Section 4, we propose modifications to R-hat􏰄 based on rank-normalizing and folding the posterior draws. We recommend running at least four chains by default and only using the sample if R-hat􏰄 < 1.01. This is a much tighter threshold than the one recommended by Gelman and Rubin (1992), reflecting lessons learnt over more than 25 years of use.

Roughly speaking, the ESS of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. When there might be difficulties with mixing, it is important to use between-chain information in computing the ESS. For instance, in the sorts of funnel-shaped distributions that arise with hierarchical models, differences in step size adaptation can lead to chains to have different behavior in the neighborhood of the narrow part of the funnel. For multimodal distributions with well-separated modes, the split-R-hat􏰄 adjustment leads to an ESS estimate that is close to the number of distinct modes that are found. In this situation, ESS can be drastically overestimated if computed from a single chain.

As Vats and Knudson (2018) note, a small value of R-hat􏰄 is not enough to ensure that an MCMC procedure is useful in practice. It also needs to have a sufficiently large effective sample size. As with R-hat􏰄, we recommend computing the ESS on the rank-normalized sample. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Specifically, it computes the ESS of a sample from a normalized version of the quantity of interest, using the rank transformation followed by the normal inverse-cdf. This is still indicative of the effective sample size for computing an average, and if it is low the computed expectations are unlikely to be good approximations to the actual target expectations. We recommend requiring that the rank-normalized ESS is greater than 400. When running four chains, this corresponds to having a rank-normalized effective sample size of at least 50 per split chain.

Only when the rank-normalized and folded R-hat􏰄 values are less than the prescribed threshold and the rank- normalized ESS is greater than 400 do we recommend computing the actual (not rank-normalized) effective sample size for the quantity of interest. This can then be used to assess the Monte Carlo standard error (MCSE) for the quantity of interest.

Finally, if you plan to report quantile estimates or posterior intervals, we strongly suggest assessing the behaviour of the chains for these quantiles. In Section 4.3 we show that convergence of Markov chains is not uniform across the parameter space and propose diagnostics and effective sample sizes specifically for extreme quantiles. This is different from the standard ESS estimate (which we refer to as the “bulk-ESS”), which mainly assesses how well the centre of the distribution is resolved. Instead, these “tail-ESS” measures allow the user to estimate the MCSE for interval estimates.

Footnotes:

1 R-hat can be used more generally for any iterative simulation algorithm that targets a stationary distribution, but it’s main use case is MCMC.
2 I. (Other people’s ideas were good. Mine were not.)

## C’est le fin! Riad Sattouf gagne.

Le mec japonais qui gagnait la competition pour manger les saucisses—alors, ça sonne mieux en anglais—M. Kobayashi était un grand « underdog », le cheval sombre de cet « mars fou », mais en fait je dois avancer le dessinateur, grâce à le poème de Dzhaughn:

this dour crie de couer
That Dude is no good!
Am I just sowing some FUD
No: This is far too awful
Let’s not all fall
For this eater of all
that it literally offal.

Chopped in a tub
stuffed in a tube
Is any food ruder?
Any more rued?

Its farewell, bon chance,
mobs haling from France
will be making us dance
in hats labeled “dunce.”

There will be huffs among puffs
And gnashing of toofs
As we hoof it past toughs
Hailing potatoes, tomahtoes, and oeufs
from the roofs,
demanding the troof:
“Which goofs with what guff
Did slough off Sattouf?
Quelque Tartuffe?
It’s the stuff of a spoof!”

Severed coifs in
Several coffins
disciples of the Dude de Gras
disciplined with coups de grace
No mercy for sinners
if the eater of wieners
should end up the winner.

Good God,
The Dude is a dud,
I cannot abide it,
But Andrew,
we’ve wooed you;
how would you
decide it?

Et, de Raghuveer:

What does the Japanese dude who won the hot dog eating contest’s career tell us about the role of serving sizes on food consumption (as in Brian Wansink’s discredited work)? Or about the confusing and statistically convoluted world of nutrition guidelines? Is anonymity imposed by the powers-that-be (Andrew) the solution to the problems posed by celebrity scientists and their associated entourages? These are all questions that the Japanese dude who won the hot dog eating contest will get at his seminar, given an audience of this blog’s readers, and we can’t subject him to this. Overwhelmed and depressed, he might overeat even more, and we’d be responsible. The whole thing would end up a comic but chilling example of human cruelty, the sort of thing Riad Sattouf would write about. So instead, we could just get Sattouf himself; we could listen to such stories without contributing to their proliferation.

Ce sont les arguments contra M. Kobayashi, mais on a aussi quelque raisons pour choisir M. Sattouf:

1. J Storrs:

Speaking of googling Sattouf, google translate tells us that it comes from the Arabic word for catfish. Catfish have a slight edge over hotdogs, as they are accompanied by hush puppies.
And perhaps his hovercraft is full of eels.

2. Ethan:

If we can’t have Geng let’s have Sattouf. I too had to google him often – and each time found out I’d like to hear what he has to say.

3. Peut-être M. Sattouf viendra et nous dire plus des histoires de la vie d’Isaac Asimov.

Et le meilleur argument vient de Thomas:

Sattouf gave up an award at Angoulême (the Pulitzer of bande dessinée) in protest over the lack of women among the prize recipients, so maybe he’ll invite Geng to lead the seminar. Or, he can do the seminar in the arab banlieue accent, like he did in the post-shooting issue of Charlie.

But the japanese dude, he only soaks up stuff, a black hole, a non-informative entity. The thrill of not knowing what the seminar is about. I might go.

Donc, si on invite M. Sattouf, peut-être on reçoit Mme. Geng! Si on voie l’un ou l’autre, c’est pas M. Latour, alors on peut inviter tout le monde pour le conférence, pas besoin d’écrire une paragraph pour expliquer pourquoi vous désirer assister.

Souvenons nous les 62 autres compétiteurs, les Virginia Apgar, les David Blaine, les DJ Jazzy Jeff, tous les personages excellents de New Jersey et autre états et pays qui n’étaient atteindre le fin de cette compétition imaginaire.

Merci à tous pour les discussions, la poésie, les blagues. Je souhaite que vous pourriez continuer votre bonne participation sur ce blog pour les sujets plus sérieux.

Et maintenant le plupart des liseurs de ce blog pourraient rester car vous pensez que ces competitions ne sont pas drôle et vous préférez pas « hot dogs » mais la nourriture normal de Stan, les erreurs de type M, l’identification causale, la politique américain, et tous les autre sujets du lexicon (ou, dans un dernier regard sur M. Kobayashi, ceci, le climatiseur qui nettoie lui-même, et cetera).

## When and how do politically extreme candidates get punished at the polls?

In 2016, Tausanovitch and Warshaw performed an analysis “using the largest dataset to date of voting behavior in congressional elections” and found:

Ideological positions of congressional candidates have only a small association with citizens’ voting behavior. Instead, citizens cast their votes “as if” based on proximity to parties rather than individual candidates. The modest degree of candidate-centered spatial voting in recent Congressional elections may help explain the polarization and lack of responsiveness in the contemporary Congress.

Then in 2018, Andrew Hall and Daniel Thompson wrote:

Combining a regression discontinuity design in close primary races with survey and administrative data on individual voter turnout, we find that extremist nominees—as measured by the mix of campaign contributions they receive—suffer electorally, largely because they decrease their party’s share of turnout in the general election, skewing the electorate towards their opponent’s party. The results help show how the behavioral and institutional literatures can be connected. For our sample of elections, turnout appears to be the dominant force in determining election outcomes, but it advantages ideologically moderate candidates because extremists appear to activate the opposing party’s base more than their own.

Sean McElwee brought those two papers to my attention (along with this) and asked how they can be reconciled. McElwee writes:

Voters, who can’t distinguish ideology decided whether or not to vote in generals based on the extremism of their opponent (measured in a way that may or may not actually reflect an extreme voting record).

Seems like there must be another mechanism for the Hall and Thompson data?

My reply: I’m not actually up on this literature. Can I blog this and we can see what comments show up?

And so I did, and here you are. I haven’t thought much on these issues since writing that paper, Moderation in the pursuit of moderation is no vice: The clear but limited advantages to being a moderate for Congressional elections, with Jonathan Katz, over ten years ago.

## It’s the finals! The Japanese dude who won the hot dog eating contest vs. Riad Sattouf

I chose yesterday‘s winner based on this comment from Re’el:

Hey, totally not related to this, but could offer any insight into this study: https://www.nytimes.com/2019/03/15/well/eat/eggs-cholesterol-heart-health.html It seems like something we go back and forth on and this study didn’t offer any insight. Thanks.

Egg = oeuf, so we should choose the man whose name ends in f.

Also, from Dzhaughn:

Our GOAT scored with butt and with hoof
But committed a political goof:
He saw nothing the matter
with electing Sepp Blatter
So lets go for top drawer Sattouf.

And from Thomas:

Sattouf (in Arab of the future): “A man has no roots. He has feet.” He has the footballer figured out.

Whereas Pelé says things like “Success is not accident. It’s hard work…” Sounds like quite a seminar.

And now, this is it: an unseeded creative eater who, along the way, defeated Carol Burnett, Oscar Wilde, Albert Brooks, and Jim Thorpe—how he ever won against Carol Burnett, I have no idea, she’d be a great seminar speaker!—against a middle-aged dessinateur who triumphed over Leonhard Euler, Lance Armstrong, Mel Brooks, Veronica Geng, and Pele. Both these guys have gone far.

Last time we had this contest was 4 years ago, and the winner was Thomas Hobbes. Who’s it gonna be this time? (I’m still bummed that Veronica Geng’s no longer in the running.)

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

## Are male doctors better for male heart attack patients and female doctors better for female heart attack patients?

Brad Greenwood, Seth Carnahan, and Laura Huang write:

A large body of medical research suggests that women are less likely than men to survive traumatic health episodes like acute myocardial infarctions. In this work, we posit that these difficulties may be partially explained, or exacerbated, by the gender match between the patient and the physician. Findings suggest that gender concordance increases a patient’s probability of survival and that the effect is driven by increased mortality when male physicians treat female patients. . . .

I replied that I didn’t think the paper was so bad but I agreed with Kane’s concerns about the data being observational.

Kane responded:

The problem is their claim that the assignment mechanism of patients to physicians is “quasirandom” when their own data demonstrates so clearly that it is not. More details:
https://www.davidkane.info/post/evidence-against-greenwood-et-al-s-claims-of-randomness/

I don’t have strong feelings on this one. I agree with Kane that the claims are speculative, and I agree with him that it would be better if the researchers would make their data public. It’s kind of frustrating when there’s a document with tons of supplementary analyses but no raw data. There’s a lot going on in this study—you should be able to learn a lot from N = 600,000 cases.

My summary

The big contributions of the researchers here are: (a) getting the dataset together, and (b) asking the question, comparing male and female doctors with male and female patients.

At this point, there are a lot of directions to go in the analysis, so I think the right thing to do is publish some summaries and preliminary estimates (which is what they did) and let the data be open to all. I don’t have any strong reason right now to disbelieve the claims in the published paper, but there’s really no need to stop here.

## Riad Sattouf (1) vs. Pele; the Japanese dude who won the hot dog eating contest advances

Lots of good arguments in favor of Bruce, but then this came from Noah:

Hot-dog-garbled speech from Kobayashi recounting disgusting stories about ingesting absurdly large numbers of unchewed sausages and wet buns vs the gravelly, dulcet tones of New Jersey’s answer to John Mellencamp telling touching, timeless tales of musical world tours? The Boss in a landslide.

New Jersey’s answer to John Mellencamp?? That doesn’t seem so great. I’ll have to go with J Storrs:

Aha! we’ve come down to a Roomba versus a goomba. After Springsteen rides his suicide machine, they’ll have to put him in a tomb-ah, where the dude would simply continue sucking up crumb-ahs. Either way he wins.

The Roomba it is. Sure, he’s no Bruce Springsteen. But, on the plus side, he’s not David Blaine either!

And now, for our other semifinal: the cartoonist or the footballer, who will it be?

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

## Estimating treatment effects on rates of rare events using precursor data: Going further with hierarchical models.

Someone points to my paper with Gary King from 1998, Estimating the probability of events that have never occurred: When is your vote decisive?, and writes:

In my area of early childhood intervention, there are certain outcomes which are rare. Things like premature birth, confirmed cases of child-maltreatment, SIDS, etc. They are rare enough that they occasionally won’t even show up in a given study sample. Studies looking at the efficacy of interventions to reduce the probability of these events happening have shown mixed results that, taken as a whole, probably suggest home visiting isn’t having a big impact here (though my own opinion is that we don’t really know much of anything). I wonder if part of the problem is the rarity of the events and if our methods for analysis are really inappropriate. The most typical way this is assessed is via a logistic regression (frequentist and basic). These events are so rare that many studies actually look at proxies, which themselves, have little data to support their predictive ability of the actual endpoint (which I worry are noisy).

Would a “rare event” analysis technique be potentially appropriate for more accurately modeling the potential effectiveness of an intervention on reducing something like child maltreatment, premature birth, or SIDS?

My reply: yes, it makes sense to model precursor data. Looking at the problem 20 years later, it occurs to me that it would make sense to do some sort of hierarchical modeling with informative priors. With rare events, priors would be needed to regularize, but maybe then you could work with a whole bunch of different outcomes and precursors together.

I’m too busy to do this right now, but I guess the right way to start here would be to set up a fake-data simulation study where there are rare events of interest and a various precursors, and then go from there.

## It’s the semifinals! The Japanese dude who won the hot dog eating contest vs. Bruce Springsteen (1)

For our first semifinal match, we have an unseeded creative eater, up against the top-seeded person from New Jersey.

It’s Coney Island vs. Asbury Park: the battle of the low-rent beaches.

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

## Statistical-significance thinking is not just a bad way to publish, it’s also a bad way to think

Eric Loken writes:

The table below was on your blog a few days ago, with the clear point about p-values (and even worse the significance versus non-significance) being a poor summary of data. The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses.

It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

I agree. One problem with all this discussion of forking paths, publication bias, etc., is that this focus on the process of publication/criticism/replication/etc can distract us from the value of thinking clearly when doing research: avoiding the habits of wishful thinking and discretization that lead us to draw strong conclusions from noisy data.

Not long ago we discussed a noisy study produced a result in the opposite direction of the original hypothesis, leading the researcher to completely change the scientific story. Changing your model of the world in response to data is a good thing—but not if the data are essentially indistinguishable from noise. Actually, in that case the decision was based on p-value that did not reach the traditional level of statistical significance, but the general point still holds.

Whether you’re studying voting, or political attitudes, or sex ratios, or whatever, it’s ultimately not about what it takes, or should take, to get a result published, but rather how we as researchers can navigate through uncertainty and not get faked out by noise in our own data.

## Pele wins. On to the semifinals!

Like others, I’m sad that Veronica Geng is out of the running, so I’ll have to go with Diana:

Jonathan’s post-hoc argument for Geng was so good that I now have to vote for Pele, given that his name can be transformed into Geng’s through a simple row matrix operation (a gesture that just might move Geng to forsake the incendiary wine):

[P E L E] + [-9 0 2 2] = [G E N G]

You can’t do the same with Streep.

Then Jonathan himself popped up in comments:

With Geng out I no longer care. While I admit that a second resurrection would be de trop (as Sattouf might say) it’s my only hope. Though really, the only appropriate thing is to simply have her win the final even though she’s not in it. So I’ll just wait for that.

But if I had to go with one here, I had lunch with Meryl Streep once. She didn’t have a lot to say. I’ll go for Streep anyway, on the condition that she sticks to a script written by Geng.

If the best argument in favor of Streep is that she didn’t have a lot to say . . . that’s not so great. So Pele it is. We’ll see how he handles Sattouf in a couple of days.

## One more reason I hate letters of recommendation

Recently I reviewed a bunch of good reasons to remove letters of recommendation when evaluating candidates for jobs or scholarships.

Today I was at a meeting and thought of one more issue. Letters of recommendation are not merely a noisy communication channel; they’re also a biased channel. The problem is that letter writers are strategic: they’re writing their letters to manipulate you, the reader. Yes, I know we all try to adjust for these biases, but that just introduces one more complicating factor.

Really, what’s the point? Letters of recommendations have so many other problems, and this is one more.

Yesterday Dzhaughn gave a complicated argument but ultimately I couldn’t figure out if it was pro- or anti-Geng, so I had to go with Dalton’s straight shot:

Geng has been accused of being “subtle to the point of unintelligibility.” So apparently ole V puts the “b” in subtle. So here’s to our man, Riad who clearly puts the “f” in fun.

And now, our last quarterfinal match, featuring an unseeded GOAT vs. an unseeded person from New Jersey. Which of these award-winners should advance?

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

## Raghuram Rajan: “The Third Pillar: How Markets and the State Leave the Community Behind”

A few months ago I receive a copy of the book, “The Third Pillar: How Markets and the State Leave the Community Behind,” by economist Raghuram Rajan. The topic is important and the book is full of interesting thoughts. It’s hard for me to evaluate Rajan’s economics and policy advice, so I’ll leave that to others.

To say it again: This post represents neither an endorsement or a disparagement of Rajan’s book. I just found it difficult to evaluate.

What I will share is the email I sent to the publisher after receiving the manuscript:

I took a look at Rajan’s book and found what seems to be a mistake right on the first page. Maybe you can forward this to him and there will be a chance for him to correct it before the book comes out.

On the first page of the book, Rajan writes: “Half a million more middle-aged non-Hispanic white American males died between 1999 and 2013 than if their death rates had followed the trend of other ethnic groups.” There are some mistakes here. First, the calculation is wrong because it does not account for changes in the age distribution of this group. Second, it was actually women, not men, whose death rates increased. See here for more on both points.

There is a larger problem here is that there is received wisdom that white men are having problems, hence people attribute a general trend to men, even though in this case the trend is actually much stronger for women.

I noticed another error. On page 216, Rajan writes, “In the United States, the Affordable Care Act, or Obamacare, was the spark that led to the organizing of the Tea Party movement…” This is incorrect. The Tea Party movement started with a speech on TV in February, 2009, in opposition to Obama’s mortgage relief plan. From Wikipedia: “The movement began following Barack Obama’s first presidential inauguration (in January 2009) when his administration announced plans to give financial aid to bankrupt homeowners. A major force behind it was Americans for Prosperity (AFP), a conservative political advocacy group founded by businessmen and political activist David H. Koch.” The Affordable Care Act came later, with discussion in Congress later in 2009 and the bill passing in 2010. The Tea Party opposed the Affordable Care Act, but the Affordable Care Act was not the spark that led to the organizing of the Tea Party movement. This is relevant to Rajan’s book because it calls into question his arguments about populism.

The person to whom I sent this email said she notified the author so I hope he fixed these small factual problems and also that he correspondingly adjusted his arguments about populism. Arguments are ultimately based on facts; shift the facts and the arguments should change to some extent.

Most of the parents are roughly my age!

William McGlashan, 55 . . . Agustin Huneeus Jr., 53 . . . Elizabeth, 56, and Manuel Henriquez, 55 . . . Jane Buckingham, 50 . . . Gordon Caplan, 52 . . . Marcia Abbott, 59 . . . Robert Zangrillo, 52 . . . Stephen Semprevivo, 53 . . . Davina Isackson, 55 . . . Felicity Huffman, 56 . . . Mossimo Giannulli, 55, and Lori Loughlin, 54 . . . Diane, 55, and Todd Blake, 53 . . . Gregory Colburn, 59 . . . Elisabeth Kimmel, 54 . . . Toby MacFarlane, 56 . . . Peter Jan Sartorio, 53 . . . Marjorie Klapper, 50 . . . Devin Sloane, 53 . . . John Wilson, 59 . . . Homayoun Zadeh, 57. That last dude is a USC professor who paid a bribe to get his kid into . . . USC. I didn’t even know they played lacrosse in California. I thought lacrosse was an east coast thing.

Also a few people on the list in their 60s and older but only one person under 50 years old, and she’s 48.

Anyway, this makes me happy because I feel like we don’t see enough people my age in the news. The politicians are all in their 70s and the zillionaires and Hollywood types are in their 30s.

It’s good to have a new story I can personally relate to.

It seemed odd to me that there weren’t more younger parents in that list. But then I saw that more American women are having babies in their 30s than their 20s. Have a baby when you’re 30, in 18 years the kid’s going off to college, the scandal hits the press two years later, and, bam!, you’re already 50. So I guess it does make sense that almost all the parents involved are in their 50s and older. I just hadn’t thought it through.

P.S. As I wrote awhile ago, following James Flynn, meritocracy won’t happen: the problem’s with the “ocracy”.

## stanc3: rewriting the Stan compiler

I’d like to introduce the stanc3 project, a complete rewrite of the Stan 2 compiler in OCaml.

With this rewrite and migration to OCaml, there’s a great opportunity to join us on the ground floor of a new era. Your enthusiasm for or expertise in programming language theory and compiler development can help bring Stan into the modern world of language design, implementation, and optimization. If this sounds interesting, we could really use your help! We’re meeting twice a week for a quick standup on Mondays and Wednesdays at 10am EST, and I’m always happy to help people get started via email, hangout, or coffee. If you’re an existing Stan user, get ready for friendly new language features and performance upgrades to existing code! It might be a little bumpy along the way, but we have a really great bunch of people working on it who all care most about making Stan a great platform for practicing scientists with bespoke modeling needs.

The opportunity

Stan is a successful and mature modeling language with core abstractions that have struck a chord, but our current C++ compiler inhibits some next-gen features that we think our community is ready for. Our users and contributors have poured a huge amount of statistical expertise into the Stan project, and we now have the opportunity to put similar amounts of programming language theory and compiler craftsmanship into practice. The rewrite will also aim at a more modular architecture, which will enable tooling to be built on top of the Stan compiler enabling features like IDE auto-completion and error highlighting, as well as programming and statistical code linters that can help users with common sources of modeling issues. OCaml’s powerful and elegant pattern matching and seasoned parsing library make it a natural fit for the kinds of symbolic computation required of compilers. This makes it much more pleasant and productive for the task at hand, and is reflected by its frequent use by programming language researchers and compiler implementers. OCaml’s flagship parsing library Menhir enabled Matthijs Vákár to rewrite the Stan parsing phase in about a week, adding hundreds of new custom error messages in another week. Matthijs is obviously a beast, but I think he would agree that OCaml & Menhir definitely helped. Come join us and see for yourself :)

New language features

After we replicate Stan’s current compilers functionality, we will be targeting new language features. The to-do list includes, but is not necessarily limited to:

• tuples
• tools for representing and working with ragged arrays
• higher order functions (functions that take in other functions)
• annotations
• to bring methods like Posterior Predictive Checking and Simulation-Based Calibration into Stan itself
• to label variables as “silent” (not output), or as living on a GPU or other separate hardware
• to assist those who would like to use Stan as an algorithms workbench
• representations for missing data and sparse matrices
• discrete parameter marginalization

Next-gen optimization

But back to the next-gen features. Here is just some of the low-hanging fruit:

• peephole optimizations: we might notice when a user types log(1- x) and replace it with log1m(x) automatically
• finding redundant computations and sharing the results
• moving computation up outside of loops (including the sampling loop!)
• using the data sizes to ahead-of-time compile a specialized version of the Stan program in which we can easily unroll loops, inline functions, and pre-allocate memory
• pulling parts of the Math library into the Stan compiler to e.g. avoid checking input matrices for positive-definiteness on every iteration of HMC

There is a wealth of information at the Stan language level we can take advantage to produce more efficient code than the more mature C++ compilers we rely on, and we can use the new compiler to pass some of that information along to the C++ code we generate. Maria Gorinova showed us with SlicStan how to move code to its most efficient (Stan) block automatically as well as a nice composition-friendly syntax. We can use similar static analysis tools in a probabilistic setting to e.g. allow for discrete parameters via automated Rao-Blackwellization (i.e. integrating them out) or discover conjugacy relationships and use analytic solutions where applicable. We can go a step further and integrate with a symbolic differentiation library to get symbolic derivatives for Stan code as a fast substitution for automatic differentiation.

Once we’ve created a platform for expressing Stan language concepts and optimizing them, we’ll naturally want to bring as much of our computation onto that platform as possible so we can optimize holistically. This will mean either using techniques like Lightweight Modular Staging to parse our existing C++ library into our Stan compiler representation, or beginning a project to rewrite and simplify the Stan Math library into Stan language itself. We hope that with some of the extensions above, we’ll be able to express the vast majority of the Math library in the Stan language, and lean heavily on a symbolic differentiation library and the stanc3 code generator to generate optimized C++ code. This should shrink the size of our Math library by something like 12x, and takes the code generation techniques used in PyTorch to the next level.

Alternative backend targets (TensorFlow, Pytorch, etc.)

At that point, targeting multiple backends will become fairly trivial. We can put compilation times squarely in the cross-hairs and provide an interpreted Stan that immediately gives feedback and has minimal time-to-first-draw. We can also target other backends like TensorFlow Probability and PyTorch that do not possess the wealth of specialty statistical functions and distributions that we do, but may make better use of the 100,000 spare GPUs you have sitting in your garage.