Skip to content

stanc3: rewriting the Stan compiler

I’d like to introduce the stanc3 project, a complete rewrite of the Stan 2 compiler in OCaml.

Join us!

With this rewrite and migration to OCaml, there’s a great opportunity to join us on the ground floor of a new era. Your enthusiasm for or expertise in programming language theory and compiler development can help bring Stan into the modern world of language design, implementation, and optimization. If this sounds interesting, we could really use your help! We’re meeting twice a week for a quick standup on Mondays and Wednesdays at 10am EST, and I’m always happy to help people get started via email, hangout, or coffee. If you’re an existing Stan user, get ready for friendly new language features and performance upgrades to existing code! It might be a little bumpy along the way, but we have a really great bunch of people working on it who all care most about making Stan a great platform for practicing scientists with bespoke modeling needs.

The opportunity

Stan is a successful and mature modeling language with core abstractions that have struck a chord, but our current C++ compiler inhibits some next-gen features that we think our community is ready for. Our users and contributors have poured a huge amount of statistical expertise into the Stan project, and we now have the opportunity to put similar amounts of programming language theory and compiler craftsmanship into practice. The rewrite will also aim at a more modular architecture, which will enable tooling to be built on top of the Stan compiler enabling features like IDE auto-completion and error highlighting, as well as programming and statistical code linters that can help users with common sources of modeling issues. OCaml’s powerful and elegant pattern matching and seasoned parsing library make it a natural fit for the kinds of symbolic computation required of compilers. This makes it much more pleasant and productive for the task at hand, and is reflected by its frequent use by programming language researchers and compiler implementers. OCaml’s flagship parsing library Menhir enabled Matthijs Vákár to rewrite the Stan parsing phase in about a week, adding hundreds of new custom error messages in another week. Matthijs is obviously a beast, but I think he would agree that OCaml & Menhir definitely helped. Come join us and see for yourself :)

New language features

After we replicate Stan’s current compilers functionality, we will be targeting new language features. The to-do list includes, but is not necessarily limited to:

  • tuples
  • tools for representing and working with ragged arrays
  • higher order functions (functions that take in other functions)
  • variadic functions
  • annotations
    • to bring methods like Posterior Predictive Checking and Simulation-Based Calibration into Stan itself
    • to label variables as “silent” (not output), or as living on a GPU or other separate hardware
    • to assist those who would like to use Stan as an algorithms workbench
  • user-defined gradients
  • representations for missing data and sparse matrices
  • discrete parameter marginalization

Next-gen optimization

But back to the next-gen features. Here is just some of the low-hanging fruit:

  • peephole optimizations: we might notice when a user types log(1- x) and replace it with log1m(x) automatically
  • finding redundant computations and sharing the results
  • moving computation up outside of loops (including the sampling loop!)
  • using the data sizes to ahead-of-time compile a specialized version of the Stan program in which we can easily unroll loops, inline functions, and pre-allocate memory
  • pulling parts of the Math library into the Stan compiler to e.g. avoid checking input matrices for positive-definiteness on every iteration of HMC

There is a wealth of information at the Stan language level we can take advantage to produce more efficient code than the more mature C++ compilers we rely on, and we can use the new compiler to pass some of that information along to the C++ code we generate. Maria Gorinova showed us with SlicStan how to move code to its most efficient (Stan) block automatically as well as a nice composition-friendly syntax. We can use similar static analysis tools in a probabilistic setting to e.g. allow for discrete parameters via automated Rao-Blackwellization (i.e. integrating them out) or discover conjugacy relationships and use analytic solutions where applicable. We can go a step further and integrate with a symbolic differentiation library to get symbolic derivatives for Stan code as a fast substitution for automatic differentiation.

Advanced techniques

Once we’ve created a platform for expressing Stan language concepts and optimizing them, we’ll naturally want to bring as much of our computation onto that platform as possible so we can optimize holistically. This will mean either using techniques like Lightweight Modular Staging to parse our existing C++ library into our Stan compiler representation, or beginning a project to rewrite and simplify the Stan Math library into Stan language itself. We hope that with some of the extensions above, we’ll be able to express the vast majority of the Math library in the Stan language, and lean heavily on a symbolic differentiation library and the stanc3 code generator to generate optimized C++ code. This should shrink the size of our Math library by something like 12x, and takes the code generation techniques used in PyTorch to the next level.

Alternative backend targets (TensorFlow, Pytorch, etc.)

At that point, targeting multiple backends will become fairly trivial. We can put compilation times squarely in the cross-hairs and provide an interpreted Stan that immediately gives feedback and has minimal time-to-first-draw. We can also target other backends like TensorFlow Probability and PyTorch that do not possess the wealth of specialty statistical functions and distributions that we do, but may make better use of the 100,000 spare GPUs you have sitting in your garage.

Riad Sattouf (1) vs. Veronica Geng; Bruce Springsteen advances

Personally, I’d rather hear Dorothy Parker, but I had to go with Dalton’s pitch:

Ah, but Dorothy Parker is actually from New Jersey. In fact, both Bruce and Dorothy are members of the official New Jersey hall of fame ( Both were born in Long Branch, NJ. But Bruce is backed up (literally) by another member of New Jersey hall of fame: the E Street Band, so advantage Bruce.

Granted, New Jersey is the armpit of America. But some people would rather rule in hell than serve in heaven, and if we have to pick the person to rule the Parkway, it’s gotta be the Boss.

And now today’s quarterfinal: Both Sattouf and Geng had troubled childhoods, they’re both hilarious, but Sattouf can draw—and his name ends in f. Is that enough to win it for him? It’s up to you to decide!

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

From the Stan forums: “I’m just very thirsty to learn and this thread has become a fountain of knowledge”

Bob came across the above quote in this thread.

More generally, though, I want to recommend the Stan Forums.

As you can see from the snapshot below, the topics are varied:

The discussions are great, and anyone can jump in. Lots of example code and all sorts of things.

Also of interest: the Stan case studies.

Dorothy Parker (2) vs. Bruce Springsteen (1); the Japanese dude who won the hot dog eating contest advances

Dalton made an impressive argument, too complicated to summarize, in favor of Jim Thorpe, “the destroyer of hot dog vendors,” but this was countered by Thomas’s logic:

Since Jim Thorpe is top dog in whatever he tries his hand at, his demise is now inevitable.

And ultimately I had to go with Albert, who made the straight-up case for Kobayashi:

Why does everyone think Jim Thorpe could eat more hot dogs? The Japanese guy doubled the previous record on his first try! There’s just no way an untrained person is getting up off the couch and out-eating the Japanese dude who won the hot dog eating contest.

Here’s him going up against a soccer player.

Also, consider that competitive eating is a more interesting sport than baseball, and requires more of both physical training and general athleticism.

Today it’s the legendary wit versus the New Bob Dylan. Manhattan vs. Jersey. Your arguments, my call.

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket:

R package for Type M and Type S errors

Andy Garland Timm writes:

My package for working with Type S/M errors in hypothesis testing, ‘retrodesign’, is now up on CRAN. It builds on the code provided by Gelman and Carlin (2014) with functions for calculating type S/M errors across a variety of effect sizes as suggested for design analysis in the paper, a function for visualizing the errors, and implements Lu et al.’s (2018) closed form solution for type M error, which is a nice speed boost. You can find the vignette online here, which goes into more detail about its functionality. Next on my to-do list for the package are tools for working with these errors in regression more easily.

If you want a visual for people, this little example of type S/M error with simulated N(.5,1) data could be good (Is including images often still a thing bloggers care about?):

Here, the dotted line is the true effect size, and the full lines are where the statistic becomes statistically significantly different from 0, given our standard error of 1. The grayed out points aren’t statistically significant, the squares are type M errors, and the triangles are type S errors.

Jim Thorpe (1) vs. the Japanese dude who won the hot dog eating contest

OK, now it starts to get interesting . . .

Again, we’re trying to pick the best seminar speaker. Here are the rules and here’s the bracket so far:

Junk science + Legal system = Disaster

Javier Benitez points us to this horrifying story from Liliana Segura: “Junk Arson Science Sent Claude Garrett to Prison for Murder 25 Years Ago. Will Tennessee Release him?”

Meryl Streep advances; it’s down to the quarterfinals!

As Manuel put it, as Stephen Hawking might put it: It’s duets with Pierce Brosnan all the way down.

Political Polarization and Gender Gap: I Don’t Get Romer’s Beef with Bacon.

Gur Huberman writes:

Current politics + statistical analysis, the Paul Romer v. 538 edition:

Economist Paul Romer is criticizing a news article by Perry Bacon, Jr. entitled, “The Biggest Divides On The Kavanaugh Allegations Are By Party — Not Gender.”

My reaction:

I don’t get Romer’s beef. Bacon’s article seems reasonable to me: He (Bacon) lists a bunch of opinion questions where the difference between Democrats and Republicans is much bigger than the difference between men and women. This is no surprise at all. I doubt that Bacon thought this was a particularly newsworthy finding; rather, he’s doing the sort of “dog bites man” reporting, reminding us a huge something—in this case, partisan polarization—that’s out there. In response, Romer does a bunch of throat-clearing about the importance of what he’s about to say, and then . . . says that the partisan comparison isn’t appropriate because it excludes independents (fine, but then you could include “leaners” to the partisans and I’d expect you’d get similar results as what Bacon reported for the pure partisans), and then he pulls out one question where the gender gap is larger than the partisan gap. But that’s just one question, so I don’t see how it renders “objectively false” Bacon’s statement that, “Even on gender issues more broadly, the partisan divide outstretches the gender one.” Romer might be right that Bacon’s claim is wrong, but to assess that claim you’d have to look at more than one issue.

I think Romer’s post would be stronger without the rhetoric: It would be fine to say that Bacon is making a commonplace point (not a problem, as sometimes we have to remind people of things that the experts already know) and then for Romer to add some subtleties such as in that one question where the gender gap is particularly strong.

LeBron James (3) vs. Meryl Streep; Pele advances

Yesterday I was gonna go with Turing following Dalton’s eloquent argument:

To his credit, Pele didn’t have to play in the era of VAR (video assistant referees). As last Tuesday’s Champions League game between Paris – Saint Germain and Manchester United (as well as the recent World Cup final between Croatia and France) demonstrate, there is an indistinct, but abrupt border between the beautiful and the farcical and that border is guarded by imperfect humans with flawed methods for making binary decisions. Football is all kinetic energy and continuous movement. That’s why it is so beautiful and beloved. But (like Stan) football runs into trouble when that continuity is confronted with a latent discrete state. The stochastic motions of a 450 gram truncated icosahedron will occasionally cause it careen into the demilitarized zone between the discrete states delineated by the Laws of the Game. Did it or did it not touch the hand or arm? Was that contact SIGNIFICANT enough to cause a penalty? Many of us hate the use of p-values on this blog, but when was the last time somebody was stabbed because they didn’t like the outcome of binary decision given by a p-value (

Pele, at his best, transported us so far away from these uncomfortable boundaries that we were able to forget they exist. Turing, on the other hand, dwelt in the uncanny trenches where the messy but still quantized states of the world must be directly confronted. Football’s (and Pele’s) transcendence can only be temporary because of the Laws of the Game. We need the Laws of the Game, otherwise every game would only be ever escalating chaos whose obvious destination is violence. But the Laws of the Game, by attempting to impose order, manage only to concentrate the chaos at the seams. Turing helped prove this fundamental limitation on our ability to reside forever in the domain of certainty. And so elevating Turing over Pele is the only grown-up thing to do. As much fun as Pele can be the night before, only Turing allows us to continue living after the cold brutal truth of the morning after.

So true. But then Manuel won it for the futebolista with this enigmatic retort:

bciw mqbe huek cwcq kwtn wbgo sphe wthr behq jpiz htjz fjnj ntic kkzu eyxr ndan cfoq

Model: Enigma M4 “Shark”
Reflector: UKW B thin
Rotor 1: Beta, position 1, ring 1
Rotor 2: I, position 17, ring 1
Rotor 3: I, position 12, ring 1
Rotor 4: I, position 1, ring 1

Plugboard: bq cr di ej kw mt os px uz gh

Composed with the help of

tl;dr: Pelé advances

As for today’s contest . . . LeBron hasn’t been looking too good lately, but do we really need to see a 22nd Oscar-nominated performance? Both these contestants are a bit overexposed. But what do you think?

Again, here are the rules and here’s the bracket:

Remember that paper we wrote, The mythical swing voter? About shifts in the polls being explainable by differential nonresponse? Mark Palko beat us to this idea, by 4 years.

So. The other day I came across a link by Palko to this post from 2012, where he wrote:

Pollsters had long tracked campaigns by calling random samples of potential voters. As campaign became more drawn out and journalistic focus shifted to the horse race aspects of election, these phone polls proliferated. At the same time, though, the response rates dropped sharply, going from more than one in three to less than one in ten.

A big drop in response rates always raises questions about selection bias since the change may not affect all segments of the population proportionally . . . It also increases the potential magnitude of these effects. . . .

Poll responses are basically just people agreeing to talk to you about politics, and lots of things can affect people’s willingness to talk about their candidate, including things that would almost never affect their actual votes . . .

[In September, 2012] the Romney campaign hit a stretch of embarrassing news coverage while Obama was having, in general, a very good run. With a couple of exceptions, the stories were trivial, certainly not the sort of thing that would cause someone to jump the substantial ideological divide between the two candidates so, none of Romney’s supporters shifted to Obama or to undecided. Many did, however, feel less and less like talking to pollsters. So Romney’s numbers started to go down which only made his supporters more depressed and reluctant to talk about their choice.

This reluctance was already just starting to fade when the first debate came along. . . . after weeks of bad news and declining polls, the effect on the Republican base of getting what looked very much like the debate they’d hoped for was cathartic. Romney supporters who had been avoiding pollsters suddenly couldn’t wait to take the calls. By the same token. Obama supporters who got their news from Ed Schultz and Chris Matthews really didn’t want to talk right now.

The polls shifted in Romney’s favor even though, had the election been held the week after the debate, the result would have been the same as it would have been had the election been held two weeks before . . .

So response bias was amplified by these factors:

1. the effect was positively correlated with the intensity of support

2. it was accompanied by matching but opposite effects on the other side

3. there were feedback loops — supporters of candidates moving up in the polls were happier and more likely to respond while supporters of candidates moving down had the opposite reaction.

The above completely anticipates the main result of our Mythical Swing Voter paper, which is based on the Xbox polling data we collected in 2012, analyzed in 2013, wrote up in 2014, and published in 2016, and which was picked up in the news media in time for the 2016 campaign.

I’m not saying our paper was valueless: we didn’t just speculate, we provided careful data analysis. The thing is, though, that the pattern we found, that big swings in Obama support could mostly be explained by differential nonresponse, surprised us. It wasn’t what we expected, it’s not something we thought about at all in our 1993 paper, and it took us awhile to digest this finding. But Palko had already laid out the whole story, all the way including the feedback mechanism by which small swings in vote preference are magnified into big swings in the polls, with all this connecting to the rise in survey nonresponse.

I probably even read Palko’s post when it came out back in 2012, but, if so, I didn’t get the point.

There’s something wrong with the world that his blog (cowritten with Joseph Delaney) doesn’t have a million readers.

P.S. Doug Rivers (one of my coauthors on the Mythical Swing Voter paper) was also talking in 2012 about differential nonresponse; see the last three paragraphs here.

Alan Turing (4) vs. Pele; Veronica Geng advances

I gotta go with Geng, based on this from Jonathan:

I was all in on Geng, as you know, but I have no idea what she sounded like.

But it’s not the voice is it? It’s the content. And listen to what Geng could do (Remorse, April 7, 1986) “I will also spend one hundred hours working with youthful offenders, who, I believe, could profit tremendously from one hundred hours away from the grind of science or math, listening instead to me explaining why I am talking to them instead of their teachers or parents.” If she can do that for youthful offenders, imagine what she can do for those of us lucky enough to attend. And science and math really is a grind, no?

Dalton almost had me going with this counter-argument:

If we’re going solely by Wikipedia, (And let’s be honest, I have been for the entire contest) it’s Nora by a mile. Nora’s got a picture and bio-box. A personal life and a career section. An entire section entitled “Ephron and Deep Throat.” Nicely formatted tables.

I was with him until he got to the tables. I hate tables.

Today two modern secular saints face off. Pele can do anything with a soccer ball. But a Turing machine can do anything computable. We’re a Venn diagram situation here:
– There are some things that are computable but can’t be done with a soccer ball.
– There are some things that are computable and can be done with a soccer ball.
– There are some things that can be done with a soccer ball but are not computable.

I can see a path to victory for either contestant. On one hand, if Pele could implement the Game of Life using a soccer ball, then Turing would be superfluous. From the other direction, if Turing could implement soccer using Boolean operators, then we wouldn’t need Pele. Either of these tasks seems pretty NP-tough to me. But this is a hypothetical seminar series, so all things are possible, no?

Again, here are the rules and here’s the bracket:

Not Dentists named Dennis, but Physicists named Li studying Li

Charles Jackson writes:

I was spurred to do this search by reading an article in the 30 Mar 2018 issue of Science. The article was:

Self-heating–induced healing of lithium dendrites​ by Lu Li et al.

Wikipedia says that more than 93 million people in China have the surname Li.

I found 62 articles on Lithium with authors named Li from one publisher. From the search function on the AAAS website with my annotations:

As the saying goes, further research is needed.

Veronica Geng vs. Nora Ephron; Riad Sattouf advances

Not much going on in yesterday‘s Past vs. Future battle. Maybe we should’ve brought in Michael J. Fox as a guest judge . . .

Anyway, the best argument in the comments came from Ethan:

Since we can’t have Mr P let’s have Mr B.

Ahhh, but we can have Mr P. We can always have Mr P. So then we don’t need Mr B. Sorry, Mel!

Today is a battle of two literary wits. Highbrow vs. Middlebrow. Or maybe it would be more precise to say High Middlebrow vs. Middlebrow. The New Yorker vs. Hollywood. Two different branches of the literary comedy tradition. Love Trouble vs. Heartburn. Mr. Reagan vs. Proust. Can’t go wrong with either choice. What’s yours?

Again, here are the rules and here’s the bracket:

The neurostatistical precursors of noise-magnifying statistical procedures in infancy

David Allison points us to this paper, The neurodevelopmental precursors of altruistic behavior in infancy, by Tobias Grossmann, Manuela Missana, and Kathleen Krol, which states:

The tendency to engage in altruistic behavior varies between individuals and has been linked to differences in responding to fearful faces. The current study tests the hypothesis that this link exists from early in human ontogeny. Using eye tracking, we examined whether attentional responses to fear in others at 7 months of age predict altruistic behavior at 14 months of age. Our analysis revealed that altruistic behavior in toddlerhood was predicted by infants’ attention to fearful faces but not happy or angry faces. Specifically, infants who showed heightened initial attention to (i.e., prolonged first look) followed by greater disengagement (i.e., reduced attentional bias over 15 seconds) from fearful faces at 7 months displayed greater prosocial behavior at 14 months of age. Our data further show that infants’ attentional bias to fearful faces and their altruistic behavior was predicted by brain responses in the dorsolateral prefrontal cortex (dlPFC), measured through functional near-infrared spectroscopy (fNIRS). This suggests that, from early in ontogeny, variability in altruistic helping behavior is linked to our responsiveness to seeing others in distress and brain processes implicated in attentional control. These findings critically advance our understanding of the emergence of altruism in humans by identifying responsiveness to fear in others as an early precursor contributing to variability in prosocial behavior.

Allison writes:

From the paper, I discern that stepwise regression was used, but could not determine how many variables were used and whether any adjustment to the reported significance levels to accommodate the overfitting that is known to occur with stepwise selection was used. This raises questions when interpreting the results.

Also, they’re making the classic error of labeling differences as real if they’re statistically significant and zero if they’re not. That’s a standard statistical technique, but it’s a disaster; it’s a way to add noise to your study and get overconfidence.

I assume the authors of this paper were doing their best, but I’m very doubtful that they’ve offered real support for their claim that their findings “critically advance our understanding of the emergence of altruism in humans.” That’s a bit over the top, no?

Riad Sattouf (1) vs. Mel Brooks; Bruce Springsteen advances

Dalton crisply solved yesterday’s problem right away with,

The real test is would you rather have Bruce Springsteen cook for you or have Julia Child sing to you?

To ask this question is to answer it. I didn’t even read any comments after that one.

Today’s matchup is more exciting. It’s The Arab of the Future vs. The 2000 Year Old Man. The future versus the past: you can’t get more primal than that. Both these guys are very funny and are great with facial expressions (in different media). How are we gonna pick this one? And is the winner doomed to be knocked out in the next round by Veronica Geng??

Again, here are the rules and here’s the bracket:

A corpus in a single survey!

This was something we used a few years ago in one of our research projects and in the paper, Difficulty of selecting among multilevel models using predictive accuracy, with Wei Wang, but didn’t follow up on. I think it’s such a great idea I want to share it with all of you.

We were applying a statistical method to survey data, and we had a survey to work with. So far, so usual: it’s a real-data application, but just one case. Our trick was that we evaluated our method separately on 71 different survey responses, taking each in turn as the outcome.

So now we have 71 cases, not just 1. But it takes very little extra work because it’s the same survey and the same poststratification variables each time.

In contrast, applying our method to 71 different surveys, that would be a lot of work, as it would require wrangling each dataset, dealing with different question wordings and codings, etc.

The corpus formed by these 71 questions is not quite the same as a corpus of 71 different surveys. For one thing, the respondents are the same, so if the particular sample happens to overrepresent Democrats, or Republicans, or whatever, then this will be the case for all 71 analyses. But this problem is somewhat mitigated if the 71 responses are on different topics, so that nonrepresentativeness in any particular dimension won’t be relevant for all the questions.

“Abandon / Retire Statistical Significance”: Your chance to sign a petition!

Valentin Amrhein, Sander Greenland, and Blake McShane write:

We have a forthcoming comment in Nature arguing that it is time to abandon statistical significance. The comment serves to introduce a new special issue of The American Statistician on “Statistical inference in the 21st century: A world beyond P < 0.05”. It is titled "Retire Statistical Significance"---a theme of many of the papers in the special issue including the editorial introduction---and it focuses on the absurdities generated by so-called “proofs of the null”. Nature has asked us to recruit "co-signatories” for the comment (for an example, see here) and we think readers of your blog would be interested. If so, we would be delighted to send a draft to interested parties for signature . Please request a copy at and we will send it (Nature has a very strict embargo policy so please explicitly indicate you will keep it to yourself) or, if you already agree with the message, please just sign here. The timeline is tight so we need endorsements by Mar 8 but the comment is short at ~1500 words.

I signed the form myself! I like their paper and agree with all of it, with just a few minor issues:

– They write, “For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss.” I’d remove this sentence, first because the connection to the coin toss does not seem clear—it’s a cute mathematical argument but I think just confusing in this context—second because I feel that the whole p=0.03 vs. p=0.06 thing (or, worse, p=0.49 vs. p=0.51) is misleading. The fundamental problem with “statistical significance” is not the arbitrariness of the bright-line rule, but rather the fact that even apparently large differences in p-values (for example, p=0.01 and p=0.30 mentioned later in that paragraph) can be easily explained by noise.

– Also in that paragraph they refer to two studies with 80% power. This too is a bit misleading, I think: People always think they have 80% power when they don’t (see here and here).

– I like that they say we must learn to embrace uncertainty!

– I’m somewhat bothered about this recommendation from their paper: “We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits. All the values between the interval’s limits are reasonably compatible with the data.” My problem is that in many cases of forking paths and selection, we have no good reason to think of any of the values within the confidence interval as reasonable. For example that study of beauty and sex ratio which purportedly found an 8 percentage point difference with a 95% confidence interval of something like [2%, 14%]. Even 2%–even 1%–would be highly implausible here. In this example, I don’t think it’s accurate in that case to even say that values the range [2%, 14%] are “reasonably compatible with the data.”

I understand the point they’re trying to make, and I like the term “compatability intervals,” but I think you have to be careful not to put too much of a burden on these intervals. There are lots of people out there who say, Let’s dump p-values and instead use confidence intervals. But confidence intervals have these selection problems too. I agree with the things they say in the paragraphs following the above quote.

– They write that in the future, “P-values will be reported precisely (e.g., P = 0.021 or P = 0.13) rather than as binary inequalities.” I don’t like this! I mean, sure, binary is terrible. But “P = 0.021” is, to my mind, ridiculous over-precision. I’d rather see the estimate and the standard error.

Anyway, I think their article is great; the above comments are minor.

Key point from Amrhein, Greenland, and McShane:

We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.

Also this:

The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.

Yes yes yes yes yes. See this other paper of ours for further elaboration of these points.

P.S. As noted above, I signed the petition and I recommend you, the readers, consider doing so as well. That said, I fully respect people who don’t like to sign petitions. Feel free to use the comment thread both to discuss the general idea of retiring statistical significance, as well as questions of whether petitions are a good idea . . .

Julia Child (2) vs. Bruce Springsteen (1); Dorothy Parker advances

Yesterday it was Dorothy Parker in a landslide. Commenters just couldn’t resist dissing the Wild and Crazy Guy. Noah came in with a limerick:

There once was a Martin named Steven
whose humor we used to believe in.
His outlook got starker.
He’s no Dorothy Parker.
In this matchup, then, Steven be leavin’.

And Dzhaughn took it home with:

Tut tut! We seldom make pharaohs of men who wear arrows.

As for today’s matchup . . . what can I say, Julia Child is the ultimate dark horse. I have no idea how she got this far. Where’s Virginia Apgar when we need her?

Again, here are the rules and here’s the bracket:

(back to basics:) How is statistics relevant to scientific discovery?

Someone pointed me to this remark by psychology researcher Daniel Gilbert:

Publication is not canonization. Journals are not gospels. They are the vehicles we use to tell each other what we saw (hence “Letters” & “proceedings”). The bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech.

Which led me to this, where Gilbert approvingly quotes a biologist who wrote, “Science is doing what it always has done — failing at a reasonable rate and being corrected. Replication should never be 100%.”

I’m really happy to see this. Gilbert has been loud defender of psychology claims based on high-noise studies (for example, the ovulation-and-clothing paper) and not long ago was associated with the claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” This was in the context of an attack by Gilbert and others on a project in which replication studies were conducted on a large set of psychology experiments, and it was found that many of those previously published claims did not hold up under replication.

So I think this is a big step forward, that Gilbert and his colleagues are moving beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct.

Gilbert’s revised view—that the replication rate is not 100%, nor should it be—is also helpful in that, once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists who Gilbert earlier referred to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery.

If we are discouraged from criticizing published work—or if our criticism elicits pushback and attacks from the powerful, or if it’s too hard to publish criticisms and obtain data for replication—that’s bad for discovery, in three ways. First, criticizing errors allows new science to move forward in useful directions. We want science to be a sensible search, not a random walk. Second, learning what went wrong in the past can help us avoid errors in the future. That is, criticism can be methodological and can help advance research methods. Third, the potential for criticism should allow researchers to be more free in their speculation. If authors and editors felt that everything published in a top journal was gospel, there could well be too much caution in what to publish.

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. Shame comes not when people make mistakes, but rather when they dodge criticism, won’t share their data, refuse to admit problems, and attack their critics.

But, yeah, let’s be clear: Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start.

So let’s talk a bit about failed replications.

First off, as Gilbert and others have noted, an apparently failed replication might not be anything of the sort. It could be that the replication study found no statistical significance because it was too noisy; indeed, I’m not at all happy with the idea of using statistical significance, or any such binary rule, as a measure of success. Or it could be that the effect found in the original study occurs only in some situations and not others. The original and replication studies could differ in some important ways.

One thing that the replication does give you, though, is a new perspective. A couple years ago I suggested the “time-reversal heuristic” as a partial solution to the “research incumbency rule” in which people privilege the first study on a topic—even when the first study is small and uncontrolled and the second study is a large and careful replication.

In theory, an apparently failed replication can itself be a distraction, but in practice we often seem to learn a lot from these replications, for three reasons. First, the very act of performing a replication study can make us realize some of the difficulties and choices involved in the original study. This happened with us when we performed a replication of one of our own papers! Second, the failed replication casts some doubt on the original claim, which can motivate a more critical look of the original paper, and which can then reveal all sorts of problems that nobody noticed the first time. Third, lots of papers have such serious methodological problems that their conclusions are essentially little more than shufflings of random numbers—but not everyone understands methodological criticisms, so a replication study can be a convincer. Recall the recent paper with the replication prediction market: lots of these failed replications were no surprise to educated outsiders.

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

Let’s go through each of these:

Design and data collection. Statistics can help us evaluate measures and can also give us a sense of how much accuracy we will need from our data to make strong conclusions later on. It turns out that many statistical intuitions developed many decades ago in the context of estimation of large effects with good data, do not work so well when estimating small effects with noisy data; see this article for discussion of that point.

Data analysis. As has been discussed many times, one of the sources of the recent replication crisis in science is the garden of forking paths: Researchers gather rich data but then only report a small subset of what they found: by selecting on statistical significance they are throwing away a lot of data and keeping a random subset. The solution is to report all your data with no selection and no arbitrary dichotomization. At this point, though, analysis becomes more difficult: analyzing a whole grid of comparisons is more difficult than analyzing just one simple difference. Statistical methods can come to the rescue here, in the form of multilevel models.

Decision making. One way to think about the crisis of replication is that if you make decisions based on selected statistically significant comparisons, you will overstate effect sizes. Then you have people going around making unrealistic claims, and it can take years of clean-up to dial back expectations.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work.

To loop back to Daniel Gilbert’s observations quoted at the top of this post: We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

One last question

Finally, what should we think about research that, ultimately, has no value, where the measurements are so noisy that nothing useful can be learned about the topics under study?

For example, there’s that research on beauty and sex ratio which we’ve discussed so many times (see here for background).

What can we get out of that doomed study?

First, it’s been a great example, allowing us to develop statistical methods for assessing what can be learned from noisy studies of small effects. Second, on this particular example, we’ve learned the negative fact that this research was a dead end. Dead ends happen; this is implied by those Gilbert quotes above. One could say that the researcher who worked on those beauty-and-sex-ratio papers did the rest of us a service by exploring this dead end so that other people don’t have to. That’s a scientific contribution too, even if it wasn’t the original aim of the study.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

We’ve gone a long way in this direction, both statistically and sociologically. From “the replication rate . . . is statistically indistinguishable from 100%” to “Replication should never be 100%”: This is real progress that I’m happy to see, and it gives me more confidence that we can all work together. Not agreeing on every item, I’m sure, but with a common understanding of the fallacy of published work.