Skip to content

Data partitioning as an essential element in evaluation of predictive properties of a statistical method

In a discussion of our stacking paper, the point came up that LOO (leave-one-out cross validation) requires a partitioning of data—you can only “leave one out” if you define what “one” is.

It is sometimes said that LOO “relies on the data-exchangeability assumption,” but I don’t think that’s quite the right way to put it, but LOO does assume the relevance of a data partition. We discuss this briefly in section 3.5 of this article. For regular Bayes, p(theta|y) proportional to p(y|theta) * p(theta), there is no partition of data. “y” is just a single object. But for loo, y can be partitioned. At first this bothered me about loo, but then I decided that this is a fundamental idea, related to the idea of “internal replication” discussed by Ripley in his spatial statistics book. The idea is that with just “y” and no partitions, there is no internal replication and no statistically general way of making reliable statements about new cases.

This is similar to (but different from) the distinction in chapter 6 of BDA between the likelihood and the sampling distribution. To do inference for a given model, all we need from the data is the likelihood function. But to do model checking, we need the sampling distribution, p(y|theta), which implies a likelihood function but requires more assumptions (as can be seen, for example, in the distinction between binomial and negative binomial sampling). Similarly, to do inference for a given model, all we need is p(y|theta) with no partitioning of y, but to do predictive evaluation we need a partitioning.

Oscar Wilde (1) vs. Joe Pesci; the Japanese dude who won the hot dog eating contest advances

Raghuveer gave a good argument yesterday: “The hot dog guy would eat all the pre-seminar cookies, so that’s a definite no.” But this was defeated by the best recommendation we’ve ever had in the history of the Greatest Seminar Speaker contest, from Jeff:

Garbage In, Garbage Out: Mass Consumption and Its Aftermath
Takeru Kobayashi

Note: Attendance at both sessions is mandatory.

Best. Seminar. Ever.

So hot dog guy is set to go to the next round, against today’s victor.

It’s the wittiest man who ever lived, vs. an unseeded entry in the People from New Jersey category. So whaddya want: some 125-year-old jokes, or a guy who probably sounds like a Joe Pesci imitator? You think I’m funny? I’m funny how, I mean funny like I’m a clown, I amuse you?

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Does Harvard discriminate against Asian Americans in college admissions?

Sharad Goel, Daniel Ho and I looked into the question, in response to a recent lawsuit. We wrote something for the Boston Review:

What Statistics Can’t Tell Us in the Fight over Affirmative Action at Harvard

Asian Americans and Academics

“Distinguishing Excellences”

Adjusting and Over-Adjusting for Differences

The Evolving Meaning of Merit

Character and Bias

A Path Forward

The Future of Affirmative Action

Carol Burnett (4) vs. the Japanese dude who won the hot dog eating contest; Albert Brooks advances

Yesterday was a tough matchup, but ultimately John “von” Neumann was no match for a very witty Albert Einstein.

The deciding argument, from Martha:

I’d like to see Von Neumann given four parameters and making an elephant wiggle his trunk. And if he could do it, there would be the chance that Jim Thorpe could do it if they met in a later round.

No way do I think that Neumann could fit that elephant. As I wrote earlier, that elephant quote just seems like bragging! For one thing, I can have a model with a lot more than five parameters and still struggle to fit my data.

I almost want to invite Neumann to speak, just so we can put him on the spot, ask him to fit the damn elephant, and watch him fail. But that’s not cool, to invite a speaker just for the purpose of seeing him crash and burn. That way lies madness.

Today’s contest features two unique talents. Carol Burnett was the last of the old-time variety-show hosts, she can sing, she can dance, and according to Wikipedia, she was “the first celebrity to appear on the children’s series Sesame Street.” But she’s facing stiff competition, from the Japanese dude who won the hot dog eating contest. That’s an accomplishment, to have done something so impressive that this one feat defines you. So I think that whoever advances to the next round will be a strong competitor. Neither Carol Burnett nor the Japanese dude who won the hot dog eating contest are top seeds, but both of them are interesting dark horse candidates.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Storytelling: What’s it good for?

A story can be an effective way to send a message. Anna Clemens explains:

Why are stories so powerful? To answer this, we have to go back at least 100,000 years. This is when humans started to speak. For the following roughly 94,000 years, we could only use spoken words to communicate. Stories helped us survive, so our brains evolved to love them.

Paul Zak of the Claremont Graduate University in California researches what stories do to our brain. He found that once hooked by a story, our brain releases oxytocin. The hormone affects our mood and social behaviour. You could say stories are a shortcut to our emotions.

There’s more to it; stories also help us remember facts. Gordon Bower and Michal Clark from Stanford University in California let two groups of subjects remember random nouns. One group was instructed to create a narrative with the words, the other to rehearse them one by one. People in the story group recalled the nouns correctly about six to seven times more often than the other group.

But my collaborator Thomas Basboll is skeptical:

It seems to me that a paper that has been written to mimic the most compelling features of Hollywood blockbusters (which Anna explicitly invokes) is also, perhaps unintentionally, written to avoid critical engagement. Indeed, when Anna talks about “characters” she does not mention the reader as a character in the story, even though the essential “drama” of any scientific paper stems from the conversation that reader and writer are implicitly engaged in. The writer is not simply trying to implant an idea in the mind of the reader. In a research paper, we are often challenging ideas already held and, crucially, opening our own thinking to those ideas and the criticism they might engender.

Basboll elaborates:

Anna promises that storytelling can produce papers that are “concise, compelling, and easy to understand”. But I’m not sure that a scientific paper should actually be compelling. . . . A scientific paper should be vulnerable to criticism; it should give its secrets away freely, unabashedly. And the best way to do that is, not to organise it with the aim of releasing oxytocin in the mind of the reader, but by clearly identifying your premises and your conclusions and the logic that connects them. You are not trying to bring your reader to a narrative climax. You are trying to be upfront about where your argument will collapse under the weight of whatever evidence the reader may bring to the conversation. Science, after all, is not so much about what Coleridge called “the suspension of disbelief” as what Merton called “organised skepticism”.

In our article from a few years ago, Basboll and I wrote about how we as scientists learn from stories. In discourse about science communication, stories are typically presented as a way for scientists to frame, explain, and promote their already-formed ideas; in our article, Basboll and I looked from a different direction, considering how it is that scientists can get useful information from stories. We concluded that stories are a form of model checking, that a good story expresses true information that contradicts some existing model of the world.

Basboll’s above exchange with Clemens is interesting in a different way: Clemens is saying that stories are an effective way to communicate because they compelling and memorable. Basboll replies that science shouldn’t always be compelling: so much of scientific work is mistakes, false starts, blind alleys, etc., so you want the vulnerabilities of any scientific argument to be clear.

The resolution, I suppose, is to use stories—but not in a way that hides the potential weaknesses of a scientific argument. Instead, harness the power of storytelling to make it easier for readers to spot the flaws.

The point is that there are two dimensions to scientific communication:

1. The medium of expression. Storytelling can be more effective than a dry sequence of hypothesis, data, results, conclusion.

2. The goal of communication. Instead of presenting a wrapped package of perfection, our explanation should have lots of accessible points: readers should be able to pull the strings so the arguments can unravel, if that is possible.

P.S. More on this from Basboll here.

Coursera course on causal inference from Michael Sobel at Columbia

Here’s the description:

This course offers a rigorous mathematical survey of causal inference at the Master’s level. Inferences about causation are of great importance in science, medicine, policy, and business. This course provides an introduction to the statistical literature on causal inference that has emerged in the last 35-40 years and that has revolutionized the way in which statisticians and applied researchers in many disciplines use data to make inferences about causal relationships. We will study methods for collecting data to estimate causal relationships. Students will learn how to distinguish between relationships that are causal and non-causal; this is not always obvious. We shall then study and evaluate the various methods students can use — such as matching, sub-classification on the propensity score, inverse probability of treatment weighting, and machine learning — to estimate a variety of effects — such as the average treatment effect and the effect of treatment on the treated. At the end, we discuss methods for evaluating some of the assumptions we have made, and we offer a look forward to the extensions we take up in the sequel to this course.

Last year Bob Carpenter and I started to put together a Coursera course on Bayesian statistics and Stan, but we ended up deciding we weren’t quite ready to do so. In any case, causal inference is a (justly) popular topic, and I expect that this online version of Michael’s course at Columbia will be good.

John van Neumann (3) vs. Albert Brooks; Paul Erdos advances

We had some good arguments on both sides yesterday.

For Erdos, from Diana Senechal:

From an environmental perspective, Erdos is the better choice; his surname is an adjectival form of the Hungarian erdő, “forest,” whereas “Carson” clearly means “son of a car.” Granted, the son of a car, being rebellious and all, might prove especially attentive to the quality of the air, but we have no evidence of this.

On the other side Stephen Oliver had an excellent practical point:

Johnny Carson, because if Erdos gave a talk it would be overrun by mathematicians trying to get a paper with him.

But I had to call it for Erdos after this innovative argument from Ethan Bolker, who said, “I have a good argument for Erdos but will save it for a later round. If he loses this one you’ll never know . . .” I think you can only use that ploy once ever—but he used it!

Our next bout features two people who changed their own names. In one corner, one of the most brilliant mathematicians of all time, but a bit of a snob who enjoyed hobnobbing with government officials and apparently added “von” to his name to make himself sound more upper-class. In the other corner, a very funny man who goes by “Brooks” because he didn’t feel like going through life with the name Albert Einstein.

From what I’ve read about von Neumann, I find him irritating and a bit of a braggart. But, if we want to go negative, we can get on Brooks’s case for not fulfilling his early comedic promise. So maybe we should be looking for positive things to say about these two guys.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

How post-hoc power calculation is like a shit sandwich

Damn. This story makes me so frustrated I can’t even laugh. I can only cry.

Here’s the background. A few months ago, Aleksi Reito (who sent me the adorable picture above) pointed me to a short article by Yanik Bababekov, Sahael Stapleton, Jessica Mueller, Zhi Fong, and David Chang in Annals of Surgery, “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science,” which contained some reasonable ideas but also made a common and important statistical mistake.

I was bothered to see this mistake in an influential publication. Instead of blogging it, this time I decided to write a letter to the journal, which they pretty much published as is.

My letter went like this:

An article recently published in the Annals of Surgery states: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%---with the given sample size and effect size observed in that study”. This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power . . . The problem is well known in the statistical and medical literatures . . . That said, I agree with much of the content of [Bababekov et al.] . . . I appreciate the concerns of [Bababekov et al.] and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is just a problem with their recommendation to calculate power using observed effect sizes.

I was surgically precise, focusing on the specific technical error in their paper and separating this from their other recommendations.

And the letter was published, with no hassle! Not at all like my frustrating experience with the American Sociological Review.

So I thought the story was over.

But then my blissful slumber was interrupted when I received another email from Reito, pointing to a response in that same journal by Bababekov and Chang to my letter and others. Bababekov and Chang write:

We are greatly appreciative of the commentaries regarding our recent editorial . . .

So far, so good! But then:

We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. . . . We also respectfully disagree that knowing the power after the fact is not useful in surgical science.

No! My problem is not that their recommended post-hoc power calculations are “mathematically redundant”; my problem is that their recommended calculations will give wrong answers because they are based on extremely noisy estimates of effect size. To put it in statistical terms, their recommended method has bad frequency properties.

I completely agree with the authors that “knowing the power after the fact” can be useful, both in designing future studies and in interpreting existing results. John Carlin and I discuss this in our paper. But the authors’ recommended procedure of taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power. Not the same thing at all.

Here’s an example. Suppose you have 200 patients: 100 treated and 100 control, and post-operative survival is 94 for the treated group and 90 for the controls. Then the raw estimated treatment effect is 0.04 with standard error sqrt(0.94*0.06/100 + 0.90*0.10/100) = 0.04. The estimate is just one s.e. away from zero, hence not statistically significant. And the crudely estimated post-hoc power, using the normal distribution, is approximately 16% (the probability of observing an estimate at least 2 standard errors away from zero, conditional on the true parameter value being 1 standard error away from zero). But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away fro zero, corresponding to power being somewhere between 5% (if the true population effect size happened to be zero) and 97.5% (if the true effect size were three standard errors from zero). That’s what I call noisy.

Here’s an analogy that might help. Suppose someone offers me a shit sandwich. I’m not gonna want to eat it. My problem is not that it’s a sandwich, it’s that it’s filled with shit. Give me a sandwich with something edible inside; then we can talk.

I’m not saying that the approach that Carlin and I recommend—performing design analysis using substantively-based effect size estimates—is trivial to implement. As Bababekov and Chang write in their letter, “it would be difficult to adapt previously reported effect sizes to comparative research involving a surgical innovation that has never been tested.”

Fair enough. It’s not easy, and it requires assumptions. But that’s the way it works: if you want to make a statement about power of a study, you need to make some assumption about effect size. Make your assumption clearly, and go from there. Bababekov and Chang write: “As such, if we want to encourage the reporting of power, then we are obliged to use observed effect size in a post hoc fashion.” No, no, and no. You are not obliged to use a super-noisy estimate. You were allowed to use scientific judgment when performing that power analysis you wrote for your grant proposal, before doing the study, and you’re allowed to use scientific judgment when doing your design analysis, after doing the study.

The whole thing is so frustrating.

Look. I can’t get mad at the authors of this article. They’re doing their best, and they have some good points to make. They’re completely right that authors and researchers should not “misinterpret P > 0.05 to mean comparison groups are equivalent or ‘not different.'” This is an important point that’s not well understood; indeed my colleagues and I recently wrote a whole paper on the topic, actually in the context of a surgical example. Statistics is hard. The authors of this paper are surgeons and health policy researchers, not statisticians. I’m a statistician and I don’t know anything about surgery; no reason to expect these two surgeons to know anything about statistics. But, it’s still frustrating.

P.S. After writing the above post a few months ago, I submitted it (without some features such as the “shit sandwich” line) as a letter to the editor of the journal. To its credit, the journal is publishing the letter. So that’s good.

Johnny Carson (2) vs. Paul Erdos; Babe Didrikson Zaharias advances

OK, our last matchup wasn’t close. Adam Schiff (unseeded in the “people whose name ends in f” category) had the misfortune to go against the juggernaut that was Babe Didrikson Zaharias (seeded #2 in the GOATs category). Committee chair or not, the poor guy never had a chance. As Diana Senechal wrote, “From an existential standpoint, If Schiff won this match, life would be absurd. Perhaps it is, but I still look for interludes of logic and meaning: for instance, right here. Let this battle be such an interlude, and let Babe claim the victory she deserves.”

Next up is Johnny Carson #2 in the TV personalities category and arguably the best talk-show host ever, against Paul Erdos, one of the weirdest and prolific mathematicians of all time. I’m guessing that the commenters here will side with Erdos, but I dunno. From everything I’ve read about Erdos, he’s always seemed irritating to me. In some ways, I can relate to the guy: like me, he liked to solve research problems with lots of different collaborators, but there’s something about all those indulgent descriptions of the guy that rub me the wrong way. In contrast, Johnny Carson is just brilliant. But, in any case, it’s up to you, not me, to give the most compelling arguments on both sides.

Remember, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

This is one offer I can refuse

OK, so this came in the email today:

Dear Contributor,


[978 1 78347 485 1]

Regular price: $455.00

Special Contributor price: $113.75 (plus shipping)

We are pleased to announce the publication of the above title. Due to the limited print run of this collection and the high number of contributing authors, we are unable to offer a complimentary copy. In recognition of your contribution, however, we are delighted to offer you one copy of this title at a discount of 75% off the list price (excluding postage and packing). Please note that these purchases should be for personal use and not for resale.

If you would like to take advantage of this offer, please visit our website at the link below. To receive your 75% discount on one copy of this title enter ( FRANZESE75 ) in the discount code field during checkout.


You can also purchase further copies of this title and other titles from the Elgar list at a 50% author discount.

As a thank you to our authors and contributors, Edward Elgar Publishing offers a 50% discount on all titles. Orders must be prepaid and are for personal use only. To take advantage of this offer at any time, please enter the discount code ‘EEAUTHOR’ on the payment page of our website: Please note only one discount code is allowed per order. Any further questions please feel free to contact us.

With best wishes,

Research Collections Department
Edward Elgar Publishing

Independent Publisher of the Year 2017- Independent Publishers Guild
Academic & Professional Publisher of the Year 2017 & 2014 – Independent Publishers Guild
Digital Publisher of the Year 2015 – Independent Publishers Guild
Independent, Academic, Educational and Professional Publisher of the Year 2014 & 2013 – The Bookseller

Wow, a mere $113.75 (plus shipping), huh? I guess that’s what it takes to be named Digital Publisher of the Year.

Also, I just love it that this extremely-low price of $113.75 excludes “postage and packing.” No free lunches here, no siree!

New blog hosting!

Hi all. We’ve been having some problems with the blog caching, so that people were seeing day-old versions of the posts and comments. We moved to a new host and a new address,, and all should be better.

Still a couple glitches, though. Right now it doesn’t seem to be possible to comment. We hope to get that fixed soon (unfortunately it’s Friday evening and I don’t know if anyone’s gonna look at it over the weekend), will let you know when comments work again. Regularly scheduled posts will continue to appear.

Comments work too now!

NYC Meetup Thursday: Under the hood: Stan’s library, language, and algorithms

I (Bob, not Andrew!) will be doing a meetup talk this coming Thursday in New York City. Here’s the link with registration and location and time details (summary: pizza unboxing at 6:30 pm in SoHo):

After summarizing what Stan does, this talk will focus on how Stan is engineered. The talk follows the organization of the Stan software.

Stan math library: differentiable math and stats functions, template metaprorgrams to manage constants and vectorization, matrix derivatives, and differential equation derivatives.

Stan language: block structure and execution, unconstraining variable transforms and automatic Jacobians, transformed data, parameters, and generated quantities execution.

Stan algorithms: Hamiltonian Monte Carlo and the no-U-turn sampler (NUTS), automatic differentiation variational inference (ADVI).

Stan infrastructure and process: Time permitting, I can also discuss Stan’s developer process, how the code repositories are organized, and the code review and continuous integration process for getting new code into the repository

Becker on Bohm on the important role of stories in science

Tyler Matta writes:

During your talk last week, you spoke about the role of stories in scientific theory. On page 104 of What Is Real: The Unfinished Quest for the Meaning of Quantum Physics, Adam Becker talks about stories and scientific theory in relation to alternative conceptions of quantum theory, particularly between Bohm’s pilot-wave interpretation and Bohr’s Copenhagen interpretation:

The picture of the world that comes along with a physical theory is an important component of that theory. Two theories that are identical in their predictions can have wildly different pictures of the world… and those pictures, in turn, determine a lot about the daily practice of science… The story that comes along with a scientific theory influences the experiments that scientists choose to perform, the way new evidence is evaluated, and ultimately, guides the search for new theories as well.

Anyways, I just wanted to share the passage as I think Becker has done a nice job of connecting the two.

A lot of things came up in my talk, but at the beginning I did discuss how in science we learn from stories. For researchers, stories for scientists are not just a way for us to vividly convey our findings to others. Stories also frame our understanding of the world. I discussed the idea of stories being anomalous and immutable (see second link above for more on this); the above Becker quote is interesting in that it captures the importance of story-like structures in our understanding as well as in our communication.

Babe Didrikson Zaharias (2) vs. Adam Schiff; Sid Caesar advances

And our noontime competition continues . . .

We had some good arguments on both sides yesterday.

Jonathan writes:

In my experience, comedians are great when they’re on-stage and morose and unappealing off-stage. Sullivan, on the other hand, was morose and unappealing on-stage, and witty and charming off-stage, or so I’ve heard. This comes down, then, to deciding whether the speaker treats the seminar as a stage or not. I don’t think Sullivan would, because it’s not a “rilly big shew.”

That’s some fancy counterintuitive reasoning: Go with Sullivan because he won’t take it seriously so his pleasant off-stage personality will show up.

On the other hand, Zbicyclist goes with the quip:

Your Show of Shows -> Your Seminar of Seminars.

Render unto Caesar.

I like it. Sid advances.

For our next contest, things get more interesting. In one corner, the greatest female athlete of all time, an all-sport trailblazer. In the other, the chairman of the United States House Permanent Select Committee on Intelligence, who’s been in the news lately for his investigation of Russian involvement in the U.S. election. He knows all sorts of secrets.

If the seminar’s in the statistics department, Babe, no question. For the political science department, it would have to be Adam. But this is a university-wide seminar (inspired by this Latour-fest, remember?), so I think they both have a shot.

MRP (multilevel regression and poststratification; Mister P): Clearing up misunderstandings about

Someone pointed me to this thread where I noticed some issues I’d like to clear up:

David Shor: “MRP itself is like, a 2009-era methodology.”

Nope. The first paper on MRP was from 1997. And, even then, the component pieces were not new: we were just basically combining two existing ideas from survey sampling: regression estimation and small-area estimation. It would be more accurate to call MRP a methodology from the 1990s, or even the 1970s.

Will Cubbison: “that MRP isn’t a magic fix for poor sampling seems rather obvious to me?”

Yep. We need to work on both fronts: better data collection and better post-sampling adjustment. In practice, neither alone will be enough.

David Shor: 2012 seems like a perfect example of how focusing on correcting non-response bias and collecting as much data as you can is going to do better than messing around with MRP.

There’s a misconception here. “Correcting non-response bias” is not an alternative to MRP; rather, MRP is a method for correcting non-response bias. The whole point of the “multilevel” (more generally, “regularization”) in MRP is that it allows us to adjust for more factors that could drive nonresponse bias. And of course we used MRP in our paper where we showed the importance of adjusting for non-response bias in 2012.

And “collecting as much data as you can” is something you’ll want to do no matter what. Yair used MRP with tons of data to understand the 2018 election. MRP (or, more generally, RRP) is a great way to correct for non-response bias using as much data as you can.

Also, I’m not quite clear what was meant by “messing around” with MRP. MRP is a statistical method. We use it, we don’t “mess around” with it, any more than we “mess around” with any other statistical method. Any method for correcting non-response bias is going to require some “messing around.”

In short, MRP is a method for adjusting for nonresponse bias and data sparsity to get better survey estimates. There are other ways of getting to basically the same answer. It’s important to adjust for as many factors as possible and, if you’re going for small-area estimation with sparse data, that you use good group-level predictors.

MRP is a 1970s-era method that still works. That’s fine. Least squares regression is a 1790s-era method, and it still works too! In both cases, we continue to do research to improve and better understand what we’re doing.

Ed Sullivan (3) vs. Sid Caesar; DJ Jazzy Jeff advances

Yesterday’s battle (Philip Roth vs. DJ Jazzy Jeff) was pretty low-key. It seems that this blog isn’t packed with fans of ethnic literature or hip-hop. Nobody in comments even picked up on my use of the line, “Does anyone know these people? Do they exist or are they spooks?” Isaac gave a good argument in favor of Roth: “Given how often Uncle Phil threw DJ Jazzy Jeff out of the house, it seems like he should win here,” but I’ll have to give it to Jazz, based on Jrc’s comment: “From what I hear, Roth was only like the 14th coolest Jew at Weequahic High School (which, by my math, makes him about the 28th coolest kid there). And we all know DJ Jazzy Jeff was the second coolest kid at Bel-Air Academy.” Good point.

Our next contest features two legendary TV variety show hosts who, at the very least, can tell first-hand stories about Elvis Presley, the Beatles, Mel Brooks, Woody Allen, and many others. Should be fun.

The full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Reproducibility and Stan

Aki prepared these slides which cover a series of topics, starting with notebooks, open code, and reproducibility of code in R and Stan; then simulation-based calibration of algorithms; then model averaging and prediction. Lots to think about here: there are many aspects to reproducible analysis and computation in statistics.

Philip Roth (4) vs. DJ Jazzy Jeff; Jim Thorpe advances

For yesterday’s battle (Jim Thorpe vs. John Oliver), I’ll have to go with Thorpe. We got a couple arguments in Oliver’s favor—we’d get to hear him say “Whot?”, and he’s English—but for Thorpe we heard a lot more, including his uniqueness as greatest athlete of all time, and that we could save money on the helmet if that were required. We also got the following bad reason: “the chance to hear him say, ‘I’ve been asked to advise those of you who are following this talk on social media, whatever that means, to use “octothorpe talktothorpe.”‘” Even that bad reason ain’t so bad, also it’s got 3 levels of quotation nesting, which counts for something right there. What iced it for Thorpe was this comment from Tom: “Seeing as he could do everything better than everyone else, just by giving it a go, he would surely give an incredible seminar.”

And for our next contest, it’s the Bard of Newark vs. a man who’s only in this contest because it was hard for me to think of 8 people whose name ended in f, whose entire fame comes from the decades-old phrase, “Fresh Prince and DJ Jazzy Jeff.” So whaddya want: riffs on Anne Frank and suburban rabbis, or some classic 80s beats? I dunno. I think Roth would be much more entertaining when question time comes along, but he can’t scratch.

Does anyone know these people? Do they exist or are they spooks?

The full bracket is here, and here are the rules:

We’re trying to pick ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

“The Book of Why” by Pearl and Mackenzie

Judea Pearl and Dana Mackenzie sent me a copy of their new book, “The book of why: The new science of cause and effect.”

There are some things I don’t like about their book, and I’ll get to that, but I want to start with a central point of theirs with which I agree strongly.

Division of labor

A point that Pearl and Mackenzie make several times, even if not quite in this language, is that there’s a division of labor between qualitative and quantitative modeling.

The models in their book are qualitative, all about the directions of causal arrows. Setting aside any problems I have with such models (I don’t actually think the “do operator” makes sense as a general construct, for reasons we’ve discussed in various places on this blog from time to time), the point is that these are qualitative, on/off statements. They’re “if-then” statements, not “how much” statements.

Statistical inference and machine learning focuses on the quantitative: we model the relationship between measurements and the underlying constructs being measured; we model the relationships between different quantitative variables; we have time-series and spatial models; we model the causal effects of treatments and we model treatment interactions; and we model variation in all these things.

Both the qualitative and the quantitative are necessary, and I agree with Pearl and Mackenzie that typical presentations of statistics, econometrics, etc., can focus way too strongly on the quantitative without thinking at all seriously about the qualitative aspects of the problem. It’s usually all about how to get the answer given the assumptions, and not enough about where the assumptions come from. And even when statisticians write about assumptions, they tend to focus on the most technical and least important ones, for example in regression focusing on the relatively unimportant distribution of the error term rather than the much more important concerns of validity and additivity.

If all you do is set up probability models, without thinking seriously about their connections to reality, then you’ll be missing a lot, and indeed you can make major errors in casual reasoning, as James Heckman, Donald Rubin, Judea Pearl, and many others have pointed out. And indeed Heckman, Rubin, and Pearl have (each in their own way) advocated for substantive models, going beyond data description to latch on to underlying structures of interest.

Pearl and Mackenzie’s book is pretty much all about qualitative models; statistics textbooks such as my own have a bit on qualitative models but focus on the quantitative nuts and bolts. We need both.

Judea Pearl, like Jennifer Hill and Frank Sinatra, are right that “you can’t have one without the other”: If you think you’re working with a purely qualitative model, it turns out that, no, you’re actually making lots of data-based quantitative decisions about which effects and interactions you decide are real and which ones you decide are not there. And if you think you’re working with a purely quantitative model, no, you’re really making lots of assumptions (causal or otherwise) about how your data connect to reality.
Continue reading ‘“The Book of Why” by Pearl and Mackenzie’ »

Did she really live 122 years?

Even more famous than “the Japanese dude who won the hot dog eating contest” is “the French lady who lived to be 122 years old.”

But did she really?

Paul Campos points us to this post, where he writes:

Here’s a statistical series, laying out various points along the 100 longest known durations of a particular event, of which there are billions of known examples. The series begins with the 100th longest known case:

100th: 114 years 93 days

90th: 114 years 125 days

80th: 114 years 182 days

70th: 114 years 208 days

60th: 114 years 246 days

50th: 114 years 290 days

40th: 115 years 19 days

30th: 115 years 158 days

20th: 115 years 319 days

10th: 116 years 347 days

9th: 117 years 27 days

8th: 117 years 81 days

7th: 117 years 137 days

6th: 117 years 181 days

5th: 117 years 230 days

4th: 117 years 248 days

3rd: 117 years 260 days

Based on this series, what would you expect the second-longest and the longest known durations of the event to be?

These are the maximum verified — or as we’ll see “verified” — life spans achieved by human beings, at least since it began to be possible to measure this with some loosely acceptable level of scientific accuracy . . .

Given the mortality rates observed between ages 114 and 117 in the series above, it would be somewhat surprising if anybody had actually reached the age of 118. Thus it’s very surprising to learn that #2 on the list, an American woman named Sarah Knauss, lived to be 119 years and 97 days. That seems like an extreme statistical outlier, and it makes me wonder if Knauss’s age at death was recorded correctly (I know nothing about how her age was verified).

But the facts regarding the #1 person on the list — a French woman named Jeanne Calment who was definitely born in February of 1875, and was determined to have died in August of 1997 by what was supposedly all sorts of unimpeachable documentary evidence, after reaching the astounding age of 122 years, 164 days — are more than surprising. . . .

A Russian mathematician named Nikolay Zak has just looked into the matter, and concluded that, despite the purportedly overwhelming evidence that made it certain beyond a reasonable doubt that Calment reached such a remarkable age, it’s actually quite likely, per his argument, that Jeanne Calment died in the 1930s, and the woman who for more than 20 years researchers all around the world considered to be the oldest person whose age had been “conclusively” documented was actually her daughter, Yvonne. . . .

I followed the link and read Zak’s article, and . . . I have no idea.

The big picture is that, after age 110, the probability of dying is about 50% per year. For reasons we’ve discussed earlier, I don’t think we should take this constant hazard rate too seriously. But if we go with that, and we start with 100 people reaching a recorded age of 114, we’d expect about 50 to reach 115, 25 to reach 116, 12 to reach 117, 6 to reach 118, 3 to reach 119, etc. . . . so 122 is not at all out of the question. So I don’t really buy Campos’s statistical argument, which all seems to turn on there being a lot of people who reached 117 but not 118, which in turn is just a series of random chances that can just happen.

Although I have nothing to add to the specific question of Jeanne or Yvonne Calment, I do have some general thoughts on this story:

– It’s stunning to me how these paradigm shifts come up, where something that everybody believes is true, is questioned. I’ve been vaguely following discussions about the maximum human lifespan (as in the link just above), and the example of Calment comes up all the time, and I’d never heard anyone suggest her story might be fake. According to Zak, there had been some questioning, but it it didn’t go far enough for me to have heard about it.

Every once in awhile we hear about these exciting re-thinkings of the world. Sometimes it seems that turn out to be right (for example, that story about the asteroid collision that indirectly killed the dinosaurs. Or, since we’re on the topic, the story that modern birds are dinosaurs’ descendants). Other times these new ideas seem to have been dead ends (for example, claim that certain discrepancies in sex ratios could be explained by hepatitis). As Joseph Delaney discusses in the context of the latter example, sometimes an explanation can be too convincing, in some way. The challenge is to value paradigm-busting ideas without falling in love with them.

– The Calment example is a great illustration of Bayesian inference. Bayesian reasoning should lead us to be skeptical of Calment’s claimed age. Indeed, as Zak notes, Bayesian reasoning should lead us to be skeptical of any claim on the tail of any distribution. Those 116-year-olds and 117-year-olds on Campos’s list above: we should be skeptical of each of them too. It’s just simple probabilistic reasoning: there’s some baseline probability that anyone’s claimed age will be fake, and if the distribution of fake ages has wider tails than the distribution of real ages, then an extreme claimed age is some evidence of an error. The flip side is that there must be some extreme ages out there that we haven’t heard about.

– The above discussion also leads to a sort of moral hazard of Bayesian inference: If we question the extreme reported ages without correspondingly researching other ages, we’ll be shrinking our distribution. As Phil and I discuss in our paper, All maps of parameters are misleading, there’s no easy solution to this problem, but we at least should recognize it.

P.S. Campos adds:

I hadn’t considered that the clustering at 117 is probably just random, but of course that makes sense. Calment does seem like a massive outlier, and as you say from a Bayesian perspective the fact that she’s such an outlier makes the potential holes in the validation of her age more probable than otherwise. What I don’t understand about the inheritance fraud theory is that Jeanne’s husband lived until 1942, eight years after Jeanne’s hypothesized death. It would be unusual, I think, for French inheritance law not to give a complete exemption to a surviving spouse for any inheritance tax liability (that’s the case in the legal systems I know something about), but I don’t know anything about French inheritance law.