Skip to content

A. J. Liebling vs. Dorothy Parker (2); Steve Martin advances

As Dalton wrote:

On one hand, Serena knows how to handle a racket. But Steve Martin knows how to make a racket with some strings stretched taught over a frame. Are you really gonna bet against the dude who went to toe-to-toe Kermit the Frog in racket making duel?

Today we have an unseeded eater vs. the second-seeded wit. That said, Liebling was very witty, at least in his writing. And Parker—I don’t know how she was as an eater, but she sure knew how to drink. As did Liebling. So the two are evenly matched.

Again, here’s the bracket and here are the rules.

R fixed its default histogram bin width!

I remember hist() in R as having horrible defaults, with the histogram bars way too wide. (See this discussion:

A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That’s why, if you look at the histograms in my books and published articles, I just about always use lots of bins.

But somewhere along the way someone fixed it. R’s histogram function now has a reasonable default, with lots of bins. (Just go into R and type hist(rnorm(100)) and you’ll see.)

I’m so happy!

P.S. When searching for my old post on histograms, I found this remark, characterizing the following bar graph:

Screen Shot 2016-01-24 at 4.17.21 PM

This graph isn’t horrible—with care, you can pull the numbers off it—but it’s not set up to allow much discovery, either. This kind of graph is a little bit like a car without an engine: you can push it along and it will go where you want, but it won’t take you anywhere on its own.

Update on that study of p-hacking

Ron Berman writes:

I noticed you posted an anonymous email about our working paper on p-hacking and false discovery, but was a bit surprised that it references an early version of the paper.
We addressed the issues mentioned in the post more than two months ago in a version that has been available online since December 2018 (same link as above).

I wanted to send you a few thoughts on the post, with the hope that you will find them interesting and relevant to post on your blog as our reply, with the goal of informing your readers a little more about the paper.

These thoughts are presented in the separate section below.

The more recent analysis applies an MSE-optimal bandwidth selection procedure. Hence, we use only one single bandwidth for assessing the presence of a discontinuity at a specific level of confidence in an experiment.

Less importantly, the more recent analysis uses a triangular kernel and linear regression (though we also report a traditional logistic regression analysis result for transparency and robustness).
The results have not changed much, and have partially strengthened.

With regard to the RDD charts, the visual fit indeed might not be great. But we think the fit using the MSE-optimal window width is actually good.

The section below provides more details, and I hope you will find it relevant to post it on your blog.

We also of course would welcome any feedback you may have about the methods we are using in the paper, including the second part of the paper where we attempt to quantify the consequences of p-hacking on false discovery and foregone learning.

I am learning from every feedback we receive and am working to constantly improve the paper.

More details about blog post:

The comments by the anonymous letter writer are about an old version of the paper, and we have addressed them a few months ago.

Three main concerns were expressed: The choice of six possible discontinuities, the RDD window widths, and the RDD plot showing weak visual evidence of a discontinuity in stopping behavior based on the confidence level the experiment reaches.

1. Six hypotheses

We test six different hypotheses, each positing optional-stopping based on the p-value of the experiment at one of the three commonly used levels of significance/confidence in business and social science (90, 95 and 99%) for both positive and negative effects (3 X 2 = 6).
We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.

2. RDD window width

The December 2018 version of the paper details an RDD analysis using an MSE-optimal bandwidth linear regression with a triangular kernel.
The results (and implications) haven’t changed dramatically using the more sophisticated approach, which relies on a single window for each RDD.

We fully report all the tables in the paper. This is what the results look like (Table 5 of the paper includes the details of the bandwidth sizes, number of observations etc):

The linear and the bias-corrected linear models use the “sophisticated” MSE-optimal method. We also report a logistic regression analysis with the same MSE-optimal window width for transparency and to show robustness.

All the effects are reported as marginal effects to allow easy comparison.

Not much has changed in results and the main conclusion about p-hacking remains the same: A sizable fraction of experiments exhibit credible evidence of stopping when the A/B test reaches 90% confidence for a positive effect, but not at the other levels of significance typically used and not for negative effects.

3. RDD plots

With respect to the RDD charts, the fit might indeed not look great visually. But what matters for the purpose of causal identification in such a quasi-experiment, in our opinion, is more the evidence of a discontinuity at the point of interest, rather than the overall data fit.

Here is the chart with the MSE-Optimal bandwidth around .895 confidence (presented as 90% to the experimenters) from the paper. Apart from the outlier at .89 confidence, we think the lines track the raw fractions rather well.

I wrote that earlier post in September, hence it was based on an earlier version of the article. It’s good to hear about the update.

Serena Williams vs. Steve Martin (4); The Japanese dude who won the hot dog eating contest advances

We didn’t have much yesterday, so I went with this meta-style comment from Jesse:

I’m pulling for Kobayashi if only because the longer he’s in, the more often Andrew will have to justify describing him vs using his name. The thought of Andrew introducing the speaker as “and now, here’s that Japanese dude who won the hot dog eating contest” sounds awkward enough to prime us all for stress-eating, and who better to give us best practices/techniques?

I agree with Diana that there’s some underdog bias here, as in the real world there’d be no doubt that we’d want to hear Wilde. Indeed, if this were a serious contest we’d just have looked at the 64 names, picked Wilde right away, and ended it. But, for round 3, the Japanese dude who won the hot dog eating contest it is.

And today we have an unseeded GOAT vs. the fourth-seeded magician. Whaddya want, ground strokes or ironic jokes?

Again, here’s the bracket and here are the rules.

“Do you have any recommendations for useful priors when datasets are small?”

Someone who wishes to remain anonymous writes:

I just read your paper with Daniel Simpson and Michael Betancourt, The Prior Can Often Only Be Understood in the Context of the Likelihood, and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, and small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My quick response is that when sample size is small, or measurements are noisy, or the underlying phenomenon has high variation, then the prior distribution will become more important.

So your question is a good one!

To continue, when priors are important, you’ll have to think harder about what real prior information is available.

One way to to is . . . and I’m sorry for being so predictable in my answer, but I’ll say it anyway . . . embed your problem in a multilevel model. You have a new product with just four observations. Fine. But this new product is the latest in a stream of products, so create a model of the underlying attributes of interests, given product characteristics and time.

Don’t think of your “prior” for a parameter as some distinct piece of information; think of it as the culmination of a group-level model.

Just like when we do Mister P: We don’t slap down separate priors for the 50 states, we set up a hierarchical model with state-level predictors, and this does the partial pooling more organically. So the choice of priors becomes something more familiar: the choice of predictors in a regression model, along with choices about how to set that predictive model up.

Even with a hierarchical model, you still might want to add priors on hyperparameters, but that’s something we do discuss a bit at that link.

P-hacking in study of “p-hacking”?

Someone who wishes to remain anonymous writes:

This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).

It is a regression discontinuity design:

They even cite your paper [that must be this or this — ed.] against higher-order polynomials.

Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.

My correspondent continues:

The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).

Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.

Anyway, he continues:

Example table (this is the one that looks best for them, others relegated to appendix):

So maybe an interesting tour of:
– How much optional stopping is there in industry? (Of course there is some.)
– Self-deception, ignorance, and incentive problems for social scientists
– Reasonable methods for regression discontinuity designs.

I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.

Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.

And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.

P.S. It’s been pointed out to me that the above-linked paper has been updated and improved since when I wrote the above post last September. Not all my comments above apply to the latest version of the paper.

The Japanese dude who won the hot dog eating contest vs. Oscar Wilde (1); Albert Brooks advances

Yesterday I was going to go with this argument from Ethan:

Now I’m morally bound to use the Erdos argument I said no one would see unless he made it to this round.

Andrew will take the speaker out to dinner, prove a theorem, publish it and earn an Erdos number of 1.

But then Jan pulled in with :

If you get Erdos, he will end up staying in your own place for the next n months, and him being dead, well, let’s say it is probably not going to be pleasant.

To be honest, I don’t even think I’d like a live Erdos staying in our apartment: from what I’ve read, the guy sounds a bit irritating, the kind of person who thinks he’s charming—an attribute that I find annoying.

Anyway, who cares about the Erdos number. What I really want is a good Wansink number. Recall what the notorious food researcher wrote:

Facebook, Twitter, Game of Thrones, Starbucks, spinning class . . . time management is tough when there’s so many other shiny alternatives that are more inviting than writing the background section or doing the analyses for a paper.

Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday. In the meantime, this Turkish woman’s resume will always have the five papers below.

Coauthorship is forever. Those of us with a low Wansink number will live forever in the scientific literature.

And today’s match features an unseeded eater vs. the top-seeded wit. Doesn’t seem like much of a contest for a seminar speaker, but . . . let’s see what arguments you come up with!

Again, here’s the bracket and here are the rules.

More on that horrible statistical significance grid

Regarding this horrible Table 4:

Eric Loken writes:

The clear point or your post was that p-values (and even worse the significance versus non-significance) are a poor summary of data.

The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses. It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

Good point. As John Carlin and I wrote, we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

Book reading at Ann Arbor Meetup on Monday night: Probability and Statistics: a simulation-based introduction

The Talk

I’m going to be previewing the book I’m in the process of writing at the Ann Arbor R meetup on Monday. Here are the details, including the working title:

Probability and Statistics: a simulation-based introduction
Bob Carpenter
Monday, February 18, 2019
Ann Arbor SPARK, 330 East Liberty St, Ann Arbor

I’ve been to a few of their meetings and I really like this meetup group—a lot more statistics depth than you often get (for example, nobody asked me how to shovel their web site into Stan to get business intelligence). There will be a gang (or at least me!) going out for food and drinks afterward.

I’m still not 100% sure about which parts I’m going to talk about, as I’ve already written 100+ pages of it. After some warmup on the basics of Monte Carlo, I’ll probably do a simulation-based demonstration of the central limit theorem, the curse of dimensionality, and some illustration of (anti-)correlation effects on MCMC, as those are nice encapsulated little case studies I can probably get through in an hour.

The Repository

I’m writing it all in bookdown and licensing it all open source. I’ll probably try to find a publisher, but I’m only going to do so if I can keep the pdf free.

I just opened up the GitHub repo so anyone can download and build it:

I’m happy to take suggestions, but please don’t start filing issues on typos, grammar, etc.—I haven’t even spell checked it yet, much less passed it by a real copy editor. When there’s a more stable draft, I’ll put up a pdf.

Paul Erdos vs. Albert Brooks; Sid Caesar advances

The key question yesterday was, can Babe Didrikson Zaharias do comedy or can Sid Caesar do sports. According to Mark Palko, Sid Caesar was by all accounts extremely physically strong. And I know of no evidence that Babe was funny. So Your Show of Shows will be going into the third round.

And now we have an intriguing contest: a famously immature mathematician who loved to collaborate, vs. an Albert Einstein who didn’t do science. Whaddya think?

Again, here’s the bracket and here are the rules.

Simulation-based statistical testing in journalism

Jonathan Stray writes:

In my recent Algorithms in Journalism course we looked at a post which makes a cute little significance-type argument that five Trump campaign payments were actually the $130,000 Daniels payoff. They summed to within a dollar of $130,000, so the simulation recreates sets of payments using bootstrapping and asks how often there’s a subset that gets that close to $130,000. It concludes “very rarely” and therefore that this set of payments was a coverup.

(This is part of my broader collection of simulation-based significance testing in journalism.)

I recreated this payments simulation in a notebook to explore this. The original simulation checks sets of ten payments, which the authors justify because “we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments.” You get about p=0.001 that any set of ten payments gets within $1 of $130,000. But the authors also calculated p=0.1 or so if we choose from 15, and my notebook shows this that goes up rapidly to p=0.8 if you choose 20 payments.

So the inference you make depends crucially on the universe of events you use. I think of this as the denominator in the frequentist calculation. It seems like a free parameter robustness problem, and for me it casts serious doubt on the entire exercise.

My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.

I’d be much more comfortable with fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without a Daniels payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require making a bunch of domain assumptions explicit, which strikes me as a good thing.

But I do still wonder if frequentist logic can answer the denominator question here. It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Carl Bialik’s critique of statistical tests for tennis fixing.

My reply:

As Daniel Lakeland said:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

So . . . when a hypothesis tests rejects, it’s no big deal; you’re just rejecting the hypothesis that the data where produced by a specific random number generator—which we already knew didn’t happen. But when a hypothesis test doesn’t reject, that’s more interesting: it tells us that we know so little about the data that we can’t reject the hypothesis that the data where produced by a specific random number generator.

It’s funny. People are typically trained to think of rejection (low p-values) as the newsworthy event, but that’s backward.

Regarding your more general point: yes, there’s no substitute for subject-matter knowledge. And the post you linked to above is in error, when it says that a p-value of 0.001 implies that “the probability that the Trump campaign payments were related to the Daniels payoff is very high.” To make this statement is just a mathematical error.

But I do think there are some other ways of going about this, beyond full Bayesian modeling. For example, you could take the entire procedure used in this analysis, and apply it to other accounts, and see what p-values you get.

Sid Caesar vs. Babe Didrikson Zaharias (2); Jim Thorpe advances

Best comment from yesterday came from Dalton:

Jim Thorpe isn’t from Pennsylvania, and yet a town there renamed itself after him. DJ Jazzy Jeff is from Pennsylvania, and yet Will Smith won’t even return his phone calls. Until I can enjoy a cold Yuengling in Jazzy Jeff, PA it’s DJ Jumpin’ Jim for the win.

And today’s second-round bout features a comedic king versus a trailblazing athlete. I have no idea if Babe was funny or if Sid could do sports.

Again, here’s the bracket and here are the rules.

Michael Crichton on science and storytelling

Javier Benitez points us to this 1999 interview with techno-thriller writer Michael Crichton, who says:

I come before you today as someone who started life with degrees in physical anthropology and medicine; who then published research on endocrinology, and papers in the New England Journal of Medicine, and even in the Proceedings of the Peabody Museum. As someone who, after this promising beginning . . . spent the rest of his life in what is euphemistically called the entertainment business.

Scientists often complain to me that the media misunderstands their work. But I would suggest that in fact, the reality is just the opposite, and that it is science which misunderstands media. I will talk about why popular fiction about science must necessarily be sensationalistic, inaccurate, and negative.

Interesting, given that Crichton near the end of his life became notorious as a sensationalist climate change denier. But that doesn’t really come up in this particular interview, so let’s let him continue:

I’ll explain why it is impossible for the scientific method to be accurately portrayed in film. . . .

Movies are a special kind of storytelling, with their own requirements and rules. Here are four important ones:

– Movie characters must be compelled to act
– Movies need villains
– Movie searches are dull
– Movies must move

Unfortunately, the scientific method runs up against all four rules. In real life, scientists may compete, they may be driven – but they aren’t forced to work. Yet movies work best when characters have no choice. That’s why there is the long narrative tradition of contrived compulsion for scientists. . . .

Second, the villain. Real scientists may be challenged by nature, but they aren’t opposed by a human villain. Yet movies need a human personification of evil. You can’t make one without distorting the truth of science.

Third, searches. Scientific work is often an extended search. But movies can’t sustain a search, which is why they either run a parallel plotline, or more often, just cut the search short. . . .

Fourth, the matter of physical action: movies must move. Movies are visual and external. But much of the action of science is internal and intellectual, with little to show in the way of physical activity. . . .

For all these reasons, the scientific method presents genuine problems in film storytelling. I believe the problems are insoluble. . . .

This all makes sense.

Later on, Crichton says:

As for the media, I’d start using them, instead of feeling victimized by them. They may be in disrepute, but you’re not. The information society will be dominated by the groups and people who are most skilled at manipulating the media for their own ends.

Yup. And now he offers some ideas:

For example, under the auspices of a distinguished organization . . . I’d set up a service bureau for reporters. . . . Reporters are harried, and often don’t know science. A phone call away, establish a source of information to help them, to verify facts, to assist them through thorny issues. Don’t farm it out, make it your service, with your name on it. Over time, build this bureau into a kind of good housekeeping seal, so that your denial has power, and you can start knocking down phony stories, fake statistics and pointless scares immediately, before they build. . . .

Unfortunately, and through no fault of Crichton, we seem to have gotten the first of these suggestions but not the second. Scientists, universities, and journals promote the hell out of just about everything, but they aren’t so interested in knocking down phony stories. Instead we get crap like the Harvard University press office saying “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%,” or the Cornell University press office saying . . . well, if you’re a regular reader of this blog you’ll know where I’m going on this one. Distinguished organizations are promoting the phony stories, not knocking them down.

Crichton concluded:

Under the circumstances, for scientists to fret over their image seems slightly absurd. This is a great field with great talents and great power. It’s time to assume your power, and shoulder your responsibility to get your message to the waiting world. It’s nobody’s job but yours. And nobody can do it as well as you can.

Didn’t work out so well. There have been some high points, such as Freakonomics, which, for all its flaws, presented a picture of social scientists as active problem solvers. But, in many other cases, it seems that science spent much of its credibility on a bunch of short-term quests for money and fame. Too bad, seeing what happened since 1999.

As scientists, I think we should spend less time thinking about how to craft our brilliant ideas as stories for the masses, and think harder about how we ourselves learn from stories. Let’s treat our audience, our fellow citizens of the world, with some respect.

Halftime! And Jim Thorpe (1) vs. DJ Jazzy Jeff

So. Here’s the bracket so far:

Our first second-round match is the top-ranked GOAT—the greatest GOAT of all time, as it were—vs. an unseeded but appealing person whose name ends in f.

Again here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Should he go to grad school in statistics or computer science?

Someone named Nathan writes:

I am an undergraduate student in statistics and a reader of your blog. One thing that you’ve been on about over the past year is the difficulty of executing hypothesis testing correctly, and an apparent desire to see researchers move away from that paradigm. One thing I see you mention several times is to simply “model the problem directly”. I am not a masters student (yet) and am also not trained at all in Bayesian. My coursework was entirely based on classical null hypothesis testing.

From what I can gather, you mean the implementation of some kind of multi-level model. But do you also mean the fitting and usage of standard generalized linear models, such as logistic regression? I have ordered the book you wrote with Jennifer Hill on multi-level models, and I hope it will be illuminating.

On the other hand, I’m looking at going to graduate school and I will be applying this fall. My interests have diverged from classical statistics, with a larger emphasis on model building, prediction, and machine learning. To this end, would further training in statistics be appropriate? Or would it be more useful to try and get into a CS program? I still have interests in “statistics” — describing associations, but I am not so sure I am interested in being a classical theorist. What do you think?

My reply: There are lots of statistics programs that focus on applications rather than theory. Computer science departments, I don’t know how that works. If you want an applied-oriented statistics program, it could help to have a sense of what application areas you’re interested in, and also if you’re interested in doing computational statistics, as a lot of applied work requires computational as well as methodological innovation in order to include as much relevant information as possible into your analyses.

Yakov Smirnoff advances, and Halftime!

Best argument yesterday came from Yuling:

I want to learn more about missing data analysis from the seminar so I like Harry Houdini. But Yakov Smirnoff is indeed better for this topic — both Vodka and the Soviet are treatments that guarantee everyone to be Missing Completely at Random, and as statistican we definitely prefer Missing Completely at Random.

And now the contest is halfway done! We’re through with the first round. Second round will start tomorrow.

Global warming? Blame the Democrats.

An anonymous blog commenter sends the above graph and writes:

I was looking at the global temperature record and noticed an odd correlation the other day. Basically, I calculated the temperature trend for each presidency and multiplied by the number of years to get a “total temperature change”. If there was more than one president for a given year it was counted for both. I didn’t play around with different statistics to measure the amount of change, including/excluding the “split” years, etc. Maybe other ways of looking at it yield different results, this is just the first thing I did.

It turned out all 8 administrations who oversaw a cooling trend were Republican. There has never been a Democrat president who oversaw a cooling global temperature. Also, the top 6 warming presidencies were all Democrats.

I have no idea what it means but thought it may be of interest.

My first thought, beyond simply random patterns showing up with small N, is that thing that Larry Bartels noticed a few years ago, that in recent decades the economy has grown faster under Democratic presidents than Republican presidents. But the time scale does not work to map this to global warming. CO2 emissions, maybe, but I wouldn’t think it would show up in the global temperature so directly as that.

So I’d just describe this data pattern as “one of those things.” My correspondent writes:

I expect to hear it dismissed as a “spurious correlation”, but usually I hear that argument used for correlations that people “don’t like” (it sounds strange/ridiculous) and it is never really explained further. It seems to me if you want to make a valid argument that a correlation is “spurious” you still need to identify the unknown third factor though.

In this case I don’t know that you need to specify an unknown third factor, as maybe you can see this sort of pattern just from random numbers, if you look at enough things. Forking paths and all that. Also there were a lot of Republican presidents in the early years of this time series, back before global warming started to take off. Also, I haven’t checked the numbers in the graph myself.

Harry Houdini (1) vs. Yakov Smirnoff; Meryl Streep advances

Best argument yesterday came from Jonathan:

This one’s close.

Meryl Streep and Alice Waters both have 5 letters in the first name and 6 in the last name. Tie.

Both are adept at authentic accents. Tie.

Meryl has played a international celebrity cook; Alice has never played an actress. Advantage Streep.

Waters has taught many chefs; Meryl has taught no actors. Advantage Waters.

Streep went to Vassar and Yale. Waters went to Berkeley. I’m an East Coast guy, but YMMV.

Waters has the French Legion of Honor. Streep is the French Lieutenant’s Woman.

Both have won more awards than either of them can count.

So I use Sophie’s Axiom of Choice: When comparing a finite set of pairs of New Jersey Celebrities, choose the one who got into the New Jersey Hall of Fame earlier. That’s Streep, by 6 years.

And today we have the final first-round match! Who do you want to see: the top-seeded magician of all time, or an unseeded person whose name ends in f? Can a speaker escape from his own seminar? In Soviet Russia, seminar speaker watch you.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

“Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior”

Kevin Lewis points us to this research paper by Ruben Arslan, Katharina Schilling, Tanja Gerlach, and Lars Penke, which begins:

Previous research reported ovulatory changes in women’s appearance, mate preferences, extra- and in-pair sexual desire, and behavior, but has been criticized for small sample sizes, inappropriate designs, and undisclosed flexibility in analyses.

Examples of such criticism are here and here.

Arslan et al. continue:

In the present study, we sought to address these criticisms by preregistering our hypotheses and analysis plan and by collecting a large diary sample. We gathered more than 26,000 usable online self-reports in a diary format from 1,043 women, of which 421 were naturally cycling. We inferred the fertile period from menstrual onset reports. We used hormonal contraceptive users as a quasi-control group, as they experience menstruation, but not ovulation.


We found robust evidence supporting previously reported ovulatory increases in extra-pair desire and behavior, in-pair desire, and self-perceived desirability, as well as no unexpected associations. Yet, we did not find predicted effects on partner mate retention behavior, clothing choices, or narcissism. Contrary to some of the earlier literature, partners’ sexual attractiveness did not moderate the cycle shifts. Taken together, the replicability of the existing literature on ovulatory changes was mixed.

I have not looked at this paper in detail, but just speaking generally I like what they’re doing. Instead of gathering one more set of noisy data and going for the quick tabloid win (or, conversely, the so-what failed replication), they designed a study to gather high-quality data with enough granularity to allow estimation of within-person comparisons. That’s what we’ve been talkin bout!

Alice Waters (4) vs. Meryl Streep; LeBron James advances

It’s L’Bron. Only pitch for Mr. Magic was from DanC: guy actually is ultra-tall, plus grand than that non-Cav who had play’d for Miami. But Dalton brings it back for Bron:

LeBron James getting to the NBA Final with J.R. Smith as his best supporting cast member is a more preposterous escape than anything David Blaine or Houdini did. So he’s already a better magician than Eric Antoine (who is seeded below Blaine and Houdini).

Plus, he’s featured in this (unfortunately paywalled) Teaching Statistics article which points out the merits of graphical comparison (“Understanding summary statistics and graphical techniques to compare Michael Jordan versus LeBron James” – I love the fact that statistics cannot determine the MJ and LeBron debate precisely because it all depends on which summary statistic you choose. Just goes to show that you need to put as much thought into which dimensions you choose to check your model (graphically and numerically) as you do in constructing your model in the first place.

All stats, yah!

Today it’s a cook vs. a drama star. Whaddya want, a scrumptious lunch or Soph’s option? Or ya want Silkwood? Fantastic Mr. Fox? Can’t go wrong with that lady. But you also luv that cookbook, that food, that flavor, right? You pick.

Again, full list is at this link, and instructions:

Trying to pick #1 visitor. I’m not asking for most popular, or most topical, or optimum, or most profound, or most cool, but a combination of traits.

I’ll pick a day’s victor not from on a popular tally but on amusing quips on both camps. So try to show off!