Skip to content

Serena Williams vs. Steve Martin (4); The Japanese dude who won the hot dog eating contest advances

We didn’t have much yesterday, so I went with this meta-style comment from Jesse:

I’m pulling for Kobayashi if only because the longer he’s in, the more often Andrew will have to justify describing him vs using his name. The thought of Andrew introducing the speaker as “and now, here’s that Japanese dude who won the hot dog eating contest” sounds awkward enough to prime us all for stress-eating, and who better to give us best practices/techniques?

I agree with Diana that there’s some underdog bias here, as in the real world there’d be no doubt that we’d want to hear Wilde. Indeed, if this were a serious contest we’d just have looked at the 64 names, picked Wilde right away, and ended it. But, for round 3, the Japanese dude who won the hot dog eating contest it is.

And today we have an unseeded GOAT vs. the fourth-seeded magician. Whaddya want, ground strokes or ironic jokes?

Again, here’s the bracket and here are the rules.

“Do you have any recommendations for useful priors when datasets are small?”

Someone who wishes to remain anonymous writes:

I just read your paper with Daniel Simpson and Michael Betancourt, The Prior Can Often Only Be Understood in the Context of the Likelihood, and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, and small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My quick response is that when sample size is small, or measurements are noisy, or the underlying phenomenon has high variation, then the prior distribution will become more important.

So your question is a good one!

To continue, when priors are important, you’ll have to think harder about what real prior information is available.

One way to to is . . . and I’m sorry for being so predictable in my answer, but I’ll say it anyway . . . embed your problem in a multilevel model. You have a new product with just four observations. Fine. But this new product is the latest in a stream of products, so create a model of the underlying attributes of interests, given product characteristics and time.

Don’t think of your “prior” for a parameter as some distinct piece of information; think of it as the culmination of a group-level model.

Just like when we do Mister P: We don’t slap down separate priors for the 50 states, we set up a hierarchical model with state-level predictors, and this does the partial pooling more organically. So the choice of priors becomes something more familiar: the choice of predictors in a regression model, along with choices about how to set that predictive model up.

Even with a hierarchical model, you still might want to add priors on hyperparameters, but that’s something we do discuss a bit at that link.

P-hacking in study of “p-hacking”?

Someone who wishes to remain anonymous writes:

This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).

It is a regression discontinuity design:

They even cite your paper [that must be this or this — ed.] against higher-order polynomials.

Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.

My correspondent continues:

The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).

Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.

Anyway, he continues:

Example table (this is the one that looks best for them, others relegated to appendix):

So maybe an interesting tour of:
– How much optional stopping is there in industry? (Of course there is some.)
– Self-deception, ignorance, and incentive problems for social scientists
– Reasonable methods for regression discontinuity designs.

I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.

Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.

And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.

P.S. It’s been pointed out to me that the above-linked paper has been updated and improved since when I wrote the above post last September. Not all my comments above apply to the latest version of the paper.

The Japanese dude who won the hot dog eating contest vs. Oscar Wilde (1); Albert Brooks advances

Yesterday I was going to go with this argument from Ethan:

Now I’m morally bound to use the Erdos argument I said no one would see unless he made it to this round.

Andrew will take the speaker out to dinner, prove a theorem, publish it and earn an Erdos number of 1.

But then Jan pulled in with :

If you get Erdos, he will end up staying in your own place for the next n months, and him being dead, well, let’s say it is probably not going to be pleasant.

To be honest, I don’t even think I’d like a live Erdos staying in our apartment: from what I’ve read, the guy sounds a bit irritating, the kind of person who thinks he’s charming—an attribute that I find annoying.

Anyway, who cares about the Erdos number. What I really want is a good Wansink number. Recall what the notorious food researcher wrote:

Facebook, Twitter, Game of Thrones, Starbucks, spinning class . . . time management is tough when there’s so many other shiny alternatives that are more inviting than writing the background section or doing the analyses for a paper.

Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday. In the meantime, this Turkish woman’s resume will always have the five papers below.

Coauthorship is forever. Those of us with a low Wansink number will live forever in the scientific literature.

And today’s match features an unseeded eater vs. the top-seeded wit. Doesn’t seem like much of a contest for a seminar speaker, but . . . let’s see what arguments you come up with!

Again, here’s the bracket and here are the rules.

More on that horrible statistical significance grid

Regarding this horrible Table 4:

Eric Loken writes:

The clear point or your post was that p-values (and even worse the significance versus non-significance) are a poor summary of data.

The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses. It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

Good point. As John Carlin and I wrote, we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

Book reading at Ann Arbor Meetup on Monday night: Probability and Statistics: a simulation-based introduction

The Talk

I’m going to be previewing the book I’m in the process of writing at the Ann Arbor R meetup on Monday. Here are the details, including the working title:

Probability and Statistics: a simulation-based introduction
Bob Carpenter
Monday, February 18, 2019
Ann Arbor SPARK, 330 East Liberty St, Ann Arbor

I’ve been to a few of their meetings and I really like this meetup group—a lot more statistics depth than you often get (for example, nobody asked me how to shovel their web site into Stan to get business intelligence). There will be a gang (or at least me!) going out for food and drinks afterward.

I’m still not 100% sure about which parts I’m going to talk about, as I’ve already written 100+ pages of it. After some warmup on the basics of Monte Carlo, I’ll probably do a simulation-based demonstration of the central limit theorem, the curse of dimensionality, and some illustration of (anti-)correlation effects on MCMC, as those are nice encapsulated little case studies I can probably get through in an hour.

The Repository

I’m writing it all in bookdown and licensing it all open source. I’ll probably try to find a publisher, but I’m only going to do so if I can keep the pdf free.

I just opened up the GitHub repo so anyone can download and build it:

I’m happy to take suggestions, but please don’t start filing issues on typos, grammar, etc.—I haven’t even spell checked it yet, much less passed it by a real copy editor. When there’s a more stable draft, I’ll put up a pdf.

Paul Erdos vs. Albert Brooks; Sid Caesar advances

The key question yesterday was, can Babe Didrikson Zaharias do comedy or can Sid Caesar do sports. According to Mark Palko, Sid Caesar was by all accounts extremely physically strong. And I know of no evidence that Babe was funny. So Your Show of Shows will be going into the third round.

And now we have an intriguing contest: a famously immature mathematician who loved to collaborate, vs. an Albert Einstein who didn’t do science. Whaddya think?

Again, here’s the bracket and here are the rules.

Simulation-based statistical testing in journalism

Jonathan Stray writes:

In my recent Algorithms in Journalism course we looked at a post which makes a cute little significance-type argument that five Trump campaign payments were actually the $130,000 Daniels payoff. They summed to within a dollar of $130,000, so the simulation recreates sets of payments using bootstrapping and asks how often there’s a subset that gets that close to $130,000. It concludes “very rarely” and therefore that this set of payments was a coverup.

(This is part of my broader collection of simulation-based significance testing in journalism.)

I recreated this payments simulation in a notebook to explore this. The original simulation checks sets of ten payments, which the authors justify because “we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments.” You get about p=0.001 that any set of ten payments gets within $1 of $130,000. But the authors also calculated p=0.1 or so if we choose from 15, and my notebook shows this that goes up rapidly to p=0.8 if you choose 20 payments.

So the inference you make depends crucially on the universe of events you use. I think of this as the denominator in the frequentist calculation. It seems like a free parameter robustness problem, and for me it casts serious doubt on the entire exercise.

My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.

I’d be much more comfortable with fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without a Daniels payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require making a bunch of domain assumptions explicit, which strikes me as a good thing.

But I do still wonder if frequentist logic can answer the denominator question here. It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Carl Bialik’s critique of statistical tests for tennis fixing.

My reply:

As Daniel Lakeland said:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

So . . . when a hypothesis tests rejects, it’s no big deal; you’re just rejecting the hypothesis that the data where produced by a specific random number generator—which we already knew didn’t happen. But when a hypothesis test doesn’t reject, that’s more interesting: it tells us that we know so little about the data that we can’t reject the hypothesis that the data where produced by a specific random number generator.

It’s funny. People are typically trained to think of rejection (low p-values) as the newsworthy event, but that’s backward.

Regarding your more general point: yes, there’s no substitute for subject-matter knowledge. And the post you linked to above is in error, when it says that a p-value of 0.001 implies that “the probability that the Trump campaign payments were related to the Daniels payoff is very high.” To make this statement is just a mathematical error.

But I do think there are some other ways of going about this, beyond full Bayesian modeling. For example, you could take the entire procedure used in this analysis, and apply it to other accounts, and see what p-values you get.

Sid Caesar vs. Babe Didrikson Zaharias (2); Jim Thorpe advances

Best comment from yesterday came from Dalton:

Jim Thorpe isn’t from Pennsylvania, and yet a town there renamed itself after him. DJ Jazzy Jeff is from Pennsylvania, and yet Will Smith won’t even return his phone calls. Until I can enjoy a cold Yuengling in Jazzy Jeff, PA it’s DJ Jumpin’ Jim for the win.

And today’s second-round bout features a comedic king versus a trailblazing athlete. I have no idea if Babe was funny or if Sid could do sports.

Again, here’s the bracket and here are the rules.

Michael Crichton on science and storytelling

Javier Benitez points us to this 1999 interview with techno-thriller writer Michael Crichton, who says:

I come before you today as someone who started life with degrees in physical anthropology and medicine; who then published research on endocrinology, and papers in the New England Journal of Medicine, and even in the Proceedings of the Peabody Museum. As someone who, after this promising beginning . . . spent the rest of his life in what is euphemistically called the entertainment business.

Scientists often complain to me that the media misunderstands their work. But I would suggest that in fact, the reality is just the opposite, and that it is science which misunderstands media. I will talk about why popular fiction about science must necessarily be sensationalistic, inaccurate, and negative.

Interesting, given that Crichton near the end of his life became notorious as a sensationalist climate change denier. But that doesn’t really come up in this particular interview, so let’s let him continue:

I’ll explain why it is impossible for the scientific method to be accurately portrayed in film. . . .

Movies are a special kind of storytelling, with their own requirements and rules. Here are four important ones:

– Movie characters must be compelled to act
– Movies need villains
– Movie searches are dull
– Movies must move

Unfortunately, the scientific method runs up against all four rules. In real life, scientists may compete, they may be driven – but they aren’t forced to work. Yet movies work best when characters have no choice. That’s why there is the long narrative tradition of contrived compulsion for scientists. . . .

Second, the villain. Real scientists may be challenged by nature, but they aren’t opposed by a human villain. Yet movies need a human personification of evil. You can’t make one without distorting the truth of science.

Third, searches. Scientific work is often an extended search. But movies can’t sustain a search, which is why they either run a parallel plotline, or more often, just cut the search short. . . .

Fourth, the matter of physical action: movies must move. Movies are visual and external. But much of the action of science is internal and intellectual, with little to show in the way of physical activity. . . .

For all these reasons, the scientific method presents genuine problems in film storytelling. I believe the problems are insoluble. . . .

This all makes sense.

Later on, Crichton says:

As for the media, I’d start using them, instead of feeling victimized by them. They may be in disrepute, but you’re not. The information society will be dominated by the groups and people who are most skilled at manipulating the media for their own ends.

Yup. And now he offers some ideas:

For example, under the auspices of a distinguished organization . . . I’d set up a service bureau for reporters. . . . Reporters are harried, and often don’t know science. A phone call away, establish a source of information to help them, to verify facts, to assist them through thorny issues. Don’t farm it out, make it your service, with your name on it. Over time, build this bureau into a kind of good housekeeping seal, so that your denial has power, and you can start knocking down phony stories, fake statistics and pointless scares immediately, before they build. . . .

Unfortunately, and through no fault of Crichton, we seem to have gotten the first of these suggestions but not the second. Scientists, universities, and journals promote the hell out of just about everything, but they aren’t so interested in knocking down phony stories. Instead we get crap like the Harvard University press office saying “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%,” or the Cornell University press office saying . . . well, if you’re a regular reader of this blog you’ll know where I’m going on this one. Distinguished organizations are promoting the phony stories, not knocking them down.

Crichton concluded:

Under the circumstances, for scientists to fret over their image seems slightly absurd. This is a great field with great talents and great power. It’s time to assume your power, and shoulder your responsibility to get your message to the waiting world. It’s nobody’s job but yours. And nobody can do it as well as you can.

Didn’t work out so well. There have been some high points, such as Freakonomics, which, for all its flaws, presented a picture of social scientists as active problem solvers. But, in many other cases, it seems that science spent much of its credibility on a bunch of short-term quests for money and fame. Too bad, seeing what happened since 1999.

As scientists, I think we should spend less time thinking about how to craft our brilliant ideas as stories for the masses, and think harder about how we ourselves learn from stories. Let’s treat our audience, our fellow citizens of the world, with some respect.

Halftime! And Jim Thorpe (1) vs. DJ Jazzy Jeff

So. Here’s the bracket so far:

Our first second-round match is the top-ranked GOAT—the greatest GOAT of all time, as it were—vs. an unseeded but appealing person whose name ends in f.

Again here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Should he go to grad school in statistics or computer science?

Someone named Nathan writes:

I am an undergraduate student in statistics and a reader of your blog. One thing that you’ve been on about over the past year is the difficulty of executing hypothesis testing correctly, and an apparent desire to see researchers move away from that paradigm. One thing I see you mention several times is to simply “model the problem directly”. I am not a masters student (yet) and am also not trained at all in Bayesian. My coursework was entirely based on classical null hypothesis testing.

From what I can gather, you mean the implementation of some kind of multi-level model. But do you also mean the fitting and usage of standard generalized linear models, such as logistic regression? I have ordered the book you wrote with Jennifer Hill on multi-level models, and I hope it will be illuminating.

On the other hand, I’m looking at going to graduate school and I will be applying this fall. My interests have diverged from classical statistics, with a larger emphasis on model building, prediction, and machine learning. To this end, would further training in statistics be appropriate? Or would it be more useful to try and get into a CS program? I still have interests in “statistics” — describing associations, but I am not so sure I am interested in being a classical theorist. What do you think?

My reply: There are lots of statistics programs that focus on applications rather than theory. Computer science departments, I don’t know how that works. If you want an applied-oriented statistics program, it could help to have a sense of what application areas you’re interested in, and also if you’re interested in doing computational statistics, as a lot of applied work requires computational as well as methodological innovation in order to include as much relevant information as possible into your analyses.

Yakov Smirnoff advances, and Halftime!

Best argument yesterday came from Yuling:

I want to learn more about missing data analysis from the seminar so I like Harry Houdini. But Yakov Smirnoff is indeed better for this topic — both Vodka and the Soviet are treatments that guarantee everyone to be Missing Completely at Random, and as statistican we definitely prefer Missing Completely at Random.

And now the contest is halfway done! We’re through with the first round. Second round will start tomorrow.

Global warming? Blame the Democrats.

An anonymous blog commenter sends the above graph and writes:

I was looking at the global temperature record and noticed an odd correlation the other day. Basically, I calculated the temperature trend for each presidency and multiplied by the number of years to get a “total temperature change”. If there was more than one president for a given year it was counted for both. I didn’t play around with different statistics to measure the amount of change, including/excluding the “split” years, etc. Maybe other ways of looking at it yield different results, this is just the first thing I did.

It turned out all 8 administrations who oversaw a cooling trend were Republican. There has never been a Democrat president who oversaw a cooling global temperature. Also, the top 6 warming presidencies were all Democrats.

I have no idea what it means but thought it may be of interest.

My first thought, beyond simply random patterns showing up with small N, is that thing that Larry Bartels noticed a few years ago, that in recent decades the economy has grown faster under Democratic presidents than Republican presidents. But the time scale does not work to map this to global warming. CO2 emissions, maybe, but I wouldn’t think it would show up in the global temperature so directly as that.

So I’d just describe this data pattern as “one of those things.” My correspondent writes:

I expect to hear it dismissed as a “spurious correlation”, but usually I hear that argument used for correlations that people “don’t like” (it sounds strange/ridiculous) and it is never really explained further. It seems to me if you want to make a valid argument that a correlation is “spurious” you still need to identify the unknown third factor though.

In this case I don’t know that you need to specify an unknown third factor, as maybe you can see this sort of pattern just from random numbers, if you look at enough things. Forking paths and all that. Also there were a lot of Republican presidents in the early years of this time series, back before global warming started to take off. Also, I haven’t checked the numbers in the graph myself.

Harry Houdini (1) vs. Yakov Smirnoff; Meryl Streep advances

Best argument yesterday came from Jonathan:

This one’s close.

Meryl Streep and Alice Waters both have 5 letters in the first name and 6 in the last name. Tie.

Both are adept at authentic accents. Tie.

Meryl has played a international celebrity cook; Alice has never played an actress. Advantage Streep.

Waters has taught many chefs; Meryl has taught no actors. Advantage Waters.

Streep went to Vassar and Yale. Waters went to Berkeley. I’m an East Coast guy, but YMMV.

Waters has the French Legion of Honor. Streep is the French Lieutenant’s Woman.

Both have won more awards than either of them can count.

So I use Sophie’s Axiom of Choice: When comparing a finite set of pairs of New Jersey Celebrities, choose the one who got into the New Jersey Hall of Fame earlier. That’s Streep, by 6 years.

And today we have the final first-round match! Who do you want to see: the top-seeded magician of all time, or an unseeded person whose name ends in f? Can a speaker escape from his own seminar? In Soviet Russia, seminar speaker watch you.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

“Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior”

Kevin Lewis points us to this research paper by Ruben Arslan, Katharina Schilling, Tanja Gerlach, and Lars Penke, which begins:

Previous research reported ovulatory changes in women’s appearance, mate preferences, extra- and in-pair sexual desire, and behavior, but has been criticized for small sample sizes, inappropriate designs, and undisclosed flexibility in analyses.

Examples of such criticism are here and here.

Arslan et al. continue:

In the present study, we sought to address these criticisms by preregistering our hypotheses and analysis plan and by collecting a large diary sample. We gathered more than 26,000 usable online self-reports in a diary format from 1,043 women, of which 421 were naturally cycling. We inferred the fertile period from menstrual onset reports. We used hormonal contraceptive users as a quasi-control group, as they experience menstruation, but not ovulation.


We found robust evidence supporting previously reported ovulatory increases in extra-pair desire and behavior, in-pair desire, and self-perceived desirability, as well as no unexpected associations. Yet, we did not find predicted effects on partner mate retention behavior, clothing choices, or narcissism. Contrary to some of the earlier literature, partners’ sexual attractiveness did not moderate the cycle shifts. Taken together, the replicability of the existing literature on ovulatory changes was mixed.

I have not looked at this paper in detail, but just speaking generally I like what they’re doing. Instead of gathering one more set of noisy data and going for the quick tabloid win (or, conversely, the so-what failed replication), they designed a study to gather high-quality data with enough granularity to allow estimation of within-person comparisons. That’s what we’ve been talkin bout!

Alice Waters (4) vs. Meryl Streep; LeBron James advances

It’s L’Bron. Only pitch for Mr. Magic was from DanC: guy actually is ultra-tall, plus grand than that non-Cav who had play’d for Miami. But Dalton brings it back for Bron:

LeBron James getting to the NBA Final with J.R. Smith as his best supporting cast member is a more preposterous escape than anything David Blaine or Houdini did. So he’s already a better magician than Eric Antoine (who is seeded below Blaine and Houdini).

Plus, he’s featured in this (unfortunately paywalled) Teaching Statistics article which points out the merits of graphical comparison (“Understanding summary statistics and graphical techniques to compare Michael Jordan versus LeBron James” – I love the fact that statistics cannot determine the MJ and LeBron debate precisely because it all depends on which summary statistic you choose. Just goes to show that you need to put as much thought into which dimensions you choose to check your model (graphically and numerically) as you do in constructing your model in the first place.

All stats, yah!

Today it’s a cook vs. a drama star. Whaddya want, a scrumptious lunch or Soph’s option? Or ya want Silkwood? Fantastic Mr. Fox? Can’t go wrong with that lady. But you also luv that cookbook, that food, that flavor, right? You pick.

Again, full list is at this link, and instructions:

Trying to pick #1 visitor. I’m not asking for most popular, or most topical, or optimum, or most profound, or most cool, but a combination of traits.

I’ll pick a day’s victor not from on a popular tally but on amusing quips on both camps. So try to show off!

Our hypotheses are not just falsifiable; they’re actually false.

Everybody’s talkin bout Popper, Lakatos, etc. I think they’re great. Falsificationist Bayes, all the way, man!

But there’s something we need to be careful about. All the statistical hypotheses we ever make are false. That is, if a hypothesis becomes specific enough to make (probabilistic) predictions, we know that with enough data we will be able to falsify it.

So, here’s the paradox. We learn by falsifying hypotheses, but we know ahead of time that our hypotheses are false. Whassup with that?

The answer is that the purpose of falsification is not to falsify. Falsification is useful not in telling us that a hypothesis is false—we already knew that!—but rather in telling us the directions in which it is lacking, which points us ultimately to improvements in our model. Conversely, lack of falsification is also useful in telling us that our available data are not rich enough to go beyond the model we are currently fitting.

P.S. I was motivated to write this after seeing this quotation: “. . . this article pits two macrotheories . . . against each other in competing, falsifiable hypothesis tests . . .”, pointed to me by Kevin Lewis.

And, no, I don’t think it’s in general a good idea to pit theories against each other in competing hypothesis tests. Instead I’d prefer to embed the two theories into a larger model that includes both of them. I think the whole attitude of A-or-B-but-not-both is mistaken; for more on this point, see for example the discussion on page 962 of this review article from a few years ago.

LeBron James (3) vs. Eric Antoine; Ellen DeGeneres advances

Optimum quip Thursday was from Dzhaughn:

Mainly, that woman’s tag has a lot of a most common typographical symbol in it, which would amount to a big difficulty back in days of non-digital signs on halls of drama and crowd-laughing.

Should that fact boost or cut a probability appraisal of said woman writing an amazing book such as “A Void” (aka “La Disparition” in Gallic printings?) I cannot say, A or B. (If you don’t know what’s up, visit to find that book’s author’s autograph and a blurb on said book. You will know why its local omission is mandatory.)

That I should, so soon as now, so miss that most familiar symbol. But I do! Would you not? I should strongly disavow prodigalilty with it!

Good points, all. I must go with L.A. TV host and funny lady for this win. You go girl. You will soon stand vs. a hoops man or a magical guy in round 2. Good stuff all round.

Today, #3 GOAT is facing off against a magician. L’Bron could talk b-ball or politics and might want to know about schools policy, a common topic on this blog. But that français is funny looking and has strong tricks. Both guys on TV all days. Who do you want to show up to our Columbia talk?

Again, full list is at this link, and instructions:

Trying to pick #1 visitor. I’m not asking for most popular, or most topical, or optimum, or most profound, or most cool, but a combination of traits.

I’ll pick a day’s victor not from on a popular tally but on amusing quips on both camps. So try to show off!

Fitting multilevel models when the number of groups is small

Matthew Poes writes:

I have a question that I think you have answered for me before. There is an argument to be made that HLM should not be performed if a sample is too small (too small level 2 and too small level 1 units). Lot’s of papers written with guidelines on what those should be. It’s my understanding that those guidelines may not be worth much and I believe even you have suggested that when faced with small samples, it is probably better to just simulate.

Is it accurate to say that if a data set is clearly nested, there is dependence, and the sample is too small to do HLM, that no analysis is ok. That a different analysis that doesn’t address dependence but is not necessarily as biased with small samples (or so they say) is still not ok. I think you mentioned this before.

Let’s say you want to prove that head start centers that measure as having higher “capacity” (as measured on a multi-trait multi-method assessment of capacity) have teachers that are more “satisfied” with their jobs, that simply looking at the correlation between site capacity and site average job satisfaction is not ok if you only have 15 sites (and 50 total teachers unequally distributed amongst these sites). This is a real question I’ve been given with the names and faces changed. My instinct is they aren’t analyzing the question they asked and this isn’t right.

Would the use of a Bayesian GLM be an option or am I expecting too much magic here? This isn’t my study, but I hate to go back to someone and say, Hey sorry, you spent 2 years and there is nothing you can do quantitatively here (Though I much rather say that then allow this correlation to be published).

My quick response is that the model is fine if you’re not data-rich; it’s just that in such a setting the prior distribution is more important. Flat priors will not make sense because they allow the possibility of huge coefficients that are not realistic. My book with Hill is not at all clear on this point, as we pretty much only use flat priors, and we don’t really wrestle with the problems that this can cause. Moving forward, though, I think the right thing to do is to fit multilevel models with informative priors. Setting up these priors isn’t trivial but it’s not impossible either; see for example the bullet points on page 13 of this article for an example in a completely different context. As always, it would be great to have realistic case studies of this sort of thing (in this case, informative priors for multilevel models in analyses of social science interventions) that people can use as templates for their own analyses. We should really include once such example in Advanced Regression and Multilevel Models, the in-preparation second edition of the second half of our book.

Short-term, for your problem now, I recommend the multilevel models with informative priors. I’m guessing it will be a lot easier to do this than you might think.

Poes then replied:

That example came from a real scenario where a prior study actually had unusually high coefficients. It was an intervention designed for professional development of practitioners. In general, most studies of a similar nature have had no or little effect. An effect size is .2 to .5 is pretty common. This particular intervention was not so unusual as to expect much higher effects, but they ended up with effects closer to .8 or so, and the sample was very small (it was a pilot study). They used that evidence as a means to justify a second small study. I suspect there is a great deal more uncertainty in those findings than it appears to the evaluation team, and I suspect if priors from those earlier studies were to be included, the coefficients would be more reasonable. The second study has not yet been completed, but I will be shocked if they see the same large effects.

This is an exaggeration, but to put this large effect into perspective, it would be as if we are suggesting that spending an extra ten minutes a day with hands on supervision of preschool teachers would lead to their students knowing ten more letters by the end of the year. I think you have addressed this before, but I do think people sometimes forget to take a step back from their statistics to consider what those statistics mean in practical terms.

Poes also added:

While we are talking about these studies as if Bayesian analysis would be used, they are in fact all analyzed using frequentist methods. I’m not sure if that was clear.

And then he had one more question:

When selecting past studies to use as informative priors, does the quality of the research matter? I have to imagine the answer is yes. A common argument I hear against looking to past results as evidence for current or future results is that the past research is of insufficient quality. Sample too small, measures too noisy, theory of change ill-thought-out, etc. My guess is that it does matter and those issues all potentially matter, but . . . It seems like that then raises the question, at what point is the quality sufficiently bad to merit exclusion? Based on what criteria? Study rating systems (e.g. Consort) exist, but I’m assuming that is not a common part of the process and I would also guess that much of the criteria is unimportant for their use as a prior. I’ve worked on a few study rating tools (including one that is in the process of being published as we speak) and my experience has been that a lot of concessions are made to ensure at least some studies make it through. To go back to my earlier question, I had pointed out that sample size adequacy shouldn’t be based on a fixed number (e.g. at least 100 participants) and maybe not based on the existence of a power analysis, but rather something more nuanced.

This brings me back to my general recommendation that researchers have a “paper trail” to justify their models, including their choice of prior distributions. I have no easy answers here, but, as usual, the default flat prior can cause all sorts of havoc, so I think it’s worth thinking hard about how large you can expect effect sizes to be, and what substantive models correspond to various assumed distributions of effect sizes.

P.S. Yes, this question comes up a lot! For example, a quick google search reveals:

Multilevel models with only one or two groups (from 2006)

No, you don’t need 20 groups to do a multilevel analysis (from 2007)

Hierarchical modeling when you have only 2 groups: I still think it’s a good idea, you just need an informative prior on the group-level variation (from 2015)