Some critics seem to think that English grammar is just a brief checklist of linguistic table manners that every educated person should already know. Others see grammar as a complex, esoteric, and largely useless discipline replete with technical terms that no ordinary person needs. Which is right? Neither. The handy menu of grammar tips is a myth. Faculty often point to Strunk and White’s The Elements of Style as providing such a list, but its assertions about grammar are often flagrantly false and its rambling remarks on style are largely useless. The truth is that the books on English grammar intended for students or the general public nearly all dispense false claims and bad analyses. Yet grammar can be described in a way that makes sense. I [Pullum] offer some eye-opening facts about Strunk and White, and an antidote, plus brief illustrations of how grammar and style can be tackled in a sensible way drawing on insights from modern linguistics.

The talk is at 707 Hamilton Hall, Fri 22 Feb, 4pm.

The funny thing is, I get what Pullum is saying here, but I still kinda like Strunk and White for what it is.

]]>First New Yorker showdown, just to see who will be taking on Veronica Geng in the finals. All the other contestants are just for show. I’m going with Liebling, because Parker wasn’t even the best New Yorker writer of her generation, being edged out by Benchley. Liebling dominated his era. If it comes down to Liebling vs. Geng, we’ll just exhume Harold Ross and make him pick.

But we’re looking for a talker, not a writer, so I’ll have to go with Dzhaughn:

After the Seance, we were chatting about the inspiration for this tournament. I said I thought Bruno was just a minor intellectual swindler rather than a real threat. Dorothy replied:

I used to think Latour was just something on a Schwinn dealer’s list*, but that was before I saw Julia’s child Oscar wildly strong-arm Lance with an ephronedrine-filled syringe merrily down the Streep, past a sidewalk cafe where the turing Pele and big bejeweled #23, in Brooks’ Brothers suits, were yakking over Smirnoff Martinis, eating a pile of franks, caesar salads, and some weirder dishes. James was on the phone, taking the TV network to hell and back over “letting that degenerate George Karl off the hook” for some remark, when, from behind a bush, sudden as a python, out springs teen-aged Babe D.-Z, among others! That geng didn’t look like they were here to serenade us with arias from Yardbird, that jazz oprah about Parker! No, they were there to revolt–air their own grievances–and when he stood to object, Babe just shoved LeBron and all his LeBling back onto LaPlace where he sat: Oof!

A bit of recursion is usually a good plan.

For today it’s the French Chef vs. the Chairman of the Board. Frank’s got a less screechy voice, but Julia should be able to handle the refreshments. Any thoughts?

Again, here’s the bracket and here are the rules.

]]>The study of American politics as a window into understanding uncertainty in science

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

Some background reading:

19 things we learned from the 2016 election (with Julia Azari), http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf

The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild). http://www.stat.columbia.edu/~gelman/research/published/swingers.pdf

The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf

Honesty and transparency are not enough. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf

The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. http://www.stat.columbia.edu/~gelman/research/published/bayes_management.pdf

The talk will mostly be about statistics, not political science, but it’s good to have a substantive home base when talking about methods.

]]>On one hand, Serena knows how to handle a racket. But Steve Martin knows how to make a racket with some strings stretched taught over a frame. Are you really gonna bet against the dude who went to toe-to-toe Kermit the Frog in racket making duel?

Today we have an unseeded eater vs. the second-seeded wit. That said, Liebling was very witty, at least in his writing. And Parker—I don’t know how she was as an eater, but she sure knew how to drink. As did Liebling. So the two are evenly matched.

Again, here’s the bracket and here are the rules.

]]>A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That’s why, if you look at the histograms in my books and published articles, I just about always use lots of bins.

But somewhere along the way someone fixed it. R’s histogram function now has a reasonable default, with lots of bins. (Just go into R and type hist(rnorm(100)) and you’ll see.)

I’m so happy!

**P.S.** When searching for my old post on histograms, I found this remark, characterizing the following bar graph:

]]>This graph isn’t horrible—with care, you can pull the numbers off it—but it’s not set up to allow much discovery, either. This kind of graph is a little bit like a car without an engine: you can push it along and it will go where you want, but it won’t take you anywhere on its own.

I noticed you posted an anonymous email about our working paper on p-hacking and false discovery, but was a bit surprised that it references an early version of the paper.

We addressed the issues mentioned in the post more than two months ago in a version that has been available online since December 2018 (same link as above).I wanted to send you a few thoughts on the post, with the hope that you will find them interesting and relevant to post on your blog as our reply, with the goal of informing your readers a little more about the paper.

These thoughts are presented in the separate section below.

The more recent analysis applies an MSE-optimal bandwidth selection procedure. Hence, we use only one single bandwidth for assessing the presence of a discontinuity at a specific level of confidence in an experiment.

Less importantly, the more recent analysis uses a triangular kernel and linear regression (though we also report a traditional logistic regression analysis result for transparency and robustness).

The results have not changed much, and have partially strengthened.With regard to the RDD charts, the visual fit indeed might not be great. But we think the fit using the MSE-optimal window width is actually good.

The section below provides more details, and I hope you will find it relevant to post it on your blog.

We also of course would welcome any feedback you may have about the methods we are using in the paper, including the second part of the paper where we attempt to quantify the consequences of p-hacking on false discovery and foregone learning.

I am learning from every feedback we receive and am working to constantly improve the paper.

More details about blog post:

The comments by the anonymous letter writer are about an old version of the paper, and we have addressed them a few months ago.

Three main concerns were expressed: The choice of six possible discontinuities, the RDD window widths, and the RDD plot showing weak visual evidence of a discontinuity in stopping behavior based on the confidence level the experiment reaches.

1. Six hypotheses

We test six different hypotheses, each positing optional-stopping based on the p-value of the experiment at one of the three commonly used levels of significance/confidence in business and social science (90, 95 and 99%) for both positive and negative effects (3 X 2 = 6).

We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.2. RDD window width

The December 2018 version of the paper details an RDD analysis using an MSE-optimal bandwidth linear regression with a triangular kernel.

The results (and implications) haven’t changed dramatically using the more sophisticated approach, which relies on a single window for each RDD.We fully report all the tables in the paper. This is what the results look like (Table 5 of the paper includes the details of the bandwidth sizes, number of observations etc):

The linear and the bias-corrected linear models use the “sophisticated” MSE-optimal method. We also report a logistic regression analysis with the same MSE-optimal window width for transparency and to show robustness.

All the effects are reported as marginal effects to allow easy comparison.

Not much has changed in results and the main conclusion about p-hacking remains the same: A sizable fraction of experiments exhibit credible evidence of stopping when the A/B test reaches 90% confidence for a positive effect, but not at the other levels of significance typically used and not for negative effects.

3. RDD plots

With respect to the RDD charts, the fit might indeed not look great visually. But what matters for the purpose of causal identification in such a quasi-experiment, in our opinion, is more the evidence of a discontinuity at the point of interest, rather than the overall data fit.

Here is the chart with the MSE-Optimal bandwidth around .895 confidence (presented as 90% to the experimenters) from the paper. Apart from the outlier at .89 confidence, we think the lines track the raw fractions rather well.

I wrote that earlier post in September, hence it was based on an earlier version of the article. It’s good to hear about the update.

]]>I’m pulling for Kobayashi if only because the longer he’s in, the more often Andrew will have to justify describing him vs using his name. The thought of Andrew introducing the speaker as “and now, here’s that Japanese dude who won the hot dog eating contest” sounds awkward enough to prime us all for stress-eating, and who better to give us best practices/techniques?

I agree with Diana that there’s some underdog bias here, as in the real world there’d be no doubt that we’d want to hear Wilde. Indeed, if this were a serious contest we’d just have looked at the 64 names, picked Wilde right away, and ended it. But, for round 3, the Japanese dude who won the hot dog eating contest it is.

And today we have an unseeded GOAT vs. the fourth-seeded magician. Whaddya want, ground strokes or ironic jokes?

Again, here’s the bracket and here are the rules.

]]>I just read your paper with Daniel Simpson and Michael Betancourt, The Prior Can Often Only Be Understood in the Context of the Likelihood, and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, and small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My quick response is that when sample size is small, or measurements are noisy, or the underlying phenomenon has high variation, then the prior distribution will become more important.

So your question is a good one!

To continue, when priors are important, you’ll have to think harder about what real prior information is available.

One way to to is . . . and I’m sorry for being so predictable in my answer, but I’ll say it anyway . . . embed your problem in a multilevel model. You have a new product with just four observations. Fine. But this new product is the latest in a stream of products, so create a model of the underlying attributes of interests, given product characteristics and time.

Don’t think of your “prior” for a parameter as some distinct piece of information; think of it as the culmination of a group-level model.

Just like when we do Mister P: We don’t slap down separate priors for the 50 states, we set up a hierarchical model with state-level predictors, and this does the partial pooling more organically. So the choice of priors becomes something more familiar: the choice of predictors in a regression model, along with choices about how to set that predictive model up.

Even with a hierarchical model, you still might want to add priors on hyperparameters, but that’s something we do discuss a bit at that link.

]]>This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).

It is a regression discontinuity design:

They even cite your paper [that must be this or this — ed.] against higher-order polynomials.

Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.

My correspondent continues:

The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).

Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.

Anyway, he continues:

Example table (this is the one that looks best for them, others relegated to appendix):

So maybe an interesting tour of:

– How much optional stopping is there in industry? (Of course there is some.)

– Self-deception, ignorance, and incentive problems for social scientists

– Reasonable methods for regression discontinuity designs.

I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.

Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.

And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.

**P.S.** It’s been pointed out to me that the above-linked paper has been updated and improved since when I wrote the above post last September. Not all my comments above apply to the latest version of the paper.

Now I’m morally bound to use the Erdos argument I said no one would see unless he made it to this round.

Andrew will take the speaker out to dinner, prove a theorem, publish it and earn an Erdos number of 1.

But then Jan pulled in with :

If you get Erdos, he will end up staying in your own place for the next n months, and him being dead, well, let’s say it is probably not going to be pleasant.

To be honest, I don’t even think I’d like a live Erdos staying in our apartment: from what I’ve read, the guy sounds a bit irritating, the kind of person who thinks he’s charming—an attribute that I find annoying.

Anyway, who cares about the Erdos number. What I really want is a good Wansink number. Recall what the notorious food researcher wrote:

Facebook, Twitter, Game of Thrones, Starbucks, spinning class . . . time management is tough when there’s so many other shiny alternatives that are more inviting than writing the background section or doing the analyses for a paper.

Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday. In the meantime, this Turkish woman’s resume will always have the five papers below.

Coauthorship is forever. Those of us with a low Wansink number will live forever in the scientific literature.

And today’s match features an unseeded eater vs. the top-seeded wit. Doesn’t seem like much of a contest for a seminar speaker, but . . . let’s see what arguments you come up with!

Again, here’s the bracket and here are the rules.

]]>Eric Loken writes:

The clear point or your post was that p-values (and even worse the significance versus non-significance) are a poor summary of data.

The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses. It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

Good point. As John Carlin and I wrote, we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

]]>I’m going to be previewing the book I’m in the process of writing at the Ann Arbor R meetup on Monday. Here are the details, including the working title:

Probability and Statistics: a simulation-based introduction

Bob Carpenter

Monday, February 18, 2019

Ann Arbor SPARK, 330 East Liberty St, Ann Arbor

I’ve been to a few of their meetings and I really like this meetup group—a lot more statistics depth than you often get (for example, nobody asked me how to shovel their web site into Stan to get business intelligence). There will be a gang (or at least me!) going out for food and drinks afterward.

I’m still not 100% sure about which parts I’m going to talk about, as I’ve already written 100+ pages of it. After some warmup on the basics of Monte Carlo, I’ll probably do a simulation-based demonstration of the central limit theorem, the curse of dimensionality, and some illustration of (anti-)correlation effects on MCMC, as those are nice encapsulated little case studies I can probably get through in an hour.

**The Repository**

I’m writing it all in bookdown and licensing it all open source. I’ll probably try to find a publisher, but I’m only going to do so if I can keep the pdf free.

I just opened up the GitHub repo so anyone can download and build it:

- GitHub repository: bob-carpenter/prob-stats

I’m happy to take suggestions, but please don’t start filing issues on typos, grammar, etc.—I haven’t even spell checked it yet, much less passed it by a real copy editor. When there’s a more stable draft, I’ll put up a pdf.

]]>And now we have an intriguing contest: a famously immature mathematician who loved to collaborate, vs. an Albert Einstein who didn’t do science. Whaddya think?

Again, here’s the bracket and here are the rules.

]]>In my recent Algorithms in Journalism course we looked at a post which makes a cute little significance-type argument that five Trump campaign payments were actually the $130,000 Daniels payoff. They summed to within a dollar of $130,000, so the simulation recreates sets of payments using bootstrapping and asks how often there’s a subset that gets that close to $130,000. It concludes “very rarely” and therefore that this set of payments was a coverup.

(This is part of my broader collection of simulation-based significance testing in journalism.)

I recreated this payments simulation in a notebook to explore this. The original simulation checks sets of ten payments, which the authors justify because “we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments.” You get about p=0.001 that any set of ten payments gets within $1 of $130,000. But the authors also calculated p=0.1 or so if we choose from 15, and my notebook shows this that goes up rapidly to p=0.8 if you choose 20 payments.

So the inference you make depends crucially on the universe of events you use. I think of this as the denominator in the frequentist calculation. It seems like a free parameter robustness problem, and for me it casts serious doubt on the entire exercise.

My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.

I’d be much more comfortable with fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without a Daniels payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require making a bunch of domain assumptions explicit, which strikes me as a good thing.

But I do still wonder if frequentist logic can answer the denominator question here. It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Carl Bialik’s critique of statistical tests for tennis fixing.

My reply:

As Daniel Lakeland said:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

So . . . when a hypothesis tests rejects, it’s no big deal; you’re just rejecting the hypothesis that the data where produced by a specific random number generator—which we already knew didn’t happen. But when a hypothesis test *doesn’t* reject, that’s more interesting: it tells us that we know so little about the data that we *can’t* reject the hypothesis that the data where produced by a specific random number generator.

It’s funny. People are typically trained to think of rejection (low p-values) as the newsworthy event, but that’s backward.

Regarding your more general point: yes, there’s no substitute for subject-matter knowledge. And the post you linked to above is in error, when it says that a p-value of 0.001 implies that “the probability that the Trump campaign payments were related to the Daniels payoff is very high.” To make this statement is just a mathematical error.

But I do think there are some other ways of going about this, beyond full Bayesian modeling. For example, you could take the entire procedure used in this analysis, and apply it to other accounts, and see what p-values you get.

]]>Jim Thorpe isn’t from Pennsylvania, and yet a town there renamed itself after him. DJ Jazzy Jeff is from Pennsylvania, and yet Will Smith won’t even return his phone calls. Until I can enjoy a cold Yuengling in Jazzy Jeff, PA it’s DJ Jumpin’ Jim for the win.

And today’s second-round bout features a comedic king versus a trailblazing athlete. I have no idea if Babe was funny or if Sid could do sports.

Again, here’s the bracket and here are the rules.

]]>I come before you today as someone who started life with degrees in physical anthropology and medicine; who then published research on endocrinology, and papers in the New England Journal of Medicine, and even in the Proceedings of the Peabody Museum. As someone who, after this promising beginning . . . spent the rest of his life in what is euphemistically called the entertainment business.

Scientists often complain to me that the media misunderstands their work. But I would suggest that in fact, the reality is just the opposite, and that it is science which misunderstands media. I will talk about why popular fiction about science must necessarily be sensationalistic, inaccurate, and negative.

Interesting, given that Crichton near the end of his life became notorious as a sensationalist climate change denier. But that doesn’t really come up in this particular interview, so let’s let him continue:

I’ll explain why it is impossible for the scientific method to be accurately portrayed in film. . . .

Movies are a special kind of storytelling, with their own requirements and rules. Here are four important ones:

– Movie characters must be compelled to act

– Movies need villains

– Movie searches are dull

– Movies must moveUnfortunately, the scientific method runs up against all four rules. In real life, scientists may compete, they may be driven – but they aren’t forced to work. Yet movies work best when characters have no choice. That’s why there is the long narrative tradition of contrived compulsion for scientists. . . .

Second, the villain. Real scientists may be challenged by nature, but they aren’t opposed by a human villain. Yet movies need a human personification of evil. You can’t make one without distorting the truth of science.

Third, searches. Scientific work is often an extended search. But movies can’t sustain a search, which is why they either run a parallel plotline, or more often, just cut the search short. . . .

Fourth, the matter of physical action: movies must move. Movies are visual and external. But much of the action of science is internal and intellectual, with little to show in the way of physical activity. . . .

For all these reasons, the scientific method presents genuine problems in film storytelling. I believe the problems are insoluble. . . .

This all makes sense.

Later on, Crichton says:

As for the media, I’d start using them, instead of feeling victimized by them. They may be in disrepute, but you’re not. The information society will be dominated by the groups and people who are most skilled at manipulating the media for their own ends.

Yup. And now he offers some ideas:

For example, under the auspices of a distinguished organization . . . I’d set up a service bureau for reporters. . . . Reporters are harried, and often don’t know science. A phone call away, establish a source of information to help them, to verify facts, to assist them through thorny issues. Don’t farm it out, make it your service, with your name on it. Over time, build this bureau into a kind of good housekeeping seal, so that your denial has power, and you can start knocking down phony stories, fake statistics and pointless scares immediately, before they build. . . .

Unfortunately, and through no fault of Crichton, we seem to have gotten the first of these suggestions but not the second. Scientists, universities, and journals promote the hell out of just about everything, but they aren’t so interested in knocking down phony stories. Instead we get crap like the Harvard University press office saying “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%,” or the Cornell University press office saying . . . well, if you’re a regular reader of this blog you’ll know where I’m going on this one. Distinguished organizations are promoting the phony stories, not knocking them down.

Crichton concluded:

Under the circumstances, for scientists to fret over their image seems slightly absurd. This is a great field with great talents and great power. It’s time to assume your power, and shoulder your responsibility to get your message to the waiting world. It’s nobody’s job but yours. And nobody can do it as well as you can.

Didn’t work out so well. There have been some high points, such as Freakonomics, which, for all its flaws, presented a picture of social scientists as active problem solvers. But, in many other cases, it seems that science spent much of its credibility on a bunch of short-term quests for money and fame. Too bad, seeing what happened since 1999.

As scientists, I think we should spend less time thinking about how to craft our brilliant ideas as stories for the masses, and think harder about how we ourselves learn from stories. Let’s treat our audience, our fellow citizens of the world, with some respect.

]]>Our first second-round match is the top-ranked GOAT—the greatest GOAT of all time, as it were—vs. an unseeded but appealing person whose name ends in f.

Again here are the rules:

]]>We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

I am an undergraduate student in statistics and a reader of your blog. One thing that you’ve been on about over the past year is the difficulty of executing hypothesis testing correctly, and an apparent desire to see researchers move away from that paradigm. One thing I see you mention several times is to simply “model the problem directly”. I am not a masters student (yet) and am also not trained at all in Bayesian. My coursework was entirely based on classical null hypothesis testing.

From what I can gather, you mean the implementation of some kind of multi-level model. But do you also mean the fitting and usage of standard generalized linear models, such as logistic regression? I have ordered the book you wrote with Jennifer Hill on multi-level models, and I hope it will be illuminating.

On the other hand, I’m looking at going to graduate school and I will be applying this fall. My interests have diverged from classical statistics, with a larger emphasis on model building, prediction, and machine learning. To this end, would further training in statistics be appropriate? Or would it be more useful to try and get into a CS program? I still have interests in “statistics” — describing associations, but I am not so sure I am interested in being a classical theorist. What do you think?

My reply: There are lots of statistics programs that focus on applications rather than theory. Computer science departments, I don’t know how that works. If you want an applied-oriented statistics program, it could help to have a sense of what application areas you’re interested in, and also if you’re interested in doing computational statistics, as a lot of applied work requires computational as well as methodological innovation in order to include as much relevant information as possible into your analyses.

]]>I want to learn more about missing data analysis from the seminar so I like Harry Houdini. But Yakov Smirnoff is indeed better for this topic — both Vodka and the Soviet are treatments that guarantee everyone to be Missing Completely at Random, and as statistican we definitely prefer Missing Completely at Random.

And now the contest is halfway done! We’re through with the first round. Second round will start tomorrow.

]]>An anonymous blog commenter sends the above graph and writes:

I was looking at the global temperature record and noticed an odd correlation the other day. Basically, I calculated the temperature trend for each presidency and multiplied by the number of years to get a “total temperature change”. If there was more than one president for a given year it was counted for both. I didn’t play around with different statistics to measure the amount of change, including/excluding the “split” years, etc. Maybe other ways of looking at it yield different results, this is just the first thing I did.

It turned out

all8 administrations who oversaw a cooling trend were Republican. There has never been a Democrat president who oversaw a cooling global temperature. Also, the top 6 warming presidencies were all Democrats.I have no idea what it means but thought it may be of interest.

My first thought, beyond simply random patterns showing up with small N, is that thing that Larry Bartels noticed a few years ago, that in recent decades the economy has grown faster under Democratic presidents than Republican presidents. But the time scale does not work to map this to global warming. CO2 emissions, maybe, but I wouldn’t think it would show up in the global temperature so directly as that.

So I’d just describe this data pattern as “one of those things.” My correspondent writes:

I expect to hear it dismissed as a “spurious correlation”, but usually I hear that argument used for correlations that people “don’t like” (it sounds strange/ridiculous) and it is never really explained further. It seems to me if you want to make a valid argument that a correlation is “spurious” you still need to identify the unknown third factor though.

In this case I don’t know that you need to specify an unknown third factor, as maybe you can see this sort of pattern just from random numbers, if you look at enough things. Forking paths and all that. Also there were a lot of Republican presidents in the early years of this time series, back before global warming started to take off. Also, I haven’t checked the numbers in the graph myself.

]]>This one’s close.

Meryl Streep and Alice Waters both have 5 letters in the first name and 6 in the last name. Tie.

Both are adept at authentic accents. Tie.

Meryl has played a international celebrity cook; Alice has never played an actress. Advantage Streep.

Waters has taught many chefs; Meryl has taught no actors. Advantage Waters.

Streep went to Vassar and Yale. Waters went to Berkeley. I’m an East Coast guy, but YMMV.

Waters has the French Legion of Honor. Streep is the French Lieutenant’s Woman.

Both have won more awards than either of them can count.

So I use Sophie’s Axiom of Choice: When comparing a finite set of pairs of New Jersey Celebrities, choose the one who got into the New Jersey Hall of Fame earlier. That’s Streep, by 6 years.

And today we have the final first-round match! Who do you want to see: the top-seeded magician of all time, or an unseeded person whose name ends in f? Can a speaker escape from his own seminar? In Soviet Russia, seminar speaker watch *you*.

Again, the full bracket is here, and here are the rules:

]]>We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Previous research reported ovulatory changes in women’s appearance, mate preferences, extra- and in-pair sexual desire, and behavior, but has been criticized for small sample sizes, inappropriate designs, and undisclosed flexibility in analyses.

Examples of such criticism are here and here.

Arslan et al. continue:

In the present study, we sought to address these criticisms by preregistering our hypotheses and analysis plan and by collecting a large diary sample. We gathered more than 26,000 usable online self-reports in a diary format from 1,043 women, of which 421 were naturally cycling. We inferred the fertile period from menstrual onset reports. We used hormonal contraceptive users as a quasi-control group, as they experience menstruation, but not ovulation.

And:

We found robust evidence supporting previously reported ovulatory increases in extra-pair desire and behavior, in-pair desire, and self-perceived desirability, as well as no unexpected associations. Yet, we did not find predicted effects on partner mate retention behavior, clothing choices, or narcissism. Contrary to some of the earlier literature, partners’ sexual attractiveness did not moderate the cycle shifts. Taken together, the replicability of the existing literature on ovulatory changes was mixed.

I have not looked at this paper in detail, but just speaking generally I like what they’re doing. Instead of gathering one more set of noisy data and going for the quick tabloid win (or, conversely, the so-what failed replication), they designed a study to gather high-quality data with enough granularity to allow estimation of within-person comparisons. That’s what we’ve been talkin bout!

]]>LeBron James getting to the NBA Final with J.R. Smith as his best supporting cast member is a more preposterous escape than anything David Blaine or Houdini did. So he’s already a better magician than Eric Antoine (who is seeded below Blaine and Houdini).

Plus, he’s featured in this (unfortunately paywalled) Teaching Statistics article which points out the merits of graphical comparison (“Understanding summary statistics and graphical techniques to compare Michael Jordan versus LeBron James” – https://onlinelibrary.wiley.com/doi/abs/10.1111/test.12111) I love the fact that statistics cannot determine the MJ and LeBron debate precisely because it all depends on which summary statistic you choose. Just goes to show that you need to put as much thought into which dimensions you choose to check your model (graphically and numerically) as you do in constructing your model in the first place.

All stats, yah!

Today it’s a cook vs. a drama star. Whaddya want, a scrumptious lunch or Soph’s option? Or ya want Silkwood? Fantastic Mr. Fox? Can’t go wrong with that lady. But you also luv that cookbook, that food, that flavor, right? You pick.

Again, full list is at this link, and instructions:

]]>Trying to pick #1 visitor. I’m not asking for most popular, or most topical, or optimum, or most profound, or most cool, but a combination of traits.

I’ll pick a day’s victor not from on a popular tally but on amusing quips on both camps. So try to show off!

But there’s something we need to be careful about. All the statistical hypotheses we ever make are false. That is, if a hypothesis becomes specific enough to make (probabilistic) predictions, we know that with enough data we will be able to falsify it.

So, here’s the paradox. We learn by falsifying hypotheses, but we know ahead of time that our hypotheses are false. Whassup with that?

The answer is that *the purpose of falsification is not to falsify*. Falsification is useful not in telling us that a hypothesis is false—we already knew that!—but rather in telling us the directions in which it is lacking, which points us ultimately to improvements in our model. Conversely, *lack* of falsification is also useful in telling us that our available data are not rich enough to go beyond the model we are currently fitting.

**P.S.** I was motivated to write this after seeing this quotation: “. . . this article pits two macrotheories . . . against each other in competing, falsifiable hypothesis tests . . .”, pointed to me by Kevin Lewis.

And, no, I don’t think it’s in general a good idea to pit theories against each other in competing hypothesis tests. Instead I’d prefer to embed the two theories into a larger model that includes both of them. I think the whole attitude of A-or-B-but-not-both is mistaken; for more on this point, see for example the discussion on page 962 of this review article from a few years ago.

]]>Mainly, that woman’s tag has a lot of a most common typographical symbol in it, which would amount to a big difficulty back in days of non-digital signs on halls of drama and crowd-laughing.

Should that fact boost or cut a probability appraisal of said woman writing an amazing book such as “A Void” (aka “La Disparition” in Gallic printings?) I cannot say, A or B. (If you don’t know what’s up, visit Amazon.com to find that book’s author’s autograph and a blurb on said book. You will know why its local omission is mandatory.)

That I should, so soon as now, so miss that most familiar symbol. But I do! Would you not? I should strongly disavow prodigalilty with it!

Good points, all. I must go with L.A. TV host and funny lady for this win. You go girl. You will soon stand vs. a hoops man or a magical guy in round 2. Good stuff all round.

Today, #3 GOAT is facing off against a magician. L’Bron could talk b-ball or politics and might want to know about schools policy, a common topic on this blog. But that français is funny looking and has strong tricks. Both guys on TV all days. Who do you want to show up to our Columbia talk?

Again, full list is at this link, and instructions:

]]>Trying to pick #1 visitor. I’m not asking for most popular, or most topical, or optimum, or most profound, or most cool, but a combination of traits.

I’ll pick a day’s victor not from on a popular tally but on amusing quips on both camps. So try to show off!

I have a question that I think you have answered for me before. There is an argument to be made that HLM should not be performed if a sample is too small (too small level 2 and too small level 1 units). Lot’s of papers written with guidelines on what those should be. It’s my understanding that those guidelines may not be worth much and I believe even you have suggested that when faced with small samples, it is probably better to just simulate.

Is it accurate to say that if a data set is clearly nested, there is dependence, and the sample is too small to do HLM, that no analysis is ok. That a different analysis that doesn’t address dependence but is not necessarily as biased with small samples (or so they say) is still not ok. I think you mentioned this before.

Let’s say you want to prove that head start centers that measure as having higher “capacity” (as measured on a multi-trait multi-method assessment of capacity) have teachers that are more “satisfied” with their jobs, that simply looking at the correlation between site capacity and site average job satisfaction is not ok if you only have 15 sites (and 50 total teachers unequally distributed amongst these sites). This is a real question I’ve been given with the names and faces changed. My instinct is they aren’t analyzing the question they asked and this isn’t right.

Would the use of a Bayesian GLM be an option or am I expecting too much magic here? This isn’t my study, but I hate to go back to someone and say, Hey sorry, you spent 2 years and there is nothing you can do quantitatively here (Though I much rather say that then allow this correlation to be published).

My quick response is that the model is fine if you’re not data-rich; it’s just that in such a setting the prior distribution is more important. Flat priors will not make sense because they allow the possibility of huge coefficients that are not realistic. My book with Hill is not at all clear on this point, as we pretty much only use flat priors, and we don’t really wrestle with the problems that this can cause. Moving forward, though, I think the right thing to do is to fit multilevel models with informative priors. Setting up these priors isn’t trivial but it’s not impossible either; see for example the bullet points on page 13 of this article for an example in a completely different context. As always, it would be great to have realistic case studies of this sort of thing (in this case, informative priors for multilevel models in analyses of social science interventions) that people can use as templates for their own analyses. We should really include once such example in Advanced Regression and Multilevel Models, the in-preparation second edition of the second half of our book.

Short-term, for your problem now, I recommend the multilevel models with informative priors. I’m guessing it will be a lot easier to do this than you might think.

Poes then replied:

That example came from a real scenario where a prior study actually had unusually high coefficients. It was an intervention designed for professional development of practitioners. In general, most studies of a similar nature have had no or little effect. An effect size is .2 to .5 is pretty common. This particular intervention was not so unusual as to expect much higher effects, but they ended up with effects closer to .8 or so, and the sample was very small (it was a pilot study). They used that evidence as a means to justify a second small study. I suspect there is a great deal more uncertainty in those findings than it appears to the evaluation team, and I suspect if priors from those earlier studies were to be included, the coefficients would be more reasonable. The second study has not yet been completed, but I will be shocked if they see the same large effects.

This is an exaggeration, but to put this large effect into perspective, it would be as if we are suggesting that spending an extra ten minutes a day with hands on supervision of preschool teachers would lead to their students knowing ten more letters by the end of the year. I think you have addressed this before, but I do think people sometimes forget to take a step back from their statistics to consider what those statistics mean in practical terms.

Poes also added:

While we are talking about these studies as if Bayesian analysis would be used, they are in fact all analyzed using frequentist methods. I’m not sure if that was clear.

And then he had one more question:

When selecting past studies to use as informative priors, does the quality of the research matter? I have to imagine the answer is yes. A common argument I hear against looking to past results as evidence for current or future results is that the past research is of insufficient quality. Sample too small, measures too noisy, theory of change ill-thought-out, etc. My guess is that it does matter and those issues all potentially matter, but . . . It seems like that then raises the question, at what point is the quality sufficiently bad to merit exclusion? Based on what criteria? Study rating systems (e.g. Consort) exist, but I’m assuming that is not a common part of the process and I would also guess that much of the criteria is unimportant for their use as a prior. I’ve worked on a few study rating tools (including one that is in the process of being published as we speak) and my experience has been that a lot of concessions are made to ensure at least some studies make it through. To go back to my earlier question, I had pointed out that sample size adequacy shouldn’t be based on a fixed number (e.g. at least 100 participants) and maybe not based on the existence of a power analysis, but rather something more nuanced.

This brings me back to my general recommendation that researchers have a “paper trail” to justify their models, including their choice of prior distributions. I have no easy answers here, but, as usual, the default flat prior can cause all sorts of havoc, so I think it’s worth thinking hard about how large you can expect effect sizes to be, and what substantive models correspond to various assumed distributions of effect sizes.

**P.S.** Yes, this question comes up a lot! For example, a quick google search reveals:

Multilevel models with only one or two groups (from 2006)

No, you don’t need 20 groups to do a multilevel analysis (from 2007)

]]>Belushi’s demons are a whole lot more interesting than Laplace’s demon. With the latter, you always know what you’re gonna get forever and ever evermore. The former offers heaps of exciting uncertainty, and if you remember the night, you’ll have a hell of story.

Then I read this comment from J Storrs Hall:

I fear that Laplace would be overly relaxed. Belushi, on the other hand, would be on a mission from God. With a full tank of gas. At midnight. Wearing sunglasses.

And he might even bring a penguin.

Compelling. But I don’t want a penguin in my seminar. A piranha or a kangaroo, sure, those have statistical relevance. But a penguin, no way. So Laplace, the first and greatest applied Bayesian statistician, goes to round 2.

Zbicyclist puts it well:

A man who had no need for God, and a man on a mission from God.

When our pastor was taking a statistics course as part of his MBA, I tried to explain how statistical models of human behavior were less of a violation of the notion of free will than the notion of an omniscient, omnipotent God was. I’d like to hear Laplace’s answer to this one, even if it’s just to sniff at the question.

Today we must choose between two charming show-business figures: Ian McKellen, seeded #2 in the “People whose names end in f” category, versus Ellen DeGeneres, an unseeded TV personality. You can’t go wrong with either one. All I’ve got for you is that Gandalf has a track record of saving people who are about to get eaten by trolls—I’ve been reading The Hobbit and happen to be right in the middle of that scene—and we do sometimes have trolls around here.

Any other thoughts?

Again, the full bracket is here, and here are the rules:

]]>We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

But sometimes we do have good ideas, quantitative research projects that a high school student could do that would have some interesting statistical content or would shed light on some political or social issue.

If you have any good ideas—projects that would be fun for a high school student, or something quantitative a student could do that could make the world a better place—place them in the comments, and then maybe we could put together a list.

I’m not looking for classroom activities—Deb and I have a whole book about that—I’m looking for ideas for research projects that high school students could do on their own.

]]>Penn & Teller not only create interesting, often politically-relevant, magic. They are also visible skeptics who critique the over-claiming of magicians/mystics/paranormal advocates and they use empirical arguments/demonstrations when they speak to debunk pseudoscience. For those of us who care about such things as the “replication crisis,” creating better science, the acceptance of science, etc., is there a better analogy than to magic?

But then I read this from Daniel:

The question is whether we want a seminar focused on Bullshit! like most seminars, or on a universal truth and beauty: The Beautiful Game. I gotta go with Pele, but given it’s an academic seminar I’m pretty sure we’re going to get the Bullshit!

And the deciding argument from plusplus:

I would really like to hear Pele’s considered thoughts on who really is GOAT — him or Messi. I know, he is on the record about it already, but has been already refuted massively [sic] by video evidence, so what better than confronting a hostile seminar audience and justify his title?

And today we have the second-ranked mathematician of all time (recall that the ranking was done by a statistician; that’s how applied statisticians Laplace and Turing ranked so high) vs. an unseeded, but memorable, eater. Either one would be entertaining. Recall that Laplace anticipated all of behavioral economics, so his talk should attract people from the psychology and econ depts and b-school as well as the usual suspects from math, stat, and physics.

Again, the full bracket is here, and here are the rules:

Part I. Rear-View Mirror

Stan 2.18 Released

Multi-core Processing has Landed!

Multi-Process Parallelism

Map Function

New Built-in Functions

Manuals to HTML

Improved Effective Sample Size

Foreach Loops

Data-qualified Arguments

Bug Fixes and Enhancements

Math Library Enhancements

CmdStan EnhancementsPart II. The Road in Front

GPU Support

GPU Speedup, Cholesky (40+ times)

PDEs, DAEs & Definite Integrals

Tuples (i.e., Product Types)

Ragged Arrays

Lambdas and Function Types

Independent Generated Quants

Adjoint-Jacobian Product Functor

Mass Matrix/Step Size Init

Variadic Functions, not PackingPart III. The Longer Road

Faster Compile Times

Blockless Stan Language

Blockless Linear Regression

Non-Centered Normal Module

Protocol Buffer I/O

Logging Standards

We have lots more plans, but these are the specific items on the agenda for the Stan language.

By the time this post appears, we can report on what in the above list has already been done, what else has been done, and what’s planned for next steps.

**P.S.** Follow the link and you can watch all the talks from Stancon.

The facial feedback effect refers to the influence of unobtrusive manipulations of facial behavior on emotional outcomes. That manipulations inducing or inhibiting smiling can shape positive affect and evaluations is a staple of undergraduate psychology curricula and supports theories of embodied emotion. Thus, the results of a Registered Replication Report indicating minimal evidence to support the facial feedback effect were widely viewed as cause for concern regarding the reliability of this effect.

But then:

However, it has been suggested that features of the design of the replication studies may have influenced the study results.

So they did their own study:

Relevant to these concerns are experimental facial feedback data collected from over 400 undergraduates over the course of 9 semesters. Circumstances of data collection met several criteria broadly recommended for testing the effect, including limited prior exposure to the facial feedback hypothesis, conditions minimally likely to induce self-focused attention, and the use of moderately funny contemporary cartoons as stimuli.

What did they find?

Results yielded robust evidence in favor of the facial feedback hypothesis. Cartoons that participants evaluated while holding a pen or pencil in their teeth (smiling induction) were rated as funnier than cartoons they evaluated while holding a pen or pencil in their lips (smiling inhibition). The magnitude of the effect overlapped with original reports.

Their conclusion:

Findings demonstrate that the facial feedback effect can be successfully replicated in a classroom setting and are in line with theories of emotional embodiment, according to which internal emotional states and relevant external emotional behaviors exert mutual influence on one another.

Here are the summaries, which, when averaged over all the studies, show a consistent effect:

I’ll leave it to others to look at this paper more carefully and fit it into the literature; I just wanted to share it with you, since we’ve discussed these replication issues of facial feedback experiments before.

]]>Best argument in favor of the showman was from Jonathan:

OK. Here’s a Blaine seminar. He delivers the entire lecture locked inside a trunk with 40 minutes of air. He doesn’t get paid (or live) unless he solves the Entscheidungsproblem for a particular program (selected by the seminar participants) running on a Turing machine clacking away in the corner. If he turns this offer down, invite Turing. He can lecture on whatever he wants to.

But I don’t think even Blaine can solve the Entscheidungsproblem, so we’ll have to go with Enigma-man.

Today’s match features the third-ranked magician—sure, Penn and Teller are two people, but since we’re looking for a speaker, they count as one—this does make me think we should’ve had a Mute Person category, including Teller, Helen Keller, Silent Bob, Harpo Marx, Calvin Coolidge, and Robin Williams With Laryngitis, but that’s another story—against an unranked GOAT, in this case the greatest soccer player of all time.

Penn and Teller have been on TV and do live shows a lot—indeed, I saw them perform, ummm, 30 years ago it was?—and this is practically an invitation for Phil to invoke the “I can see them whenever I want” rule—but then again you can find old World Cup highlights on Youtube and watch Pele whenever you want too. So you gotta come up with a better argument than that!

Again, the full bracket is here, and here are the rules:

You blogged about Heckman and the two 1970s preschool studies a year ago here and here.

Apparently there are two papers on a long-term study of Tennessee’s preschool program. In case you had an independent interest in the topic, a summary of the most recent paper is here, and the paywalled paper is here.

The research paper, by Mark Lipsey, Dale Farran, and Kelley Durkin, concludes:

This study of the Tennessee Voluntary Pre-K Program (VPK) is the first randomized control trial of a state pre-k program.

Positive achievement effects at the end of pre-k reversed and began favoring the control children by 2nd and 3rd grade. . .

It’s a long paper and I have not read it in detail so I won’t attempt to evaluate this empirical claim. At first it might sound surprising that preschool did not show positive effects, but the authors in section 6 of their paper do give some reasons why this might be the case.

]]>I cannot believe we are having this conversation. Self-made multi-billionaire philanthropist African American warrior saint v. nerd game writer. Let. me. think. Copies of O per copies of Sci Am? I am looking at your bracket, looks a bit pale. Rhymes with male. Twenty nineteen.

If Gardner is THAT important to you, evidently there are backdoors. That’s your business. But it does not work the other way. Once you see your photo in the newspaper, “Columbia Blog dumps Oprah,” there is no going back.

You cannot geng Oprah Winfrey.

And Phil followed up:

Don’t get me wrong, like all right-thinking people of my generation I loved Gardner’s columns and books. But I picked up a copy of ‘The Night is Large’ about ten years ago, read a few dozen pages, then put it aside and haven’t picked it up since. I’ve read a lot of Gardner, Back In The Day, and I’ve got Gardner right there waiting for me…but I have no idea what Oprah is like.

I’m a little concerned that if Oprah comes to speak at Columbia, maybe Dr. Oz will show up in the audience . . . but I guess that’s a risk we’ll have to take.

And today we have the fourth-seeded magician, legendary codebreaker and Bayesian, martyr, the inventor of round-the-house chess, for chrissake!, competing against an unseeded mathematician, who’s famous for . . . getting really cold one time? Doesn’t seem close to me, but who knows, maybe you can get creative in the comment thread.

Again, the full bracket is here, and here are the rules:

There is a perception that cultural distances are growing, with a particular emphasis on increasing political polarization. . . .

Carter writes:

I am troubled by the inferences in the paper.

The authors state: “We define cultural distance in media consumption between the rich and the poor in a given year by our ability to predict whether an individual is rich or poor based on her media consumption that year. We use an analogous definition for the other three dimensions of culture (consumer behavior, attitudes, and time use) and other group memberships. We use a machine learning approach to determine how predictable group membership is from a set of variables in a given year. In particular, we use an ensemble method that combines predictions from three distinct approaches, namely elastic net, regression tree, and random forest (Mullainathan and Spiess 2017).” And come up with this:

This looks akin to ANOVA or a discriminant analysis. It seems to show (invalidated) predictability, but I can’t get from there to a measurement of trait differences among categories, especially as they include traits that are poorly predictive of category, namely time use. Is there reason to infer that they measure the degree of distinctiveness, or some kind of distance? Or is this analogous to the speaker I once witnessed using p values to rank effectiveness of various therapies?

My reply:

I took a look at the article and I too am unhappy with the indirect form of the analyses there. The questions asked by the authors are interesting, but if they want to study the differences in cultural behaviors of different groups, I’d rather see that comparison done directly, rather than this thing of using the behaviors to predict the group. I can see that the two questions are mathematically connected, but I find it confusing to use these indirect measures. When, in Red State Blue State, we were comparing the votes of upper and lower-income people, we just compared the votes of upper and lower-income people, we didn’t try to use votes to predict people’s income.

Again, the research conclusions in that paper could be correct, and I assume the authors put their data and code online so anyone can recover these results and then go on and do their own analyses. I just find it hard to say much from this indirect data analysis that’s been done so far.

**P.S.** I also noticed one little thing. The authors write, “We focus on the top and the bottom quartile (as opposed to, say, the top and the bottom half or the top and the bottom decile) to balance a desire to make the rich and the poor as different in their income as possible and the pragmatic need to keep our sample sizes sufficiently large.” That’s right! If you want to make simple comparisons (rather than running a regression), it’s a good move to throw out those middle cases. For more detail, see my paper with David Park, Splitting a predictor at the upper quarter or third and the lower quarter or third, which we wrote because we were doing this sort of comparison in Red State Blue State.

**P.P.S.** Please round those percentages in the tables. “71.4%,” indeed.

I’m going with Gauss. Ephron would show up in his office, and say, “I’ve got this great idea for a screenplay”; she’d really lay on the charm and work on her sales pitch. After she’d finish, Gauss would go back to his filing cabinet, aimlessly rifle through his least interesting shelf, pull out a sheaf of papers, and casually drop the screenplay for When Harry Met Sally on the desk in front of her. “Not even worth publishing” is how Gauss would think of it.

On the other hand, from Jonathan:

Isn’t Gaussian just a synonym for normal? Who wants a normal speaker? Or even a standard deviant? We need someone significant, and not just one time on 20… we only get one shot.

Manuel writes:

Oh, but Gauss can be a mean speaker, too!

Martha took that as a weak pun, but I looked up Gauss on wikipedia and learned this:

Carl Gauss was an ardent perfectionist and a hard worker . . . Though he did take in a few students, Gauss was known to dislike teaching. It is said that he attended only a single scientific conference . . . Gauss usually declined to present the intuition behind his often very elegant proofs—he preferred them to appear “out of thin air” and erased all traces of how he discovered them. . . .

This does not sound like it would make for a compelling talk. If I wanted hocus-pocus, I’d go with someone in the Magicians category. So instead I’ll go with Bobbie’s reasoning:

Yes, yes, all the other comments are mostly about how brilliant Gauss was. Ephron was brilliant, too. And funny.

More important, Ephron would bring food.

No pressure, Nora. But if you do come, we want some good food.

Today’s matchup is highly competitive, with the top-seeded TV personality lined up against an unseeded magician who is arguably the top science writer of all time. Either one has an essentially unlimited supply of stories and a strong ability to engage the audience.

Who do you want to see?

Again, the full bracket is here, and here are the rules:

I’ve been a long-time reader of your blog, eventually becoming more involved with the “replication crisis” and such (currently, I work with the Brazilian Reproducibility Initiative).

Anyway, as I’m now going deeper into statistics, I feel like I still lack some foundational intuitions (I was trained as a half computer scientist/half experimental neuroscientist). I write to ask a question about something that I think is simple, but I have a hard time wrapping my mind around.

Should we ever correct for multiple comparisons (within a false positive/false negative framework)?

Say, if I collect data Y and make two unplanned estimates of A and B to be significantly different from 0, some people would recommend performing a correction for multiple comparisons (not very worrying in this case, since it’s just two comparisons, but bear with me). But what if this happens in a different time frame? What if I publish A first and only later get to the point in the analysis where I estimate B and then publish it as a separate paper? Should this be corrected? What if I release my dataset and several independent analysts use it to estimate A, B (and C, D, E …)? Wouldn’t that warrant a correction for multiple comparisons?

These scenarios made me think initially that, at the very least, these corrections are inconsistent. But I also find it weird that the number of comparisons performed should change my belief on an assertion – well, it’s not that I find it weird, I think I get the idea of p-hacking, forking paths and researcher degrees of freedom. What I don’t quite get is how this is formalized in terms of probability. I think a reach a similar problem when thinking about preregistration: if I get data Y and estimate a value A from data, does the uncertainty on my estimate of A decrease if it was preregistered?

Now, after thinking about this, I believe the answer is that I “contaminate” the prior. Whatever expectation I have before, it is assumed to be independent of the data. The data is another source of information. If I use the data to inform my choice during analysis (i.e. the prior, in a broad sense), the prior is no longer independent of the data, I’m using the same information twice, it’s “double dipping”, whereas if I do only the planned analysis, I have two independent sources of information.

If the above is correct, then I still don’t see much sense in correcting for multiple comparisons – the problem is actually the lack of a prior. I believe that’s the message of your “Why you should not correct for multiple comparisons” paper with Hill and Yajima. This would be the case for multiplicity that arises from exploratory analysis: I want the data to show me new possibilities of analysis. I can do that, aware of the “double dipping”, aware to be careful with inferences. This would also be the case for multiplicity that arises simply from having many different possible measures for the same thing. Either all measures are valid and then they should agree somewhat (and checking all of them would be what you call a “multiverse analysis”?) or the choice of measure should be justifiable by theory and previous knowledge. Again, the problem would not be using many measures, per se (until you find one with p less than 0.05), but the lack of an a priori justification for the choice of measure.

My reply:

That’s right. From the Bayesian point of view, if estimate a lot of parameters using flat priors, with the goal of comparing those parameters to zero, you’ll get problems.

Why? Consider the frequency properties of this procedure of classical estimates with uncorrected multiple comparisons.

You can do a little simulation, for example you’re estimating theta_j, for j=1,…,100, and for each theta_j, you have an unbiased estimate y_j with standard error of 1 (for simplicity). Now suppose the true theta_j’s come from a normal distribution with mean 0 and standard deviation 1. Then your point estimates will have a distribution with mean 0 and standard deviation sqrt(2). So on average you’re overestimating the magnitude of your parameters. But that’s really the least of your problems. If you restrict your attention to the estimates that are statistically significantly different from zero, you’ll be way overestimating these theta_j’s. That’s the type M error problem, and no multiple comparisons correction of statistical significance will fix that.

But wait!, you might say: Flip it around. The above simulation gave these results because we assumed a certain distribution for the true theta_j’s that was concentrated near zero; no surprise that the point estimates end up likely to be much larger in absolute value. But what if that wasn’t the case? What if the underlying theta_j’s were much more spread, for example coming from a distribution with mean 0 and standard deviation 100? OK, fine—but then it doesn’t seem so plausible that you’re doing a study to compare all these theta_j’s to zero.

More generally, if we’re estimating a lot of parameters (100, in this example), we can *estimate* the population distribution of the theta_j’s from the data. That’s what Francis Tuerlinckx and I discussed in our original paper on Type M and Type S errors, from 2000.

If the number of parameters is small, your inference will depend more strongly on your prior distribution for the hyperparameters that govern the population distribution of theta. That’s just the way it is: if you have less data, you need more prior information, or else you’ll have weaker inferences.

Regarding your question about whether “the number of comparisons performed should change my belief on an assertion”: The key here, I think, is that increasing the number of parameters also supplies new data, and that should be changing your inferences. You write, “What if I publish A first and only later get to the point in the analysis where I estimate B and then publish it as a separate paper? Should this be corrected? What if I release my dataset and several independent analysts use it to estimate A, B (and C, D, E …)? Wouldn’t that warrant a correction for multiple comparisons?” My answer is that new data from others shouldn’t change the *data* that you reported from experiment A, but it should change the *inferences* that we draw. No need to correct your original paper: it is what it is.

Why go with a guy whose most famous for something he didn’t say? Let’s go with a guy who can give a short, pithy lecture that can blossom into a whole structure of knowledge as we repeat it!

But then I was persuaded by Phil’s list of Voltaire’s admirers. This French philosopher and wit seems to have influenced just about everybody. From wikipedia (as quoted by Phil):

Jorge Luis Borges stated that “not to admire Voltaire is one of the many forms of stupidity” . . . According to Will Durant: “Italy had a Renaissance, and Germany had a Reformation, but France had Voltaire; he was for his country both Renaissance and Reformation, and half the Revolution.”

So let’s go with the Great Tolerator.

Today’s bracket features the top-ranked mathematician, vs. an unseeded, but still very funny, wit. My favorite of Nora Ephron’s works is Heartburn (both book and movie), but her early essays are great too. I think either of today’s choices would be fine, as long as we don’t have to see Nora trying to prove a theorem or Carl complaining about his neck.

Again, the full bracket is here, and here are the rules:

Charles Margossian is the editor of Stan This Month and indeed he wrote all of this:

]]>## Stan this Month

Editor: Charles Margossian

ISSUE 1, JANUARY 2019

Stan this month is a newsletter aimed at highlighting major developments, discussions, and ideas within the Stan community. It provides an overview and pointers to more detailed references.This month’s issue paints 2018 in broad strokes.

The Stan governing bodyOur community of users and developers is growing, which means the Stan project needs to scale up. To do so, we formed a Stan governing body (SGB). The SGB’s mission is to oversee all aspects of the Stan project. This includes managing online resources, handling various funding sources, and supervising the organization of Stan conferences. The SGB is also strongly committed to advancing diversity in the Stan community.

More information about the members and the mission of the SGB can be found on https://mc-stan.org/about/.

Major software developmentsIn 2018, we dove deep into the high-performance arena. Some of our biggest new features include:

Within-chain parallelization. The parallel map function allows users to parallelize the computation of distributions and their gradients. The performance scales up (roughly) linearly with the number of cores.

Accelerated compound GLM functions.Allows users to tackle more efficiently generalized linear models.

Differential algebraic equation (DAE) solver.Solves systems of equations that couple ordinary differential and algebraic equations. It is one of the many features being developed to support differential equation based models.Other improvements include new features, bug fixes, as well as drastic improvements made to the parser and the language itself. There is in addition a growing ecosystem of softwares build around Stan for more specialized applications, such as the popular R packages RStanArm and BRMS, or the Python package Prophet developed by Facebook for forecasting.

ApplicationsContributions to the Stan Conferences provide an extensive list of applications, and can be found on https://github.com/stan-dev/stancontalks 1. These contributions span a broad range of topics, including physics, genetics, and social sciences; with models using hierarchical structures, predictive time series, and partial differential equations.

Articles using Stan can also be found in various venues. Our editorial highlight include a paper on industrial optimization and one on the development of a new method and its application to pharmacometrics:

Scaling Auctions as Insurance: A Case Study in Infrastructure Procurement by (Bolotnyy & Vasserman, 2018) [link]Bayesian aggregation of average data: An application in drug developmentby (Weber et al., 2018) [link]

Theory and methodConcurrent with software development and application (and often overlapping with), a lot of research is done at a more theoretical and methodological level. Papers on these subjects either give users practical guidance on how to tackle difficult problems or lay the foundation for long-term developments. These are our editorial highlights from last year:

Visualization in Bayesian workflowby (Garby, Simpson, Vehtari, Betancourt, & Gelman, 2018) [link 1]Validating Bayesian Inference Algorithms with Simulation-Based Calibrationby (Talts, Betancourt, Simpson, Vehtari, & Gelman, 2018). [link 1]Yes, but Did It Work? Evaluating Variational Inferenceby (Yao, Vehtari, Simpson, & Gelman, 2018b) [link]Using stacking to average Bayesian predictive distributions (with discussion)by (Yao, Vehtari, Simpson, & Gelman, 2018a) [link]- Geometric Theory of Higher-Order Automatic Differentiation by (Betancourt, 2018) [link 1]

Events and communityWe have a strong community of users and developers, eager to help one another and dedicated to open science. Discussion on the Stan forum and our GitHub pages are still going strong.

Rather exceptionally, we had two dedicated Stan conferences this year: one in Asilomar, California in January, and one in Helsinki, Finland in August. Each event hosted about 200 participants. Talks were recorded and often accompanied by a notebook with computer code. This material is now freely available online. On top that, we offered tutorials covering introductory Stan, advanced Bayesian modeling, and how to develop Stan features at a C++ level.

There have been many more events, such as the

Stan for Pharmacometricsday in Paris, France; the many events organized by meetup groups in New York, Boston, Berlin, and now in South Korea; and many workshops and courses hosted by various conferences and institutions.

Pedagogical outreachA big part of building Stan and its community goes through the development of pedagogical tools. The development team is actively working on the

Stan book, which is based on theStan User Manual. It is still a work in progress, but the draft looks fantastic.Michael Betancourt has also been prolific at writing case studies, often available both in R and in Python. Our editorial highlight is his case study entitled

Towards a Principled Bayesian Workflow[link 1].

New Year’s resolutionsIt is January after all. Ultimately, our efforts will continue to be driven by the community. One suggestion for 2019 would be to get involved in answering the increasing volume of questions on the Stan forum, review more pull requests if you are a developer, diligently report bugs and issues, and – last but not least – continue sharing your work. Tell us what your achievements and what your challenges are.

Good venues to do all these are the Stan Forum and our GitHub pages. You may also contact the editors of Stan this month via stan-this-month@mc-stan.org.

CreditThe Stan development team

Editor: Charles Margossian

Photo Credit (CC-BY licence) forAudience at Stan Con Helsinki: Aki Vehtari

Today we have the fourth-seeded wit, the man who will defend to the death your right to say it, competing against the inventor of fractals, in my opinion (see here and here) one of the great mathematicians of the twentieth century, which I say even though various snobby math professors might disagree. I’m sure that proving a longstanding conjecture in number theory is a more impressive technical feat than inventing fractals, but, to me, inventing fractals is more of a big deal and more of a creative contribution.

Anyway, I think either Voltaire or Mandelbrot could give a good speech. What do you think?

Again, the full bracket is here, and here are the rules:

Besides the fact that the paper uses Stan, and it’s about principal stratification, which you just blogged about, I thought you might like it because of its central methodological contribution.

We had been trying to use computer log data to see if the effect of a piece of educational software varied with the way the software was used. We had originally been using student-level sample means of the underlying variables (e.g. proportion of worked sections that the student “mastered,” or the average number of hints a student requested). Eventually (with a slap to the forehead) I realized that all of the apparent effect variation we saw was being driven by students who barely used the software—their sample averages had very small sample sizes (# of sections or problems) and hence large variance. That reminded me of that example from the beginning of BDA about county-level cancer incidence, so to solve it I thought of multilevel modeling. So we ended up nesting a section-level model inside our student model and that’s basically our paper.

And here’s the abstract to the Sales and Pane article:

Mastery learning—the idea that students’ mastery of target skills should govern their advancement through a curriculum—lies at the heart of the Cognitive Tutor, a computer program designed to help teach. This paper uses log data from a large-scale effectiveness trial of the Cognitive Tutor Algebra I curriculum to estimate the role mastery learning plays in the tutor’s effect, using principal stratification. A continuous principal stratification analysis models treatment effect as a function of students’ potential adherence to mastery learning. However, adherence is not observed, but may be measured as a latent variable in an item response model. This paper describes a model for mastery learning in the Cognitive Tutor that includes an item response model in the principal stratification framework, and finds that the treatment effect may in fact decrease with adherence to mastery, or may be nearly unrelated on average.

One of the cool things about statistics, or applied math more generally, is the way in which tools that are developed for one purpose can be useful in so many other settings that have similar mathematical structures.

I sent the above discussion to Avi Feller, who wrote:

]]>I think Adam and John have done some very careful work here, and I’m happy to see it in print.

At the same time, I’ve grown skeptical of using mixture modeling (either explicitly or implicitly) for estimating causal effects in principal stratification models (I’ve written a bit about it, but the damn paper keeps getting rejected!). So while I applaud Adam for his work, I’m less confident that this is a generally applicable strategy. Of course, these are quite challenging questions, and I’m thrilled to see more researchers tackling them!

I have eaten

the money

that was in

the piggybankwhich

you were probably

saving

for retirementForgive me

it was delicious

so sweet

read my lips

But it’s not clear if this is an endorsement of Bush, for his economic policies, or Williams, for his poetry.

In judging the contest, I’ll go with Jrc:

This is like a mediocre World Cup group stage match between two countries with the combined population of Florida. No one really cares, the quality isn’t real high, and you just sorta root for the team least likely to ever play in a world cup again.

It’s now or never for Bush – its not like he’ll crack a Presidents of the United States category. Or Vice-Presidents or Directors of the CIA for that matter. William Williams can compete again in three years in the double-name category. Seed him somewhere between Boutros Boutros and Fan Bingbing.

So it’s David Cop-a-feel for the win.

Today we have a power matchup. Sedaris, seeded third in the Wits category, is a legendary storyteller—especially if we don’t mind that he changes the details to get the stories to work. Stanislaw Ulam is unseeded in the Mathematicians category—jeez, even the legendary Euler and Erdos are unseeded there—but he wrote a wonderful autobiography and of course would have a lot to say about HMC and Stan.

Again, the full bracket is here, and here are the rules:

**P.S.** Just for laffs, see this post from 2010 which is the earliest mention of autodiff I could find on the blog.

Mel Brooks created Get Smart (along with Buck Henry), which suggests a number of seminar topics of interest to readers of this blog.

“Missed It By That Much: Why Predictive Models Don’t Always Pick the Winner”

“Sorry About That, Chief: Unconscious Researcher Biases”

“I Asked You Not to Tell Me That: How Not to Respond to Replication Failures”

And Jrc has the pithy summary:

Mel Brooks: EGOT

Chris Christie: GTFO

I’d rather see the guy who came up with the line, It’s good to be the king, than the guy who really *was* king—of New Jersey—and all he did with it was hog a beach.

As for today’s matchup . . . G. H. W. Bush is seeded #2 in the Magicians category but not because of any talent at performing magic; he’s just the second-most-famous person in that category. And William Carlos Williams is an unseeded Jerseyite. It’s your choice: you could get stories about the secret service, Iran-Contra, etc., or some modernist poetry. The winner will probably get wiped out in the second round, as he’ll have to face either David Sedaris or Stanislaw Ulam.

Again, the full bracket is here, and here are the rules:

Large-Scale Population-Level Evidence Generation

Objective:Generate evidence for the comparative effectiveness for each pairwise comparison of depression treatments for a set of outcomes of interest.

Rationale:In current practice, most comparative effectiveness questions are answered individually in a study per question. This is problematic because the slow pace at which evidence is generated, but also invites reporting and publishing only those studies where the result is ‘statistically significant’, leading to an underestimation of the true number of tests performed when correcting for multiple testing. This process is known as publication bias. Moreover, these studies typically do not include the evidence needed to interpret the study results, such as empirical estimates of residual bias inherent to the study design and data used. A solution to these problems is to perform a large set of comparative effectiveness analyses in one study, where each analysis adheres to current best practices. One of these best practices that we’ll follow is to use large scale propensity models to adjust for confounding. Another best practice that this study will follow is that each analysis will include a large set of negative and positive control outcomes (outcomes that are respectively not known or known to be cause by one exposure more than the other). In this study we would like to demonstrate the feasibility of generating population-level estimates at scale by focusing on on disease: depression. We perform every possible pairwise comparison between depression treatments for a large set of outcomes of interest. Most of these outcomes are generic safety outcomes, but some outcomes are related more specifically to the effectiveness of antidepressant treatment.

I don’t know anything about depression treatments, so all I have to offer is the general suggestion of analyzing all these comparisons together using a multilevel model, as discussed in this 2011 paper with Hill and Yajima. Then you get all the comparisons automatically.

]]>Well, the main problem with Anastasia is … she’s dead. However, we can be relatively certain that 31 or so pretenders would show up in her place. One of them might be Godunov.

Karloff is of course also dead. Yet one has faith that if we were to patch him back together and expose him to a little lightning, he would be good to go. All we’d need would be hooks, and some wire.

Today’s matchup is the #2 seed from New Jersey, a man once jocularly referred to as a possible future Secretary of Transportation, versus an unseeded wit. Who do you want to hear from? Sure, Bridgegate is old news—but the 2000-year-old-man, that’s even older news. Either one would have lots of stories, I’m sure.

Again, the full bracket is here, and here are the rules:

Publicly formulated the first time in January 2013 by Alberto Brandolini, an Italian programmer, the bullshit asymmetry principle (also known as Brandolini’s law) states that:

The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.

It became especially popular after a picture of a presentation by Brandolini at XP2014 on May 30, 2014, was posted on Twitter. Brandolini was inspired by reading Daniel Kahneman’s Thinking, Fast and Slow right before watching an Italian political talk show with journalist Marco Travaglio and former Prime Minister Silvio Berlusconi attacking each other. A similar concept, the “mountain of shit theory”, was formulated by the Italian blogger Uriel Fanelli in 2010, roughly stating the same sentence.

Brandolini’s law emphasizes the difficulty of debunking bullshit. In contrast, the faster propagation of bullshit is an old proverb: “a lie is halfway round the world before the truth has got its boots on”.

Two questions then arise:

1. Is this principle true? Or, more specifically, when is it true and when is it not?

2. To the extent that the principle is true, where is it coming from? I can think of a couple theories:

a. Asymmetry in standards of evidence: it’s much easier to suggest that something *might* be true than to demonstrate conclusively that it’s *not* the case. For example, consider “cold fusion”: A single experiment with anomalous results got lots of attention, but it took a lot of effort to figure out what went wrong.

b. Ethical asymmetry: The kinds of people who bullshit are more likely to be the kinds of people who misrepresent evidence, avoid correcting their errors, and intimidate dissenters, so at some point the people who *could* shoot down the bullshit might decide it’s not worth the trouble: Why bother fight bullshit if the bullshitters are going to turn around and personally attack you? From this standpoint, once bullshit becomes “too big to fail,” it can stay around forever.

**P.S.** In comments, Kaiser writes:

I have to speak up for the other side. Brandolini’s Law is false.

Counterexample: it takes mathematicians very little time to shoot down “obviously wrong” claimed proofs of any number of unsolved problems. Many statistical errors are also relatively easy to spot – and surely the researcher spent more time manufacturing the evidence (I’m thinking Wansink).

Secondly, the claim of an order of magnitude difference is absurd. I just proved my first point.

Good point about Wansink. It took him decades to construct a palace of bullshit. Sure, it took effort for the skeptics to reveal the emptiness of this edifice, but the effort of this debunking was still much less than the effort of the original construction.

]]>Anyway, yesterday’s winner is another dark horse. There’s little doubt in my mind that Bobby Fischer, if in a good mood, could give a much more interesting talk than Lance Armstrong, but then there was this argument from Diana:

Zbicyclist gave such a strong argument for him that zchessplayer appeared out of nowhere—a testament to the generative potential of an Armstrong seminar.

You’ll have to read the whole thread to see where she was coming from here.

Also, Lance has his own statistical principle.

Today we have a battle of two people whose names end in f: the original Frankenstein and the original lost princess. Check out Karloff’s wikipedia entry, where you’ll learn a few of interesting things: he was related to an English diplomat, he was part-Indian, and he wasn’t actually named Karloff. Anastasia you know all about, I’m sure.

Again, the full bracket is here, and here are the rules: