Skip to content

HMC step size: How does it scale with dimension?

A bunch of us were arguing about how the Hamiltonian Monte Carlo step size should scale with dimension, and so Bob did the Bob thing and just ran an experiment on the computer to figure it out.

Bob writes:

This is for standard normal independent in all dimensions. Note the log scale on the x axis. The step size declines but not nearly as sharply as 1 / sqrt(N), which would have the 1024 dimensional about 1/10th what it actually is.

He continues:

HMC isn’t like Metropolis. It’s not just these step sizes, but also the gradient that adjusts the momentum and the fact that momentum behaves like momentum. The gradient used to update momentum keeps the Hamiltonian trajectory running through the typical set even with fairly large step sizes. This is what Betancourt refers to as “flow” in his intro to HMC.

Here’s the R code to make the above graph:

mas <- function(fit) {
  sampler_params <- get_sampler_params(fit, inc_warmup = FALSE)
  return(c(sampler_params[[1]][1000, 2],
           sampler_params[[2]][1000, 2],
           sampler_params[[3]][1000, 2],
           sampler_params[[4]][1000, 2]))

model <- stan_model(model_code = "data {  int N; } parameters { vector[N] alpha; } model { alpha ~ normal(0, 1); }")

K <- 11
stepsizes <- rep(0, K)
for (N in 1:K) {
  fit <- sampling(model, data=list(N = 2^(N - 1)), refresh = 0)
  stepsizes[N] = mean(mas(fit))

for (k in 1:K) {
  print(sprintf("N = %5d;  step size = %4.2f",
                2^(k - 1), stepsizes[k]),

plot <-
  ggplot(data.frame(dimensionality = 2^(0:(K - 1)),
                    stepsize = stepsizes)) +
  geom_line(aes(x = dimensionality, y = stepsize)) +
  scale_x_log10(name = "standard normal dimensions",
                breaks = c(1, 4, 16, 64, 256, 1024)) +
  scale_y_continuous(name = "adapted step size",
                     limits = c(0, 1.25),
                     breaks=c(0, 0.25, 0.5, 0.75, 1, 1.25))
ggsave("std-norm-stepsizes.png", width = 5, height = 4)

Boris Karloff (3) vs. Mel Brooks; Riad Sattouf advances

In yesterday’s contest, Dalton asks:

Lance Armstrong isn’t even a GOAT. Did he cheat to get included on the list at the expense of Eddy Merckx?

But then Jrc points out:

Lance isn’t in for Cycling GOAT, he’s in for NGO-bracelet GOAT.

I’m pretty sure he didn’t juice the bracelets. Although now that I think about it… that might explain why they became so popular and after a while people had like 5 or 10 on at a time. So I guess since we already kicked out Erdos, it might be worth keeping him as an insurance policy in case we need some drugs to get through the tournament.

Ultimately, though, I’ll have to go with Daniel’s argument:

If Armstrong moves on – and ultimately wins – we’ll have to nervously wait 15 years to see if he recants his victory seminar.

We can’t have that, so it will have to be Sattouf, even though, as J points out, we can’t expect any sex from him, given that the seminar is in New York.

And next we have a bit of a Young Frankenstein contest. Actor vs. director. A man whose birth name didn’t really end in f vs. a man who’s two thousand years old. It’s your call.

Again, here’s the bracket and here are the rules.

Kevin Lewis has a surefire idea for a project for the high school Science Talent Search

Here’s his idea:

If I were a student, I’d do a study on how Science Talent Search judges are biased. That way, they can’t reject it, otherwise it’s self-confirming.

That’s a great idea! Maybe it’s possible to go meta on this one by adding some sort of game-theoretic model or simulation of talent search submission and judging?

The background was that Lewis wrote this to me:

The Science Talent Search needs more social science. I could only find one such study in this year’s 40 finalists: “Evaluation of Gender Bias in Social Media Using Artificial Intelligence.”

I replied that I’ve actually advised a few local high school students on social science projects for this competition, and one of them made it into the final round, a few years ago! I think it was based on a survey she did; I don’t remember, as it was entirely her idea, and my role was only to supply some feedback. Each year, some high school students come to me asking to be advised on a socisl science project. I sometimes have an idea that I suggest but usually they do their own thing. It’s surprisingly difficult to come up with good ideas! The best ideas I’ve had involve collecting data, and collecting data is hard. For example, a couple years ago after that article about evictions (later made into a book) appeared in the New Yorker, I suggested to a student that he track down some actual data on evictions: time series on the number of evictions in the U.S., or in some states, or somewhere. But that wasn’t so easy to do. What students want to do is collect a survey or run a regression. I guess that the easiest way to make progress would be to run a computer simulation.

And then Lewis responded with his suggestion given above. We’ll see what happens.

Riad Sattouf (1) vs. Lance Armstrong; Bruce Springsteen advances

Best comment yesterday came from Jan:

Now we have opportunity to see in the next round whether Julia is really that much better than Python!

But that doesn’t resolve anything! So to pick a winner we’ll have to go with Tom:

Python foresaw the replication crisis with their scientific method of proving someone is a witch but I fear that they would have to resort to talking about how they used to be funny. Springsteen could just bring a guitar and start playing – the only difficulty being that if you had something booked afterwards you might be a little late. Hmmm, not a particularly witty comment but still – onwards with Springsteen.

Today it’s the top-seeded person whose name ends in f, versus an unseeded GOAT. Sattouf is hilarious—but Lance does have a statistical principle named after him. So who should advance to the third round?

Again, here’s the bracket and here are the rules.

“News Release from the JAMA Network”

A couple people pointed me to this:

Here’s the Notice of Retraction:

On May 8, 2018, notices of Expression of Concern were published regarding articles published in JAMA and the JAMA Network journals that included Brian Wansink, PhD, as author. At that time, Cornell University was contacted and was requested to conduct an independent evaluation of the articles to determine whether the results are valid.

Cornell University has notified JAMA that based on its investigation they are unable to provide assurances regarding the scientific validity of the 6 studies. Their response states: “We regret that, because we do not have access to the original data, we cannot assure you that the results of these studies are valid.” Therefore, the 6 articles reporting the results of these studies that were published in JAMA, JAMA Internal Medicine, and JAMA Pediatrics are hereby retracted.

Admirable of Cornell University to bite the bullet on this and for Jama to publicize it.

P.S. More here from Retraction Watch, including a couple of ridiculous quotes by Wansink.

Statmodeling Retro

As many of you know, this blog auto-posts on twitter. That’s cool. But we also have 15 years of old posts with lots of interesting content and discussion! So I had this idea of setting up another twitter feed, Statmodeling Retro, that would start with our very first post in 2004 and then go forward, posting one entry every 8 hours until it eventually catches up to the present. So far, this blog has exactly 9000 posts, so it would take a little over 8 years to catch up at this rate. But then if we continue at the current rate we’ll have another 6000 posts or so, which will take another 5 years to appear in the retro feed. Etc. So it will take awhile.

Maybe people don’t want to wait that long? We could program Statmodeling Retro to post every 6 hours, but then I’m worried that the frequency would be too high for people to follow.

Whaddya think?

Monty Python vs. Bruce Springsteen (1); Julia Child advances

From Jeff:

If they meet in the semi-final the Japanese dude will eat Frank for lunch: All vs. Nothing at All.

Though it appears she also had a soft spot for hot dogs, if Julia makes it that far it would be a matchup of gourmet vs gourmand, which seems a better contest.

Today it’s an unseeded, but very funny, gang of Wits against the top seeded Person from New Jersey. What will it be: Holy Grail or Thunder Road?

Again, here’s the bracket and here are the rules.

Geoff Pullum, the linguist who hates Strunk and White, is speaking at Columbia this Friday afternoon

The title of the talk is Grammar, Writing Style, and Linguistics, and here’s the abstract:

Some critics seem to think that English grammar is just a brief checklist of linguistic table manners that every educated person should already know. Others see grammar as a complex, esoteric, and largely useless discipline replete with technical terms that no ordinary person needs. Which is right? Neither. The handy menu of grammar tips is a myth. Faculty often point to Strunk and White’s The Elements of Style as providing such a list, but its assertions about grammar are often flagrantly false and its rambling remarks on style are largely useless. The truth is that the books on English grammar intended for students or the general public nearly all dispense false claims and bad analyses. Yet grammar can be described in a way that makes sense. I [Pullum] offer some eye-opening facts about Strunk and White, and an antidote, plus brief illustrations of how grammar and style can be tackled in a sensible way drawing on insights from modern linguistics.

The talk is at 707 Hamilton Hall, Fri 22 Feb, 4pm.

The funny thing is, I get what Pullum is saying here, but I still kinda like Strunk and White for what it is.

Julia Child (2) vs. Frank Sinatra (3); Dorothy Parker

For yesterday‘s contest, Jonathan gave a strong argument:

First New Yorker showdown, just to see who will be taking on Veronica Geng in the finals. All the other contestants are just for show. I’m going with Liebling, because Parker wasn’t even the best New Yorker writer of her generation, being edged out by Benchley. Liebling dominated his era. If it comes down to Liebling vs. Geng, we’ll just exhume Harold Ross and make him pick.

But we’re looking for a talker, not a writer, so I’ll have to go with Dzhaughn:

After the Seance, we were chatting about the inspiration for this tournament. I said I thought Bruno was just a minor intellectual swindler rather than a real threat. Dorothy replied:

I used to think Latour was just something on a Schwinn dealer’s list*, but that was before I saw Julia’s child Oscar wildly strong-arm Lance with an ephronedrine-filled syringe merrily down the Streep, past a sidewalk cafe where the turing Pele and big bejeweled #23, in Brooks’ Brothers suits, were yakking over Smirnoff Martinis, eating a pile of franks, caesar salads, and some weirder dishes. James was on the phone, taking the TV network to hell and back over “letting that degenerate George Karl off the hook” for some remark, when, from behind a bush, sudden as a python, out springs teen-aged Babe D.-Z, among others! That geng didn’t look like they were here to serenade us with arias from Yardbird, that jazz oprah about Parker! No, they were there to revolt–air their own grievances–and when he stood to object, Babe just shoved LeBron and all his LeBling back onto LaPlace where he sat: Oof!

A bit of recursion is usually a good plan.

For today it’s the French Chef vs. the Chairman of the Board. Frank’s got a less screechy voice, but Julia should be able to handle the refreshments. Any thoughts?

Again, here’s the bracket and here are the rules.

My talk today (Tues 19 Feb) 2pm at the University of Southern California

At the Center for Economic and Social Research, Dauterive Hall (VPD), room 110, 635 Downey Way, Los Angeles:

The study of American politics as a window into understanding uncertainty in science

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

Some background reading:

19 things we learned from the 2016 election (with Julia Azari),
The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild).
The failure of null hypothesis significance testing when studying incremental changes, and what to do about it.
Honesty and transparency are not enough.
The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

The talk will mostly be about statistics, not political science, but it’s good to have a substantive home base when talking about methods.

I believe this study because it is consistent with my existing beliefs.

Kevin Lewis points us to this.

A. J. Liebling vs. Dorothy Parker (2); Steve Martin advances

As Dalton wrote:

On one hand, Serena knows how to handle a racket. But Steve Martin knows how to make a racket with some strings stretched taught over a frame. Are you really gonna bet against the dude who went to toe-to-toe Kermit the Frog in racket making duel?

Today we have an unseeded eater vs. the second-seeded wit. That said, Liebling was very witty, at least in his writing. And Parker—I don’t know how she was as an eater, but she sure knew how to drink. As did Liebling. So the two are evenly matched.

Again, here’s the bracket and here are the rules.

R fixed its default histogram bin width!

I remember hist() in R as having horrible defaults, with the histogram bars way too wide. (See this discussion:

A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That’s why, if you look at the histograms in my books and published articles, I just about always use lots of bins.

But somewhere along the way someone fixed it. R’s histogram function now has a reasonable default, with lots of bins. (Just go into R and type hist(rnorm(100)) and you’ll see.)

I’m so happy!

P.S. When searching for my old post on histograms, I found this remark, characterizing the following bar graph:

Screen Shot 2016-01-24 at 4.17.21 PM

This graph isn’t horrible—with care, you can pull the numbers off it—but it’s not set up to allow much discovery, either. This kind of graph is a little bit like a car without an engine: you can push it along and it will go where you want, but it won’t take you anywhere on its own.

Update on that study of p-hacking

Ron Berman writes:

I noticed you posted an anonymous email about our working paper on p-hacking and false discovery, but was a bit surprised that it references an early version of the paper.
We addressed the issues mentioned in the post more than two months ago in a version that has been available online since December 2018 (same link as above).

I wanted to send you a few thoughts on the post, with the hope that you will find them interesting and relevant to post on your blog as our reply, with the goal of informing your readers a little more about the paper.

These thoughts are presented in the separate section below.

The more recent analysis applies an MSE-optimal bandwidth selection procedure. Hence, we use only one single bandwidth for assessing the presence of a discontinuity at a specific level of confidence in an experiment.

Less importantly, the more recent analysis uses a triangular kernel and linear regression (though we also report a traditional logistic regression analysis result for transparency and robustness).
The results have not changed much, and have partially strengthened.

With regard to the RDD charts, the visual fit indeed might not be great. But we think the fit using the MSE-optimal window width is actually good.

The section below provides more details, and I hope you will find it relevant to post it on your blog.

We also of course would welcome any feedback you may have about the methods we are using in the paper, including the second part of the paper where we attempt to quantify the consequences of p-hacking on false discovery and foregone learning.

I am learning from every feedback we receive and am working to constantly improve the paper.

More details about blog post:

The comments by the anonymous letter writer are about an old version of the paper, and we have addressed them a few months ago.

Three main concerns were expressed: The choice of six possible discontinuities, the RDD window widths, and the RDD plot showing weak visual evidence of a discontinuity in stopping behavior based on the confidence level the experiment reaches.

1. Six hypotheses

We test six different hypotheses, each positing optional-stopping based on the p-value of the experiment at one of the three commonly used levels of significance/confidence in business and social science (90, 95 and 99%) for both positive and negative effects (3 X 2 = 6).
We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.

2. RDD window width

The December 2018 version of the paper details an RDD analysis using an MSE-optimal bandwidth linear regression with a triangular kernel.
The results (and implications) haven’t changed dramatically using the more sophisticated approach, which relies on a single window for each RDD.

We fully report all the tables in the paper. This is what the results look like (Table 5 of the paper includes the details of the bandwidth sizes, number of observations etc):

The linear and the bias-corrected linear models use the “sophisticated” MSE-optimal method. We also report a logistic regression analysis with the same MSE-optimal window width for transparency and to show robustness.

All the effects are reported as marginal effects to allow easy comparison.

Not much has changed in results and the main conclusion about p-hacking remains the same: A sizable fraction of experiments exhibit credible evidence of stopping when the A/B test reaches 90% confidence for a positive effect, but not at the other levels of significance typically used and not for negative effects.

3. RDD plots

With respect to the RDD charts, the fit might indeed not look great visually. But what matters for the purpose of causal identification in such a quasi-experiment, in our opinion, is more the evidence of a discontinuity at the point of interest, rather than the overall data fit.

Here is the chart with the MSE-Optimal bandwidth around .895 confidence (presented as 90% to the experimenters) from the paper. Apart from the outlier at .89 confidence, we think the lines track the raw fractions rather well.

I wrote that earlier post in September, hence it was based on an earlier version of the article. It’s good to hear about the update.

Serena Williams vs. Steve Martin (4); The Japanese dude who won the hot dog eating contest advances

We didn’t have much yesterday, so I went with this meta-style comment from Jesse:

I’m pulling for Kobayashi if only because the longer he’s in, the more often Andrew will have to justify describing him vs using his name. The thought of Andrew introducing the speaker as “and now, here’s that Japanese dude who won the hot dog eating contest” sounds awkward enough to prime us all for stress-eating, and who better to give us best practices/techniques?

I agree with Diana that there’s some underdog bias here, as in the real world there’d be no doubt that we’d want to hear Wilde. Indeed, if this were a serious contest we’d just have looked at the 64 names, picked Wilde right away, and ended it. But, for round 3, the Japanese dude who won the hot dog eating contest it is.

And today we have an unseeded GOAT vs. the fourth-seeded magician. Whaddya want, ground strokes or ironic jokes?

Again, here’s the bracket and here are the rules.

“Do you have any recommendations for useful priors when datasets are small?”

Someone who wishes to remain anonymous writes:

I just read your paper with Daniel Simpson and Michael Betancourt, The Prior Can Often Only Be Understood in the Context of the Likelihood, and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, and small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My quick response is that when sample size is small, or measurements are noisy, or the underlying phenomenon has high variation, then the prior distribution will become more important.

So your question is a good one!

To continue, when priors are important, you’ll have to think harder about what real prior information is available.

One way to to is . . . and I’m sorry for being so predictable in my answer, but I’ll say it anyway . . . embed your problem in a multilevel model. You have a new product with just four observations. Fine. But this new product is the latest in a stream of products, so create a model of the underlying attributes of interests, given product characteristics and time.

Don’t think of your “prior” for a parameter as some distinct piece of information; think of it as the culmination of a group-level model.

Just like when we do Mister P: We don’t slap down separate priors for the 50 states, we set up a hierarchical model with state-level predictors, and this does the partial pooling more organically. So the choice of priors becomes something more familiar: the choice of predictors in a regression model, along with choices about how to set that predictive model up.

Even with a hierarchical model, you still might want to add priors on hyperparameters, but that’s something we do discuss a bit at that link.

P-hacking in study of “p-hacking”?

Someone who wishes to remain anonymous writes:

This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).

It is a regression discontinuity design:

They even cite your paper [that must be this or this — ed.] against higher-order polynomials.

Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.

My correspondent continues:

The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).

Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.

Anyway, he continues:

Example table (this is the one that looks best for them, others relegated to appendix):

So maybe an interesting tour of:
– How much optional stopping is there in industry? (Of course there is some.)
– Self-deception, ignorance, and incentive problems for social scientists
– Reasonable methods for regression discontinuity designs.

I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.

Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.

And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.

P.S. It’s been pointed out to me that the above-linked paper has been updated and improved since when I wrote the above post last September. Not all my comments above apply to the latest version of the paper.

The Japanese dude who won the hot dog eating contest vs. Oscar Wilde (1); Albert Brooks advances

Yesterday I was going to go with this argument from Ethan:

Now I’m morally bound to use the Erdos argument I said no one would see unless he made it to this round.

Andrew will take the speaker out to dinner, prove a theorem, publish it and earn an Erdos number of 1.

But then Jan pulled in with :

If you get Erdos, he will end up staying in your own place for the next n months, and him being dead, well, let’s say it is probably not going to be pleasant.

To be honest, I don’t even think I’d like a live Erdos staying in our apartment: from what I’ve read, the guy sounds a bit irritating, the kind of person who thinks he’s charming—an attribute that I find annoying.

Anyway, who cares about the Erdos number. What I really want is a good Wansink number. Recall what the notorious food researcher wrote:

Facebook, Twitter, Game of Thrones, Starbucks, spinning class . . . time management is tough when there’s so many other shiny alternatives that are more inviting than writing the background section or doing the analyses for a paper.

Yet most of us will never remember what we read or posted on Twitter or Facebook yesterday. In the meantime, this Turkish woman’s resume will always have the five papers below.

Coauthorship is forever. Those of us with a low Wansink number will live forever in the scientific literature.

And today’s match features an unseeded eater vs. the top-seeded wit. Doesn’t seem like much of a contest for a seminar speaker, but . . . let’s see what arguments you come up with!

Again, here’s the bracket and here are the rules.

More on that horrible statistical significance grid

Regarding this horrible Table 4:

Eric Loken writes:

The clear point or your post was that p-values (and even worse the significance versus non-significance) are a poor summary of data.

The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses. It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

Good point. As John Carlin and I wrote, we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

Book reading at Ann Arbor Meetup on Monday night: Probability and Statistics: a simulation-based introduction

The Talk

I’m going to be previewing the book I’m in the process of writing at the Ann Arbor R meetup on Monday. Here are the details, including the working title:

Probability and Statistics: a simulation-based introduction
Bob Carpenter
Monday, February 18, 2019
Ann Arbor SPARK, 330 East Liberty St, Ann Arbor

I’ve been to a few of their meetings and I really like this meetup group—a lot more statistics depth than you often get (for example, nobody asked me how to shovel their web site into Stan to get business intelligence). There will be a gang (or at least me!) going out for food and drinks afterward.

I’m still not 100% sure about which parts I’m going to talk about, as I’ve already written 100+ pages of it. After some warmup on the basics of Monte Carlo, I’ll probably do a simulation-based demonstration of the central limit theorem, the curse of dimensionality, and some illustration of (anti-)correlation effects on MCMC, as those are nice encapsulated little case studies I can probably get through in an hour.

The Repository

I’m writing it all in bookdown and licensing it all open source. I’ll probably try to find a publisher, but I’m only going to do so if I can keep the pdf free.

I just opened up the GitHub repo so anyone can download and build it:

I’m happy to take suggestions, but please don’t start filing issues on typos, grammar, etc.—I haven’t even spell checked it yet, much less passed it by a real copy editor. When there’s a more stable draft, I’ll put up a pdf.