Someone who wishes to remain anonymous writes:

This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).

It is a regression discontinuity design:

They even cite your paper [that must be this or this — ed.] against higher-order polynomials.

Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.

My correspondent continues:

The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).

Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.

Anyway, he continues:

Example table (this is the one that looks best for them, others relegated to appendix):

So maybe an interesting tour of:

– How much optional stopping is there in industry? (Of course there is some.)

– Self-deception, ignorance, and incentive problems for social scientists

– Reasonable methods for regression discontinuity designs.

I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.

Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.

And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.

**P.S.** It’s been pointed out to me that the above-linked paper has been updated and improved since when I wrote the above post last September. Not all my comments above apply to the latest version of the paper.

I noted that the paper garnered a considerable number of downloads. Wonder what paper is the most downloaded in these couple of years. I guess there must be a resource that names it.

Andrew, “There’s nothing at all wrong with data-dependent stopping rules,” is incorrect.

While it is true that a Bayesian analysis doesn’t have the Type I error inflation rate, it’s not true that “it all woks out”. In the long run, the Bayesian analyses using a data dependent stopping rule to declare the presence of an effect will have biased effect sizes that are correlated with N. There’s always going to be some bias with a data dependent stopping rule.

However, it is also the case that some data dependent stopping rules are much more problematic than others.

Psyoskeptic:

All Bayesian methods with proper priors have “biased effect sizes.” From a Bayesian standpoint, bias is not a problem because it is conditional on the true parameter value, which is never known. A Bayesian method (if the underlying model is true) will be calibrated. Calibration is conditional on the data, not on the unknown true effect size.

You can see this by running a simulation of data collected with a data-dependent stopping rule. If you simulate from the model you’re fitting, the Bayesian inferences will be calibrated. If the model is wrong, you can get miscalibration, but we always have to worry about the model being wrong; that’s another story.

Biased effect sizes are a *feature* not a bug. In a Bayesian decision theory you describe the value of different types of errors and then choose the decision that maximizes value. If one kind of error is much better than another, you will choose the decision that tends to bias you towards the “less bad” errors and *that’s a good thing*.

“it’s not true that “it all woks out””

I haven’t yet figured out a good way to make this into a pun or joke (perhaps involving a stir-fry?), so I’m tossing it out for others to try. Any takers?

PS: A Google search on “It all woks out” got about 357 results

Andrew, it occurred to me that you can feature a weekly or monthly podcast with your blog. Maybe as a subscription for a podcast. I would guess though that there are contractual issues as you use a university email addy.

“p less than 0.05, but p greater than *0.01*” matches that table.