Update on that study of p-hacking

Ron Berman writes:

I noticed you posted an anonymous email about our working paper on p-hacking and false discovery, but was a bit surprised that it references an early version of the paper.
We addressed the issues mentioned in the post more than two months ago in a version that has been available online since December 2018 (same link as above).

I wanted to send you a few thoughts on the post, with the hope that you will find them interesting and relevant to post on your blog as our reply, with the goal of informing your readers a little more about the paper.

These thoughts are presented in the separate section below.

The more recent analysis applies an MSE-optimal bandwidth selection procedure. Hence, we use only one single bandwidth for assessing the presence of a discontinuity at a specific level of confidence in an experiment.

Less importantly, the more recent analysis uses a triangular kernel and linear regression (though we also report a traditional logistic regression analysis result for transparency and robustness).
The results have not changed much, and have partially strengthened.

With regard to the RDD charts, the visual fit indeed might not be great. But we think the fit using the MSE-optimal window width is actually good.

The section below provides more details, and I hope you will find it relevant to post it on your blog.

We also of course would welcome any feedback you may have about the methods we are using in the paper, including the second part of the paper where we attempt to quantify the consequences of p-hacking on false discovery and foregone learning.

I am learning from every feedback we receive and am working to constantly improve the paper.

More details about blog post:

The comments by the anonymous letter writer are about an old version of the paper, and we have addressed them a few months ago.

Three main concerns were expressed: The choice of six possible discontinuities, the RDD window widths, and the RDD plot showing weak visual evidence of a discontinuity in stopping behavior based on the confidence level the experiment reaches.

1. Six hypotheses

We test six different hypotheses, each positing optional-stopping based on the p-value of the experiment at one of the three commonly used levels of significance/confidence in business and social science (90, 95 and 99%) for both positive and negative effects (3 X 2 = 6).
We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.

2. RDD window width

The December 2018 version of the paper details an RDD analysis using an MSE-optimal bandwidth linear regression with a triangular kernel.
The results (and implications) haven’t changed dramatically using the more sophisticated approach, which relies on a single window for each RDD.

We fully report all the tables in the paper. This is what the results look like (Table 5 of the paper includes the details of the bandwidth sizes, number of observations etc):

The linear and the bias-corrected linear models use the “sophisticated” MSE-optimal method. We also report a logistic regression analysis with the same MSE-optimal window width for transparency and to show robustness.

All the effects are reported as marginal effects to allow easy comparison.

Not much has changed in results and the main conclusion about p-hacking remains the same: A sizable fraction of experiments exhibit credible evidence of stopping when the A/B test reaches 90% confidence for a positive effect, but not at the other levels of significance typically used and not for negative effects.

3. RDD plots

With respect to the RDD charts, the fit might indeed not look great visually. But what matters for the purpose of causal identification in such a quasi-experiment, in our opinion, is more the evidence of a discontinuity at the point of interest, rather than the overall data fit.

Here is the chart with the MSE-Optimal bandwidth around .895 confidence (presented as 90% to the experimenters) from the paper. Apart from the outlier at .89 confidence, we think the lines track the raw fractions rather well.

I wrote that earlier post in September, hence it was based on an earlier version of the article. It’s good to hear about the update.

11 thoughts on “Update on that study of p-hacking

  1. “Apart from the outlier”?

    What’s the leverage of that outlier? If you remove the outlier, the curve probably flattens and the discontinuity spike at 0.05 gets a lot smaller.

    Isn’t a better model, a combination of linear lines with the possibility of a spike rather than two quadratics?

    What’s the theory the generates a curve that is concave upward above 0.05?

  2. I find the fit on Figure 7 disturbing, with the selection of the change point seemingly driven by one or two outliers. It just feels like arbitrary overfitting.

    I decided to see what trend the data supports, so I captured the coordinates of the points from the figure and fitted a generalized additive model (using the mgcv package) with a penalized spline over x. The GAM suggests that there is a nonlinear trend with about 3.5 estimated degrees of freedom. (I wish I could post the figure) The trend rises to a peak at a confidence of about .890, then falls linearly to a trough at about .901, then rises again.

    AIC for the nonlinear fit was -93.6, compared with that for a linear fit of -91.1. I won’t use the “S” word, but the resulting fit overlaid with 95% interval on the scatterplot strikes me as much more persuasive than the two curves in Figure 7.

    • (1) In RDD, the cutoff point at which the presence of a discontinuity is being tested is set a priori. Hence, the selection of the cutoff cannot be “driven by one or two outliers” or benefit from “arbitrary overfitting”.

      (2) I have never seen GAM being used to causal effect estimation and testing in a regression discontinuity design (RDD). I expect that, to be applied in an informative manner, one would estimate one GAM to the left and one GAM to the right of the purported discontinuity point, and provide an estimate and standard error (or other form of inference) of the sharp discontinuity of this point. From the description offered here, I am not sure that Clark’s GAM analysis even tested for the presence of a discontinuity at the cutoff point.

      • You are correct that the GAM did not reference the discontinuity point. My goal with the GAM was to see whether the nonlinear trend was consistent with a discontinuity, and it showed no evidence of the presence of a discontinuity in the curve (the discontinuity was at a linear portion of the curve). Basically it supported a local maximum at the left and a local minimum at the right of the discontinuity point, with an approximately linear trend connecting one to the other. To me, this suggests that the real story may lie with the local minimum and maximum in the curve as indicating some extremes of choice, perhaps with the discontinuity point as the fulcrum about which these extremes relate. I’m wondering whether you might fit a linear model, a GAM, and your piecewise discontinuous model and compare the AICs among them — does the piecewise model better explain the data than the alternatives?

        Of course, the existence of the symmetric maximum and minimum about the discontinuity point is arguably evidence supporting the existence of that discontinuity. I’m just not convinced that the separate curves as currently in the figure are justified by the data.

      • You are correct that the GAM did not reference the discontinuity point. My goal with the GAM was to see whether the nonlinear trend was consistent with a discontinuity, and it showed no evidence of the presence of a discontinuity in the curve (the discontinuity was at a linear portion of the curve). Basically it supported a local maximum at the left and a local minimum at the right of the discontinuity point, with an approximately linear trend connecting one to the other. To me, this suggests that the real story may lie with the local minimum and maximum in the curve as indicating some extremes of choice, perhaps with the discontinuity point as the fulcrum about which these extremes relate. I’m wondering whether you might fit a linear model, a GAM, and your piecewise discontinuous model and compare the AICs among them — does the piecewise model better explain the data than the alternatives?

        Of course, the existence of the symmetric maximum and minimum about the discontinuity point is arguably evidence supporting the existence of that discontinuity. I’m just not convinced that the separate curves as currently in the figure are justified by the data. I have no particular expertise in RDD, so my focus is on model fitting versus overfitting.

        • (1) In a regression discontinuity design (RDD), one puts forward a hypothesis of a sharp jump/discontinuity at the pre-specified cut-off, and then estimates the size of the jump. The use of a “piecewise discontinuous model” is essential to RDD as a means to identify and measure causal effects in this specific type of quasi-experimental design.

          (2) Fitting the overall data is not the purpose of an RDD analysis. Rather, it is to estimate (and do inference on) the size of the discontinuity. Of course, the model must be reasonable, but overall fit to the data is not the criterion against which the analysis is assessed.

          (3) I believe that it is not surprising that a GAM, which by construction assumes local smoothness, is unable to detect the presence of a sharp discontinuity.

          (4) Your main concern seems to be that the discontinuity we detect does not stem from an upward shift of the same curve (of whatever shape), but from the a flip in sign of the slopes around the cutoff, from negative to positive. This flip is actually quite consistent with p-hacking through optional stopping if experimenters monitor the p-value frequently. Someone prone to p-hacking and seeing a confidence of 89% displayed will think “hey, let me not stop now; a bit more patience and I hit 90%”. Conversely, as the p-hackers terminate their experiment soon after seeing 90% confidence displayed, the pool of remaining experimenters consists increasingly of non-hackers and the hazard of stopping goes down again [standard “unobserved heterogeneity result in spurious negative duration dependence in hazards” story].

          (5) Thanks for the constructive suggestion to also fit linear and GAM specifications. In an earlier version of the paper, we did estimate and report linear specifications to the left and right of the cut-off point, and ended up with the same substantive conclusions. We did not estimate GAMs to the left and to the right of the cut-off, because using such flexible models in RDD is advised against in the recent literature. See, e.g., this paper by Andrew Gelman and Guido Imbens https://www.tandfonline.com/doi/abs/10.1080/07350015.2017.1366909 .

        • Thank you for the patient response. You’ve taught me a bit about regression discontinuity designs. I think that fundamentally the sparsity of data in the scatterplot leaves you with a challenging modeling situation in general, and particularly in the context of demonstrating the discontinuity. Probably fitting discrete lines is too simple, but fitting polynomials feels like overfitting. I wish I had something more constructive to suggest, though if discrete lines make the same point, I’d feel less critical than of the polynomials.

        • (3) I believe that it is not surprising that a GAM, which by construction assumes local smoothness, is unable to detect the presence of a sharp discontinuity.

          The locality of the smoothness is easily altered by increasing the degrees of freedom in the spline:

          dat=data.frame(x=1:20,y=c(rep(1,10),rep(2,10)))

          for example:

          ggplot(dat,aes(x,y))+geom_point()+geom_smooth(method=”gam”,formula=y ~ s(x,k=3))

          ggplot(dat,aes(x,y))+geom_point()+geom_smooth(method=”gam”,formula=y ~ s(x,k=5))

          ggplot(dat,aes(x,y))+geom_point()+geom_smooth(method=”gam”,formula=y ~ s(x,k=7))

          (1) In a regression discontinuity design (RDD), one puts forward a hypothesis of a sharp jump/discontinuity at the pre-specified cut-off, and then estimates the size of the jump.

          More generally one should hypothesize a *change in behavior before vs after* and there’s no reason it has to be a sharp jump in value, for example you could have a linear trend before and a linear trend after with a different slope, or a constant before and an oscillation after… or a constant before and a decay after or an oscillation before and an increasing amplitude oscillation after … whatever.

          Fitting two separate curves… one before, and one after… is destined to find a change, because fits are less constrained in their behavior at the *edges* of the fit interval and they inherently fail to use information on the other side of the interval (which if you are using null hypothesis testing for example violates the assumption of your null hypothesis which is that the curve is a single thing)

          if you really believe in a jump before vs after you should fit *one curve* with a step function added to the basis:

          y ~ a + b*x + c*x^2 + d * as.numeric(x > 0)

          or some such thing.

    • That isn’t really an “outlier” in the usual sense of being a single data point. The points here are a poor man’s nonparametric regression (conditional means within bins), so that point is actually an average of many points…

  3. This seems like an attempt to claim a model describes reality by comparing it to the same data used to develop it? They need to compare the model to new data and see how it performs.

  4. “We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.”
    This seems like just a way to declare that no control of familywise error rates is needed, but without really any justification. Under the null (and if these tests were orthogonal…) this analysis would declare there is optional stopping over 25% of the time at the 5% level. And then we’d get a paper like this but focusing on a different one of the cutoffs that happened to come in with p slightly less than 0.05.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *