“The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research”

Posted on April 30, 2017 9:37 AM by Andrew

Valentin Amrhein, Fränzi Korner-Nievergelt, and Tobias Roth write:

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as ‘we need more stringent decision rules’, ‘sample sizes will decrease’ or ‘we need to get rid of p-values’.

The general ideas should be familiar to regular readers of this blog.

63 thoughts on ““The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research””

Tom Passin on April 30, 2017 10:17 AM at 10:17 am said:

“This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research.”

This seems to mean that perhaps some non-replicated research was actually replicated, or would have been if analyzed properly. But I don’t think that this lessens a “crisis of unreplicable research” very much, because that research is in fact being analyzed in those (“improper”) ways. The best you can say is that, if correct, these sentences shift the focus from “science” possibly being unreliable to the current use of certain statistical techniques and inference methods.

That’s appropriate, but most of us reading this blog probably didn’t have real doubts about “science” itself.

Reply ↓
Daniel Lakens on April 30, 2017 10:33 AM at 10:33 am said:

I just reviewed this for PeerJ – curious to hear what readers here thought. I didn’t find the manuscript particularly well structured. And it contained nothing new (while often lacking nuance and a complete description of relevant information). But maybe I’ve just read way too many of these articles (there are literally hundreds of them in the literature now) to become enthusiastic by this one?

Reply ↓
- Carol on April 30, 2017 1:52 PM at 1:52 pm said:
  
  Hi Daniel (Lakens),
  
  I just linked to the article, but I don’t see your review. How does one access that? (This may be a dumb question reflecting my lack of familiarity with PeerJ. If so, my apologies!)
  
  Carol
  
  Reply ↓
  - Jordan Anaya on April 30, 2017 2:26 PM at 2:26 pm said:
    
    Carol: The link is to a preprint on PeerJ, which has been submitted to the journal PeerJ. PeerJ does allow for open reviews, but those are only available once the article has been accepted and published.
    
    I suppose there is nothing stopping Lakens from posting his review publicly on his blog or in the feedback section of the preprint (there has been recent debate about who owns the copyright of reviews, but unless you explicitly sign over your copyright it seems you can do whatever you want with your review, although the journal may get very angry).
    
    c.f. https://fossilsandshit.com/an-update-to-the-elsevier-thing/
    
    Reply ↓
    - Carol on April 30, 2017 2:48 PM at 2:48 pm said:
      
      Hi Jordan,
      
      Thank you for the explanation.
      
      Carol
Glen M. Sizemore on April 30, 2017 10:48 AM at 10:48 am said:

“Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis.”

GS: Doesn’t this imply that the “strength of evidence” is somehow quantitative? How so? They (p-values) are quantitatively meaningful only if the null is true. But what could it mean to say that the p-value has any quantitative meaning if one declares the null to be false. So…a p-value is quantitatively meaningful when you use it as evidence the null is false, and then it loses its quantitative meaning when it is declared false?

Reply ↓
- Christian Hennig on April 30, 2017 5:10 PM at 5:10 pm said:
  
  “They (p-values) are quantitatively meaningful only if the null is true.”
  p-values measure to what extent the data are consistent with the null hypothesis. This is computed based on the null hypothesis but doesn’t rely at all on the null hypothesis being true (which it isn’t anyway because “all models are wrong”).
  
  Reply ↓
  - Glen M. Sizemore on April 30, 2017 6:35 PM at 6:35 pm said:
    
    I was under the impression that a p-value was the probability of obtaining data as extreme of more extreme than what you have in hand, given that the null is true. Have I been misled?
    
    Reply ↓
    - Christian Hennig on April 30, 2017 9:21 PM at 9:21 pm said:
      
      It is correct that the p-value is computed assuming that the null is true, so you’re right. But this is meaningful as a characterisation of the data in relation to the null hypothesis regardless of whether the null is indeed true in reality.
      The null may be wrong but still a p-value of 0.664 tells you that the data are well compatible with the null and the data cannot be used as an argument against it.
    - Glen M. Sizemore on April 30, 2017 11:11 PM at 11:11 pm said:
      
      “It is correct that the p-value is computed assuming that the null is true, so you’re right. But this is meaningful as a characterisation of the data in relation to the null hypothesis regardless of whether the null is indeed true in reality.
      The null may be wrong but still a p-value of 0.664 tells you that the data are well compatible with the null and the data cannot be used as an argument against it.”
      
      GS: So…if it was 0.01 it *could* be used as an argument against the null? This sounds exactly like asserting that a p-value=p(Ho|Data) when it is, in fact, p(Data|Ho.
    - Christian Hennig on May 1, 2017 8:44 AM at 8:44 am said:
      
      “So…if it was 0.01 it *could* be used as an argument against the null? This sounds exactly like asserting that a p-value=p(Ho|Data) when it is, in fact, p(Data|Ho).”
      I don’t believe in models to be true so to me p(H0)=P(H0|Data) will always be zero. This is very different from stating that the data are compatible with the H0. A model is a tool for thinking and communicating, not for being “true”. However, p=0.01 can be an indication to think in other ways than those embodied by H0. Assuming of course that your test statistic measures something that is relevant to you and that p=0.01 isn’t a result of some hacking operation that will give you p<0.01 with a probability of 35% or so under H0.
    - Oh wellGlen M. Sizemore on May 2, 2017 12:45 PM at 12:45 pm said:
      
      Oh well…no matter how much I read what you write on this topic, it sounds like double-talk.
Jonathan on April 30, 2017 11:25 AM at 11:25 am said:

The earth is flat if you view it from your general ground level. And you can replicate that as often as you want. But the result doesn’t hold if you change perspective. It’s not frame invariant. It’s a result that literally maps to a space that’s part of a space which contradicts the result.

I wonder if ‘the earth is flat’ is a separate case from noise. I mean: lots of work looks at stuff with transient effects, but those tend not to be repeatable over time in the same population – e.g., the studies that say x pose or x picture stimulates this result may have an effect on first exposure but the effect fades with exposure to that or to similar things (like a cat thinks its reflection is a cat until it realizes it isn’t). Is that the same as knowing the earth isn’t flat but being able to see that it looks flat every time you go outside (unless you work on a plane)? Categorization is hard.

Reply ↓
- Martha (Smith) on April 30, 2017 11:44 AM at 11:44 am said:
  
  “Categorization is hard.”
  
  I’d say: Categorization often leads us astray from reality.
  
  Reply ↓
- Tom Passin on April 30, 2017 12:20 PM at 12:20 pm said:
  
  “The earth is flat” is a fairly good test case for thinking about some of these things. There are many methods for investigating whether the earth is flat, and we can examine one or another of them to see how they would fare under statistical analysis.
  
  For example, one potential method for testing the flatness is to measure the sum of the interior angles of a triangle laid out on the ground. If flat, the sum will be exactly 180 deg, otherwise it will differ. There would be many practical experimental matters to work out, but after they are handled, a measurement would give some number that almost certainly would be different from exactly 180.000 deg. Would that indicate that the earth truly was not flat?
  
  The best way to look at it would probably be that the uncertainty in the test data would reflect an uncertainty, a range for, the earth’s radius. Say we found that range to be 0 – 8000 miles. Even that result (0 at the end) would depend on an assumption that the form of the earth is roughly spherical instead of, say, saddle shaped or irregular. Sticking with the quasi-spherical shape, what would it take to move the lower bound well off of 0, if that indeed were true?
  
  In this kind of case, it is relatively easy to figure out what to do in the way of improving the experiment. In other cases, such as ones that often come up on this blog, it’s not so clear cut. Maybe it would be a good thing to test out ideas for better analysis on thought experiments like the earth’s radius.
  
  Reply ↓
Anoneuoid on April 30, 2017 1:00 PM at 1:00 pm said:

This seems like old news, it reminded me of:
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129. https://psycnet.apa.org/journals/met/1/2/115.pdf

Now after checking the paper I see they cite Schmidt (1996) multiple times. Anyway, I’ve noticed that a lot of these old papers informing about NHST used to be available as pdfs after a search, but are now being put behind paywalls. So I guess it is good for people to produce new freely available ones on the topic, even if it is largely retreading well-trod ground.

Reply ↓
Michael on April 30, 2017 1:28 PM at 1:28 pm said:

“Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval.”

So they must also recommend against fixed confidence interval thresholds?

Reply ↓
Matt on April 30, 2017 4:20 PM at 4:20 pm said:

Why is there no “p-hacking” equivalent in Bayesian analyses? I get that you are not using a strict cutoff to declare whether or not “there is an effect”, but can’t it also be the case in Bayesian work that you can just keep fiddling with your likelihood/prior until you get a posterior that you “like”? Coming from economics, my biggest worry is the specification searching that I know must go on under the hood in most empirical work, especially structural work which is very assumption-heavy – when trying to interpret estimates in a causal/structural manner. My issue with Bayesian work is kind of the same issue I have with structural work in economics – namely, that you need to specify the DGP. What makes anyone here think that if Bayesian methods were to become mainstream, researchers wouldn’t just keep fiddling with the model til they get what they want? To this you can say you need to show that your posterior is not so sensitive to seemingly innocuous assumptions (and I’m sure the people on this blog would be able to tell when something is iffy). But right now in structural work in economics also this should be the case (i.e. researchers should show their results are somewhat robust to arbitrary modelling assumptions to make it believable), but it’s certainly not: economists doing structural work hide a ton of what actually happened in the analysis. Why should I expect this to be any different in Bayesian work? Can it not be “gamed” just as easily as current statistical methods?

Reply ↓
- Andrew on April 30, 2017 4:36 PM at 4:36 pm said:
  
  Matt:
  
  Any analysis, Bayesian or otherwise, is conditional on the model and the data. “Hacking” is particularly an issue with null hypothesis significance testing because a p-value is literally a statement about the analyses that would have been done, had the data been different. So if someone is reporting a p-value, he or she is, explicitly or implicitly, making a statement about what analysis he or she would’ve done, given any possible data that could’ve arisen. Thus, the garden of forking paths is a concern, even if only one analysis was ever done or considered for the particular dataset that was observed. So that’s even worse than the usual concerns about gaming the system.
  
  There’s also the practical issues that some methods (multilevel Bayesian analyses, “big-data” machine learning methods, etc.) are suited to including lots of data and auto-tuning to resolve potential problems with overfitting; whereas other methods (least squares, classical survey weighting, permutation tests, etc.) can have big difficulties trying to incorporate diverse information and, as a result, require lots of hand tuning and do-I-include-this-variable-or-not choices. So it’s not that Bayesian inference, random forests, etc., are immune to gaming, but I do think they can be less subject to gaming, as compared to traditional methods such as least squares which rely much more on data inspection and trial and error to decide what features to include in the model.
  
  Reply ↓
- Daniel Lakeland on April 30, 2017 6:58 PM at 6:58 pm said:
  
  The “right” way to do a data analysis is to argue why it is that your specific model is the right family of models to consider, why it is that you think the parameters should be in the ranges of the high probability intervals of your prior, and then fit the Bayesian model and give summaries of the important parameters.
  
  Anything less really in an ideal world, aught to be rejected as too vague and incomplete.
  
  In that context you the reader have two options:
  
  1) Agree with the researchers that the family of models they’re using and the ranges of parameters are appropriate, and then accept their conclusions.
  
  2) Disagree about the appropriateness of the model, request the data, and re-analyze with a different model, preferably a model where you take the researcher’s model and your own model, and incorporate them into a single mixture model, and then look at the posterior probability. The resulting analysis will tell you whether one or the other model is definitely more favored by the data. But, this is sometimes quite tricky to do right, and also, open data? Yeah, right in reality you hardly even get an email back from researchers acknowledging your existence.
  
  So, from the perspective of Bayes giving you the math you need to compare models within a super-family, it’s the right way to go, from the perspective of actual science as is done today… well you have to fix the social problems before you can really make progress…
  
  Reply ↓
  - Matt on April 30, 2017 10:20 PM at 10:20 pm said:
    
    Daniel, or Andrew, this is perhaps a dumb question…but are the standard errors obtained in frequentist analyses equally problematic, as far as the garden of forking path goes? I assume they are… given that there is a one-to-one mapping from estimates/standard errors to a CI.
    
    And thank you for the response Daniel. I’d be interested to hear your thoughts on the paper Ed posted below.
    
    Reply ↓
    - Andrew on April 30, 2017 11:12 PM at 11:12 pm said:
      
      Matt:
      
      Regarding standard errors: it depends what they’re being used for. A standard error is an estimate of the standard deviation of a point estimate. My impression is that published analyses are typically optimized to get statistical significance, not for standard errors. So, as far as the standard errors are concerned, the forking paths are chosen almost at random, hence i would not expect any major systematic problems here. One could imagine a researcher doing standard-error-hacking, trying different analyses in order to get the standard error as low as possible, but I don’t have a sense that this is a thing.
    - Cliff AB on April 30, 2017 11:54 PM at 11:54 pm said:
      
      “My impression is that published analyses are typically optimized to get statistical significance, not for standard errors. So, as far as the standard errors are concerned, the forking paths are chosen almost at random, hence i would not expect any major systematic problems here.”
      
      That last statement seems odd; it’s true things are more likely to optimized for statistical significance rather than standard errors. But standard errors are definitely not independent of statistical significance; in a null (or, more realistically, near-null) hypothesis world, standard errors are very heavily negatively correlated with statistical significance. Especially given the fact that standard errors are much less stable than the actual point estimates of effects themselves.
      
      In fact, we can see that if we sample a whole bunch of N(0,1) pairs, save the pairs that are statistical significant in a t-test, we will see only a minor upward bias in the absolute means, (in my simulation, the absolute means were about 1.2x the true absolute mean of a N(0,1)), but an enormous downward bias in the standard deviations (mean is about 0.05x the true standard deviation).
      
      So. at least in the case, cherry picking p-values leads to a substantially bigger bias in the standard error than the actual point estimates themselves.
      
      R Code (sorry, no code commenting as I’m sure WordPress will mash it all up if I do):
      
      sim_t_test <- function(n = 2){
      vals <- rnorm(n)
      ans <- t.test(vals)
      ans$sd = sd(vals)
      return(ans)
      }
      
      MC = 2000
      fp_list = list()
      count = 0
      
      set.seed(1)
      
      for(i in 1:MC){
      t_test_res <- sim_t_test()
      if(t_test_res$p.value < 0.05){
      count <- count + 1
      fp_list[[count]] <- t_test_res
      }
      }
      
      mus <- NULL
      sds <- NULL
      for(i in seq_along(fp_list)){
      mus[i] <- fp_list[[i]]$estimate
      sds[i] <- fp_list[[i]]$sd
      }
      
      summary(abs(mus))
      summary(abs(rnorm(10000)))
      
      summary(sds)
    - Andrew on April 30, 2017 11:58 PM at 11:58 pm said:
      
      Cliff:
      
      I know what you’re saying but I don’t think things usually go that way. I think people optimize for finding large effects, not so much for small standard errors.
    - Cliff AB on May 1, 2017 12:20 AM at 12:20 am said:
      
      Andrew:
      
      I’m sure it depends on the field. If you’re doing “pure” p-hacking (haha, I find that phrase amusing), I say you should get a huge bias in the standard error…but yes, if you’re looking at p-hacking + effect size hacking (i.e. in my simulation, only look at pairs that are statistical significant + mean greater than, say, 0.5), that should somewhat downplay the bias of standard error. I still believe it would be substantial.
      
      My experience working in the world of biology is that I think that’s a field where “pure” p-hacking is probably more common than p-hacking + effect size hacking, as we are often looking at surrogate measures. I would (very sadly) guess that many grants are approved because a certain sample had an abnormally low standard deviation. Of course, with small samples this becomes necessary even if the effect may be real and substantial.
    - Cliff AB on May 1, 2017 12:05 AM at 12:05 am said:
      
      To clarify, by “absolute mean” I actually meant “mean of the absolute value”.
      
      I should have just used a one sided test to avoid that whole issue…
    - Daniel Lakeland on May 1, 2017 2:51 PM at 2:51 pm said:
      
      Matt: See my Comment below on the paper Ed posted.
- Ed Hagen on April 30, 2017 9:41 PM at 9:41 pm said:
  
  Posterior-Hacking: Selective Reporting Invalidates Bayesian Results Also:
  
  https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040
  
  Reply ↓
  - Daniel Lakeland on May 1, 2017 2:50 PM at 2:50 pm said:
    
    There are problems with that paper. Some of them include
    
    1) The assumption of flat priors being the right way to do things. I personally think all analyses should have some information encoded in the prior.
    
    2) The assumption that the Bayesian model will have the same basic “Data generating process”. One of the main reasons to use Bayesian models is that they provide a mechanism whereby you could do a better job of describing mechanistic descriptions of how data arise. The concept in a Bayesian model is NO LONGER “random sampling” but rather “various plausible deviations from predictions”
    
    What I think is absolutely correct though is that if you “hack” the data, that is, filter it out and pretend that you didn’t do this, you will invalidate your analysis. This becomes clear when you think of the description of the data generating process p(Data | Parameters), if you generate some super-dataset Data* and then fiddle with it by filtering it and soforth to produce Data, and you don’t include the effects of this filtration process in your p(Data | Parameters) you are in essence lying about what happened.
    
    Reply ↓
    - Keith O'Rourke on May 1, 2017 3:17 PM at 3:17 pm said:
      
      One way to view the paper is that it shows the best way to game a Bayesian analysis is to make the credible intervals match the confidence intervals from a frequentest analysis other other inference outputs.
      
      Unfortunately – often a default Bayesian analysis does exactly that and double unfortunately some interpret that (Bayes matching the frequency) as reassuring when it should be alarming.
      
      I think the paper does get that across – but this sentence makes no sense to me “When computing Bayes factors we don’t set priors and we don’t compute posteriors” though I guess they are thinking of just a point prior for the alternative.
    - Daniel Lakeland on May 1, 2017 4:42 PM at 4:42 pm said:
      
      True fact: if you use your newfound flexible mathematical and philosophical tool to keep doing what you used to be doing, which is a tiny special case of what your new tool can do, you don’t “get something for nothing”
      
      This is true whether your new tool is Bayesian mathematical modeling or a CNC mill/lathe/3D printer or a digital storage oscilloscope or….
      
      The value in the new tool comes from its additional abilities.
    - Keith O'Rourke on May 2, 2017 8:23 AM at 8:23 am said:
      
      > The value in the new tool comes from its additional abilities.
      Exactly!
Carol on April 30, 2017 5:17 PM at 5:17 pm said:

I am finding the article a bit difficult to understand, perhaps because the authors seem not to be native speakers of English. What does “P values are hardly replicable” mean? Not easily replicable? Not replicable?

Reply ↓
- Alex Gamma on April 30, 2017 6:20 PM at 6:20 pm said:
  
  They’re Swiss. We try to do our best. But I don’t understand the sentence either ;)
  
  Reply ↓
  - Carol on April 30, 2017 6:37 PM at 6:37 pm said:
    
    Thank you, Alex. I think that they must mean that a p value is unlikely to be duplicated.
    
    Reply ↓
- A.P. Salverda on April 30, 2017 6:33 PM at 6:33 pm said:
  
  See Halsey et al. (2015), “The fickle p-value generates irreproducible results”:
  
  https://www.nature.com/nmeth/journal/v12/n3/pdf/nmeth.3288.pdf
  
  Reply ↓
  - Carol on April 30, 2017 6:39 PM at 6:39 pm said:
    
    Thank you, A.P. Salverda.
    
    Reply ↓
- Cliff AB on April 30, 2017 6:36 PM at 6:36 pm said:
  
  Carol:
  
  I think “P values are hardly replicable” means if you know the p-value from a given study, you have poor ability to predict it in the next study, even if it is carried out in the exact same manner.
  
  This shouldn’t be surprising to statisticians (although it’s actually much more nuanced than that statement leads), but this can be surprising to other applied researchers. I’ve had non-statisticians ask me “Suppose I read a paper, it tests some effect and the paper reports p = 0.03. I try to replicate the study, but I get p = 0.09. What should I conclude?”. The researcher saw the replication study as evidence against the treatment having a positive effect; I saw it as evidence for the treatment effect.
  
  Reply ↓
  - Tom Passin on May 1, 2017 1:26 AM at 1:26 am said:
    
    Aside from p-value problems like the ones discussed here, p-values have two more problems as a test statistic (one was mentioned in the referenced paper):
    
    1) Their distribution is uniform, so they have rotten properties for centralization and regression to the mean, and
    
    2) We don’t actually know the real p-value because we have to calculate it from the sample population, not the full population. And because of the uniform distribution, the experimentally-calculated p-value can easily not be close to the actual one.
    
    So as a test statistic, a p-value has just about the worst properties you could possibly get. To me, that says that we shouldn’t try to wring very much mileage out of it.
    
    Reply ↓
    - Carlos Ungil on May 1, 2017 8:58 AM at 8:58 am said:
      
      The distribution of the p-value is uniform only if the null hypothesis is true. And what do you mean by “the real p-value”? I would understand referring to “the actual distribution of the p-value” (which is not uniform in general). But I don’t know how to interpret the remark about the experimentally-calculated p-value not being close to the actual one.
    - Tom Passin on May 2, 2017 8:57 AM at 8:57 am said:
      
      “And what do you mean by “the real p-value”?”
      
      I mean the p-value that you would calculate if you knew the distribution for the entire population. But you don’t know that, you only know the distribution of your sample. So you can’t compute the p-value, but only an estimate of it.
      
      For those who said that the p-value or its distribution only apply for the null hypothesis – you can compute a p-value for any hypothetical mean. There’s nothing magical about a “null” value. The whole idea behind p-values is based on a hypothetical question: “*if* the true mean were M, what would be the probability of seeing the observed results?”. You can choose M to be whatever you like.
    - KKnight on May 2, 2017 9:47 AM at 9:47 am said:
      
      In practice, a null hypothesis represents a simplification of a more general (complicated) model – variable A has no effect versus variable A has some effect. But there’s nothing to stop you from choosing the null hypothesis to be whatever you want it to be, which may or may not make sense depending on the problem.
      
      At the end of the day, a p-value is just a statistic. If you’re carrying out a bunch of different hypothesis tests, the collection of p-values contains a non-trivial amount of information. For example, the Benjamini-Hochberg and related procedures use these p-values to control the false discovery rate.
    - Carlos Ungil on May 2, 2017 9:56 AM at 9:56 am said:
      
      I’m not sure if we’re talking about the same thing. Let’s say we have a parametric model, like x ~ Normal(theta,1). We propose a “null” hypothesis (the mean is M, as you suggest). We also have some data (I measure x multiple times) and we define some statistic calculated from the data, for example the absolute value of the t-statistic.
      
      From the model we can calculate the sampling distribution for this statistic if the mean was indeed M. The p-value is the quantile were my measurement sits in this theoretical distribution. By the way, the calculation requires some assumptions about how the data is generated (for example if there is fixed number of measurements or some kind of stopping rule is applied).
      
      If I repeat the experiment many times, I will get a sampling distribution for the sample mean. I calculate one p-value each time I repeat the experiment. There is no “true” p-value, there is only a distribution of p-values. If the null hypothesis is true (i.e. theta has the value we assumed to calculate the p-values), the distribution of p-values that I obtain by repeating the experiment many times will be uniform in [0,1].
      
      Now, you say: “if you knew the distribution for the entire population”. The distribution of what? Which population? How does this distribution relate to the model used to calculate a p-value? Could you give a simple example?
    - Christian Hennig on May 2, 2017 10:27 AM at 10:27 am said:
      
      “I mean the p-value that you would calculate if you knew the distribution for the entire population. But you don’t know that, you only know the distribution of your sample. So you can’t compute the p-value, but only an estimate of it.”
      
      Not true. (Once more:) The p-value is a function of data and null hypothesis, it doesn’t estimate any quantity of the underlying “true” distribution (if there was such a thing). The null hypothesis is not an estimate of the truth but rather a specific model the interpretation of which is of special interest.
      
      If you knew the true distribution it wouldn’t make sense to compute a p-value because you know the truth so you don’t need to know whether there’s evidence in the data against it.
    - Corey on May 2, 2017 10:40 AM at 10:40 am said:
      
      The fact that the sample p-value isn’t intended to estimate any quantity of the underlying “true” distribution doesn’t mean that it can’t be regarded as doing so anyway. It’s certainly a non-standard view of the p-value statistic and the inferential meaning of such a perspective is not clear to me, but it’s not impossible that this could turn out to be a fruitful way of thinking about it.
    - Keith O'Rourke on May 2, 2017 11:12 AM at 11:12 am said:
      
      Christian:
      
      If you think of what happens if you repeated the experiment many times, you would get a sample from the distribution of p_values given the _true_ effect of such an experiment (if taken as stable).
      
      So the first study provides the first estimate of that distribution – simulating this might make it amply clear just how noisy that estimate is where as with a dozen or so studies its pretty good.
      
      Again, as elsewhere, I think it very important to think beyond a single study to multiple studies.
      
      Also, in non-randomised study one of the first important steps is to estimate the distribution of p_values under the no effect assumptions as it can be almost anything – e.g. Robust empirical calibration of p-values using observational data https://onlinelibrary.wiley.com/doi/10.1002/sim.6977/full
    - Christian Hennig on May 2, 2017 11:28 AM at 11:28 am said:
      
      Corey and Keith: Fair enough, but I responded to this: “So you can’t compute the p-value, but only an estimate of it.” No, a p-value computed from data is as p-valuey as a p-value gets. (It can with some effort be seen as an estimator of something else or a first step of estimating such a something, alright.)
    - george on May 2, 2017 1:03 PM at 1:03 pm said:
      
      Corey, Keith – I see what you’re saying about p-values possibly being estimates – under some transformation, in regular settings they’d estimate the non-centrality parameter of a test statistic’s distribution, in similarly-designed studies. (Murdoch et al 2012 says much the same thing). It’s a bit weird to estimate something so completely tied to sample size, but not totally useless.
      
      Christian – I agree the p-value is a realized quantity. The comments above on the “real p-value” seem misguided, but they may be alluding to approximations of p-values – say those that have approximately the right properties but are not truly p-values. Most software and textbooks fudge this issue – and while it may not often matter in practice, it does when discussing foundations.
      
      Carlos – “The distribution of the p-value is uniform only if the null hypothesis is true.” No, not under composite null hypotheses. This has been discussed before, with you.
    - Carlos Ungil on May 2, 2017 3:05 PM at 3:05 pm said:
      
      George: you’re right, I was keep it simple. I also ignored the discrete case. Anyway, I was responding to Tom Passin who claimed that the uniform distribution of p-values was an issue. I would say that by mentioning other cases where this is not true you’re strengthening my case…
    - KKnight on May 1, 2017 9:01 AM at 9:01 am said:
      
      I’m not sure what you mean by point (1) – if you’re concerned about the distribution being uniform, you can just transform the p-value to give it whatever distribution you want although that would achieve anything.
      
      With respect to (2), there is no “true” p-value. Under whatever null hypothesis one is testing, the distribution of the p-value will be uniform (or uniformish) on (0,1) and this property can be exploited in multiple testing situations. On the other hand, if a given null hypothesis isn’t true, then the corresponding p-value will tend to 0, in most cases exponentially fast, as the sample size increases. (There’s actually quite a bit of theory on this.) So if you’ve done a really large study and get a p-value of 0.04, there’s likely not a lot interesting going on!
      
      In no way should this be construed as a defence of p-values – they represent a very blunt instrument in the statistical toolbox and researchers should focus more on estimating effects (as well as the uncertainty of these estimates) be it through Bayesian or some flavour of frequentist methodology.
    - Christian Hennig on May 1, 2017 9:47 AM at 9:47 am said:
      
      “With respect to (2), there is no “true” p-value.” Certainly not as a function of the “true” population; the p-value is a function of the data and the null hypothesis model, but *not* of the underlying truth.
      
      I think the only valid way to speak of a “true” p-value as opposed to the one that was computed and reported is if a p-value is the result of some p-hacking, and then the term “(unobserved) true p-value” would refer to the probability under H0 to achieve a p-value smaller or equal to the one actually computed using a formal model of all the decisions combined that the researchers would have been made had the data been different in various ways.
    - Daniel Lakeland on May 1, 2017 10:44 AM at 10:44 am said:
      
      I think the idea is that you don’t know the population parameters exactly (such as standard deviation) so you estimate them from data. Of course, in the normal case, this is what the t distribution is all about, moving probability around so as to account for the fact that you estimated the sd from data.
  - Keith O'Rourke on May 1, 2017 9:21 AM at 9:21 am said:
    
    Much of the confusion comes from almost no teaching about how to deal with multiple studies (i.e. meta-analysis).
    
    It should be basic content in any intro statistics course (I did it once just using a simple combining p_values rule – at least the students get some experience with evidence adding up or down as more studies become available).
    
    Additionally, replicable has not been well defined in the literature so authors can largely decide to make it mean anything they want.
    
    Reply ↓
    - LauraK on May 1, 2017 10:54 AM at 10:54 am said:
      
      Keith, this is a nice idea. Are there any Shiny apps to do this? If not I might write one. Do you have an example you did with this in intro?
      
      I feel as well that this gets students in intro thinking about effect sizes not just p-values and better understanding what a p-value is and is not. Great idea.
      
      I would love to see ideas like this published in the open source Statistics education literature so typical folks who are at teaching schools and high schools and such can use them. Many such teachers might not even know what meta analysis is, I will be honest that I never explicitly learned it while getting a PhD. Too busy taking theory courses.
      
      Laura
    - Martha (Smith) on May 1, 2017 11:54 AM at 11:54 am said:
      
      IF there aren’t any shiny apps to do this, I hope you indeed write one. I think good shiny apps can help students understand what is going on.
    - Keith O'Rourke on May 1, 2017 1:26 PM at 1:26 pm said:
      
      > I will be honest that I never explicitly learned it while getting a PhD
      That is pretty common.
      
      The exception would be examples for multi-level modelling where the level happens to be study but even there there is probably little discussion about between study issues (heterogeneity, quality of studies, etc.) Now, if these involve two group randomized studies and the control group rate or average is modeled as random – it is not a proper meta-analysis (as the concurrent randomization comparison is being broken).
      
      Have no idea about teaching material here – last time I taught anything about meta-analysis was at Duke in 2007/8 (a graduate course just on meta-analysis and a bit about combining p_values in the undergrad intro course and that was only done in my section).
      
      To _me_ its really strange as I don’t think you can really understand statistics without thinking about multiple studies which is how Fisher often thought of statistics in his work (Fisher quotes on pages 56 & 57 here https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf)
    - Martha (Smith) on May 1, 2017 3:10 PM at 3:10 pm said:
      
      “I don’t think you can really understand statistics without thinking about multiple studies”
      
      +1
Carol on April 30, 2017 6:40 PM at 6:40 pm said:

Thank you, Cliff AB.

Reply ↓
Valentin Amrhein on May 1, 2017 5:26 AM at 5:26 am said:

Hi, I’m the main author of this manuscript. Thanks a lot for your comments so far, and thanks to Andrew for posting this. I put the paper on the preprint server to get as much feedback as possible before the manuscript is printed (should this ever happen). So please feel free to continue commenting, for example on
peerj.com/preprints/2921

I wrote this manuscript for a general readership represented by my colleagues (mostly biologists) and by my students. My experience is that many scientists may be aware of the problems, but most continue using significance tests as if nothing happened. And then there are colleagues who say that what we did for 100 years can’t be wrong. I suspect that the latter are the largest group of scientists, given that the prevalence of p-values is still increasing.

Now my problem is that I can give the ASA statement to such colleagues, and surprisingly they will say they agree to everything. I found that most papers, including the ASA statement, discuss misinterpretations of p-values or give advice that everybody can agree on without changing statistical practice, for example that there should be more open science, and that we should be careful with doing too many tests.

So our approach was to request a slight but concrete change in practice by removing fixed thresholds, and not to ask to stop using p-values. Of course, and luckily, this is not a new request. The language in the review is non-technical to address the expected readership of non-statisticians, and surely there are things that should be explained more thoroughly. Every advice is welcome!

Reply ↓
Eli Rabett on May 1, 2017 1:49 PM at 1:49 pm said:

From the standpoint of a practitioner in physical science the p value that you should look at depends on how strong the theory(model) is. If you can show that your expected result is in concordance with conservation laws and a negative result would be a violation, you can accept a much weaker statistical validation. On the other hand if your results imply that the theory is violated after it has been affirmed many times, you have to go out to five or six sigma before anybody will take you seriously. If you are working in an area where there is no proven theory (model) such as looking for new fundamental particles, you should go there anyhow.

That’s the problem in social sciences. The theory is weak and the practitioners accept weak statistical validation.

Reply ↓
- Glen M. Sizemore on May 2, 2017 5:14 PM at 5:14 pm said:
  
  If you look at most of psychology (often viewed as a “social science” – whatever that is), what you see is that it differs from physics in that most of psychology doesn’t really exert experimental control over its subject matter. Psychology would be better off dropping its aping of physics’ modern hypothetico-deductive base. Drop their obsession with theories about unobservable events and spend some time developing techniques to control behavior. Compare most of psychology to an actual experimental science that advances by exerting experimental control over behavior (i.e., behavior analysis). Take a look at an old, classic paper I posted here a day or so ago:
  
  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1404062/pdf/jeabehav00196-0021.pdf
  
  Check out Fig. 2 on page 205 – that’s experimental control over behavior.
  
  Reply ↓
  - Glen M. Sizemore on May 2, 2017 5:22 PM at 5:22 pm said:
    
    Oops. I should have said: “Drop their obsession with theories about ALLEGED unobservable events…”
    
    Reply ↓
Valentin Amrhein on May 2, 2017 4:05 PM at 4:05 pm said:

I just received a “major revisions” decision from PeerJ, with one unfavorable and two favorable reviews. So if everything works well, we might see this paper in print soon. In case someone did read the paper, I’m happy to make changes according to your suggestions. Also, there was an interesting blog post by Brent W. Roberts dealing with the question of whether this paper in particular, and research in general, should be rejected because “it contains nothing new”. Here is some excerpt:

“So why post papers that reiterate these points? Even if those papers are derivative or maybe not as scintillating as we would like? Why write blogs that repeat what others have said for decades before?
Because, change is hard.”

https://pigee.wordpress.com/2017/04/30/because-change-is-hard/

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

“The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research”

63 thoughts on ““The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research””

Leave a Reply Cancel reply