Statistical significance and the dangerous lure of certainty

In a discussion of some of the recent controversy over promiscuously statistically-significant science, Jeff Leek Rafael Irizarry points out there is a tradeoff between stringency and discovery and suggests that raising the bar of statistical significance (for example, to the .01 or .001 level instead of the conventional .05) will reduce the noise level but will also reduce the rate of identification of actual discoveries.

I agree. But I should clarify that when I criticize a claim of statistical significance, arguing that the claimed “p less than .05” could easily occur under the null hypothesis, given that the hypothesis test that is chosen is contingent on the data (see examples here of clothing and menstrual cycle, arm circumference and political attitudes, and ESP), I am not recommending a switch to a more stringent p-value threshold. Rather, I would prefer p-values not to be used as a threshold for publication at all.

Here’s my point: The question is not whether something gets published, but rather where it is published, in what form, and how it is received and followed up.

In the era of online repositories, every study can and should be published. (I do think we can and should do better than Arxiv—or, I should say, I hope there will be an equivalent for fields other than mathematics and physics—but Arxiv has established the principle.)

For most if not all of the studies we’ve been discussing lately, I think the raw data should be published too. Sometimes the authors will share their data to people who request, but that shouldn’t even be an issue if the data and survey forms are just posted.

In a world where everything can be posted, what is the point of publishing in Psychological Science? Publicity and that stamp of approval. Assuming that both of these are limited quantities in some flexible way (in the same way that the government cannot simply print unlimited amounts of money and that banks cannot simply give out unlimited amounts of government-backed loans), some selection is necessary.

I would make that selection based on the quality of the data collection and analysis, the scientific interest of the research, and the importance of the topic—but not on the significance level of the results.

There are exceptions, of course: I could imagine a clean, well-defined, internally-replicated study with a surprising effect, where statistical significance would be part of the argument for why to believe it. But this is usually not what we see. Instead, over and over again we see poorly-measured data with analyses that are iffy or data-dependent. Studies such as those should demand our attention because of their data quality or scientific importance, not because they are attention-grabbing and have a p-value of .04.


I think Rafa’s point of the tradeoff between stringency and discovery is important, and I’d like to move this discussion away from p-values and toward concerns of data quality.

Of course I also think data should be analyzed appropriately. For example, with Bem’s ESP study, a proper analysis would not pull out some wacky interactions and declare them statistically significant; instead, it would display all interactions and show the estimates and uncertainties for all of them. The point would be to see what can be learned from the data, not to attempt to obtain a claim of certainty.

Attention and resources are limited so there will always be some sort of selection of what studies get followed up. I’d just like to do a different selection than that based on p-values. Especially considering all the small-N studies of small effects where any statistical significance is essentially noise anyway.

The dangerous lure of certainty

Regarding the title of this post: I’m not saying that Rafa is suffering from the dangerous lure of certainty. Rather, I’m saying that p-value thresholds are connected to this dangerous lure, present among producers and consumers of science.

This is not a Bayes vs. non-Bayes debate. If Bayesians were going around using a posterior probability threshold (e.g., only accept a paper if the direction of its main finding is at least 95% certain), I’d be bothered there too.

And the lure of certainty even arises in completely non-quantitative studies. Consider disgraced primatologist Marc Hauser, who refused to let others look at his data tapes. I expect that he has a deterministic view of his own theories, that he views them as true and thus thinks any manipulation in that direction is valid in some deep sense.

45 thoughts on “Statistical significance and the dangerous lure of certainty

  1. Terrific post. I almost always use the confidence intervals together with the significance filters to give me a sense of how credible the result is. The strict use of 95% for publication is silly. The criteria you laid out for publication are great, especially the part on understanding the quality of the data, and the plausibility of the models, but I’d still want statistical significance to be part of the consideration. I mean, conceptually, I want the publications to filter out findings that are likely to be statistical noise because we do have the opposite problem as well-lots of results are reported in certain fields (nutrition science, neuroscience, evolutionary psychology, etc.) that are just noise, and most likely not replicable.

    One question: if we revert to subjective, qualitative assessment, like having referees judge the statistical significance, how do we keep the personal politics out of it?

    • I believe 95% CI is quite useful if you want to evaluate the precision of your estimates. Did you mean the use of 95% CI as a test of significance? (i.e. whether the interval contains the null value). Here I agree with you.

      • Here’s an interesting example of what CI’s to measure the precision of your estimates can sometimes get you:

        The key quote is:

        “The Frequentist answer is a disaster in every way imaginable. Besides being cryptic and not directly relevant, the confidence interval misses the correct length at the same “alpha” as the Bayesian despite having a width over 3 billion times larger!”

        • Enstropy:
          Thanks for the link although the focus there is on frequentist intervals as measures of statistical significance e.g. how many intervals cover the true value e.t.c. What I meant is that the reader can get a better idea on the magnitude of random errors if CIs are reported. For me both Bayesian and Frequentist are fine as long as there is a justification.

          You may also be interested to read this article…

        • I read your comment and a rejoinder that followed. It was an interesting discussion and I liked your conclusion:

          “I am in agreement with Greenland and Poole’s interpretation of the one-sided P value as a lower bound of a posterior probability, although I am less convinced of the practical utility of this bound, given that the closeness
          of the bound depends on a combination of sample size and prior distribution”.

          But let’s face it, P-values are here to stay. I felt that Greenland and Poole’s article was an attempt to at least inform researchers of the kind of information P-values depict; bearing in mind that they’ll probably always co-exist with posterior probabilities.

        • I just meant that if you casually use the size of the Confidence Interval as a measure of the precision of the estimates then you can easily be led astray.

          In that example the size of the Bayesian interval is clearly the correct measure of the precision since it’s effectively just the precision of the most highly accurate measurement. The CI on the other hand is 3 billion times wider than that and very clearly isn’t any kind of measure of the precision of the estimate. The fact that the CI also misses the correct value is just adding insult to injury.

        • Entsophy:
          Something is not right with your example. You start by assuming that the calculations for CI are valid under the iid assumption (the random variables are identically independently distributed), then you construct an example where that assumption is not hold, but you go ahead and compute CI anyways. Is it surprising that something weird will show up?

          Also in my discipline mean and SD are important, we never deal with the Cauchy distribution. Can you construct a similar example using something like a normal distribution?

        • Huh? I used the actual distributions which were used to simulate the data, namely independent N(0,sigma_1) for all but one, and an independent N(0,sigma_2) for the other. Both the CI and the Bayesian interval where made assuming these distributions, which as noted were the ones actually used to simulate the data.

          There’s nothing weird about it.

  2. “I would make that selection based on the quality of the data collection and analysis, the scientific interest of the research, and the importance of the topic—but not on the significance level of the results.”

    In general I agree with your post but I feel at some point one needs some measure of significance. Not the conventional one (p-value) perhaps, but some alternative. The traits you mention (e.g. scientific interest, quality of data etc.) are already used to some extent by current publishing practice and not by a pure and blind adherence to the cult of p-values (i.e. No journal’s going to publish a great p-value study that my dog runs fastest on the day he eats Purina)

    The tricky part is judging between multiple studies having almost equal quality of data, scientific interest etc. Say a narrow research area of which additives might improve corrosion resistance of car engines. Now which study among otherwise similar competing ones has the most impact needs some measure of significance.

    It’s all fine saying let’s publish all studies but Journals perform an important curatorial function: I trust the editors of journal I follow to try and publish the research most deserving my attention. And cetris paribus some measure of impact is necessary. Sure, there might be alternative models of recommendation systems etc. but that’s a different story.

    • Rahul:

      Yes, I agree. But right now the rule seems pretty clear: if there’s no “p less than .05,” there’s no publication. And that seems like a problem. I think it should be ok routinely to publish (or curate) papers with no “p less than .05.” The flip side is that then there will be a bunch of “p less than .05” papers (for example, the clothing and menstrual cycle paper) that would not be published (curated) in top journals. In my ideal world, data quality would be have a much larger role in determining whether a paper is published (curated).

      • Andrew:

        This (publication of good quality science which generates “negative” results) is starting to happen. See PLOS One, PeerJ, etc. which are indexed journals whose mandate is to publish all scientific valid studies irrespective of results.

      • “In my ideal world, data quality would be have a much larger role in determining whether a paper is published (curated).”

        I’m not sure. Perhaps from an academic or methods viewpoint, yes. But as an applied practitioner I have few reasons to read a paper with excellent data quality that makes weak or no conclusions. Yes, perhaps secondary pedagogical reasons but not helping my goals directly.

        OTOH, I think a paper with poor data and conclusions drawn on that which are hard to trust is equally bad.

        • Well, but a zero finding is important information. Optimally, the authors would report what they may reliably find in the data, not the most fancy thing that they somehow get past the p<.05 filter. No reliable findings in a weak sample is not too interesting, but no reliable findings in a strong sample is surprising.

          Rewarding proper methods over getting lucky (or just creative) with results and replacing the p<.05 filter with something sensible is one of the main goals of some of the recent pre-registration format proposals:

        • Isn’t it helpful to know what can be left out of practical applied models intended for, say, forecasting or other out-of-sample predictions?

      • At least in ecology, I see more and more papers in high impact journals with p-values above 0.05 (i.e. the p-value is not the main criteria for publication). So maybe a change is happening.

  3. I would echo the point about the importance of publishing raw data (with very few exceptions – relating to data which cannot be readily de-identified). Plenty of free Internet sites to post raw data now. However, I would like to recount a brief anecdote highlighting the reticence of some academics to publishing “their” data. I am a medical doctor, and I firmly believe that the patients from whom we generate data are the sole “owners” of their data. The data are used by the researcher to derive some meaning, but the researcher should not see the data as “theirs”.

    I recently contacted a high profile author of several very large randomized controlled trials. This person is also an advocate of “open data” and has published the raw data from a couple of large trials on his website. I downloaded the raw data and quickly discovered that the treatment allocation variable was missing, and was only available to those who submit a protocol (presumably approved by the author himself) for “new” research. Although I totally agree that “new” research should have a protocol, I objected to the notion that (a) the researcher would decide by fiat whether or not a particular protocol was “worthy”, and (b) new research was the only acceptable usage of the full raw dataset (since I think there are many other legitimate uses of the full dataset which do not involve new research, such as teaching, reproducing the original analyses as presented in the paper, fraud reduction, etc.). I mentioned this to him and have not received a reply.

    What do others think? Who should be the owner of raw data, and should the authors of the original paper using the dataset be the sole arbiters of who gets to see the raw data? How is this really “open data”? In my opinion, if open data is subject to conditions, then it is “closed” data, just a little less closed than before.

    Disclosure: I try to practice what I preach and have posted all raw data (including treatment allocations) of my most recent RCTs at FigShare. I plan to continue doing this going forward.

    • The “patients are the true owners of the data” part is a non-sequitor though: You are always free to contact individual patients and get that data (however impractical that may be). But it doesn’t follow that since patients own data the data should be open to everyone. Alternatively, all we are saying is, if a subject of a study contacts the PI he ought to have unqualified access to the portion of data that impacts him.

      PS. I’m a fan of open data too but just don’t think that the specific argument you made makes a strong case for it.

      • You are right – I didn’t elaborate my argument fully.

        My point is that patients are the owners of their data, but, more generally, the public is often the owners of “data” in general, and that is why open data should be mandated for most research projects. This is because much of the funding of research of all kinds directly or indirectly comes from the public purse. Even for industry-funded studies this is usually true, since industry profits are coming ultimately from the public in most cases.

        The status quo, whereby a researcher takes (largely) public money to conduct research, and then turns around and holds that data as if he owns it, is unacceptable and needs to change. Researchers should see themselves as keepers of the data, not owners.

        Getting back to my original comment, I am shocked that so-called “open data supporters” are hesitant to release all of their (de-identified) raw data. Putting constraints on it is antithetical to the whole idea!

        • I still don’t like your arguments:

          (1) What about cases where the “public” may be ok (or prefer) that the specific researcher they agreed to, to have access to their data but not, generically, the wider public? I don’t think this is a unlikely possibility.

          Lots of people may be willing to have a local researcher whom they perhaps met and trust (or at least they trust the institute he works for) have access to their data (and that too for a specific agreed upon purpose) but may not want to allow generic access to any researcher and for a study-goal they do not know or may not approve of.

          I’m not saying this is indeed the case or even that study subjects are explicitly informed about the open / closed nature of subsequent access. But all I’m saying is it does not follow that just because subjects of a study are “public” the participants would axiomatically want open access to be provided.

          (2) The “industry profits are coming ultimately from the public” argument is even more abhorrent to me because, if so, what can we ever not justify society not requiring of an industry? If that line of thought is taken to its logical conclusion, all industry proprietory data could be justified being open-access, in the light of larger social good.

        • I agree with your (1) often patients may not really want their data to be available outside of the original purpose, de-identified or no. I think ethics demand that we inform individuals of how their data will be made available, and have them consent ahead of time to open-deidentified publication of data, and that this should be the norm. Also if we allow people to opt out of this process, their data should be included in the dataset as missing entries with an indicator variable for “opted out”, so we know how many of them there are etc.

          As for (2), I agree with you that privately funded researchers should be able to own their datasets without restriction, but I think that “privately funded” is often a pretty difficult thing to define. There are plenty of instances of publicly funded researchers at universities taking that research and turning it into corporate profits. There was a lawsuit about this with the founders of Genentech vs UCSF for example. Furthermore, it would seem to me to be a fair thing to ask that *as part of a patent application* all the relevant datasets generated should become (de-identified) public knowledge. The *point* of a *patent* is to make things “patent” (ie. well known) and I think that should be not just the conclusions but the data that those conclusions are based on.

        • My stand on this is that I’m all for open access and the way we should go about mandating it is:

          (1) Take a stand as editors that we won’t publish a paper without open access to data in it. A reasonable reason is the “inability to detect fraud” like @Philip Jones mentions.

          (2) Take a stand as funding bodies (e.g. NSF, NIH etc.) that open-access be a precondition to funding.

          (3) Take a stand as public safety bodies (e.g. FDA ) requiring open-access for the safety of the public subject to the drugs licensed.

          I feel both are justified as providers of money or an imprimatur to mandate such requirements and it makes for better arguments than the others I’ve seen.

          What I don’t agree with is any requirement for openness on strictly privately funded work.

        • 1) I take your point about axiomatically wanting open access. However, I could extend your argument to include a study participant stating they wanted no open access to the paper produced from the research too! Since, after all, the paper will include their aggregate (or raw) data in the text of the manuscript, and they don’t want any “wider public” access to it? How big of a difference is there between aggregated data and de-identified raw data? How is one more potentially damaging to someone than the other, if the data is de-identified? In contrast, I think most study participants would be interested in ensuring that the trial’s analyses were done correctly, and that no fraud occurred in the conduct of the trial.

          2) Is it OK for industry to subject trial participants to risks and then never publish the aggregate (or raw) data? (We know this is a very common practice.)

          re: #2, you may already be aware of the sea change happening in drug regulation. For instance, the EMA has declared that all drug trials will have to commit to open data after 1 January 2014. See for more info.

    • There are ways of doing partial de-identification, for instance an approach proposed by Roderick Little where he created intentionally missing data in each record and then used MI to recover the aggregate information.

      More such approaches would be useful.

  4. Andrew, I agree with this post, but not with some of what you’ve written earlier that’s related: e.g. “Also, as a referee, I would not demand statistical significance. As I wrote above, this is an important problem, and if their results have a big standard error, so be it.” [1] (comment from ‘I doubt they cheated’)

    Precision matters and incentives matter; IMO researchers should be encouraged to do research where a precise estimate is interesting, whether the unknown parameter is near zero or far away from zero. There are way too many studies that are a priori only interesting if the estimate turns out to be far from zero, and completely useless if the estimate is close to zero (the shirt-color example from your slate article probably falls into that category); I would bet that “researcher degrees of freedom” is typically a bigger problem for those studies than studies where a precise estimate is of obvious interest for any value of the unknown parameter.

    So I’d rather have referees ask, 1) if the estimate is precise enough to be useful or informative and 2) if a similarly precise estimate in different regions of the parameter space would also be useful or informative, and judge the paper accordingly.


    • “So I’d rather have referees ask … if a similarly precise estimate in different regions of the parameter space would also be useful or informative, and judge the paper accordingly.”
      I think that’s a super nice idea.

    • Gray:

      I would include precision as one of my criteria for publishing (curating) an article. Precision is part of data quality. Requiring sufficient precision is not the same as requiring statistical significance.

    • Andrew, just to complete my thought (sorry): there are plenty of settings where it’s obvious beforehand that a well-executed study of an important phenomenon would probably result in a very very imprecise estimate. I’m worried that your proposal might encourage researchers to work on those projects too much.

  5. A wacky (and perhaps unrelated) idea but what if Journals started requiring prospective authors to pre-submit drafts of Introduction & Methods sections of prospective articles?

  6. Andrew: “The question is not _whether_ something gets published, but rather _where_ it is published, in what form, and how it is received and followed up.”

    Implicit in this statement is the idea that top journals publish “quality” studies. Unfortunately they also publish “sensational”, “counter intuitive”, “interesting” as discussed at length in this blog. One option is to improve the top journals. Another is to move on.

    The top journal system is expensive, closed, slow, and for the most part non-interactive. I am also not sure what purpose they serve. With regards to quality, presumably their raison d’etre, scientists writing for scientists can rely on other scientists to judge the merits of their work. Not sure what the editor adds in this regard. The issue of quality becomes more salient for non-peer readers. But here we could have a blogger/firm/bot/etc with a reputation for quality curating research articles in public repositories certifying results, etc… just like coffee roasters vie to select the best coffee beans. There is a business model there waiting to be discovered.

    It would also help if, in addition to sharing data, authors shared their code files, and perhaps a JSON type file on methods, as in Predictive Model Markup Language (PMML). Then machines can do much of the quality scoring, facilitate meta-analysis, etc…

    • Fernando:

      We don’t have to reform the top journals. But if we don’t, then we have to reform the way that these journals are perceived, by many scientists as well as by the news media.

  7. Andy, I agree with what you say here. I also agree with most of the content of your original post. However the “Scientific Mass Production of Spurious Statistical Significance” title led me, perhaps unfairly, to group you with the pessimists. I think your post was inspired by what you see in psychology journals and I was writing specifically about the biomedical sciences. I interpreted your title as grouping all the sciences into one. Also note that my post was not a defense of the 0.05 cutoff or the use of p-values but rather an attempt to move the discussion away from critiquing the current system without considering the importance of the discoveries made in the biomedical sciences. I end the post by saying “it is important to point out that there are two discussions to be had: 1) where should we be on the ROC curve (keeping in mind the relationship between FPR and TPR)? and 2) what can we do to improve the ROC curve?” I still need to be convinced that abandoning the current approach to filtering papers, which includes the use of p-values among many other criteria, will have a substantial impact on improving the ROC. Again, I am referring to the biomedical sciences, psychology is perhaps a different story (you know better than me). Now, your suggestion about making data public can definitely move the ROC up as I see no way it could slow discoveries but many ways in which it could help us catch mistakes and also leverage the data to make further discoveries.
    Thanks for reading and best wishes,
    ps- I am waiting for a Bayesian to start a decision theory discussion about the best way to answer question 1). What loss functions should we use, etc…

  8. The problem in social science is use of invalid study designs. Statistical significance levels can’t fix that. Indeed, the idea that p-values can solve problems involving validity is a symptom of the misunderstanding that generates invalid study designs (i.e., mistaking statistical technique for sound causal inference).

    • dmk38, for some social sciences it is essentially impossible to have study designs that would be considered even adequate in an experimental setting; off the top of my head: empirical finance, macroeconomics, and much of industrial organization (can you tell I’m in econ?) For most research in those areas, every agent/individual interacts with every other through the markets we want to study and acts in anticipation of future events, so the sort of assumptions that usually would need to be justified are clearly violated, and are violated exactly along the dimensions we want to study. I’m sure the issue comes up in other social sciences as well (but it seems especially acute in parts of economics and finance).

      And yet… it’s important to try to estimate whether regulation is working as intended, and to predict the effects of new regulation, or to predict the effect of monetary policy rules, etc. It’s not obvious or unanimous what the correct study design is for these problems (there are lots of different approaches that can be informative; none is overwhelmingly convincing).

      • Alternatively, there are lots of different approaches that are plausible but none is really telling us what we want to know. (can you tell I’m not in the social sciences?) :)

  9. Pingback: P-values and the gap between what you know and what you think you know | LARS P SYLL

Comments are closed.