Estimates of “false positive” rates in various scientific fields

Uli Schimmack writes:

I am curious what you think about our recent attempts to estimate the false discovery risk (maximum rate under assumption of 100% power) based on estimates of a bias-corrected discovery rate? We applied this method to medicine (similar to Jager and Leek, 2014) and psychology (hand-coding). Results are very similar with an FDR estimate between 10 and 20 percent. Based on results we recommend an alpha of .01 to maintain a long-run FDR below 5%.

He points to these two posts:

Most published results in medical journals are not false (with Frantisek Bartos)

Estimating the False Positive Risk in Psychological Science

My quick reply is I don’t find the true-positive, false-positive framework very helpful. Here’s what Keith O’Rourke and I wrote in our comment on the Jager and Leek paper:

Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. To answer such a question would require additional care in defining what it means for a study to be correct. The research paradigm of effects being either zero or non-zero is not, we believe, particularly helpful, for two reasons. First, we almost always care about the direction of an effect, not merely its existence. Second, the magnitude of any effects or comparisons are also important and, in fact, connect directly to concerns about replicability of scientific phenomena.

Medical researchers are mostly studying real effects (setting aside certain wacky examples and desperate clinical research areas involving high mortalities). But there is a lot of variation. A new treatment will help in some cases and hurt in others. Also, studies are not perfect . . .

Our point about Type 1 errors is not primarily “semantics” or “philosophy”. The framework of the Jager and Leek paper under discussion is admirably clear—our problem is that we do not think it applies well to reality. We have a problem with the identification of scientific hypotheses as statistical “hypotheses” of the “θ = 0” variety. We understand that the authors chose to follow the model used by the much-cited Ioannidis (2005), but that does not recuse them from dealing with the logical difficulties involved with that model.

That said, I recognize that many people do think in these terms, so I’m linking to these posts of Schimmack as they may interest some of you.

30 thoughts on “Estimates of “false positive” rates in various scientific fields

  1. I’ve always thought Ioannidis (2005) was a neat little piece of reductio ad absurdum. To my mind, it always implied the move ‘from testing to estimation’. Alas, Ioannidis himself, and the majority who encounter it, just seem to want to patch up significance testing rather than move on.

  2. Post-hoc quantities like the FDR discussed here by Schimmack come up from time to time on this blog (e.g., the post-hoc power of single studies and the post-hoc average power of multiple studies).

    Is there any literature on the statistical properties of estimates of the FDR? We know the properties of estimates of post-hoc power and average power are quite poor and it seems like the FDR would be a quantity that similarly if not more difficult to estimate. How much confidence can we have in these estimates?

    Also, what exactly is it that the FDR (true or estimated) is telling us and does it predict anything interest (e.g., what happens in future replication studies)? It seems for the reasons Andrew discusses in his comment with O’Rourke on the Jager and Leek paper as well as because of the hypothetical nature of the classical repeated sampling framework in which the FDR is embedded maybe not?

    • I think it is important to distinguish between discussions of a single study and discussion of a field, discipline, collection of studies in a journal, or (meta-analysis) of studies on the same topic.

      Most statisticians focus on the single study case where we have an effect size estimate and some information about sampling error and then we have to come up with an inference. This is difficult and there is lots of uncertainty. Statisticians are still looking for the holy grail of the best way to draw inferences from this information and I am sure they will still do so in 1,000 years – if we make it till then.

      The meta-scientific question is different. When you look at 10,000 articles published in medical journals or psychology journals or any other field that uses significance testing (you may hate it, but that is what they do), how many of these results are wrong because they rejected a true null-hypothesis. It may not matter to you, but I think most consumers of scientific information are interested in the risk that the information they consume is false. Well, for those people we collected data and analyzed them to give them a scientific answer. If statisticians don’t find this useful, it may only show how removed they are from questions of practical importance.

      • It isn’t that we hate significance testing. It is that it doesn’t work. If people continue to use it, despite our telling them that it doesn’t work, I’m not sure why they would be interested in our telling them that it doesn’t work even more often than they perhaps thought.

        • What do you mean by “it doesn’t work”? if we test 50 true and 50 false hypothesis with alpha = .05 and power = .75, we will get mostly significant results that correctly tell us whether an effect is positive or negative. That is all, NHST promises and it delivers. If you want more, you first of all need to invest more resources to be able to provide some effect size estimates. That may be easy for some cheap sciences, but not if participants are expensive.

        • Ulrich:

          There are different things going on here, but the quick answer is that people don’t always have power = 0.75. Often it’s a lot less! When the signal-to-noise ratio is high, all sorts of methods will work well.

        • if we test 50 true and 50 false hypothesis with alpha = .05 and power = .75, we will get mostly significant results that correctly tell us whether an effect is positive or negative. That is all, NHST promises and it delivers.

          The null hypothesis is always false, so you will be testing 100 false null hypotheses. In response people adjust the significance cutoff and/or sample size to avoid getting 100% significant results.

          The reality is that NHST is no better than flipping a coin to decide positive/negative and probably worse due to various biases/p-hacking:
          https://www.sciencenews.org/article/cancer-biology-studies-research-replication-reproducibility

        • Well, the null isn’t ALWAYS false. Homeopathy versus placebo trials spring to mind. As do trials of intercessory prayer to cure disease.

        • I looked into homeopathy once. The supposed explanations for how it works were BS, but so are the debunkings.

          If you properly do “succinations” you are taking each sample from the froth at the top or whats stuck to the side of the container, which makes the concentration plateau to a kind of minimal microdose. You can also concentrate contaminants from the container and air in this way.

          So I see no reason that couldn’t have some effect, however minimal. Also, performing rituals calms people or can lead to them avoiding more fruitful approaches.

          intercessory prayer, even if you aren’t supposed to know someone is doing it, can also still have an effect. If the study is big enough, eventually you or someone you know will run into someone doing the praying.

        • > If you properly do “succinations” you are taking each sample from the froth at the top or whats stuck to the side of the container, which makes the concentration plateau to a kind of minimal microdose.

          I guess you mean succussions (succinations are something else).

          Reusing the vial (with whatever remains stucked to its walls) is one way to do that but the standard way is to take a drop and dilute in a new vial.

          Why do you say that the “proper” way to perform dilution is the one that doesn’t really dilute the thing?

      • I agree with your meta-scientific / multiple study focus over a single study focus. However, I link above to a basic exercise in mathematical statistics that shows that multi-study average power estimates have poor statistical properties even in ideal cases (e.g., accurate estimates require many hundreds or thousands of studies).

        Is there any basic mathematical statistics showing that FDR estimates have good statistical properties? This would be very surprising because FDR and average power (which you term expected discovery rate EDR) are related quantities: you define FDR = (1/EDR – 1)*(alpha/(1-alpha)).

        Further, even if we take the z-curve estimates in Figure 2 of your medical journals link at face value, the large width of the intervals — even when there are many many hundreds of studies — portends the estimates are not accurate.

        This suggests FDR is like average power (or EDR as you call it) in more than one way: not only are both hypothetical in nature, we can’t even estimate them well even if we did want to know them.

        • If you actually care about the answer to your questions, you might want to read up on z-curve, which estimates power before selection and power after selection. The false positive risk (maximum rate) is just a simple mathematical transformation of expected power before selection (Soric, 1989). And as always, estimates come with uncertainty, which is why we provide confidence intervals. I apologize that they are frequentist, but I am sure we can add a vague prior and get the same information as Bayesian estimates of uncertainty.
          And the estimate of 15% False Discovery Risk is based on over 1,000 studies.

          https://replicationindex.com/2020/01/10/z-curve-2-0/

        • And the estimate of 15% False Discovery Risk is based on over 1,000 studies.

          But when people actually try to run the the replications they fail much more often. You need to compare your prediction to data. This is was probably the first effort to check this, it was using in spinal cord injury research: https://www.sciencedirect.com/science/article/abs/pii/S0014488611002391

          Then there is the cancer research reproducibility project, where they had to drop more than half the studies (and 75% if the experiments) because no one could figure out what the methods even were…

          The initial goal was to repeat 193 experiments from 53 high-impact papers published between 2010 and 2012, but the obstacles we encountered at every phase of the research lifecycle meant that we were only able to repeat 50 experiments from 23 papers.

          https://www.sciencenews.org/article/cancer-biology-studies-research-replication-reproducibility

        • The sister paper has the results for the subset where a replication was performed:
          https://elifesciences.org/articles/71601

          From table 1 we see that of the “significant” results, ~ 25% of replications were in the same direction but “‘not significant”, and another 25% were in the other direction (“significant” or not).

          Thus they saw ~50% “significant” in the same direction. That is pretty much what you would expect if NHST does nothing since the null hypothesis is always false in practice. Ie, with sufficient sample size you would always get significance in one direction or the other.

          And that is for the (likely better) subset of experiments where it was even possible to perform a replication!

        • I do actually care about the answer to my question :)! Thank you for the link to your paper but the paper does not speak to the question.

          I asked if there is any basic mathematical statistics showing that FDR estimates have good statistical properties. The paper contains no mathematical statistics whether for estimates of FDR or otherwise.

          There is however a simulation study that shows that the z-curve produces inaccurate estimates of average power (or EDR): “RMSE values were large and remained fairly large even with larger number of studies.” Since FDR is a transformation of average power, the paper would therefore appear to provide evidence that the z-curve produces inaccurate estimates of FDR.

          The paper also shows z-curve confidence intervals for average power are wide and have low coverage probability, thus suggesting the same for z-curve confidence intervals for FDR.

          It seems like we should be wary of estimates and confidence intervals for FDR that come the z-curve (and perhaps other methods too)!

        • I’m wary of labelling it “basic”, but (good) theoretical properties of methods controlling FDR are due to Benjamini & Hochberg (1995, JRSSB). Somewhat similar are developed by Storey (2001, JRSSB) and in subsequent papers, showing strong control and also conservative point estimates.

        • Thank you but we are talking about two different things. You are talking about the properties of procedures to control the FDR in some applied setting; there is a long literature on this as you point out. What I am looking for is literature on the properties of estimates of FDR calculated based on a prior published results, which is the topic of Schimmack’s paper discussed in the blog post.

        • Jerry, you don’t really care about the properties of z-curve that we carefully analyzed in Bartos and Schimmack (2021). We show that the 95% confidence intervals generated by the z-curve app have good coverage and are often conservative (i.e., coverage is well above 95%).

          This is all we can ask from a statistical method. If you are not happy with it, maybe you should stick to mathematics and not mathematics applied to data.

        • Hi Jerry,
          glad you take a look. I guess the question is what we mean by large. I prefer to talk about quantitative issues in terms of numbers. The key issue for me is that a 95%CI is expected to be correct at least 95% of the time. We found that this was not the case. So, we created a robust/conservative CI that does produce 95% coverage. To do so we had to add 5% points to the bootstrapped confidence intervals. With these robust confidence intervals, we find that the point estimate of the EDR is 31% with a 95%CI from 16% to 40%. Simple mathematical transformation yields a point estimate of the false discovery risk (the maximum rate compatible with these EDRs) of 12% with a confidence interval ranging from 8% up to 28%. With more data, we can narrow this down further, but are you saying that this is not an interesting or credible finding, given the widely cited claim that “most published results are false positives” (Ioannidis, 2005)?

        • You critique me for using the word “large” in my prior comment rather than a precise quantification, but my only usage of that word came as a direct quotation of a paper you wrote and linked to.

          You linked to this paper in response to a question I had posed in a still prior comment, but the paper did not address that question.

          This does not have the makings of a fruitful dialogue.

          Anyway, as for what I am saying, I said it above: your paper provides evidence that we should be wary of z-curve estimates and confidence intervals.

  3. There was another major concern about how fully one understands how the p_values were generated and _ended up_ in your sample of studies – “We think what Jager and Leek are trying to do is hopeless, at least if applied outside a very narrow
    range of randomized clinical trials with prechosen endpoints. It is not a matter of a tweak to the model
    here or there; we just do not think it is possible to analyze a collection of published p-values and, from
    that alone, infer anything interesting about the distribution of true effects. One might say that the same
    objection would hold for any meta-analysis, but this case seems more problematic to us here. The approach
    is just too driven by assumptions that are not even close to plausible and a catch-all sample of p-values
    that would not be representative of any conceivable population of interest.”

    • This blanked statement is unscientific. You cannot just dismiss data with hand-waving. You would have to show that the sample is unrepresentative or that there are actual problems with the model. We did exactly that and confirmed Jager and Leek’s results.

      https://replicationindex.com/2021/08/10/fpr-medicine/

      Evidently, this estimate is superior to mere speculations by Ioannidis (2005), but for people who don’t care about science, data don’t really matter.

      • > You would have to show that the sample is unrepresentative or that there are actual problems with the model.
        Disagree – you have to credibly support the claim the sample _is_ representative and accept that there are always actual problems with the model (all models are wrong) but again convincingly argue it is useful for some purpose.

        > but for people who don’t care about science, data don’t really matter.
        Again disagree, for people who _do_ care about science, determining how the data came to be and in your possession is critical.

  4. I guess you mean succussions (succinations are something else).

    Reusing the vial (with whatever remains stucked to its walls) is one way to do that but the standard way is to take a drop and dilute in a new vial.

    Why do you say that the “proper” way to perform dilution is the one that doesn’t really dilute the thing?

    Yes, succussions. The theory and whether there is an effect are two entirely different things. Meehl called it the research hypotheses and statistical hypothesis, the connection between the two is extremely tenuous to non-existent when using a default null hypothesis.

    Same as chemo/radiation therapy damaging the intestinal mucosa, causing nausea and reduced absorption. So caloric restriction would be slowing the growth of cancer rather than whatever theory they have about how it works. I’ve never seen a single trial account for this.

Leave a Reply

Your email address will not be published. Required fields are marked *