Once more about the z-curve method

This is Erik: I recently wrote 2 posts about my concerns about the z-curve method and some more concerns about the z-curve method. I’m sorry if this is getting repetitive, but I hope that even if you’re not interested in the z-curve per se, my criticism and especially the comments I’ve received are still interesting in light of the recent discussion about the level of rigor of meta-scientific work.

I’ll briefly summarize the z-curve method to make this post self-contained. The z-value (or z-score or z-statistic) is the estimated effect divided by the standard error, and the signal-to-noise ratio (SNR) is the true effect divided by the standard error. It is often reasonable to assume that the z-value has the normal distribution with mean SNR and variance 1.  Now suppose that we observe the z-values of a collection of studies. The z-curve method is based on the assumption that the SNRs have a discrete distribution over 0,1,2,…,6 which implies that the distribution of the z-values is a mixture of normal distribution with means 0,1,2,…,6 and unit variances. The goal is to estimate the vector of mixture weights p=(p0,p1,…,p6) and various related quantities.

The z-curve method aims to circumvent publication bias, so instead of maximizing the likelihood of the observed z-values, it maximizes the conditional likelihood of the z-values given |z|>1.96. In other words, it tries to estimate the full distribution of the z-values by using only those that exceed 1.96. The main quantity of interest is the Expected Discovery Rate (EDR) which is P(|z|>1.96) in my notation. The R function zcurve::zcurve() provides an estimate of the EDR and uses the parametric bootstrap to provide a confidence interval.

In my previous posts, I gave some examples to illustrate two concerns:

  1. The bootstrap fails as the confidence interval around p0=P(SNR=0) often collapses to the zero-length interval [0,0].
  2. z-curve can be very sensitive to slight (practically undetectable) model misspecification.

Ulrich Schimmack, who is the main author of the method, made it clear that he is unimpressed because my examples are “unrealistic” and therefore — in his opinion — irrelevant. He insists that to be realistic there should be sufficient heterogeneity among the studies, and the density of the z-values should be decreasing at 1.96. So, I did one more simulation which meets both requirements. The distribution of the z-values is the following mixture:

0.5 × N(0,1) + 0.2× N(1,1) + 0.2 × N(2,1) + 0.1× N(3,1).

The true EDR is 0.25. The coverage of the 95% (“robustified”) confidence interval across 100 simulations is 89% (CI: 81%-94%). Moreover, the estimated EDR is very biased. The average across the simulations is 0.35 (CI: 0.32-0.39). Below, I show the 100 confidence intervals. The horizontal red line is the true EDR and the blue dots are the maximum likelihood estimates (MLEs). The red confidence intervals are the cases where the confidence interval of p0 collapsed to [0,0]. It’s clear that they are an important part of the problem.

So, why is this happening? Note that the conditional likelihood of the observed z-values given |z|>1.96 has very little information about p0=P(SNR=0). That means that even if p0 is quite large, the MLE of p0 can hit the boundary, i.e. p0 is estimated at zero. If that happens, then the parametric bootstrap will create datasets without SNRs at zero. In those cases, it will be very likely that p0 will be estimated at zero. If that happens in 95% or more of the bootstrap samples, then the confidence interval collapses to [0,0].

It is actually well known in the statistical literature that the MLE can be quite biased and that the bootstrap can fail when the estimate can hit the boundary of the parameter space. See for example this paper which has a simple example.

Finally, I want to emphasize that it is not my intention to “shoot down” the z-curve method. However, I do think more work is needed before it can be used reliably. First, the problem with the collapsing confidence intervals must be fixed. Second, the limits of applicability of the method should be established and stated clearly. Third, I think z-curve is a case where the data are so weak that it’s necessary to do some regularization either with a strong prior on the mixture weights or some restrictions on them (smoothness or some shape constraint). That might be possible as several commenters on the blog seem to be very clear on what “realistic scenarios” are!

PS January 22, 2026 Frantisek Bartos uploaded zcurve version 2.4.6 to CRAN. If there are fewer than 300 z-values exceeding 1.96, the function zcurve() issues a warning:

Warning: The z-curve method is meant for large samples of test statistics. 
It might produce undercoverage and biased estimates of EDR in small sample 
sizes.

 

36 thoughts on “Once more about the z-curve method

  1. Thanks for sharing these cases Erik. I agree with you that this simulation is quite realistic. It indeed seems like that the bootstrap recovery of the z0 weight is connected to the issue.

    • Agreed. This is a realistic scenario. It still remains true that we are discussing estimation of the EDR and that none of these issues concern estimating the ERR (average true power of the available significant results). Extrapolation to unobserved non-significant results is the challenge.

      • Ulrich: are you now ready to agree that my concerns are valid, at least with respect to the EDR? Would you like to comment on the final paragraph of my post?

        I have made a serious effort to study your z-curve method, and I have shared my concerns in a transparent and respectful way, first with you and your co-authors and then on this blog. You have responded with a series of dismissive and insulting comments (e.g. Fortunately, I am actually enjoying how Erik (and to a lesser extent Andrew) are embarrassing themselves) including a fake AI match report declaring yourself the “winner”. Would you say this has been an appropriate reaction to valid scientific criticism?

        • see my response to Carlos and the new comment by Frantisek.

          z-curve will fail sometimes, nobody disagrees, and nobody thinks that is a problem for z-curve or any other method.

          What we still need to do is to specify when it fails and then see whether we (a) can fix these problems or (b) find diagnostics when a method fails (e.g., like checking assumptions about outliers, normality, etc.), which is also totally normal for all methods.

        • Ulrich: Since (by your own admission) the z-curve method can fail in realistic scenarios, you need to establish the limits of applicability. From your comments and those of Frantisek, there seem to be (at least) 3 requirements for the reliable application of the z-curve method:

          1. There should be “sufficient” heterogeneity. How should this be checked in practice?
          2. There should be “sufficiently” many significant z-values. How many? The zcurve() function has a minimum of 10, but apparently 25 is still not enough.
          3. The density of the z-values should be decreasing at 1.96. How should this be checked in practice? Note that the derivative of a density is notoriously hard to estimate.

          Anything else?

        • Your criticism is not 100% valid.
          Please consult some neutral observer to check your own biases, or feel free to continue with your questionable simulations k = 25
          The ball is in your court

    • I guess I have to take some of my words back on this. From my reading of the previous blogposts, I assumed that we are no longer dealing with numbers of statistically significant studies in low double digits (e.g., 25 significant cases in this post). In my previous reply (thanks to Erik who reached out for a comment in advanced), I said that I completely agree with him on the point that fitting z-curve to double digit and low triple digit sample sizes is not sensible and I would not expect the method to work. Our simulations documented that the RMSE and coverage deteriorates in those scenarios as well as very homogenous cases. This simulated scenario shows exactly that. I think that clearly stating the actuall sample size used for this blogpost would have been quite relevant. On the other hand, I do want to recognize that Erik did provide me with helpful tips on diagnosing the issues related to bounded parameter space and the boostrap.

      • František: Sorry – I should have mentioned that I sample n=100 z-values from that mixture distribution. The EDR is 0.25 so on average there are 25 significant z-values. I would say that correct coverage is especially important when the sample size is low so that the user is duly aware of the uncertainty. Isn’t that better than telling them they shouldn’t use the method when the sample size is below “double or low triple digits”?

        • I just completed a simulation with k = 100 (significant, not k =100, 25 significant) and the true value was included in 95 out of 100 simulations.

          FALSE TRUE
          5 95

          So, can we agree that your concerns is about ONE scenario with fewer than 100 observations for estimation?

          Before we draw any firm conclusions about z-curve, we need a proper evaluation of z-curve with say k = 50.

          We can discuss what to do when we see the results, but we need to see all the results (good and bad) to evaluate z-curve. Selecting only simulations that show failure does not provide information to evaluate z-curve.

  2. I confess to have not been following this discussion and to have not read Schimmack’s paper or your (van Zwet’s) critiques… a combination of busy-ness and laziness. Sorry in advance if this misses the point, feel free to just say so and not waste your time or anyone else’s.

    I would think that the key assumption of the z-curve method is that all of the studies provide estimates of the same effect. For instance, if the question is how much a certain drug decreases low-density cholesterol in American adults, on average, then maybe there’s some small initial study and then a bigger follow-up study and then a huge study, and then maybe some health researchers do smaller studies, etc., so you end up with a dozen studies that have different central estimates and different standard errors. I guess there’s some concern that studies with low z-scores maybe won’t get published, so a meta-analysis would lead to too high a mean estimate? Or maybe the effect could go the other way? Anyway the idea of the z-score method is to only include studies that find a ‘statistically significant’ result.

    If I have that right, it seems odd. What if the true effect is tiny — in this case, the drug doesn’t work? In this case, all of the “statistically significant” results will over-estimate the effect. I could imagine that if you knew the number of studies that found “non-significant” results then maybe you could do something with the distribution of values that make the cut, and conclude that the true effect is likely tiny, but if the worry is that some studies that didn’t attain “significance” are not published and thus unknown then I don’t see how this could work. But presumably Schimmack has discussed this and it all works out somehow.

    But if I have that right, van Zwet’s mixture model also seems odd. Very odd. What is the mixture coming from? Maybe one set of studies measures the cholesterol levels of people who have had heart attacks, another set looks at the level in professional athletes, etc. You could certainly get a mixture. In real life one could even say that, given the practical impossibility of getting truly representative random samples of American adults to participate in the study, any set of studies is bound to have a mixture…indeed, in practice each study’s sample will be drawn from a different underlying population. But van Zwet’s mixture seems to take this to an almost absurd extreme: on a scale in which 1 is a fairly big effect, we have a set of studies that are sampling a population whose true mean is 0, another set for which the true mean is 3, and a couple of other sets in between. If I’m understanding the basic idea — and I very well might not be! — then this seems like a straw man, and an unusually absurd one.

  3. I don’t get where the mixture comes from? Isn’t there a key assumption that all of the studies are at least trying to estimate the same quantity? Of course in practice different studies will sample from different populations or will test slightly different things, but on a scale where 1 is a fairly large effect it seems pretty extreme to assume there’s a set of studies for which the true value is 0 and another set for which the true value is 3.

    • Phil: The signal-to-noise ratio (SNR) is not the effect. It’s the (true) effect divided by the standard error of its estimate. If the estimate is unbiased and normally distributed, then the z-value (the estimate divided by the standard error) has the normal distribution with mean SNR and variance 1. That means that the SNR is equivalent to the power. For example, if the SNR is 2.8 then the power is 80% because pnorm(-1.96,2.8,1) + 1 – pnorm(1.96,2.8,1) = 0.8.

      The heterogeneity between the studies comes from variation in the effects (as you mentioned) and *also* from different sample sizes. Small studies have small SNRs and large studies have large ones. The model of the z-curve method assumes studies can have SNRs of 0,1,2,…,6 which corresponds to powers of 0.05, 0.17, 0.52, 0.85, 0.98, and 1. The mixture distribution of my example says that there are many studies with very low power, some with medium power, and a few with high power.

    • You write.

      I confess to have not been following this discussion and to have not read Schimmack’s paper or your (van Zwet’s) critiques… a combination of busy-ness and laziness. Sorry in advance if this misses the point, feel free to just say so and not waste your time or anyone else’s.

      I would think that the key assumption of the z-curve method is that all of the studies provide estimates of the same effect.

      That makes it easy to reply. No, it does not make that assumption. It is actually designed for the purpose of estimating parameters assuming heterogeneity in effect sizes and/or sample sizes.

      • “Heterogeneity in effect sizes”, really? It can handle the case where Study A tests a drug on a population where it has no effect at all, while Study B tests the drug on a population where it has a huge positive effect, and Study C tests the drug on a population where it has a big negative effect, and so on?

        • Yes. It is not designed for Cochrane style metas.

          It is designed for science wide claims
          Jager and Leek where studies test different drugs A B C etc

  4. Feature idea for the next version of the zcurve package: a function that plots this comic but for z-values. Maybe call it plot_xkcd()?
    xkcd #1478

    While searching for existing R packages, I also came across this clever package that plots in the xkcd style: xkcd

  5. > The z-curve method is based on the assumption that the SNRs have a discrete distribution

    That seems misleading. It uses a mixture of distributions to represent the observed distribution of z-scores but saying that it “assumes” that the underlying SNRs “have a discrete distribution” may not be appropriate.

    For what it’s worth the EDR coverage seems fine for n=300, for example. With n=100 there are only 24 significant z-scores on average. The “collapse” discussed is correlated with having a low number of “valid” datapoints and even more correlated with those significant z-scores being concentrated (low standard deviation).

    • Thank you for this constructive comment.

      Indeed, the model does not assume that the true power of studies is distributed has a specific distribution or is discrete.
      It just models the observed distribution with a finite number of components to estimate average true power.
      It actually works best when power is not discrete, which is also not a realistic assumption. Why should there be studies with 5% power (true nulls) and 17% power, but not studies with 10% power?

      You also clarify that we have two k = number of studies. Thos of all studies that were run and the k of studies with significant results that are used for estimation. Z-curve package requires 10 significant results as a bare minimum, but to evaluate the model, we used k = 100, 500, and 1000. Maybe we should say, “don’t use if k.sig < 100, but sometimes results with less are interpretable and useful.

      Most important, running a study with k = 100 and 50% are z = 0, means that k.sig is well below 100.

      • > Indeed, the model does not assume that the true power of studies is distributed has a specific distribution or is discrete.

        As far as I can tell, all the simulation results you’ve published to support the performance of your method are based on mixtures of studies with power 0.05, 0.17, 0.50, 0.85, 0.98, 0.999 or 0.99997.

    • Carlos: Why misleading? The normal mixture distribution of the z-values is the convolution (sum) of a discrete distribution on 0,1,2,…,6 and the standard normal distribution. Bartos and Schimmack (2022, p6) write: “the power values implied by the 7 components are 0.05, 0.17, 0.50, 0.85, 0.98, 0.999, 0.99997” which is equivalent to: “the SNRs implied by the 7 components are 0, 1, …, 6.”

      > The “collapse” discussed is correlated with having a low number of “valid” datapoints and even more correlated with those significant z-scores being concentrated (low standard deviation).

      You can use my script to see what happens when you sample n=1000 z-values from N(2,1) or from 0.5xN(1,1) + 0.5xN(2,1). Many of the confidence intervals around p0 will collapse to [0,0]. That’s not correct, but since p0 is zero the consequences for estimating the EDR are minor.

      • > Why misleading?

        I wouldn’t say either that doing a quadratic approximation “assumes” that the data is quadratic. (Of course the approximation may be good, less good or completely useless.)

      • > Many of the confidence intervals around p0 will collapse to [0,0]. That’s not correct, but since p0 is zero

        I don’t understand the reasoning. If the true value is zero (?) what would be incorrect about the interval [0 0]?

        • Carlos: It’s not a corrected representation of the uncertainty. The interval [0,0] suggests that you’re 100% sure that p0 is zero *just* from observing the z-values that exceed 1.96.

        • A confidence interval is not a representation of uncertainty. It’s a thing that should cover the true value (which happens to be zero in your example) often enough. The model is perfectly specified here as a mixture of standard Gaussians and I see no reason why you wouldn’t be able to estimate the parameters from *just* part of the distribution if you have enough data.

        • Carlos: Zero-length confidence intervals are just not a thing in statistics. In the context of z-curve, it would mean that you can have *absolute certainty* in distinguishing z ~ N(2,1) from z ~ 0.01xN(0,1)+0.99xN(2,1) based on only the z-values that exceed 1.96. Not with a million observations!

        • > based on only the z-values that exceed 1.96.

          Do you find that this part is relevant at all? If it can be done based on all the z-values it can be done based on an interval. We want to write the observed curve segment as the sum of seven distinct curve segments. That seems doable in principle.

          Apart from that I understand that in this example you get a confidence interval that you don’t like for a parameter that doesn’t matter so we may leave it there.

        • Carlos: True – you also cannot have absolute certainty about p0 from all the z-values.

          > Apart from that I understand that in this example you get a confidence interval that you don’t like for a parameter that doesn’t matter so we may leave it there.

          Apart from the consequences for the estimation of the EDR and related tests, some users may be interested in p0 itself. It’s the proportion of true “nulls”.

  6. Following up on Carlos, who found no problem with the confidence interval in their own simulation with k = 300 significant z-values,

    I ran a simulation with k = 500 significant results and obtained 95% coverage with the default settings of z-curve.

    FALSE TRUE
    5 95

    code:

    library(zcurve)

    pow0 = pnorm(0,1.96) + pnorm(-1.96,0)
    pow1 = pnorm(1,1.96) + pnorm(-1.96,1)
    pow2 = pnorm(2,1.96) + pnorm(-1.96,2)
    pow3 = pnorm(3,1.96) + pnorm(-1.96,3)

    w = c(.5,.2,.2,.1)
    true.edr = pow0*w[1] + pow1*w[2] + pow2*w[3] + pow3*w[4]
    true.edr

    n.sim = 100

    k.sig = 500

    res = c()

    i = 1
    for (i in 1:n.sim) {

    k = k.sig * 20

    z = c(
    rnorm(k*0.5,0,1),
    rnorm(k*0.2,1,1),
    rnorm(k*0.2,2,1),
    rnorm(k*0.1,3,1)
    )

    abs.z = abs(z)
    abs.z = abs.z[order(runif(k))]
    abs.z.sig = abs.z[abs.z > 1.96]
    abs.z.sig.sel = abs.z.sig[1:k.sig]

    #2. Fit z-curve with a custom μ grid including 1.5

    ncz = 0:6
    sds = rep(1,length(ncz))

    fit <- zcurve(
    abs.z.sig.sel,
    method = "EM",
    control = list(
    mu = ncz, sigma = sds
    )
    )

    res = rbind(res,c(t(summary(fit)$coefficients)))
    print(res[i,])
    print(table(res[,5] true.edr))

    }

    ### check results

    dim(res)
    table(res[,5] true.edr)
    summary(res)

    • > Following up on Carlos, who found no problem with the confidence interval in their own simulation with k = 300 significant z-values,

      Just for the record I just set n=300 (instead of n=100) in Erik’s script.

      That increased the number of significant z-values when the script was executed from a average of 24.35 (median 24, range 16 to 36) to an average of 73.02 (median 73, range 60 to 97).

      The reported bootstrap (robustified) coverage went from 0.87 (0.89) to 0.94 (0.95), pretty close to the nominal level.

      • I see. So less than 100 significant values.
        Fits my results that 50 sig gets you close to 95% coverage.
        Thanks again for flagging tge small k issue.

  7. I see some similarity between these posts about z-curve and Andrew’s concerns about selection for statistical significance.

    Andrew has repeatedly criticized selection for statistical significance because it leads to exaggerated conclusions: conditioning on “interesting” results without reporting the full set of attempts makes effects look larger or more problematic than they really are. I agree with that critique.

    What strikes me is that a similar issue can arise in methodological criticism if we selectively present stress scenarios in which a method performs poorly, without embedding them in a pre-specified simulation design or reporting how often those failures occur relative to benign cases. Showing that a method can fail is almost never informative by itself—nearly all methods can be made to fail under low information or adversarial conditions. What matters is when, how often, how badly, and whether users can detect the risk.

    In that sense, presenting several scenarios that all reject good performance of z-curve, without equal visibility of scenarios where it performs adequately (including ones already published), risks creating the same kind of inferential distortion that selection for significance creates in applied research. This is not about intentions or correctness of individual examples; it is about how evidence is aggregated and communicated.

    A proper stress test—whether of banks or statistical methods—requires a planned design and reporting all results across the design space. Otherwise, “it can fail” becomes the conclusion, which is true but not actionable.

    I think this is a deeper issue of consistency in standards. If we object to selective reporting in empirical science, we should be cautious about endorsing selective stress demonstrations in methodological debates as well.

  8. Hi Erik,

    Here is the requested disclaimer:

    Disclaimer: z-curve EDR estimates should not be interpreted when (a) the average true power of significant studies is over 90% (Erik sim 1), weights show no evidence of heterogeneity (EDR = ERR, just use ERR)) (Erik sim 2), or if the set of significant results to be fitted is small (k.sig < 50).

  9. František Bartoš just uploaded zcurve version 2.4.6 to CRAN. If there are fewer than 300 z-values exceeding 1.96, the function zcurve() issues a warning: “Warning: The z-curve method is meant for large samples of test statistics. It might produce undercoverage and biased estimates of EDR in small sample sizes.”

    • As a sideline observer of this, this practical result is pretty cool (in both the sense of being nice, and in the sense of being a cooled-down and reasonable outcome of the exchange!).

  10. One argument implied by the z-curve authors’ replies seems to be that certain distributions of z values are more common than others and the method works well for these common distributions. Do we know what the distributions look like in practice, both with publication bias (easy, just grab published replication studies of famous effects) and without (harder, but weren’t there big open replication efforts in psychology a few years back)? One could observe these, simulate something like them and test the method under these conditions. Maybe this was the starting point of this series of posts and I’ve forgotten due to the distracting nature of some of the comments.

Leave a Reply

Your email address will not be published. Required fields are marked *