Joshua Vogelstein writes:

I know you’ve discussed this on your blog in the past, but I don’t know exactly how you’d answer the following query:

Suppose you run an analysis and obtain a p-value of 10^-300. What would you actually report? I’m fairly confident that I’m not that confident :) I’m guessing: “p-value \approx 0.”

One possibility is to determine the accuracy with this one *could* in theory know, by virtue of the sample size, and say that p-value is less than or equal to that? For example, if I used a Monte Carlo approach to generate the null distribution with 10,000 samples, and I found that the observed value was more extreme than all of the sample values, then I might say that p is less than or equal to 1/10,000.

My reply: Mosteller and Wallace talked a bit about this in their book, the idea that there are various other 1-in-a-million possibilities (for example, the data were faked somewhere before they got to you) so p-values such as 10^-6 don’t really mean anything. On the other hand, in some fields such as genetics with extreem multiple comparisons issues, they demand p-values on the order of 10^-6 before doing anything at all. Here I think the solution is multilevel modeling (which may well be done implicitly as part of a classical multiple comparisons adjustment procedure). In general, I think the way to go is to move away from p-values and instead focus directly on effect sizes.

Personally, I’d be skeptical of a p-value like that just because it’s not clear to me how a computer using 64-bit arithmetic could be trusted to generate a value that small reliably. The possibility of floating error is one of the reasons why R typically reports very small p-values as < 2.2e-16.

To report an effect size about the parameters, as opposed to merely an observed effect size, one still needs to report an error probability be it a confidence level or series of intervals, power or severity curves, p-value distributions or the like. I have seen people report observed differences as if they were equal to the population effect size or parametric magnitudes. So my point is that shelving p-values does not mean shelving error probability computations.

A few points, based on my experience in high-throughput genetics.

First, beyond 10^-200 or so one tends to run into numerical accuracy issues, depending on the software being used and how expertly it’s used. So it gets difficult to state p-values with accuracy when they are so tiny/”extreem”. Happily, it also doesn’t matter in practice; we can be certain the p-value beats any threshold you are likely to set for it. While I’d prefer not to write p\approx zero (because p certainly is not zero) writing e.g. “p less than 10^-20” is fine.

Second, in high-throughput studies we really do care about accurate calibration of p-values “way out in the tail”, and the data is of sufficient quality that Mosteller and Wallace’s concerns don’t often apply. However, asymptotic properties can, in many practical situations, lead to computed p-values not being close enough to U(0,1) under the null; one might get a computed p-value of 10^-10 when it should be 10^-7. Normally being inaccurate in the umpteenth decimal place like this is utterly irrelevant, but when one is expecting results truly at 10^-7 by chance alone (because of the massive multiplicity issue) then one has to be much more careful about asymptotic accuracy.

Finally, I’m not sure multilevel modeling alone is the way to go here, though it is of course great in many other situations. Most high-throughput genetics work I see is essentially pure testing; we are looking for genetic variants which will be of interest in subsequent studies, versus those with effects too small/noisy to usefully follow up. Finding the replicable variants (and determining their sign) pretty much exhausts the information in the data; being aware of the consequences of having low power is standard when interpreting the results. The field is really strict about positive results being independently replicated before anyone trusts them, and does not pretend that negative results indicate truly zero effects. These distinctions seem to me to line up well with those in significance testing (a la Fisher, who obviously thought about genetics too) and it seems to me to be a good approach. It is also relatively straightforward to implement; multilevel models, in contrast, would be a major challenge to implement with millions (or now billions) of genetic covariates on many thousands of people. Multilevel modeling is also difficult to implement via meta-analysis, which we typically use to combine information from many contributing studies, each with their own complications.

For a partial use of random effects modeling in in genetics, you might be interested in Mike Wu’s SKAT method, that tests for disease association within in a genetic region by examining heterogeneity of effects. This is now being used in practice, although often in tandem with non-random effects methods. Also, when implemented, each region still results in a p-value, and these p-values are then checked for “extreemity” in the same way as before, so it’s not that far from earlier approaches.

Check out Efron’s book, which is essentially suggesting hierarchical models fit with “empirical Bayes” to adjust for over-dispersion and then try to calibrate false discovery rates:

http://www.amazon.com/Large-Scale-Inference-Estimation-Prediction-Mathematical/dp/0521192498

This is the same basic approach I’ve seen used in packages like edgeR, which is widely applied to short-read sequencing data for tasks like estimating differential expression.

http://www.bioconductor.org/packages/2.11/bioc/html/edgeR.html

The micro-array estimation in wide use I know of is dChip, and that also does something like a hierarchical model by using many replicates to “calibrate” a chip’s sensor responses.

https://sites.google.com/site/dchipsoft/

The geneticists I’ve talked to are all highly skeptical of the devices and the variation due to the lab, which as with microarrays, can often swamp these other calibrations on an experiment by experiment basis. The intra-lab variation is typically much less than the inter-lab variation. This seems like a ripe area for exploring hierarchical models.

Masanao, Andrew, and I developed a Bayesian approach like that of edgeR for differential expression data for the revision of Andrew, Masanao and Jennifer’s multiple comparison paper, but I don’t know if it showed up in the published version (which I can’t see right now due to the paywall).

http://www.tandfonline.com/doi/abs/10.1080/19345747.2011.618213

Reporting a p-value implies that you know the distribution for test statistic you’ve calculated. How often are you confident in the tails of the distribution down to the 1e-6 level nevermind the 1e-16 or 1e-100 level? That’s a question for the general audience. I’m curious.

From my own work, I have a hard time imagining a scenario where I’d ever encounter a homogeneous population with 10^6 members – a few thousand or perhaps even tens of thousands if I got lucky but that’s about it. Sure, I can collect a large number samples but in doing so I’ll be drawing from multiple populations, many with similar but distinguishable and a priori unknown characteristics. Those variations wreak havoc on the tails of test statistic distributions. In practice, I can make inferences down to maybe the p=1e-6 level but I don’t do so from the presumption that my data (and associated test statistic values) are normally-distributed (or chi-squared-distributed or F-distributed or any-other-flavor-distributed). Real data produces a lot of variation in the tails of my test statistic distributions, i.e., at the four-sigma, five-sigma, six-sigma level. Because of that variation, if I want to make inferences below say p=0.0001 I can’t do so by looking at isolated test statistic values. I need to look at the values of neighboring points and make arguments about the relationships of values between neighboring points.

Anyhow, my own work leads me to react to reports of p<epsilon values with skepticism. That said, I'm curious whether there are fields where its possible to nail down distributions down to the 1./(huge number) level. Sure, I can calculate pdf(x) down to ridiculously small values but how well does the calculated distribution match up against real data at that level? If I compare pdf(x) against histogram(x) where do the values diverge or, if they don't diverge, where do I run out of information in histogram(x) to make strong assertions about pdf(x)? What does your data tell you?

To clarify: I do process large samples – say on the order of 10^5 to 10^6 samples per second. Those large samples draw from multiple populations whose characteristics are a priori unknown and which I generally don’t be get to probe after the fact to establish ‘truth’. I have to rely on statistical characterizations. I also need to make an H1/~H1 decision for every sample. Pretty much all samples are H0 but the test statistic distribution under H0 can vary significantly in the tails – nominally for p<0.0001.

Signal hypotheses:

H0: x = b + e

H1: x = as + b + e

x = n-dimensional vector representing measurement

b = n-dimensional vector representing baseline signal

s = n-dimensional vector representing signal of interest

a = scalar

e = measurement error, nominally multivariate-normal distributed – just don't look too closely at the tails of the distribution

In some cases you can reasonably model b with a pdf. In order cases you might model b as a linear combination of basis vectors apply either either weak or no constraints on weight coeffs, i.e., use ridge regression or do unconstrained regression. (Specific to the data I'm analyzing, modeling b as a manifold looks pretty good in terms of being a high fidelity representation of the data but doing so makes real-time analysis a mess. Often times it's better to get a reasonable answer now using a moderate fidelity model than a great answer in a week using a high fidelity model.)

I agree that we should focus directly on effect sizes. But we should also measure our uncertainty in those effect sizes, right?

In genetics (where we’re now often up to testing what looks like 10^8 parameters at a time), one thing to remember is that many of those violations of the data-generating assumptions will affect a large number of the tests and be kind of obvious on a qq plot. People will still report the p-value as whatever near-zero number happens when the data are completely incompatible with the null.

Simulation-based estimates of p-values are ubiquitous (often case-control permutation, occasionally bootstrap), though the rationale is one of robustness rather than the accuracy of the asymptotic reference distribution (the most common tests are just differences in means). The requirement to do about p*5 independent simulations places a lower bound on what people will report in that setting. Usually people stop at genome-wide-significant/100 or so. We’re currently playing with some techniques from statistical physics that may allow even sillier simulation-based p-values estimates.

thanks everybody for your helpful responses!