## Statistical-significance filtering is a noise amplifier.

The above phrase just came up, and I think it’s important enough to deserve its own post.

Well-meaning researchers do statistical-significance filtering all the time—it’s what they’re trained to do, it’s what they see in published papers in top journals, it’s what reviewers for journals want them to do—so I can understand why they do it. But it’s a mistake, it’s a noise amplifier.

To put it another way: Statistical significance filtering has two major problems:

– Type M errors. We’ve talked about this a lot. If you publish what is statistically significant, you bias your estimates upward. And this bias can be huge; see for example here or section 2.1 here.

– Noise amplifier. That’s what I’m focusing on here. P-values are super noisy. You’re trained to think that p-values are not noisy—you’re given the (false) impression that if the true effect is zero, there’s only a 5% chance that you’ll get statistical significance, and you’re also given the (false) impression that if the effect is real, there’s an 80% chance you will get statistical significance. In fact, whether the underlying effect is real or not, your p-value is noisy noisy noisy (see here), and selecting what to report, or deciding how to report, based on statistical significance, is little better than putting all your findings on a sheet of paper, folding it up, cutting it a few times with scissors, and picking out a few random shards to publish. See section 2.2 here for an example (and not a particularly bad example, more like standard practice).

So, again, statistical-significance filtering is a noise amplifier. We should avoid filtering our results by statistical significance, not just because we’re worried about our “alpha level” or “p-hacking” or because it’s a “questionable research practice” or because of “multiple testing” or whatever, but because it adds noise to our already noisy data. And that’s just irresponsible. It’s a bad idea, even if it’s done in complete innocence.

1. Carlos Ungil says:

> you’re given the (false) impression that if the true effect is zero, there’s only a 5% chance that you’ll get statistical significance,

What would be the chance of getting statistical significance conditional on the true effect being zero then?

Unless it’s related to model misspecification, I’m not sure what the “noisy” character of p-values has to do with that. Of course a p-value will be anywhere between 0 and 1 when the true effect is zero, by definition.

> and you’re also given the (false) impression that if the effect is real, there’s an 80% chance you will get statistical significance.

I agree that this would be a false impression to have. The chance of getting statistical significance conditional on the effect being real (without giving a precise effect size) is not a well-defined concept.

• Andrew says:

Carlos:

1. What Anon said. The null hypothesis being tested is the hypothesis of no effect and no systematic error and all distributions specified correctly. Even with no effect, there will still be systematic error and model misspecification.

There are also technical issues regarding discrete data and composite null hypotheses, but it was systematic error that I was thinking of here.

2. Studies are often designed with claimed 80% power, but the hypothesized effect sizes required for that calculation are typically unrealistically huge.

• Carlos Ungil says:

Ok. I guess statistical analysis of any kind is a noise amplifier when the model is wrong.

• Anoneuoid says:

The whole point of statistical analysis is to use it when the model is “wrong” (an approximation). Wrong models can still make useful predictions.

Example:
Say we had a physical model for a series of coin flips (initial positions, forces, etc). Then we wouldn’t use a binomial model that assumes the flips must be iid (there probably is some slight correlation or bias between one flip and the next).

However, since often we don’t have the info required for the physical model, we use the (wrong) statistical approximation that tells us about how likely various outcomes would be. Where do you see any “noise amplification” in this use?

Since we know it is wrong to begin with there is no point in checking whether it is “wrong” though… However at some point a model can be so wrong that it is better to use a different one.

• Andrew says:

Carlos:

I strongly disagree with your statement that “statistical analysis of any kind is a noise amplifier when the model is wrong.” Lots of statistical analyses (for example, lasso, wavelets, Bayes, deep learning) are based on regularization: they smooth noise rather than amplifying it.

• Carlos Ungil says:

I don’t like much my statement, to be frank. I understood that your “noise” includes systematic bias and I was using “noise” in a extended (but confusing) sense to refer to all the wrong things that may be pushing the result of the analysis far from the “true” result (regularization towards the wrong value is not very helpful).

• The simplified picture looks like this (tell me if I’m simplifying too much)

The basic questing you ask your statistical analysis is “my data looks weird, what does it mean?”

p-values and similar methods reply: “it looks weird, it means it is True”

regularization methods reply: “it looks weird, it means it is False”

• I think it’s closer to:

NHST/p-values: this dataset looks weird, it must not come from our null model (and in practice we immediately assume this means it comes from our favorite model)

regularization does something more like “this particular dataset looks weird, but it has to compete with what we think is more likely, and it doesn’t have enough oomph to convince us, so we’re sticking to something closer to our default model until we get even more oomph from additional kinds of data”

The function of regularization is to make the transition between some default and some specific data informed model more ductile. The NHST/p-value method is terribly brittle, it’s either 100% intact null or broken in half and we pick the half we like.

• Mikhail Shubin says:

Yes, You game more technical description while I was trying to be more poetic =)

2. Anoneuoid says:

Unless it’s related to model misspecification, I’m not sure what the “noisy” character of p-values has to do with that.

Not sure what Andrew meant, but isn’t this always true in these cases? Setting the “effect” as zero is just one way to misspecify the model.

If any assumptions (eg, normality, i.i.d. samples) used to derive the model are violated then the model has been misspecified. I would guess that more than one assumption is usually false and this will be detected with large enough sample size.

3. I hope I contribute something to this discussion by a frequentist take on these concerns.

There’s no downside to testing many mutually exclusive null hypotheses, so testing a single null hypothesis is incredibly wasteful. This produces confidence intervals. Then False Coverage-statement Rate correction is a very flexible way of giving confidence intervals for selected results. Any decision making would then be based on these intervals, rather than estimates that have been biassed by selection.

• Andrew says:

Paul:

Agreed. Let me just add one thing. You refer to estimates that have been biased by selection, which is my Type M error issue listed above (indeed, Type M and Type S errors are purely frequentist concepts). The other point I wanted to emphasize is the noise amplification. In a frequentist context, I would prefer an analysis that reports all comparisons rather than one that uses any p-value cutoff. An example of a classical approach that does this is the half-normal plot of interactions in a 2^k factorial experiment. If decisions need to be ultimately made, that’s fine, but said decisions can be based on the spectrum of estimates, without any pre-selection based on statistical significance.

4. Terry says:

The concept of p-values being a “noise amplifier” confused me too. Maybe the following helps.

If we apply a simple 5% decision rule, lucky results become immortal while unlucky results are consigned to oblivion. This amplifies the distance between lucky and unlucky results, most egregiously for results that fall close to the 5% line. Two studies that are nearly identical twins get treated very differently if one’s p-value is 4.9% and the other’s is 5.1%.

The problem gets turbo-charged when the accept/reject line itself “jumps around” due to systematic error, model mispecification, etc. For some studies, only a few results improperly slip across the line into <5% territory, while for other studies, hordes of results stampede over the line. Then, our decision-making process itself is buffeted about by poorly understood errors in the model. Who knows how to quantify this?

Type M errors would be susceptible to this turbo-charging as well (I'm guessing). When hordes stampede across the line, Type M errors probably get larger because the barbaric results would be more extreme on average, whereas when only a few results politely slip across, they would stay closer to the 5% cutoff because they are more well-behaved. (Going out on a limb here.)

• Andrew says:

Terry:

Sure, but I wouldn’t focus on the 0.049 vs. 0.051 thing. P-values are so noisy that there’s no reliable difference between, say, p=0.2 and p=0.01 (which correspond to z-scores of 1.3 and 2.6, respectively, even though these seem very different to people. Even the famous p=0.005 has a z-score of only 3.1, and it would be not at all out of the ordinary to see z=1.3 for one sample and z=3.1 for another, even if these were two independent measurements of the exact same underlying effect.

See here and here for further discussions of this point.

Selecting on anything, without follow-up regularization, is a noise amplifier. Selection on p-values is particularly bad because there’s some way in which little differences in z-scores, easily attributable to pure noise, can show up as apparently very important differences in p-values. The same problem could arise with other methods such as Bayes factors if they were used for null hypothesis significance testing.

• Terry says:

“P-values are so noisy that there’s no reliable difference between, say, p=0.2 and p=0.01 (which correspond to z-scores of 1.3 and 2.6, respectively, even though these seem very different to people.”

So it’s a transformation issue. P-values stretch out the tail dramatically so it has the psychological effect of making close z-values LOOK much more distant. I agree this exaggerates results, often wildly, making them LOOK more impressive than they are.

But in another sense, this is not introducing additional noise in a statistical sense because its just a one-to-one transformation. If you mind your statistical p’s and q’s, all the original information is still there. It is just transformed to a more sensational scale.

I was trying to think of ways p-values actually added additional noise.

• P values add noise to decision making and model acceptance etc. They turn something which is inherently continuous (say the uncertainty in a parameter value) into something inherently binary (yes or no decision about is thing equal to 0 or close to equal to the sample mean). That then gets propagated through a series of decisions into people investigating phenomenon that aren’t even real in the first place.

• Chris Wilson says:

+1. Once you realize that p-values are used to make dichotomous decisions, in place of any kind of real probabilistic decision theory, the problems become very clear, obvious even. Generations of researchers were trained to sift significant from non significant results and build ‘stories’ around them, with the blessing of Statistics as a kind of oracular power. The whole thing is a house of cards, methodologically speaking. Sound science still got done in many quarters, but in spite of this model of statistics rather than aided by it. I hope Andrew and others gain far more traction and momentum soon!

• Martha (Smith) says:

+ many to both Daniel and Chris.

• Anoneuoid says:

Once you realize that p-values are used to make dichotomous decisions

The problem isn’t the p-value though, you can do the same thing with Bayes’ factors or even without any math but “eyeballing” a chart and saying it looks different.

• Chris Wilson says:

I agree Bayes factors etc open to abuse. I certainly don’t want to replace one dumb dichotomous decision rule with another. BTW. I’m fine with the narrow use of p values as a continuous measure of discrepancy- my understanding is that was Fishers approach fwiw. Really this is a wider structural problem with how science gets done- which is why this subject is so perennial and frustrating.

• Anoneuoid says:

I agree Bayes factors etc open to abuse. I certainly don’t want to replace one dumb dichotomous decision rule with another. BTW. I’m fine with the narrow use of p values as a continuous measure of discrepancy- my understanding is that was Fishers approach fwiw. Really this is a wider structural problem with how science gets done- which is why this subject is so perennial and frustrating.

Yep.

• Terry says:

I agree. Two points are being made. One is your point about the discontinuity at 5%. The other is about how p-values explode so modest differences in z values are exaggerated. The second is the point i was making here. The first screws up all sorts of things. The second is just a marketing trick so to speak.

There is a lot of confusion in the comments to this post. A lot of noise you might say.

• Christian Hennig says:

People here talk as if p-values are persons. There’s nothing inherently binary about p-values. The binary decision thing comes from people using thresholds for reporting. You can compute p-values without using thresholds for reporting, and then they are as noisy or as reliable as the original statistic of which they are a transformation.

• Carlos Ungil says:

I agree. The p-value is just a transformation. Like the z-score, for the matter. I don’t understand in which sense these statistics are more “noisy” than the underlying data.

It’s not literally true that p-values stretch out the tail. Actually they do compress the tail: for the normal example the whole real line (or half-line, for two tailed tests) is mapped to the [0 1] interval and unit segments far in the tail go into increasingly small segments close to 0.

Of course p-values are calculated to allow for some interpretation. Like the z-score, for the matter. The reason for calculationg that the z-score, relative to some specific origin, is 1.3 or 2.6 or whatever is to be able to give a particular meaning to that number.

• Andrew says:

Carlos:

1. Please look at the title of my post. It is “Statistical-significance filtering is a noise amplifier.” It is not “The p-value is a noise amplifier.” As you say, the p-value is just a function of data.

2. I very rarely see a problem where the p-value has a clear interpretation. Or, I should say, where the clear interpretation of the p-value has any relevance to the problem at hand.

• Carlos Ungil says:

Andrew, I agree with Terry that there are two different issues being discussed. The first one, which corresponds to the title of your post and is your main point, is that a deterministic decision rule based on the value of an statistic reaching a threshold will in some sense “introduce noise“: we get drastically different results for arbitrarily close inputs.

I don’t think anyone has a problem with that. But that argument works just the same for statistical-significance filtering based p-values (below or above 0.05) as for selection on z-scores (say below or above 1.96 in absolute value).

The second point is that in the body of the post and in your comments you seem to say that p-values are particularly . They are noisy, noisy, noisy. Super-noisy. I’m not sure what separates super-noisy for simoly noisy. They don’t seem noisier than z-scores to me. Maybe your point is that they are as noisy as z-scores, but people don’t understand that.

“P-values are so noisy that there’s no reliable difference between, say, p=0.2 and p=0.01 (which correspond to z-scores of 1.3 and 2.6, respectively, even though these seem /very/ different to people.”

Wouldn’t the z-scores 1.3 and 2.6 also seem very different to people? Isn’t the point of z-scores is to put these numbers in the context of a standard normal distribution?

“Selection on p-values is particularly bad because there’s some way in which little differences in z-scores, easily attributable to pure noise, can show up as apparently very important differences in p-values.”

How is selection on p-values is particularly bad compared to selection on z-scores?

P=0.01 and p=0.2 would be respectively significant and non-significant for p-value-based filtering, but the same happens for z=1.3 and z=2.6 for z-score-based filtering.

• Carlos Ungil says:

I’m sorry for the missing words and other editing errors, I should have revised more carefully the text before sending.

To be clear, I don’t say that what you write is wrong but I think it could be misleading.

5. Carlos Ungil says:

Andrew, I know that you understand this. But maybe someone somewhere finds the following comments useful.

> P-values are so noisy that there’s no reliable difference between, say, p=0.2 and p=0.01 (which correspond to z-scores of 1.3 and 2.6, respectively, even though these seem very different to people.

The distribution of the p-values (for multiple replications of the experiment, which can be calculated assuming that the model is correct) depends on the value of the parameter of interest (the “true” value is unknown). Assuming that the null hypothesis is true (mu=0 or whatever) the distribution of p-values is essentially uniform in [0,1] (for simple cases, leaving aside some technicalities). P-values are in that case as noisy as they can be! There is no reliable difference between p=0.2 and p=0.01 in the same sense that for a fair die there is no reliable difference between a 1 and a 6.

The distribution of the p-values will become more concentrated towards zero (or maybe one, for one-tailed tests) as the value of the parameter diverges from the null hypothesis.

> Even the famous p=0.005 has a z-score of only 3.1, and it would be not at all out of the ordinary to see z=1.3 for one sample and z=3.1 for another, even if these were two independent measurements of the exact same underlying effect.

I think p=0.005 corresponds to z=2.8 (if we use a definition consistent with the values in the previous paragraph).

How “out of the ordinary” this result would be depends on what is the “true” underlying effect.

If the underlying (standardized) effect size is 0, the probablility of getting p1.3) for one sample and p2.8) for another is 0.1%

If the underlying (standardized) effect size is 1, the probability of getting p1.3) for one sample and p2.8) for another is 1.4%

If the underlying (standardized) effect size is 2, the probability of getting p1.3) for one sample and p2.8) for another is 16%

If the underlying (standardized) effect size is 3, the probability of getting p1.3) for one sample and p2.8) for another is 55%

If the underlying (standardized) effect size is 4, the probability of getting p1.3) for one sample and p2.8) for another is 88%

If the underlying (standardized) effect size is 5, the probability of getting p1.3) for one sample and p2.8) for another is 99%

• Brent Hutto says:

Once you’ve said “…which can be calculated assuming that the model is correct…” you’ve missed the (in my opinion) key point. We use models that are at best approximations to reality and for which there is no way to establish whether they are correct or not. Your precisely calculated numbers “assuming the model is correct” ARE NOISE. That’s the key issue here.

• Carlos Ungil says:

Absolutely. But this applies to some extent to any analysis based on any model.

In the particular case of p-values, the case where the model is perfectly specified and the true parameter corresponds to the null hypothesis is the case where the noise will be larger. If in reality the model is misspecified or the parameter is not the one assumed in the calculation the actual p-value will be LESS NOISY.

6. Dikran Marsupial says:

FWIW as a reviewer, I am happy with papers that propose a new machine learning algorithm that doesn’t show a statistically significant improvement on the state of the art (or do so at a low level of significance, e.g. 0.1 rather than the usual 0.05), provided the algorithm has some “interesting” aspect or useful feature. I do like to see the test performed however, as if NHSTs have a useful purpose it is imposing a degree of self-skepticism and making us more appropriately moderate in our claims.

• Daniel says:

From my experience NHSTs have the opposite effect. Usually when people get a significant result they really believe it and it gets really hard to convince them otherwise, even with additional information that totally contradicts their claims.

• Dikran Marsupial says:

If people are taught how to use tools badly, that is what they will do, but that doesn’t mean they shouldn’t use the tools at all. The reviewing culture at journals can help with this, instead of hindering, even at that late stage.

• Anoneuoid says:

This argument always comes up. But when pressed, it always turns out there is no correct way to use NHST. No wonder everyone has such a problem using it correctly…

There are always one or more logical errors being used to justify it. Most here should be familiar with them by now: p-values are error rates, p-values are probability result wont replicate, rejecting a statistical hypothesis means we can accept the research hypothesis, etc.

• Dikran Marsupial says:

I disagree that there is no correct way to use an NHST – the key issue is to understand the framework, it’s meaning and its limitations.

The real problem (for me) is that most people want a test that gives the probability a hypothesis is correct; a frequentist NHST fundamentally cannot give you that, but a Bayesian one can (assuming that you only have two hypotheses from which to choose, one of which is the null and the other is your research hypothesis).

• Anoneuoid says:

I can already tell where this is headed, but the best way is to work with a real life example rather than speak in the abstract. If you have one I will gladly point out where the logical error is.

• Dikran Marsupial says:

Working out whether there is evidence to suggest that a coin is biased? Of course this may also involve consideration of statistical power and whether the effect size is of practical significance, but that is just part of the framework.

• Dikran Marsupial says:

BTW it is pretty condescending attitude you are displaying there, which is somewhat ironic given I suggested the main benefit of NHSTs is mostly to do with self-skepticism! ;o)

• Anoneuoid says:

Working out whether there is evidence to suggest that a coin is biased? Of course this may also involve consideration of statistical power and whether the effect size is of practical significance, but that is just part of the framework.

No, please a real published example with real life consequences. I will note that there is no mint I could find that flips coins a bunch of times and checks the result with NHST as part of quality control.

BTW it is pretty condescending attitude you are displaying there, which is somewhat ironic given I suggested the main benefit of NHSTs is mostly to do with self-skepticism! ;o)

I have just been over this hundreds of times and am already familiar with the “strain” of error that will be revealed based on your earlier post about a dichotomy between null vs research hypothesis.

• Dikran Marsupial says:

Evasion. Coin tosses are used in real life, e.g. to decide which team bats first in cricket.

• Andrew says:

Dikran:

You can load a die but you can’t bias a coin. Beyond this, even if there was such a thing as a bias coin, I recommend studying such biases directly rather than testing the null hypothesis of exactly zero bias.

In the real example of survey sampling bias, that’s what we do. We don’t test whether bias is zero, we estimate bias and adjust for it.

• Dikran Marsupial says:

I agree, however I suspect that there maybe techniques for tossing a coin that gives the appearance of the coin being biased. I’m happy to use a loaded die instead as it is a better real life example with obvious applications and costs.

I also agree that an NHST may not be the best way to do this, my point was that it is not necessarily logically flawed, provided you understand the meaning and limitations of the test, nor is it completely useless (just a back of the envelope calculations are useful in science and engineering, without being the best answer). I’m not greatly in favour of frequentist NHSTs, but I like to understand the value in statistical tools, rather than just the flaws.

• Andrew says:

Dikran:

I did write this article a few years ago: P-values and statistical practice, where I talked about good, mediocre, and bad p-values. So this might be what you’re looking for here.

• Dikran Marsupial says:

Thanks Andrew, I suspect I have read it before, but if so, it looks worth re-reading.

As I said, I think NHSTs can be useful as a “back-of-the-envelope” calculation, but I would strongly avoid the temptation to draw conclusions about the probability of the hypotheses.

• Dikran Marsupial says:

I should add, my original point was that when I am reviewing papers in machine learning, I *very* frequently see performance evaluations where the difference in performance clearly isn’t statistically significant, but the authors nevertheless claim superiority for their method. While NHSTs are flawed, it would be a much more accurate description of the results to acknowledge that the observed differences would not be particularly surprising if the performance of the new algorithm was the same as the benchmark method, and the claims moderated. So there is a sense in which NHSTs can act as a noise attenuator. In this setting it can be a case of “better a diamond with a flaw than a pebble without”?

As it happens, in my experience as an author, demonstrating statistically superior results over state-of-the-art methods is no guarantee of acceptance!

• Anoneuoid says:

I’m happy to use a loaded die instead as it is a better real life example with obvious applications and costs.

I couldn’t find anyone who uses NHST for this either. Die manufacturers and users know the die are imperfect, but have an expected lifetime (number of rolls) and are only expected to be close enough to the ideal die model for that number of rolls.

• Dikran Marsupial says:

This is again evasion. You said you would point out the logical flaw, but you haven’t. If there was a logical flaw in the application of NHSTS in general, you would be able to point it out in textbook examples as well as real life examples.

• Anoneuoid says:

All I asked for was a real life example where NHST is applied “correctly”… is that so hard?

Anyway, I offered to spend a bit of time to help you save possibly tens of thousands of hours on flawed analysis in the future (assuming you do that for work). It is not worth it if you insist on living in a world of stats 101 examples though.

• Anonymous says:

If there is a logical flaw in the method, it is there for textbook examples as well. Substitute it for reasonable evidence a die is biased if you prefer. Being arrogant is not a good way of convincing anybody, other than yourself, that you are right – it is best avoided IMHO.

• Anoneuoid says:

No, it really is that it will be a waste of time. People cannot be made to understand this stuff from the abstract or toy examples they learned in stats 101.

• Dikran Marsupial says:

Sorry, this kind of arrogant rhetoric is not the way to have a productive discussion of science or statistics, so I’ll leave you to argue with someone else.

• Anoneuoid says:

Sigh…

The logical flaw is:

There is not a one to one mapping of rejecting the null hypothesis to a research hypothesis.

There will be more than one explanation for why you rejected the null hypothesis ranging from the mundane to the very interesting. Just picking your favorite one is committing an “affirming the consequent” error.

• Dikran Marsupial says:

Did I make that argument? No.

• Anoneuoid says:

Did I make that argument? No.

You didnt make any argument yet. As I said, I can see the “strain” of your error from this post:

The real problem (for me) is that most people want a test that gives the probability a hypothesis is correct; a frequentist NHST fundamentally cannot give you that, but a Bayesian one can (assuming that you only have two hypotheses from which to choose, one of which is the null and the other is your research hypothesis).

Now, because you haven’t been “nailed down” to a position or argument with a real life example you will continue to post stuff like that. Like I said, waste of time.

• Dikran Marsupial says:

No, I was careful about the meaning of the NHST, and what it can be used to show, which is not the same thing. Perhaps if you were not behaving so arrogantly, you might try and understand what I was saying, rather than assuming that you knew it already and thus substituting your own arguments as straw men. This kind of tiresome rhetorical argument is endemic on blogs and does nobody any good. I’ll leave it there.

• Anoneuoid says:

No, I was careful about the meaning of the NHST, and what it can be used to show, which is not the same thing. Perhaps if you were not behaving so arrogantly, you might try and understand what I was saying, rather than assuming that you knew it already and thus substituting your own arguments as straw men. This kind of tiresome rhetorical argument is endemic on blogs and does nobody any good. I’ll leave it there.

The main point to me is that you asserted NHST could be used “correctly”, but won’t give an example from outside a stats 101 book. I am very interested in these examples.

If you perform a randomised controlled trial with allocation concealment and blinding, and no patients cross-over or are lost to follow-up, and there is a single pre-defined outcome measure blindly assessed then the distribution of P-values under the null is guaranteed to be uniform. In this case the p-value is a valid measure of the weight of evidence against the null. Violate any of the above conditions and the guarantee of uniformity is lost and you are better off using some other method than NHST. Good studies can be done under NHST but it’s not easy (and hence not common).

7. Hans says:

Any discretisation trades off simplification against increased noise. Trivial?

8. Z says:

The claims in this post are a little over-general in my opinion. They seem to apply to cases where there is differential measurement error or suspect modeling assumptions or forking paths etc. But these things need not be meaningfully present in all analyses. There are genuinely simple examples (e.g. a randomized clinical trial with two treatment arms and an easily measured outcome) where it is possible to perform model free randomization based inference and the probability of a p-value less than .05 if there is (very close to) no treatment effect is indeed (very close to) .05.

• Andrew says:

Z:

There are such cases, but they are rarer than you might think. Standard model-free randomization tests are based on a design that is almost never done (see here).

But in any case this technical point (that p-values don’t really have their advertised alpha level) is not the main point of the above post. My point about the alpha levels is just an aside. The main point of the above post is that statistical-significance filtering is a noise amplifier—and this point is valid irrespective of this alpha-level point. Even in cases with clear tests etc., it’s still a terrible idea to filter by statistical significance. And even if a study is clean with only one outcome, it is still part of a larger literature and so these selection issues keep coming up.

• Z says:

Completely agree on your main point. My criticism that “the claims in this post are over-general” was over-general.

• Anoneuoid says:

There are genuinely simple examples (e.g. a randomized clinical trial with two treatment arms and an easily measured outcome) where it is possible to perform model free randomization based inference and the probability of a p-value less than .05 if there is (very close to) no treatment effect is indeed (very close to) .05.

Do you have a real life example of this? I am doubtful that such a clean study exists.

9. Anoneuoid says:

If you perform a randomised controlled trial with allocation concealment and blinding, and no patients cross-over or are lost to follow-up, and there is a single pre-defined outcome measure blindly assessed then the distribution of P-values under the null is guaranteed to be uniform. In this case the p-value is a valid measure of the weight of evidence against the null. Violate any of the above conditions and the guarantee of uniformity is lost and you are better off using some other method than NHST. Good studies can be done under NHST but it’s not easy (and hence not common).

Sorry to be repetitive. Can you give a real life example?

• Martha (Smith) says:

I would also like to see a real life example — blinding is rarely complete, as is “no patients lost to follow-up”. In addition, I’m not clear on what you mean by “valid measure”, and you would need to use intent-to-treat analysis to have a clinically meaningful result. (Probably other points as well.)

• Anoneuoid says:

While the study isn’t going to be perfect, the even bigger problem is they measured a difference but are going to want to conclude way more than merely “there was a difference”. And they will go ahead and (incorrectly) do so. The only person I’ve never seen make this error is Ronald Fisher.

It is something that learning to apply stats on toy models like dice and coin flips can’t teach you.

• Anoneuoid says:

Here is a perfect “real-life example” I just came across:
https://www.nytimes.com/2019/02/28/well/eat/trans-fat-bans-may-be-good-for-the-heart.html
https://ajph.aphapublications.org/doi/abs/10.2105/AJPH.2018.304930

They collected blood from people in NYC in 2004 and 2014, analyzed them recently, and compared the “trans-fat” content of the two. They found about 50% lower concentration of trans-fats in the 2014 samples and conclude it was due to a 2006 law that restaurants needed to reduce the amount of “trans-fats” in food.

For the study, “trans-fats” referred to these 4 molecules (MUFA = mono-unsaturated fatty acid, PUFA = poly-unsaturated):

However a quick search shows it is also reported that fatty acid content content can increase/decrease in frozen biological samples (fish fillets) by 10-30% in only a month and a half:

The results indicated that during frozen storage, SFA and MUFA content increased by 32.63 and 9.25%, respectively, while PUFA content decreased by 25.3%, n-6 by 12.4% and n-3 by 32.55%. These changes were more significant (P ≤ 0.05) during the first 45 d of storage.

http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132014000100015

Here is another (fish filet) study that shows palmitoleic acid tripling and linolelaidic acid dropping by ~60% after 6 months of storage:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1439-0426.2008.01176.x

So why not just as well conclude that the additional trans-fat in the older samples was formed while sitting in storage for 10 additional years? Maybe even someone left the freezer door open a bit to long one time in 2011 and the slight thawing caused this difference?

There is no reason to favor one explanation over the other.

• Anoneuoid says:

We did an international, partial-factorial, open-label, blinded-endpoint trial…patients were randomly assigned… to receive intensive (target systolic blood pressure 130–140 mm Hg within 1 h) or guideline (target systolic blood pressure <180 mm Hg) blood pressure lowering treatment over 72 h. The primary outcome was functional status at 90 days measured by shift in modified Rankin scale scores

So they used a PROBE design. Ie, the doctors and patients knew whether they received treatment for lowering blood pressure to 130-140 mmHG vs only 180 mmHG. Then the doctors (who knew the treatment group) subjectively scored each patient on a 0-6 scale of how severe the symptoms were.

PROBE designs are well known to be susceptible to investigator bias:

The possibility of investigator bias is, however, a drawback of the PROBE design. Therefore, measures should be taken in order to minimize such bias as much as possible. This could be done by careful instructions to investigators and by comparing therapeutic modalities of similar appeal.

https://www.tandfonline.com/doi/pdf/10.3109/08037059209077502

Obviously, any difference in Rankin score could be because the doctors liked (or didn’t) the intensive blood pressure lowering. I’m sure some found it dangerous and others probably safer.

Perhaps if they thought it was dangerous they would subject the patient to less strenuous rehab, so they get less practice on the “test” and recover slower. Or the less strenuous rehab meant the patients were more rested for the “test”, so they would perform better. Vice versa if they thought the patients would be more capable of rehab with lower blood pressure.

We can come up with all sorts of reasons like that for the mere existence of a difference between the groups. If I looked closer at how blood pressure was monitored, the types of blood pressure lowering interventions used, etc I am sure could come up with some others unrelated to the bias. Maybe lower blood pressure (or use of one of the interventions to make it happen) means the patients tend to be positioned differently in the bed, which affects how doctors do the scoring.

But the point is that is all NHST checks for: the existence of a difference. So if at the end of the study they saw a “significant” p-value, it still wouldn’t tell me whether it was actually better/worse for these patients to have lower blood pressure.

In this case they did not see a significant difference, which means the sample size was too small and/or the modified rankin score is too messy a tool to detect it.

They achieved 6mmHg lower blood pressure in the treatment group (p<0.001 – strong evidence against the null of equal blood pressure in the 2 groups).
Minimal difference in the outcome (OR 1.01, p=0.8). So blood pressure different, outcome not.

NHST is ok sometimes. Often not.

Anyway it will be washed away sometime soon by the likelihood tsunami: Blume, Bickel, Zhang and Strug.

• Martha (Smith) says:

6mmHg seems well within the normal range of fluctuation of blood pressure in an individual, and with an unblinded study (which could prompt unconscious behaviors on the part of the patient or medical personal that might affect blood pressure), it sounds as thought 6mmHg is not a practically significant difference.

• Anoneuoid says:

So blood pressure different, outcome not.

As I said, the outcome was different, they just didn’t have large enough sample size to detect it. The blood pressure was also different. Take it as a principle: Everything correlates with everything else, so there is always a difference between two groups in anything you can measure.

NHST is ok sometimes.

It seems like your argument is that sometimes things are statistically significant and sometimes they aren’t, therefore NHST is ok. Please correct me if I am wrong.

Anyway it will be washed away sometime soon by the likelihood tsunami: Blume, Bickel, Zhang and Strug.

I searched these names and this is the first paper I found. It describes doing NHST but calling it “strong evidence favoring” instead of “statistical significance”:

According to the EP, one concludes strong evidence favouring θ_1 over θ_0 when L(θ_1;x)/L(θ_0;x)≥k for any n; k is defined by the investigator

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284518/

It doesn’t really matter what math is used (p-values, likelihood ratios, bayes factors, whatever). The important part is that “theta” is something predicted by the research hypothesis and not a default strawman model. Then when “theta” is not consistent with the observations (by whatever method you want), the research hypothesis needs to be modified or discarded.

This is all explained quite clearly here:

Paul E. Meehl, “Theory-Testing in Psychology and Physics: A Methodological Paradox,” Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135