It’s not just p=0.048 vs. p=0.052

Posted on September 6, 2019 9:16 AM by Andrew

Peter Dorman points to this post on statistical significance and p-values by Timothy Taylor, editor of the Journal of Economic Perspectives, a highly influential publication of the American Economic Association.

I have some problems with what Taylor writes, but for now I’ll just take it as representing a certain view, the perspective of a thoughtful and well-meaning social scientist who is moderately informed about statistics and wants to be helpful.

Here I’ll just pull out one quote, which points to a common misperception about the problems of p-values, or what might be called “the p-value transformation,” which takes an estimate and standard error and transforms it to a tail-area probability relative to a null hypothesis. Taylor writes:

[G]iven the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is “significant,” while if the result had a 5.2% probability of happening by chance it is “not significant.” Uncertainty is a continuum, not a black-and-white difference.

First, I don’t know why he conditions on “the realities of real-world research” here. Even in idealized research, the p-value is a random variable, and it would be goofy to draw a sharp line between p = 0.048 and p = 0.052, just as it would be goofy to draw a sharp line between z-scores of 1.94 and 1.98.

To formalize this slightly, “goofy” = “not an optimal decision rule or close to an optimal decision rule under any plausibly reasonable utility function.”

Also, to get technical for a moment, the p-value is not the “probability of happening by chance.” But we can just chalk that up to a casual writing style.

My real problem with the above-quoted statement is not the details of wording but rather that I think it represents a mistake in emphasis.

This business of 0.048 vs. 0.052, or 0.04 vs. 0.06, etc.: I hear it a lot as a criticism of p-values, and I think it misses the point. If you want a bright-line rule, you need some threshold. There’s no big difference between 18 years old, and 17 years and 364 days old, but if you’re in the first situation you get to vote, and if you’re in the second situation you don’t. That doesn’t mean that there should be no age limit on voting.

No, my problem with the 0.048 vs. 0.052 thing is that it way, way, way understates the problem.

Yes, there’s no stable difference between p = 0.048 and p = 0.052.

But there’s also no stable difference between p = 0.2 (which is considered non-statistically significant by just about everyone) and p = 0.005 (which is typically considered very strong evidence.)

Just look at the z-scores:

> qnorm(1 - c(0.2, 0.005)/2)
[1] 1.28 2.81

The (two-sided) p-values of 0.2 and 0.005 correspond to z-scores of 1.3 and 2.8. That is, a super-duper-significant p = 0.005 is only 1.53 standard errors higher than an ignore-it-pal-there’s-nothing-going-on p = 0.2.

But it’s even worse than that. If these two p-values come from two identical experiments, then the standard error of their difference is sqrt(2) times the standard error of each individual estimate, hence that difference in p-values itself is only (2.81 – 1.28)/sqrt(2) = 1.1 standard errors away from zero.

To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

So. Yes, it seems goofy to draw a bright line between p = 0.048 and p = 0.052. But it’s also goofy to draw a bright line between p = 0.2 and p = 0.005. There’s a lot less information in these p-values than people seem to think.

So, when we say that the difference between “significant” and “not significant” is not itself statistically significant, “we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.”

143 thoughts on “It’s not just p=0.048 vs. p=0.052”

Anoneuoid on September 6, 2019 9:48 AM at 9:48 am said:

But as they recognize, criticizing is the easy part. What is to be done instead? And here, the argument fragments substantially.
[…]
Other authors agree with turning away from “statistical significance,” but in favor of their own preferred tools for analysis: Bayesian approaches, “second-generation p-values,” “false positive risk,” “statistical decision theory,” “confidence index,” and many more. With many alteratative examples along these lines, the researcher trying to figure out how to proceed can again be forgiven for desiring little more definitive guidance.

Wow, not a single mention of science! People will apparently do anything to get away without doing actual science at this point. From recent discussions here I have seen that economists cant do “science” (come up with testable models) because the data is just garbage to begin with. So:

1) Collect reliable data
2) Come up with an explanation (theory) for why that data looks the way it does
3) Derive (either mathematically or logically) a prediction from that explanation (hypothesis)*
4) Collect new data** to test the prediction

* The prediction needs to be “otherwise surprising”, ie unlikely under most other plausible explanations. A correlation/effect exists or is in one direction does not meet this criteria.

** This should also be done by people who distrust the explanation

Reply ↓
- Andrew on September 6, 2019 10:22 AM at 10:22 am said:
  
  Anon:
  
  Rubin used to say that one important property of a good statistical method is that it allows the scientists to talk more about the science rather than the statistics. For example, one advantage of Bayesian inference is that it leads people to argue about what’s the best prior, rather than what’s the best estimator.
  
  A political science researcher or a psychology researcher or whatever will get all passionate about using the Winsorized mean, or some particular way of computing the maximum likelihood estimate, or a false positive rate, and I’d much rather see such people putting this mental effort into questions of effect size, variation, confounding, etc.
  
  Every field will have methodologists who will want to explore more deeply into questions of statistics and measurement, and that’s fine: but to the extent that researchers are being “methodologists” in their everyday work, I’d much prefer their methodology to have an applied focus, and I think one big problem with the null hypothesis testing framework is that it takes the focus away from the science and toward statistical theory.
  
  Reply ↓
  - Anoneuoid on September 6, 2019 10:45 PM at 10:45 pm said:
    
    Right, the entire focus is on the last half of step number 4. This should be a minor, relatively unimportant step but seemingly all important to these fields.
    
    It just doesn’t matter that much how you decide whether a prediction and observation are in agreement or if one measurement is close enough to another to count as a replication. Common sense “eyeballing” of the data works fine.
    
    Reply ↓
  - Sameera Daniels on September 7, 2019 7:40 AM at 7:40 am said:
    
    Well put Andrew
    
    Reply ↓
- Chris C on September 7, 2019 11:43 AM at 11:43 am said:
  
  Wow! Not one mention of oxygen, but every data analyst uses it every hour of every day! Peoplw will apparently do anything to get away with not mentioning the actual biology necessary to do statistics.
  
  Reply ↓
  - Anoneuoid on September 7, 2019 12:33 PM at 12:33 pm said:
    
    NHST and science are mutually exclusive. Breathing and science are not.
    
    Reply ↓
- M Lyle on September 11, 2019 1:39 PM at 1:39 pm said:
  
  Science is great, but we’re trying to figure out more and more difficult things, with confounds and biases. The data has *always* been garbage. People are easily fooled in myriad ways.
  
  That’s why we use statistical tools to try and separate the wheat from the chaff. But even they are error-prone and potentially misleading. Even so, I wouldn’t throw them out. ;)
  
  There is no substitute for very, very, very careful and skeptical reasoning.
  
  Reply ↓
Paul R Pudaite on September 6, 2019 10:07 AM at 10:07 am said:

“But there’s also stable difference” => “… also no stable …”?

Reply ↓
- Andrew on September 6, 2019 10:24 AM at 10:24 am said:
  
  Typo fixed; thanks.
  
  Reply ↓
Thanatos Savehn on September 6, 2019 10:25 AM at 10:25 am said:

“There’s a lot less information in these p-values than people seem to think.” And I get it! Which is again why I’m grateful for your blog and the many contributors who have helped me along my painful journey to statistical sobriety.

Reply ↓
- Martha (Smith) on September 6, 2019 4:36 PM at 4:36 pm said:
  
  “statistical sobriety” — Nice turn of phrase.
  
  Reply ↓
- Sameera Daniels on September 7, 2019 7:42 AM at 7:42 am said:
  
  Statistical sobriety > love that description
  
  Reply ↓
aram on September 6, 2019 10:29 AM at 10:29 am said:

“But there’s also stable difference between p = 0.2 (which is considered non-statistically significant by just about everyone) and p = 0.005”

There’s also NO stable difference …?

Reply ↓
Erik on September 6, 2019 10:39 AM at 10:39 am said:

“that difference in p-values itself is only” => “z-scores”?

Reply ↓
Simon Gates on September 6, 2019 11:04 AM at 11:04 am said:

48 and 52… those numbers. Couldn’t he have picked some different ones?

Reply ↓
- Martha (Smith) on September 6, 2019 4:38 PM at 4:38 pm said:
  
  Sounds like a middle-aged lament?
  
  Reply ↓
  - Andrew on September 6, 2019 5:30 PM at 5:30 pm said:
    
    Or maybe a reference to the result of the 2016 Brexit referendum.
    
    Reply ↓
    - Simon Gates on September 7, 2019 2:47 AM at 2:47 am said:
      
      Yes, that.
Michael Schwartz on September 6, 2019 11:12 AM at 11:12 am said:

I found this to be one of the better/clearer/most-convincing short-form pieces I have seen on this topic.

It makes me wonder about something seemingly strange. Couldn’t the instability of p-values be an arrow in the quiver of those who rail against the so-called “replication police”.

Wouldn’t a reasonable counter-argument be “So what if the experiment didn’t replicate – you can’t rely on p-values anyhow”. I’m not arguing in favor of this … I am all in on the importance of replication. But it does seem to me like a somewhat paradoxical situation.

Reply ↓
- Anoneuoid on September 6, 2019 11:25 AM at 11:25 am said:
  
  Yes, whether or not two results are both statistically significant in the same direction is not a good definition of replication. That is why I argue that the replication studies need to ensure they have sufficient power to “get significance” based on he effect size of the original study, in which case they will figure out that the a priori probability of a successful replication is 50%.
  
  For a result to be considered replicated it needs to be quantitatively similar enough, ie the uncertainty intervals need to overlap.
  
  Reply ↓
  - M Lyle on September 11, 2019 4:37 PM at 4:37 pm said:
    
    This seems pessimistic. If one does a trial and finds a statistically significant– p<.05, weak effect vs. control… and someone does a larger trial and finds a statistically significant, stronger effect in the same direction vs. control (p<.01)… The second trial certainly makes us more confident in rejecting the null hypothesis, doesn't it? Even if the uncertainty intervals don't quite overlap.
    
    I'm also not sure why you say the a priori probability of replication is 50%. Can you elaborate? If I do trials of what jelly bean colors cure cancer, I'll come up with some number of statistically significant effects… but almost none will be replicated.
    
    Reply ↓
    - Anoneuoid on September 11, 2019 4:54 PM at 4:54 pm said:
      
      The second trial certainly makes us more confident in rejecting the null hypothesis, doesn’t it?
      
      No, I am 100% confident the null hypothesis in this case should be rejected. The data is irrelevant.
      
      If I do trials of what jelly bean colors cure cancer, I’ll come up with some number of statistically significant effects… but almost none will be replicated.
      
      There is going to be some weird correlation between the color of jelly beans chosen and recovery from cancer under whatever specific circumstances. With sufficient sample size you will detect this, and if it is worthless information then 50% of the time the statistically significant effect will be in the same direction.
    - Keith O'Rourke on September 11, 2019 5:11 PM at 5:11 pm said:
      
      > Even if the uncertainty intervals don’t quite overlap.
      That often an indication of a serious problem and neither study should be taken as informative until the lack of overlap is understood.
    - Martha (Smith) on September 11, 2019 6:00 PM at 6:00 pm said:
      
      +1
- Andrew on September 6, 2019 11:33 AM at 11:33 am said:
  
  Michael:
  
  Sure, that would be no problem with me if people said this. I’d be just fine if Psychological Science, PNAS, etc., would just publish papers with raw data and scientific claims, with no statistical analysis at all! If they don’t want to compute any p-values and they want to make their case with the raw data, that’s cool. My problem is when they make claims that are not apparent from the data (here’s an example) and then make strong claims which are, essentially, only supported by statistical analysis.
  
  Reply ↓
- Michael Nelson on September 6, 2019 12:19 PM at 12:19 pm said:
  
  Some estimates of effects are better than others. Often, extraordinary claims are made based on samples of 30 or less, then fail to replicate in samples of 300. If forced to make a binary choice, I would trust the latter over the former, but a better approach would be to treat both effect estimates as samples from a population of effects and then make inferences about the expected value of the population effect size. Either way, you don’t need p-values to assess replicability. In fact, Hedges and Schauer (2018) have shown that many replications are under-powered for NHSTs.
  
  Anyway, relying on p-values to check replication implies that the main problem with these studies is the magnitude of the effects they report, when really it’s the magnitude of the implications they report based on those effects. Either power poses (for example) ought to be used widely to boost confidence OR their benefits are highly contingent on following an extremely precise, researcher-controlled protocol that’s difficult to replicate. It’s the interpretation we’re debunking, not the correlation.
  
  Reply ↓
  - Martha (Smith) on September 6, 2019 4:41 PM at 4:41 pm said:
    
    “relying on p-values to check replication implies that the main problem with these studies is the magnitude of the effects they report, when really it’s the magnitude of the implications they report based on those effects.”
    
    +1
    
    Reply ↓
  - M Lyle on September 11, 2019 4:42 PM at 4:42 pm said:
    
    I think your second paragraph sort of undermines the first, in its discussion of researcher protocol.
    
    If we get different results on an attempt to reproduce a finding, there’s two things that may be contributing to varying degrees– chance and methodology. If we treat the two studies as a combined sampling in our meta-analysis, we miss the former.
    
    In effect, we have a combined sample, but a very confounded one– consider them tuples of methodology and subject; we have 300 subjects combined with methodology B and 30 subjects combined with methodology A.
    
    Reply ↓
Daniel Lakeland on September 6, 2019 11:15 AM at 11:15 am said:

> That doesn’t mean that there should be no age limit on voting.

The truth is there’s no real reason to have an age limit on voting. Choose score voting and an age related weighting function. Make the weighting function go to zero at age zero and 1 at age 18. Something like inverse_logit((Age-14))

Reply ↓
- Wonks Anonymous on September 6, 2019 12:13 PM at 12:13 pm said:
  
  I’ve suggested a civics test where your vote is weighted by your score. We could use handicapping in case some aspects of that bother people, with the important thing being to incentivize politicians to target higher information voters.
  
  Reply ↓
  - Daniel Lakeland on September 6, 2019 1:56 PM at 1:56 pm said:
    
    In theory this might be good, in practice I’d be very suspicious of the content of such a test. I assume it’d have to be a multi-choice and if so, you’d want all the questions to have absolutely unambiguously verifiably correct answers. Like “which of the following is the correct wording of the 1st amendment to the US constitution”
    
    Even that kind of thing can go wrong, my understanding is that there were multiple hand-written copies of the constitution, and although they all agree on the wording, the punctuation wasn’t consistent, particularly in the matter of placement of commas.
    
    Reply ↓
    - Adede on September 6, 2019 2:38 PM at 2:38 pm said:
      
      Even if unambiguously true answers exist, one could contrive to ask questions that certain segments of the population know more about.
    - Daniel Lakeland on September 6, 2019 2:46 PM at 2:46 pm said:
      
      Yes. I think you can make a good case for providing a continuous weighting of a vote based on age, so that say 10 year olds can participate but they have dramatically less effect than say 20 year olds, you could also make a case for say downweighting people based on criminal history, so for example felons automatically recover their vote effectiveness over a period of say 10 years after their most recent conviction… and maybe a few other similar ideas… but the “civics test” idea is in practice a lousy idea I think.
  - Terry on September 7, 2019 7:35 PM at 7:35 pm said:
    
    “I’ve suggested a civics test where your vote is weighted by your score.”
    
    Of course, this would be an open invitation to gross manipulation of the civics test to achieve partisan results.
    
    Therefore, I would only agree if I was put in charge of the implementation.
    
    Reply ↓
    - Andrew on September 7, 2019 7:48 PM at 7:48 pm said:
      
      Terry:
      
      I don’t know about that. But I think either of my sisters would be a better president of the United States than I could ever be.
    - Terry on September 7, 2019 8:04 PM at 8:04 pm said:
      
      I’m not surprised.
      
      Math and empirical skills .NE. political skills. I don’t want Einstein fixing the brakes on my car.
- jim on September 6, 2019 3:35 PM at 3:35 pm said:
  
  We already have a function: 18 or over, you’re intellectually developed enough to vote, less than 18, you’re not.
  
  Arguably, as young people depend on their parents increasingly later in life and consequently don’t gain real life experience, it makes the most sense to adjust the age upward instead of downward. Today it should be about 25 – you have to be three years into the real world before you get to make decisions on how it’s run.
  
  Reply ↓
  - Daniel Lakeland on September 6, 2019 4:25 PM at 4:25 pm said:
    
    We have a discontinuous function, there is zero justification for a discontinuous function. if you’re born one day after election day you don’t vote, if you’re born one day before election day you do… how does that make sense? It doesn’t it’s just pure convenience for people who know how to write rules but don’t know anything about math.
    
    Most of our laws are so innumerate that the people who write them should qualify for disability.
    
    Reply ↓
    - jim on September 6, 2019 9:20 PM at 9:20 pm said:
      
      ” there is zero justification for a discontinuous function. ”
      
      In your personal opinion, which isn’t the foundation of universal law.
    - Daniel Lakeland on September 7, 2019 9:43 AM at 9:43 am said:
      
      If you’d like to give a justification for why a 17 year and 364 day old person can’t vote and a 17 year and 365 day old person can, and why this is a good and moral situation, you are free to try… I think it’s obvious to most people that this is this way out of pure convenience, it’s less obvious that a solution exists. most people will argue “well we have to put the cutoff somewhere” but the truth is, no we don’t.
    - jim on September 7, 2019 11:16 AM at 11:16 am said:
      
      My first argument is that the break should be at 24 years and 364 days.
      
      My second argument is that it’s a simple cutoff. Nothing useful is gained from your procedure. Enfranchising 8, 12, or 15-year olds with an age-proportional share of a vote, as your earlier descriptions suggests, just doesn’t accomplish anything useful. There are far more pressing matters in government.
    - Daniel Lakeland on September 7, 2019 12:17 PM at 12:17 pm said:
      
      People have such a hard time thinking out of the box… oh well.
      
      There is a perfectly good argument that people’s vote should be weighted by the number of years they will be subject to the effects of the government policies. So perhaps we should weight everything by say 100-age and empower parents to vote as the agents of their children under 18…
      
      people are so stuck in TTWWADI (that’s the way we’ve always done it)
    - Daniel Lakeland on September 7, 2019 1:23 PM at 1:23 pm said:
      
      Maybe it’s not clear, I’m not claiming to have well thought out specific suggestions of what we should be doing, I’m just saying the technology to make voting depend continuously on various quantities is known as “multiplication” and it’s been around for several thousand years and yet people in charge of writing laws have apparently never heard of it.
    - jim on September 8, 2019 1:56 PM at 1:56 pm said:
      
      “People have such a hard time thinking out of the box”
      
      I guess different people have different concerns. You’d probably get a big howl out of a lot of the things I think should be done! :)
      
      Generally I see your point. I’m not sure I agree with it. I mean on the one hand we have people advocating for more rights for youth but often the same people are telling us that we can’t hold young men responsible for their actions because their brains aren’t fully formed until age 21 or something – and that’s before you count their individual abilities to acquire knowledge and their accumulated experience.
      
      So I credit you with good intentions! :)
    - M Lyle on September 11, 2019 1:47 PM at 1:47 pm said:
      
      1. Simplicity. Keeping systems simple is desirable. Can vote vs. can’t vote is simple. More complicated systems have “gotchas” and are more difficult to administer and verify.
      2. Preventing tracking of votes. If you were to weight votes, you’d need to attach the weightings to the votes. This means that in a given precinct you’d have a lot more information of whom to attribute votes to.
      3. Audit / preventing cheating. If you have different weightings for different votes, were the votes weighted properly? You want to prevent attribution (#2), but you also want to make sure that votes count the right amount– these two goals are in direct opposition. (As compared to just tracking whether or not someone has voted and whether you have the right number of ballots).
      4. Distortive effects. Youth turnout is already low. If votes are weighted differently, this is likely to discourage low-weighted groups from participation.
  - Robin Morris on September 6, 2019 5:06 PM at 5:06 pm said:
    
    If we have a cut off at the low end, I’d argue for a cut-off at the high end as well.
    
    Reply ↓
    - Jeff on September 6, 2019 8:28 PM at 8:28 pm said:
      
      Or perhaps a weighting based on how long you should expect to be around to enjoy the effect.
  - DC on September 7, 2019 11:55 AM at 11:55 am said:
    
    Jim, why do you get to decide when folks have had ‘three years into the real world’? & what the heck is the ‘real world’ anyways? Why would this be 25? Yikes.
    
    Reply ↓
Nate on September 6, 2019 11:29 AM at 11:29 am said:

I’ve read some of the articles in the special issue of the American Statistician about p-values, yet I’m still not positive as to what should be done in order to reliably test hypotheses. After all due process is done, what tests and steps should I take ? Let’s take the basic example of comparing two means. Should I do a full bayesian analysis as prescribed by Kruschke 2013a (for example), should do “regular” hypotheses testing and report p-values with caution and some informative paragraph on their true nature vs. what’s often told about them? What are the steps that would satisfy the people here and that I can integrate into my workflow?

Reply ↓
- Andrew on September 6, 2019 11:34 AM at 11:34 am said:
  
  Nate:
  
  I don’t think we should be testing hypotheses at all! See here.
  
  Reply ↓
  - Nate on September 6, 2019 11:40 AM at 11:40 am said:
    
    Thank you!
    
    Reply ↓
  - Deborah G. Mayo on September 6, 2019 9:17 PM at 9:17 pm said:
    
    It’s very odd for someone who calls himself a falsificationist to be against testing hypotheses.
    
    Reply ↓
    - Andrew on September 6, 2019 10:07 PM at 10:07 pm said:
      
      Deborah:
      
      All the models I fit are false. I’m interested in the ways that these models fail, but (a) I care about models I want to fit, not about straw-man null hypotheses of zero effect, (b) I care about discrepancies on a scale of interest, not on a tail-area probability scale.
    - Deborah G. Mayo on September 6, 2019 11:39 PM at 11:39 pm said:
      
      Yes, that’s exactly what good significance testers care about as well. The P-values correspond to discrepancies on the scale of interest, if one reads them properly.
      
      A very big problem is that a lot of today’s discussions of P-values take us back to pre-Neyman and Pearson (N-P) days when there was only the null hypothesis (worse, it’s taken as a point rather than composite). With the alternative hypothesis, we have easy ways to distinguish the discrepancies or population effect sizes indicated by a given P-value.
      
      That said, I admit the N-P tests as typically given do not explicitly take the step of indicating discrepancies, that’s why I reformulate them. But the apparatus is all there.
The naked statistician on September 6, 2019 11:39 AM at 11:39 am said:

P-values are a useful continuous measure of discordance with the null hypothesis.

Obviously their interpretation depends on how much you believed in the null hypothesis before the data were observed as well as other factors.

The use of universal P-value thresholds demonstrates that P-values are not being interpreted in this way.

Hence people are doing bad science and giving P-values a bad name as a result.

Reply ↓
- jim on September 6, 2019 3:45 PM at 3:45 pm said:
  
  Bang! That’s what I thought! They’re the appropriate method for a restricted range of problems with a certain kind of variation, but they’ve been applied to a much wider range of problems where the basic assumptions of the method aren’t met so, not surprisingly, using p-values as a test for these problems frequently yields spurious results.
  
  Reply ↓
  - Anoneuoid on September 6, 2019 3:57 PM at 3:57 pm said:
    
    This post is about NHST, a misuse of p-values. It is not about p-values per se. It is really frustrating to discuss this topic because there are just so many points of confusion. These are all due to people making up stuff to justify performing a nonsensical ritual.
    
    Reply ↓
    - jim on September 6, 2019 9:22 PM at 9:22 pm said:
      
      sometimes the points of confusion are your own.
    - matt on September 6, 2019 10:27 PM at 10:27 pm said:
      
      +1.
      
      Anoneuoid is likely the most arrogant commenter on this blog.
    - Anoneuoid on September 7, 2019 7:15 AM at 7:15 am said:
      
      It only looks like arrogance because all im doing is debunking the same fallacies that have already been pointed out half a century ago. Thousands of papers have been written on this. Its just the same boring wrong stuff over and over, completely unresponsive to anything but money. And there is currently more money in producing fake discoveries than doing science.
      
      If no one doing this stuff has figured out a way to justify it yet, it isnt going to happen.
    - pl on September 7, 2019 7:50 AM at 7:50 am said:
      
      Whether he is arrogant or not is irrelevant; the only question is whether he is right or wrong.
    - jim on September 7, 2019 1:25 PM at 1:25 pm said:
      
      great point. I really don’t care if s/he is arrogant or not. I’m interested to know what I misunderstood. Anoneuoid isn’t clear about that.
      
      from Naked: “P-values are a useful continuous measure of discordance with the null hypothesis.”
      
      This seems to me to relate to NHST. Naked characterizes P as a continuous measure, in contrast to the discontinuous nature of the NHST.
      
      from Me: “They’re the appropriate method for a restricted range of problems”
      
      By which I mean p-value thresholds or NHST significance testing. I don’t think that reflects a misunderstanding on my part. I don’t know exactly what Anoneuoid refers to. I suspect that Anoneuoid’s statement is a reaction against NHST made without really understanding what was said, partly because I wasn’t clear that I was referring specifically P-thresholds. But I dunno, without clarification from Anoneuoid.
    - Daniel Lakeland on September 7, 2019 1:35 PM at 1:35 pm said:
      
      The problem with NHST isn’t the p value, it’s the “null hypothesis”, Anoneuoid has repeatedly advocated for people to formulate specific predictive models of what they think is going on and test *those* models, in which case p values might be an OK way to test them.
    - jim on September 8, 2019 1:47 PM at 1:47 pm said:
      
      Daniel,
      
      Thanks, much appreciated, that’s useful. My understanding is that Anoneuoid is bothered not by the P values themselves, but by the use of significance testing. It’s the use of a threshold to determine significance that’s bothers Annoneuoid. Is that correct?
      
      I agree with Anoneuoid on this point.
      
      I was trying to make a distinction between cutting edge research and routine work. Seems to me like there is lots of regular work where the data is sound (say, information on stock trades), n is large and there’s a fairly smooth distribution, where using a cut-off to select among models is a sensible thing to do.
    - Anoneuoid on September 8, 2019 2:03 PM at 2:03 pm said:
      
      It’s the use of a threshold to determine significance that’s bothers Annoneuoid. Is that correct?
      
      No… your question has been answered twice in this thread already:
      
      https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117359
      https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117518
      
      But yea, using an arbitrary threshold that in practice gets adjusted so that the “right” amount of “discoveries” get published is another fun piece of nonsense that acts as a red herring to distract from the real problem.
    - Sameera Daniels on September 7, 2019 9:27 AM at 9:27 am said:
      
      Wow, that never occurred to me.
    - Terry on September 7, 2019 7:40 PM at 7:40 pm said:
      
      How dare you say that!
      
      I am far more arrogant than Anoneuoid could ever hope to be in his wildest dreams! Anoneuoid isn’t fit to sniff my socks when it comes to arrogance!
    - Anoneuoid on September 6, 2019 10:37 PM at 10:37 pm said:
      
      Where do you think I am confused? Andrews post is about NHST (specifically the cutoff), one implementation of which happens to use p-values (others use Bayes’ factors, etc it doesn’t really matter).
      
      The naked statistician’s post and your response are about p-values.
    - jim on September 7, 2019 11:28 AM at 11:28 am said:
      
      Where do you think I’m confused? :)
      
      “Andrews post is about NHST (specifically the cutoff)”
      
      How can you talk about the cut-off without talking about what the P-value means and it’s general utility? Can you show me what I’m supposedly confused about? Perhaps it’s you that’s misunderstanding my statement?
    - Anoneuoid on September 7, 2019 9:17 PM at 9:17 pm said:
      
      I’ll just point these to the same place: https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117244
    - [email protected] on September 6, 2019 10:22 PM at 10:22 pm said:
      
      Anoneuoid: I apologize for being flippant. I shouldn’t have done that.
      
      I don’t think, however, that my statement mischaracterizes or reflects a misunderstanding at all. With two P values separated by the cut-off value clearly displayed in the title of the post, I think this post is about both p values and the classing p-value test. I don’t think I’m confusing the two at all.
      
      I believe Naked’s point – which I expressed agreement with – is that there are certain circumstances under which the classic p-value test is a sensible test to provide a definitive statement about the status of the null hypothesis. These conditions are limited. Most work doesn’t satisfy them and for that work the P-value test – or NHST if you prefer – isn’t an appropriate method.
    - Anoneuoid on September 7, 2019 7:22 AM at 7:22 am said:
      
      I say if you want to use a p-value to assess how well your hypothesis fits the data, then go for it. Unless your theory predicts exactly zero effect/correlation, then testing that is pseudoscience. The p-value is not the problem.
- ojm on September 6, 2019 9:13 PM at 9:13 pm said:
  
  > P-values are a useful continuous measure of discordance with the null hypothesis
  
  An under appreciated fact is that a pvalue is actually a measure of ‘accordance’ – ie compatibility or possibility – for a hypothesis.
  
  P = 0.05 means hypothesis *agrees less with (ie is less compatible with) data* than P=0.2. Higher P, more compatibility, lower P, lower compatibility.
  
  The key problem imo is people mistake ‘compatibility’ or ‘possibility’ measures for probability measures.
  
  Reply ↓
  - Sander Greenland on September 7, 2019 11:20 AM at 11:20 am said:
    
    As a matter of terms I don’t like “possibility” at all:
    Before adopting a term we need to consider its negation and its abuse potential. The negation of “possible” is “impossible,” an absolute claim (needs no referent) which would be a terrible way to think of values outside any statistical interval, since the most important of those outside values are far from impossible. Thus “possibility” is even worse than “confidence” – the worst of all proposals I’ve seen once its negation is considered.
    In contrast, “incompatible” merely denotes a relation (something cannot be “incompatible” without reference to what it is incompatible with): it is a relation between the data and the background model with the value inserted, so “incompatible” is a mild statement in that usage.
    
    Reply ↓
    - Sameera Daniels on September 7, 2019 12:32 PM at 12:32 pm said:
      
      Lol
      Sander! It is ‘possible’that you gonna get a good spanking . Just kidding.
    - ojm on September 7, 2019 7:45 PM at 7:45 pm said:
      
      Sure – I prefer compatibility to possibility too.
      
      Though when using possibility theory, there’s no important role for impossibility since the focus is on direct evaluations of things of interest, not NHST style reasoning. Furthermore, it is a continuous *degree of possibility*, so you’d call it a 5 percent possibility interval rather than a 95 percent confidence interval. Things outside have possibility 5 percent or less, rather than being ‘impossible’.
      
      The main reason I mention ‘possibility’ is that there is a formal theory of it that tracks the same ideas as compatibility:
      
      https://en.m.wikipedia.org/wiki/Possibility_theory
      
      But I’d happily see some formal aspects of the compatibility interpretation developed in a similar manner.
      
      In particular – just because something is very possible doesn’t mean it is ‘very necessary’. You can measure the degree of necessity of H relative to Ha via 1- Poss(Ha) from any alternative.
      
      So, if replacing degree of possibility by compatibility, it would be nice to have an analogue for degree of necessity too. Something about uniqueness or precision? I don’t know, but I feel like this is a useful aspect of possibility theory – in short, it being non-additive, so two theories can have degree of possibility 1 with no contradiction.
- Sameera Daniels on September 7, 2019 9:40 AM at 9:40 am said:
  
  The P-value has accumulated too much historical baggage to serve a helpful function. I do see hope in expanding the audiences to these controversies. I think that Gigerenzer’s call to empower consumers of statistics and medicine is going to reign in overdiagnosis and fallacious argumentation as consumers bear their costs and harms largely. The elephant in the room is ‘conflicts of interests’, one of the main obstacles to improved decisionmaking.
  
  Reply ↓
Keith O'Rourke on September 6, 2019 2:37 PM at 2:37 pm said:

OK, I’ll bite.

In terms of surprisal (-log2(p) ) the number of heads in a row equivalent for .2 is 2.3 and for .005 7.6.

Seven heads in a row does seem more suspicious than two from “fair coin tosses” …

Reply ↓
- The naked statistician on September 6, 2019 3:15 PM at 3:15 pm said:
  
  It’s actually 8 heads in a row with a one-sided test and 9 heads in a row with a two-sided test to get P value < 0.005.
  
  Moreover, even 7 heads in a row would be suspicious if you had a strong belief before starting to toss the coin that it had been tampered with so that it favoured heads.
  
  Reply ↓
  - Keith O'Rourke on September 6, 2019 5:01 PM at 5:01 pm said:
    
    Perhaps I should have defined surprisal
    
    “the problem of scaling can make it difficult to conceptualize the P-value as being an evidential measure against the test hypothesis and test model. However, these issues can be addressed by taking the negative log of the P-value –log2(p), which yields something known as the Shannon information value or surprisal (s) value” – https://lesslikely.com/statistics/s-values/
    
    Reply ↓
    - Sander Greenland on September 6, 2019 7:35 PM at 7:35 pm said:
      
      Two nice aspects of S-values is that they are additive over independent events so that differences between them on independent draws are directly interpretable, and they measure information in the singular draws (not some long-run). Here one can say there is a 5.3 bit difference in the information against the tested model between the two P-values, or about 5 heads in a row worth.
Corey on September 6, 2019 4:18 PM at 4:18 pm said:

If these two p-values come from two identical experiments, then the standard error of their difference is sqrt(2) times the standard error of each individual estimate, hence that difference in p-values itself is only (2.81 – 1.28)/sqrt(2) = 1.1 standard errors away from zero.

To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

I’m puzzled by the use of the standard error of the difference here. This isn’t an instance of the “the difference between “significant” and “not significant” is not
itself statistically significant” problem because we’re not discussing being “tempt[ed] to conclude that there is a large difference between the two studies” — we’re discussing if the p-values are consistent with null hypothesis in “two identical experiments”. In this rather special case it seems to me that we ought to be looking at the standard error of the mean, in which case we’re 2.9 standard errors from the null.

Reply ↓
- Deborah G. Mayo on September 6, 2019 9:26 PM at 9:26 pm said:
  
  I don’t get the supposition that the difference between a SS result and a non-SS result ought to be SS, for P-values to make sense. The difference between an obese weight and a non-obese weight need not itself be an obese weight.
  
  Reply ↓
  - ojm on September 6, 2019 9:30 PM at 9:30 pm said:
    
    A grain of sand is not a heap.
    
    Adding a grain of sand to something that is not a heap does not create a heap.
    
    Therefore no collection of grains of sand is ever a heap.
    
    ;-)
    
    Reply ↓
    - Deborah G. Mayo on September 6, 2019 10:40 PM at 10:40 pm said:
      
      Yes, the fallacy of the heap is often committed in today’s railing against P-value thresholds. A P-value is continuous, but it doesn’t follow, in any given case, that one can’t distinguish rather good from terrible evidence of a genuine discrepancy.
  - Daniel Lakeland on September 6, 2019 9:31 PM at 9:31 pm said:
    
    Most people take SSd -> there is a real difference, NoSSd -> we can safely assume for the moment that there is no difference.
    
    Then, the see A SSd 0 and B NoSSd 0 and they say A different from zero, B not different from zero, therefore A different from B.
    
    This happens just *all over the place* and it’s very wrong.
    
    Reply ↓
  - Corey on September 7, 2019 2:17 PM at 2:17 pm said:
    
    In this case the question isn’t “do p-values make sense”. Gelman wrote a paper about the fallacy of concluding that two conditions are are inequivalent because in one the p-value against a no-effect null was greater than 0.05 and in the other, not — a necessary task because researchers aren’t clear on this point. (I’ve personally been asked by a PI if it’s a statistically reasonable argument that we could put in a paper.)
    
    Reply ↓
David Colquhoun on September 6, 2019 5:33 PM at 5:33 pm said:

I guess that everyone now agrees that p values are unsatisfactory as a way to tell you whether or not your effects are just chance. The problem is now that nothing is likely to change unless users are given a better method. And the problem with anything Bayesian is that there is an infinitude of possible answers. I propose one of those possibilities which is mathematically simpler than most others, namely comparison of a point null with the best-supported alternative. This makes the calculations particularly simple.

I think that the question boils down to this.
Do you prefer an ‘exact’ calculation of something that can’t answer your question (the p value), or a rough estimate of something that can answer your question (the false positive risk). I prefer the latter.

Reply ↓
- Daniel Lakeland on September 6, 2019 5:51 PM at 5:51 pm said:
  
  False positive risk is zero in all experiments. Everything affects everything else, there are no precisely zero anythings (to 2nd order accuracy, to third order there might be a couple particle physics questions like “is this neutrino massless” or something that we could find if we searched hard enough)
  
  Reply ↓
  - David Colquhoun on September 6, 2019 7:08 PM at 7:08 pm said:
    
    “False positive risk is zero in all experiments.”
    
    With ideas like that I’m sure that a major pharmaceutical company will be prepared to pay you a great deal as an expert witness. (On second thoughts, that’s an unjustified slur on Pharma. I should perhaps have said a purveyor of homeopathic pills.)
    
    Reply ↓
    - Daniel Lakeland on September 6, 2019 7:51 PM at 7:51 pm said:
      
      If you spend a quadrillion dollars you’ll find that there is some very tiny effect of taking homeopathic pills, could be negative, could be positive, definitely is bigger in absolute value than 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
      
      on any meaningful scale of measurement.
    - David Colquhoun on September 7, 2019 5:06 AM at 5:06 am said:
      
      I fear that this comment shows a complete disconnect between statistician and scientific reality.
      That is a great pity.
    - Daniel Lakeland on September 7, 2019 9:23 AM at 9:23 am said:
      
      It’s not I who has the disconnect, it’s NHST based “testing of zero effect”. The way to go about evaluating drugs is just like what was advocated by the recent anti-p-value paper that Sander and co wrote. Every time you need to make a decision, you frame this decision in terms of how valuable different outcomes are (a utility) and you estimate the outcomes with a Bayesian posterior, and you choose the course of action with the best expected utility.
      
      In the US we’ve actually outlawed this as I understand… because the FDA can’t consider the price of the drug. IMHO if we are going to give out govt granted monopolies we should have the companies bid on a price ceiling which would be used in the utility. Then for the life of the patent the drug could never be sold for more than the amount they bid to get approval.
      
      And now you see just how offensive your suggestion that I become a pharma shill is.
    - Daniel Lakeland on September 6, 2019 9:31 PM at 9:31 pm said:
      
      hoping my response eventually shows up…
    - Thanatos Savehn on September 7, 2019 12:23 AM at 12:23 am said:
      
      Andrew is busily counting all the quarks in the universe to see if you’ve picked a ratio that’s sensible. It’s going to take a while.
    - Anoneuoid on September 6, 2019 11:01 PM at 11:01 pm said:
      
      With ideas like that I’m sure that a major pharmaceutical company will be prepared to pay you a great deal as an expert witness.
      
      This is exactly opposite of reality. Big Pharma does not want people thinking about how their drug can be linked to every side effect people can come up with, 50% of them in a negative way.
    - David Colquhoun on September 7, 2019 5:11 AM at 5:11 am said:
      
      As you must know, I was referring to claims that a drug works when it doesn’t.
      
      It’s bad enough that statisticians have failed to agree about what should be done about p values.
      
      It’s even worse when they spend time arguing about the number of angels that will fit on the head of a pin.
      
      The fact of the matter is that science has suffered from a surfeit of false positives. It is not helpful to deny that this is the case.
    - Anoneuoid on September 7, 2019 7:02 AM at 7:02 am said:
      
      As you must know, I was referring to claims that a drug works when it doesn’t.
      
      Which is hardly the only concern of pharmaceutical companies. If you increase the sample size enough to inevitably detect a “positive effect” 50% of the time, then you will also be increasing it enough to inevitably be detecting 50% of the “negative side effects” you check.
      
      The fact of the matter is that science has suffered from a surfeit of false positives. It is not helpful to deny that this is the case.
      
      Nope, there is always a correlation/effect for one reason or another. This is why when people collect “big data”, they need to lower their significance threshold to something like 3e-7 or 5e-8 or whatever to avoid looking like idiots because otherwise everything comes out significant.
      
      So there really are no false positives when testing a null hypothesis of zero correlation/effect. You can either have insufficient sample size for your chosen significance cut off (false negative), or a true positive.
- Deborah G. Mayo on September 6, 2019 9:40 PM at 9:40 pm said:
  
  Colquhoun:
  I don’t agree. It’s pretty obvious that, if the question is whether the observed result is easily produced by random variability that one is led to the P-value (or something akin). It would be nonsensical to say: Even though larger differences would frequently be expected by chance variability alone (i.e., even though the P-value is large), I maintain the data provide good evidence they are not due to chance variability.
  
  We’re assuming we’re dealing with a statistical problem of distinguishing genuine and spurious effect.
  
  Your proposal:” comparison of a point null with the best-supported alternative” can’t do the job because you will readily find a best supported alternative. Just find one that perfectly fits the data. You would then find it better supported than a null asserting the results are due to chance variability. You will erroneously favor the highly supported alternative with extremely high probability. I don’t see how this can be considered a good way to avoid erroneously mistaking noise for a real effect.
  
  Reply ↓
  - Anoneuoid on September 6, 2019 11:08 PM at 11:08 pm said:
    
    if the question is whether the observed result is easily produced by random variability
    
    This is an easy question to answer, but rarely the question people actually want answered. And this is a good thing that they are not actualyl concerned about that sinec it doesn’t make sense anyway.
    
    There can be many possible models of “random variability”. Lets take a sequence of binary outcomes. I can model it as Poisson binomial, or binomial. Both are “random variability”.
    
    https://en.wikipedia.org/wiki/Poisson_binomial_distribution
    https://en.wikipedia.org/wiki/Binomial_distribution
    
    And those are only the only options.
    
    Reply ↓
    - Anoneuoid on September 7, 2019 12:13 AM at 12:13 am said:
      
      Mods: Please delete the above typofest.
    - David Colquhoun on September 7, 2019 5:20 AM at 5:20 am said:
      
      Dear Anoneuoid
      Your anonymity puts me at a disadvantage.
      I’m puzzled by your statements:
      “This is an easy question to answer, but rarely the question people actually want answered.”
      Where on earth did you get that idea from?
      Perhaps you haven’t read as much of the biomedical literature as I have?
      If it isn’t the “question people actually want answered”, why is the literature full to overflowing with claims that an effect is “statistically significant”?
    - Anoneuoid on September 7, 2019 6:50 AM at 6:50 am said:
      
      If it isn’t the “question people actually want answered”, why is the literature full to overflowing with claims that an effect is “statistically significant”?
      
      Because there is mass confusion, just as Fisher forewarned.[1] Since the 1940s, researchers have been taught to perform a logical fallacy by stats 101 teachers along with their advisors and peers. Here is the usual “logic” of NHST:
      
      P: If my favorite explanation is true, then
      Q: there will be a correlation between x and y
      
      Observation: There is a correlation between x and y
      Conclusion: Therefore, my favorite explanation is true
      
      P -> Q is modus ponens which is fine, but Q -> P is a formal logical fallacy called affirming the consequent.
      
      The important thing to note is it does not matter how you decide whether a correlation/effect exists or not. Omniscient Jones can tell you, you can know it exists with 100% certainty. Still it will not tell you what you want to know, which is whether your favorite explanation is correct.
      
      There is a fundamental principle at play here that the existence of a correlation between any two variables x and y is a special and exciting thing to discover. It is not, everything is correlated with everything else and everything is caused by everything that happened earlier.
      
      Compare that to:
      
      P: If the theory of relativity is correct, then
      Q: The apparent position of stars should be displaced by d during a solar eclipse
      
      In this case it is ok to go from Q -> P as a heuristic just because very few P have been proposed (and the difference between those have negligible practical consequences) that must result in Q. this is all in Bayes rule:
      
      pr(P[0]|Q) = pr(P[0])*pr(Q|P[0])\sum(pr(P[0:n])*pr(Q|P[0:n]))
      
      The denominator is much larger (and hence LHS smaller) when Q is “there is a correlation between x and y” vs “displacement is d”.
      
      Ref 1:
      
      “We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”
      
      Fisher, R N (1958). “The Nature of Probability”. Centennial Review. 2: 261–274.
    - M Lyle on September 11, 2019 1:54 PM at 1:54 pm said:
      
      We simplify our reasoning for the sake of simplicity, and yes, sometimes we get burned by it.
      
      We start with a hypothesis that has some kind of plausible causal relationship.
      
      We do our best to control our research to eliminate common causes, etc.
      
      And then we seek out a p value of statistical significance. If we have well-controlled other confounds, and found p<0.05, Bayes will tell us that our plausible causal relationship is now 20x as likely as before (assuming it was not very likely), and it's a good candidate for other people to try and reproduce and study other ways.
      
      It's not a crazy way to do things. Yes, we need to implicitly consider the prior to some extent. If we find p<0.01 evidence for ESP under some new trial conditions, … it's a significant finding in that it's 100x as likely as before. But it still isn't very likely.
      
      Of course, we also need to consider effect size. A p<0.00001 finding of an effect, with a strong explanation for why we would believe in a causal relationship, tested and explored multiple ways… is not very interesting if the effect magnitude is 2%.
  - Anoneuoid on September 6, 2019 11:09 PM at 11:09 pm said:
    
    if the question is whether the observed result is easily produced by random variability
    
    This is an easy question to answer, but rarely the question people actually want answered. And this is a good thing that they are not actually concerned about that since it doesn’t make sense anyway.
    
    There can be many possible models of “random variability”. Lets take a sequence of binary outcomes. I can model it as Poisson binomial, or binomial. Both are “random variability”.
    
    https://en.wikipedia.org/wiki/Poisson_binomial_distribution
    https://en.wikipedia.org/wiki/Binomial_distribution
    
    And those are hardly the only two options.
    
    Reply ↓
  - David Colquhoun on September 7, 2019 5:42 AM at 5:42 am said:
    
    Mayo
    
    I am simply following Stephen Goodman. Suppose you have observed p = 0.05. If you test a point null against an alternative hypothesis then the likelihood ratio in favour of the alternative is at most 3 (that’s the value for the best supported alternative). This is the strongest evidence against the null, but it’s much weaker evidence for the alternative that might (mistakenly) be inferred from p = 0.05.
    
    That alone is sufficient to show the weakness of the evidence provided by the p value.
    I’d be happy if people just gave the likelihood ratio.
    
    It’s optional whether you take the next step and convert the LR into a probability, and interpret 1/(1 + LR) as a posterior probability of H1 that corresponds to prior odds of 1. If you do you get a false positive risk of 25% when you have observed p = 0.05.
    
    If you don’t like my last step, I’m happy to stop at giving the likelihood ratios. Either way the conclusion is that p = 0.05 is weak evidence.
    
    The problem is that as soon a Bayes is mentioned, internecine warfare breaks out, and users switch off. That’s probably the biggest reason that 70+ years of arguments about p values have been almost totally ignored by users and journals. If statisticians are too have the influence that they should have, they will have to stop squabbling among themselves. If they can’t do that, they will continue to be ignored by users.
    
    Reply ↓
    - Daniel Lakeland on September 7, 2019 9:32 AM at 9:32 am said:
      
      No, statisticians aren’t ignored, their Stats 101 is the most successful sham ever taught. People continue doing NHST because it’s a trivially easy automatable religious ritual proposed by all the textbooks that promises to give people certainty out of a mess of any pile of crappy data they find lying around. it’s the goose that eats table scraps and lays golden eggs.
      
      Statisticians can’t break away from it because who would kill the goose? Even if it’s only laying gold spray painted eggs, the market is paying full solid gold prices…
    - David Colquhoun on September 7, 2019 7:48 PM at 7:48 pm said:
      
      Daniel Lakeland
      Oh dear, you seem to be very angry.
      Surely it was obvious that, when I said statisticians are ignored, I meant that their advice to stop relying on p values is ignored. And part of the reason is surely that statisticians have been unable to agree on what to do about the problem.
    - Daniel Lakeland on September 7, 2019 8:03 PM at 8:03 pm said:
      
      I don’t see why anyone who understands what is going on in science today shouldn’t be angry. I’ve encountered multiple late career PIs in biology who are angry at how badly things are done. There’s a whole post up today about how you basically are *required by policy* to p-hack a research paper in order to graduate from medical school… how is that not worthy of anger?
      
      In any case, I don’t think “statisticians have been unable to agree on what to do about the problem” even rates as a real reason science is so badly done today. There are whole fields that *desperately want to keep doing ritualistic cargo cult science* because it brings them power and makes them money.
      
      the whole phrase “cargo cult science” comes from Feynman in his 1974 commencement address at Caltech. It’s 2019 today…
    - Deborah G. Mayo on September 7, 2019 1:08 PM at 1:08 pm said:
      
      Well don’t simply follow him or anyone. “Appeal to authority” is the main source of problems in this field. You overlook that the alternative can and will likely be a data-dependent selection, so you’re practically guaranteed to find a better-fitting alt, even when the null is true.
    - David Colquhoun on September 7, 2019 7:43 PM at 7:43 pm said:
      
      Mayo
      I would have appreciated a properly argued rejection of the idea that p = 0.05 corresponds to approximately a likelihood ratio of at most 3 in favour of H1. That’s the heart of my argument. Of course the people who object to the whole idea of a point null might find fault with it, but I wouldn’t have expected you to do so.
    - Martha (SmithH) on November 4, 2022 12:04 AM at 12:04 am said:
      
      Daniel Lakeland said,
      ” it’s the goose that eats table scraps and lays golden eggs.”
      
      I suggest this quote as “best in thread”.
Ron Kenett on September 6, 2019 11:24 PM at 11:24 pm said:

Gelman (quoting Rubin) is implicitly referring to Deming’s theory of profound knowledge. The context is a distinction between enumerative and analytics studied: “Tests of variables that affect a process are useful only if they predict what will happen if this or that variable is increased or decreased. Statistical theory, as taught in the books, is valid and leads to operationally verifiable tests and criteria for an enumerative study. Not so with an analytic problem, as the conditions of the experiment will not be duplicated in the next trial. Unfortunately, most problems in industry are analytic.” From preface to The Economic Control of Quality of manufactured product by W. Shewhart, 1931.

So, first clarify your goal. The analytics should provide information related to it. Science is about generalisation of findings. You do research and make claims. Two important issues: 1) How do you present them and 2) how do you support them. Nayo deals with the second question. So is Tim Taylor. Study design, research objectives, research claims in subject expert terms are not part of this discussion. They should be….

For a discussion of presentation of findings and their generalisations see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

Reply ↓
- Ron Kenett on September 6, 2019 11:36 PM at 11:36 pm said:
  
  1. Sorry for the typos (Nayo—>Mayo)
  2. I should have also quoted Sam Karlin (my thesis advisor) who used models and mathematics as a support for scientific work. It is the same quote as Gelman’s of Rubin. Karlin developed BLAST and he was most proud of it. Actually, R.A. Fisher was his role model and his “meandering through science” article refers to him a lot.
  https://www.informs.org/Explore/History-of-O.R.-Excellence/Biographical-Profiles/Karlin-Samuel
  https://dl.acm.org/citation.cfm?id=2770800
  
  Reply ↓
- Martha (Smith) on November 4, 2022 12:07 AM at 12:07 am said:
  
  Thanks to Ron Kenett for the Shewhart quote.
  
  Reply ↓
Deborah G. Mayo on September 6, 2019 11:46 PM at 11:46 pm said:

What comes to mind in reading many comments is disappointment that tests (and interval estimates) are viewed so confusingly. It’s disappointing when people draw conclusions about statistical method based on well-known fallacies: statistical significance is not substantive importance, correlation is not cause, no evidence against is not evidence for, p-values are altered by data-dredging and a variety of selection effects.

First understand how they’re intended to work, then beat them up if you like.

If your problem is not statistical, don’t use statistics, but it’s absurd to suppose that a single statistical method is inadequate if it doesn’t give you a recipe to use unthinkingly, or doesn’t do full-bodied scientific inference and inquiry. No one would blame a single instrument in entirely non-statistical sciences to do everything at once. Statistical significance tests are part of an explicitly piecemeal approach. As in all good science, the pieces must be put together and empirical and theoretical considerations enter. I don’t think Gelman has any problem using statistical significance tests to find flaws in his models. As Fisher said, we’re not interested in isolated small P-values (to show a genuine experimental phenomenon), but reliable methods of procedure….

The problem cases that have gotten Gelman so aggravated are ones that statistical method can’t fix: there’s questionable connections between what’s experimentally measured and what the investigator purports to measure, there’s a ton of flexibility in interpreting and selecting results, the statistical assumptions aren’t checked, and Duhemian problems (for distinguishing the sources of anomalies) loom large. Don’t blame statistics for your bad science. In good sciences, not only are these problems attended to, there’s theory, even at a low level, enabling any statistical thresholds one might use to have a meaning, given what one wants to know. The ability to impinge on checkable effects is what’s required for showing genuine effects. Knowledge progresses by getting better at being able to impact and change such known effects. Looking purely at formal statistical measures, any of them, to check successful replication if one is operating in a swamp of questionable science makes little sense.

Reply ↓
- Martha (Smith) on November 4, 2022 12:11 AM at 12:11 am said:
  
  Mayo’s post has lots of good point — well worth rereading and thinking about.
  
  Reply ↓
Carlos Ungil on September 7, 2019 4:45 PM at 4:45 pm said:

I’ve never really understood this argument. It seems kind of trivial but anyway one should keep in mind that there is often a very thin line separating insightful from misleading.

Are two datasets with p=0.005 and p=0.2 close? Modifying slightly Keith’s example with coins, getting in 9 flips 8/1 and 6/3 do not seem so close. On the other hand, under the null hypothesis the average distance between p-values is 0.33 so a distance of 0.195 probably can be considered as close. But the subtle sleight of hand is that you are not looking at the closeness between the p=0.2 and p=0.005 results under the null hypothesis. In the coins example, getting 8 heads in 9 flips is 7 times more likely when the binomial probability is 2/3 then when is is 1/2.

The following example where the argument doesn’t make much sense may help to make my point clear. Let’s consider that instead of normal errors we have a bilinear error term: the distribution has a triangular shape and the error is in the interval [-1 1] with a maximum at 0. For the null hypothesis mu=0, the positive results with (two-sided) p-values 0.2 and 0.005 are 0.55 and 0.9. The p=0.05 threshold is at 0.78. Consider the results 0.55 (p=0.2) and 1.30 (p=0): following the reasoning in your post one can say that going from non-significant to impossible results can correspond to small, non-significant changes in the underlying quantities.

Reply ↓
David Colquhoun on September 7, 2019 7:58 PM at 7:58 pm said:

I’d like to make defence of the point null, which is what seems to upset several people here.

Some people don’t like the assumption of a point null that’s made in my proposed approach to calculating the false positive risk, but it seems natural to users who wish to know how likely their observations would be if there were really no effect (i.e. if the point null were true).

It’s often claimed that the null is never exactly true. This isn’t necessarily so (just give identical treatments to both groups). But more importantly, it doesn’t matter. We aren’t saying that the effect size is exactly zero: that would obviously be impossible. We are looking at the likelihood ratio, i.e. at the probability of the data if the true effect size were not zero relative to the probability of the data if the effect size were zero. If the latter probability is bigger than the former then clearly we can’t be sure that there is anything but chance at work. This does not mean that we are saying that the effect size is exactly zero. From the point of view of the experimenter, testing the point null makes total sense.

[From section 4 in https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622 ]

Reply ↓
- Anoneuoid on September 7, 2019 9:20 PM at 9:20 pm said:
  
  it seems natural to users who wish to know how likely their observations would be if there were really no effect
  
  No it doesn’t come naturally. I remember being taught this and feeling something was very wrong. I was kept too busy at the time to put much thought into it though.
  
  Reply ↓
  - Nick Adams on September 7, 2019 10:10 PM at 10:10 pm said:
    
    I agree with Anoneuoid. I want to know how certain we can be of the direction of the effect (and it’s magnitude of course). Is drug A better than drug B? And what is the plausible range of how much better?
    Interestingly, if we use maximum likelihood, having a point null (theta=0) or a composite null (theta less than or equal to zero) makes no difference because (at least in the normal mean model), the point null will always be the supremum of the composite null.
    
    Reply ↓
    - David Colquhoun on September 8, 2019 6:11 AM at 6:11 am said:
      
      @Nick Adams
      You say
      ” I want to know how certain we can be of the direction of the effect (and it’s magnitude of course)”
      
      Yes sure. But before estimating the size and direction of the effect, you want to be as sure as you can that there is an effect there to measure. That is the problem in almost all biomedical research. That is what, despite decades of work by statisticians, most users still think that it’s what the p value tells them. Most users, when asked what a p value means, say that “it’s the probability that by results occurred by chance”. Of course it isn’t, but that response surely tells you that what users want to know is the probability that their results occurred by chance.
      
      The problem with that question is that it has an infinitude of answers. But many of the answers, including mine, suggest that if you have observed p = 0.049 in a well-powered experiment and claim an effect exists, the probability that you are wrong is between 20 and 30% (and much higher for an implausible hypothesis).
      
      What would really help users would be for you to say what your estimate of that false positive risk is.
    - Anoneuoid on September 8, 2019 9:19 AM at 9:19 am said:
      
      that response surely tells you that what users want to know is the probability that their results occurred by chance.
      
      No, they want information about their research hypothesis, not some null hypothesis. They are trained to calculate a p-value for a default null model and compare it to a significance threshold. This process makes zero sense so they come up with myths to explain why they and everyone else is doing it.
      
      Knowing “the probability that the results occurred by chance” at least seems like a possibly useful piece of information. Compare to “the probability of observing a deviation at least that extreme from what we would predict if we assumed a null model everyone knows was false begin with was actually true”.
    - David Colquhoun on September 8, 2019 11:03 AM at 11:03 am said:
      
      @Anoneuoid
      
      You say
      “Knowing “the probability that the results occurred by chance” at least seems like a possibly useful piece of information.”
      
      Good. We agree at least on that point.
      
      So how would you estimate that probability in, for example, the case of comparing the means of two independent samples (assume normal dist, equal variances)?
    - Anoneuoid on September 8, 2019 11:28 AM at 11:28 am said:
      
      So how would you estimate that probability in, for example, the case of comparing the means of two independent samples (assume normal dist, equal variances)?
      
      1) Assuming an observation resulted from sampling from a particular normal distribution is just one possible definition of “chance”. Why can’t I assume it is a sample from a lognormal distribution and refer to that as “chance”? I would get rid of this “chance” terminology altogether.
      
      2) What purpose would it serve to know the probability that two groups were sampled from exactly the same distribution?
      
      3) As per recent discussion on this blog, most likely we can deduce the probability is zero and the data is irrelevant to our conclusion: https://statmodeling.stat.columbia.edu/2019/08/28/beyond-power-calculations-some-questions-some-answers/#comment-1108650
    - Daniel Lakeland on September 8, 2019 11:50 AM at 11:50 am said:
      
      Whatever the result was, it occurred because of an enormous dynamical system of atoms and electrons behaving physical laws. so the probability it “occurred by chance” is zero. The best we can do is answer the question “if we imagine it had occurred by sampling from an abstract sequence of real numbers whose probability distribution is a given function H, can we find evidence that this assumption is unlikely?”
      
      that’s what a typical “hypothesis test” does, and it gives us reasonably useful information when it *fails to reject* because then we know that this abstract mathematical model is for the moment sufficiently good that we can’t distinguish between it and whatever actually happened…
    - Anoneuoid on September 8, 2019 7:20 AM at 7:20 am said:
      
      Sorry, but I do not believe we are in agreement. I dont think looking at an effect is the right thing to do at all. There are two valid approaches:
      
      1) Come up with a ml/statistical model and use it to make predictions so some form of cost benefit analysis can be done and a decision made. Every intervention will have many “effects”, and they will vary depending on the circumstances.
      
      2) Testing predictions derived from a theory, pretty much as described here: https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1116876
      
      An effect is basically the coefficient of a linear model, it will change depending on what you include in the model (variables, interactions, etc), so really is just an arbitrary number.
    - matt on September 8, 2019 12:08 PM at 12:08 pm said:
      
      Lolz. Coefficients from a linear regression model are not arbitrary any more than a covariance is arbitrary. They describe one of the statistical properties of two or more variables. When that model can be convincingly said to represent some real-world phenomenon, then the coefficients have some real-world meaning. But even when there is no physical model in the background, the coefficients still have meaning, of course (at the very least, they reflect the best linear approximation to the conditional expectation of y given x).
    - Anoneuoid on September 8, 2019 6:28 PM at 6:28 pm said:
      
      When that model can be convincingly said to represent some real-world phenomenon, then the coefficients have some real-world meaning
      
      Yes, they have meaning when the model is “correctly specified”. Ie, it includes all the relevant variables and no irrelevant ones, etc. You can easily prove this to yourself by adding/removing variables to the model and seeing the others change. Attempting to interpret the meaning of these coefficients is like looking at the individual weights of a neural network.
      
      Here is an example of people attempting to explore all plausible linear models for one dataset. They come up with over 600 million different values for the same coefficient ranging from positive to negative: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/
      
      After all that, they conclude:
      
      Because we are examining something inherently complex, the likelihood of unaccounted factors affecting both technology use and well-being is high. It is therefore possible that the associations we document, and those that previous authors have documented, are spurious.
      
      For the sake of simplicity and comparison, simple linear regressions were used in this study, overlooking the fact that the relationship of interest is probably more complex, non-linear or hierarchical 13 .
    - matt on September 9, 2019 10:50 AM at 10:50 am said:
      
      What is your point? Explain to me how things will be any different in the research agenda you’ve outline here and elsewhere (i.e. have a model that has testable predictions, etc). Your testable predictions are arbitrary as well (given your claim that any model that has coefficients which are not robust to changing something about the model (?) provides no useful information.. absurd of course; I hope you agree).
      
      As for that Garden of Forking Paths example, some models are clearly more plausible than others. Of course with many variables there are many permutations possible, hence the hundreds of models. But, at the end of the day, in statistical modelling you are going to have take a stand on some model or other if you want to say something beyond throwing your hands in the air and saying “I don’t know” (which, I’ll admit, could be an improvement in a lot of cases). In most developed economic literatures, there are often only a few specifications that any one would take to be reasonable given the body of theoretical work that has preceded it. Therefore if you have added or removed some variables and it’s not guided by theory, people will be very suspicious.
      
      In general, you keep straw-manning social science empirical work, acting as though all social scientists are imbeciles plugging variables at random into STATA regression models and looking at p-values to decide which model “worked”. This characterization fits with your narrative, so you run with it. But the reality is a little more complex; there is a lot of high-quality empirical work being done in economics, for example. Lots of problems too, but nobody is publishing papers that are of the PNAS variety in Econ, I can assure you that. The “p-hacking” (or equivalent of) done in modern Econ is mostly in structural work, I’d argue.
    - Anoneuoid on September 9, 2019 5:14 PM at 5:14 pm said:
      
      An arbitrary statistical can make useful predictions. This is how machine learning works. Simpler statistical models can be fine for that too. The coefficients dont mean anything though.
      
      Say I started with a population of N0 cells that on average undergo a binary division r times per day so my model for number of cells after t days is N(t) = N0*2^(r*t). This is an equation derived from some basic principles no one has a problem with. If we fit it to observations of the number of cells after 2, 4, and 6 days then our estimates of N0 and r have well defined meaning.
      
      If then I say “the curve doesnt fit that good so lets add in terms for apotosis rates, senescence when the cell density gets too high, etc” the meaning of N0 and r does not change.
      
      For a statistical model not derived from any first principles the meaning of each coefficient depends on what else is included in the model. Almost always what gets included is a matter of convenience. So the meaning of the coefficient changes, as does the value.
      
      The correct use if a statistical model is to make predictions, not interpret the arbitrary values of the coefficients. A rationally derived model can be used for both.
    - Anoneuoid on September 9, 2019 7:17 PM at 7:17 pm said:
      
      I was on mobile and made some typos, then I was going to leave the typos to avoid cluttering this site… but it is bugging me:
      
      > An arbitrary statistical model
      > the curve doesn’t fit that well
      > The correct use of
    - Martha (Smith) on November 4, 2022 12:18 AM at 12:18 am said:
      
      Nick Adams said, “‘I want to know how certain we can be of the direction of the effect (and it’s magnitude of course).”
      
      Hmm — How do we measure (or describe) “how certain we can be of something”? Or is Nick off in La La land?
David Colquhoun on September 8, 2019 11:48 AM at 11:48 am said:

@Anoneuoid
1. Because I said, for the sake of example, that you know the distributions are normal.

2. “What purpose would it serve to know the probability that two groups were sampled from exactly the same distribution?”
It would certainly be useful to have evidence that the two groups did *not” come from the same distribution.

3. I doubt whether users would find at all useful your advice that the false positive risk is zero.

Reply ↓
- Anoneuoid on September 8, 2019 12:13 PM at 12:13 pm said:
  
  1. Because I said, for the sake of example, that you know the distributions are normal.
  
  No you said “what users want to know is the probability that their results occurred by chance”, and then substituted “sampled from a normal distribution” for “chance”. This is a bait and switch.
  
  It would certainly be useful to have evidence that the two groups did *not” come from the same distribution.
  
  The answer is they did not. They never do in real life except perhaps in cases where it is theoretically predicted to be so (eg that subatomic particles have identical properties, etc). Otherwise, I can’t think of anything to do with this information.
  
  3. I doubt whether users would find at all useful your advice that the false positive risk is zero.
  
  Yes, and I would agree with them. I can’t think of any reason doing what you describe could be useful.
  
  Reply ↓
  - David Colquhoun on September 8, 2019 2:41 PM at 2:41 pm said:
    
    OK I give up.
    I suspect that we are looking at the difference between those who do experiments, and those who just theorize.
    
    Reply ↓
    - Anoneuoid on September 8, 2019 3:01 PM at 3:01 pm said:
      
      Then I must the one who “does experiments”, since I have actually collected biomedical data (rodent behavior, histology, etc) and know what all that entails.
    - Daniel Lakeland on September 8, 2019 4:15 PM at 4:15 pm said:
      
      Anoneuoid is an ex biologist who used to “do experiments”. I work with experimental biologists, biomechanics people, soil mechanics, bioinformatics, analyze industrial accidents, lawyers, all people with real world questions… the question “is a different from b” is never the right question. People are taught this is the right question because it’s the question they think they can answer with a t-test in excel. If all you have is a hammer….
    - Daniel Lakeland on September 8, 2019 4:25 PM at 4:25 pm said:
      
      I often have to scrape biologists jaws off the floor when I start talking to them about taking into account in a model which machine they ran their PCR on, what lane the fluorescence measurement came from, which technician collected the data, what batch they collected the data from, the mouse strain it was collected from, what order the surgeries were done in…. how old the mice were in days…
      
      we wind up with models involving 10 or so different causative factors each of which affects the outcome in a well defined way, and each of which has an associated set of parameters and posterior distribution of those parameters…
      
      that’s the reality. NHST *is* “just theory” produced by mathematicians with no connection to real experiments. Ronald Fisher who did real actual experiments on agriculture is quoted above somewhere saying how stupid NHST is and how it will result in people ritually doing stupid things in the name of “science”.
      
      https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117232
    - Anoneuoid on September 8, 2019 9:02 PM at 9:02 pm said:
      
      It is very interesting that even though Fisher played a huge role in popularizing p-values and significance thresholds, etc I never see him make the usual mistake of “rejecting strawman hypothesis A in order to conclude substantive hypothesis B.”
      
      Student makes it here:
      https://errorstatistics.com/2015/03/16/stephen-senn-the-pathetic-p-value-guest-post/#comment-120537
      
      I know I’ve seen Neyman do it, but don’t remember the paper at the moment. Here is a random applied Neyman paper I just found:
      
      (i) did the silver iodide seeding in any of the other completed experiments show significant effects, positive or negative, on precipitation in areas far removed from the intended target?
      […]
      The two Arizona experiments (6, 7) were performed during the summer months of 1957-60 and in 1961, 1962, and 1964. The target area was an isolated body of mountains known as the Santa Catalina Mountains, with dimensions of roughly 15 by 20 miles. Seeding was performed over a period of 2-4 hr, and began at 12:30 p.m. The experimental unit was a “suitable” day. Determination of the suitability of a given day was made in the morning; the essential criterion was a high level of precipitable water. The experimental design was in randomized pairs of suitable days, subject to the restriction that the 2 days of a pair be separated by not more than 1 day diagnosed as not suitable. For the first day of each pair, the decision whether to seed or not was purely random. Whatever this decision was, it required a contrary decision for the second day. The second experiment differed from the first in the following respects: more gages scattered over a somewhat smaller area, level of seeding, and more stringent selection of suitable days.
      
      The original evaluation of possible effects of seeding was based (6, 7) on the average rainfall over the 5-hr period from 1300 to 1800, MST, as measured by a substantial number of recording gages scattered in the target. In both experiments the results of the evaluation were about the same-a not significant 30% apparent loss of rain ascribable to seeding. On days when cloud bases were high, these apparent losses were heavier than on days when the cloud bases were low.
      […]
      The first shows that the seeding over Santa Catalina Mountains was actually accompanied by a significant apparent 40% loss in 24-hr rainfall at a distance of 65 miles from the intended target, P = 0.025. This, then, constitutes an affirmative answer to question (i).
      […]
      The stratification reflected in the last two double lines of Table 1, stimulated by the thoughts of Horace Byers (1, pp. 551-2), was performed because the design is randomized pairs: only the first day of each pair was selected for the experiment, without prior knowledge whether it would be seeded or not. Table 1 shows that the difference between the category of “first days” and the category of “second days” is quite sharp, but its sign is opposite to that visualized by Byers.
      
      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC389009/
      
      So first of all we see some p-hacking going on here. Second we see him going from rejecting the null model (I didn’t follow to ref 5 to learn exactly what it was) of something like “rainfall for the next 24 hrs on seeded vs non-seeded days is sampled from the same distribution”, to conclude “silver iodide seeding reduced 24 hr rainfall by 40%”.
      
      Can anyone find an example of Fisher committing this error?
    - M. Lyle on September 11, 2019 4:46 PM at 4:46 pm said:
      
      > we wind up with models involving 10 or so different causative factors each of which affects the outcome in a well defined way, and each of which has an associated set of parameters and posterior distribution of those parameters…
      
      You end up with a severely overfit model, you mean.
Jeff on September 11, 2019 8:04 AM at 8:04 am said:

My understanding of the 18 year old voting age is this: If politicians can elect to
send you to your death then you are entitled to participate in the election of those politicians.

Reply ↓
Peter Chapman on September 11, 2019 9:20 AM at 9:20 am said:

The difficulty I have with discussions such as this is the lack of context. There are situations in which I would definitely not ignore a p value of 0.52 For example, if was trying to develop a new product for the market and this experiment was as small part of a much bigger project. In this case I might carry out a better experiment or I might tweak the product, but I would definitely not abandon the project. On the other hand, if I was a graduate student on a very tight budget and the effect being studied was of no practical significance then I might abandon the project.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

It’s not just p=0.048 vs. p=0.052

143 thoughts on “It’s not just p=0.048 vs. p=0.052”

Leave a Reply Cancel reply