Skip to content
 

What’s the p-value good for: I answer some questions.

Martin King writes:

For a couple of decades (from about 1988 to 2006) I was employed as a support statistician, and became very interested in the p-value issue; hence my interest in your contribution to this debate. (I am not familiar with the p-value ‘reconciliation’ literature, as published after about 2005.) I would hugely appreciate it, if you might find the time to comment further on some of the questions listed in this document.

I would be particularly interested in learning more about your views on strict Neyman-Pearson hypothesis testing, based on critical values (critical regions), given an insistence on power calculations among research funding organisations (i.e., first section headed ‘p-value thresholds’), and the long-standing recommendation that biomedical researchers should focus on confidence intervals instead of p-values (i.e., penultimate section headed ‘estimation and confidence intervals’).

Here are some excerpts from King’s document that I will respond to:

My main question is about ‘dichotomous thinking’ and p-value thresholds. McShane and Gal (2017, page 888) refers to “dichotomous thinking and similar errors”. Is it correct to say that dichotomous thinking is an error? . . .

If funding bodies insist on strict hypothesis testing (otherwise why the insistence on power analysis, as opposed to some other assessment of adequate precision), is it fair to criticise researchers for obeying the rules dictated by the method? In summary, before banning p-value thresholds, do you have to persuade the funding bodies to abandon their insistence on power calculations, and allow applicants more flexibility in showing that a proposed study has sufficient precision? . . .

This brings us to the second question regarding what should be taught in statistics courses, aimed at biomedical researchers. A teacher might want the freedom to design courses that assumes an ideal world in which statisticians and researchers are free to adopt a rational approach of their choice. Thus, a teacher might decide to drop frequentist methods (if she/he regards frequentist statistics a nonsense) and focus on the alternatives. But this creates a problem for the course recipients, if grant awarding bodies and journal editors insist on frequentist statistics? . . .

It is suggested (McShane et al. 2018) that researchers often fail to provide sufficient information on currently subordinate factors. I spent many years working in an experimental biomedical environment, and it is my impression that most experimental biomedical researchers do present this kind of information. (They do not spend time doing experiments that are not expected to work or collecting data that are not expected to yield useful and substantial information. It is my impression that some authors go to the extreme in attempting to present an argument for relevance and plausibility.) Do you have a specific literature in mind where it is common to see results offered with no regard for motivation, relevance, mechanism, plausibility etc. (apart from data dredging/data mining studies in which mechanism and plausibility might be elusive)? . . .

For many years it had not occurred to me that there is a distinction between looking at p-values (or any other measure of evidence) obtained as a participant in a research study, versus looking at third-party results given in some publication, because the latter have been through several unknown filters (researcher selection, significance filter etc). Although others had commented on this problem, it was your discussions on the significance filter that prompted me to fully realise the importance of this issue. Is it a fact that there is no mechanism by which readers can evaluate the strength of evidence in many published studies? I realise that pre-registration has been proposed as a partial solution to this problem. But it is my impression that, of necessity, much experimental and basic biomedical science research takes the form of an iterative and adaptive learning process, as outlined by Box and Tiao (pages 4-5), for example. I assume that many would find It difficult to see how pre-registration (with constant revision) would work in this context, without imposing a massive obstacle to making progress.

And now my response:

1. Yes, I think dichotomous frameworks are usually a mistake in science. With rare exceptions, I don’t think it makes sense to say that an effect is there or not there. Instead I’d say that effects vary.

Sometimes we don’t have enough data to distinguish an effect from zero, and that can be a useful thing to say. Reporting that an effect is not statistically significant can be informative, but I don’t think it should be taken as an indication that the true effect as zero; it just tells us that our data and model do not give us enough precision to distinguish the effect from zero.

2. Sometimes decisions have to be made. That’s fine. But then I think the decisions should be made based on estimated costs, benefits, and probabilities—not based on the tail-area probability with respect of a straw-man null hypothesis.

3. If scientists in the real world are required to do X, Y, and Z, then, yes, we should train them on how to do X, Y, and Z, but we should also explain why these actions can be counterproductive to larger goals of scientific discovery, public health, etc.

Perhaps a sports analogy will help. Suppose you’re a youth coach, and your players would like to play in an adult league that uses what you consider to be poor strategies. Short term, you need to teach your players these poor strategies so they can enter the league on the league’s terms. But you should also teach them the strategies that will ultimately be more effective so that, once they’ve established themselves, or if they happen to play with an enlightened coach, they can really shine.

4. Regarding “currently subordinate factors”: In many many of the examples we’ve discussed over the years on this blog, published papers do not include raw data or anything close to it, they don’t give details on what data were collected or how the data were processed or what data were excluded. Yes, there will be lots of discussion of motivation, relevance, mechanism, plausibility etc. of the theories, but not much thought about data quality. Some quick examples include the evolutionary psychology literature, where the days of peak fertility were mischaracterized or measurement of finger length was characterized as a measure of testosterone. There’s often a problem that data and measurements are really noisy, and authors of published papers (a) don’t even address the point and (b) don’t seem to think it matters, under the (fallacious) reasoning that, once you have achieved statistical significance, measurement error doesn’t matter.

5. Preregistration is fine for what it is, but I agree that it does not resolve issues of research quality. At best, preregistration makes it more difficult for people to make strong claims from noise (although they can still do it!), hence it provides an indirect incentives for people to gather better data and run stronger studies. But it’s just an incentive; a noisy study that is preregistered is still a noisy study.

Summary

I think that p-values and statistical significance as used in practice are a noise magnifier, and I think people would be better off reporting what they find without the need to declare statistical significance.

There are times when p-values can be useful: it can help to know that a certain data + model are weak enough that we can’t rule out some simple null hypothesis.

I don’t think the p-value is a good measure of the strength of evidence for some claim, and for several reasons I don’t think it makes sense to compare p-values. But the p-value as one piece of evidence in a larger argument about data quality, that can make sense.

Finally the above comments apply not just to p-values but to any method used for null hypothesis significance testing.

40 Comments

  1. Justin says:

    Hi,

    “With rare exceptions, I don’t think it makes sense to say that an effect is there or not there. Instead I’d say that effects vary.”

    IMO, p-values are not saying “an effect is there” in an absolute sense. They are rather saying an effect is there but at this level of alpha. And then of course show CIs or some estimation of the effect.

    “But then I think the decisions should be made based on estimated costs, benefits, and probabilities—not based on the tail-area probability with respect of a straw-man null hypothesis.”

    When I use p-values to make decisions, that is just one piece of the decision. For example, if a p-value led me to conclude to use a certain procedure for mailing surveys. Then, I’d also have to look at cost functions (overhead and per survey mailed) to make a final decision. The null hypothesis is just modus tollens logic and counterfactual reasoning, so I do value it highly. It is, however, as strawman as any other model.
    Something I looked at in grad school was Lin’s concordance correlation coefficient, comparing say a gold standard device to a proposed replacement device (manual vs digital blood pressure cuffs. Obviously p-values are involved (via bootstrapping), but costs of the devices, training, etc., would come into play in any decision.

    “I think that p-values and statistical significance as used in practice are a noise magnifier, and I think people would be better off reporting what they find without the need to declare statistical significance.”

    I think actually that they help cut through noise, especially if reported not just from 1 experiment, but from several.

    I recently looked very briefly at some Nobel prize winners’ research, and they do use p-values and statistical significance language. Good enough for scientists doing science at the highest level, good enough for me. See http://www.statisticool.com/nobelprize.htm

    “Finally the above comments apply not just to p-values but to any method used for null hypothesis significance testing.”

    Agreed, but the current trend/cottage industry is to bash p-values and statistical significance, say they are not good for science, that they are confusing, backwards, arbitrary, dichotomize in a bad way, etc., and then propose your own method to correct the claimed defects, and then act like those proposed methods cannot possibly be gamed by bad actors or be affected by arbitrary journal standards. My observation anyway..

    Justin

    • DC says:

      1. how does shifting from talking about whether an ‘effect is there’ to qualifying it with an alpha level make it any better? and CI’s are really just a directly corollary to the p-value. So, not sure what benefit that is, aside from adding seeming complexity and nuance to a flawed metric?
      2. is there any evidence that p-values across multiple experiments help ‘cut through the noise’, like you say? Or is this just an assumption? My understanding was that the use of them in such settings does exactly the opposite. there’s a literature on flaws of certain meta-analysis approaches due to exactly this kind of thinking. if the p-values in each study are flawed (even slightly), this can have a big impact on meta-analysis.
      3. Who cares if a Nobel prize laureate uses p-values? There’s racists with Nobels, too. This is an appeal to authority argument that has no business in the discussion of what is accurate/right to do in science.
      4. I agree that outright bashing of methods doesn’t really help, and that pretty much all methods can be gamed in some way. But, to be fair, there seems to be some misconceptions/confusion in your own view of p-values (I’m sure lots of dimensions of them confuse me as well). So, maybe some of the harsh critique of p-values and NHST isn’t so unfounded after all.

      • Justin says:

        Hi DC,

        “… and CI’s are really just a directly corollary to the p-value. So, not sure what benefit that is, aside from adding seeming complexity and nuance to a flawed metric?”

        Well they put a ‘conclusion’ in terms of original units, the width has information wrt variability, and they show upper and lower bounds, and save the reader from having to derive it themselves. I agree there is 1:1 relationship, just displayed/communicated differently.

        ” 2. is there any evidence that p-values across multiple experiments help ‘cut through the noise’, like you say? Or is this just an assumption?”

        That p-values cut through the noise is a definition, and a theoretical and observed fact.

        I was thinking about things like http://www.statisticool.com/cis.jpg
        showing results from more than 1 study (replication), for example over time, or a meta-analysis, to get a better understanding of the phenomenon

        or something like this http://www.statisticool.com/flips_7000.GIF
        where more trials show limiting values

        or something like http://www.statisticool.com/fdr.JPG
        where several statistically significant trials in a row dramatically lower the false discovery rate when compared to 1 trial alone

        or useful things like:
        -the asymptotic results of CLT
        -variance as n–>oo
        -as the sampling fraction n/N gets large
        -likelihoods swamping priors

        “My understanding was that the use of them in such settings does exactly the opposite. there’s a literature on flaws of certain meta-analysis approaches due to exactly this kind of thinking. if the p-values in each study are flawed (even slightly), this can have a big impact on meta-analysis.”

        If they are literally flawed in each study there is a systemic problem (same with Bayes factors, or anything else) in science or whatever field we are in. What about if they aren’t or aren’t all flawed? Again, it just emphasizes that we must focus on more than 1 study if we are talking about establishing a phenomena (Fisher ~80 years ago). And, again, how would Bayes factors or anything else not be affected by this – why pick on p-values alone?

        ” 3. Who cares if a Nobel prize laureate uses p-values? There’s racists with Nobels, too. This is an appeal to authority argument that has no business in the discussion of what is accurate/right to do in science.”

        I don’t equate what I wrote with a fallacious appeal to authority though. They didn’t just ‘say so’ based on authority, and I’m not saying p-values are correct only because they are famous scientists. I’m saying these scientists clearly found p-values and confidence intervals useful to directly assess evidence in their experiments and make conclusions, and I note that these techniques were useful in their work.

        Of course, such things are used by ‘regular’ scientists and researchers the world over, as Hubbard notes in “Will the ASA’s Efforts to Improve Statistical Practice be Successful? Some Evidence to the Contrary”
        (https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1497540)
        also showing increasing use in social and management sciences since 1960 to 2017 or 2018, I forget.

        In “There is still a place for significance testing in clinical trials” (https://journals.sagepub.com/doi/pdf/10.1177/1740774519846504) by Cook et al, we read
        “There is no competing paradigm [traditional statistical testing] that has to date achieved such broad support.

        Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse.”

        Moreover, the books “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century”, by Salsburg, and “Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective”, by von Plato give more examples too.

        Are those ‘argument by authority’ too, or just me observing some facts?

        Cheers,
        Justin

        • DC says:

          1. I don’t see how your CI position really helps things; my understanding is that there’s a lot of misconceptions about the extent to which CI’s can do the things you’re claiming, including whether they give adequate displays of uncertainty whatsoever. I’m thinking of Richard Morey’s work, among others. There’s a huge debate about this that you seem to be totally glossing over. But, I’m no expert on it either.
          2. Your demonstrations do not convince me. As an applied scientist, I’m thinking of how the stats are used in practice. The point is that even small changes in design choices (garden of forking paths, etc) can pose huge problems for p-values and their associated CIs. Sure, they can pose problems for other methods too, but that doesn’t change my position on the extent to which p-values are a problem. So, I don’t see how you showing me a few examples where they’re supposedly used better (which I also question…) helps. But, I’ll confess that I’m biased…even when p-values are used 1000% exactly accurate, I still only find them minimally useful, including in my own work. I’m not arguing p-values are inherently flawed, nor is Andrew. Just saying that their most common uses run into a lot of problems (which your examples do not address). Also, not sure why you keep bringing up Bayes Factors. seems like you think I’m a proponent of them and I never mentioned them once (furthermore, I agreed with your point that any stat can be gamed).
          3. Still seems a bit like argument to authority, as others below seem to also think. Your assumption that these scientists used p-values because they ‘clearly found’ that they were useful is pretty amazing to me. If you don’t see the issue with that statement,I’m not sure what else to say.

          Clearly, you are bothered by critiques of p-values, that’s fine. I use them in my own work & still have no problem critiquing them. No metric is perfect. I was taking issue with your specific claims in support.

      • To me, coming from linguistics, I feel the biggest advantage of confidence intervals is how they shift the attention back to the value of the difference, regression coefficient, correlation coefficient, etc. itself. I think linguists are usually too content to just take the qualitative implications of a model (e.g. p < 0.05 means there's an effect of X) and ignore what the p-value is even doing (I doubt most linguists can even tell you what, for example, the beta in a logistic regression model means). The shift towards CIs, I think, will encourage thinking more about the model and its quantitative implications, since the veneer of straightforward qualitative implications is removed.

  2. jd says:

    Unless I missed something, I don’t see how these comments address the requirements of power analysis from funding agencies. Any grant application that I have helped a PI with seems to *require* a power analysis. It’s not an option. I have done my best to try to implement what I think is better practice on the analysis end (no NHST; Bayesian multilevel models; reporting uncertainty intervals; plotting the raw data; etc.), but I am still not sure what to do other than a power analysis for those grants. I typically do these via simulation, because I think simulating data beforehand and running analysis is a good sort of thinking check.
    I seem to remember a post recently about how it would be best to not do power analysis, but what should I be doing other than a power analysis, when it seems to be required?

    • Andrew says:

      Jd:

      Indeed, the NIH’s implicit requirements can be even worse than you say; I’ve heard that sometimes as a condition for funding they require statistical significance in the pilot study. This is just an invitation to cheat and a recipe for noise chasing.

      Regarding the problems with power analyses, see my posts here and here on “the 80% power lie.”

      I do recommend “design analyses”—evaluating your design of data collection and measurement using the sampling distribution of your estimate in the context of clearly-stated assumptions about effect sizes and variation—see this paper with John Carlin on “Beyond power calculations.” I don’t like “power analysis” per se because it is all about the goal of statistical significance. Design analysis is a good idea, though, for sure.

      • jd says:

        Ok, I have read those, and I certainly agree with your arguments. I’ll read the paper again. I seemed to remember it aiming more toward showing the probability of Type S and M errors in already published studies. Maybe I missed something on how it can help me determine sample size for grant proposals.

        As far as the job goes though, it seems as long as funding agencies require power analysis, then I’m stuck doing them even if they aren’t a good idea, right? And as far as a PI goes, I think I’ve done a decent job shifting from NHST in analysis, but I doubt I can propose to include anything other than a power analysis when they go asking for money. It seems like there needs to be a shift at the highest levels for this type of thing to change. Unless I am just completely missing some sort of alternative.

        • Andrew says:

          Jd:

          I agree that this paper with Carlin focuses on design analysis for already-conducted studies. We should write another paper on design analysis for future studies!

          And, yeah, if NIH requires a power analysis, you better do it. Just recognize the limitations.

          Regarding what to do instead, Shira Mitchell and I and others wrote two (unpublished) papers on the design for the Millennium Villages evaluation; see here and here. These demonstrate some of the things you can do, other than power calculations, when evaluating a design.

          • jd says:

            Great! Thank you! I will take a look at these.

            A paper on design analysis for future studies would be extremely helpful. If it provided the code and examples like the paper with Carlin about Type S and M errors, that would be fantastic.

  3. Pietro Ghezzi says:

    Dear Justin,
    Although I am a coward and always pass my results through the P<0.05 ritual or else my papers won't be published, I must say the website on the recent Nobels was a pick-and-choose. Those are two very recent ones (2018 and 2019). I found some of the earlier ones (1991-1992)*, that were really instrumental to the discovery of HIF-1, where a null hypothesis significance testing was, in most cases, not done and he used the word "significant" in the way we would normally use outside scientific papers ("important, worth of consideration, meaningful – Chambers dictionary). And it's full of examples like those, impactful papers that opened the way to new fields, and medicines, without a P value.

    *Forsythe JA, Jiang BH, Iyer NV, Agani F, Leung SW, Koos RD, Semenza GL.Activation of vascular endothelial growth factor gene transcription by
    hypoxia-inducible factor 1. Mol Cell Biol. 1996 Sep;16(9):4604-13. Semenza GL, Nejfelt MK, Chi SM, Antonarakis SE. Hypoxia-inducible nuclear
    factors bind to an enhancer element located 3' to the human erythropoietin gene. Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5680-4.

    Regarding the fact that NHST helps cut noise – may help but may also not help. When we look at gene expression array data to do a gene expression profile, we normally cut them with a NHST threshold but it happened to me to look at them without looking at the P value (or limma or whatever we use) but just sort them by fold change and, at least once, I found something interesting that was not statistically significant and then I validated by a different technique. So, yes, if you have data from 30,000 transcripts the test helps cutting the noise but , unless you also look at them in some other way, that may be a blinder that will not let you see something interesting.

  4. Justin says:

    Hi Pietro,

    “…, I must say the website on the recent Nobels was a pick-and-choose.”

    Yes, it was, much like critics who claim ‘p-values and significance language are bad for science’ pick and choose when they do not talk about the good.

    I did mention that sometimes I found none, sometimes Bayesian and frequentist techniques within the same paper, sometimes only Bayesian, and so on. But I did, in fact, find p-values and significance language were useful to the scientists doing science at the highest level, contrary to claims of these things supposedly being bad for science.

    Justin

    • matt says:

      Justin,

      The fact that Nobel Laureates used p-values in their is not evidence that p-values are useful. Perhaps their research would have been more compelling without the use of p-values, or perhaps the reporting of p-values made no difference whatsoever to their main findings.

      Also, p-values are engrained into the scientific establishment; Nobel Laureates are selected by the scientific establishment. Do you see the problem? There are many terrible (from a statistical standpoint) papers published from tenured professors at Harvard; should we also not question the methods used in these papers simply because they are the product of individuals employed by the finest academic institution in the world?

      Scientific paradigms change over time as we realize flaws that were not well understood in the past (I think just 7-8 years ago Andrew was blogging uncritically about certain social psych papers that he would now criticize harshly, on grounds of “Garden of Forking Paths” type problems). If people accepted your appeal to authority then there could never be change in the scientific community.

  5. Justin says:

    Also, I am thinking of exploring other scientific discoveries/findings and if p-values and statistical significance language is used.

    For example, there sure seems to be a lot, just perusing around, on the evidence for efficacy and safety of vaccinations, and a lot of p-values and statistical significance language are used in these papers.
    Another argument from authority, I’m sure. ;)

    Justin

    • DC says:

      Dude, how do you not see this being an argument from authority? Are you intentionally not seeing it? You’re literally taking the (flawed) normative state of scientific practice (i.e., in many disciplines it is damn near impossible to publish anything without adding a p-value to the paper) and then claiming that this is evidence for why p-values are useful (…See?! these papers use p-values, they must have been useful!!). How do you not see the issue with that? You could totally be right in claims about p-values and this would STILL be a bad argument in favor. You can’t argue that something is useful just because it is common (and often mandated).

      • Justin says:

        “You’re literally taking the (flawed) normative state of scientific practice (i.e., in many disciplines it is damn near impossible to publish anything without adding a p-value to the paper) and then claiming that this is evidence for why p-values are useful (…See?! these papers use p-values, they must have been useful!!)”

        If they are mandating, they may be mandating because they are proven useful.

        Look at Duflo’s recent Nobel for Economics work on poverty over many years and journals. They basically state that they find RCTs and significance tests essential compared to the alternatives. This was not mandated by any journal.

        Justin

        • Andrew says:

          Justin:

          I think the work of Duflo et al. would be just as good, if not better, if no p-values had been calculated. There is value in randomized experiments—they reduce bias—and there is value in large sample sizes—they reduce variance. I don’t see statistical significance and p-values as having any useful role here, except for the “social” role of enabling the papers to be published in top journals. Conditional on the papers being accepted, I think they’d be just as good if they never computed a p-value or its equivalent.

          • Justin Smith says:

            Hi,

            It is often argued that using a p-value is flawed because it is using results that could have happened but didn’t happen.
            But several people have responded that the researchers could have used other methods that they didn’t actually use, and that is somehow not a flawed argument. Ironic. Yes, they could have used other approaches, but chose not to probably for good reasons like I’m imagining priors incredibly hard to construct, so they didn’t. The research was great, published or not, in understanding poverty (reducing child mortality, improving education, etc.)

            Now I read from Daniel that praised Nobel work (found examples of Economics, Medicine, and Physics, for multiple years, all using p-values and statistical significance) is “like 8th grade science fair”. That’s caricaturing what they did, of course, but it *is* funny in a denial-y sort of way,

            Justin

            • Anoneuoid says:

              It is often argued that using a p-value is flawed because it is using results that could have happened but didn’t happen.
              But several people have responded that the researchers could have used other methods that they didn’t actually use, and that is somehow not a flawed argument. Ironic.

              Someone somewhere argued p-values are flawed because “it is using results that could have happened but didn’t happen.” Some other person said that researchers should use a different method than they did.

              What is ironic about this to you?

              Ok, let’s say for the sake of argument the same person said both things. You find it ironic to tell someone not to use a flawed method if that method uses results that didn’t happen? Basically, since they use results that never happened, it is impossible to criticize p-value according to your reasoning?

              • Anoneuoid says:

                Or actually, it’s that if someone ever criticized p-values for using results that didn’t happen, then they can never legitimately say someone should have done something differently.

        • Apparently you can get a Nobel in Econ for the brave new idea of running experiments to see what works….sigh

          The summary article by the Royal Swedish Academy was honestly pretty disheartening.

          https://www.nobelprize.org/uploads/2019/10/advanced-economicsciencesprize2019.pdf

          The modern approach to development economics relies on two simple but powerful
          ideas. One idea is that empirical micro-level studies guided by economic theory can
          provide crucial insights into the design of policies for effective poverty alleviation. The other is that the best way to draw precise conclusions about the true path from causes to effects is often to conduct a randomized controlled field trial.

          This is like 8th grade science fair ideas.

          • Justin Smith says:

            https://www.nobelprize.org/uploads/2019/10/advanced-economicsciencesprize2019.pdf

            “This is like 8th grade science fair ideas.”

            A+ work on cherry-picking, distorting, and trivializing a 40+ page summary of decades of work and research on poverty.
            The Nobel committee will sure be overwhelmed next year by all those 8th graders who will be getting medals.

            But a serious thing is, do you think work has to be overly complexified to be good science? The Nobel committee thought their experiments and ideas and research were pretty good is all that matters apparently.

            Cheers,
            Justin

            • Their work is just fine, great even. what bothers me is that the field of economics as a whole is in a place where people give out prizes and remark on how amazing it is that these people actually decided to test some theories using experiments… Imagine giving out a chemistry prize (Nobel was a chemist) and saying that one of the major contributions was that they actually synthesized a new explosive and put it out on the bomb range and jolted it with electricity to see if it blew up…

    • matt says:

      Justin,

      Again, the fact that p-values were used in the research on efficacy of vaccinations is not evidence that p-values are useful. Given that vaccinations ARE incredibly useful, I think you would be very hard pressed to find a statistical / decision theoretic framework that would lead you to conclude that vaccinations are not effective.

      I’m sure in most vaccination studies a cursory glance at the raw data would inform you of their efficacy. P-values likely added nothing to the strength of the conclusions.

      • Anoneuoid says:

        Given that vaccinations ARE incredibly useful

        How did you determine this?

        • matt says:

          just the basic time trends in vaccination rates and disease eradication is enough. Sure there could be a confounder, but that seems incredibly unlikely.

          • Anoneuoid says:

            just the basic time trends in vaccination rates and disease eradication is enough. Sure there could be a confounder, but that seems incredibly unlikely.

            Have you looked into this at all? Because there are many *huge* confounders… In the case of measles (the one I looked into) you have:

            1) People stopped having “measles parties”
            2) They changed the definition to require a blood test
            3) Doctors are more reluctant to diagnose measles in someone who reports being vaccinated (eg, there were still ~10k reported cases that met the original clinical criteria for measles diagnosis in the US in 2004)

            Collectively that can account for over 99% reduction in reported cases by my estimation.

            Then there is the second problem of side effects. The rate of “side effects” from MMR are pretty much the same as complication rates from measles in 1950s US/UK. So the danger is about the same for the child, however the benefit (immunity) from vaccines wanes much faster as you age… You can start here if you want to look further: https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-1005346

            • matt says:

              LOL. So Anoneuoid is an anti-vaxxer, in addition to dominating the stock market on a daily basis.

              Maybe look up polio. Or is there a confounder there, too?

              • Anoneuoid says:

                I only looked into Measles deeply.* And how am I an anti-vaxxer? You clearly didn’t even read it, and are incapable of thinking rationally. Go back to your NHST jobs program.

                * For anyone actually capable of thinking for themselves on the topic, there is another disease called non-polio acute flaccid paralysis that rises in frequency after polio vaccination programs are implemented. Other confounders would be sanitation, antibiotics reducing the rate of complications, etc.

              • Saying that the evidence for vaccine effectiveness is less strong than its made out to be is not the same as saying vaccines don’t work.

                The flu vaccine is talked up a lot every year, it probably isn’t nearly as effective as what you will hear from nurses and doctors who will act like “if you get the vaccine you won’t get the flu”. If the vaccine reduces incidence by half and severity by half it’d be a huge benefit, even though it’s much less effective than the 90 or 100% a typical healthcare worker’s answer will imply. But if they say that, they are afraid far fewer people will bother to get the vaccine, and then their public health campaign will fail to be as effective, and eventually they’ll stop making the vaccine because it loses money… so they sort of white lie.

                Another thing you’ll hear is “you can’t get the flu from the flu vaccine” but of course that’s only true if the manufacturers didn’t have any kind of malfunction. You know what else? You can’t shoot yourself in the head if the safety is on your loaded pistol… but don’t EVER EVER point your pistol at your head or anyone else’s, safety or not.

                Realistic risk and reward assessments are something that has been pounded out of much of science due to basically politics. Doctors aren’t going to say “this cancer drug increases your lifespan by 3 to 4 months while massively increasing your incidence of vomiting diarrhea and hair loss, most doctors who treat cancer would never ever take it” but they will say “this drug is the best thing we have, it significantly improves outcomes compared to the older drugs…”

                meh

            • Nick Adams says:

              1. True
              2. Not true. Clinical diagnosis or a nose/throat swab PCR. Serology is unnecessary and rarely done.
              3. Not really true. I’m happy to diagnose measles in someone who has been vaccinated under the right circumstances. However, since the disease is much less common in the vaccinated I would obviously need better clinical evidence to do so.

              • The part about number 3 is that the *old* diagnostic criteria probably over-diagnosed measles. So when you improve the diagnostic criteria the incidence would decline automatically without any change in measles..

                Might be true, most likely is, as modern diagnostic methods are way way better than what was available in 1910 or whatever.

              • Anoneuoid says:

                Not true. Clinical diagnosis or a nose/throat swab PCR. Serology is unnecessary and rarely done.

                I cited this in my earlier post. Here it is according to the CDC:

                Measles (Revised 9/96) Clinical case definition

                An illness characterized by all the following:

                a generalized rash lasting greater than or equal to 3 days

                a temperature greater than or equal to 101.0 F (greater than or equal to 38.3 C)

                cough, coryza, or conjunctivitis

                Laboratory criteria for diagnosis

                Positive serologic test for measles immunoglobulin M antibody, or

                Significant rise in measles antibody level by any standard serologic assay, or

                Isolation of measles virus from a clinical specimen

                Case classification Suspected: any febrile illness accompanied by rash

                Probable: a case that meets the clinical case definition, has noncontributory or no serologic or virologic testing, and is not epidemiologically linked to a confirmed case

                Confirmed: a case that is laboratory confirmed or that meets the clinical case definition and is epidemiologically linked to a confirmed case. A laboratory-confirmed case does not need to meet the clinical case definition.

                https://www.cdc.gov/mmwr/preview/mmwrhtml/00047449.htm

                Also, throat swab + pcr was never done to confirm a case of measles back in the 1950s either. It plays the same role of making the criteria more strict.

                Not really true. I’m happy to diagnose measles in someone who has been vaccinated under the right circumstances.

                I cited this in my earlier post as well:

                “This was not a blind study, since the investigators knew which children had received measles vaccine.
                […]
                It seems probable that the occurrence of so much ‘measles-like’ illness in the vaccinated children was a reflexion of the difficulty in making a firm diagnosis of measles in the African child at one visit.”

                http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2134550/

                “As only approximately 7% of the clinically-diagnosed cases of measles reported locally turned out to be measles by laboratory testing, there is a need for laboratory confirmation of measles to avoid misidentification of cases and improve disease surveillance.(2)”

                http://www.ncbi.nlm.nih.gov/pubmed/17609829

                Thus, requiring “laboratory confirmation” could reduce the number of reported measles cases by ~90%.

              • Anoneuoid says:

                I have a comment (with too many links apparently) waiting for moderation, but also regarding #3:

                “Indeed, an average of only 100 cases of measles are confirmed annually [32], despite the fact that >20,000 tests are conducted [28], directly suggesting the low predictive value of clinical suspicion alone.”

                https://www.ncbi.nlm.nih.gov/pubmed/15106109

      • Pietro Ghezzi says:

        In any case, I doubt very much that anti-vaxxers would look at the statistical tests in scientific papers, so this would be a case where NHST wouldn’t have a great impact on public health. They probably build their beliefs in other ways.

    • Anonymous says:

      Progress:

      1930’s: “p-values come with frequentist guarantees and will rarely lead to errors”

      2010’s: “P-values: they’re not guaranteed to fail!”

      And it didn’t even take full century to get there.

  6. Anoneuoid says:

    Is it a fact that there is no mechanism by which readers can evaluate the strength of evidence in many published studies?

    You have to synthesize the various lines of evidence into a model and derive some otherwise surprising predictions from that model. Then check those predictions against new data. And yes, this is pretty much absent from modern biomedical research.

  7. Nick Adams says:

    Just on confidence intervals and power:
    It’s instructive to convert %power to precision expressed in terms of the minimum practically significant effect size (MPSES).
    If power is 50% the CI width will be twice the size of the MPSES. One can justify this as being the lowest acceptable power on the grounds that if the power is any lower than this, then it is possible to get an observed effect size equal to the MPSES and yet have a CI that includes zero. This is clearly undesirable – practical significance without statistical significance.
    At the other end of the scale, 94% power produces a CI width that is slightly smaller than the MPSES. One can justify this as being the largest necessary power given that it will be impossible to obtain a CI that includes both zero and the MPSES (and hence is equivocal).
    And just to tidy everything up, with 80% power (as recommended in the medical literature) the CI is 1.43 times the MPSES, which is about equal to the square root of two. Since the sample size is proportional to the inverse of the CI width, the sample size at 80% power will be half that of 94% power and twice that of 50% power. Neat, eh?

Leave a Reply