In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

Denis Jabaudon writes:

I was thinking that perhaps you could help me with the following “paradox?” that I often find myself in when discussing with students (I am a basic neuroscientist and my unit of counting is usually cells or animals):

When performing a “pilot” study on say 5 animals, and finding an “almost significant” result, or a “trend”, why is it incorrect to add another 5 animals to that sample and to look at the P value then?

Notwithstanding inducing the bias towards false positive (we would not add 5 animals if there was no trend), which I understand, why would the correct procedure to start again from scratch with 10 animals?

Why do these first 5 results (or hundreds of patients depending on context) need to be discarded?

If you have any information on this it would be greatly appreciated; this is such a common practice that I’d like to have good arguments to counter it.

This one comes up a lot, in one form or another. My quick answer is as follows:

1. Statistical significance doesn’t answer any relevant question. Forget statistical significance and p-values. The goal is not to reject a null hypothesis; the goal is to estimate the treatment effect or some other parameter of your model.

2. You can do Bayesian analysis. Adding more data is just fine, you’ll just account for it in your posterior distribution. Further discussion here.

3. If you go long enough, you’ll eventually reach statistical significance at any specified level—but that’s fine. True effects are not zero (or, even if they are, there’s always systematic measurement error of one sort or another).

In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

97 thoughts on “In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

  1. I confess, I just didn’t understand the first point at all.

    1. Statistical significance doesn’t answer any relevant question. Forget statistical significance and p-values. The goal is not to reject a null hypothesis; the goal is to estimate the treatment effect or some other parameter of your model.

    When is statistical significance relevant? The distinction blew right past me.

    Jai Jeffryes

    • Statistical significance is relevant when you are asking the question “could my data have happened if I were collecting data from a particular random number generator” and you want a yes/no answer.

      if p less than 0.05 you can conclude “my data probably didn’t come from that random number generator” and if p greater than 0.05 you can conclude “My data could have come from that generator (not must have, but could have)”

      if either of those things seems like a useful answer to you, by all means… but the situations where they are actually useful is pretty limited…

      • Did you really intende to use a fixed p-value threshold, and 0.05? That procedure would ignore the evidential content of the p-value. Smaller p-values indicate that the data are more discordant with the random number generator than large.

    • I think this is a good comment.

      The case against statistical significance isn’t being made clear. I just read a good share of Gelman and Loken 2013, the “forking paths” paper. The examples are clear enough but what is meant in general by “forking paths” isn’t clear at all, and I think this is reflected in lots of comments I’ve seen here and elsewhere.

      Looking at the examples in the forking paths paper, though, as I’ve said many times before, the **FIRST** and **BIGGEST** problem in these examples is that the researchers make false assumptions about the relationship between the population and the sample. The studies are no good not because they don’t understand statistical significance, but because their experiment is riddled with errors. G&L recognize these errors, but instead point to statistical significance as the problem. It’s not the problem. The problem is bad study design and false assumptions.

      For example, in the paper on socio-economic status and upper arm strength, the researchers make quite a few blatant errors:

      1) substitute “arm strength” for “arm circumference”

      It’s an assumption that arm strength = arm circumference. Is it valid? Possibly, but this has to be established separately because there are other factors that affect arm circumference.

      2) use college students as representative of the general population

      Why this keeps happening I have no clue but it’s a bad practice and should be denounced. Pollsters select a representative portion of a population to get accurate polling results. Researchers investigating other phenomena should be doing the same, not just using college students because it’s cheap. Every paper should not only explain but defend the method of sample selection and show that it’s a viable method for the analysis.

      3) investigate a very small magnitude interaction that have no reasonable basis for being true

      We know this is at best a small-scale interaction because we already know there’s no visible variation in body strength with any of these parameters. If there were, we wouldn’t need the statistics to detect it! We could go into upper class political establishments and see burly men ambling around denouncing wealth redistribution. But we don’t see that. We see the two most unburly men north America has to offer: Gates and Bezos.

      The bottom line is that there’s no justifiable reason to believe this data would be an accurate representation of anything related to the questions at hand. The whole study design is just pure bunk.

      And the bigger problem is that much of the research apparatus of social science is built on this kind of sloppy experiment designed – created mainly for the purpose of generating data for political ends. So people have to decide whether social science is science or not, and get rid of this shit once and for all.

      • jim, great comment. Yes, indeed, bad experimental design or weak and unclear relationship between the sample and the population of interest are the worst problem. And they are problems that cannot be dealt with by any statistical flimflammery.

  2. Never mind my comment, Andrew, about not getting your observation about the irrelevance of statistical significance here.

    I clicked through to your paper, Abandon Statistical Significance, and I’m studying that.

    Thank you,
    Jai Jeffryes

  3. DJ,

    Statistical significance testing, p-values, etc., is the most popular method of analysis in the world probably in about every subject. Therefore, “scapegoat p-values for all questionable research practices’ is an expected answer, but not quite satisfying.

    In the design you mentioned, where you keep adding until you find statistical significance, if not specified beforehand or not designed as a sequential test, or alpha not adjusted, this is a questionable research practice.

    “1. Statistical significance doesn’t answer any relevant question. Forget statistical significance and p-values. The goal is not to reject a null hypothesis; the goal is to estimate the treatment effect or some other parameter of your model.”

    Statistical significance does answer relevant questions and I’d advise to not forget statistical significance and p-values. It answers the question of if your data/test statistic is far away from what you’d expect under your model, this allowing falsification and learning. It works for Nobel Prize winners, experimental design, clinical trials, survey sampling, quality control, analyzing quantum supremacy data, for example.

    “2. You can do Bayesian analysis. Adding more data is just fine, you’ll just account for it in your posterior distribution. Further discussion here.”

    It can be “just fine” and it can also have issues. It is a fallacy that optional stopping is never a problem for Bayesians. It seems like it depends on prespecification and also the type of prior.

    In “Stopping rules matter to Bayesians too” (https://www.phil.vt.edu/dmayo/conference_2010/Steele%20Stats%20tests_april15_sent_PS.pdf). Steele writes

    “If a drug company presents some results to us – “a sample of n patients showed that drug X was more effective than drug Y” – and this sample could i) have had size n fixed in advance, or ii) been generated via an optional stopping test that was ‘stacked’ in favour of accepting drug X as more effective – do we care which of these was the case? Do we think it is relevant to ask the drug company what sort of test they performed when making our final assessment of the hypotheses? If the answer to this question is ‘yes’, then the Bayesian approach seems to be wrong-headed or at least deficient in some way.”

    Also see “Why optional stopping is a problem for Bayesians” (https://arxiv.org/pdf/1708.08278) by Heide and Grunwald.

    “3. If you go long enough, you’ll eventually reach statistical significance at any specified level—but that’s fine. True effects are not zero (or, even if they are, there’s always systematic measurement error of one sort or another).”

    If you go long enough, maybe likelihood would swamp any Bayesian prior too.
    True effects are not 0 (useful model), but they may be under any practical meaningful threshold..

    Justin
    http://www.statisticool.com

    • To be fair, quite a few Nobel Prize winners also used slide rules, published look-up tables, and assumption heavy (but computationally tractable) models in their work, given when it was conducted. That doesn’t mean we should emulate their approaches if we have the capacity to compute posteriors of interest directly.

  4. So I’m not a statistician but my perception of the discussion of statistical significance is that it’s not making clear what conditions must be satisfied for statistical significance to be the appropriate method of analysis.

    This is my crack at the assumptions that must be satisfied for appropriate use of statistical significance:

    1) Sample is representative of the population (I think this captures the effect of n and also using students to represent the general population or other similar errors)

    2) Method of measurement accurately captures feature of interest (this is a huge problem on many studies, eg. Arm circumference substituting for upper body strength)

    3) Proposed process or relationship is dominant factor affecting measurement – that means that there are no other sources of variation of equal or subequal magnitude that create apparent noise. So if you’re trying to measure teacher effectiveness, “student achievement” isn’t a good measure because student achievement is affected by many other likely larger magnitude factors like raw intelligence, student effort, parental help etc).

    4) At p=0.5 there’s a 1/20 chance of a false positive even when all other factors are perfectly controlled or, in other words, one out of every twenty analyses will yield a false positive in the best-case scenario.

    Would love to see others’ response to this

  5. Jim,
    I like your formulation and it makes sense to me, but I’m not a statistician either. You cannot get a straight answer from anyone here who is. I strongly suspect that there are no straight answers that won’t cause a different statistician to fly into a rage, or sink into despair.

    In particular, qualitative statements like “method of measurement accurately captures feature of interest” might as well be in a different language. There just is not a framework for understanding these sorts of notions as near as I can tell, unless you can put it in an equation or Bayesian prior.

    The field of statistics is in a particularly bad patch right now, although the statisticians seem to be the last to get the word. Statistical significance is being criticized in different ways and for different reasons, yet still being defended by some. Meanwhile, replacement methods that have been proposed just lead to bickering and shouting.

    Researchers in other fields just need a list of best practices in statistical inference. It should be that if you are working in the social sciences, biology, etc., you can just pick from a list of statistical best practices that are appropriate for the type of analysis you are doing, and if you pick from the list, you should not be criticized for it during review. Innovation and novel approaches are still allowed, but you should assume that you will be asked to defend their use.

    I just cannot see this happening with the herd of cats currently prominent in the field of statistics.

    • Matt, your point about statisticians failing to respond to the the critical issue of whether the sample (measurements) capture features of interest is well made. I agree.

      You are correct to doubt the utility of a prior in correcting for the shortcoming. A Bayesian prior exists within a statistical model and so it can only shape the information captured by the model. That means that the prior cannot rescue that model when it fails to contain a critical real-world feature. The only solution to the problem is to act on a recognition that statistical inference and scientific inference are different things.

      I will again shamelessly promote a chapter that I wrote in which I discuss these very issues: https://arxiv.org/abs/1910.02042

      • Michael, thanks for the link to your paper. I am working my way through it. I guess I do disagree with your view “that good scientists [need to be] capable of dealing with the intricacies of statistical thinking.” To me that is like saying that to determine what waveforms your device is producing, it is not enough to be able to read them on an oscilloscope, you have to be able to design an oscilloscope yourself. I frankly feel that it is an abrogation of responsibility by the statistical community to just toss the statistical aspects of proper inference methods to the unwashed masses (like me). Because we will fight like ignorant dogs. And we will attempt to baffle our reviewers with statistical treatments unique, dense and long, and dare them to wade in deep enough to criticize.

        • I disagree, it’s more like saying that to determine what waveforms your device is producing you must understand the concept of real numbers, of functions, of voltage, of variation of voltage in time, of input impedance and probe capacitance and the way that attaching the probe might change the waveform your device produces, of radio frequency radiation and shielding to know whether signals come from outside your device…

          Do you disagree that those are important to using an oscilloscope, and are separate from design of the instrument itself?

        • Daniel captures my intention perfectly.

          Ideally a scientist would be able to make the best possible use of statistics as a relatively reliable step towards making optimal scientific inferences. In the real world a scientist has to be able to usually do a good enough job with statistical inference that it does not substantially detract from the scientific inferences. (Many, many scientists will have to up their game to meet the standards of that second sentence!)

    • “you can just pick from a list of statistical best practices that are appropriate for the type of analysis you are doing, and if you pick from the list, you should not be criticized for it during review.”

      Perhaps I misunderstand you, but this sounds like a bit of an over-simplification. Creating a flow chart of canned analyses sounds like a bad idea…but maybe this is what has been done for forever, and now that the word is out that the flow chart method is no good, people are unhappy.

      Also, so far, I have found that in review I am criticized for ‘best-practice’ because it isn’t the canned old way of doing it that reviewers are familiar with. So, other way around.

      • “Creating a flow chart of canned analyses sounds like a bad idea…”

        I don’t know why it would be. Physicists use a set of “canned” or standardized procedures to find and verify exoplanets. Geologists use a standard set of procedures to identify and processes specimens for isotopic ages. Drug researchers use a standard set of procedure to extract natural chemicals from plants and test them for efficacy in the lab

        Everything in science is done by standardized procedures. If you’re not using one, you’re making one up to get other people to use. The procedures aren’t the problem. The problem is the people who don’t follow procedures but claim they did. A lot of that is not confirming that the assumptions required for the procedure to work are actually met.

        • To the extent that you are running the same physical process multiple times, then yes, you can also run the same mathematical model of that physical process multiple times…

          the idea however that you can somehow create a canned procedure which need not know anything about the physical process or social science process or literary process or whatever and can still magically tell you what your data means… that’s extremely problematic.

        • I disagree. Good data analysis varies greatly depending on the data generating process and the particular observations on the components of this process that you happen to observe in your experiment or study. General guidelines could be given, but creating a giant list or flow chart to follow would seem quite difficult if not impossible to me. Also, I think flow charts of canned solutions tend to encourage people to turn their brains off.

          A flow chart or standardized procedures for data analysis would be like writing an SOP to drive a car. It’s nice to know that at a stop sign one should brake the car to a halt, but it’s also a good idea to brake the car to a halt when a child steps out in the street.. or a dog, or if smoke starts coming out from under the hood, or when it starts raining too hard to see, etc. At some point you need to think for yourself and apply the skills you know in a common sense and reasonably intelligent way.

          Martha Smith puts this better than I have in the comments below.

    • Matt said,

      “Researchers in other fields just need a list of best practices in statistical inference. It should be that if you are working in the social sciences, biology, etc., you can just pick from a list of statistical best practices that are appropriate for the type of analysis you are doing, and if you pick from the list, you should not be criticized for it during review.”

      Sorry to be the grinch, but:

      What you say presupposes a world that is simpler and more uniform than the one we live in. If it were as simple as what you propose, then someone (or some group of people) could (theoretically) create a (probably) humongous computer program that could be used to pick the best approach for the given problem — and I say “problem” rather than “type of analysis”, because the choice of type of analysis will depend on the problem, and lots of little nitty-gritty properties of the particular problem, in the particular context in which it is being studied. And such a program would be very difficult to use, because there would be so many little nitty-gritty details to include.

      What is really needed is *thinking* at every stage of working on the problem, including deciding on how best to collect data (including what data to collect), and taking into account all the little idiosyncrasies of the particular problem. These decisions on the part of the researcher *should* be spelled out in reporting the research, and *should* be critiqued (perhaps what you call criticized) in order to decide how credible the results of the research are. The thinking is (at least) as important as carefulness in following procedures.

      • Thinking a little more about the situation: Here is a suggestion that I think would be a good idea to implement (which I realize might be difficult, but worthwhile things are often difficult) in graduate programs preparing students to do research:

        Provide a course (or perhaps a two-term course sequence) focused on helping students to read, critique, plan and carry out statistical analyses in the field, as described below:

        The first part of the course would be discussion of the statistical methodologies most commonly used in the field. It would not be a “How to do this statistical technique.” but would focus on things like:
        The model assumptions of the technique.
        The implications of the model assumptions for data gathering.
        When to use (and when not to use) the technique.
        What the model assumptions imply for the “settings” to use when using statistical software to analyze the data.

        The second part of the course would have a “journal club” type format: Each meeting would be discussion and critique of an article in the field. Students would take turns leading the class meetings. Class discussion would focus on the quality of the article, including whether or not data gathering and statistical analysis were appropriate for the type of analysis used, as well as quality of explanation/exposition.

        Such a course would help prepare students to read and critique literature in the field, as well as give them an idea of what they need to do in their own research and writing.

        • Martha,
          Thanks for your responses. It is an important topic. I have a close relative who is a late-career, prominent psychology researcher. He is frankly terrified that the statistical world will fully denounce p testing, more or less tossing his life’s work into what the British call “the dustbin of history.” Remember that theory that we have a reptile brain inside a monkey brain inside a human brain? Hundreds of papers were written with that as a basic assumption, and I bet no one reads them anymore.

          I assumed it would be hard to come up with best practices. But the waterfront does not need to be covered. There is a standard template for forward correlation/causation in psychology:

          1. Identify a discrete population of interest.
          2. Define a treatment and apply it to the population. Or in observational studies, identify and measure the magnitude of the treatment in the population.
          3. Show that the treatment had a measurable effect of meaningful magnitude on the outcome of interest.

          This is where folks have used p values for the last few decades. Is it really too hard to come up with best practices for this specific situation?

        • oh, sure, for that kind of thing, best practices is to just stop doing it entirely!

          Imagine if people tried to understand bouyancy that way, we would have a hopeless mishmash of thousands of papers: x sinks in bucket of water… x floats in bucket of water when coated in butter, x sinks in water if coating of butter is too thin, x floats in water when powdered… actually x sinks in water when powdered if experiment performed in southern hemisphere… x definitely floats in water when powdered contrary to eminant southern researchers findings, x plus thick coating of butter actually sinks if water has sufficient temperature… blah blah blah

        • “best practices is to just stop doing it entirely”

          I am in violent agreement! I would never approach anything that way. But that is a standard approach in psychology, at least according to the practitioner I know. He would say that you are not trying to understand buoyancy, that is impossibly hard, so all you can do is establish whether a coating of butter has an effect in isolation. But I am with you, I wouldn’t feel comfortable with anything about a butter coating if I did not have an adequate understanding of the concept of buoyancy.

        • Matt,

          FWIW I’m not a statistician or medical researcher, just a geologist :) But it seems obvious to me that, properly applied, this approach should be suitable to test for the efficacy of a treatment at some level of accuracy or precision.

          All that’s going on in a test like this is comparing the distribution of measurements from treatment samples to the distribution of measurements from an ideal set of samples in which there is no effect. The distribution of measurements in which there is no effect is presumably normal, thus the idea that it’s a “random number generator”. I was confused by that for a while but I think that’s what Daniel and Phil are on about. A random number generator would produce a normal distribution.

          That being said the method is only as good as the data and P = 0.05 isn’t absolute proof of efficacy, it’s the *odds* that the difference in measurement distributions result from chance. So it’s an indirect comparison with certain odds of failure even if everything else is perfect.

        • Jim: if you have a historical dataset of curated “nothing is going on” data, and you get a new data point, then you can compare this data point to the historical set and say “this (is/is not) unusual compared to history”

          This is an actually fine application of hypothesis testing, the reason a thing is considered unusual is specifically because it was rare (had low frequency) in the past.

          The problem is, most hypothesis testing isn’t like that at all.

          1) it’s rare to compare to a large database of data, instead some stand-in distribution is used without ever checking if it adequately represents the “real” distribution.

          2) Another strategy is only a single parameter is compared, say the mean, using an asymptotic limit theorem to define the sampling distribution in a way that’s independent of what the data look like (as long as they have a mean and standard deviation…). But that two distributions have the same / different mean is often not sufficient to establish much.

          3) Even if the difference in means is meaningful, people often throw away that information and make decisions based only on is something “significant vs not significant”… ie. they decide on a single binary digit of information (present/absent). But different sized effects can’t all have the same theoretical use.

          4) When first filtering on significant/nonsignificant and then looking at effect size, the effect size is dramatically biased upwards in magnitude. Failure to replicate a similar magnitude effect is our expectation from this kind of research.

          So, there are multiple issues at play. The problem with p values isn’t the p value, it’s the failure to properly align the research question with the mathematical question.

  6. Jim,
    You’re missing the forest for the trees. The problem with statistical significance is not with the mechanics of the test or of the sample, it’s with the question that it attempts to answer. The question being answered is, “if the true value of the effect I’m measuring is exactly 0 — that is, 0.000000000…, then how likely is it that I would see the data I saw?” The answer to that question is rarely relevant.

    For one thing, you almost always know that the effect you are measuring is not exactly 0. It probably isn’t even 0 to 30 decimal places, if it’s expressed in reasonable units.

    I’m not sure what would be a good example of your problem, so I’ll make something up: you’re measuring the concentration of some chemical in the blubber of elephant seals and you want to know if there’s an important difference in concentration between males and females.

    What you’re interested in is, how big is the difference in concentration between the two groups? Is that difference big enough to be important?

    But that is not what the p-value tells you. What the p-value tells you is: if there is no difference in the concentration — no difference whatsoever — then what is the probability I’d see data like mine? This is why Daniel Lakeland answered the way he did above: You are essentially comparing your observed data to the output of random number generator (typically, but not always, one that spits out independent, identically distributed random numbers).

    Almost always there is no way the true effect you are trying to quantify could be 0.00000000000…. The only exceptions I can think of involve physical quantities (the rest mass of the positron might really be exactly equal to the rest mass of the electron; two protons can have exactly the same mass; things like that), and discrete quantities, i.e. those that can only take a finite number of values. In the example I made up, chemical concentrations in the tissues of elephant seals, there’s no way the statistical distribution in males can be identical to that in females, either in the current population of elephant seals or in expectation or anything else. It could conceivably be very close but it can’t be identical.

    Think of it this way. Suppose in some units, for some quantitative question, I think a difference of 3 between two groups might be of practical importance, and a difference of 5 is definitely important. Let’s pretend all I care about is the average for some reason. I measure three males and three females and my estimate of the average difference is 7 +/- 7. I am nowhere near ‘statistical significance’, there’s just a slight suggestion that one group is higher than the other. I sample a bunch more animals and now my estimate is 4 +/- 3. I still don’t have ‘statistical significance’ but at least it’s unlikely the difference is as big as, say, 15, which was entirely possible based on the first six. If I really want an answer I have to get more animals, so I do, and I end up at 2.5 +/- 1. Hey, great, there is a ‘statistically significant’ difference between the two groups…but it’s probably not of practical importance.

    ‘Statistical significance’ is usually not a good way to think about this stuff. Justin is right that it is extremely popular, indeed it is almost ubiquitous, but it’s also safe to say that there has been increasing recognition of the fact that it is usually not helpful, and that recognition isn’t just represented on this blog.

    • Thanks Phil! Great response.

      1) I recognize P is just a comparison of distributions and has nothing fundamentally to do with, for example, the concentration of yttrium in seal blubber, or how the concentration came to be what it is.

      2) I’m not advocating for or against the use of P. What I’m interested in is a clear expression of the conditions under which P can achieve something practically useful. Those conditions may be extremely narrow or not, but such conditions do exist. And by defining conditions under which P use is valid, we’re also defining the conditions under which it’s not valid. That would be an important contribution to the debate.

      3)” The question being answered is, “if the true value of the effect I’m measuring is exactly 0 — that is, 0.000000000…”

      From a practical standpoint this isn’t true as far as I can see. Can you give an example where the true value of the effect was 0.000001 and that produced an erroneous result? Nonetheless there are situations where the true value of the effect really is zero. What if you measured the isotope ratios of Ru in two different minerals, one from Sudbury and one from South Africa? Are there regional or mineralogical differences in the Ru isotope ratios? No, because Ru is too heavy for it’s isotopes to experience mass fractionation and stable isotopes don’t experience chemical fractionation. One stable Ru isotope behaves *exactly* the same as any other.

      4) “…it’s probably not of practical importance.” Why is it probably not? I don’t see any relevant argument here.

      I’m not arguing for or against statistical significance. But what I see is a much more fundamental problem than number crunching. If you have bad data, neither Fisher nor Bayes can help you, and I haven’t seen an argument as to how eliminating significance testing would make that situation better. As far as I can see, just eliminating significance testing without changing people’s understanding of experimental design would mean even **MORE** crappy studies with bad design, poor measurements, etc etc.

      • 1. To your question of when is significance testing useful, it’s when you think the effect you’re looking for really could be zero, or at least so close to zero that it is practically indistinguishable from zero (see discussion of your third point, below).

        2. Although this is not really relevant to the rest of the discussion: the isotope ratio of Ru in minerals from different places cannot be exactly equal to an infinite number of decimal places. Probably not even to a dozen decimal places. Possibly not even to five decimal places. Some isotopes of Ruthenium are quite stable, and the isotopic ratios will depend on the relative concentrations of those isotopes where and when the minerals formed. Those might happen to be quite close but there’s no way they could be perfectly identical. Additionally, various isotopes of Ruthenium are produced from the decay of other elements, so even if, miraculously, the isotopic of Ru in the minerals that formed at different places and different times happened to be exactly equal, the concentrations and isotopic ratios of those _other_ elements would also have to be exactly equal. It just can’t happen. Whether these differences would be big enough to detect or big enough to identify where a given sample of mineral came from, I don’t know. So that’s a good example of where statistical significance is not what you want to test: the question isn’t “is there a difference in the isotopic ratios” (the answer to that is Yes); maybe the question is, “is there a big enough difference that I could use current technology to determine the origin of a sample of mineral”, and that’s something that significance testing can’t tell you.

        3. I don’t know of any example where the true value of an effect was 0.000001 and someone tested against 0.00000000000000 and got an erroneous result, but it is almost certain to have happened: there are in fact parameters in the physical sciences and in biology that are measured to six figures, and people do null hypothesis significance testing in those fields, so it’s pretty much a sure thing that people have mistaken statistical significance for practical significance or made other errors based on those measurement.

        You’re right, though, that focusing on the exactness of 0 in NHST is a bit of a red herring. The point is not a red herring at all, though. What you usually care about is “is this effect large enough to be of practical significance”, and you simply don’t get an answer to that by comparing your numbers to the output of a random number generator that is centered at zero. What you want is something like “what is the probability the difference is large enough to be important, given the data that I have”; what NHST gives you is “what is the probability that, if the difference is zero, I would have obtained the data that I have.” There is clearly a relationship between these questions, but they are not the same question, and the answer to the second question is not (at all) the right answer to the first question.

        4. The reason NHST is usually not helpful is that you usually don’t care about the probability that you would have gotten the data that you saw, if the true effect were zero. You usually care about the probability that the true effect is substantially non-zero, given the data that you saw. These are not the same question and they don’t have the same answer.

        • I am learning a lot from this thread!

          I had this for NHST:

          Null is: the signal cannot be differentiated from the noise no matter how well the data is measured.
          Rejecting the null is: sufficient confidence that the signal has been differentiated from the noise using the described measurement method.

          So I guess I am still struggling with a similar point to Jim’s, is it really fair to say it is always a rejection of zero or near zero or whatever?

        • not just “the noise” but one particular kind of noise that you checked.

          If you do a test that is sensitive to deviations from normal, you might find “this data isn’t from a normal random number generator” but it doesn’t mean it’s “signal” it could for example be a cauchy distribution, or a gamma, or lognormal, or uniform between two values, or a mixture of 3 different normals… anything…

        • “the isotope ratio of Ru in minerals from different places cannot be exactly equal to an infinite number of decimal places.”

          “cannot” or “probably isn’t”?

          “you usually don’t care about the probability that you would have gotten the data that you saw, if the true effect were zero. ”

          No, I don’t agree with that.

      • Well we’ve had this discussion numerous times… just to reiterate a point that I always make when this comes up. The interest when computing p-values does *not* have to be whether the true effect is exactly 0.0000. This is not the question that is addressed. The question that is addressed is whether the data can *distinguish* what has happened from a model in which the true effect is 0.0000. Surely if the answer is “no”, the data can’t be used as evidence for anything else, and that’s of interest and relevant in a lot of cases. On the other hand I agree with probably the majority here that if the answer is “yes”, this doesn’t necessarily mean that something substantially meaningful is going on, and one need to look at effect sizes and all kinds of issues that could have caused the result such as systematic sampling error and whatever (against which a Bayesian approach offers no cure) apart from how one would like to interpret the result.

        By the way, somehow some seem to think that it is somehow a problem for significance tests that the true parameter is never exactly 0.000 in reality, whereas they seem to be very cavalier about the fact that any model (including their favourite Bayesian one) will never hold precisely either.

        • Christian:

          I agree with what you say about the use of p-values. Regarding the parameter not being exactly zero: the relevance to the above post is that sometimes people seem to think it’s cheating to keep gathering data until the result is statistically significantly different from zero, or people think that inferences from such analyses need to be corrected to account for peeking at the data. My point is that, no, it’s perfectly fine to keep gathering data until you get statistical significance. This is not cheating: In the real world, effects are not exactly zero, and it’s just fine that once you gather enough data, you can rule out that null hypothesis.

        • The thing is that the theory on which the (standard) test is based doesn’t take into account that you make decisions to gather more data conditionally on what happened earlier, which may bias the p-value. There’s sequential analysis that does take these things into account.

          The analogous problem in Bayesian statistics is that if you do optional stopping in a Bayesian analysis and don’t model this (which as a Bayesian you don’t have to), if you assume that there is a true underlying model, one can show that this can increase your chances to make wrong decisions, i.e., to increase the posterior probability for something wrong, compared with an analysis in which you have a fixed rule how many observations you gather, just the same as if you’re computing p-values ignoring optional stopping. Chances are this has been mentioned/papers have been linked elsewhere in this thread that show this (I haven’t followed links).

          Now of course a Bayesian could say, why should we be interested in what happens if we assume a true frequentist model in which we don’t believe anyway, but it’s how we can analyse the performance of methods, by setting up artificial models and see whether what happens there makes good sense. (Chances are I don’t have to tell you this Andrew, but other readers of this blog…;-)

        • The problem isn’t that the p value is wrong or biased, the problem is that the model used to calculate the p value doesn’t model known facts about the actual data collection process. Another way to say this is standard practice among frequentists is to try to make the p value be based on a model of the data collection experiment and therefore be able to say something more than “if you had just grabbed a sample from rnorm() you probably wouldn’t have gotten this data” what they want to say is “if you actually repeat this exact experiment in the world, and the true parameter is 0 you will rarely get data like this” that is in other words only the value of the parameter is left as a hypothetical, everything else is intended as a statement about the way the world actually operates.

          it’s this attempt to make frequency a physical fact that galls me about frequentist, I think it’s also what irritated Jaynes, his mind projection fallacy for example. Mayo is on record here saying that statistical claims can’t just be “conditional on some assumed process… probability of this thing occurring is p” but rather should be unconditional claims about what actually does happen. to me that’s the essence of the conflict. I have no problem with Keith’s favored interpretation of Bayes as testing hypothetical Range to see which subset are sufficiently lifelike.

        • > all kinds of issues that could have caused the result such as systematic sampling error and whatever (against which a Bayesian approach offers no cure)

          One way to think about Bayesian analysis is that it gives you free reign to design tests of whatever substantive question you have. Think there’s a systematic sampling error? Design a generative model that incorporates this error, place a prior that describes what sizes are reasonable for you to test, then run your test… you will find a sample of reasonable sizes for this error taking into account the data.

          In this context it becomes clear that the prior is just a part of the assumption that there might be a systematic error. it’s the part where you specify which systematic errors you want to consider. why should you waste time considering sampling biases in tree size surveys that miss all the trees less than 100 ft tall, when you know people would have easily found many, but maybe not all, trees as small as 18 inches… it’s a waste of time, so instead consider bias where the probability to find trees is a continuous function that increases with increasing size… if you represent that function in certain ways you need to restrict the parameters to a certain region in order to avoid say decreasing functions, or functions that oscillate… this is what priors do.

          So I do think Bayes offers a cure, but it’s not a push button take a pill type cure, it’s more of a “here’s a lathe, milling machine, belt grinder, some fasteners, taps and dies, miscellaneous bearings… build yourself a cure”

        • You can model all kinds of things such as systematically biased sampling also in a frequentist way. Unfortunately it’s hardly ever done. (If a Bayesian does it, I’m fine with that, but you won’t tell me that the majority of Bayesian analyses that we see published does this in appropriate ways either.)

        • I honestly don’t know what the majority of Bayesian publications do, I mostly read about fairly good ones… but it’s rare to see them at all in say Biology or engineering or medicine or economics, fields I care about.

          Often people count maximum likelihood random effects models as Frequentist because they don’t specify an explicit prior, but I don’t think they are… they apply probability to parameters without parameters being an observable quantity that can have repetitions, or a verifiable physical distributional shape. They are just a poor man’s Bayes for people afraid that explicit priors will get their papers rejected.

          of course you can bootstrap things rather a lot, and people do. and you can do lots of permutation testing and placebo treatments and try to discover things that way… it’s not totally hopeless, at least this kind of thing respects the observed distribution of data

          I’m beginning to like Keith’s explanation of Bayes as filtering out which random number generator models could/couldn’t have generated the data using likelihood as the filter against the prior as the proposal. it is a unifying concept that seems intuitive to explain to beginners.

        • For me “frequentist” is an interpretation of probability. It’s about what we think a model means. It’s not how the model looks like. Bayesian on the other hand is a way of computing things, and there are different interpretations of this by different people (objective, subjective, falsificationist…). One can give a frequentist interpretation to a Bayesian prior. Sometimes that’s very far fetched (if the idea is that in fact only one value of the parameter generates all the data), sometimes not so much. Many people interpret random effects in a frequentist manner (even more people don’t care about interpretation so can be neither classified as frequentist nor as any variety of Bayesian), which often makes sense insofar as that they model something that in reality is repeated (e.g., when a random effect is assigned to each of n test persons). In my use of terminology it doesn’t matter whether this actually is a distribution over a parameter. I’m not aware that papers are rejected because they use priors by the way. What I see more is that people avoid priors because they have no idea where to get them from, and how sensitive what they do would be to the choice of prior.

        • > It’s about what we think a model means. It’s not how the model looks like.

          Specifically, a Frequentist model should be one in which everything that has probability assigned to it should be a verifiable, observable outcome which can be repeated and varies from one experiment/observation to another.

          From wikipedia which I think agrees with most of everything I’ve read: “The relative frequency of occurrence of an event, observed in a number of repetitions of the experiment, is a measure of the probability of that event. This is the core conception of probability in the frequentist interpretation. ”

          also “Probabilities can be found (in principle) by a repeatable objective process (and are thus ideally devoid of opinion).”

          So “observed in a number of repetitions” is essential, and probabilities are verifiable facts about the world, not about our assumptions. This is why in “good” frequentist practice like what’s taught in “Statistical Analysis and Data Display” (Heiberger and Holland) you are told how to run things like Anderson-Darling tests or Kolmogorov-Smirnov tests or do transformations to make your data more normally distributed before running your ANOVA or whatever. The idea is that whatever calculation you’re doing is an approximation of some “real, true” probability distribution and you should make sure the distributional assumptions map at least approximately onto the verifiable facts of the world. This is also why there are so many nonparametric tests and goodness of fit tests and things in such books because they use mathematical tricks to incorporate these transformations into the tests.

          This is why I say in most cases random effects models are Bayesian models in which people just avoid thinking about the prior explicitly. Somewhere deep in these mixed models is the idea that some parameter (like the mean) that describes a distribution over verifiable outcomes itself has a distribution placed over it and each sub-group has a different realization of this parameter value.

          But the interpretation of the distribution over the parameters can not be anything but Bayesian since the parameters themselves are not observable nor can they be found by “a repeatable objective process” nor is there any way to verify it.

          You *might* argue that with enough data on each subgroup, the sample mean is close enough to the real mean, and then the observed distribution of the sample means across the groups forms the sample distribution of the parameter, and then you could adjust the assumption of the hierarchical distribution over parameters to match the shape of the “observed” distribution over people

          I have never ONCE seen that done.

        • Fair enough, you’ll probably find enough references in which frequentism is described like this. I think this is very problematic. Nobody has ever managed to nail down properly what an objectively “existing in the world” frequentist probability actually is. It’s hard to start with how it can be “measured” without having a conception what it is. Personally I think it makes more sense to define it as a way of thinking: “We think of a process as if…” Even “repetition” is a to some extent subjective construct in the sense that no “repetition” ever is really identical (and one would have a very hard time defining what could be meant by “an approximate repetition”. For me frequentism is basically about thought constructs, and one can just say that these thought constructs are more or less strongly and convincingly associated with how we perceive the “real world”, very weakly in situations in which we only have very few or even just one observation (although it may be conceivable to gather more), much stronger in cases in which many observations can be had, without obvious threats to “randomness” such as dependence, shifts of conditions etc. Random effects are somewhere in between; often you have many observations but of course you don’t observe the parameters directly but rather observations generated from them (according to the thought construct).

        • Wow, well this might explain quite a bit about the confusion that occurs between the two of us.

          I see Frequentism as a very definite ontological theory: the frequency is a thing in the world that can in principle be verified by taking a large sample of observable things, and probability calculations are acceptable to the extent that they approximate the real world frequencies in a reasonably accurate way.

          I have absolutely no problem with “if such and such random process were to occur, it would produce observed data or more extreme as often as p = 0.0021” that’s a perfectly valid mathematical calculation. The problem is usually either:

          1) It isn’t anything validated about the world, so it doesn’t correspond to the Frequentism theory mentioned above, so we need a test of goodness of fit or a proof of an appropriate asymptotic result or something before we can call it Frequentist.

          or

          2) It is a purely hypothetical calculation, in which case it is actually a (usually quite poor) Bayesian model, and I ask “why did you check that particular model using that particular likelihood?”. In particular, when used with p = 0.05 threshold, it kind of corresponds to a strange ABC Bayesian model, in which your summary statistic is p

          You can create a Bayesian model as follows, using ABC methodology:

          1) Define a stochastic RNG sampling model, like normal(m,s) or even some complex “random effects” model like linear regression with iid errors and individual slopes and intercepts per person or per school etc.

          2) Let the prior for the parameters be uniform over [-MaxFloat,MaxFloat] for all parameters (or if parameter must be positive then [0,MaxFloat] or similarly for negative, or uniform between two logical bounding values maybe [0,1] etc.

          3) Generate a random parameter vector from the prior

          4) using the parameter vector calculate the p value for the test statistic function applied to the data under that parameter + sampling model (basically generate a data set, calculate the test statistic, repeat, then find out what the quantile of the real data test statistic is in this big database)

          5) keep the parameter vector if p > 0.05, reject the parameter vector if p < 0.5

          6) goto 1 until a sufficient sample is collected.

          Is this a Bayesian model or not? I suggest that to the extent that you haven't tested the sampling distribution against the data for goodness of fit, and to the extent that you actually hypothesize this uniform prior which can't possibly be a real sampling distribution of anything, it's actually a Bayesian procedure and an obviously non-optimal one at that.

        • “The question that is addressed is whether the data can *distinguish* what has happened from a model in which the true effect is 0.0000.”

          Perfect! That’s what I understood but was having a hard time expressing clearly.

          “Surely if the answer is “no”, the data can’t be used as evidence for anything else, and that’s of interest and relevant in a lot of cases.”

          “if the answer is “yes”, this doesn’t necessarily mean that something substantially meaningful is going on”

          Badabing and Badaboom.

    • “The problem with statistical significance is not with the mechanics of the test or of the sample”

      Really? You mean if you survey five year olds on how they would vote they’ll predict the outcome of the next presidential election? :)

      Yes it matters that the sample reflects the population, that’s about the most fundamental thing I can think of in all of science and I’m sure you know that too, so I’m not sure what’s up with that statement.

      • You seem to miss the point, which is that p values DO NOT answer questions about validity or who will win the election, they answer questions about whether your data might have been spit out by calling a random number generator.

        when the p value is small you have proof of a thing you most likely new already, that noone switched your data with the output of rnorm(n,0,1)

        it does not matter what you actually did do for the validity of the answer that the p value gives you. the answer it gives you is “no your data didn’t come from that RNG you just checked” which is valid even if you surveyed babies on how the election will go…

        or maybe it answers “we can’t say that your data didn’t come from that RNG (p=0.28)” which also is valid regardless of where your data did come from.

        The only time p values are really of use is when you are a computer scientist or mathematician trying to design random number generators… in which case you should not get small p values very often… or when you are checking your new data against a database of old data to see if the new data is not like the old data… that’s about it.

        • Daniel:

          You write, “The only time p values are really of use is when you are a computer scientist or mathematician trying to design random number generators.” I don’t quite agree. First, p-values can be useful to practitioners if they don’t take them literally but rather as convenient shorthands for z-scores. I don’t recommend that, but this is how people communicate sometimes, so the numbers can be useful for that purpose. Second, although I agree that rejection doesn’t usually tell us much, non-rejection can be a useful signal that data are too noisy to get a good estimate of some effect or comparison of interest. Again, if that’s the goal, the p-value is kind of an indirect way of getting there, more valuable because of convention than anything else—but, still, the statement that a certain dataset is consistent with being the product of some specified random number generator . . . that can be a useful thing to know sometimes, as in this example.

        • well, you left out my second case, checking your data against a database of other data, which is a great use, basically filtering data to find the “weird” points.

          yes I also agree that sometimes we want to check if some particular random process could have generated our data because we are looking to see if that process is sufficient to explain the data. when we don’t reject we can conclude that a bias or other effect was probably not present. This is a good use but actually pretty rare in practice.

        • Also, rereading my comment I realize I meant to place a lot of emphasis on “really”, as in it’s essential to the question. You can not design good RNG algorithms without subjecting them to a huge battery of tests in which you calculate p values and ensure that they are only very small with the appropriate frequency. Per Martin-Löf showed that this is essentially the definition of a random sequence. so in this case they are *really* of interest themselves.

          There are very few other situations where calculating the p value directly answers the real question you have. your example of checking and not rejecting is another one, and filtering data into two bins, one where the data is consistent with a past database and one where it’s very rare in the past database is another. almost all the other applications are people trying to indirectly make an argument where going directly to that argument would be more logical.

        • Matter for what? Matter for the accuracy of the mathematical statement “such and such random process wouldn’t have produced this data p = 0.002” or matter for the accuracy of scientific statement “such and such a thing is true about the world”?

          a representative sample doesn’t matter for the mathematical truth, but of course does matter for the statement about the world.

        • Jim, there is no such thing as a “representative sample”, that’s a non-statistical concept that many statisticians seem to like throwing around simply to justify their methods. It’s ill-defined. Don’t believe me? Try to define what a “representative sample” is without getting into a reductio ad absurdum. Well… a representative sample is a sample in which the relevant characteristics of a sample are similar to those in the population. Who defines which are the “relevant characteristics”? What does “similar” mean? Sure, the proportions of African Americans, and females, and those from the northeast are “similar” to those in the census (or whatever)… but what about the proportions of female African Americans from the northeast (a perfectly well-defined sub-population, clearly as relevant as any of the marginal sub-populations). Thus, it’s an absurd concept.

        • For samples from a finite population at a point in time (like surveys) a representative sample is one in which all measurements are of similar magnitude to what would be expected from a pure random sample drawn using a calibrated RNG.

          There’s nothing circular or absurd about that. You don’t have to actually use a calibrated RNG to collect your sample, but if it isn’t “representative” of what you might have gotten from a RNG it isn’t a representative sample.

        • Mark:

          I agree that the term “representative sample” is vague and depends on context, but I disagree that there’s no such thing.

          To put it another way, every real-world sample is nonrepresentative. It is important in data collection to control the nonrepresentativeness, and it is important in analysis to recognize and adjust for the nonrepresentativeness.

        • “I disagree that there’s no such thing” [as a ‘representative sample’], but “every real-world sample is nonrepresentative.” And, thus, Bertrand Russell was the Pope (he may actually have been, I don’t rightly know).

        • “Try to define what a “representative sample” is without getting into a reductio ad absurdum.”

          ha ha, yeah, you’re right, but you still know what one is right? :)

    • I think the point about the null hypothesis being (almost) never exactly true is valid but not that relevant, because researchers use significance testing to make inferences about the *direction* of effects – the null is simply the boundary. And one plausible reason why it’s so popular is that in many (lab) settings, many researchers don’t care much about effect sizes.

      • Tas:

        It’s fine to focus on the direction of the effects. Bayesian inference will give you that, if that’s what you want. Also, if you care about the direction of the effects, it’s no problem that, with large enough N, you can get a statistically significant p-value. That’s mathematically correct: in the absence of bias, with large enough N you will be able to determine the direction of an effect to any desired level of confidence.

    • Phil: That’s true only for the most artificial kind of nil null test. The P-bashers invariably use it, along with an erroneous definition of P-value, to regurgitate their straw arguments. The ASA Statement on P-values irresponsibly follows suit, but the 2019 follow-up is much worse. I replied to the ASA president’s call for recommendations: https://errorstatistics.com/2019/11/30/p-value-statements-and-their-unintended-consequences-the-june-2019-asa-presidents-corner/

      • Whether you do a point null or a composite null or a goodness of fit or whatever you are always rejecting some consequences of an assumed sampling distribution, when you do reject it you find out only that your assumption was wrong. when you don’t reject it you find out that your assumption could have been an adequate one to explain your test statistic. Usually the second situation is more helpful, there are an infinity of wrong assumptions, but only a relatively few adequate ones. Hence we care about the interior of confidence intervals more than the exterior usually.

        Still, the nil null is programmed into every push button piece of software and is spit out by all sorts of analyses and plopped into papers Willy nilly. it’s not like it’s a straw man, it is in fact the most common form of NHST result out there

        • “the nil null is programmed into every push button piece of software and is spit out by all sorts of analyses and plopped into papers Willy nilly. it’s not like it’s a straw man”

          Sure, but isn’t the question whether the baby should go out with the bathwater, that is, whether misuse of p values means we should end significance testing?

          What if you make your null as robust/severe/extreme as you can? It seems that if you exhaustively list every potential source of bias in the data and fully characterize them into estimations of error, and then sum those into a global estimate of error, you could build a null from that, right?

          Suppose I am interested in himmicanes and I decide that after working through all the possible sources of bias that I can dream up or find in the literature, I discover that 97% of the potential error is caused by two factors, the magnitude of the hurricane and the percent that flee. After estimating those biases and establishing error bars, my model shows that any value between [people spend 5% more on female-named hurricanes] and [people spend 18% more on male-named hurricanes] is indistinguishable from “no effect.” Now I back-calculate from the worst case error, the 18%, that to get a p of 0.05, I need a result greater than [people spend 68% more on male-named hurricanes] to show a statistically significant result based purely upon the gender of the hurricane name. I don’t really care how big the effect is! At this point, I finally interrogate the raw data and discover that it shows [people spend 74% more on male-named hurricanes]. Can I declare victory? I am curious what problem this approach would fail to prevent, and if so, what would work better in a simple example like this.

        • > What if you make your null as robust/severe/extreme as you can? It seems that if you exhaustively list every potential source of bias in the data and fully characterize them into estimations of error, and then sum those into a global estimate of error, you could build a null from that, right?

          Now you’re doing Bayesian analysis. Specifically, you’re using your scientific theory to build a generative model of the data collection process. That’s all the fanatics like Anoneuoid and I ask for ;-) Just actually think about your science and build a model of what happens, and then check that against data! In doing so, incorporate as many of the known or hypothesized effects and biases as you can…

          As soon as you bother to do that, you’ll have questions other than “does this or does this not equal 0”. No one really cares about whether parameters equal zero in real models, because as soon as you build the model in your head, the parameters take on *meaning* whereas when you’re push-buttoning the “pocket calculator” version of statistics the parameters have no meaning, you just want to know if they are or aren’t greater than 0 or whatever.

          But don’t you think it matters whether say your ability to measure how much effort people put into avoiding hurricanes is dominated by the variation caused by the male/female name of the hurricane, or by the extremely poor data that you have on what the reported locations of landfall were and how that was communicated so that people knew whether they were or were not in danger?

          I mean, that’s actual scientific knowledge there: communication about hurricane risk affects people’s perception of danger and choice of response… and perhaps the hurricane’s name affects the communications, or perhaps it affects the response, maybe both…

  7. > When performing a “pilot” study on say 5 animals, and finding an “almost significant” result, or a “trend”, why is it incorrect to add another 5 animals to that sample and to look at the P value then?

    For the same reason that you cannot do the study with 6 animals and discard the one you don’t like and look at the p-value then, or test 42 different hypothesis calculating 42 p-values. Because you have set the threshold of statistical significance at 0.05 to have 5% of rejecting the null hypothesis when it’s true and that calculation would no longer be valid if you do funny things.

    You could design the experiment to allow for intermediate significance tests or multiple comparisons. But this has to be predefined and you cannot make it up as you go along.

    Say for example that you’re going to study 5 animals, calculate a p-value, and reject the null hypotheis if p<A. Otherwise you continue the experiment only if A<p<0.1 with 5 additional animals. You calculate then a new p-value and reject the null hypothesis if p<B.

    There are many possible combinations of A and B that will give you 5% probability of rejecting the null hypothesis when it's actually true. But in all of them A<0.05, otherwise you're making already all the errors that you are allowed to have in the first step and you don't have margin to do any rejection, that could be wrong, in the second step.

    • Yes! Thank you!

      Collecting data until you’re black box spits out the “right” answer is poor practice. Why not just pick the one sample that creates statistical significance and use that first and save yourself the trouble of doing all the other ones? :))))

      Andrew, it’s strange that you approve of this practice, since its likely to lead to your favorite P value, P = 0.049999999999, not to mention tempting people to be selective in which values they add to the model.

  8. You have answered a question about the finer points of hypothesis testing by saying “don’t do hypothesis testing”. Whilst a valid opinion it may not help the questioner because he may be stuck in a culture in which hypothesis testing is the norm.

    Pharmaceutical statisticians argue that you should not carry out interim analyses before all the data is available because it affects the p-value of the final analysis when all of the data is available. I’m not familiar with the theory but it exists somewhere in the pharma literature.

    • Peter:

      As you say, statisticians argue that you should not carry out interim analyses before all the data is available because it affects the p-value of the final analysis when all of the data is available. I think this is a mistaken attitude. Yes, it’s true that making interim decisions affects the p-value: this is a well known point, indeed the p-value is also affected by the mere possibility of making interim decisions, even if no interim decision happen to be made for the data at hand, as discussed on the first page of our forking paths.

      My point in the above post is that interim decisions are ok. The problem is not with the interim decisions, it’s in using statistical significance to make decisions about what to conclude from your data.

      To put it another way: Yes, if for some reason you are required to produce a p-value, then that p-value should reflect all possible interim decisions you might have made or might make in the future. (I’m now reminded of the continuing discussion thread here. But, even if for some reason you have to construct this p-value and report statistical significance, I still don’t think it’s a good idea to use statistical significance to make decisions about what to conclude from your data. P-values (and, for that matter, Bayes factors) just don’t line up to decision problems.

      If you are “stuck in a culture in which hypothesis testing is the norm,” I’d say: calculate the numbers that they tell you to calculate, but then move on to inferential summaries and decision making with a clear head, forgetting about statistical significance and all that.

      • The idea that a p-value is influenced by interim data peeking is based on an implicit false assumption that a p-value is an unconditional error rate. Where the p-value is used as an index of evidence against the null hypothesis (as a substitute for a z-score, as Andrew mentioned above) then interim analyses and potential decisions are totally irrelevant.

        The pervasive confusion on this is probably a consequence of the failure of statistics to deal with the necessary distinction between evidential strength and the risk of erroneous decision. P-values (are) should be used as indices of the evidence in the data against the null hypothesis (according to the statistical model). That is evidence ‘local’ to the experiment: evidence in _this_ data against _this_ null hypothesis. Error rates (i.e. Neyman-Pearson alpha) are not local because they relate to the long run rate of false positive decisions coming from long run use of the method: they are ‘global’.

        It us true that the global error rates are affected by the presence or possibility of interim analyses, but the local evidence is not. In other words, interim analyses cannot affect the p-value (when correctly understood). After all, the interim analyses do not influence the data or the null hypothesis. I have written about this extensively (see https://arxiv.org/abs/1910.02042). (I apologise for posting this link in so many comments here, but its content is so directly relevant that anyone interested in the topic really should read it.)

        • Yes, I agree with this. If you have a big set of data you can ask a question like “could this have been the output of my_rng(a,b)” and get a p value that tells you whether your particular test was sensitive to and detected differences between your dataset and the kind of thing my_rng(a,b) would output.

          the answer to this question is unaffected by how many times you looked at the data etc.

          On the other hand the procedure “use the p value to decide on substantive question” when repeated over and over while sampling continuously from the process does not have an “error frequency” equal to the final p value that you calculate when you get tired of doing this procedure.

  9. > Where the p-value is used as an index of evidence against the null hypothesis then interim analyses and potential decisions are totally irrelevant.

    Don’t you need a sampling distribution to calculate a p-value? The design of the experiment including “potential decisions” is not irrelevant to calculate a proper p-value.

    Of course if you don’t care about the pertinence of the sampling distribution you may still calculate something, calling it a p-value and saying it’s an index of the evidence in the data. You could also take the estimated effect size and say it’s an index of the evidence in the data.

    • There is no requirement that a p value be calculated with “the real sampling distribution” (which doesn’t exist in most cases anyway). You can calculate a p value asking whether your vote share data came from a cauchy distribution with median 1 trillion. You will find a tiny p value.

      why you would ask this question is a different story. But the p value is a valid measure of how weird your data would be under this particular RNG.

      • Yes, you can calculate anything and call it “p-value”. Just like you can calculate anything and call it “electability”. :-)

        As you know the idea behind p-values is not that you use “the real sampling distribution” for the calculation, is that you use the “sampling distribution that your model predicts for the statistic of interest conditional on the null hypothesis being true”.

        • Right, so the fact that you sampled for a while, calculated a p value, sampled some more, calculated a p value, sampled some more…etc etc… doesn’t mean you can’t test the model “all of it was IID samples from a uniform(-10,338) RNG” and get a p value…

          of course you might ask yourself “why am I testing that RNG?” especially when you know that what you really did was something else entirely.

          For some reason people don’t seem to ask that same question when they run some canned stats program that outputs a table of p values though.

    • Carlos, you say that one might use the estimated effect size as an index of evidence (I’m assuming that you meant it, but do see that you might have been suggesting that it would be silly), but it would not be an index of evidence. It might _be_ the evidence within a simple model with no nuisance parameters, but as an index of evidence against the null hypothesis it would have the fatal flaw of not being affected by the choice of null hypothesis!

      P-values are (sometimes) useful indices of the evidence in data because they are affected by the effect size and the variability and the null hypothesis. And the statistical model.

      The idea that a p-value has to be fully conditioned on experimenter’s intention is a silly idea that might have come into prominence during the Bayes–Frequentist wars. It is not a useful idea because it mixes p-values and error rates and therefore forces a conflation of the evidence with the potential consequences of decision.

      Likelihood functions are often better indices of evidence than p-values because they show the evidential support for all possible values of the parameter of interest and because they have a far more straightforward interpretation and because they are not misconstrued as being error rates. Nonetheless, p-values (actual, not thresholded) are indices of the strength of evidence in the data against the null hypothesis, according to the statistical model.

      • I mean it. When p-values lose their sampling distribution connection with the experiment they are not that different from using the effect size (mu-mu0) or z-score or anything else as “index”. At least for single-parameter problems where such one-dimensional summaries are readily available.

        I think p-values came into prominence during the Bayes-Frequentist wars because low p-values were an indication that “either an exceptionally rare chance has occurred or the theory is not true”. They are not as interesting when you add “or the experiment has been extended until a low p-value was obtained”.

        N.B. I’ve not read your latest paper yet, I could still be convinced :-)

      • Michael,
        I actually came across your article last week and was very impressed.
        I do believe however, that for a number of reasons your, argument only applies to one-sided p values against a range null, not a two-sided p value against a point null.

        (Also, if we’re discussing p values, where is Anoneudoid?)

        • Thanks Nick. I would be very pleased to receive your reasoning about the sidedness of the p-values. My email is Michael with an extra L at the end, then @unimelb.edu.au

      • Michael,

        > interim analyses cannot affect the p-value (when correctly understood). After all, the interim analyses do not influence the data or the null hypothesis.

        The p-value does depend as well on the model, as you discuss extensively on that link. The interim analyses do influence the model. The model should take them into account.

        “What is a statistical model?”

        “A statistical model is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in response to data. The model does that by assigning probabilities to potential arrangements of data.”

        You acknowledge that you won’t get calibrated statistical inferences if you ignore the interim analyses: “That sub-optimality should be accounted for in the inferences that made [sic] from the evidence”.

        “Consider the meaning conveyed by an observed P-value of 0.002. It indicates that the data are strange or unusual compared to the expectations of the statistical model when the parameter of interest is set to the value specified by the null hypothesis. The statistical model expects a P-value of, say, 0.002 to occur only 2 times out of a thousand on average when the null is true. If such a P-value is observed then one of these situations has arisen:

        • a two in a thousand accident of random sampling has occurred;

        • the null hypothesised parameter value is not close to the true value;

        • the statistical model is flawed or inapplicable because one or more of the assumptions underlying its application are erroneous.”

        In the latter case, I don’t think it makes much sense to use a model known to be inapplicable to calculate a p-value that no longer has a clear meaning.

        • Carlos, I half agree and half disagree:

          The actual goal of Frequentist inference *in my opinion* is to find out things about the real world, and to do so by doing calculations on problems that mimic reality as an approximation. The idea behind saying “probability is the frequency in large repetitions” is that then… you can make a 1-1 at least approximate correspondence between a RNG computational algorithm, and what you think will happen in collecting real world data.

          Unfortunately, this is where Frequentism usually goes wrong: it’s used in huge numbers of contexts where validating the choice of distribution against a data-set is impossible. So, it relies on inference on things like averages etc where CLT type results are available so you can choose a distribution independent of an observed shape of individual data points… but the problem with that is you get *one* data point from the sample average distribution for example.

          however, there’s nothing *mathematically* wrong with calculating a p value for a model that doesn’t mimic reality, it just doesn’t tell you what the usual Frequentist statistician wants to know: the probability of making a *real world* error.

          In some sense this possibility is normally not acknowledged in the typical NHST rubric, rather, it’s assumed that the statistical model is adequately describing reality *first*, and all that is needed is a parameter value. Then people reject the null parameter value, and normally immediately falsely conclude that the parameter value is close to the maximum likelihood value or some other estimator value, while never even considering the idea that the model is totally inadequate in the first place.

          They ignore the 3rd option you mention above “the statistical model is flawed or inapplicable” and this is *almost always* the correct option in typical application of NHST because people who do adequate models have a tendency to be using Bayes to fit them.

          In fact, basically what Bayes is is a search through model space to see which models are adequate, using likelihood to filter out those that are inadequate.

          This is why Anoneuoid always goes on about *test your actual research hypothesis*. The “null hypothesis” is usually just a thing in a book. You don’t care about it, it doesn’t match reality in any way… it’s just a stupid straw man. The guy who wrote the book didn’t have the slightest clue what science you were doing.

          But none of that invalidates the *math* of the p value. The p value tells you “hey, this stupid thing you chose to check is not reality” and *that’s all* it doesn’t tell you “hey this other thing near the MLE is reality” or any of the stuff people want it to tell them.

          Anyone who formulates a sufficiently complex model to describe the reality of their experiment, hypothesizes some parameters, and then tests their model against data using NHST and a battery of tests… well they get a silver star… they get a gold star if instead of hypothesizing some point parameters, they acknowledge the possibility for those parameters to have some wiggle room and explore that wiggle room… But then they’re doing Bayes.

          Bayes isn’t much different from NHST using likelihood ratio tests + hypothesized wiggle room for parameters. It’s just that given the generality of the applicability, you also get to spend your time specifying a model that matches your understanding of the experiment/observations relatively closely.

          It’s not really about running Stan on dumb models + some priors and everything gets better, it’s about getting a tool where you can use “not dumb” models, and suddenly you’re in a different realm where golf putting depends on things like the precision with which people can estimate angles, the length over which the putt has to travel, and the rate at which energy is taken out of the ball by friction and soforth.

          “just do bayes” won’t cut it… but “just learn Bayes so you can in theory fit any kind of model, and then learn about how to build good models!” that can cut it.

        • Carlos, it is easy to make such arguments, but what exactly is your different model? The frequentist approaches to ‘dealing with’ interim analyses by taking them into account in the statistical model seem to me to discard the evidential nature of the observed p-value in favour of preservation of long run type I error rates. If you want to do that then use the Neyman-Pearson hypothesis testing framework, but take note of the serious shortcomings of that approach for scientific inference formation.

        • Do we agree that by “observed p-value” you mean “the p-value calculated using a flawed/inapplicable model” according to your own discussion of model adequacy?

          How can be the use of a less flawed / more applicable model, if available, be a bad thing? If no better model is available I understand that you may want to use whatever you have, but saying that these p-values calculated according to a flawed model are fine when “correctly understood” is a stretch.

          I don’t think that distinguishing p-values calculated according to an applicable model and p-values calculated according to an inapplicable model is a silly thing to do.

        • No, you are not getting the point. The model that can extract the _evidential meaning_ of the data via a p-value is not the same as the model that you want to use. That’s why it is necessary to have the evidential considerations explicitly separate from the considerations of how to respond to that evidence.

        • At least we agree on something: I don’t get it.

          To extract the _evidential meaning_ of the data we use a model which ignores a number of things about the experiment, so the difference between that model and the experiment details become irrelevant for the calculation of p-values. However, I still to take into account those details for inference because that _evidential meaning_ may not be so meaningful by itself.

          The model could be applicable, the sampling distribution used to calculate the p-values could be correct, the p-value could have a clear inferential meaning. If I understand your position, a correct understanding of p-values means forgetting about that and assuming that different models are to be used for _evidential considerations_ and for inference from that evidence.

        • Again, what exactly is your model that is taking the optional stopping into account? Does it distinguish between the subset of possible outcomes that stop early from those that go on to the large sample size? Assuming that it does that, then does it treat optional stopping results at, say n=20, from the equivalent data that notionally came from a fixed n=20 protocol?

          It is easy to say that the models should take all of the sampling rules into account, but not so easy to condition on the sampling rule and the actual (i.e. observed) sample size at the same time. I think that you are suggesting that the situation calls for a model that cannot simultaneously have all of the properties that you would need.

          An equivalent issue arises when the results from multiple tests are ‘corrected for multiplicity’. The ‘corrected’ tests involve a different null hypothesis from the null hypotheses of the individual significance tests. If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

        • When I say that “The model could be applicable” I mean that it may be the case that you have the option of considering a more applicable model, not that I’m going to provide you with one. If the model is “good”, the p-value means something. From the perspective, the “worse” the model gets the less meaningful the p-value is.

          You say that the “evidential meaning” of the data doesn’t depend on the experimental protocol, only on the data and on a model that doesn’t have to change when the design of the experiment changes. That would make more sense, I think, if you were embracing the likelihood principle and saying that the inference has to be based on the likelihood function only. But given that you say that inference has to take into account those details, I don’t see what do you gain by claiming that p-values give the “evidential meaning” of the data/model when no inference is possible without additional evidence/data/assumptions. You split the evidence in local/global and you still need to report and take into consideration every piece.

          Is that “local” evidence useful by itself? What would you say is “the meaning conveyed by an observed P-value of 0.002”? As far as I can see, you don’t give an answer beyond saying that it could have the “usual” interpretation of unlikely-things-happening if the model is correct but could also be low because the model is not applicable. But once the model is not applicable, does it mean anything concrete? If there is no way to tell if a value is low or high, what’s the utility?

          > If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

          “For example, in the case of the cartoon, the evidence in the data favour the idea that green jelly beans are linked with acne (and if we had an exact P-value then we could specify the strength of favouring) but because the data were obtained by a method with a substantial false positive error rate we should be somewhat reluctant to take that evidence at face value.”

          I agree that we shouldn’t take “that evidence” at face value, because the “strength of evidence” is weakened by the muliple comparisons even though they are irrelevant for the calculation of the p-value according to the model that ignores them.

          Say the original study had used 20 different colors of jelly beans but looked only at the aggregate (non) effect.

          They published the data and two researchers A and B looked at the subgroups. The standardized effects are:

          -2.052 -1.331 -1.266 -1.203 -1.062 -0.970 -0.928 -0.599 -0.594 -0.431 0.089 0.302 0.535 0.576 0.628 0.685 0.946 1.081 1.354 1.553

          Alice calculates p-values using independent models, with the null hypothesis being no effect for one color, because it’s obvious that the existence of people taking yellow jelly beans doesn’t change the data for those who took blue jelly beans.

          She finds evidence for a negative effect (say that the subjects have acne and it can get better or worse) of green jelly beans: p=0.0402, the evidence for other effects is less strong (p>0.1).

          Bob considers a model where the null hypothesis is that the effect is zero for every color. This means that if any of the independent null hypothesis considered by his colleague is false then this one is also false.

          He calculates a p-value using as statistic the absolute value of the largest observed effect: p=0.45. The evidence against the null hypothesis that the effect is zero for every color is extremely weak, if any.

          What is the evidential interpretation of the p=0.0402 obtained by Alice? What’s the evidential interpreation of the p=0.45 obtained by Bob?

          Does the data collected in the original study favour the idea that green jelly beans are linked with acne?

          What is the “strength of favouring” given that the exact p-value is 0.0402?

          Can Alice say that she has “strong” evidence against Bob’s null hypothesis? (If green jelly beans have an affect, Bob’s null hypothesis is also false.)

          (I agree that statistics is difficult but I don’t think that all p-values are equal: I find that Bob’s are better than Alice’s.)

        • > What’s the evidential interpretation of the p=0.45 obtained by Bob?

          Correction: it’s actually p=0.57. Not very different, but maybe crossing the 0.5 equator is relevant for the evidential interpration.

          (In my previous comment I had rounded it from 0.43 to 0.45, but I didn’t realize I was looking at it from the wrong side.)

        • Yes, Carlos, I do tend towards the likelihood principle, but I don’t believe that it should be interpreted as saying that _only_ the likelihood function should be taken into account when making inferences. The interpretation of the principle as saying that _only_ likelihoods should be taken into account when making inferences is silly, and it is hard to understand why it has remained the standard understanding of the principle.

          This is how I define the likelihood principle (from https://arxiv.org/abs/1507.08394):
          Two likelihood functions which are proportional to each other over the whole of their parameter spaces have the same evidential content. Thus, within the framework of a statistical model, all of the information provided by observed data concerning the relative merits of the possible values of a parameter of interest is contained in the likelihood function of that parameter based on the observed data. The degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.

          Notice how recognition of the role of the statistical model and the restricted scope of the likelihood function (“within the framework of a statistical model”) precludes the false notion that _only_ the likelihood function should be taken into account when making inferences. Statistical models cannot capture all of the information relevant to an inference, except in trivial model cases, and a likelihood function cannot be any better or more relevant than the statistical model from which it comes.

        • Bob and Carol have tested different null hypotheses and so different p-values is of no consequence. Carol focussed on the evidence regarding the effects of each colour of jelly bean whereas Bob used a method that minimises the possibility of a falsely discarding a true null hypothesis. Their results do not conflict.

          Alice cannot say that she has evidence against Bobs hypothesis because she did not test that hypothesis. She might say, however, that Bob’s hypothesis is not very interesting. And she might point out that if Bob’s hypothesis were of interest then Bob’s experiment would have needed far larger samples than hers because his approach to multiple tests costs a lot of power.

          The relevant inferential question is how Carol should proceed. She should recognise that the probability that she would find evidence favouring an effect of at least one of the colours was quite high and so she should be cautious about making a firm conclusion on the basis of just those experimental results. However, she did find that there is reason to suppose that the green jelly beans had an effect and so she should design an experiment to test just that hypothesis. (Or publish the lot as a preliminary study.)

          What should Bob do? He should read my papers and see how badly served he is by Neyman-Pearsonian statistics!

        • Alice finds that:

          A1] the data provides evidence of the effect on acne of green jelly beans

          A2] the data provides evidence of the effect on acne of jelly beans of a particular color, namely green

          A3] the data provides evidence of the effect on acne of jelly beans of a particular color

          Bob finds that the data doesn’t provide evidence of the effect of acne of jelly beans of any particular color.

          Mind you, I do understand that they are doing different tests. But if the goal is to make the (never defined) “evidential meaning” of p-values understandable to non-statisticians the whole thing remains a bit confusing. It seems to me that you need some “non-standard” concept of “strength of evidence”, because in another context such a contradiction wouldn’t be dismissed so easily.

          Imagine that looking at the same body of evidence, Detective Anderson claims that there is evidence that Mr. X murdered Epstein while Detective Brown claims that there is no evidence that Epstein was murdered by anyone. We would say that there is a conflict, or at least that they don’t use the concept of “evidence” consistently.

        • > she did find that there is reason to suppose that the green jelly beans had an effect and so she should design an experiment to test just that hypothesis.

          Is that really the answer that you want to give to the “how should she proceed” question in this (literally) cartoonish but unfortunately all-too-real example?

          Say that she does a larger “green jelly beans” study. Sadly she cannot find evidence for the effect of green jelly beans anymore…

          But looking at the data in detail, it favours the idea that green jelly beans are linked with acne in female smokers with no children.

          She is cautions about making a firm conclusion, of course, but there is reason to suppose that there is an effect.

          She should design an experiment to test just that hypothesis.

          Etc.

        • Yes, and there is nothing wrong with that. Decisions about what to study have to be decisions by thoughtful minds, not mindless statistics. If Alice is thoughtless enough to assume that my advice is a recipe to be followed ad infinitum then that is her stupidity, not a flaw in the advice.

        • Michael, I agree with you ! the problem with Frequentist statistics isn’t with the p values. They mean what they mean, it’s a true statement “you’d rarely get as large an effect of green jelly beans on acne as you saw if you sampled from such and such a random number generator”…

          the problem is in the interpretation. the correct logical statement from that is:

          “So, either you aren’t sampling from a random number generator, or you are and it doesn’t have the properties you used in your test.”

          Since we know you aren’t sampling from an RNG to begin with… well…

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *