What’s wrong with null hypothesis significance testing

Following up on yesterday’s post, “What’s wrong with Bayes”:

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

You’ve heard this a few zillion times before, and not just from me. Null hypothesis significance testing collapses the wavefunction too soon, leading to noisy decisions—bad decisions. My problem is not with “false positives” or false negatives”—in my world, there are no true zeroes—but rather that a layer of noise is being added to whatever we might be able to learn from data and models.

Don’t get me wrong. There are times when null hypothesis significance testing can make sense. And, speaking more generally, if a tool is available, people can use it as well as they can. Null hypothesis significance testing is the standard approach in much of science, and, as such, it’s been very useful. But I also think it’s useful to understand the problems with the approach.

My problem with the ideology

My problem with null hypothesis significance testing is not just that some statisticians recommend it, but that they think of it as necessary or fundamental.

Again, the analogy to Bayes might be helpful.

Bayesian statisticians will not only recommend and use Bayesian inference, but also will try their best, when seeing any non-Bayesian method, to interpret it Bayesianly. This can be helpful in revealing statistical models that can be said to be implicitly underlying certain statistical procedures—but ultimately a non-Bayesian method has to be evaluated on its own terms. The fact that a given estimate can be interpreted as, say, a posterior mode under a given probability model, should not be taken to imply that that model needs to be true, or even close to be true, for the method to work.

Similarly, any statistical method, even one that was not developed under a null hypothesis significance testing framework, can be evaluated in terms of type 1 and type 2 errors, coverage of interval estimates, etc. These evaluations can be helpful in understanding the method under certain theoretical, if unrealistic, conditions; see for example here.

The mistake is seeing such theoretical evaluations as fundamental. It can be hard for people to shake off this habit. But, remember: type 1 and type 2 errors are theoretical constructs based on false models. Keep your eye on the ball and remember your larger goals. When it comes to statistical methods, the house is stronger than the foundations.

43 thoughts on “What’s wrong with null hypothesis significance testing

  1. “Null hypothesis significance testing …[leads] to noisy decisions—bad decisions. ”

    Which is why it should be used with all the attendant caveats in the forefront of the discussion, as you did in the linked article (my asterisks):

    “Thus, our graphical and numerical summaries both conclude that the intermediate vote tallies **are consistent with** random voting.”

    “This **does not rule out the possibility of fraud**, of course – it merely shows that this aspect of the voting is **consistent with** the null hypothesis.”

    I like that, very well done, I also like the discussion of the graphs in the article. It would be nice if that were more common.

    Seems like the value or lack thereof in NHST is a) how tightly the experiment or problem is set up; and b) how much one makes of the result. In the article you referenced, the problem is pretty tight, but even then you don’t overdo the result, consistently using and emphasizing the phrase “are consistent with”. After all, it wouldn’t be hard to fake an election tally consistent with random voting. Your analysis would not have detected that.

    *MY* problem with NHST isn’t with NHST at all, it’s with the way it’s used: any old data can be plugged into it without the least bit of thought and presto, with a few students, and excel spreadsheet and ten minutes work, your favorite hypothesis is confirmed as a fundamental law of the universe! Now off we go to congress to regulate the Big Screwdriver industry for illegally marketing to vulnerable toolophiles, and file lawsuits because the INDUSTRY KNEW TOOLOPHILES WERE A VUNLERABLE POPULATION AND SOUGHT TO EXPLOIT THEM!!!!

    My other problem with NHST is that there’s a general idea that it’s results are final and unquestionable. In other forms of analysis in science, results are double, triple and quadruple checked and retested. If your election study was about a presidential election, it would be repeated by several different investigators for confirmation. But many users of NHST consider it uncouth to try to replicate!

  2. When does NHST work?

    First we should define “work”. I take work to mean that it helps us learn something real about the world.

    So one way it “works” is when the world is like the model we are testing… and the data shows no significant differences from the model even though we have tested *multiple* aspects of the distribution.

    For example, suppose we observe some electrical noise in a 16 bit digital sampling circuit. We have 1 million observations of the voltage while the circuit is not connected up to anything. We calculate a histogram, and it looks rather normally distributed around an average value of 2^15, exactly half the maximum sample, just as it was designed to be and the standard deviation is about 10 as it was designed to be.

    So we do an Anderson-Darling test of normality p = .44, and a Kolmogorov-Smirnov test of normality p = .71. Now we split the data into 4 chunks and do individual tests of normality on each… they pass. Now we calculate the in-sample lagged correlations and test against simulated IID normal data, again the sample lagged correlations are not unusual for IID noise…. we continue to test the hypothesis that these are samples from IID normal distribution with mean 2^15 and sd = 10 using various other tests, of the type available in the die-harder suite (Thanks George Marsaglia!)… of 100 test run only 6 give p values less than 0.05, and only 1 gives p value less than 0.01, which is basically within the expected number.

    We have now learned that our model of the noise circuit where it’s an IID normal(2^15,10) is actually a good model, because we tried to reject it in hundreds of ways and couldn’t.

    When was the last time you saw *that* in a paper using NHST?

    • That approach looks like Bayesian inference to me.

      What you have is competing hypotheses that are mutually exclusive (and aspiring to be collectively exhaustive), and each of them also estimates probabilities for the experiment outcomes. However, your actual experiment data annihilates their posterior probabilities and upholds the null hypothesis.

      If you were to be unfortunate to get data that seemed to confirm a false hypothesis, that data could also be confirming one or more of the competing alternatives as well. This is what we don’t know when we only think of what the data says about the null hypothesis.

    • I disagree with your definition of “work”. Obviously you can define a single test that will once and for all reject with p=0 that the data are normally distributed, which is test whether the data are discrete. The thing is, there are model assumptions the violation of which is a problem, and there are model assumptions the violation of which is not much of a problem. This doesn’t only depend on the relation of model to reality but also of what kind of interpretation you want you give your final test result.

      I like to think about this in the following way. If for example people use a two-sample t-test assuming normality, they are really not in the first place interested in whether the real underlying process is normal, and neither would they claim that (actually some do, but not those who have some basic understanding about how probability models relate to reality). Actually, usually the interest is in claiming that the location of one group’s distribution is different from the other one, or that generally one distribution tends to produce larger (or smaller) values. Now one can ask, looking at data generated from other distributions with location differences or some stochastic order between them, what is the performance of the two-sample t-test there (type I and type II error probabilities)? Then one finds that very often, particularly with large enough samples, it is just fine. Sometimes not though. The task of model assumption checking is to figure those cases out, not just all kinds of deviations from the model assumptions.

      With a PhD student of mine, I have some new work on checking model assumptions and subsequent testing by the way:
      https://arxiv.org/abs/1908.02218v1

      • You are right that whenever we ask does something work, we must immediately ask work for what purpose?. The testing I describe here would determine whether a model works for any purpose whatsoever because the testing described here mathematically defines what it means to be a random IID sequence from a given distribution.

        rarely do you need ALL of that, it is often enough to have it be close enough for government work so to speak. In fact for lots of purposes it is absolutely fine to be FAR OFF from the frequency model

        http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

        But then as soon as you acknowledge that the frequency property is something you havent checked, and really not what you actually care about, you are doing Bayesian analysis, because Bayesian analysis is about quantifying things you care about without demanding that they be the frequency of anything validated in the world. Bayes rule is just one consequence of probability theory, p values calculated from notional distributions that don’t match the frequency of anything are just other consequences of having made a notioal Tassumption. They don’t have a connection to how often stuff actually happens, but they do have a connection to how often your brain expects them to happen. so long as we avoid the mind projection fallacy, the math is just fine.

        the point I’m making here is that people do Bayes all the time and claim they are doing frequentist inference. As long as you are going to do Bayes, might as well do it right. and if you are going to do frequency matching, might as well do it right too. you ought to check at least a couple of goodness of fit measures and soforth

        • Well this is a result of what you call “do Bayes”, much of which doesn’t seem to have any relation of what Bayes actually did. We had this discussion before, so I leave it at that.

        • I don’t recall you two’s last discussion, but it’s to be expected that “doing Bayes” departs from what Bayes did. Despite how historically interesting Bayes’s paper is, whenever you hear someone use “Bayes” as shorthand for “probability theory as logic”, they’re nodding to the lineage of thought that includes Jaynes, Cox, Jeffreys, Keynes, Laplace, and Bernoulli, in which Bayes’s technical contribution, while certainly respectable, is relatively minor.

        • Yes, it’s true. and of course I’m opinionated. But I do think I have a consistent argument that agrees with common definitions in the literature… and just takes those really seriously, like a mathematician would when she has a definition of a set of mathematical objects (this is probably a pathology drilled into me as a math major in college, or possibly a kind of personality disorder ;-) )

          My understanding:

          Frequentist statistics makes inferences about the world based on how often things will happen under repetition (by definition, I think), and uses probability distributions to approximate how often things will happen in the future, by fitting those distributions to observed data using quantities estimated from a sample … such as what the sampling distribution of a particular function of data will be under future repetitions, like how often would your regression coefficient on a fit to a new data set be bigger than 0.

          The difference between Frequentist statistics and pure math probability theory is that you must have failed to reject some hypothesis tests that the distribution you are using could have generated the data you saw, so that there is a connection between the frequency model and the observations.

          My understanding:

          Bayes makes inferences about the world by encoding what we know about the world in a measure of credence / or how “probable” a thing is based on *how much we know about it*, applying it to both what we think we know (the values of quantities that are not known with certainty), and what we can observe (our expectation for what data will occur which need not have been frequency tested against data), and then deduces logical consequences of those assumptions (sometimes/often by doing numerical calculations with random numbers). Like how much credence are we giving to the idea that future functions of a sample of 10 data points are as big as some quantity t (a bayesian one-sided p value)

          Basically the only difference is whether the sampling frequency is asserted to be a verifiable stable property of the world, or an informational property about what we know.

          I’m open to other definitions, I just haven’t seen anyone state them with any kind of convincing argument about why what I said above shouldn’t be taken as the definition, and it’s logical and useful to take something else as the definition instead.

          What *doesn’t* make sense to me is to choose a frequency distribution on some basis other than it’s been validated at least a little against data… and then *assert* that logical consequences of that sampling distribution are predictions for future frequency of observations.

          That’s the mind-projection fallacy in a nutshell.

          I believe that you can do *both* by the way. You can fit a Bayesian model, and then using frequency tests, validate that the predicted frequencies from the Bayesian model are sufficiently like what is actually seen in the world, that it’s reasonable to use the Bayesian model to predict what will happen in the future in terms of frequency.

          So, these are not exclusive categories. I just think you’d better run some goodness of fit tests if you want to be doing Frequency based stats and make assertions about the frequency with which things will happen in the future. Otherwise you’re either misinterpreting your Bayesian model, or just doing pure math.

        • Historically these terms have been used by many people with different meanings. There’s no justification for the assumption that there’s only one true definition. Particularly “Bayes” has been identified with several different interpretations of probability. So whatever you say here is *your* interpretation of things, which is fair enough, but there’s no reason to think that others will share it.

          What I find problematic about your definition of frequentism is that you’re mixing up interpretation, ontology and methodology. As I wrote earlier, I define frequentism as an interpretation of probability, “we think of the statement P(A)=x as meaning that if we can repeat the experiment infinitely often the relative frequency of A will converge to x”. I do not state or assume that the world really is like this, and I don’t have to. One can discuss to what extent frequentist thinking is appropriate for which real world problems. It’s an important discussion, but frequentism as interpretation doesn’t rely on a positive result of that discussion. You may think that in order to apply it successfully the question must have been answered positively, but I doubt that. However I agree that somebody who applies frequentist statistics at least has something to answer there. Goodness-of-fit testing is a way to address this question, but not necessarily the only way, and as such will belong to the domain of the application of frequentism but not its definition. I’m not alone with this view. Frequentist grandpa von Mises says nothing of the kind that goodness-of-fit testing is mandatory, and neither does Kolmogorov. And that’s not because they don’t think models should be checked but because they know the difference between an interpretation of probability and its application.

          I’d also personally not call anything “Bayesian” that doesn’t come with prior distributions and builds posteriors, so if you say “the point I’m making here is that people do Bayes all the time and claim they are doing frequentist inference” this leaves me baffled, because a) there’s no contradiction, you can use Bayesian computation with probabilities that are interpreted in a frequentist way, and b) if they don’t specify priors in my view they’re not “doing Bayes”, and if you tell me 100 times that one can write down something equivalent that has a prior and looks like Bayes, still that isn’t what they do. This would be like saying that every Bayesian actually does Walley-style imprecise probabilities because Bayes is a trivial special case of this where the imprecision is actually zero. You can always embed something simpler in something more complex but that doesn’t make it a more complex thing.

        • I think we are discussing two different things. There is *probability theory* and there is *statistics*. Sure, you can do pure probability theory on sequences of outcomes and it’s totally legit. I never *once* meant to question that.

          But this is only *statistics* if it connects to data and how the world works. We do *statistics* to infer something about the real world. We do probability theory to infer something about a mathematical construct’s consequences. The probability theory calculation is true independent of the properties of particular experiments on mice or astronomical observations or whatever. The probability calculation that says “if I sample from distribution D I would only get the t(data) value I did from my mouse experiment with p=0.002” taken as a *statement about pure math* is true regardless of whether the data did in fact come from your mouse experiment.

          So I have no problem with “frequentism as an interpretation of probability” there are *at least* 2 models (in the model theory sense) of probability, one is the properties of infinite sequences, and one is the Cox style credence. I suspect there’s also more, for example some geometric idea maybe.

          So, my point was to come up with some definition of frequentist *statistics*. If we do inference on a question about the real world, using frequency interpretations of probability, we should have a reason to believe that the frequency interpretation of probability is making statements that correspond approximately to facts about the real world.

          For example, if I say “water behaves according to the Navier-Stokes equations” this should mean that when I calculate a velocity field with those equations, the actual water should be moving in the same way that the mathematics says it will at least to pretty close approximation… Similarly when I say “we are running experiments that are equivalent to samples from a normal distribution with unknown mean and unknown variance” it had better be the case that to some reasonable level of approximation, the actual data fills up the histogram of a normal distribution without too many gaps or too much kertosis, skewness, secondary modes, etc… If not, then the calculation “and therefore f(x) will have mean value Q and sampling distribution D” is a false statement about the world, like if you put 20 times the right viscosity into your Navier-Stokes calculations you won’t have correct predictions for the water flow.

          So, now, what do we call statistics where people do an RCT, take some data, hand it to lmer, get a mixed effects model fit, and then claim something like “the effect of free toothbrushes on cavities in children is -.4 per person per year CI [-1.2,-.1]” ??

          Let’s agree that as is *quite typical* they *did not once run any sort of goodness of fit testing*, and, it’s also trivially true that if they collected a moderate amount of data, the data fails to look like a normal distribution, it has some skewness, a long tail, a couple of modes… something. they also did not explicitly calculate samples from a posterior distribution, I grant that, and they did not explicitly input a prior.

          but they *did* make a claim about the world that is not a pure math claim.

          I claim that in so far as they applied particular distributions to their data without checking and validating the frequency properties… these are at best Bayesian distributions describing credence… at worst utter garbage.

          I also claim in so far as they calculated an interval and almost always interpret that interval as a high probability density interval… they intend to interpret it in a Bayesian way… I think this happens all the time. People say things like “there is only a 5% chance that the real parameter is outside this interval” *all the time*.

          I also claim that in so far as the claim of the CI is that “if the procedure were re-run on additional data 95% of the time the CI’s so generated will contain the true parameter” it can only be a claim *about the world* if the assumptions inherent in the procedure are approximately true facts about the world (ie. distributional shape assumptions…) also which they have not checked and will be pretty demonstrably false with lots of data.

          I had my wife run a sham experiment measuring blobs of certain sizes in digital images… they weren’t normally distributed. but it’d be totally bog standard for any biologist to run a mixed effects model or a t-test on them etc.

          So, besides “bad statistics” what is this kind of statistics?

        • What I would call that kind of statistics is some kind of “naive” or “hopeful” “empirical Bayes”. It *does* use priors, namely the normal distributions over the hierarchical parameters. There can truly be no frequency-statistics justification for those, as they almost never involve even a repeated process (the groups are often a small number of fixed things, like races, or schools). The hyperparameters are estimated essentially directly from data… the user interprets the CIs as if they were high probability regions… the data models are not validated in shape and so statements like “the CI will contain the parameter 95% of the time you rerun the experiment” is an unvalidated claim about that world that we have *no* argument for… It corresponds to a tightly peaked prior on the shape of the data distribution which is unjustified. (Andrew says people choke on the pea of the prior but let the camel of the likelihood pass through the eye of a needle… or some kind of less mixed up metaphor)

          A lot of it is just “hope and pray” but what *is* usually considered here, is the structure of the hierarchical model, that is, the user is saying “these groups could be substantially different from each other, but they’ll all cluster around some average… and each group will have it’s own intercept, and its own slope, and there can be groups of groups…etc.” And to the extent that the user thinks like that, it’s basically a kind of “folk” generative model.

          So, this is why I say people are “doing Bayes” because it smells like Bayes, it’s interpreted like Bayes, it has structure similar to Bayes, but of course it’s not *good* Bayes, because it doesn’t actually use prior information that we do actually have, and it fits with a point estimate, basically a kind of MAP, and so it doesn’t do a good job of exploring correlation in the parameters and soforth.

          What it doesn’t smell like or taste like is a model for the frequency with which things in the world occur, because it doesn’t *once* interrogate *the world* about frequency properties.

        • Surely you don’t need to convince me that tests are often misused and misinterpreted, and that frequentist probability interpretations are often used with very weak justification or none at all. However, I generally think that the connection between models and reality is somewhat looser than you think, or rather that there are different degrees of looseness and although you can find the odd example where the connection is very close, in many cases this is not so. There is no such thing as a true parameter, and there is no such thing as a perfect repetition of anything to begin with. We still use models, sometimes with weaker justification, in a kind of trial-and-error way, and they help us, or where they don’t, we ditch them or change them. I wrote about mathematical models in general, too.
          https://www.researchgate.net/publication/225691477_Mathematical_Models_and_Reality_A_Constructivist_Perspective

          Obviously any such model and inference can and should be challenged and the researchers should be able to defend it, and all the issues that you list play a role.

          However, I think that such issues appear for all mathematical models of reality, including Bayes. Just as an example, much Bayesian statistics relies on the assumption of exchangeability, but hardly anyone would say that if you observe 00000011111111110000, the probability for observing 0 in the next go should be the same as if you observe ten ones and ten zeroes in apparent random order. “Near exchangeability” can only hold as long as you don’t see things that look strikingly non-exchangeable, and this is usually conveniently ignored in modelling. One can defend the lofty ideal that all thoughts, all informations and everything one can imagine to happen in the future should be appropriately reflected in the prior, but this will never happen, and in many cases it is unclear what that even means. And as in frequentist modelling, the Bayesian needs to argue that their specific model is convincing, not much of this is done in many publications, something better can often be done than is actually done, but there are limits.

          You write: “I also claim in so far as they calculated an interval and almost always interpret that interval as a high probability density interval… they intend to interpret it in a Bayesian way… I think this happens all the time. People say things like “there is only a 5% chance that the real parameter is outside this interval” *all the time*.”
          What I actually believe is that people want Bayesian probabilities for truly existing frequentist parameters to lie in certain intervals. To which both of us should object!

        • Christian, thanks for continuing to engage on this topic I feel like my own ideas get better the more we discuss, so I appreciate it.

          When you say “We still use models, sometimes with weaker justification, in a kind of trial-and-error way, and they help us, or where they don’t, we ditch them or change them” I wish this were the way actual scientists behave, but I am afraid that it is only sometimes, probably less than 50% of the time. We see bad examples of sticking to models as if they were received truth on this blog all the time. Perhaps my strong push back against Frequentist statistics is because I see it as usually making strongly unjustified assumptions about the world, and then people often refuse to check them, or to back down on them. In fact I think most of the time people are completely unaware of them! (in particular, I don’t think an average scientist has much background in algorithmic/kolmogorov complexity theory)

          One of the strong assumptions I object to the most is the idea that the world outputs high complexity sequences “by default” so that short prefixes of the data are all we need to figure out what will happen infinitely into the future. The claim that “X is a random sample of the growth rate of cells in a dish” simply because you grew up 20 plates in your lab is hugely problematic

          I would be happy if everyone were forced to explicitly state “if growing cells in a dish anywhere in the world using these instructions results in the same kind of growth under all conditions because the growth function is well modeled as a reliable high complexity random sequence …. then….”

          If everyone had to do that, then when people don’t actually test that assumption… we could just insert “if FALSE then I’m the queen of Egypt” which is a true statement but useless… and move on to science where people test their assumptions, particularly when they’re super strong!

          Here’s a little code that you can run over and over again to see how reliable high quality random number sequences are at producing the same thing over and over day after day… (not mainly aimed at Christian, but maybe someone else reading this)

          library(ggplot2)

          x=rgamma(30,3,1/2)
          y=rgamma(30,3,1/2)
          z=rgamma(30,3,1/2)
          df = data.frame(val=c(x,y,z),label=c(rep(“A”,30),rep(“B”,30),rep(“C”,30)))

          ggplot(df,aes(val,col=label))+geom_density()+coord_cartesian(xlim=c(0,20))

          If you continue to do this, not once does the region around 12 have a peak above everything else, not once does the data look uniform, not once is the peak location varying dramatically between the 3 lines…

          The Bayesian claim “To the best of my knowledge I have no reason to distinguish between X and future X” is way less problematic. It’s not saying “the world is the same symmetric in time out to infinity” it’s saying “I don’t know how the world changes in time, but taken together all of it should be within some range so that no matter which observation you give me individually, it will be in the high probability region of this distribution…”

          It might be good to force that into papers too.

          Note that the first one makes a mathematically strong assumption about each data point. But it also implies mathematically strong assumptions about each sequence of 2 data points, each sequence of 10 data points, each sequence of 20, each sequence of 100… and it’s an assumption about how the world works, and the “probability” that is calculated is interpreted as a “frequency with which the world will actually produce this result”.

          if it turns out that in a given day each of the colonies on 10 plates are similar in size, but then on another day they’ll all be similar but a different size, and on a third day they’ll all be spread around highly variedly, and then on a 4th day they’ll all be almost exactly the same… then whatever you get on the first day, it’s got a totally different histogram from the second day, and the third day… and so-on. Any given batch of 10 or 20 won’t fill up the histogram of the frequentist assumed distribution even close! It also won’t fill up the “true” distribution. In fact there is no true IID distribution, because obviously there’s strong correlation in a given day… So the frequency model is strongly violated and all kinds of predictions it will make are violated in very very strong ways.

          On the other hand, the Bayesian model is not strongly violated under those conditions, so long as each of those batches does produce individual points that are somewhere in the high probability region of the assumed gamma distribution, because the Bayesian model doesn’t make strong assumptions about the sequence, it only makes assumptions about what you know about the individual points in the sequence. This is seen by virtue of the fact that if you shuffle the sequence in any order, the likelihood function stays the same (a consequence of the commutative property of multiplication of real numbers)

          (Of course, if you want to make assumptions about subsequences you can make Bayesian models that model sequences… so it’s not a limitation of the method.)

          The math specifying the model *looks* kind of the same, but it’s not!

          y_i is a random iid sample of gamma(a,b)

          y_i is in the high probability region of gamma(a,b) for each i

          But the first is *vastly stronger* than the second because it has as a consequence an infinite variety of statements very much like y_[1..40] will have a 10 bin histogram where the chi-squared goodness of fit test to the gamma(a,b) distribution will only be rejected at the 0.05 level about 5% of the time you take a sample of 40 successive numbers.

        • To give a mathematical example of what I’m talking about…

          > s = rep(seq(1,6),10)
          > s
          [1] 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2
          [39] 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

          s is pretty obviously a low complexity sequence, it’s not IID in any way, it has 100% lag 6 correlation…

          > rs = sample(s,length(s))
          > rs
          [1] 5 6 3 1 3 4 5 2 3 3 1 1 5 2 6 6 5 3 2 6 1 1 1 6 6 6 4 2 1 4 6 4 5 3 2 2 3 5
          [39] 4 4 5 1 5 1 4 2 3 1 2 2 3 4 2 6 4 3 4 5 5 6

          rs is a high complexity sequence, it still isn’t IID because the frequencies are too close to perfect, but it passes a lot more tests.

          Now sequence s under the Bayesian model “these were produced by a 6 sided die that has equal probability for each side” has log probability 60*log(1/6) = -107.5

          and for sequence rs, the log-probability is the exact same number 60*log(1/6)

          So while the first sequence fails huge numbers of tests for IID uniform 1/6 results, both have the same Bayesian evidentiary value for fitting a model of the die.

          The math is the same but the *meaning* is totally different because *IID sampling* is not the same as *IID information about the die’s parameters*.

        • I certainly agree about careful wording of what is assumed and what the output means. However I still believe that you demand more from the frequentist side than from the Bayesian one. I can see that a purely epistemic claim like “To the best of my knowledge I have no reason to distinguish between X and future X” is not so easy to criticise as “the world behaves like the following random number generator…” For which reason I have explained earlier that this is not my way to understand the results of what I’d call frequentist statistics. Sometimes frequencies are highly hypothetical and so is the argument, and sometimes, although certain model assumptions are obviously violated, one can make a case that regarding what one wants to learn it doesn’t matter much (such as when applying t-tests to uniform data).

          In one application I have been involved with we have “translated” a theory that people have in the literature about speciation (species differentiation processes in biology) into a fairly simple model for spatial species distributions, and then generated species distributions from the model, and compared them to species distributions observed in reality, including running a test. The model could actually be interpreted as generating worlds parallel to ours, every repetition creates a new world and in reality we have only one, no repetition. However in case that real species distributions deviate significantly from what the model generates (which was sometimes the case and sometimes not), the people defending the theory would need to try to explain the difference, potentially leading to sharpening of the theory, and surely to some this was surprising. On the other hand, situations in which there was no significant difference, it was clear that opponents of that theory couldn’t use these cases to argue their opposition. Voila, something learned.

          On the Bayesian side, one may wonder whether “To the best of my knowledge I have no reason to distinguish between X and future X” is a proper and true description of the best knowledge. One should not assume exchangeability for example if in fact one would be prepared to drop it given that certain patterns occur (the perception of this kind of pattern implies already that a process is *not* seen as exchangeable, because if it was there wouldn’t be any reason to look at orders). Or rather, one can assume exchangeability but should state from the beginning that this is a simplifying model and it makes an assumption that one would be prepared to drop (in turn falsifying the assumption if taken literally). At the end of the day, whether one thinks properly about all kinds of things that can happen and tries to model them appropriately (or ignore them if they are deemed irrelevant) is not a question of Bayes vs. frequentism. It is rather a question of whether people are modest and intelligent enough to think things through and know and point out limitations.

        • Oh good! I like your example problem where we have a simulation model, because I too build simulation models, like agent based models. It was hugely helpful to my biologist wife to build one about mouse rib development a while back. We published it in eLife:

          https://elifesciences.org/articles/29144

          Now, suppose we have an agent based simulation model that can produce a spatial pattern of species through some migration and mating type behaviors, and we generate random numbers to make decisions, and we get a final time state.

          I think of such a thing as “to the best of my knowledge, this sample is no more or less likely than any other sample my program might output”… Basically we can’t describe the Bayesian distribution directly, but we can sample from it.

          Now we suppose there are a couple of parameters, and for ease of exposition, we assume they are on bounded sets, like for example “the probability at each time step to mate with an animal the same color vs alternative color to me” and “the probability to migrate vs search for a mate”… obviously both of them are confined to [0,1].

          I claim the set of all possible values for these parameters is [0,1]x[0,1] and sample uniformly from all these. I then run a simulation for each sample, and look at the simulated data. I have some function that compares two datasets to get a real number C(D1,D2) which is 0 if the D1 and D2 are the same, and an increasing number the more they deviate from one another….

          I keep a parameter in my set with a probability that is proportional to a declining function of C.

          Am I doing Bayesian Statistics? Frequentist Statistics? Some other thing? Not Statistics At All?

          to cut short an objection, obviously I am doing calculations using frequentist probability theory within the computer… I don’t mean that, I mean, what kinds of implications am I deriving about the world by incorporating the real-world dataset and the comparison function into this procedure?

          In my mind, this is very clearly Bayesian statistics. In fact, I am working hard in my brain over the last few months to come up with ways to fit this kind of model conveniently because I see it coming up in my future to need to do this.

          I would be interested to hear your thoughts.

        • I like this because this illustrates to me how close my frequentist interpretation of probability can be to your supposedly epistemic one. If you ask me whether I think this is frequentist (which in my view is of course not generally incompatible with being Bayes), I’d say that if you interpret this as simulating a hypothetical process that we could imagine to go on in the world, then I’d call it frequentist. Is it a thought process that uses a hypothetical model of the world, or is it a model of a thought process about the world? The first one would be frequentist, the second one epistemic. Writing this I do realise that they are close to the point of becoming almost indistinguishable. Actually our model had two free parameters. We could have simulated them from priors, but what we did was rather “parametric bootstrap”, i.e., we estimated these parameters from the data in order to give the simulation optimal chances to reproduce certain features of the original distribution that were irrelevant to the theory that was to be tested. Disadvantage: this ignores that the parameter could have been different; simulating from a distribution of parameters could have produced more “realistic” variability (although in that case it was fairly clear that these parameters were quite “orthogonal” to the test statistic, so I don’t expect that this would have made a big difference). Advantage: No necessity to specify a prior; a uniform prior would pretty surely have produced too much variability by putting too much probability on unrealistic values, however it would have been pretty hard to argue specific more informative priors. But I’m not against using priors in such a situation as a matter of principle. You see, in practice we may agree. Our disagreement may be largely really about definition of terms, not so much about what a good use of probability models is. I wouldn’t mind much, for example as a reviewer of your work, about whether this is “really” frequentist, “really” Bayes, none, or both of them.

        • Christian. This is what I’m thinking, at least some of our issue is conflict on words. So let’s ignore frequentist and bayesian entirely… let’s talk about some of the things that are going on in this example as just phenomena.

          First off, there are some parameters. We both agree that if we try them uniformly at random it will take a long time to do anything, because many of them will produce too strange of a simulation, so we’d rather try them in some region of space where they have a good chance to do something useful, more or less like what is observed in reality. So there is a search process of some sort.

          One way to search is to find a single parameter value that produces hypothetical results from the simulation which are not too weird, or maybe if we have a way to find such a thing, minimally weird (maximum un-weird?). Let’s call this “Minimum Weirdness Estimation”. Next we could just find a set where the Weirdness is no worse than some threshold. Let’s call this Interior Weirdness Set Estimation, it’s just an estimate of the function Indicator(x such that Weirdness(x) less than 95%). Finally we could describe a Weirdness Surface over the whole set and say that W(x) describes how weird each parameter is. Call this Weirdness Surface Estimation.

          All of these are ways to find the parameters that result in hypothetical predictions that aren’t too weird.

          Second, as the agents move around and do things, there are decisions they have to make. We don’t have details on what their decisions are, but we can say that there is a distribution of possible choices. In general, in the simulation, the final outcome depends on a lot of different choices among many different agents. In this case it surely is a frequency based computational method, and it computes a sample of a single possible outcome. However, both of us will agree that it’s a hypothetical outcome I think. We are not making claims that the frequency of how often the world does things is like what our model predicts, at least not yet (at a minimum, before doing that we need to find at least one UnWeird parameter).

          Now, we may have let’s say several different examples of real-world situations. Let’s say for example the Galapagos islands, Australia, and Madagascar, all of which involve some kind of speciation process that are physically independent of each other because very different types of animals are involved for example.

          When this occurs, what is weird for one environment might not be weird for the other. In other words, there might be different parameters needed in each condition.

          On the other hand, we might think that these parameters are fundamental, like the speed of light, and so there are not different parameters in different environments, however, the specific details of what happens in different environments can vary, so perhaps each environment is just a different outcome within the variation possible from all the strange things that might have occurred.

          We now have several different kinds of uncertainty.

          U1) We don’t know what the un-weird parameter is.
          U2) We may or may not think that the un-weird value is different in different environments.
          U3) There are different things that can happen even if the parameter is the same in two different environments, because: stuff happens / initial conditions / stuff we don’t know about like the huge vector of unobserved decisions on the part of the agents etc.
          U4) We don’t know how much variation there might be in the future, or perhaps in observations we might have been able to make about the past, like if there was a hidden ornithology map from ancient egypt that showed up in a tomb somewhere…

          We also have two kinds of frequency:

          F1) Hypothetical / computational frequency that occurs inside our computer program.

          F2) Observed frequency that occurs from having multiple datasets about the world.

          ….

          So here’s my claim. When I discuss Bayesian Statistics, I mean that we can use F1 freely to do hypothetical calculations that describe our model. We can describe U1 in terms of a continuous measure of Weirdness. We can describe U2 in terms of a continuous measure of weirdness of the collection of parameters. We can model U3 using a hypothetical model of what we know about what might happen. And we can describe U4 in terms of how weird we *think* different future outcomes will be.

          The only thing we can’t do is make a claim like “in the future, if we observe speciation multiple times, and we repeat our procedure of Minimum Weirdness Estimation followed by Weirdness Expansion, the true parameter will be within the Weirdness Expansion interval 95% of the time”. Nor can we say “having found the maximally unweird parameter, when we put it into the model 95% of the time we get 13 or more species… therefore when animals arrive in new environments we can expect in the future that at least 13 species will be created 95% of the time…”

          Why? Because that’s a prediction about how often something will occur in the future… and we have no real basis for discussing how *often* it will occur since we have no measurements of how often things happen, we have maybe 3 individual instances.

          On the other hand, we can say “95% of the time that we run our simulation with the parameter vector x we get more than 13 species, therefore we find it would not be very weird if we found 13 or more species in a new environment”… Because this is a statement about how weird we think it would be to see something happen that wasn’t at all weird in our simulation.

          So, as far as I’m concerned, we’re doing 100% Bayesian statistics here, because *we don’t predict the actual frequency in the world that anything will happen, like how often our procedure will get the right answer, or how often in new circumstances we will get 13 species*.

          As soon as we have a large set of examples, like way more than 3, maybe 30, then we can start measuring how weird things are in terms of how often we predict something vs how often it actually happened… and we can start to have confidence that we have learned something about frequency in the world (particularly if we have a theoretical reason to believe there is a stable frequency)… At that point, our Bayesian model which was compared to frequency in the world, has converged onto a not-weird frequency distribution, and we can say something about the frequency distribution too..

          There’s no reason a Bayesian can’t estimate frequencies after all…

          now, if we use a variety of parameters and form that Weirdness Surface, and then we allow that there are level sets of Weirdness, and that we can calculate a total weirdness… we are doing 100% bayesian statistics with full bayesian estimation.

        • Let me just confirm that I read it and I am largely with you except on the terminology question (as before). I’m not ending up with any epistemic probability of the model to be true, rather I investigate its compatibility with given data. The whole thing is not about prediction but rather about theories regarding the past (although one could of course think about hypothetical predictions – to be checked in a few 10000 years or so). As long as I choose the parameters by “minimum weirdness estimation” rather than specifying a prior (without going further into discussing pros and cons of these), there’s no “Bayesian computation”, and therefore I’d object to Bayesians claiming that this “really” is Bayesian. (It could be done in a “more” Bayesian way, which is fair enough and fine by me, except for me Bayesianism only starts then, not before.)

        • Thanks Christian for checking in on it.

          I think terminology is standing in the way, as well as some history, like the Cox/Jaynes notion that Bayes *is* continuous expansion of true/false logic. I think I reject that, continuous expansion of true/false is only a subset of Bayes. Bayes can certainly do something useful when there *is no* true parameter, in fact most of the time this is the case.

          But ignoring terminology issues, I think there are a few important classificational aspects of styles of statistics, which includes important ones like:

          1) Whether or not the model’s merit is measured relative to its ability to predict the frequency of occurrence.

          2) Whether or not a continuous rating for different unobserved parameters is computed.

          3) Whether or not a claim is made about the estimation procedure’s frequency of “being right”.

          If the model is only about “what would be more or less weird to occur” I call it Bayesian. I guess what this means is that for me Bayesian *isn’t* about a model of epistemic truth… or at least not exclusively that.

          If the model is “what would be more or less weird for the Frequencies of occurrence” I call it “Bayesian model of Frequency”.

          If the model is about “the frequencies in the world should be considered to be F(x,q) to good approximation” and q has no relative rating of goodness, then I call it Frequentist.

          So long as we rate the weirdness of the parameter q based on the weirdness of the data under the model, I think it’s a Bayesian model. If we distort the weirdness of the q based on information other than the weirdness of the data under the model, then it’s Bayesian with an explicit/informative prior.

          In particular, I consider sampling from the posterior in order to explore the weirdness landscape to be a good practice but not actually essential to the classification. If you have a weirdness landscape and you find the one point that minimizes the weirdness, it’s still the case that the weirdness landscape exists in your calculation.

          In particular, I think this makes models fit with maximum likelihood a kind of Bayes…

          But I don’t want to get bogged down in the terminology so much as the distinction in the ideas: rating weirdness of outcomes (based on frequency or based on just informational considerations) and weirdness of parameters, vs calculating frequency of occurrence of data, falsifying models through their lack of frequency correspondence, and calculating frequency of occurrence of “success” in estimating the truth.

        • I’d also say that the weirdness rating had better obey the sum and product rules to be Bayes. If your weirdness rating is not isomorphic to probability I guess it couldn’t be Bayesian.

          I don’t actually know what I think of the idea that a likelihood function might be non-normalizable. p(data | parameters) taken for fixed data and variable parameters that can’t be normalized on its own without a proper prior on parameters seems like it still is Bayesian to me.

          p(data | parameters) taken as fixed parameters and a non-normalizable distribution over data does NOT seem like it can be Bayesian.

  3. I like the description of “null hypothesis significance testing” as prematurely collapsing the wave function, but you have not captured what I think are the key problems. NHST procedures blend of two distinct statistical procedures with wildly different objectives, neither of which are typically achieved with NHST.

    The hypothesis testing part leads to a decision (accept or reject the null [or reject or fail to reject]) and is woefully inadequate for scientific purposes when the decision is not informed by thoughtfully constructed loss functions. An acculturated aversion to type I errors is no substitute for a considered loss function. A dichotomisation of continuous (or finely granular) information into a yes or no decision is the premature collapse of the wave function.

    The significance testing part should provide a component of the characterisation of the evidence. A p-value is an index of the strength of evidence against the null hypothesis according to the data and the statistical model. The main problem with that is that most people fail to treat the p-value as evidence and, instead, treat it as a variable that feeds into an incomplete hypothesis test procedure.

    For a scientific inference the evidence should be considered in light of what is known about the system (including the applicability of the model, the representativeness of the sample, the relationship between the population actually sampled and the population of interest, any relevant prior information about the population of interest and the parameter of interest within the statistical model).

    Once again I will shamelessly link to my recent work on this topic: https://arxiv.org/abs/1910.02042

    • Michael: Schervish 1996 (which you’ve read and thought about) makes it clear that p-values, in general, do not behave like anything one would want to call a measure of evidence. In your comment here, are you implicitly assuming that only 1-tailed tests should be performed?

      Also, have you had a chance yet to go through Examples 1.1 and 1.2 of Patriota 2013? They further illustrate how p-values can fail to measure evidence.

      • I have indeed read and thought about Michael Schervish’s falsely titled paper. I reasoned, and he confirmed by email, that his arguments that p-values fail in a singular manner to behave as evidence apply _only_ to two-sided p-values. One-sided p-values are totally unaffected by Schervish’s arguments. It is strange to me that he failed to make that limitation clear in the paper itself.

        Yes, I do think that one-tailed p-values are preferable to two-tailed, but given that the arithmetical conversion from one to the other is usually trivial I am content to say that the number of tails should always be specified. I will note that arguments in favour of 2-tailed p-values are typically based on error rate properties and so should not be persuasive to anyone wanting to use p-values as indices of evidence.

        Have I had a chance to go through examples? Yes, I suppose I have had a chance to, but I haven’t read the paper. Should I? I have just now read the abstract and I am none the wiser. The first sentence does not make it sound very attractive. (If I need an excuse for not reading that paper I will offer the fact that papers pointing to problems with p-values are very common and not always very useful.)

        • Thanks Michael. I didn’t actually ask if you prefer one- or two-sided tests; I would like clarity that statements of yours like “A p-value is an index of the strength of evidence” actually only apply to one-sided tests.

          As you are undoubtedly aware two-sided tests are vastly more common in practice, so your complaint that “most people fail to treat the p-value as evidence” doesn’t seem well-founded. If the p-values being used don’t measure evidence, why is it bad to not treat them as evidence?

        • I do prefer one-sided p-values over two sided for reasons set out at length in section 3 of this paper https://arxiv.org/abs/1311.0081. However, if one-sided p-values are an index of the evidence against the null then two times the one-sided p-value is also an index.

          Why is it bad to not treat p-values as evidential indices? Because the thoughtless dichotomous decision making on the basis of a comparison of a p-value to a fixed threshold leads to bad decisions, and bad science.

        • ??? The two-sided p-value is not just twice the one-sided p-value. If it was they’d take values up to 2.

          The two tests also have different nulls so the relationship between them does not mean we can extrapolate 1-sided test p-values working as measures of evidence (against one null) to 2-sided test p-values (against another).

        • Generally, a simple one-sided P-value can only be in the range 0-0.5. The corresponding two-sided P-value is twice the one-sided if the probability distribution is symmetric(e.g. the normal) This is not true if the distribution is asymmetric (e.g. the negative binomial). As an aside, it is worth mentioning that the 2-sided P-value is not clearly defined for an asymmetric distribution – another reason to use the one-sided.

        • Nick Adams: are you using the term “simple” in some special way? Here’s a line of R code for a one-sided test, that’ll give you a p-value well above 0.5

          t.test(x=-3:7, alternative=”less”)

        • Typically the two-sided p-value is two times the one sided p-value where the latter is less than 0.5 and it is two times one minus the one sided p-value when the one-sided p-value is greater than 0.5. (For asymmetrical statistic distributions that relationship will not hold.)

          It is worth noting that some software (at least the dreaded MS Excel) will only give one-sided p-values less than 0.5 because the direction defaults to being the direction that gives the smaller p-value.

  4. Bayesian statisticians will not only recommend and use Bayesian inference, but also will try their best, when seeing any non-Bayesian method, to interpret it Bayesianly. This can be helpful in revealing statistical models that can be said to be implicitly underlying certain statistical procedures—but ultimately a non-Bayesian method has to be evaluated on its own terms.

    Here’s how I [currently] think about these things:

    – Frequentism is about proposing/evaluating estimators.

    – Bayes is about proposing/evaluating estimates.

    Sensible estimates imply sensible estimators but not conversely.

    I can understand evaluating an estimator “on its own terms” before having seen any data. But given some data, it seems to me that “sensible estimate” really does mean “approximately Bayesian estimate for a sensible prior”; frequentist criteria are only indirectly relevant.

    Is this wrong?

    Also, from the previous post:

    It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a “reference prior” because objectivity.

    There’s a substantive point here that I think Richard McElreath makes well with his “models-as-golems” analogy. On the other hand, it’s a bit lazy to call this “the whole religion thing”.

Leave a Reply to John R Cancel reply

Your email address will not be published. Required fields are marked *