A comment about p-values from Art Owen, upon reading Deborah Mayo’s new book

The Stanford statistician writes:

One of the fun parts of this was reading some of what Meehl wrote. I’d seen him quoted but had not read him before. What he says reminds me a lot of how p values were presented when I was an undergraduate at Waterloo. They emphasized large p values as a way of saying ‘not necessarily’ instead of small ones as Eureka.

Well put.

I’ll be posting soon with more reviews of Mayo, but I just wanted to post the above quote on its own.

153 thoughts on “A comment about p-values from Art Owen, upon reading Deborah Mayo’s new book

  1. I’m not sure how relevant it is. Meehl’s mother died as a consequence of ‘poor medical care’ that made him question medical practitioners and diagnostic accuracy’ according to Wikipedia. I think the best diagnosticians have experiences in their lives which make them question expertise.

    We forget that some scientists of his generation had begun to express the limits of sciences and medicine. He came to University of Wisconsin as I recall.

    • Thanks for sharing Sameera. Didn’t know that, and hadn’t looked too hard into Meehl, but his perspectives on philosophy of science always seemed interesting to me. Will have to give him a read.

  2. Really interested in your takes on Mayo. She makes a pretty great argument for frequentist methods—and I think she agrees with you quite a bit on how they are used incorrectly. Is it really true that Bayesians don’t care about stopping rules, etc.? It seems incredibly wrong to say “stopping rules are irrelevant,” but I don’t know enough about the various Bayesian camps and what they believe to know if it is a fair characterization on Mayo’s part. It seems like a framework that says stopping rules don’t matter is going to yield poor evidence and encourage QRPs.

      • Thanks! Will read into it. Do you remember where in BDA? Doesn’t appear to be in an index, and searching the ebook didn’t yield anything. Looking forward to more Mayo reactions. I’m nearly done with the book now; it’s great.

      • Stopping rules are relevant precisely when the fact that someone stopped is information of relevance to the model (this seems circular and it sort of is). For example, suppose I know that you have calibrated your instrument very well, but I don’t know what it’s precision is exactly. You then tell me you sampled until you had posterior standard deviation of q for parameter z, and you give me your data x. Now I can use this info and the number of x values you collected to infer something about your instruments precision.

    • It seems like a framework that says stopping rules don’t matter is going to yield poor evidence and encourage QRPs.

      The primary reason for QRPs is doing something (testing a strawman hypothesis) that cannot possibly answer the researcher’s real question. Ie, there is no right way to do NHST. If it were not for QRPs, most NHST-users would be collecting completely different types of data to begin with.

      That is all orthogonal to bayesian vs frequentist. From my experience Mayo doesn’t really appreciate this aspect of the problem and is still blaming QRPs on the “bayesian vs frequentist” red herring, but hopefully she has started considering it in her new book.

      • Anoneuoid

        I may not understand you completely, but the ‘Bayesian vs. frequentist’ seems like a red herring to me. It seems Deborah’s generation’s thought leaders weigh very heavily in their perspectives. I am more aligned with the thought leaders of the Evidence Based Medicine movement, Chalmers, Sackett, etc.

        I’m reading through the 43 articles TAS articles in different sequences to test out the various biases. What an experience.

        • I think Bayesian vs Frequentist confuses people about what is possible to do with Statistical methods, and what various analyses mean. The traditional frequentist based statistics curriculum is like teaching people a language where there’s no words for color, taste, smell, texture, spices, meats, vegetables, grains etc and then wondering why it is that all their dishes turn out to be inedible mush. The effect of classic NHST based frequentist statistics on altering the *kinds of research questions people pose* and the *kinds of data they collect* shouldn’t be underestimated.

          If you give people a language for asking more sophisticated causal mechanistic questions, it will help them focus their research on science I think.

      • > The primary reason for QRPs is doing something (testing a strawman hypothesis) that cannot possibly answer the researcher’s real question.

        I’m not sold on this, especially after reading Mayo’s book. To me, QRPs exist because researchers have a motivation to get to a certain answer, and some researchers will keep analyzing data until they get that answer—and do so by using QRPs. This is can be accomplished by either frequentist or Bayesian paradigms.

        I agree that a strawman hypothesis where the null is states that “the test statistic equals exactly zero” can be problematic, but this isn’t a QRP itself. And testing a strawman hypothesis can be done in the Bayesian paradigm, too: One can look at the posterior distribution and see the probability of a statistic being above zero by seeing what is the proportion of posterior samples above zero.

        • I agree that a strawman hypothesis where the null is states that “the test statistic equals exactly zero” can be problematic, but this isn’t a QRP itself.

          Unless this corresponds to the research hypothesis, yes it is. Which is easier as the data keeps coming in?
          1) reject a strawman null model and take it as evidence for your research hypothesis
          2) derive a real prediction from the research hypothesis and keep failing to reject it.

          From personal experience I’d guess #1 is at least 10x easier in terms of time and resources. That is why people use the strawman to begin with, it lets them pump out fake “discoveries” at superhuman speed. It is the godzilla of QRPs.

          And testing a strawman hypothesis can be done in the Bayesian paradigm, too: One can look at the posterior distribution and see the probability of a statistic being above zero by seeing what is the proportion of posterior samples above zero.

          Yep, that is why I say Bayesian vs Frequentist is a red herring.

        • reject a strawman null model and take it as evidence for your research hypothesis

          I should clarify that of course there is no requirement that the user perform the second (illegitimate) half of this. If they just do only the first half it can be perfectly fine, but the result is of no interest to anyone.

          The structure to get from one to the other looks like this:
          Reject strawman null model -> one or more QRPs -> take the result as evidence for your research hypothesis.

          It has been a big mistake to focus on “reject strawman null model” and then let the researcher fill in the blanks regarding how to conclude something about their research hypothesis in stats philosophy and training materials.

        • I think Mayo is great but … I wish she’d do the following. Get on pubmed.gov and count how many times just that day cancer, Alzheimer, CVD, etc. “researchers” report “We gathered some data, pushed some buttons, and Behold! See what is upon the tablets we hereby publish! The probability that whatever procedure, molecule or protocol we’ve hitched our wagons to reduces the risk of whatever we’ve made it our careers to reduce is between 20% and 90% because p is less than 0.05 and the CI covers our interval of interest. QED, or so we assume.”

          Bury enough loved ones whose doctors misinterpreted p-values and you’ll be radicalized too.

    • Mayo argues that tests can be interpreted in terms of well and poorly indicated discrepancies from the null hypothesis. Well-warranted hypotheses about discrepancies from the null are (we are told) exactly those that have passed a severe test. (We’re taking the model as given here because we’re presuming that the model assumptions have passed their own severe tests.) When she translates this severity concept into math we end up with the SEV function, which seems to refute criticisms of statistical tests as inherently dichotomous and to unify testing and estimation.

      To my knowledge Mayo has only ever demonstrated the SEV function for the normal model with known variance, and it just so happens that the SEV function in this model is numerically equal to a Bayesian posterior under a flat prior. Her examples got the concept across but it left me (a firm Bayesian) at a loss because this *particular* result is already intuitive from a Bayesian point of view. So to me it seemed possible but not thereby demonstrated that the severity concept really works as a philosophical foundation for frequentist statistics.

      So I sought to drive a wedge between the SEV function and the Bayesian posterior so that I could figure out if SEV really made more sense than Bayes (or made sense at all). It turns out that a group-sequential design — that is, with interim looks at the data and stopping rules that could terminate data collection before the maximum sample size was observed — provided the ideal test bed. This is a bit sticky because it is not obvious how to construct the SEV function in such cases — there’s more than one way to order outcomes as more or less “extreme”. There is one ordering — stagewise ordering — that is preferred in the literature; insofar as we can take Mayo as justifying current practice from a philosophical standpoint rather than doing original research in statistics, it seems reasonable to take stagewise ordering as canonical.

      I discovered that I could create a design in which there was a non-negligible probability that the SEV function would fail to converge to a step function located at the true parameter value even as the amount of information available to estimate the parameter grew without bound. This also occurs with the other orderings I found in the literature. You can read all about it in my 7,000 word blog post on the subject entitled The SEV function just plain doesn’t work. (And you thought this comment was long.)

      Mayo insists that selection effects due to early stopping must be taken into account but doesn’t work out the consequences of early stopping for her notion that tests can be interpreted in terms of well or poorly warranted discrepancies from the null. I went and did it and I found that the consequence is that the SEV function can fail to be consistent — the irony is that it is precisely the complications brought on by the relevance of stopping rules to error probabilities that reveal the deficiencies of the argument.

        • I hate Twitter… I see that Mayo just dismissed your ~ 10k word detailed analysis of SEV in a relatively simple early stopping trial as you “struggling with your feelings” about SEV. Did I mention I hate Twitter.

        • It’s a good one, eh? Belittling but somehow hard to respond to directly. Still, one has to have some sympathy and compassion for her — I am, after all, asserting that a fundamental flaw undermines a goodly portion of her philosophical views and academic output.

        • Like that one time where I showed that 50 years of tabletop experiments in soil mechanics were actually measuring the interaction between the soil sample and the artificial rubber membrane it was encased in… been there. it’s not an enjoyable place to be on either side of the result… no fun to be the one to prove serious flaws in what your colleagues are doing, and no fun to have your work shown to have major flaws…

          still I hate Twitter.

        • I’m reading Deborah’s book. Still got ways to go. I’m surprised that Deborah doesn’t elaborate here on the blog.

          For most of us, this discussion is too technical. But there are dimensions on which we can ask questions and comment further. I don’t mind asking questions that may seem below your competence. After all, some of this should make sense to non-scientists/public.

          The overall question that arises from reading for two years, is whether statistics and epidemiology have a future. So much controversy and dismay discernible.

        • > Surprised that Deborah doesn’t elaborate
          To me the not so technical issue is whether a meta-statistics or resolution of statistics wars can be based on (so far?) only working through a toy example of a single study assuming it is known the be Normally distributed with known variance.

          Many like me, want to see it worked through at least for examples with other distributions with unknown nuisance parameters. With the Normal distribution the nuisance parameter usually is taken as the variance and the magical convenience of the Normal assumptions is that it is easy to deal with.

          I recall asking about it when Mayo’s blog first started and many others have also asked about this and Stephen Senn once even commented that he would not be convince without the problem of nuisance parameters being explicitly addressed.

          Corey has identified a situation where it seems severity cannot be salvaged, but as of yet there are a number technicalities that may make it difficult for many to grasp.

        • Thank you Keith,

          That I comprehend. Rather than post with a seeming debunking attitude, just stating the information in a neutral tone and in plain english would be nice. Only a suggestion.

          So many debates are argumentative, which inherently lend themselves false dichotomies.

        • On reflection I’ve decided that nuisance parameters aren’t really a conceptual problem for severity. It’s important to keep in mind that Mayo isn’t trying to do object-level methodological research in statistics; she’s doing meta-level philosophical work aiming at figuring out how frequentist techniques (which are evaluated and justified in terms of long-run performance) can be viewed as providing information in a particular application — in the case at hand, whatever that may be. This being the case, all we need to do is look at how current frequentist methods handle nuisance parameters and see if they’re compatible with the severity account. Here I’m thinking of things like the bootstrap (which works around nuisance parameters) and approximate pivotal statistics via higher order asymptotics of the likelihood function (which work well even in models with nuisance parameters), and the way they work is well-aligned with the severity rationale.

        • “To me the not so technical issue is whether a meta-statistics or resolution of statistics wars can be based on (so far?) only working through a toy example of a single study assuming it is known the be Normally distributed with known variance.”

          I supervised a student project that did computations and definitions in some standard slightly less trivial situations including testing means with unknown variance, chi squared in a contingency table, proportions and a few others. It’s not totally straightforward but certainly can be done (the actual computations were fairly easy and no problem for the student, but figuring out what to compute and what exactly that means required some supervisor help). Let’s say at least the examples we tried in that project didn’t exceed the limits of the concept in the way Corey’s does.

        • Christian:

          First this is not, at least on my part, a personal criticism of anyone. I do recall suggesting to Mayo that she could defer nuisance parameter (or more than one “thing” in the data generating model affecting the assessment of the quantity of interest) issue to others.

          > It’s not totally straightforward but certainly can be done
          Yes, and some would argue that’s the usual in most applications (e.g. if asymptotics kick in).

          But given Stephen Stigler once wrote that Neyman and Wald? thought nuisance parameters in the Neyman-Scott examples would permanently sink Fisher’s likelihood boat (Stigler later went with super-efficiency) and Jim Berger moved to “O”-Bayes was based in (large?) part to his disapproval of how Bayesian were dealing with nuisance parameters* – I think someone should work it through and Corey’s done a lot to start that.

          In additional, Stephen Senn has also argued it needs to be worked through.

          * wonder if the newer Bayesian workflows that identify problematic handling of nuisance parameters might temper his veiw.

        • Corey takes the time to do the math, which is great for someone like me. I totally understand why it would be not so great for someone without that math background.

          To help convey what Corey found, perhaps I can describe its general findings in words.

          Corey investigates a simple normal model in which rather than collecting 100 data points and then testing whether the underlying concentration of something exceeds a threshold of 150 (imagine testing for a pollutant in drinking water or something) instead he first collects 4 data points, and will stop if the sample mean is dramatically far away from the threshold, if not, he will continue to collect the rest of the 100 samples and evaluate whether the threshold is exceeded after all 100.

          Now, to me, on a practical level, this makes sense. If I grab even a single sample of drinking water and test it for lead and it has 100 times the lead as I’d allow, I can pretty much stop (provided I know my testing machine is working properly).

          But the logic of P values and SEVerity testing suggest that looking at the first 4 data points affects the severity of the test. Corey constructs 4 possible ways of evaluating Severity, using different notions of “greater than” and “less than” found in the literature, because this is required for the math of severity.

          What he finds is that SEV behaves bizarrely. For example for some situations, if you don’t get a sample of 4 which is big enough to stop your sampling, then no matter how many samples you take later, you can NEVER conclude that there’s a lot of lead in your water ???

          Under another way of ordering the results, he determines that if you converge onto the measurement 170 +- 8.5 after 100 samples, you are warranted in concluding that there’s good evidence that you have 250 ppm of lead in your water… again WTH?

          Under another ordering… if you stop after 4 samples, you get to use the fact that you planned to take 100 samples if 4 samples weren’t enough to conclude that you can reduce your error bars… WTH?

          Basically there is *no way* to use SEV in this experiment to get a result that makes sense. None.

          But what’s worse: for anything you’d want to measure, there’s always someone who measured something like it in the past. Why should *you* be able to conclude something from your fixed 100 sample when Joe last week measured 4 samples already?

          And this experiment is just easy peasy stuff compared to some of the complicated models you might imagine you’d need to use to analyze say an economic development experiment where you did a pilot study, the results had some probability of convincing various countries to let you do the full study, various countries did or didn’t let you move forward, then you did a medium term study, using the results you got funding for a long term study… political pressures in some countries forced you to cancel the study early in those countries… etc etc.

        • Nice – so the nuisance parameter for the number of samples in the first look creates problems.

          And then there is multiple studies and we (Bayesian and frequentists) also have worry about using the difference versus the ratio … https://statmodeling.stat.columbia.edu/2019/03/31/a-comment-about-p-values-from-art-owen-upon-reading-deborah-mayos-new-book/#comment-1009196

          Single studies with single unknown parameters are not severe enough.

        • There’s no nuisance parameter; the number of samples in the first look is a design parameter. What creates the problems is that the sufficient statistic is bivariate so there’s no immediately obvious total order.

        • Corey:

          Why is it I *can’t* label the design parameter as a nuisance parameter – others have.

          There being more than one (interim and final) is what makes the sufficient statistic binary – right?

        • Have they? For me the phrase “nuisance parameter” refers to an unknown quantity that helps pick out a probability distribution from a space of them; design parameters are in another category altogether, being under our control.

          It is the fact that the sample size is a random variable whose distribution depends on the value of the parameter that makes the sufficient statistic bivariate.

        • Mayo’s severity function implicitly assumes that results can be ordered regarding how strongly they agree or disagree with a hypothesis. Examples can be constructed in which no single such ordering looks “natural” and issues can be found with all kinds of orderings. So what?

          The basic idea of severity is that hypothesis should be put to tests that would give the hypotheses a good chance to fail in case they are wrong. Nothing wrong with that as far as I’m concerned. Mayo isn’t a statistician or a mathematician, so chances are her formal ideas based on some very simple situations may not wash in cases in which one or more things are substantially different. Not a big problem in my view. Formal severity as originally proposed doesn’t do a convincing job in situation X? Redefining it so that it works better and the original idea is preserved would be a worthy research project, but of course the Bayesians prefer to spend their time on celebrating how clever they are and how stupid the frequentists.

        • I don’t think any of that last bit about Bayesians celebrating is fair. Corey did a lot of hard work steelmanning SEV to show that it had fundamental flaws in relatively simple contexts. He’s not celebrating Bayesianism, he’s trying hard to find a way to see if SEV makes sense or breaks down when you attempt to use it for a nontrivial problem.

          > Examples can be constructed in which no single such ordering looks “natural” and issues can be found with all kinds of orderings. So what?

          So SEV is not usable in such cases, and SEV gives results that are equivalent to some version of a Bayesian result in all demonstrated cases where it’s usable that I’m aware of… hence SEV seems to be a strict subset of Bayesian reasoning?

        • I’m not going to defend all that she writes – not being her it’d be easy for me to say I think she went wrong here and there (which I actually so)… but the linked twitter comment seems to refer to an “informal notion” rather than a formal function that you used in a situation she obviously didn’t have on her radar when she proposed it.

        • Daniel: “SEV gives results that are equivalent to some version of a Bayesian result in all demonstrated cases where it’s usable that I’m aware of… hence SEV seems to be a strict subset of Bayesian reasoning?”

          As I said, Mayo is a philosopher and not a mathematician or statistician. I don’t think her main message is that her formal SEV function cannot be mimicked by some Bayesian computation (where it works); her points are about the logic of inference and how things are interpreted. Whether the Bayesians can come up with some setup and prior that gives the same numbers isn’t very relevant to that issue (Mayo has some comments on “agreement on numbers” in her book with which I agree). Sure you may disagree with her on the logic of inference and interpretation as well, but Corey’s work unpicking SEV in that example, as adorable it might be, doesn’t contribute much to this topic.

        • Well there is also the question > in which fields/subfields the ‘severity’ function has more or less explanatory power. Wouldn’t you say that some use something akin to or like the severity function in some stages of reasoning? I can’t see how it would be otherwise.

          I think Raymond Hubbard’s elaboration of Hypothetico-Deductivism (HD) strikes me as more applicable to

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5136553/

          I have particularly resonates with this observation, independently of reading the review above. I haven’t Hubbard’s book yet.

          ‘In turn, NHST’s dominance has given the illusion that all scientific practice fits the HD model.’

          I think this is so evident to me.

        • Feynman’s procedure for science:

          1) Guess the model
          2) Compute the consequences
          3) Compare to experiment
          4) If it disagrees with experiment, it’s wrong

          https://www.youtube.com/watch?v=b240PGCMwV0

          If this is the entire content of Mayo’s philosophy, then I agree with all of it.

          But I think Mayo has strong opinions on at least step (4) deciding whether a thing disagrees with experiment, and she seems to have opinions on what kinds of (2) we should do (compute predicted frequencies)

          She seems to require that we test a model “severely” let’s take that informally as saying that it had the opportunity to fail spectacularly, and it didn’t.

          Fine. I still think I agree at this level of informality. For example a Bayesian model fit to data that predicts the next data point will with 99.7% probability fall between 0 and 1 if the next data point comes and it’s 37, it would fail spectacularly. When the next data point comes and it’s 0.93, how severely was it tested?

          Suppose the next 10 data points come, and they’re always between say .80 and .95. We can easily see that the frequency is wrong compared to the Bayesian probability, which predicts at least a few results in the range 0 to 0.8. Is this a severe test of the predictions of the Bayesian model which it has failed? or not?

        • Christian, I think we agree more than disagree. For instance, you and I would agree that the severity concept has limits, that it is possible to exceed them, and that group-sequential trials lie outside of them. (Do you think Mayo would agree?)

          This being so, I find myself at a loss to understand your tone, which seems less than collegial to me. If the problem is that you think I’m spending my time celebrating how clever Bayesians are and how stupid the frequentists, I would gently point out that there’s someone in this situation spending their time and energy promoting their recently published book in which they engage in a fairly hostile and uncharitable review of a statistical paradigm they view as wrongheaded — and it ain’t me.

        • >We can easily see that the frequency is wrong compared to the Bayesian probability, which predicts at least a few results in the range 0 to 0.8.

          To be stated better, this should say “which, if you force it to be taken as a frequency distribution then predicts at least a few results in the range 0 to 0.8”

          obviously frequency and probability are potentially decoupled for Bayesian calcs.

        • Andrew, I wouldn’t disagree! Guessing the model is the hard, and fun part. IMHO we also get to decide what it means for the model to “disagree with experiment” since models have a purpose, and some disagreements have more or less utility consequences than others.

          To me the absolute requirement that models have frequency properties seems to be implied in Mayo’s opinions. I don’t buy that in the slightest. There are plenty of situations where just *getting the order of magnitude of the outcome right* is enough. My model of the price of sandwiches is like that… They should be sort of $3 to $20 It’s good enough to know that to keep me from stumbling into a local lunch place and getting taken advantage of by being charged $250 when I could have crossed the street and gotten a sandwich in that range.

          Is my model of sandwich prices shown to be “wrong” when you do a survey of Pasadena sandwich shops and find out that all of them charge between $4.50 and $11.75 ?

        • Corey: 1) You have shown that the SEV function doesn’t work for group-sequential trials, but not that the general concept of severity doesn’t work.
          2) Regarding the “non-collegial” tone, I have great personal respect for you and your elaborations of these issues, some of which come with lots of insight (and surely I have learnt from them). However, your blog post is titled “The SEV function just plain doesn’t work” without any qualification, and ends with “Annoying Bayesian Triumphalism” as if anything in your posting could grant any kind of triumph to the Bayesians (which subtribe of them?). I am indeed regularly annoyed by what I call Bayesian propaganda (of which this tone is an example), and I allow myself some polemic in response from time to time. (Reading that last section again after quite some time I have to admit though that there is some Bayesian moderation to be found in it. I wasn’t aware of that so much when I wrote my earlier posting.)

        • Daniel:
          “She seems to require that we test a model “severely” let’s take that informally as saying that it had the opportunity to fail spectacularly, and it didn’t.”

          You also need the probability to fail into account; it shouldn’t just have the opportunity, it should be likely to fail (which is dependent on the alternative of interest, as in power).

          “Fine. I still think I agree at this level of informality. For example a Bayesian model fit to data that predicts the next data point will with 99.7% probability fall between 0 and 1 if the next data point comes and it’s 37, it would fail spectacularly. When the next data point comes and it’s 0.93, how severely was it tested?”

          I think we need to distinguish the severity of the test procedure (which is probably just its power), and the question to what extent an observed result can severely rule out a hypothesis.
          For your 37 you can compute in terms of probabilities that this really should not have happened under the model. For 0.93 there’s no evidence against the model but in your text there is no indication to what extent this was a severe test (i.e. what chance it would have had to reject the model under some alternatives of interest). What one could compute is only whether some other hypothesis can be ruled out severely by that result.

          “Suppose the next 10 data points come, and they’re always between say .80 and .95. We can easily see that the frequency is wrong compared to the Bayesian probability, which predicts at least a few results in the range 0 to 0.8.”

          You haven’t specified your setup precisely enough that this would be clear. But OK, let’s accept that.

          “Is this a severe test of the predictions of the Bayesian model which it has failed? or not?”

          The frequentist (my kind of frequentist at least) would specify the test procedure in advance, and then, after having observed the data, you could compute the probability of observing data like this or further from what is expected under your model. Without such a specification it is not really clear how unlikely it would’ve been to find a result that points as strongly or more against your model.

        • Christian: I appreciate your kind words.

          1) I mean, if a concept doesn’t work in some specific situation then it’s by definition not a general concept, right? (The point is really that Mayo claims wide — very wide — generality, but for whatever reason won’t actually demonstrate it.)

          As for propaganda, well, again I point you to the fact that somebody is promoting a whole damn book that is as much (or as little) propaganda as my blog post (“How to Get Beyond the Statistics Wars”??? I ask you) and yet seems to have escaped your reproach. I admit that I’m not being cautious in my choice of title but succinctness and clarity are virtues in title. I would defend the title as it stands (but if you want me to, we should carry out that conversation in the comments section of the blog post itself).

          (“Annoying Bayesian Triumphalism” is me ironically hanging a lampshade on what my readers might expect of me; you aren’t supposed to take the section title that seriously…)

        • >You also need the probability to fail into account; it shouldn’t just have the opportunity, it should be likely to fail (which is dependent on the alternative of interest, as in power).

          Obviously, it should be likely to fail if it’s wrong… but if it’s more or less correct, there *doesn’t exist* a test in which it’s likely to fail. I assume this means it’s impossible to test a more or less true hypothesis, like “Navier Stokes equations correctly predict the flight of golf balls”?

          I don’t really get it. To me, if you run a CFD simulation of a golf ball in a wind tunnel, and make a simulated video, and then take a video of a real golf ball in a wind tunnel, and then you overlay the two, and they basically match down to eddies and vortices at the scale of golf ball dimples… this is a severe test.

          But given our knowledge of the history of this kind of experiment, there was every opportunity to fail, but zero probability ;-)

        • I guess the idea is that you are testing two hypotheses together, either A or B… and so you do something where A predicts something different from B and then you collect the data and one or the other or both of them will fail the test. The severity is something about how distinct the predictions are, that is A is likely to fail if B is true because B predicts something very different from what’s compatible with A.

          I don’t have a problem with that informally, but Bayesian model averaging handles this pretty well as it is… we plop A and B into a mixture in which the mixture probabilities are a parameter, collect data, and get posteriors over the mixtures…

          If we collect data in the region of where the two predict different things (this would be a severe test of A vs B), and if one of the models predicts consistently better, we’ll converge to one of them having all the posterior probability.

          This is how I’d compare two models… But comparing a model to “nature” requires us to test its predictions across the full range of its applicability and see if it fails to predict nature’s result to within something like the tolerance it provides through its posterior predictive distribution. It’s this situation that Feynman was talking about, just testing your preferred model against nature/experiment to see if reality is well outside its predictions or not.

        • Corey: I don’t disagree with your criticism of some of Mayo’s wordings and generality claims. I’m however in the first place interested in appreciating what is valuable of her thoughts and her approach. Your exploration of the limits of what she has proposed, is also valuable; very valuable indeed particularly for people who’d be interested in elaborating severity with a proper mathematical or statistical background. They’d need to think about your setup and do something about it. Maybe I contribute to this at some point, but always the amount of things I’d like to do are in a totally ridiculous relationship to the available time, so who knows. I actually realise that you followed up with another posting pinpointing the issues in a somewhat more digestible manner, which is also helpful. The key thing for me is to not have this used to dismiss the whole concept of severity or to somehow imply that this could be used in any way to make an anti-frequentist pro-Bayesian general case. Which is what people who only study your postings superficially could very easily think – the reader has to dig quite a bit into things to discern polemic from subtle self irony etc.

          Mayo’s position in statistics is certainly a minority position these days, even more so on this blog. So surely there are enough people here to call her out where she goes wrong, and I can use my time for teasing the Bayesians. Surely I wouldn’t trust generality claims for a definition based on a single very simple case, and that that definition actually doesn’t work in full generality doesn’t surprise me one bit. I’m not that interested really in calling her out for one false claim here and another one there. I’m interested in the potential of these thoughts, what we can learn from them and what they can inspire. So I wouldn’t like to see if justified criticism led to dismissing the whole thing.

        • Daniel: “Obviously, it should be likely to fail if it’s wrong… but if it’s more or less correct, there *doesn’t exist* a test in which it’s likely to fail. I assume this means it’s impossible to test a more or less true hypothesis, like “Navier Stokes equations correctly predict the flight of golf balls”?”

          I don’t know about golf balls. In any case, you’re right that whether a hypothesis is correct or not can only ever be assessed severely if the hypothesis and alternative can be distinguished properly by the data. We can never say that a point hypothesis is severely tested to be true, we can only ever say that we can severely rule out alternatives that are far enough away from it (with large n such alternatives may be pretty close, but there are always possible alternatives that cannot be ruled out). It’s not necessarily only just one alternative, it could be a parametric class with parameters at least so-and-so far away from our research hypothesis. Also Mayo states that testing hypothesis is “piecemeal”. i.e., you may use one test to rule out certain kinds of alternatives (like with different location) and other tests to rule out others (like with certain dependence structure).

          You can certainly do something like this in Bayes. Particularly, Bayes is rich, so you can certainly set up something that emulates the formalism, however frequentist logic does not depend on a prior and wouldn’t frame outcomes in terms of probabilities that certain hypotheses are true. If I run a test to compare hypothesis A with alternative B, in most cases (I can be Bayesian occasionally) I wouldn’t want an approach that tells me that to the extent that I believe A is false I have to believe B is true. I don’t believe either A or B (precisely). There are infinitely many distributions that are compatible with the data, many of which cannot be distinguished based on the data I have, so it seems pointless to me to have probability distributions over them. The best I can do is to rule out B but not A, in which case I have learnt something. Not that A is true but maybe that A is pretty good (having ruled out many Bs close to and far from it.)

        • > I wouldn’t want an approach that tells me that to the extent that I believe A is false I have to believe B is true.

          I’ve often thought it’d be useful to include an N+1th model which is built by taking the data itself x, and adding some moderate quantity of computer generated random noise so x’ = x+e, and then expressing p(data | ThisModel) as something like normal_pdf(x,x’,sd) where sd has a prior distribution defining how well you think a “good” model should behave.

          Now in a mixture, A, B, Q where Q is this other model, you have the opportunity to converge to a either a mixture of a good model and Q, or just Q. Q is kind of a proxy for “other models that would perform well”

          Just some thoughts. Also I’ve moved away from Cox-Bayes for many purposes, as I just don’t think framing the question as the truth of a proposition makes sense when we know the models aren’t true. A different set of requirements leads to Bayes again, but based on the idea of measuring a degree of accordance between theoretical expectation and data.

        • Daniel: That’s an interesting idea, but it doesn’t seem fair to have the original hypotheses compete with a model that is actually fitted on the data. OK, you can juggle around with the prior probability of Q and the prior of sd to calibrate this, but that’s more prior decisions for which in most cases there won’t really be a substantial basis. Just first thoughts.

        • Christian: the point is to specify to the machinery a model which is guaranteed to do as well as some particular expectation you have for how well a “good” model should work. Now, if you fit your mixture model and Q dominates, then none of your other models reached near to the bar you set for them. If on the other hand Q and A wind up similarly weighted, then A is doing more or less as well as Q did, so A is meeting the bar you set.

          If A dominates over Q, then A is doing better than your bar. If A and B and C all compete well with Q then you should ask yourself whether the bar was set too low, etc

        • Fair enough, although as a (mostly) non-Bayesian it seems to me that a bar for such a procedure could be set up in a more straightforward and easier interpretable way without making use of prior distributions. Priors don’t seem to add much to that idea.

        • Priors don’t seem to add much to that idea.

          This is the same as saying p(H_0) ~ p(H_1) ~ … ~ p(H_n), they all drop out of Bayes’ rule in that case. Also, I made a response to you at the bottom to get out of this nesting, not sure if you would notice it.

        • + 1. As Andrew has said elsewhere, there is value in philosophy and philosophers just prompting us to think better and more clearly about what we are doing. The *concept* of severe testing, as I understand it, makes good sense and is helpful in thinking about the scientific process. But, I have yet to see anything that resembles a useful paradigm for doing real data analysis – and the counter-examples Corey provides seem pretty damning. I think one is right to be skeptical of big ideas that don’t operationalize into anything useful (except insofar as they correspond to an already established framework like Bayesian inference).

      • This is the sort of stance that might appeal to someone like Daniël Lakens, who is fully committed to Neyman-Pearson testing and has little use for effect size estimates. From what I understand, in his investigations effect sizes are experiment-bound and aren’t expected to generalize, so he is much more interested in whether any effect at all can be generated and discerned than with the magnitude of the effect.

        Can you give an example of a process that produces this type of data?

        I can’t see how this makes sense since the effect size is always going to be an intermediate value used to get a p-value… If effect size doesn’t matter why would the p-value?

        • This is most likely someone who is doing an experiment where the quantity of interest is actually a ratio but they don’t know enough about dimensional analysis to figure that out.

          For example, suppose two labs have different instruments for measuring brightness of fluorescence. They’re both interested in something about how fluorescence increases when you give a drug. They measure with different machines, one gets:

          Control = 1330 and Experiment is 1500 (in some arbitrary units determined by the manufacturer)

          the other gets

          Control = 26.6 and experiment is 30.0 (in some arbitrary units determined by the manufacturer).

          So they do their confidence intervals using braindead t-test based intervals and find:

          “The difference between experiment and control is 170 +- 15”

          “The difference between experiment and control is 3.4 +- 0.3”

          And they therefore conclude: “effect sizes are meaningless, the only thing that matters is that there IS a difference!”

        • In case it’s not clear, if you rescale both of these by the expected value in the control condition, you get:

          (1500-1330)/1330 = 12.8% +- 1.1% increase

          and

          (30-26.6)/26.6 = 12.8% +- 1.1% increase

          the two experiments give exactly the same results interpreted in a unitless dimensionless ratio context.

        • While this is much better than just a p-value, what if the treatment affects the reference measurement in the denominator instead of the numerator? How can you tell? And if the reference is from somewhere other than a test sample, why aren’t they using one containing a known number of fluorescent molecules?

          I think there should always be a way to calibrate the result to something like number of molecules (correct me if you know of an example where this is impossible). However, it would be more expensive so perhaps they wouldn’t do it if they can get away with it.

        • Unless we’re talking small numbers of molecules… 10 or 100, we can treat the number of molecules as continuous and the effect of one extra molecule is not “felt”, and therefore whatever the constant that multiplies number of molecules is, it cancels out when you create the ratio.

          k*N1 / k*N2 = N1/N2 regardless of the k. Therefore you don’t need to establish the k through expensive precisely calibrated reference samples etc. It’s an economically valueless step. Only if you want to be able to detect where N1 = N2+1 etc where 1 is the smallest possible change (ie. single molecule differences) do you need to establish the k precisely.

          “what if the treatment affects the reference measurement in the denominator instead of the numerator?”

          You need to design your experiment to have a control in which *there is no treatment* and make this the reference denominator, or otherwise design your experiment and analysis so that the denominator is a meaningful thing.

        • Nice demo of meta-statistical need for commonness somewhere and then information about it for for any statistical method to have any traction. More generally the need to treat what (e.g. parameters) is different as different, what is common as common and what is worth considering as being drawn from a common distribution as exchangeable/random.

          But discerning it, challenging it and defending it (as in following comments) will (should) be required.

          These ideas were written up with Andrew here http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

          But a simple excerpt is given below –

          We believe, this simple contrived example from wiki (below) nicely shows a lack of concern regarding the need to represent reality (correctly specifying common and non-common parameters) as well as one can or at least well enough for statistical procedures to provide reasonable answers.

          A concrete but simple example that demonstrates practical controversies nicely would be the situation depicted in the wiki entry on Simpson’s paradox (Schutz, 2017). The illustration of the quantitative version: a positive trend appears for two separate groups (blue and red), whereas a negative trend (black, dashed) appears when the groups are combined. The illustration clearly depicts an underlying reality of exactly the same positive trend for two groups (both slopes equal to one) but that happen to have different intercepts, one at about 5 and the other -7. A default application of regression modelling using the 8 data points displayed in illustration would likely specify a single intercept, slope and standard deviation parameter. The incorrect single intercept here is a mistaken acceptance of commonness which destroys the evidence for common positive slopes by providing a single negative slope estimate of roughly -.6. In addition, providing a single incorrect intercept estimate of about 9. Specifying the correct commonness here – that of separate intercepts but a single common slope and single standard deviation parameter captures (all the evidence for) the correct intercept and slope with no actual error, with resulting correct estimates of the intercepts of 5 and -7, slope of 1 and single standard deviation of 0. With realistic data, there would be errors of observation and the specifying of separate intercepts but along with incorrect separate slopes and again common standard deviation parameter would waste evidence providing two different slope estimates randomly varying about 1 and a biased downward estimate of the standard deviation. One might further ask or question why the assumption of common standard deviation was being made? Simply convenience?

        • Unless we’re talking small numbers of molecules… 10 or 100, we can treat the number of molecules as continuous and the effect of one extra molecule is not “felt”, and therefore whatever the constant that multiplies number of molecules is, it cancels out when you create the ratio.

          k*N1 / k*N2 = N1/N2 regardless of the k. Therefore you don’t need to establish the k through expensive precisely calibrated reference samples etc. It’s an economically valueless step.

          I think we discussed something like this before, but I prefer some absolute value for sanity checking and thinking about what is going on. Eg, “in this part of the cell there are n1 receptors R on a surface with area A exposed to n2 ligands L with affinity a1 to R, and each bound receptor activates n3 second messengers M per second. Each activated M diffuses at rate d into the cytosol where it encounters enzyme E present at concentration C, etc, etc”.

          Then you can sanity check it by (for example) getting the molecular weight of all those molecules and seeing if they will even fit on/in the region, lower bounds on how much energy the whole process must take, etc.

        • Sure, it’s fine, even great, if you can set up your dimensionless ratios to be the ratios of meaningful quantities in your problem. You STILL need to formulate a dimensionless ratio in order for it to be meaningful, or the numerical results will be determined by whether you measure binding energy in joules or kWh or Therms etc..

          So:

          F / (rho_receptor * rho_ligand * Acell^2 * Ebinding * BindingFreq)

          where F is in dimensions of Fluorescent Energy/time and the denominator has the same dimensions.

          Sometimes individual quantities in a dimensionless ratio are well characterized, in which case you can directly work with this ratio. For example the Reynold’s number:

          Rho * v * L / mu

          Rho is fluid density, v is object velocity, L is the size of the object (say a diameter or a length) and mu is the viscosity. You can sit down and measure Rho, v, L, mu using separate measurements, and then you can calculate the Reynold’s number, and then you can find out properties of the flow from experiments done where the fluid was something else entirely and at a totally different scale, because all flows at the same Reynolds number behave essentially the same.

          On the other hand, sometimes you can’t hope to measure all the individual components of the dimensionless ratio, but you can do inference on some of them by measuring several quantities that vary in different ways according to these values, and then find the posterior distribution after taking these things into account. That’d be a great way to do model sanity checking.

        • You STILL need to formulate a dimensionless ratio in order for it to be meaningful, or the numerical results will be determined by whether you measure binding energy in joules or kWh or Therms etc..

          I can’t say I ever think in terms of Joules or Therms, but Watts and kWh are meaningful to me. So I think you are overloading the term “meaningful”. I do not prefer these dimensionless ratios.

          For example, an increase of 10% can mean either 10 W to 11 W (an increase of 1 W) or 10,000 W to 11,000 W (an increase of 1,000 W). These differences have totally different “meaning” to me in terms of waste heat generated, cost, battery life, etc.

        • Well, I’d argue that your example already makes my point at a meta level. You are comparing 10 to 11Watts with 10000 to 11000 watts, they have totally different meanings to you because you compare them to some known quantity of power that you are familiar with and determine whether they are more or less than that quantity of power… It just happens that you know how to measure these things in Watts, so you can compare watts to watts but it would be hard for you to compare say GigaTherms per Fortnight.

          But the essential thing for you is that something like 10000 watts is a reasonable amount of power for a motor in an electric car, and is a lot of power for a common household appliance etc… So you *are* forming your opinion based on comparison, you’re just implicitly using a denominator in a particular units that you intuitively understand.

          I think you’d also understand the magnitudes pretty well if I said “about the same amount of power used by an electric car when driving flat at 25mph” vs “about 1/1000 of the power used by an electric car when driving flat at 25mph”

          Those are dimensionless ratios.

          Consider the following scientific predictive equation:

          y = exp(-3 * t)

          now y is measured in say dollars and t is measured in years.

          in this equation 3 is measured in per years, and 1, which doesn’t even appear since it’s left off is measured in dollars

          consider instead the following equation

          y/mean_household_income_in_one_day = exp(- 3 t/time_for_earth_to_orbit_sun)

          Now if you measure time in any consistent units, and measure monetary value in any currency, the equation holds, and bare constants don’t have implicit units… this is what I meant by “makes sense”

        • Anon:

          I agree. To put it another way, I don’t see how you can be so sure the direction of an effect will persist if its size is unknown. An effect of +0.002 in one population today could be -0.001 in another population tomorrow, or +0.003 in yet another population some other day. Here I’m talking about variation in the true underlying effect, not sampling variation or estimation noise.

          Maybe this is worth its own section of the (hypothetical) article on this topic.

        • I have a comment waiting in moderation, but I can at least report an example of a process that Lakens says gives this kind of data: the Stroop effect.

        • The original result was reported in seconds:
          https://en.wikipedia.org/wiki/Stroop_effect#/media/File:Stroop-fig1-exp2.jpg

          I’m guessing you’re saying they didn’t figure out what need to be done to get similar numerical results across labs (for whatever reason) so just started ignoring it?

          It would be stuff like children’s reaction time is different than adult’s, noisy traffic outside is different than a silent room, etc.

          It sounds like they are just throwing away information to me, I wouldn’t consider that a valid analysis.

        • Also, I immediately thought of this:

          All experiments in psychology are not of this type, however. For example, there have been many experiments running rats through all kinds of mazes, and so on—with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

          The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and, still the rats could tell.

          He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

          Now, from a scientific standpoint, that is an A‑Number‑l experiment. That is the experiment that makes rat‑running experiments sensible, because it uncovers the clues that the rat is really using—not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat‑running.

          I looked into the subsequent history of this research. The subsequent experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats. But not paying attention to experiments like that is a characteristic of Cargo Cult Science.

          http://calteches.library.caltech.edu/51/2/CargoCult.htm

        • If this isn’t in your collection, here is one more to add to fixing a radio, and cargo cult science (for many years, I’ve used all three in a methods class to introduce how biologists are trained in pathological science)

          http://www.autodidactproject.org/other/sn-nabi2.html

          (the history and sociology of “Isidore Nabi” is a fascinating story in biology/science).

        • Corey:

          I’d say that effect size matters in Stroop effect. If, for example, the manipulation increases average response times in an experiment by 50%, that’s one thing. If it increases average response times by 1%, then I’m much less convinced the sign of the effect would be stable, if the conditions of the experiments change.

          One big flaw with the arguments in favor of embodied cognition, power pose, etc., is that (a) the proponents of these ideas were claiming very large effect sizes, but (b) if you were to take these studies at face value, the size and direction of these effects were highly variable. So they’re claiming an effect that is large and persistent across people, but highly fickle from experiment to experiment. It’s all possible, but a much more plausible explanation is that whatever effects are there, are highly context dependent, and the reported large effects are the result of noise mining.

        • Don’t tell me, tell Lakens. I know next to nothing about research methods in experimental psychology and I neither endorse nor disparage his views — I’m not competent to hold an opinion.

        • Andrew, you’ve already taken a step that I suspect is missing in psych experiments of this type: expressing the result as a percentage of some meaningful and standardized base measure.

          I personally suspect that the reason Lakens says the Stroop effect is somehow effect size free is that this step of turning the result into a dimensionless ratio was not done, and therefore the results are meaningless since they depend not only on the units in which you measure, but also on things like the choice of words used in the test etc.

        • It’s a basic symmetry principle of the universe that if something happens in the universe, it must be possible to confirm it by anyone regardless of the units that they use for measurement. Only dimensionless ratios have this property, and therefore only dimensionless ratios are meaningful real things in the world.

  3. I am reading Mayo now and its a bit of a slog without a deep background on the historical arguments. Is there another text to help with the N-P, Fisher, Popper philosophies before diving into this text?

    • I’m not sure if this would be helpful or muddying the waters, but Measuring and Reasoning: Numerical Inference in the Sciences, by Fred L. Bookstein (Cambridge University Press, 2014), includes some discussion of Pearson, Fisher and Popper, along with a lot of other discussion about reasoning from data.

    • Gerd Gigerenzer’s “The Empire of Chance” is a good read. As I recall, he finds a greater difference between Fisher and Neyman-Pearson than Mayo will allow.

      In the psrts of the book that I have explored, I have found the details of the calculations often hard to follow. It would be helpful to have, as an accompaniment to the book, an R Markdown or suchlike document that has all the relevant code.

      • On pp. 178-180 (Section 3.5), Mayo directly addresses Gigerenzer’s claim that “the neat and tidy accounts of statistical testing in social science texts are really an inconsistent hybrid of elements from N-P’s behavioristic philosophy and Fisher’s evidential approach.” I find the second rather polemical paragraph on that page obscure — perhaps she is making the simple point that it is OK to report both p < 0.05 and the calculated p-value providing one does not confuse them in the same calculation.

      • You might also look at the article in the American Statistician Special Issue that gives an account of the history:
        Lee Kennedy-Shaffer (2019) Before p < 0.05 to Beyond p < 0.05: Using History to Contextualize p-Values and Significance Testing, The American Statistician, 73:sup1, 82-90, DOI: 10.1080/00031305.2018.1537891
        https://www.tandfonline.com/doi/ref/10.1080/00031305.2018.1537891?scroll=top

        I like the comment that:
        "Presenting the debate between Fisher, Gosset, Neyman-Pearson, and the Bayesians, and how that debate has evolved into the current discussion, highlights the human aspect of statisticians and the constantly changing, challenging nature of the field. As discussed above, many of the specific points made in that debate are ongoing points of contention today."

  4. Hi, could you explain one simple thing about severity testing to me?

    As far as I understand, Mayo defines the concept of severity as not just claiming that your hypothesis explains the data, but going forward and explaining that alternative hypothesis do not. But isnt it exactly what P(data|alternative hypothesis)*P(alternative hypothesis) from Bayesian formula represents? Thus doing Bayesian analysis in

    • One distinction between severity and Bayes is that you can apply the severity rationale in the context of whatever test you happen to be conducting, whereas in Bayes, you say what your likelihood is and that fully determines how the data enter into the inference. So for example, let’s say you want to test the equality of two probabilities using empirical frequencies from some experiment. In Bayes, the likelihood is proportional to the product of two binomial distributions, the end. A frequentist, on the other hand, has the freedom to choose a test statistic; severity does not prescribe some particular choice but rather takes whatever test statistic you give it. So you can look at the ratio of the empirical frequencies to test the null hypothesis that the ratio of the probabilities is one, or you can look at the difference of the empirical frequencies to test the null hypothesis that the difference of the probabilities is zero. These two ways of treating the problem will give rise to two different sampling distributions for the test statistics and two different SEV functions. That’s fine according to the severity rationale, which holds that there’s no particular reason why two different tests of mathematically equivalent null hypotheses applied to the same data should give the same result.

    • “As far as I understand, Mayo defines the concept of severity as not just claiming that your hypothesis explains the data, but going forward and explaining that alternative hypothesis do not. But isnt it exactly what P(data|alternative hypothesis)*P(alternative hypothesis) from Bayesian formula represents? Thus doing Bayesian analysis in..”

      Mayo’s approach still preserves the sample distribution/space and counterfactual ideas, so IMO it is more relevant, and doesn’t wonder into probabilism, more subjectivity, and comparative likelihood approaches.

      Justin

      • Corey, Justin, thanks for the reply!

        Im not really familiar with frequentist language. So when I was reading frequentist literature I have to translate it into Bayesian terms to understand. And when I was reading Mayo, I thought: hm, she just talking about P(data|alternative hypothesis), it should be small for our test to be severe. But this is what already included in Bayesian formula! So it is very hard for me get the motivation behind this book!

        Now (please tell me if im wrong) I see that Mayo is proposing a frequentist (are more generalized) analogue for P(data|alternative hypothesis)

        • When the claims of two frequency models are different, and one of them is correct, then you can detect this easily, when they are more or less similar claims, then it’s harder. You are on the right track thinking about the Bayesian expression. With the Frequentist analysis though there is the choice of test statistic, so while a Bayesian model gives probability over data, the Frequentist model looks at frequency of some function of an entire sample (like sample average, or average of the samples that exceed 1 or median of the sqrt of the samples or whatever)

    • mikhail: In frequentism there’s no such thing as probabilities for hypotheses being true. No priors, no posteriors. Probability distributions model outcomes assuming certain hypotheses/parameters. All probability statements are about statistics of the outcomes given hypotheses/parameters. There’s no “P(alternative hypothesis)”. You do have P(data|alternative hypothesis), however a frequentist would rather write P_{alternative hypothesis}(data), because in frequentism technically this is not a conditional probability, as whether the alternative is true or not is not treated as random experiment.

      • By the way much of this (actually all I wrote in the posting above) is not just “Mayo’s approach”, but how frequentists have done things for ages.

      • “. . . in frequentism technically this is not a conditional probability.” That seems to me dangerous. If one knows that the the alternative is unlikely to be true, on grounds that a frequentist finds convincing, then surely this must influence the conclusions reached. I’d take the difference to be that the frequentist will consider how it might affect the relevant sampling distribution, while the Bayesian would be examining the posterior. Mayo acknowledges, in her “Fine for Baggage” comment (on pp.185-186,361) that there are cases (low prior risk means that when your baggage is taken aside by the inspector, it will usually not indicate that you are carrying anything dangerous) where there is hard evidence on prior probabilities that should be taken into account. But, for some reason, Mayo seems to think that it should not colour one’s judgement that in journal X, replication studies have found only 20% of the results to be replicable. Of course, one wants to ask whether paper Y stands out from the crowd.

        • John: Frequentists don’t deny Bayes’ rule. If the experiment is indeed two step with a prior that can be given a frequentist interpretation, they can do Bayesian computations. Not sure what Mayo says but in that case where one could even have data on the first step “prior” surely I’d be happy to use Bayes – even in frequentist mode (which isn’t my only mode). That’s very different from the prior probability of whether a scientific hypothesis is (approximately) true or not.

        • It is your last sentence that I have problems with. Mayo would, as I understand her, agree with you. I do not agree that this is so “very different.”

          In the ASA 2016 Guide, the second principle is:
          “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
          Mayo (pp.215-216) objects to the second clause (“[do not measure] the probability that the data were produced by random chance alone”). Many or most researchers do, I suspect, want to equate the two clauses, however it may be argued that the only reasonable way to construe it makes it conditional on the null, and therefore different from the first clause. For just that reason, I think it important to talk about Pr(x|H0). If that means that I am not a bona fide frequentist, so be it.

        • John: I think the second clause in the ASA statement is somewhat ambivalent. The p-value gives the probability to produce the data (or something worse) given the H0, and if the H0 is some model for “random chance”, then that’s what the p-value is. However, the p-value does not give the probability that given the data what produced them was the H0 (random chance, if that is it).

          I’d think that the ASA guide means the second thing, and I agree that the p-value doesn’t measure that. It does measure the first thing (I mean the first thing in my first paragraph). If you want to know what Mayo thinks on this one you have to ask herself, but I’d be surprised if she’d disagree with the ASA or me here (although she may have interpreted the ambiguous ASA statement differently). Or actually with you – the only difference here seemingly being between denoting the thing P(x|H0) or P_{H0}(x), which in practice doesn’t need to make a difference as long as no prior probability for H0 is involved.

  5. Christian Henning wrote :

    There are infinitely many distributions that are compatible with the data, many of which cannot be distinguished based on the data I have, so it seems pointless to me to have probability distributions over them.

    This is my favorite way of writing Bayes’ rule:

    p(H_0|D) = p(H_0)*p(D|H_0)/[ p(H_0)*p(D|H_0) + p(H_1)*p(D|H_1) + p(H_2)*p(D|H_2) + … + p(H_n)*p(D|H_n) ]

    You can see that to assess one theory/model/hypothesis you do not need to consider every other one since you can drop any small terms in the denominator and get a good approximation. In practice you just need to include the top 5 or so that people come up with.

    Another thing to note is the likelihoods p(D|H_i) can be small for two reasons:
    1) The data is inconsistent with the hypothesis. Eg, imagine a normal distribution with the observation in one of the low probability density tail regions.
    2) The hypothesis is very vague so would be consistent with almost anything you could have observed (Aliens did it). Eg, imagine a uniform distribution where the density is low everywhere.

    Perhaps an image is better. If x = 5 it is consistent with both the precise and vague hypotheses, but the upper is more likely since it made a riskier guess:
    https://i.ibb.co/2YVbNGy/Hprecision.png

    So afaict, everything good scientific/detective work involves is perfectly and beautifully explained by Bayes’ rule.

    However, there is one problem with using the approximation. You need to keep in mind there could be an unknown explanation you dropped from the denominator that would totally change your assessment of H_0.

    I don’t see any way around that since as you say we obviously can’t assess infinitely many possibilities. We need to drop terms from the denominator in practice. Is this a problem that severity testing claims to solve?

    • Thinking about this a bit more I notice that if:

      1) p(H_0) ~ p(H_1) ~ … ~ p(H_n)
      2) p(D|H_0) + p(D|H_1) >> p(D|H_2) + p(D|H_3) + … + p(D|H_n)

      Then:
      p(H_0|D) ~ p(D|H_0)/p(D|H_1)

      So likelihood ratios are apparently a special case of Bayes’ rule. Specifically, if all explanations are about equally as likely a proiri and the data is much more consistent with just two of them than all the others combined.

      Alternatively,

      1) p(H_0) ~ p(H_1)
      2) p(H0) >> p(H_2) + p(H_3) … p(H_4)

      Then:
      p(H_0|D) ~ p(D|H_0)/p(D|H_1)

      Ie, If there are only two plausible explanations, both of them equally as probable a priori, you get the same result. Other combinations of small prior probabilities and/or likelihoods for hypothesis 2:n will also do it.

      Sorry if this isn’t new, but I found it interesting.

        • However, if

          3) p(D|H_1) >> p(D|H_0)

          Then you get a likelihood ratio…

          Also, this line of thought got me looking at likelihood principle on wikipedia:

          Some classical significance tests are not based on the likelihood. A commonly cited example is the optional stopping problem. Suppose I tell you that I tossed a coin 12 times and in the process observed 3 heads. You might make some inference about the probability of heads and whether the coin was fair. Suppose now I tell that I tossed the coin until I observed 3 heads, and I tossed it 12 times. Will you now make some different inference?

          The likelihood function is the same in both cases: it is proportional to

          p^3(1 − p)^9.

          According to the likelihood principle, the inference should be the same in either case.

          https://en.wikipedia.org/wiki/Likelihood_principle#Experimental_design_arguments_on_the_likelihood_principle

          This criticism is total nonsense. The equation p^3(1 − p)^9 is derived by assuming the flips are iid. If you flip again depending on the previous results the iid assumption is invalid and you need to derive a different likelihood function.

          Is this a real critique found in the stats literature?

        • The article is accurate. The experimental design in which you flip until a certain number of heads is observed induces a negative binomial distribution for the total number of flips, and the resulting likelihood is indeed proportional to the binomial one that results from the fixed-number-of-flips design.

        • is indeed proportional to

          What am I missing. Here is my simulation:
          https://pastebin.com/Hq93bcX6

          Results:
          https://i.ibb.co/RYjynXb/coinflips.png

          Mean percentage heads flipping 12 times
          > mean(sapply(res1, sum))/12
          [1] 0.5003767

          Mean number of flips to get 3 heads with prob = 0.5
          > mean(sapply(res2, length))
          [1] 6.00652

          Mean number of flips to get 3 heads with prob = 0.25
          > mean(sapply(res3, length))
          [1] 11.98964

          I would infer p ~ 0.25 in the second case

        • “I would infer p ~ 0.25 in the second case”

          I don’t think you would — you’re having a brain fart. Look at those numbers again: 3 heads, 6 flips.

        • For p from 0.1 to 0.9 by 0.1 flip coins until you get 3 heads, 1000 times, what fraction of the time did it take 12 flips? Graph this fraction as a function of p.

          https://i.ibb.co/cNkQk1T/flips.png

          Yes, the data indicates p~0.25.

          I think what I missed is the data also indicates p~0.25 in the flip 12 times case since it is simply 3/12 (but is of course p = 0.5 isn’t ruled out by any usual definition). The only reason we would tend to support p~0.5 is prior experience with coins, but there isn’t supposed to be any prior information used here.

          Anyway, if it turns out that the same likelihood gets derived for both of these situations that is fine*, though it’s a strange thing to gloss over in my book. However, what does that have to do with the likelihood principle?

          According to the likelihood principle, the inference should be the same in either case.

          What is the problem with this? I agree, our inference about p should be the same in this case.

          *I do still have a problem with calling this series of flips i.i.d, but the point was you needed to derive some other model for the different situation.

        • >Point was you needed to derive some other model for the different situation.

          Right, in this case it just is almost a weird accident of the math that the likelihoods are proportional to each other.

        • “I would infer p ~ 0.25 in the second case”

          I don’t think you would — you’re having a brain fart. Look at those numbers again: 3 heads, 6 flips.

          Can you expand on this, because I stepped away and when I came back still thought the same thing.

        • just to be absolutely sure we’re talking about the same thing:

          Mean number of flips to get 3 heads with prob = 0.5
          > mean(sapply(res2, length))
          [1] 6.00652
          [emphasis added]

          3 in 6 is 0.5.

        • just to be absolutely sure we’re talking about the same thing:

          Mean number of flips to get 3 heads with prob = 0.5
          > mean(sapply(res2, length))
          [1] 6.00652
          [emphasis added]

          3 in 6 is 0.5.

          I meant to show that if p~0.5, we would expect ~6 flips to get 3 heads. However, we “observed” 12 flips which is more consistent with p~0.25.

        • We observed on average 6.00652 flips, not 12…?

          I’m baffled that we’re unable to get on the same wavelength here. When you wrote “I would infer p ~ 0.25 in the second case”, by “second case” you did mean the mean of the vector you named “res2“, right?

        • In the original example, you flip until you get 3 heads, the example says we see 12 flips. What should Anoneuoid infer about p from that *one* sample by considering simulations?

          In this case (3 in 12), the max likelihood estimate for p is 0.25, and that’s the case whether we use the likelihood for a binomial with a fixed number of flips or for a negative binomial where we saw 12 flips, because in fact they have proportional likelihoods, as shown by his plot

        • Sorry, I can see how that is confusing. By second case I referred to flipping until 3 heads. Both res 2 and 3 corresponded to the “second case”.

    • Anoneuoid: This whole business of assigning priors to hypotheses looks dodgy to me. Let’s say you are interested in whether N(0,1) is a good model for your data. The data will be rational numbers, let’s say rounded to one digit after the decimal point. “Any” distribution that has all its probability mass on numbers that have at most one digit after the decimal point will beat the Gaussian likelihood by a factor of infinity. So you better don’t compare these with a Gaussian at all. This cannot be expressed in terms of priors, because the reason why you rule these out has nothing to do with their prior probability, be it based on belief or information, but rather because involving them would create an unfair competition at least if we accept that we’d be happy with a Gaussian distribution even though data are in fact discrete. After ruling this out we may ask, why not a t-distribution with 1000 degrees of freedom and standardised to variance 1. You’d need a very very big dataset to rule that one out against a Gaussian. There’s not normally any subject-matter information that will tell you that a Gaussian is more realistic for the data than a t1000. So P(D|t1000) will be pretty much the same as P(D|N(0,1)). No particular reason to have different prior probabilities for those two either. No basis whatsoever therefore for arguing that the t1000 can be ignored as a comparatively small term in the denominator; in fact the term for that one in the denominator will be pretty much the same as for the model you’re actually interested in.

      But the Bayesian in practice will have N(0,1) in their model but not the t1000. And rightly so, because if you have N(0,1), the t1000 is not of interest, because it basically models the same situation. Substantially there is no difference between if the t1000 or the N(0,1) is chosen to model these data. The reason to rule out the t1000 however, has nothing to do with Bayesian logic and isn’t captured by it.

      Things get worse. You may say (as most Bayesians do, once more with good reason) that you wouldn’t want to claim that any model holds precisely, you’re just fine with a model that gives a reasonably good fit and makes sense substantially. However, this endangers the whole foundation of Bayesian logic, because if you for example have a Gaussian model with continuous priors over its parameters, this includes N(10^{-10},1) as well as N(0,1). If N(0,1) is a reasonable model for the data, N(10^{-10},1) is a reasonable model for the same data as well (assuming that your n is substantially smaller than 10^20 or so). But the whole probability calculus is based on assuming that probability is distributed among alternatives of which only one is true. If you have N(0,1) in your model and also N(10^{-10},1) and you’d be prepared to accept both of them as reasonable for the same data, this means that prior probabilities shouldn’t integrate out to 1, and probability theory breaks. The only way out of this is to treat models as if one and only one of the distributions in your model is true, despite the fact that this runs counter to what you believe. I’m not saying you shouldn’t do that (I like Bayes occasionally), I’m just saying that Bayes is far from accounting for all inferential reasoning we need. (I owe the argument in this paragraph to Laurie Davies.)

      Can frequentism solve all these problems? Well, a) no. There are all kinds of issues with it, many of which correspond to some issues with Bayes. Statistics is generally weird. b) Some of these problems are generated by the desire to work with a prior over hypotheses, which the frequentist wouldn’t want in the first place. So some problems are generated by being Bayesian and the frequentist doesn’t need to be bothered (she has admittedly a few other issues in exchange). c) In my view frequentism is just, in a certain respect, more modest, and correctly so. Setting up a frequentist model, there is no pretence that anything that isn’t part of that model is ruled out, unless it can be ruled out explicitly, e.g., by misspecification testing. The Bayesian needs to do that, being based on probabilities that integrate out to one, no more and no less. You may claim that anything else is small (which it isn’t), or you may say that the prior is relative to the initial model and has no implications outside of it, which is fine, but means that the model itself needs to be justified by non-Bayesian means.

      So let’s say the frequentist is more modest than the Bayesian who claims that Bayes is fine for *all* inferential reasoning, and sadly not even all frequentists, as you find many who explicitly or implicitly claim that their model is true, which in my view is nonsense.

      • In my view continuous models are approximations to discrete models. Also most models near to a given model can be ruled out of the running by utility arguments. The t1000 Terminator model is ruled out by virtue of the fact that distinguishing it from the N(0,1) model offers zero utility improvement.

        These are maybe meta considerations, but utility is an inherent part of Bayesian theory even if it’s outside the probability portion.

        The modeling situation is not invariant to the order of application of the utility arguments. If you choose 4-5 models on a utility basis and put them into your Bayesian calc, you get a different result than if you keep an infinity of models in, do the Bayesian calcs, and then choose the model based on utility… But if you include computational and model cost in your utility, the first result is infinitely more utility since the second result is incomputable before the end of your life.

        Bounded rationality is more rational in the real world than theoretically pure rationality.

        • Obviously I’m not arguing against “bounded rationality” – I just say that pure Bayes alone won’t do the job. Pure Bayes has issues, and in order to solve them you need something else. Just to say that these are utility considerations and Bayes has utility, too, isn’t enough to integrate these considerations into a Bayesian framework. As long as utilities are considered outside a probability framework it’s hard to still call them Bayesian.

        • >I just say that pure Bayes alone won’t do the job.

          I agree, deciding on the models you will put in your Bayes calculations is always done outside the Bayesian formalism. I still think it’s useful to think about analogies between the way this is done and the formalism. For example if I want to choose between 7 formal models in Bayes, I write them down and do the Bayesian calculation, and then I write utilities over the way the models behave (the errors they induce and the computing costs they entail), and I choose the one with the best expected utility.

          In the pre-Bayes world, I think informally using bounded rationality about what makes a good model, and I choose between the models I’ve been willing to explore to decide which ones to put into the formalism… It’s not formal, but it follows a similar type of pattern.

      • Let’s say you are interested in whether N(0,1) is a good model for your data. The data will be rational numbers, let’s say rounded to one digit after the decimal point. “Any” distribution that has all its probability mass on numbers that have at most one digit after the decimal point will beat the Gaussian likelihood by a factor of infinity.

        I do not see any problem here. Part of the data generating process involves rounding, so the model that assumes rounding occurred should fit the data much better.

        After ruling this out we may ask, why not a t-distribution with 1000 degrees of freedom and standardised to variance 1.
        You’d need a very very big dataset to rule that one out against a Gaussian. There’s not normally any subject-matter information that will tell you that a Gaussian is more realistic for the data than a t1000. So P(D|t1000) will be pretty much the same as P(D|N(0,1)).

        The t-distribution and normal distribution are derived from different assumptions. How “realistic” either explanation would be is determined by your prior belief that those assumptions accurately describe the data generating process.

        No particular reason to have different prior probabilities for those two either. No basis whatsoever therefore for arguing that the t1000 can be ignored as a comparatively small term in the denominator; in fact the term for that one in the denominator will be pretty much the same as for the model you’re actually interested in.

        I would say if we think the data was generated by sampling from a normal distribution, then the t-distribution with appropriate degrees of freedom should be used. If the data consists of the entire (believed to be normally distributed) population, use the normal. By “used” I mean have a non-negligible prior.

        For practical purposes though, one could often be used to approximate the other though since they are very similar. Likewise, the normal distribution can often be used as an approximation for many other distributions.

        I think the difference in thought is you want to throw arbitrary distributions at the data, while I am only interested in distributions derived from plausible explanations for how the data was generated (and approximations thereof).

        Usually the model of uncertainty is not even central to the question of interest, and is only hoped to be approximate. Eg, the other day I sent Andrew an email regarding use of confidence intervals to test galactic rotation curves across many papers against the predictions of MOND (still waiting for feedback!), which is a deterministic “theory” (they call it an “effective theory”).

        The theory predicts an exact value. However, there is uncertainty in the observations (the measurements of the distance, brightness, etc) required to calculate this value. Here we do not really care about whether the observations were sampled from a perfect normal distribution (or whatever it is they assumed). It just has to be a close enough approximation to whatever the true distribution was. Unless there are substantial difference in the distribution, MOND + normal uncertainty, MOND + t-distributed uncertainty, etc are all the same to us.

        [I had some other stuff here but I think it started distracting from what is written above (covering too much ground) so I’ll leave it at that for now.]

        • Anoneuoid: I think Christian’s point is that the curve of the t distribution converges to the normal distribution, so that by the time you are at t100 there is almost no computable difference and by the time you are at 1000 it’s truly ridiculous, in essence he is forcing you to include the normal distribution twice in your sum that you use in your asymptotic argument.

        • Yea, I got that. That is why they can be used as approximations of each other. If we know our data is a sample from a population, we know normal is incorrect a priori because one of the assumptions is false. My point is that we do have reason to choose one over the other despite making practically the same prediction.

          We a priori know the assumptions of the t distribution probably do not hold exactly either… So what people do is set priors for all those to zero, and put back a normal/etc distribution as an approximation of whatever the true distribution may be.

        • Other comment still didn’t show up…

          The gist of it was (will be?) that I realize what he is saying, but that isn’t how Bayes’ rule is used in practice.

          What we do is set the prior probability for normal, t, etc distributions all to zero since it is unlikely all the assumptions are met perfectly, and then plug back in one of them (eg, normal distribution) as an approximation of whatever the true distribution is.

          So there is no multiplicity of very similar terms in the denominator. It is true that if we had perfect information and did not need to make this approximation then there would be, but in that case it wouldn’t be an issue since the data could discriminate between the tiny differences.

          I think what is going on is Christian is mixing aspects of the theoretical ideal use of Bayes Theorem with how it must (necessarily) be applied in practice. This mixture of precision and approximation makes no sense, but that isn’t what people are doing.

  6. Anoneuid: “I think what is going on is Christian is mixing aspects of the theoretical ideal use of Bayes Theorem with how it must (necessarily) be applied in practice. This mixture of precision and approximation makes no sense, but that isn’t what people are doing.”
    I do realise that this isn’t what most people are doing. My point is that what people are doing is quite objectionable if you think about what Bayes ideally “should” be (you haven’t responded to my point above that if you accept “reasonable approximations” and you have more than one of them in your model, probability theory will strictly speaking break down), so the ideal is tweaked in all kinds of (not so Bayesian) manners to get something that works more or less in practice.

    My personal rule is, by the way, that if it isn’t clear to me what I win from using a prior, I won’t have one (and won’t do Bayes). All that talk about “belief” is pretty much meaningless to me. I don’t believe in any probability model to hold. I am not even sure if I believe in “approximations” because people usually don’t define what they mean by a model being a “reasonable approximation” of the truth (wouldn’t that imply that there is in fact a true probability model to which the used one is in some sense close?). Models are not there to be believed in my view.

    Another rule is that the posterior inherits meaning from the prior, i.e., if the prior doesn’t have a proper meaning, neither does the posterior. Regarding Bayes in practice, the vast majority of Bayesian papers that I got as a reviewer didn’t bother to discuss the meaning of the prior at all. They just took one that made their MCMC run nicely, and presented the resulting posterior as if that would give us true probabilities for all kinds of things. Better be frequentist then.

    • Christian said,
      “All that talk about “belief” is pretty much meaningless to me. I don’t believe in any probability model to hold. I am not even sure if I believe in “approximations” because people usually don’t define what they mean by a model being a “reasonable approximation” of the truth (wouldn’t that imply that there is in fact a true probability model to which the used one is in some sense close?). Models are not there to be believed in my view.”

      I see a prior as allowing the analyst to incorporate prior information into the analysis, by choosing a prior that models prior information as well as possible.

      • This is fair enough where it works. My experience is that in most applications I have worked on, such information, if existing, was very hard to translate into a prior. Much prior information doesn’t play out in terms of prior probabilities, but rather is about aims and use of the analysis, importance assessments of variables and how they might suitably be transformed, etc.
        Also my experience as a reviewer of Bayesian work (as written already) is that many Bayesians in many situations don’t really incorporate such information into the prior but rather use a default prior or something that makes computations work nicely, because they’re Bayesians, so they need some kind of prior. I occasionally see convincing justifications of priors that make clear to me how analyses are improved by incorporating this information (Andrew pays a lot of attention to how to encode such information and what is implied by using priors in certain ways, which I like), and I’m fine with that. It’s a minority thing though.

    • you haven’t responded to my point above that if you accept “reasonable approximations” and you have more than one of them in your model, probability theory will strictly speaking break down

      I thought I did. I don’t see how probability theory “breaks down” using the scheme I described where we don’t know the true model so we plug in an approximation (eg, normal distribution) as a proxy for whatever it is.

      I don’t find the practice I described at all objectionable but I mean this as descriptive moreso than prescriptive. I’m just describing the thought process people follow when applying Bayes’ rule to real world situations.

    • I took the time to write out a full example of how I believe people are reasoning according to Bayes’ rule. Where does probability theory break down?

      The testable model (eg, H_00) consists of a theoretical prediction (denoted by the first digit 0:n) plus approximation of the measurement uncertainty using a statistical distribution (denoted by the second digit 0:m). Eg, for three competing explanations for a galactic rotation curve:

      MOND model

      H_00 = MOND + unknowable “true” distribution of measurement error
      H_01 = MOND + normal measurement error
      H_02 = MOND + t-distributed measurment error
      […]
      H_0m = MOND + etc distributed measurement error

      GR model

      H_10 = GR + unknowable “true” distribution of measurement error
      H_11 = GR + normal measurement error
      H_12 = GR + t-distributed measurment error
      […]
      H_1m = GR + etc distributed measurement error

      Vague model
      -eg, Alien technology is moving the stars

      H_20 = Aliens + unknowable “true” distribution of measurement error
      H_21 = Aliens + normal measurement error
      H_22 = Aliens + t-distributed measurment error
      […]
      H_2m = Aliens + etc distributed measurement error

      We are interested in learning how well the data supports H_00. In terms of Bayes’ rule we want to determine:

      eq 1
      p(H_00|D) = p(H_00)*p(D|H_00)/[ p(H_00)*p(D|H_00) + p(H_01)*p(D|H_01) + … + p(H_10)*p(D|H_10) + … + p(H_nm)*p(D|H_nm) ]

      General a priori knowledge:
      1) The exact distribution the measurment error comes from is unknown, so we cannot write down H_00.
      2) There are an infinite possible distributions that could be consistent with the data, but the assumptions behind these are unlikely to hold exactly.

      So a priori we can drop all of these possibilities from the denominator

      p(H_01) ~ 0
      p(H_02) ~ 0
      […]
      p(H_2m) ~ 0

      We are left with three unique terms, but cannot calculate them since the true error distribution is unknown:

      eq 2
      p(H_00|D) ~ p(H_00)*p(D|H_00)/[ p(H_00)*p(D|H_00) + p(H_10)*p(D|H_10) + p(H_20)*p(D|H_20)]

      So we approximate the true distribution with a normal distribution. Equation 2 becomes:

      eq 3
      p(H_00|D) ~ p(H_00)*p(D|H_01)/[ p(H_00)*p(D|H_01) + p(H_10)*p(D|H_11) + p(H_20)*p(D|H_22) ]

      THe alien hypothesis (H_22) is very vague and would be consistent with pretty much every possible observation. Therefore p(D|H_22) is neccesarily small, ie max(p(D|H_01)) >> max(p(D|H_22)). So, unless the data is really inconsistent with H_01 and H_02, we can also drop H_22:

      eq 4
      p(H_00|D) ~ p(H_00)*p(D|H_01)/[ p(H_00)*p(D|H_01) + p(H_10)*p(D|H_11) ]

      If we do not have an a priori preference for H_00 vs H_10, then p(H_00) ~ p(H_10). In that case all the priors drop out:

      eq 5
      p(H_00|D) ~ p(D|H_01)/[ p(D|H_01) + p(D|H_11) ]

      Also possibly of interest… If H_11 is a much better fit, ie p(D|H_01) << p(D|H_11), then p(H_00|D) is approximately the likelihood ratio:

      eq 6
      p(H_00|D) ~ p(D|H_01)/p(D|H_11)

      • A problem is that when we determined p(H_00|D) as done above we didn’t consider other possible theories.

        Eg, a relatively new one is quantised inertia. Once someone comes up with the new explanation we need to go back to equation 3 and add another term:

        eq 3b
        p(H_00|D) ~ p(H_00)*p(D|H_01)/[ p(H_00)*p(D|H_01) + p(H_10)*p(D|H_11) + p(H_20)*p(D|H_22) + p(H_30)*p(D|H_32) ]

        Then we can further simplify if the priors are similar, etc. But obviously if the new (previously unknown) theory has non-negligible plausibility and is consistent with the data this can totally change how we think about H_00.

        So, Bayes’ rule explains a few things I have seen people struggle with:

        1) Why we can safely ignore vague “god/aliens of the gaps” type theories unless no other explanation works.
        2) Why accurate “surprising” predictions lend more support to a theory than predictions that are consistent with many other explanations.
        3) Why novel theories can reduce our confidence in old theories even though the data has not changed.

        • Not a big deal, but some typos I just noticed. Equation 3 and 3b should be:

          eq 3
          p(H_00|D) ~ p(H_00)*p(D|H_01)/[ p(H_00)*p(D|H_01) + p(H_10)*p(D|H_11) + p(H_20)*p(D|H_21) ]

          eq 3b
          p(H_00|D) ~ p(H_00)*p(D|H_01)/[ p(H_00)*p(D|H_01) + p(H_10)*p(D|H_11) + p(H_20)*p(D|H_21) + p(H_30)*p(D|H_31) ]

      • More typos:

        The alien hypothesis (H_2) is very vague and would be consistent with pretty much every possible observation. Therefore p(D|H_21) is necessarily small, ie max(p(D|H_01)) >> max(p(D|H_21)). So, unless the data is really inconsistent with H_01 and H_11, we can also drop H_2:

      • Anoneuoid: In this situation my earlier objection doesn’t apply, because none of your three finally competing options is an approximation of the other. However, if you’d also use Bayes to find out a probability that the normal is a good approximation (which here you just take for granted), I think it would come in (unless you don’t allow other alternatives that are near to the normal). Actually it was more directly targeted at situations where you have a prior over a parameter in a continuous space.

        • use Bayes to find out a probability that the normal is a good approximation

          Hmm, the probability [a model] is a good approximation. Can Bayes’ rule tell you that?

          ([normal distribution], which here you just take for granted)

          As mentioned earlier I don’t think the choice is really that important, it just needs to be a rough approximation. The same model of uncertainty is used for all four theories (MOND, GR, ALIENS, QI) so any over/under-estimate would apply to all of them. I see these problems though:

          1) Overestimate uncertainty – more difficulty distinguishing between theories than need be (basically it costs more to accomplish the same thing)
          2) Underestimate uncertainty – gives too much weight to vague theories (since results for precise theories are more likely to be found in the tails than they should be)

          If you instead separately checking whether each theory meets some threshold, the problems due to a mis-specified uncertainty model can be much more severe:

          1) Overestimate uncertainty – more difficult to “rule out” any theories even bad ones (basically it costs more to accomplish the same thing)
          2) Underestimate uncertainty – you will inevitably (with enough data) “rule out” all theories and you are left with confusion.

          In both cases it seems to be far more dangerous to underestimate uncertainty, but due to almost opposite problems.

        • Anoneuoid is also assuming that the distribution of the error is a frequency distribution here. Instead assume that you know only about how big the distribution is, and you choose a maximum entropy distribution for a given bigness, you will wind up with a normal distribution for bigness measured in mean squared error…

          so what bayes is telling you is nothing about frequency, but rather, if you assume error is of some mean squared size on average, what should you think about the other model parameters.

          Bayesian results, as pretty much all results, are of the form: given the assumptions, what are the conclusions I can derive? Sometimes you can make these conclusions about frequencies, but most often you can’t, you can only make them about plausibilities.

        • But the whole probability calculus is based on assuming that probability is distributed among alternatives of which only one is true. If you have N(0,1) in your model and also N(10^{-10},1) and you’d be prepared to accept both of them as reasonable for the same data, this means that prior probabilities shouldn’t integrate out to 1, and probability theory breaks.

          I guess I also do not understand this. Why shouldn’t they integrate to 1?

          For example if there are n “hypotheses” (parameter values) that are equally likely a priori, then the prior probabilities should all be equal at 1/n.

        • Christian’s complaint comes down to he wants some kind of unconditional probability, but Bayesian only gives you probability conditional on the explicit models being tested. I just don’t see this as a problem provided you keep the conditionality in mind

        • Hmmm. I find it hard explaining this in a substantially different way from what I’ve said before. If you have a model over a parametric distribution, let’s say the Gaussian, and you have a continuous prior over the parameter, and you say you don’t really believe the Gaussian model, rather it’s a reasonable approximation, this would mean that Gaussians with various different (though very close) parameters qualify as reasonable approximations of the *same* true model, in other words, not just a single Gaussian is “true” in the sense that it is a good approximation, but a whole set of them. But probability logic implies that to the extent that one parameter is true, all the others cannot be true.

          Daniel may be right that this issue can be avoided if you don’t claim that your model approximates any “true distribution” but rather models epistemic uncertainty.

        • Afaict, this is exactly the same as what I did in my example:

          General a priori knowledge:
          1) The exact distribution the measurment error comes from is unknown, so we cannot write down H_00.
          2) There are an infinite possible distributions that could be consistent with the data, but the assumptions behind these are unlikely to hold exactly.

          So a priori we can drop all of these possibilities from the denominator

          p(H_01) ~ 0
          p(H_02) ~ 0
          […]
          p(H_2m) ~ 0

          Typically all the models are wrong except the one correct one that we do not know how to write down. So we approximate it… We don’t do a multiplicity of approximations because why?

        • when I want to model the average or typical values or something like that, I use normal or t or gamma distributed errors. when I actually want to learn a frequency distribution I set up something that is a kind of basis for distros, like a gaussian mixture model or a dirichlet distribution for histogram heights, or a mixture of gaussian and exponential or whatever seems appropriate. the point is it should be highly flexible to realistically model the shape of the frequency distribution explicitly.

        • Anoneuoid: Hmmm, I thought I commented before on why the issue doesn’t arise in your example. It would only arise if after simplification/replacement by approximation you still have different distributions in the mix that approximate the same true distribution. Which you could have in a continuous situation but not with the finally only two or three options you are left with in your example.

        • Anoneuoid: Hmmm, I thought I commented before on why the issue doesn’t arise in your example. It would only arise if after simplification/replacement by approximation you still have different distributions in the mix that approximate the same true distribution. Which you could have in a continuous situation but not with the finally only two or three options you are left with in your example.

          I get what you are saying but just don’t see that probabilities are summing to greater than one.

          You think this is happening often when people do MCMC?

        • Anoneuoid: You seem to get me wrong. I’m not saying that probabilities don’t add to one when people apply Bayes. I’m saying that many people interpret the meaning of prior and posterior in terms of approximation, and that this is not consistent with the fact that probabilities sum to one (at least not as long as there are two or more distributions in the competition that are approximately equal and therefore qualify as approximations of the same truth).

        • first off, all finite measures are isomorphic, you can always renormalized them to sum to 1.

          second, if two distributions both approximate the same thing in a mixture, we aren’t led astray, they will both have posterior mass, and they will induce essentially the same predictive distro, so in the end we have split a hair but the two hair halves taken together are still the answer.

        • many people interpret the meaning of prior and posterior in terms of approximation, and that this is not consistent with the fact that probabilities sum to one (at least not as long as there are two or more distributions in the competition that are approximately equal and therefore qualify as approximations of the same truth).

          I think if this was a correct description of what was going on people would be trying both the normal model and t-distribution with large df models when running mcmc.

          That isn’t what I see people doing.

        • Christian’s objection is a bit technical. It’s closely related to non-identifiability.

          Suppose someone does use say a normal model and a t model with a variable number of degrees of freedom together in a model averaging scenario. Suppose they aren’t aware of the fact that t for high degrees n is the same as the normal. Now you do the fit, and you find 50% of the posterior probability falls with the Normal model and 50% of the posterior probability falls on t models with degrees of freedom between say 100 and 500…

          the only thing keeping your t model from going to n = infinity is the prior, but all the models n=100 to 500 are basically normal anyway, you’ve converged to the normal model, but you “aren’t aware of it” because the posterior probability explicitly assigned to the normal is 0.5

          This kind of thing can easily happen in more complicated models, like models where you’re fitting a function and you have more than one way to represent the same function in your model. Like for example exp(x) vs a power series that approximates exp(x) essentially exactly over the range of the data…

          if you care about the posterior probability of a given parameter informing you, you’ll be confused, but if you care about the posterior predictive distribution, you’ll get the same distribution no matter how you compute it… whether it’s exp(x) or the power series… so you’ve converged to the correct solution, just not a unique *representation* of the correct solution…

          I think this is an issue to look out for, but I don’t think it’s a foundational problem of any sort.

        • + 1 to “Suppose someone does use say a normal model and a t model with a variable number of degrees of freedom together in a model averaging scenario. Suppose they aren’t aware of the fact that t for high degrees n is the same as the normal. Now you do the fit, and you find 50% of the posterior probability falls with the Normal model and 50% of the posterior probability falls on t models with degrees of freedom between say 100 and 500…”
          If we think of this whole exercise as representing our epistemic uncertainty, there is basically no difference between assigning a N(0,1) prior or a N(0+e,1) prior with e very tiny, or a t(30,0,1) prior, or a weighted mixture of the 3. Sure, our inference is conditional and will change slightly depending on choice, but that is as it should. We have changed the model in that case. The key is to think of the prior as a distro over the model configuration space. We do not know the “true” data generating distribution, but Bayesian inference uses probability to quantify consistency of configurations and data (ht: Michael Betancourt https://betanalpha.github.io/assets/case_studies/probability_theory.html#6_interpretations_of_probability)

      • Hei, great post!

        I hope they would be teaching this in the first class of Statistics, not tests or measure theory. Because this is how you link normal human reasoning to math, and everything else is just technicality.

Leave a Reply to Christian Hennig Cancel reply

Your email address will not be published. Required fields are marked *