Responding to Richard Morey on p-values and inference

Posted on May 2, 2021 9:30 AM by Andrew

Jonathan Falk points to this post by Richard Morey, who writes:

I [Morey] am convinced that most experienced scientists and statisticians have internalized statistical insights that frequentist statistics attempts to formalize: how you can be fooled by randomness; how what we see can be the result of biasing mechanisms; the importance of understanding sampling distributions. In typical scientific practice, the “null hypothesis significance test” (NHST) has taken the place of these insights.

NHST takes the form of frequentist signficance testing, but not its function, so experienced scientists and statisticians rightly shun it. But they have so internalized its function that they can call for the general abolition of significance testing. . . .

Here is my basic point: it is wrong to consider a p value as yielding an inference. It is better to think of it as affording critique of potential inferences.

I agree . . . kind of. It depends on what you mean by “inference.”

In Bayesian data analysis (and in Bayesian Data Analysis) we speak of three steps:
1. Model building,
2. Inference conditional on a model,
3. Model checking and improvement.
Hypothesis testing is part of step 3.

So, yes, if you follow BDA terminology and consider “inference” to represent statements about unknowns, conditional on data and a model, then a p-value—or, more generally, a hypothesis test or a model check—is not part of inference; it a critique of potential inferences.

But I think that in the mainstream of theoretical statistics, “inference” refers not just to point estimation, interval estimation, prediction, etc., but also to hypothesis testing. Using that terminology, a p-value is a form of inference. Indeed, in much of statistical theory, null hypothesis significance testing is taken to be fundamental, so that virtually all inference corresponds to some transformations of p-values and families of p-values. I don’t hold that view myself (see here), but it is a view.

The other thing I want to emphasize is that the important idea is model checking, not p-values. You can do everything that Morey wants to do in his post without ever computing a p-value, just by doing posterior predictive checks or the non-Bayesian equivalent, comparing observed data to their predictions under the model. The p-value is one way to do this, but I think it’s rarely a good way to do it. When I was first looking into posterior predictive checks, I was computing lots of p-values, but during the decades since, I’ve moved toward other summaries.

102 thoughts on “Responding to Richard Morey on p-values and inference”

Sander Greenland on May 3, 2021 11:58 AM at 11:58 am said:

Andrew you wrote “in the mainstream of theoretical statistics, ‘inference’ refers not just to point estimation, interval estimation, prediction, etc., but also to hypothesis testing. Using that terminology, a p-value is a form of inference.” Please explain how in this setting “a p-value is a form of inference” bearing in mind that
1) In Neyman-Pearson theory of hypothesis testing, an observed P-value p is not an inference but rather the smallest alpha-level at which a “reject” decision would be made;
2) In ordinary English p isn’t an inference either:
according to Oxford online, “inference” means
“a conclusion reached on the basis of evidence and reasoning, as in ‘researchers are entrusted with drawing inferences from the data’ ” (note the use of “researchers” here),
and Merriam-Webster says “inference” means
“1 : the act or process of reaching a conclusion about something from known facts. 2 : a conclusion or opinion reached based on known facts.”

As far as I can tell your statement is simply repeating what perhaps most textbooks, tutorials, and editorials get wrong: It is confusing a statistic (here, p) with a decision from a rule based on the statistic (alpha-level, fixed cutoff testing). A P-value is not an inference or a hypothesis test, it is a number computed from the data that just sits there vacantly on the computer output. Then someone comes along and claims it means something, usually the traditional “the association was significant” because p was under 0.05 or “there was no association” because p was over 0.05 for some regression model in which the coefficient representing the association was set to zero. [And then there are the continuity advocates like me who say it should not be broken up because it measures something, like where the data fell in percentile terms along a reference distribution for a divergence measure computed from a model (e.g., the model underlying the test of a coefficient that purportedly represents a tested effect).]

In none of these cases is the P-value p itself an inference. An inference is something our brains overlay on statistics like P-values. Now decision theorists formally model our inference processes into decision rules, which in turn normatively warp publication conventions and so authors mimic as they were trained to with dichotomous decisions. But all that is again a social-psychological overlay on a number which means nothing apart from bare mathematical descriptions that in no way demands any inference. A P-value is never an inference any more than your weight is. At best it is an evidence measure, albeit one poorly scaled for aiding good judgment (although there are many who are in abject denial about one or another of these facts).

Reply ↓
- Daniel on May 3, 2021 12:04 PM at 12:04 pm said:
  
  +1
  
  Reply ↓
- Daniel Lakeland on May 3, 2021 12:13 PM at 12:13 pm said:
  
  I like this. There’s nothing wrong with p values themselves, they mean what they mean. The problem is the illogic of assuming they mean something VASTLY different because it’s
  
  1) what you were taught
  2) what you were told to do by your boss
  3) It gets me publications
  4) Look how fun this is I just tweak this specification and then the p value drops below 0.05 and I have discovered the truth
  5) I know the reality but I’m a cynical bastard who loves the money and fame
  
  etc
  
  Reply ↓
  - Sander Greenland on May 3, 2021 1:27 PM at 1:27 pm said:
    
    +1
    
    Re 4 though, I’d make it 4a. There seems to be a prevailing cognitive bias against mentioning the act that dares not speak its name (yet I see plenty of in the med/health lit):
    4b) Look how fun this is I just tweak this specification and then the p value pops above 0.05 and I have refuted previous claims. [which will also get me publications, especially when the mainstream hated on the previous claims]
    
    While I’m here I may as well plug this one again about what a P-value means when you’re in a topic where no one has a stat model close to reality that generated (caused) the data: https://arxiv.org/abs/1909.08583
    
    Reply ↓
  - Rahul on May 4, 2021 4:48 AM at 4:48 am said:
    
    But that’s the thing: I think p values answer a question that’s rarely the question on anyone’s mind!
    
    When you stand behind a metric like that it’s bound to get misused because it’s just set up for misinterpretation.
    
    Reply ↓
    - Sander Greenland on May 4, 2021 8:05 AM at 8:05 am said:
      
      Rahul, I agree insofar as that is a psychosocial problem ignored by the theoreticians and philosophers, or dismissed with exhortations to just “do better” without addressing the source of the problem. But as others have also said, it’s not the P-value’s fault that we have so much labeling, describing, and promotion of decontextualized statistics as if they were measuring things they don’t come close to in real research (apart perhaps from some rare ideal conditions that aren’t even remotely approximated in some of the most important research areas for society, like occupational, environmental, and medical regulatory settings).
      
      Then, as if to aggravate the wounding of research and knowledge this misrepresentation causes, we get vociferous defenses of it with rationales based on hidden utilities that are far from those of others. Those include the utility of arguing that everything you’ve taught, practiced and published is defensible or even desirable based on some universally accepted social utility, sometimes to the point of hysteria based on toylike and often completely fictional representations for the very real social problem of how to foster reliable research conduct and reporting.
      
      My theme is that to blame these problems on P-values is scapegoating to avoid facing the much more uncomfortable and unquantifiable reality of the human failings that inevitably enter into research, theory, philosophy and methodology. Those failings lead to the same sort of problems with Bayes factors or any other simplistic algorithmic substitute for critical thinking about contextual causal narratives [I think that’s another convergence back with Andrew, as when he wrote “The problem with P-values is not just with P-values”].
    - Andrew on May 4, 2021 8:34 AM at 8:34 am said:
      
      Sander:
      
      As you know, I wasn’t completely happy with that ASA statement on p-values. In response I wrote a short article, The problems with p-values are not just with p-values.
- gec on May 3, 2021 5:23 PM at 5:23 pm said:
  
  I agree that the p-value itself is no more “inferential” than any other quantity we might compute.
  
  One thing that both you and Andrew bring up which I’d like to emphasize is the importance of assessing whole models and their divergence from data. Andrew discusses this in terms of model checking, and as you say, the p value is a measure of discrepancy between data and a model. Many pitfalls of p values can be traced back to not appreciating what that model entails.
  
  I want to emphasize this because model checking still doesn’t seem very appreciated in many Bayesian circles, either because it is thought to be irrelevant or because of a focus on model comparison (e.g., via Bayes factors*). But what good is a comparison between models that aren’t very good at matching the data?
  
  * Bayes factors are sensitive to priors on model parameters. It is amusing to me how many Bayesian analyses forget that priors are part of the models being compared, just as much as the assumptions of the null hypothesis are part of the “model” from which most p values are derived.
  
  Reply ↓
  - Sander Greenland on May 4, 2021 12:04 AM at 12:04 am said:
    
    Well put gec.
    
    In return, here’s more of my take: P-values do have a certain versatility in application even if their interpretation should be much narrower than commonly portrayed. For example, the usual coefficient null P-value is the percentile location in a distribution of a divergence from the nested (embedded) test model without the coefficient to the nesting (embedding) model with the coefficient unconstrained. That divergence is measured along the vector from a data projection onto the nested model to a data projection onto the embedding model (with the reference distribution computed under the nested model if that directional divergence is the only one of interest).
    
    Regardless of the test model and parameters, penalties or priors simply add Lagrangian constraints to score functions used to find projections and reference distributions, which means that we can (and I’d say should) provide calibrated (frequentist) tests of fit of Bayesian models (as Box advocated over 40 years ago) even if we plan to commit “Bayesian inference”.
    
    For me, the abstractness and unfamiliarity of basic geometrical descriptions of so-called “inferential” statistics explains why those fundamental descriptions are all but absent from books I’ve seen below the Bickel et al. level. And it explains why instead all sorts of inferential overlays (most garbled, misleading, or flat wrong) end up getting used to describe, explain or even define P-values and CI. The field of stats and its “philosophy” has seemed in denial about this problem, and even perpetuates it not only with its defenses but with its attacks on P-values and CI – especially from pure likelihoodists, Bayesians and others who make absurd claims [one you can find emanating even from some highly mathematical statisticians is that “P-values don’t measure evidence”, argued by using “P-values” that may not satisfy basic geometric ideas behind useful ones, or definitions of evidence that aren’t accepted by others (there is no universally accepted formal definition)].
    
    Sure there are dumb P-values, dumb compatibility (“confidence”) intervals, dumb data-probability (sampling) models, dumb priors and of course dumb posteriors; I see them all the time in research articles and even sometimes in stat methods articles. That points up the need to lay out precisely what each function used in a “data analysis” is actually capturing about the data and the underlying analysis models, and NOT use a function when it is NOT capturing what we are after. That means for example that someone treating a P-value as if it were a posterior probability or treating a compatibility interval as if it were a posterior interval (a con game implied every time it’s called a “confidence interval”) should be required to explain in full contextual detail why these statistics would be close enough to the posterior quantities we’d get from a defensibly informed Bayesian analysis. That means among other things that we’d better have hierarchical models on hand to enable better numerical correspondence with credible Bayesian quantities, as well as contextually more relevant calibration for honest frequentists [here I’ve converged back to a position I think matching Andrew’s, despite the blip that started out my comments above].
    
    Reply ↓
- Sameera Daniels on May 3, 2021 11:00 PM at 11:00 pm said:
  
  Excellent
  
  Reply ↓
- Richard D. Morey on May 4, 2021 7:37 AM at 7:37 am said:
  
  I generally agree with what Greenland has written here: what we call “inference” is a variety of sophisticated games/thought experiments that we’ve invented to try to get a handle on variability. The word “inference” is typically used in two senses: the statistical inference and the scientific inference. One of the features of poor NHST behaviour is confusing the two (“p is low” -> “hypothesis is supported”).
  
  The advantage that BDA has, I think, is that it is a pracgmatic, loose collection of techniques that people have found useful in applied contexts, so many ideas (including the intuitions underlying significance testing) can be said to be compatible with it. We see this pragmatism in Gelman & Shalizi and in our reply (Morey, Romeijn & Rouder) I don’t think I appreciated this as well as I could have. (The M-open/M-closed discussion has this flavor: we are worried if we have a high probability of settling on a model that is “false” in a way that is misleading. Cromwell’s rule is another case, as are the use of recovery simulations).
  
  Generally, I think we’re better off thinking about statistics less about what we *should* infer, and more about introducing skepticism about what we *might* infer. There’s almost always a lot more work to be done to show that something is plausible (e.g., model checks, understanding the method that generated the data, etc) than just looking at a test statistic. We should not let people transfer the responsibility for their inferences onto any test stat.
  
  Reply ↓
  - Sander Greenland on May 4, 2021 3:51 PM at 3:51 pm said:
    
    +1
    
    I would modify the opening slightly to say
    ‘what we call “inference” ought to be a variety of sophisticated games/thought experiments that we’ve invented to try to get a handle on uncertainty and reality.’
    – Consider that, even if there is no important unexplained variability in experimentally observed patterns, there can be a lot of uncertainty about what will happen when one moves from experimental studies to field application (witnessed by fatal accidents from unanticipated mechanical or medical failures), and a lot of questions left about the sources of those patterns (witnessed by many biologic phenomena).
    
    On BDA, I have a pragmatic objection to is its label: There’s no reason I can see in its actual goals to hobble data analysis with the label “Bayesian”. The P-value is a good example: As Box wrote, those can be used to good effect as part of the diagnostic toolkit, and in that role they can be calibrated to meet “frequentist” criteria which facilitates use of their geometric interpretation. Why force statistics into one narrow channel when it’s finally escaping from the tyranny of another?
    
    That said, IMO your closing paragraph is superb and deserves to be quoted often.
    
    Reply ↓
    - Andrew on May 4, 2021 4:02 PM at 4:02 pm said:
      
      Sander:
      
      The p-value, as we define it, is a part of Bayesian data analysis. It’s Pr(T(y^rep) >= T(y) | data), that is, the posterior probability that a replicated dataset will be as or more extreme than the observed data, where “observed” is defined by the test statistic, T. We have the p-value in chapter 6 of BDA. I don’t find the p-value very useful, and I think we might all be better off had it never been born, but given that it exists, I think it’s useful for it to have a Bayesian interpretation.
      
      In any case, I agree with you that non-Bayesian data analysis is useful too! See here for further discussion of the benefits of methodological pluralism. I wrote a book on Bayesian data analysis because there’s a lot to be said on the topic; the existence of this book or this school of thought should not be taken to imply a claim that there are no other valuable approaches.
    - Deborah G. Mayo on May 4, 2021 5:52 PM at 5:52 pm said:
      
      There’s no posterior probability, it’s a probability computed based on a sampling distribution. There are no priors. It’s a counterfactual claim,not about a future replicated dataset.
    - Daniel Lakeland on May 4, 2021 6:37 PM at 6:37 pm said:
      
      The way Andrew said it it looks to me like he’s choosing a sampling distribution to be the sampling distribution of the model p(data | parameters) marginalized across the posterior distribution of the parameters p(parameters | data). This makes sense for a Bayesian analyst. Such a test answers the question “given what we know, what is the probability that future datasets would have more extreme test statistics averaged across our remaining uncertainty?”
      
      It’s a different purpose than typical for frequentists p value testing, because in the frequentist case there is no distribution across parameters. Either the test is carried out for a *single* pre-specified parameter value, or it’s carried out for something like a ‘worst case’ parameter value, or the like.
    - Andrew on May 4, 2021 7:04 PM at 7:04 pm said:
      
      Deborah:
      
      We discuss in our 1996 paper. If you want to call it a counterfactual claim rather than a replicated dataset, that’s fine; there’s no mathematical distinction between the two. And, yes, there are priors because you average over the prior to get the posterior probability. In the special case of a pivotal test statistic, the p-value doesn’t depend on the prior, but in general it will, and there’s no way around it. There are also non-Bayesian alternatives such as plug-in estimators, taking the maximum of the p-value, averaging over a confidence region, taking the maximum over a confidence region, etc etc.
    - Sander Greenland on May 4, 2021 8:55 PM at 8:55 pm said:
      
      Andrew: As you know and as I just replied to Lakeland, I reject the posterior predictive P-value (PPP) Pr(T(y^rep) >= T(y) | data) as a diagnostic tool because of its poor frequency calibration – in essence it double-counts the data and so converges to a point mass at 0.5 instead of uniformity. So if I get a PPP of 0.25 I don’t know if that is from a terrible fit or an acceptable one. That’s not what I want if my goal is to check the model; no amount of interpreting PPP as the posterior prediction it is will make it better.
      
      In contrast, with a frequency-valid P-value (a U-value) based on a divergence statistic T from the test model M to the data [as in Pearson’s chi-squared, the earliest cite I know (1900) for the “value of P” in tests of fit], I know right away the percentile where t_obs fell in the T distribution induced by M, because by definition 100p is that percentile, and uniformity is assured by that definition.
      
      More generally, in most real applications I see (as in GLMs), M is a proper model subspace and asymptotic pivotals are available so the calibration argument carries over easily, no prior needed. Otherwise (e.g., as in equivalence testing) as you note there are choices to be made, each with frequentist arguments for and against them, including max-p as well as Box’s prior predictive p (when M includes proper priors). But despite years of dogging about this I have yet to see any frequentist argument for PPPs, only damning criticism. Since my idea of Bayes-frequentist fusion is to meet both types of criteria and criticism to the extent feasible (strive to be “B and F”, not “B or F”), this leaves PPPs in the dustbin (where some would place all P-values, but that would just shift the controversy to their replacements).
    - Andrew on May 4, 2021 9:39 PM at 9:39 pm said:
      
      Sander:
      
      Posterior predictive checks count the data exactly once. They are statements about future data (or, if Deborah prefers, counterfactual data) conditional on the observed data and the assumed parameters. I don’t find p-values of any sort to be very useful, but they are directly interpretable. The frequentist argument for posterior predictive p-values is not the Neyman argument—there, I agree with you, U-values would be more appropriate, except that I don’t find the Neyman hypothesis testing argument to make sense in any applied problem I’ve ever seen—but rather the Fisher argument of the p-value as a measure of how surprising the data are, under some measure as defined by the test statistic T. For that purpose I don’t see why there would be any interest in any uniform sampling distribution. Finally, it’s not true that the distribution of the posterior predictive p-value converges to a point mass at 0.5. It depends on the model and the test statistic. If it’s an ancillary test statistic, the distribution converges to uniformity. In other cases, it can converge to a point mass at 0.5. In other settings, something in between. From a Bayesian (or Fisherian) standpoint, I think that makes sense. I discuss the issue further in my article, “Two simple examples for understanding posterior p-values whose distributions are far from uniform.” Anyway, I’m not really trying to convince anyone to use these, or any other, p-values. I think we agree on the applied questions and also on the larger point of methodological pluralism.
    - Sander Greenland on May 4, 2021 10:18 PM at 10:18 pm said:
      
      Andrew: I maintain that you are simply mistaken about the interpretation of P-values in terms of surprise without uniformity, and it is egregiously wrong to identify uniformity with NP tests alone: Neo-Fisherians want it too (I’m an example) because it allows direct interpretation of P-values in terms of information divergence of the data from the model – no “hypothesis testing” or cutpoints needed or wanted. In fact if there was anything I’d point to as a foundation of sound frequentism of any sort, P-uniformity over the sampling model is it, as it not only translates to valid information measures and hypothesis tests, but also to valid interval coverage under the model.
      
      Statistical interpretation of a P-value requires a reference distribution for it, and in particular a uniform one if no explicit reference is given. Consider the usual case, in which the PPP converges to a point mass at 0.5 instead of uniformity, as shown here:
      Bayarri MJ, Berger JO. P values for composite null models. J Am Stat Assoc.
      2000;95:1127-42.
      Robins JM, van der Vaart A, Ventura V. Asymptotic distribution of P values in
      composite null models. J Am Stat Assoc. 2000;95:1143-56.
      In those cases, how surprising is PPP = 0.25? You can’t tell me without reference to whatever distribution it has under the model. If instead I use a uniformly calibrated P-value, I know immediately: it’s not at all surprising [in particular, it represents only 2 bits of Shannon information against the model contained in the event that T fell at its 25th percentile under the model].
      
      The tragedy of resistance to uniform calibration is that it blocks most of the logically sound interpretations that can be built around P-values. This is what your 2013 EJS paper and present comments missed. So if someone rejects uniform calibration, I’d say they ought to cease and desist using P-values.
    - Andrew on May 5, 2021 12:00 AM at 12:00 am said:
      
      Sander:
      
      Thanks. Let me clarify. By surprise etc., I’m talking about comparing current to future data. I’m not doing a null hypothesis significance test. Consider the following simple example: y ~ normal(theta, 1), with prior theta ~ normal(0, 10), and test statistic T(y) = y. The p-value is Pr(y^rep > y | y), and the distribution of this p-value, conditional on theta, is very concentrated around 0.5 for any plausible value of theta. And that’s fine, because under the model, the data will not be a surprise under that measure. And that’s ok! I wouldn’t want a u-value which 5% of the time would tell me there’s a surprise. It’s the nature of this particular model and this test statistic that a surprise would happen much less than 5% of the time, if the model is true.
      
      I agree that the posterior predictive p-value would not be a good tool if your goal is to come up with a procedure that rejects 5% of the time if the model were true. But I’m not particularly interested in using such procedures. But you are. It’s good that statistics is an open field and neither of us is a dictator!
    - Sander Greenland on May 5, 2021 12:52 AM at 12:52 am said:
      
      Andrew: I think I’ve stated many times including here that I am not interested in uniform calibration because it preserves alpha levels (false-rejection rates). I’m interested in uniform calibration because it justifies the continuous Shannon-information interpretation of s = log(1/p) = -log(p) as surprisal or information against the tested model (where the base of the logs is just a scaling factor, with base 2 making the units bits as familiar in computer science).
      
      You keep going back to claim my message is instead about NP testing (in which the overarching goal is indeed to get “a procedure that rejects 5% of the time if the model were true”) and that’s wrong, a complete misrepresentation of my goal of information summarization (a goal for which NP testing is a travesty). But the fact remains that in their respective frequentist theories, both goals demand uniformity of the P-value, and hence PPPs abjectly fail frequentist criteria for information summary and transmission (as well as for decision rules) because of their miscalibration.
      
      You of course are entitled to make up other arguments for PPP as a diagnostic, and as we know I’m entitled to reject them as warranting no interpretation in terms of ordinary English meanings of “surprise” or technical meanings of “surprisal”. In just the same way, quantities named “confidence level” and “confidence level” warrant no realistic confidence or uncertainty or posterior probability interpretation when (as usual in our work) there is serious doubt about their embedding model. So as with all things we can only hope any reader has been given enough details to form an accurate picture about what is going on with these quantities and disputes about them.
    - Daniel Lakeland on May 5, 2021 7:09 PM at 7:09 pm said:
      
      Sander, -log(p) where p is a Posterior Predictive Pvalue has a perfectly fine surprisal interpretation. If you were to transmit data generated by the Posterior Predictive distribution using an optimal code it would take a number of bits per symbol which can be calculated by averaging -log(p) over the posterior predictive distribution. (technically you’d better discretize the measurements and generate a discrete distribution from the discretization of the continuous one)
      
      If you have real data which has a large surprisal under this measure (requires a lot of bits to transmit) then it is data which would be rare to come out of the Posterior Predictive RNG. If your real data would be rare to come out of your model, then it’s not a particularly good model.
    - Sander Greenland on May 5, 2021 8:58 PM at 8:58 pm said:
      
      Daniel: My purpose (goal) for surprisal log(1/p) is to gauge surprise at the current observed data (the analysis data) given only the test model M, as it’s the latter I’m concerned to diagnose or test against the data; for that I need P uniform (maxent) under M. With p=PPP you are instead using log(1/p) for surprisal at seeing the same tail event in some other data after being given the test model AND the analysis data.
      
      Again, the problem with PPP in the model-only diagnostic role is toward the opposite (unsurprising) side of the surprisal spectrum – it’s that it can seriously understate the surprise I should have at the current data given only the test model. See my longer reply to you below and of course the JASA 2000 articles, discussion, and rejoinders.
    - Daniel Lakeland on May 5, 2021 11:18 PM at 11:18 pm said:
      
      Sander, am I right in thinking that what you’re talking about is a Prior Predictive P? That seems also useful but answers a different question.
Deborah G. Mayo on May 4, 2021 12:01 AM at 12:01 am said:

Morey seems to say that “experienced scientists and statisticians rightly shun” statistical significance tests because “they have so internalized its function” that they can carry it out by other means. Although I think there’s a grain of truth in claiming that those scientists who want to abolish statistical significance testing know not what they’re really saying–because anyone caring to distinguish effects of noise/non-noise in non-obvious cases will ultimately rely on p-value reasoning–I think it’s a dangerous mistake to claim they are “rightly” shunning it. There are plenty who are happy to believe they are carrying out its function by other means when in fact they’re not. Too many are content not to be constrained by a method that picks up on how biasing selection effects, stopping rules, multiple testing etc. alter the error probing capacities of methods and thus wreck their ability to carry out the important, though limited, function of stat significance testing. Of course, I’ve written a book on it: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).

Reply ↓
- Sander Greenland on May 4, 2021 12:49 AM at 12:49 am said:
  
  Mayo: Your response incurs my objection to the usual trope of confounding P-values and reasoning from them with “significance tests” and “severity”. This confounding is widespread and looks to me like some blend of conceptual confusion, oversimplification, and wishful thinking. Yet it’s a totally unnecessary and destructive confusion: Once we lay bare the geometry of these statistics we can see the kind of information P-values (and their inversions into CIs) carry about the relation of models to data, which is all these statistics can offer for the vital task of error probing. That information is far from all that is needed to reach a sound scientific judgment about whether some observed deviation is purely a “random artefact”, especially when there was no randomization involved in the study design and conduct (or it was heavily compromised) or no reason to expect no effect (the null spike used by Bayesian testing being a utility bias transformed into sheer delusion).
  
  I think a major reason P-values attract so many sophisticated opponents is because once someone who is not committed to them sees what they really DON’T capture, they realize that P-values aren’t capturing the significance of anything, except in some incredibly narrow sense that is nonsense outside of narrow tightly controlled experiments. Today we suffer under the burden that the founding figures of current statistics took this narrow setting as universal, yet it is rarely or never seen in many crucial fields, including research on free-living humans and their societies. [Note that I wrote “current statistics”, not “modern statistics”, as I regard the current state of statistical theory and philosophy as seen in textbooks as medieval, filled with magical thinking and superstition (such as an implicit treatment of randomness as if it were the dominant concern of all scientific research). Applied statisticians have had to cope endlessly with adapting these inadequate treatments to produce reliable analyses and uncertainty statements from real research.]
  
  Reply ↓
  - Daniel Lakeland on May 4, 2021 6:43 PM at 6:43 pm said:
    
    Thank god now that vaccination is widespread in the US we can return to the normality of arguing over p values!
    
    > reason P-values attract so many sophisticated opponents is because once someone who is not committed to them sees what they really DON’T capture, they realize that P-values aren’t capturing the significance of anything,
    
    p values capture the sufficiency of a model to make the observed data “not very unusual” in the “direction” captured by the test statistic. Nothing more.
    
    This is why p values much larger than 0.05 are usually more interesting than small p values. Small p values tell you “this model sucks”. Larger p values tell you “this model could be adequate”. There are an infinity of sucky models. That we have found one is rarely truly of interest, unless someone actually believes it could really and truly be true. That’s why a Bayesian using a p value on a posterior predictive distribution cares about a p value. It tells them “hey your best guess at model fit to data might still suck”.
    
    Reply ↓
    - Sander Greenland on May 4, 2021 8:13 PM at 8:13 pm said:
      
      Unfortunately, unlike with a frequency-valid P-value (one that approaches uniformity under the sampling model it is computed from – a “U-value”), getting a posterior predictive P-value (PPP) above 0.05 is no assurance that the observed data are not very unusual in the tested direction. Instead, PPPs converge to a point mass at 0.5 and their spread depends on the sample size among other things. One of the many great oddities of the whole P-value/testing controversy is that PPPs are still taken seriously as useful diagnostics more than 2 decades after devastating critiques appeared (in JASA no less) and despite relatively simple recalibrations being long available…see
      Bayarri MJ, Berger JO. P values for composite null models. J Am Stat Assoc.
      2000;95:1127-42.
      Robins JM, van der Vaart A, Ventura V. Asymptotic distribution of P values in
      composite null models. J Am Stat Assoc. 2000;95:1143-56.
    - Daniel Lakeland on May 5, 2021 12:27 AM at 12:27 am said:
      
      I’ll have to read your reference but I’m not sure we are talking about the same thing.
      
      Suppose I have a model and collect some data. I use it all to fit a posterior distribution. Now I can sample from the posterior of the parameters q_i
      
      For each sample from the posterior of the parameters I sample a fake data set y_i and calculate a test statistic T(y_i) =T_i
      
      Now I calculate T(y_act) the actual test statistic from the actual experimental data. I calculate the empirical cdf of the T_i posterior predictive data and calculate U_act=cdf(T_act). I claim that for a sufficiently large sample of posterior predictive datasets the U value is uniform when calculated from future posterior predictive datasets and if U_act is a very small number or 1- such a small number it is evidence that the model does not predict that the data is of the ordinary type expected by the model.
    - Carlos Ungil on May 5, 2021 3:48 AM at 3:48 am said:
      
      > Suppose I have a model and collect some data.
      
      Imagine that you have a (correct) model where measurements are normal(mu,1) with unknown underlying value and a known distribution for the measurement error.
      
      You also have a uniform prior for mu (in an interval covering all the potential measurements) and a single data point: 42.
      
      > I use it all to fit a posterior distribution.
      
      Your posterior for mu is normal(42, 1).
      
      > Now I can sample from the posterior of the parameters q_i
      
      > For each sample from the posterior of the parameters I sample a fake data set y_i and calculate a test statistic T(y_i) =T_i
      
      Let’s consider the statistic T(y)=y. The collection of fake data points (and the statistic) will be distributed as a normal(42, 1)+normal(0,1)=normal(42, sqrt(2))
      
      > Now I calculate T(y_act) the actual test statistic from the actual experimental data.
      
      T(y=42)=42
      
      > I calculate the empirical cdf of the T_i posterior predictive data and calculate U_act=cdf(T_act).
      
      U_act=0.5
      
      > I claim that for a sufficiently large sample of posterior predictive datasets the U value is uniform when calculated from future posterior predictive datasets
      
      What do you mean? The distribution of the U value over fake data sets (conditional on the observed data point) is uniform by construction. Independently of the observed data point U_act is 0.5.
      
      > and if U_act is a very small number or 1- such a small number it is evidence that the model does not predict that the data is of the ordinary type expected by the model.
      
      Maybe, but it’s not because we expected a uniform distribution for U_act. In this example U_act=0.5 for any observed data.
    - Daniel Lakeland on May 5, 2021 3:53 PM at 3:53 pm said:
      
      Carlos: sure this just tells us that the posterior predictive distribution puts the actual data in the high probability region of the posterior. This is because the bayesian process concentrates the probability mass near the data. (ie. it “learns” the data)
      
      But there are models where that isn’t true. For example suppose we have a model where we’re fitting a curve and we require it to be a strongly curving quadratic. The real data is a straight line. The posterior is incapable of representing a straight line. Let’s generate posterior predictive data and calculate the sum of squared errors (SSE) as our test statistic. We’ll transform the sum of squared errors through the CDF of the posterior predictive SSE. Then when we calculate U_rep from a new posterior sample we’re going to get something whose u_rep = cdf(SSE_rep) is uniform…
      
      When we shove the actual SSE for the straight line data into the cdf, we’ll get something that’s nothing like close to 0.5 it’ll be massively outside the expected region because the posterior predictive data will follow a strongly curving parabola, and the line will diverge from that parabola strongly. having U_actual be 0.0002 or .9993 or whatever indicates that the model doesn’t fit.
    - Carlos Ungil on May 5, 2021 4:35 PM at 4:35 pm said:
      
      > having U_actual be 0.0002 or .9993 or whatever indicates that the model doesn’t fit.
      
      Fine. But it’s not because we expected a uniform distribution for U_act. (To be fair, I can imagine one case where it would be uniform: if the data is ignored or irrelevant and the posterior is equal to the prior. The actual value of the statistic would be distributed according to its own sampling distribution!)
      
      By the way, I understand “actual data” to be the original data used to calculate the posterior that is in turn used to calculate the sampling distribution for the statistic. (From your comments it’s not completely clear to me what are you talking about.)
    - Carlos Ungil on May 5, 2021 4:43 PM at 4:43 pm said:
      
      > I can imagine one case where it would be uniform: if the data is ignored or irrelevant and the posterior is equal to the prior
      
      I should have added: and the data has been really generated according to that prior/posterior.
    - Daniel Lakeland on May 5, 2021 5:09 PM at 5:09 pm said:
      
      When we use a t test to determine if the results of collecting data from frog croaking of species B is “significantly different” from the established average amplitude for species A and we get p= .0002 and conclude that the amplitudes are probably different, it’s not because we expected frog B amplitude test stats to be uniform on (0,1) it’s because “if there were no difference” it’d be uniform on (0,1)
      
      By the same token, if the data were generated by the posterior predictive distribution then the U_act is going to be uniform. We suspect typically that it’s not, just like we suspect that frog B is different. When the PPP is is 0.25 we can conclude that the Posterior Predictive Distribution doesn’t find the data set at hand highly abnormal.
      
      When we get PPP of 0.0003 we can say “the real data looks nothing like what the model expects”
      
      I suspect Sanders point is something like the PPP doesn’t meet some frequency behavior which is irrelevant to the point. If you have a tiny PPP then your model doesn’t predict reality should look like it does. That’s a bad model.
    - Daniel Lakeland on May 5, 2021 5:50 PM at 5:50 pm said:
      
      dang it I used a less than sign and lost some of the text… hopefully you can make out the intended content.
    - Sander Greenland on May 6, 2021 1:10 AM at 1:10 am said:
      
      Daniel: half-right since a prior predictive P is one answer of several to the question of how far (in ordinal/percentile terms) the analysis data diverge from what would be expected under the test model M (as opposed to what would be expected for other data based on test model + observed data, which is what PPP is about).
      
      Please note that in my usage the “test model” M I want ot diagnose is not the single model M-hat that results from the model-fitting process, but rather is the entire model family determined by the model constraints, where in general those constraints may include penalty functions or prior distributions. As Box emphasized, this family encodes all the design and other information I want to merge with the data – provided the model and data don’t seem to conflict. T is some convenient function of the divergence of the data from this model family. If the data are also used to set constraints that define the tested family M, that double use has to be accounted for in the reference distribution for T if it is supposed to be a within-sample M-diagnostic, e.g., as in subtraction of degrees of freedom for chi-squared tests.
      
      The single model M-hat produced by the fitting process plays a role in the construction of T only in being the model in the family that is the projection of the data onto the family (the projection operator being determined by the fitting method). The divergence from M to the data is then measured from M-hat to the data (as in ordinary Euclidean geometry, where in measuring distance from an observed point to a region M, we measure to the nearest point in M). The paradigmatic example is the Pearson chi-squared where T is the squared standardized Euclidean distance from the family to the data; the analogous likelihood-ratio (deviance) statistic is another example where T is instead the Kullback-Liebler divergence.
      
      Again, please read my long responses to you below (stamped May 5, 2021 at 3:42pm and 8:42pm) and the JASA 2000 series.
    - Carlos Ungil on May 6, 2021 2:30 AM at 2:30 am said:
      
      > By the same token, if the data were generated by the posterior predictive distribution then the U_act is going to be uniform.
      
      Why do you keep repeating this? Unless when you say “posterior” you mean “prior for whatever observation comes later” it doesn’t make sense.
      
      In my example above you observe 42 and your posterior predictive distribution is normal(42, sqrt(2)). The distribution of T is also normal(42, sqrt(2)). The distribution is symmetrical and the cdf is 0.5 at the center which is 42.
      
      If the data point 42 was generated from that distribution U_act=0.5. (If the data point 42 was not generated from that distribution it’s also the case that U_act=0.5.)
      
      Don’t you agree that in that example U_act is not going to be uniform? It is always going to be 0.5! It doesn’t depend on what is the value that you got or where did it come from.
    - Daniel Lakeland on May 6, 2021 12:21 PM at 12:21 pm said:
      
      Carlos, sorry you’re right. Obviously U_act is just a number. It’d be like saying 32 has a distribution.
      
      What is true is that if you generate datasets using the posterior predictive distribution the distribution of test statistics under this random number generation will be uniform(0,1)
      
      U_act could be anything between 0 and 1, and if the model is capable of fitting the data reasonably well it’ll be a number in the “core” of that distribution, very often near 0.5 for example. If the model can’t fit the data because it’s constrained to some other manifold so that the posterior predictive data is “far” from the real data… it’ll be something in the tails of the uniform(0,1)
    - Sander Greenland on May 5, 2021 3:42 PM at 3:42 pm said:
      
      Daniel: Please read the Berger-Bayarri and Robins-Ventura-van der Vaart papers carefully, along with their discussants and replies. Those are devoted to discussing and deriving when P-values serve as valid measures of compatibility (their term) of a model with the data. Those took up a large part of that JASA 2000 issue, and as seen in the present discussion there seems to have been no progress toward resolution since. [Of course this may be no surprise given that conceptual progress in stats (if it exists) seems to operate on geologic time (if you’ve read the quarter-millenium old original essay of Bayes you may sympathize with that lament).]
      
      It looks to me like your mistake relative to the papers (that is, the general PPP mistake) is that you use the wrong cdf for the compatibility/information goals (as well as testing tasks) of calibrated frequentists, namely the posterior pdf. As B&B 2000 note (starting on p. 1128 below item c), that choice is the “double use” (their term throughout for the use of the posterior cdf as the reference distribution for a test statistic already computed from the data) that “can induce unnatural behavior” in the PPP (to use their politic description).
      
      B&B and RVV also note that, because of this double use, PPP is wrong for Bayesians too if they are interested in checking the model instead of predicting where the divergence statistic T will land in future studies. Future (or external) prediction is what PPP does, which is fine for replication prediction, but should not be confused with the pre-data prediction that model checking needs. Specifically, for checking I want to know how closely I could have predicted where T fell along a targeted direction using the test model M alone, BEFORE I got to see the data; that means (at least to some Bayesians) using the compound prior-sampling distribution as the reference cdf to get the prior predictive P-value of Box.
      
      Neither of these Bayesian tail areas are strictly Fisherian however in that they both require externally specified prior distributions, which Fisher as well Neyman explicitly took pains to avoid (recall that, according to Fienberg, it was Fisher who relabeled such detested “inverse probability” approaches as “Bayesian”). In particular, a PPP is nothing like Fisher’s conception of a P-value, even though it is a tail area for T (it’s the reference distribution that’s wrong). Beyond that problem, the PPP uses the data to make predictions about where the test statistic will fall in unused replication, which is OK in (say) cross-validation. But it’s just common sense that it is “cheating” when computed from all the data and then used as a test of M against that data. So, no surprise when that leads to overoptimistic (too large or “excessively conservative”) P-values when the posterior cdf is used; the same problem afflicts naive frequentist estimates of prediction error based on applying the fitted model to data it was built from. [A mirror issue can be seen in empirical-Bayes methodology, where variances (and hence P-values) calculated without accounting for estimation of the prior will be too small, and CI too narrow, but adjustments for the data double counting can both restore their frequency validity (uniform P, nominal CI coverage) and improve their matching to hierarchical Bayes posterior distributions.]
      
      It thus turns out that both frequentist and prior predictive Bayesian formalizations align with and validate common sense and experience. Beyond that, both the B&B and RVV papers show how to fix the PPP to get approximately valid (uniform) M-diagnostic P-values that avoid or account for the double use, and which even have optimality properties. See also sec. 4.3 of Bayarri & Berger, “The Interplay of Bayesian and Frequentist Analysis”, Statistical Science 2004.
    - Daniel Lakeland on May 5, 2021 5:17 PM at 5:17 pm said:
      
      Thanks, I’ll try to take a look at the papers. I think one objection I’ll make right away is simply that I don’t give a fig about the Frequentist goal… The only use for a PPP that I can see is to determine whether the fitted Bayesian model fails strongly to predict its own data. See the example above where the model is forced to be a strongly curving parabola and the data is a straight line. The PPP for sum of squared errors under the parabolic model will expect sum of squared errors to be one distribution, and the sum of squared errors of the real data will be totally different resulting in the realization that the model predicts that the data will not look like it actually does and hence that the model is bad
    - Sander Greenland on May 5, 2021 8:42 PM at 8:42 pm said:
      
      Daniel: I think the problem with PPP isn’t shown by the example you raise because in it the misfit is obvious even with PPP alone. The objections to PPP come from the opposite case, where there is misfit that is not clear with PPP but obvious with valid corrections. B&B 2000 give reasons why even Bayesians would want uniformity and point out how the problem of “conservatism” (null-biased invalidity) is parallel for plug-in statistics and PPP.
      
      A paradigmatic plug-in example goes back to goodness of fit P-values in Pearson (1900), which took the number of deviations (residuals) N as the chi-squared degrees of freedom df, even when K parameters were estimated to compute the deviations. The result was using an N df reference distribution for what should have been an N-K df distribution, producing “values of P” much bigger than they should have been.
      
      As the JASA 2000 papers note, the same magnitude of problem is seen with PPP. So, would you really not give a fig about missing a bad fit when just a technical tweak would reveal it? (I’ll hope you will read the whole JASA 2000 series before you answer that.)
      
      It’s an interesting footnote that it was RA Fisher who solved the df problem. The elder Pearson never forgave Fisher’s corrections, and apparently went to his grave maintaining that his N df reference distribution was correct (for one study see Stigler, “Karl Pearson’s Theoretical Errors and the
      Advances They Inspired”, Stat Sci 2008). This personal conflict may have had a more far-reaching effect on 20th-century statistics than Fisher vs. Neyman by contributing to Fisher’s hatred and banishment of Bayesian methods (inverse probability) from his early books (Stephen Senn would know more about that topic though).
    - Daniel Lakeland on May 6, 2021 12:27 PM at 12:27 pm said:
      
      Sander, Unfortunately these articles are not publicly available and cost $45 each to get access so I wont be able to read them :-(
    - Sander Greenland on May 6, 2021 12:56 PM at 12:56 pm said:
      
      Did you try JSTOR? They often have articles from some journals without charge after some embargo period (usually 5-10 years) even when those remain paywalled at the original journal site.
    - Valentin Amrhein on May 5, 2021 8:31 AM at 8:31 am said:
      
      I don’t want to crash the party, and I enjoy reading all of the comments. But I’m afraid this discussion is almost absurdly academic. My rough guess is that 95% of scientists are not interested in what a P-value really means nor what a posterior predictive P-value is. All they want is a simple number that, if it crosses a universally accepted threshold, shows that their observed result is real. And if the threshold is not crossed, journals like JAMA still force authors to declare the result is zero. Intervals usually serve exactly the same purpose.
      
      I wonder what else we could do to try and change that. So far, it seems we failed quite miserably.
    - Andrew on May 5, 2021 9:46 AM at 9:46 am said:
      
      Valentin:
      
      It’s appropriate to be absurdly academic here at Columbia University, right? But, yeah, I agree on your larger point. I think of our absurdly academic discussions as ultimately being in support of these larger goals. One way to see this is to think about statistics teaching. I used to teach intro stat every year, but I pretty much stopped doing it a couple of decades ago because I was unsatisfied with what I was teaching. I didn’t want to just give a very vivid presentation of p-values or why to divide by n-1 rather than n, or the sampling distribution of the sample mean, or coverage of confidence intervals, or all sorts of other topics that, although individually interesting, were not (in my opinion) central to how statistics should be done. From my perspective it’s hard to figure out how to teach statistics better until we’re more clear on what we should be teaching. Hence the absurdly academic discussions. That said, I’m fully supportive of people who’d like to jump in and try to teach things better in the meantime.
    - Valentin Amrhein on May 5, 2021 11:04 AM at 11:04 am said:
      
      Andrew:
      
      In reply, just a short personal story: I also used to teach very basic applied intro stat for biologists for some years but stopped for the reasons you mention. Unlike me, my dear successors are real statisticians, and they are sympathetic to most of the points you and I and others often write about, so they basically stopped explaining P-values or confidence intervals. However, because our poor students, once they left the stat course, nonetheless almost invariably are asked by their supervisors to use P-values and confidence intervals to claim significant effects, that means I’m again forced to give short lectures to try and bridge the gap between our dear statisticians and the applied researchers.
      
      I agree that you and others do a huge amount of work to reach those applied researchers (and those statisticians). I just wonder if and when it might happen that it will be unfashionable in wider scientific circles to decide about real versus zero on the basis of that number computed from the data that just sits there vacantly on the computer output (in the words of Sander). Back in 2019, I heard somebody saying that as scientists, we should not be campaigning (https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion). But maybe we should even do more campaigning. For example, I hardly ever heard or read a statistician defending the status quo, i.e., defending null hypothesis significance testing. If that is true, it would be largely unknown among applied researchers. Most researchers I met think the usefulness of NHST is “hotly debated”. Or maybe they confound NHST with P-values. Some researchers are unsettled by extremists wanting them to ban P-values, or at least to try and ignore those nice stars given in the R output (btw, why is R still using stars for significance?). Others fear somebody may even want them to switch from SPSS to Stan.
      
      To be fair, maybe something IS changing. I recently heard from an Australian colleague that his students are a bit unsure whether good old NHST is still correct, what else they could do, and how sure they are meant to be about their results anyway. I think being unsure is a first step towards introducing skepticism about what we *might* infer (in the words of Richard)!
    - Andrew on May 5, 2021 11:14 AM at 11:14 am said:
      
      Valentin:
      
      I agree that an important part of education is explaining the fallacies of old mistaken ideas. I guess that when they teach history in schools now they explain the flaws of outmoded attitudes of national character, etc., and that when they teach political science they explain the flaws of naive views of constitutions that automatically preserve liberty, etc. Similarly with statistics it’s not enough to explain a good way to do things; it’s also important to explain the appeal of null hypothesis significance testing and what’s wrong with it.
      
      Lots of statistics teachers have not internalized the idea that null hypothesis significance testing has serious problems. For example, I remember a few years ago a statistics professor telling me how he liked to teach type 1 and type 2 errors by talking about the probability of convicting an innocent person or letting a guilty person go free. This bothered me for three reasons: first, because I don’t think scientific hypotheses are true or false in that way, second because statistical models being tested are just about always false, and third because the analogy to the courtroom is so vivid that it takes the student away from thinking about science.
      
      So it’s a tangle. I want to teach good statistical methods (whatever they are), but at the same time we have to introduce all sorts of irrelevant concepts relating to null hypothesis significance testing, just to explain why they’re generally misguided. And then we also have to deal with annoying technical things like least squares, dividing by n-1, the t distribution, and all the rest.
      
      I’m ok with teaching this material at a more advanced level, such as for social science grad students. I think we do a good job in Regression and Other Stories at handing all the issues discussed above. But I’m still not quite sure how to put this together in an intro class.
    - Andrew on May 5, 2021 11:18 AM at 11:18 am said:
      
      P.S. Finally, yes, I agree with you that we should be campaigning along with doing writing, teaching, and research. Maybe we shouldn’t be going on TV news shows to be making our argument, but I think campaigning by writing articles and trying to influence how research is presented and statistics is taught . . . that I think is important. After all, if we don’t campaign, it’s not like campaigning will stop. Other people with different views will continue to campaign about science, they will continue to appear on TV, and so forth. They have every right to campaign, and so do we.
PE on May 4, 2021 8:32 AM at 8:32 am said:

One of the issues here is that there’s no clear distinction between inference and decision in frequentist statistics. If you’re a Bayesian, the distinction is very clear: the inference is the whole posterior, and the decision, if needed, comes from minimizing a loss function. Inference is epistemic, it’s what you *know* about something (given some simplifying and idealized assumptions which might need to be checked, perhaps even with p-values, etc). Decision is doing something with that knowledge (say, prescribing a drug to a patient).

But in N-P statistics, there’s no such thing as drawing justifiable conclusions from data. You just “accept” or “reject” hypotheses in a prespecified way so that your hypothetical error rates are such and such (yes, in N-P statistics you can “accept” the null in this non-epistemic sense). The “inference” is just the decision. Or, more precisely, there is no inference, just a decision. To his credit, Neyman rarely uses the word “inference” to describe what he’s doing, especially in his later papers. That’s because he recognizes that what he does is not inference at all: his theory is all about inductive behavior, not inductive inference. So of course a p-value doesn’t automatically yield inference, even in the context of a perfectly specified model with a perfect procedure etc.

Fisher, on the other hand, thought that this whole decision thing was nonsense: science is about understanding, not about calculating costs and benefits. He considered the p-value a measure of evidence, and a low-enough p-value gives, in his own words, “rational grounds for the disbelief it engenders”. But he wasn’t entirely consistent on this point. For instance, sometimes he equates “disbelief” with rejection, which *is* a decision, so it *should* depend on costs. He also tried to prove that certain estimators were “optimal”, but using estimators on the basis of optimality is also a decision. Estimators are optimal for different purposes and different contexts.

(Some say that the differences between Neyman and Fisher are exaggerated, or that they only diverged because of silly quarrels. I don’t think that makes sense.)

Reply ↓
- Sander Greenland on May 4, 2021 2:03 PM at 2:03 pm said:
  
  PE: What you wrote is a view I have often read. While it captures some aspects of the debate and I think it OK on NP (really more though Neyman alone) and Fisher, I don’t buy the descriptions of “frequentist” and “Bayesian” statistics. Those descriptions are a received story that is in large measure misleading, not because of any math error but because they are gross oversimplifications that doesn’t stack up against the realities of applied statistics or many current theories.
  
  I see the standard stereotypical accounts of “inference” vs. “decision” and “frequentism” vs. “Bayesianism” as attending to few of the many important issues in applications, as if these dichotomies must be saddled with every misconception held by prophetic founders and promoted in “philosophy of statistics”. So, while the descriptions do apply to some writers at some times (especially when those writers are being philosophical instead of practical), the reality is thousands of times more complex, as Good noted for “Bayesians” in this is a 1983 reprint of a 1971 letter: http://fitelson.org/probability/good_bayes.pdf
  For Bayesians, that complexity has been discussed before many times, e.g.,
  https://stats.stackexchange.com/questions/167051/who-are-the-bayesians
  I’ve seen no parallel account for “frequentists” but it could be done and would reach into thousands of possible categories.
  
  Many of us regard these distinctions as being silly when taken as rigid “philosophies”, even when they are practical for description of the tools and perspectives available for thinking about observations. Even for practical Bayesians the boundary between inference and decision is unclear; mere data description will involve subtle decisions that ought to involve deep understanding of the application, but which often default to decision rules that have some limited statistical rationale (like in the binning rule “have at least 5 observations per cell”), and calling the posterior “the inference” is foolish when as usually the case it is sensitive to a myriad of choices made during its construction.
  
  The distinctions aren’t even well-founded in the math theory. Most discussions seem to forget the mapping that Wald drew between “Bayesian” and “frequentist” decision rules back in the 1940s and that Good later drew between “Bayesian” and “frequentist” hierarchies (see his cites in Good IJ, Hierarchical Bayesian and empirical Bayesian methods, Am Stat 1987;41:92).
  
  In light of these kinds of connections, the only descriptive use I can see for the standard dichotomies is in the purely logical (not philosophical) distinction of calling an output “Bayesian” if (very roughly) it is built from hypotheticals like Pr(nested model|data+embedding model) and “frequentist” if it is built only from hypotheticals like p = Pr(statistic|nested model). We can impose validation criteria on both of these types of outputs, including logical ones such as coherence (“Bayesian” betting requirements), mathematical ones such as calibration against reference distributions (“frequentist” repeated-sampling operating characteristics), and nonmathematical criteria (which are often treated as poor stepchildren by theoreticians) like contextual relevance and clarity. And we can use either type of hypothetical in decision rules. For example there’s often no great practical difference in using cutpoints on P-values for decisions versus using cutpoints on marginal posterior distributions when the P-values are clear limits of posterior probabilities (which turns out to be most of the time if one knows how to generalize them to hierarchical models).
  
  That said, I wholly agree that Fisher and Neyman diverged for much deeper reasons than just silly quarrels. Neyman evolved into his behavioristic theory in which “inference” became barely more than a synonym for “decision”, and in which calibration criteria are paramount. In sharp contrast, Fisher evolved into what should now be recognized as the fuzzy border of region of “reference Bayes”/”confidence distributions” in which frequentist P-values and hypothetical posterior probabilities merge into a single information or “inference” function (but Fisher tried to promote this fusion with his fiducial theory, which due to its logical gaps tarnished the effort and seems to have set the merger back by a half-century); in both frequentist and Bayes theories this information-summary function is clearly delineated from decision as the latter requires further specifications such as loss functions.
  
  Reply ↓
  - Russ Wolfinger on May 6, 2021 8:27 AM at 8:27 am said:
    
    Sander: Thanks very much for your insightful comments in this thread, with which I largely agree. Regarding the classic “frequentist” versus “Bayesian” dichotomy, I’ve always considered it to be a fundamental distinction in the underlying definition of the probability measure being invoked in a statistical data analysis. In general, Bayesians take this measure to be an epistemic degree of belief, whereas frequentists take it to be aleatoric, grounded in a real-world data generating mechanism. Of course there are thousands of nuances, cross-overs, and points of debate; that’s what makes the topic interesting and long lasting. But does not this basic dichotomy provide a reasonable starting point for a hierarchical classification of the debatable points, with the goal of understanding and best applying them? This would mean the two “Pr” measures in your fifth paragraph above are actually different sigma algebras, and one needs to be careful when mixing them. If this starting point is too messy, what do you think a better one would be?
    
    Reply ↓
    - Sander Greenland on May 6, 2021 11:33 AM at 11:33 am said:
      
      Hi Russ. Good questions you raise. I agree with your observations to a point…
      
      Nonetheless, I don’t think it’s true anymore in practice that “Bayesians take this measure to be an epistemic degree of belief, whereas frequentists take it to be aleatoric, grounded in a real-world data generating mechanism”, even if it was once true in a narrow academic context of “philosophy of statistics”. Furthermore I think that change reflects some wisdom in practice which we should adopt more explicitly, and so it seems harmful to continue with confounding the “frequentist/Bayesian” dichotomy with the “aleatoric/epistemic” dichotomy.
      
      Today I think many Bayesians (like me when I’m being Bayesian) can attest that they want all their probabilities to be as aleatoric as possible in the sense of being grounded in physical mechanisms (as in betting odds based on mechanical generators); conversely, many frequentists (like me when I’m being frequentist) treat their data probabilities as if those were tentative degrees of belief about properties of an unknown data generator. So that means we have “objective” aleatoric Bayesians and “subjective” epistemic frequentists all around (even within the narrow confines of my head), and they often seem to be doing well in their applications, even though there is as yet no single received formal theory sanctioning their approaches.
      
      Along these lines, Senn quipped about Gelman & Shalizi’s approach: “It works in practice but does it work in theory?” While I wouldn’t want to see creative applications repressed by theoretical dogma, theories can provide excellent checks for desirable features like internal logical consistency and consistency with practical goals and empirical facts. So if we think we see an approach working in practice then we ought to come up with theoretical explanations for why it works (in fact that’s what some math statisticians did when confronted by ad hoc algorithms from engineers that produced good predictions). Those explanations may tell us how to improve the approaches and can also warn us of where the approaches are being misused or can mislead.
      
      So, viewing the above dichotomies from that pragmatic perspective, consider that the equation of bets to physical relative frequencies when the latter are known seems a commonly used rationality principle [called the Principal Principle by Lewis (“A Subjectivist’s Guide to Objective Chance” in Jeffrey, Studies in Inductive Logic and Probability, vol. II, 1980)] and is what we advise people to use if they must gamble at casinos. Conversely, in my field I can’t take the fitted data models seriously (whether they are propensity scores or outcome functions) as anything more than tentative betting rules about what data will show (e.g., someone’s treatment or disease status given the covariates), knowing as I do all the unmodeled vagaries of selection, measurement, recording etc.
      
      In sum then, I think we need to deconfound the “frequentist/Bayesian” and “aleatoric/epistemic” dichotomies from one another, and also from goal classifications such as “data-summarization/smoothing/inference/decision”. These are all useful distinctions on their own. A further step would recognize them as regions on continuous scales, but how to do that is a long story (and will likely face fierce resistance if the desperate defenses of “significant/nonsignificant” dichotomania are any indication).
    - Keith O'Rourke on May 6, 2021 12:19 PM at 12:19 pm said:
      
      + 1
    - Daniel Lakeland on May 6, 2021 12:29 PM at 12:29 pm said:
      
      I personally see no problem with the statement that Bayes is about epistemic probability. Epistemic probability can be induced by aleatoric considerations, so it completely subsumes Frequentism. It is a strict superset.
    - Sander Greenland on May 6, 2021 1:49 PM at 1:49 pm said:
      
      Daniel: I believe that superset claim is the standard radical subjective Bayes position often attributed to DeFinetti (I think in Popper’s scheme it might be cast as a claim that probability is solely an object in world 3). While I’m sympathetic up to a point, I fear it can blur an important pragmatic distinction between the empirically-based probabilities that our audience and the public wants for its own purposes and the various personal bets that may be pure lunacy.
      
      On the theoretical/philosophical side the superset assertion also has some parallel principled dissent. For example some in quantum mechanics point to the Born rule as providing pure aleatoric probability not subsumed by Bayesian philosophy since it appears to be a genuine law about relative frequencies that exists out there whether anyone or anything knows it, in the world beyond our heads; even among QBayes adherents (whom if I understand correctly see it as a law about what observers experience) it’s still a law that exists outside of any personal mind or bet (and thus in Popper’s world 1 and the ordinary usage of “probability” to represent a physical system’s propensity to produce a certain relative frequency).
      
      Regardless, I hope no one will object to this quote from DeFinetti about formal theories which should apply to meta-theories as well (whether radical Bayes or radical logicism or radical frequentism or whatever), and which I think can’t be repeated too often:
      “…everything is based on distinctions which are themselves uncertain and vague, and
      which we conventionally translate into terms of certainty only because of the logical formulation…In the mathematical formulation of any problem it is necessary to base oneself on some appropriate idealizations and simplification. This is, however, a disadvantage; it is a distorting factor which one should always try to keep in check, and to approach circumspectly. It is unfortunate that the reverse often happens. One loses sight of the original nature of the problem, falls in love with the idealization, and then blames reality for not conforming to it.” [Theory of Probability vol. 2, 1975, p. 279]
    - Daniel Lakeland on May 6, 2021 5:17 PM at 5:17 pm said:
      
      Even if physical probabilities exist out in the real world, a statistician or physicist or other human is always reduced to questions about what they know about the world. What they know about the physical probability is just one such question. If I set up a certain QM experiment in a lab, before doing the experiment I have a question about the frequency distribution of the outcome, and after doing the experiment I have a relatively strong knowledge of what the frequency distribution will be… My probability calculations converge to a delta function around the physical frequency distribution. It all makes perfect sense.
    - Paul Hayes on May 7, 2021 4:04 AM at 4:04 am said:
      
      For example some in quantum mechanics point to the Born rule as providing pure aleatoric probability not subsumed by Bayesian philosophy since it appears to be a genuine law about relative frequencies that exists out there whether anyone or anything knows it, in the world beyond our heads; even among QBayes adherents (whom if I understand correctly see it as a law about what observers experience) it’s still a law that exists outside of any personal mind or bet (and thus in Popper’s world 1 and the ordinary usage of “probability” to represent a physical system’s propensity to produce a certain relative frequency).
      
      No,the QBists’ view of QM is (FAQBism FAQ #4):
      
      A quantum state encodes a user’s beliefs about the experience they will have as a result of taking an action on an external part of the world. Among several reasons that such a position is defensible is the fact that any quantum state, pure or mixed, is equivalent to a probability distribution over the outcomes of an informationally complete measurement [8]. Accordingly, QBists say that a quantum state is conceptually no more than a probability distribution.
    - Sander Greenland on May 7, 2021 12:22 PM at 12:22 pm said:
      
      Paul Hayes: Thanks for the link! But I don’t get your “No, the QBists’ view of QM is (FAQBism FAQ #4)…” – Why the “No”? I don’t see the conflict between what I wrote and what’s in your linked paper or quotation from it, so please clarify where it is in light of the following:
      
      Long before QBism was developed, DeFinetti gave a radical subjective Bayesian view of QM phenomena and their probabilities, which is the earliest version of what I’ve seen called a QBayes interpretation (note that your linked paper distinguishes QBism from the more vague, general QBayes category). But DeFinetti lacked the explanatory and math details (especially about measurement, a major sticking point among QM interpretations) that a particle physicist would rightly demand. I can’t claim any expertise, only that I’ve read it in the work of Fuchs and colleagues and thought of QBism as one refinement which adds such details, allowing it to match aleatoric conceptions (in which Born’s rule is an external probability law “out there” beyond agents or minds and thus beyond radical subjective Bayes). The key difference is that QBism locates the the probability rule on the agent side of the physical path from the data generator to the agent’s coherent bet. This move seems central to QBism’s resolution of QM strangeness, so please correct me if it is a misconception.
      
      Consider the start of the quotation you sent, “A quantum state encodes a user’s beliefs about the experience they will have as a result of taking an action on an external part of the world”. That leaves an ambiguity due to being out of context of the paper: The state is the “user’s” belief only if the user equates their belief to the state. An objectivist could say that quote assumes implicitly that a user adheres to Lewis’s “Principal Principle” in transferring the distribution from an aleatoric QM law to the user’s belief, and my reading is that QBism indeed assumes an equivalent principle but earlier in the event path.
      
      Defenders of objectivist interpretations (like many-worlds, which I don’t care for) can point to the hiddenness of this assumption in your quotation as masking and inviting confusion of the world “out there” with our beliefs about it. That is a variant of my initial response to the DeFinetti subset claim espoused by Lakeland (that frequentist/aleatoric probability is a proper subset of subjective/epistemic probability). I don’t hold to that subset view as a dogma, but I have found it’s a most useful way to approach probability in soft sciences like health-med, social, etc. where aleatoric laws are only assumptions (and more often than not, very hypothetical oversimplifications of reality).
      
      As a complete amateur regarding QM, I find QBism the most compelling interpretation system for it I’ve read. So what I find fascinating about QM controversies is not some direct relevance for soft-science applications (I’d guess there’s none), but rather that in QM objectivists can seriously challenge the subjective Bayes view (and in particular the subset claim), and thus stand firm against radical subjective Bayes (RSB) as a religion. And that’s great: As with all such metascientific/epistemic positions, I see RSB as one extremely useful perspective for inference and decision problems; but so are its opponent positions in certain contexts (e.g., apparently in many if not most engineering and physics settings), and they can both be used in the same application to good effect. I think the name for my position in philosophy is perspectivalism, or more specifically epistemological pluralism.
    - Paul Hayes on May 7, 2021 3:29 PM at 3:29 pm said:
      
      Yes, sorry. The “No” was because the QBists don’t view the Born rule as an “out there” law but I forgot that their interpretations of state and rule aren’t tied together as strongly as they are in the “QM as applied generalised probability” context and so you’re right: on its own that quote wasn’t enough.
    - i.e. rabinovitz on May 7, 2021 5:21 PM at 5:21 pm said:
      
      “Even if physical probabilities exist out in the real world, a statistician or physicist or other human is always reduced to questions about what they know about the world.”
      
      The physical world exists ‘out in the real world’ (and we are a part of it too of course). And what we ‘know’ about it — if that so-called knowledge is to be any pragmatic guide at all to action — must be conditioned on what we believe has happened in that real world in circumstances similar to those in which we stand, when we propose to act.
      
      If our beliefs are to be any guide at all they must be informed by what we or others before us have seen; or at the very least what they presume that they have seen. Probabilities, whether you call them “aleatoric” or “epistemic”, if they are of any relevance as useful guides to action, have to be grounded in summaries: summaries of worldly experience.
    - Carlos Ungil on May 8, 2021 4:07 PM at 4:07 pm said:
      
      I don’t think that QBism (née Quantum Bayesianism) solves QM strangeness. But I cannot say that I really understand what they are trying to say, to be fair. The fact that there is a stream of publication in the last two decades that say different things doesn’t help.
      
      The starting point is reasonable: in general our knowledge of quantum states is incomplete (not pure states) and our measurements imperfect (not projectors). The densitity matrices representing quantum states are (proper) mixtures and our probabilistic predictions reflect both our uncertainty about the state and the quantum indeterminacy that would remain even if our knowledge was complete. The changes in our description about the quantum state when we do a measurement represent in part a refinement of our knowledge and in part a physical change. But if the system was in a pure state and our knowledge was already maximal there is no “Bayesian updating”, only strangeness.
      
      Another discussion of the subject can be found “The quantum Bayes rule and generalizations from the quantum maximum entropy method” by Kevin Vanslette [ https://iopscience.iop.org/article/10.1088/2399-6528/aaaa08 ] which is more in the line of Jaynes (by the way, one can find online a draft of a chapter “Maximum Entropy: matrix formulation” that didn’t make into PTLOS).
      
      From there, they get into metaphysical discussion were I’m lost. To make clear that I’m completely missing the point I’ll just say that if you prepare some quantum systems (and have some predictions about measurements) and while you look elsewhere I manipulate them (and have my personalistic predictions about measurements) what we may find is that when you do your measurements the outcomes match my expectations and not yours. As if some subjective descriptions were more objective than others…
      
      I wouldn’t say either that existence of “objective probabilities” make the “subset claim” invalid. At least in the sense that some of us may use it to defend ourselves against arguments like “You can have epistemic probability, or you can have aleatory/frequentist probability, but you should decide which one you want.”
    - i e rabinovitz on May 6, 2021 5:34 PM at 5:34 pm said:
      
      To the extent epistemic probability is influenced by sequences of relevant prior events that epistemic probability is a degree of rational belief. To the extent it is not so influenced, it is not so rational. If we create a calculus that is grounded in so-called “rational belief” we must be clear: some evidence is relevant and some is, put it as charitably as possible, less so. Rational beliefs are supported by relevant reference-classes of events. If I invent a scheme which assigns a numerical probability to the hypothesis that rabbits live on mars and I wish to persuade anyone at all that it is grounded in “rational belief”, I suppose I should be prepared to line up my references (in which rabbits live hither and thither, some concrete set of observations of rabbits in approximately martian climates, first or second hand at any rate). There is a rational reference set behind every rational belief; and, conversely, an “irrational” reference set lurks probably behind many irrational beliefs. The numerical calculus just does what it is told to do with these!
    - i.e. rabinovitz on May 6, 2021 11:34 PM at 11:34 pm said:
      
      Rosenthal, and others above and below:
      
      “so the larger point is just that it kinda doesn’t matter whether it’s aleatoric or epistemic “down there”. Of course, in practice none of this comes for free – we build models and make inferences conditional on assumptions and the various formalizations we’ve decided to be comfortable with…”
      
      If “epistemic” means anything it all, it means of or pertaining to what we know (or think we know). And if what we know (or think we know) means anything at all, it must be anchored in what we have seen, or what we think we have seen. There is no rational, reasonable, pragmatic sense of “epistemic probability” which is not fundamentally rooted in induction from *experience*. “Epistemic” probabilities — if they are indeed stand somehow or other in experience — are necessarily rooted in “references classes”: No more or less so than the “reference classes” favored by the so-called frequentists.
      
      The “epistemic” and “aleatoric” do not and cannot derive from radically different sources of experience.
      
      “Aleatory” emphasizes that aspect of the reference class not easily subsumed within a deterministic description; the classical system best characterized by this adjective was formerly that of statistical mechanics; wherein the regularities observed were emergent properties of the “aleatory” elements in the molecular stratum. The modern quantum theory is the logical continuation of this mode of description; where the underlying “aleatory” properties are not even properties at all; the theory makes claims only for properties emerging out of nothing, as it were.
      
      The adjective “epistemic” (in connection with the probability model) emphasizes that aspect of our experience which is what we *record* as having seen or learned from our interaction with …. a stratum of events relevant to the question at hand, whatever it might be. A reference class.
    - i.e. rabinovitz on May 7, 2021 12:06 AM at 12:06 am said:
      
      And if the epistemic category is a strict superset, then what are the rational grounds for beliefs which are not rooted in experience of regularities –in reference classes– of the world, seen, remembered, summarized …. as probabilities? I see no such grounds. Yes, there are grounds for beliefs which are other than these; but why should any arbitrary belief, simpliciter, be admitted as support for any proposition at all? Calling probability “epistemic” cannot invest “probability” with some character that makes it transcend the domain of our experience; for what we wish (rationally) to do with “probabilities” is organize our past experience; so our past experience better lends itself to anticipating subsequent experience. We organize what “we know” therefore; but if it is to be useful, it ought to be rooted in regularities of experience. If what we claim “we know” is not so rooted; well so be it, we are then fooling ourselves and others. But that is to be expected!
    - Carlos Ungil on May 7, 2021 3:01 AM at 3:01 am said:
      
      And calling probability “aleatoric” cannot invest it with some character that makes it trascend the domain of our knowledge. Does that mean that any distinction is completely arbitrary and useless?
      
      If you ask most people what are the “probability that the mother of Lincoln was blonde” and the “probability that the red ball will be 7 in this Sunday’s lotto draw” they will agree that they lay on different sides of the divide. They may also say that the former doesn’t make sense, if they reject the idea of epistemic probability.
    - i.e. rabinovitz on May 7, 2021 3:48 PM at 3:48 pm said:
      
      “If you ask most people what are the “probability that the mother of Lincoln was blonde”
      
      If the “epistemic” probability of such an assertion is merely “what I think I know about it” — simpliciter — that is a cop-out. What matters is the *reason* for which I think I know such and such. Logic does not describe *how* we think, but how we *ought* to think … if we seek to be persuasive.
      
      My “epistemic” probability (for such assertion ) will be grounded (if it is not mere verbal nonsense) in my recollections of: hairstyles of wives of (Americans) politicians, my recollections of assertions by first or second-hand chroniclers of the same. Less persuasively, it may be grounded in my recollections of persons I believed where wives of politicians — i.e. proxies for them — but may, in truth, not have been.
      
      All in all, whoever they may have been, I seem to recall (or think I recall) that some were blonds and some where not. And thus, by this expedient, I have conjured up a retrospective study of the matter; divided my observations into two groups (those with the character in question and those without) and produced a ratio.
      
      The fact that the retrospection is subject to gross error of memory, absurdly limited sample, and all other manner of cognitive confusion and bias, does not change the fact that if I claim I know about the propensity of politicians’ wives’ hair I am reporting what I have learned: by recollections, by chronicles. I have created an experimental record. A reliable one? Of course not. But it is the best that I can muster. Just because a retrospective experiment is poor or weak one does not mean it sought access to a genera of knowledge different from a well-performed or strong one does.
      
      What knowledge can I appeal to; other than my recollections of what I see; and my recollections of what others say they see; and my recollections of the record of the veracity of those who say they see?
    - PE on May 6, 2021 1:05 PM at 1:05 pm said:
      
      Just to complement what you said, here’s a quote by Frank Ramsey, a Bayesian, in “Truth and Probability” (1926):
      “Probability is of fundamental importance not only in logic but also in statistical and physical science, and we cannot be sure beforehand that the most useful interpretation of it in logic will be appropriate in physics also. Indeed the general difference of opinion between statisticians who for the most part adopt the frequency theory of probability and logicians who mostly reject it renders it likely that the two schools are really discussing different things, and that the word ‘probability’ is used by logicians in one sense and by statisticians in another. The conclusions we shall come to as to the meaning of probability in logic must not, therefore, be taken as prejudging its meaning in physics”
      (Obviously, statisticians and physicists are much more comfortable with Bayesianism today than they were when he wrote this.)
      
      Ramsey’s point is that you can consistently adopt many interpretations of probability (as long as you are cautious, perhaps by using different notations to refer to the different types of “probabilities” when not doing so might cause confusion). When people speak of “probability of a probability”, for example, they’re usually using two different meanings probability in the same sentence. That’s fine, the many different types of probability can coexist. As you note, physical chances should inform your degrees of belief (or when you don’t know the chance, you might want to estimate the “plausible” values of the chance parameter). But it would be nice if the concepts were kept distinct (though not completely separable from each other) to avoid confusion.
      
      Not everybody is happy with this pluralism. de Finetti, writing at the same time as Ramsey, tried to completely reduce “physical chances” to subjective probability. But we don’t have to be as radical as him.
      
      (It’s worth noting that frequentist interpretation of probability != frequentist statistics. To give an example, A.W.F. Edwards thought that probability can only ever refer to frequencies, but he criticized most of what we understand by “frequentist statistics” today.)
    - Chris Wilson on May 6, 2021 3:07 PM at 3:07 pm said:
      
      Not an expert on de Finneti at all, but I’ve always interpreted his signature contribution here- the Representation Theorem – not as reducing one to the other but showing a mathematical convergence, an ‘as if’ condition. Thus, we can do Bayesian inference assuming exchangeability of some sort and remain agnostic about the source of uncertainty!
    - PE on May 6, 2021 3:41 PM at 3:41 pm said:
      
      I’m not an expert on anything! Perhaps he changed his views at some point, but de Finetti does repeatedly say that any talk of “physical chance” is metaphysical nonsense, it’s all just a fancy way of describing convergence of beliefs, and “parameters” are just useful fictions for making predictions about observables (you don’t bet on a parameter, you only bet on what can be verified). For a critical discussion, see chapter 4 of Gillies’ “Philosophical Theories of Probability”. But, of course, we can accept de Finetti’s Representation Theorem without subscribing to his interpretation of it.
      
      Physical chance is a weird concept (Jaynes exposes some problems in his book “Probability Theory”). But if you don’t take it too seriously and realize that it’s just a simplification of something much more complex, it can be useful.
    - Chris Wilson on May 6, 2021 6:12 PM at 6:12 pm said:
      
      He is indeed notorious for saying “probability does not exist!” :) I think the key here is similar to where Lakeland is going above: you can subsume the question of frequencies and physical chances into exchangeable degrees of belief via the Representation Theorem, so the larger point is just that it kinda doesn’t matter whether it’s aleatoric or epistemic “down there”. Of course, in practice none of this comes for free – we build models and make inferences conditional on assumptions and the various formalizations we’ve decided to be comfortable with…including the idea that canonical probability theory is a suitable way to model uncertainty in the problems we’re working on…
    - Sander Greenland on May 6, 2021 6:48 PM at 6:48 pm said:
      
      +1
      and thanks PE for bringing in Ramsey, I was thinking he needs to be cited in this discussion.
      
      To meet the internal consistency criterion (and this responds to one of Russ Wolfinger’s questions) I think of applied probability and stats in terms of information models. I see the latter as a practical refinement of radical Bayes in which radical frequentism is a very special case (consistent with DeFinetti’s strict subset view that Lakeland mentioned, but more detailed). In it, additional constraints are imposed on probability models by causal information (“Bayesian”) networks, as per Pearl, Robins and others. We try to imagine how our probability models follow logically from background information, with basic gambling examples showing how we use information about causal stability and independence to arrive at models like the binomial, and with more complex stories showing how we might discount that information.
      
      More generally, suppose we are not simply running data through stock models in software in the hopes that an algorithm will pick up some pattern. Then our information leads to specifications that involve data (to-be-observeds) Y and unknown parameters B, along with other unknown parameters C which we will only condition on because we have too little time or information to include in a credible prior specification. We want a linear function E(y,b;c) that summarizes the kind of patterns we expect in (Y,B) based on our input information and the additional hypothetical information of C=c [DeFinetti calls this function “Prevision”, often translated as “Expectation” but conceptually more than just the first moment of a random variable)]. These functions can be translated quickly to probability functions P(y,b;c) by using indicators.
      
      In strict “classical” frequentism B is empty and the only information allowed in P(y;c) is from verified physical properties of the Y-generator, but it can readmit B if B represents a physical “random effect” with a distribution that may have indexes in C, as in hierarchical/mixed models.
      In strict radical Bayes C is empty and one pretends it is possible to make practical progress putting all unknowns in B and treating P(y,b) as a coherent betting schedule.
      For ecumenic pragmatists (sometimes labeled penalization or shrinkage frequentists or semi or partial Bayesians) the whole range between these extremes is accessible, and any source of information can enter anywhere.
      
      But direct joint specification P(y,b;c) can be impractically difficult, so is approached piecewise or modularly, and tentatively with various forms which can and arguably should be indexed explicitly in C. Also, there can be many different useful allocations across B and C. There can also be many factorization possibilities; mostly I see and use the “Bayesian” type P(y,b;c) = P(y|b;c)P(b;c). This seems natural I think because the information source for P(y|b;c) is supposedly our knowledge about the actual physical data (Y) generator (the study design and conduct), whereas the source for P(b;c) is supposedly outside the study. This makes the two seem qualitatively different, but that is a category error; in reality anything could be informing either factor, and the source composition can vary quite bit depending on the allocation between B and C.
      
      The math for doing “analysis” once a specification for P(y,b;c) is given is pretty much worked out and programmed, even if not presented or even recognized in the above form – it just becomes algorithmic data processing. That leaves the specification problem and the infamous Garden of Forking Paths, which typically injects much pseudo-information (what justified using age instead of log age as a covariate?). But such arbitrary elements are unavoidable if we want the programs to process the information into contextually interpretable summaries (which is one parsimonious idealization for “inference”). That is one reason I advocate unconditional interpretations of model diagnostics: Any discrepancy or lack thereof may have more possible causes than we accounted for in our model or care to imagine.
      
      By the time I got out of grad school I had been convinced that causal network models were vital for this specification task, in both easing it and making sure the probability conversion and final summaries made sense contextually (my first paper using causal diagrams was published in 1980). Many others have reached and taught the same conclusion, and over recent decades causal modeling tools have exploded in rigor, depth, and breadth. Yet they are still not a central part of applied statistics texts and courses I see, a failure to integrate a topic that I would consider as essential as software use. See more on that lament at https://arxiv.org/abs/2011.02677
    - PE on May 7, 2021 6:27 PM at 6:27 pm said:
      
      Thanks a lot for the insightful comments! My reading list has increased after reading them
    - Sameera Daniels on May 7, 2021 7:07 PM at 7:07 pm said:
      
      Thanks Sander for mentioning Frank Ramsey. My aunt, a mathematician kept his name and views alive in our family.
    - i.e. rabinovitz on May 7, 2021 10:21 PM at 10:21 pm said:
      
      How can beliefs — if rational — about what is likely or not likely to occur not be grounded in concrete facts of experience in the “real-world” ?
- rm bloom on May 4, 2021 8:38 PM at 8:38 pm said:
  
  If I do what I really ought to do if I am an experimentalist and repeat my experiment until I consider the result reproducible then I have indeed engaged in “inductive behavior” and I will then have a series — perhaps not an excessively long series, but a series nonetheless — of model fits (if I bother myself to fit a model at all). And if I seek reassurance from a statistical procedure, which I would apply habitually (even if I were so hasty as to carry out only a singular experiment and not a series), then I should very much like the sequence of statistics generated by that series of model fits to confirm my inductive conclusion: that the experiment is reproducible (if it is) and that the sequence of results is confirmative of some supposition I find worth the trouble to attend to asking about.
  
  Reply ↓
Daniel Lakeland on May 8, 2021 6:53 PM at 6:53 pm said:

Sander: I hope you see this, I’m re-starting discussion at the bottom because it’s a bit too confused at the top. Beginning to read through Bayarri and Berger, after recovering for a few days from my 2nd COVID shot. I see they mention both the Posterior Predictive P and the Prior Predictive P. They say:

“The main strengths of p_prior are that it is also based on a proper probability computation… and that it suggests a natural and simple T… the main weakness of p_prior is its dependence on the prior pi(theta)…”

…

“The main strengths of P_posterior are as follows:
a) Improper noninformative priors can readily be used
b) m_post(x|xobs) typically will be much more heavily influenced by the model than by the prior….
c) It is typically … easy to compute … from … MCMC”

I personally find all of these things misguided. As a Bayesian the prior **is part of the model** it’s a statement about what we think is more or less likely to go on in the world. So I don’t see “its dependence on the prior” as a weakness of the prior predictive, in fact, a major thing I’d want to use the prior predictive distribution for is to confirm that the prior correctly spells out what I do really know about the world. It can be difficult to directly specify a prior, but it can be much easier to check the prior predictive against the kind of data I expect (so for example to specify a small fake dataset and then compare prior predictive p with that dataset)

The “strengths” of the P_posterior mentioned are also not consistent with my Bayesian practice. I **never under any circumstances** use improper priors. They are **not** uninformative, in fact what they say is that the parameter value is almost surely infinite or negative infinite. Furthermore m_post … “more heavily influence by the model than by the prior” makes a distinction that doesn’t exist in my mind (the prior **is** part of the model)

Reply ↓
- Andrew on May 8, 2021 9:05 PM at 9:05 pm said:
  
  Daniel:
  
  I agree. I think the problem is that all these people are working within a classical tradition in which the goal of a hypothesis test is to reject false hypotheses. But I only work with false hypotheses. All my models are false. That’s why I speak of “model checking” rather than “hypothesis testing.” Rather than there being some null model I want to reject, I’m in the position of having a model I like, but which I know is flawed, and the purpose of these tests is to reveal flaws in the model, in this case flaws of the form that the model is predicting things that are much different from what has been observed in the past.
  
  Reply ↓
- Chris Wilson on May 9, 2021 1:18 PM at 1:18 pm said:
  
  It’s funny, when people talk about “having to specify a prior” as a weakness, my immediate thought is always the weakness of “not specifying a prior”! I wonder what set of asymptotic assumptions are being invoked instead and how they stack up to encoding external information/considerations/constraints in a prior model….
  
  Reply ↓
  - Daniel Lakeland on May 9, 2021 2:09 PM at 2:09 pm said:
    
    Exactly. If your confidence interval is equivalent to a Bayesian interval with a uniform prior, (confidence intervals constructed from likelihood based tests for example) then by using it you are saying essentially that you’re a Bayesian who is confident before data that your parameters are all greater than 10^300 in magnitude but don’t have any idea about the sign. That’s enormously silly in essentially 100% of applied problems.
    
    Reply ↓
    - Nick Adams on May 10, 2021 7:24 AM at 7:24 am said:
      
      No, for a dividing hypothesis you can specify prior odds (“a likelihood prior”) and make no claims whatsoever about the shape of the prior probability distribution, only the relative distribution of its mass on either side of the divide.
    - Chris Wilson on May 10, 2021 9:57 AM at 9:57 am said:
      
      Daniel is talking about priors for continuous parameters, not priors over discrete models a la Bayes Factors…
    - Daniel Lakeland on May 10, 2021 12:21 PM at 12:21 pm said:
      
      Imagine a model for something like the calibration voltage for a circuit designed to measure say oxygen concentration in automotive exhaust (like for a fuel injection computer). You know for example that if you dial in this voltage correctly your circuit will read out percentage oxygen as a digital readout on a gauge, and then the fuel injection computer will accurately supply fuel.
      
      A frequentist can collect say 10 measurements from an exhaust manifold which is rigged with both a pre-calibrated sensor and an uncalibrated sensor, then using a normal measurement error, can try to infer the required calibration voltage.
      
      If as a frequentist they insist on “not using a prior” and rely on a likelihood based confidence interval they may get a confidence interval that is say [2.7,3.2] volts with say a maximum likelihood estimate of 3.03 volts. However suppose the CI construction method mathematically gives the same interval you would get if you did a Bayesian posterior using a “noninformative” improper prior where the prior is “uniform on the real numbers”. The prior which is “uniform on the real numbers” can be thought of as a nonstandard distribution in nonstandard analysis. This density is 1/(2N) on the range [-N,N] where N is a nonstandard integer. These nonstandard distributions have the property that for any limited number x all but an infinitesimal amount of probability is located in the region outside [-x,x]. In other words “frequentists are mathematically like bayesians who believe that their parameter is almost surely infinite in magnitude but could be either positive or negative”.
      
      Now, any person who shows up at a calibration lab and says “i’m positive I’m going to need every volt you can give me, is this rig capable of handling 100 billion trillion volts?” should probably be physically restrained until appropriate care can be provided. It makes NO sense.
      
      On the other hand, a hard core Frequentist would say that the Bayesian who provides a prior of uniform(0,10) volts knowing that the circuit is only designed to supply 10v anyway and any sensor which can’t be calibrated by that range of voltage will just be chucked in the bin as “out of spec” anyway… Well the hard core frequentist believes that the Bayesian is actually insane and believes in fairies because “probability only applies to measurements not parameters” and the math they’re doing is just improper like adding Volts to meters/second.
      
      If I’m the race car driver and I have to make a choice between these two zealots, I know which of those two zealots I want calibrating the exhaust manifold of my race car.
    - rm bloom on May 10, 2021 2:33 PM at 2:33 pm said:
      
      And Bayesians are Frequentists who don’t think they need to keep books on from whence they learned what they say the believe is true about the world. It may be difficult to do it; it may be impossible. But to the extent they make a big hullaballoo about “belief” being the *primitive* concept, not further reducible, not grounded in experience, simliciter; they are peddling intellectual error.
    - Andrew on May 10, 2021 2:49 PM at 2:49 pm said:
      
      Rm:
      
      +1. It can be useful to model belief probabilistically, but belief is not the foundation of probability, any more than betting is the foundation, or coin flips and die rolls are the foundation. These are all different setting where probability can be useful.
      
      Just one thing. You say, “And Bayesians are Frequentists who don’t think they need to keep books on from whence they learned what they say the believe is true about the world.” I’d just say “Bayesians are frequentists,” full stop.
    - rm bloom on May 10, 2021 2:52 PM at 2:52 pm said:
      
      Andrew: “Bayesians are Frequentists. Full Stop.”
      
      +!
    - Daniel Lakeland on May 10, 2021 3:23 PM at 3:23 pm said:
      
      “And Bayesians are Frequentists who don’t think they need to keep books on from whence they learned what they say the believe is true about the world.”
      
      That doesn’t seem right at all. In the Bayesian calculus with the proper notation, there is always the knowledgebase on which you’re conditioning. p(A | B) where B is some background set of facts. But the **reason** for a belief need not be that in the past the frequency with which x occurred was f. For example “I think your weight is somewhere in the range of 80 to 450 lbs, with a peak around 200” modeled by some gamma distribution has **nothing** to do with me having a database of 200 randomly selected people and their weights. It does have to do with a general knowledge that people under 80 lbs are usually very sick, and that I have never met anyone over 300 pounds, but I do know a few football players or sumo wrestlers are occasionally that large.
      
      I’m fairly confident though that if we did randomly select 200 people and then do a chi-squared goodness of fit test, the frequency of weights would not match whatever gamma distribution I chose. However, also, if we did do that, I would recommend to use as a prior some best-fit gamma distribution going forward.
      
      “Bayesians are Frequentists. Full Stop.” also I don’t agree with. The definition of Frequentist is that probability means *only* the frequency with which a thing occurs in repeated trials. Bayesians may be interested in being able to predict frequencies, but they model things with probability **other** than the frequency of reoccurrence.
    - Chris Wilson on May 10, 2021 4:09 PM at 4:09 pm said:
      
      Andrew, I have never fully understood your argument why “Bayesians are frequentists” :) If I don’t, in practice, construct prior models by thinking of the reference set of problems to which I would use said prior, am I still a frequentist using Bayesian clothing? What does this reference set mean mathematically, and how does it map to ‘probabilizing’ the parameter space, establishing the prior measure? If I push some simulated data from a prior predictive and then adjust the specification based on what I see, doesn’t that tuning/modeling process sort of whittle me down to a reference set of one – i.e. the particular problem I am working on?
    - rm bloom on May 10, 2021 5:04 PM at 5:04 pm said:
      
      I cannot respond below Lakeland’s comment so I will respond here above.
      
      It is not the case that there must be controlled trials of replicates before a judgement of probability can be made. What I am insisting on is that a judgement of “probability” if it is a *reasonable* judgement must be based on evidence; and evidence consists (if it has any *reasonable* bearing on the question at hand) of histories of events or constellations of events; prior events; which I or others may have witnessed and which we or they have recorded. If it is so vague that I or they cannot make it over into a strict tabulation of presence and absence of effect; well so be it: some probability judgements are going to be sloppy and some will be tight; and others will be in the middle. But the evidence you bring to bear better have some grounding in experience if it is intended to be persuasive! This is not to deny that the mere expression of belief on the part — say — of a person of proven veracity carries with it a certain value as evidence, simpliciter. But to erect the whole tower of the theory of probability on belief — simpliciter — is remarkably jejune in its evasion of the simple question: belief on what grounds; on what evidence?
    - Daniel Lakeland on May 10, 2021 5:26 PM at 5:26 pm said:
      
      Ron, I don’t disagree with your clarification. Yes, we should have **reasons** based on a database of knowledge, for the use of our priors. When the final results depend sensitively on the prior, you should have a very convincing argument for the use of the particular one you did use. When they don’t depend so terribly strongly, it can be sufficient to use a few words, like “the density of smoke in the air is definitely dramatically less than the density of the air itself under almost all real worlds circumstances. Air is listed as about 1.2kg/m^3 on wikipedia, so we place a normal prior at about 0.1kg/m^3 with standard deviation 0.5 truncated to 0, to cover a broad region of plausible densities, including densities as low as 0.”
      
      none of that is really frequency based though.
    - rm bloom on May 10, 2021 6:07 PM at 6:07 pm said:
      
      I cannot respond to Lakeland below, so I will respond above (here).
      If I have a hunch about something being more or less probable, and my hunch is not grounded in a web of experience which I can elucidate, then I am just huffing and puffing, and you should have no reason to listen to me at all. If my huffing and puffing turns out, more often than not, to be proven true; then you *do* have reason to listen; but only *because* my track record is there for you to see!
      
      The reference class is always there, somewhere. Implicit or explicit.
    - Daniel Lakeland on May 10, 2021 6:48 PM at 6:48 pm said:
      
      Ron there is a difference between saying that I’ve seen a lot of people and many of them were around 150 to 250 lbs and saying that if you select a random selection of American males 10000 of them, that such and such percent will be between 100-110 and such and such percent between 110-120 etc etc.
      
      Frequentists demand that every probability distribution quantify what fraction of the reference class falls into each subset. Bayesian reasoning just says that we’re willing to entertain substantially more credence on one subset vs another independent of whether we think some replications will bear out the frequencies. It is also entertaining situations where no replication is possible. Probability that the mass of a certain meteorite is such and such. There is only one meteorite. It’s mass is a number. My reasons for believing it’s mass is one thing vs another can be logical without having a repetition class associated and a frequency enumerated.
      
      If you are saying that the history of our observations form our rational beliefs I don’t disagree. If you are saying that Bayes has secret quantitative frequency properties that are swept under the rug I DO diaagree
    - rm bloom on May 10, 2021 7:03 PM at 7:03 pm said:
      
      Lakeland:
      “If you are saying that the history of our observations form our rational beliefs I don’t disagree. If you are saying that Bayes has secret quantitative frequency properties that are swept under the rug I DO disagree”.
      
      Bloom: he does not sweep anything under the rug; he relies upon his experience. His experience tells him, such and such is rare under such and such circumstances; and it is common under other circumstances. He reports this experience by various figures of speech, “I believe this is the case”, “I strongly believe this is the case”, “I doubt it”, “I would not bet on it” …. If he is an ornithologist, or a beekeeper he may even perhaps have book-entries in support of this or that assertion about what is the case. If he is a particle physicist, all the more-so.
      
      What he believes — if the belief itself be taken into evidence — rests in what was observed, somewhere, by someone, by him or by his forebears or colleagues; and what was observed are the concrete facts of experience in the world. The fact that such facts may or may not be adequate basis from which to build a set of relative frequencies; is just a reflection of the truism that ‘the investigation can be no more precise than the subject matter admits’
    - Chris Wilson on May 10, 2021 4:03 PM at 4:03 pm said:
      
      I think this is mostly right, although Bayes + Flat Priors does not necessarily equal max likelihood, the most common instance of a non-Bayesian method for a similar class of problems as what I work on. In the max likelihood framework you are working with the MLE and the Fisher Information at the MLE, whereas in Bayesian inference you would be integrating over the prior. Only in the latter does the unrealistic weight out to non-physical values implied by the improper prior really matter.
      The max likelihood approach relies on asymptotic assumptions to justify using only the local information at MLE.
      
      In the simple, linear, low-dimensional toy examples we generally use to train intuition, these differences are very slight in practice. But in higher dimensions – and especially with non-linearity – these approaches can really start to diverge!
    - Daniel Lakeland on May 10, 2021 5:21 PM at 5:21 pm said:
      
      > But in higher dimensions – and especially with non-linearity – these approaches can really start to diverge!
      
      Exactly, and especially it can be absolutely critical to have a reasonable prior in high dimensions so that you don’t wind up doing something stupid, where you over-fit your model. The prior plays the role of saying “we only think this model makes sense if it has parameters in this certain region because outside the region the model would be “nonphysical” or “meaningless””. In other words, if you have to use those values, it’s because we built a bad model and need to start over not because the real world works just like our model but we were really wrong about the parameter values (like say the density of smoke in the air REALLY IS greater than neutron star material).
      
      But the prior in high dimensions has ABSOLUTELY NOTHING to do with the frequency with which we saw whatever in some reference set of previous experiments or similar experiments we’re going to do in the future or whatever. It’s just “hey the world either works like this, with parameter values in this region of space… or we need to start over from the drawing board”
    - Nick Adams on May 10, 2021 7:49 PM at 7:49 pm said:
      
      A maximum likelihood estimate and its associated confidence interval assumes no prior of course but it will produce an answer compatible with an infinite number of Bayesian priors, including some crazy ones like yours.
    - Daniel Lakeland on May 11, 2021 12:33 AM at 12:33 am said:
      
      The MLE estimate is just a number of course, so it’s compatible with lots of different Bayesian output. But the CI construction procedure for a given alpha level will coincide with the posterior probability interval for a given alpha using a prior of the form I’m discussing **exactly** in many CI construction methods (those based on likelihood functions). This is a stronger sense in which the Frequentist result is mathematically the same as a Bayesian result with a particular nonsensical prior.
PE on May 10, 2021 12:20 PM at 12:20 pm said:

Poisson in 1837:
“In this work, the word chance will refer to events in themselves, independent of our knowledge of them, and we will retain the word
probability […] for the reason we have to believe”

Even before Ramsey!

Reply ↓
- rm bloom on May 10, 2021 2:35 PM at 2:35 pm said:
  
  What *reason* can there be for belief in X other than our experiences in the world of phenomena analogous to or identical to X? Rather: what *rational* reason can there be?
  
  Reply ↓
  - PE on May 10, 2021 3:54 PM at 3:54 pm said:
    
    I have no idea what this has to do with the quote, but I’ll answer it anyway: sometimes the rational basis of the belief comes from theoretical knowledge. To give a somewhat simplified story, scientists at some point thought that slamming two high-speed neutrons together would cause an explosion. They weren’t *certain* of that, hence the need for testing, but it was a reasonable prediction based on the best theories at the time. The prediction turned out to be true. But it wasn’t at all based on “phenomena analogous to” slamming two small objects together. Nothing usually happens when you slam two small objects together. If they were to reason by enumerative induction (“nothing usually happens when we slam two small objects together, therefore…”), the prediction would utterly fail. You might say that the comparison is not fair because slamming subatomic particles together is very different from slamming rocks together. But scientists only knew they were different because their theoretical knowledge said so, that’s the point.
    
    Reply ↓
    - rm bloom on May 10, 2021 7:58 PM at 7:58 pm said:
      
      Isn’t prior theoretical knowledge a summary of some prior genera of experience?
    - PE on May 10, 2021 8:26 PM at 8:26 pm said:
      
      Sure. Reasonable priors are based on evidence/experience, they’re not justified a priori (i.e. independently from experience). The example is just meant to illustrate that “evidence for X” doesn’t have to be something like “most stuff we’ve seen in this reference class is X, therefore probably X”. That’s the simplest case of evidence, but not the only one
    - rm bloom on May 10, 2021 11:27 PM at 11:27 pm said:
      
      If a well-tested theory makes a prediction of X, do we take that prediction as “evidence” for X? We certainly take it as “support” for positing X. If the window is broken and the papers are scattered all about the room we may take this as evidence of a burglary. The theory is, burglars are known to riffle break windows and riffle papers. But it might have been the mummy from the neighboring college museum; come to get the names of the archeologists who’d violated his grave. It could have been the dog next-door, charged through the window in pursuit of his own shadow…. why, it even could have been the wind. What makes the difference between the story of the burglarous entry and the other shaggy-dog tails? Experience.

Statistical Modeling, Causal Inference, and Social Science

Responding to Richard Morey on p-values and inference

102 thoughts on “Responding to Richard Morey on p-values and inference”

Leave a Reply Cancel reply