Chow and Greenland: “Unconditional Interpretations of Statistics”

Posted on September 24, 2019 8:43 PM by Andrew

Zad Chow writes:

I think your readers might find this paper [“To Aid Statistical Inference, Emphasize Unconditional Descriptions of Statistics,” by Greenland and Chow] interesting. It’s a relatively short paper that focuses on how conventional statistical modeling is based on assumptions that are often in the background and dubious, such as the presence of some random mechanism and the absence of systematic errors, protocol violations, and data corruption.

We try to emphasize that the lack of discussion of these assumptions and their possible violations may fool people into thinking that statistics can offer more than what it actually can, and so, we should lower our expectations of these assumptions and try to interpret results in an unconditional way.

So far, the response to the discussion has been positive, with most individuals exclaiming that there are few discussions about these hidden assumptions that we often assume to be true when utilizing certain statistical methods.

I’m a big fan of assumps (see, for example, here and here). It’s also good to recognize which of these assumps are important, as often attention is drawn to the more visible or quantitative but less important assumptions underlying our methods.

Greenland and Chow’s paper on unconditional inference is interesting. Unconditional inference is hard to do, in part because you have to define the relevant population or distribution or reference set to average over. Here’s an example where Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I attempted to perform unconditional inference from pre-election polls.

53 thoughts on “Chow and Greenland: “Unconditional Interpretations of Statistics””

Sander Greenland on September 25, 2019 1:14 AM at 1:14 am said:

Andrew: I regard the unconditional interpretation as essential for decent stat interpretation in anything noticeably less than perfectly executed experiments…

Fortunately, at least in my concept of common sense, I found what I call unconditional inference easier than conditional inference, once I got used to it. Now that did involve untraining myself from the usual conditional-inference formalism pounded into us by mathematical theories of statistics. By pure logic, all math stat can do is provide inferences of the form
“if you assume all this [whence follows a massive string of assumptions, which the math manages to boil down to an innocent-looking formula or system] then here’s a test statistic T and a reference distribution F(t) for it derived from that big black bag of assumptions, with T sensitive to (optimized for) violations of a particular subset of assumptions, given the rest.”
– Using those kind of conditionals to create a real-world decision (like deciding what to say the study “showed”) throws on a layer of contextual complexity in research reporting that isn’t formalized in any adequate way. No surprise then that mechanical statistical decisions often look unsound upon checking them against the full scope of available information.

The challenge we have to face is that there will nonetheless be those who attempt to curtail reasoning for inference due to having limited resources to deploy on a topic, and yet who want to appear to deliver confident claims in very uncertain situations. In my areas I usually see these reasoning curtailments (like equating conventional NHST to rational decision-making) in service of a loss function that assigns much higher costs to what the promoters would call false positives than what they would call false negatives, for a myriad of reasons ranging from ideologic investments in nullistic philosophies to direct financial motivations (e.g., avoiding admissions of liability) and more. There are of course those with opposite stakes, but the focus of statistics on NHST puts them at a baseline disadvantage. I believe that upsetting this particular distribution of power is what has struck fear in the hearts of committed NHST practitioners and defenders like Ioannidis. Of course this fear is presented as fear of being flooded with false positives. But, as Rothman noted long ago, there is a methodology that prevents all false positives and all false negatives: Stop dichotomizing your inferences – and in particular, stop dichotomizing P-values. To which many have added: Stop computing P-values only for nulls; also give them for alternatives.

More generally, instead of decisions, supply measures of evidence in the data about possible data explanations, taking into account any and all information one has about the study context (which hopefully is quite a lot). It turns out the lowly P-value is one of the simplest starting measures, showing the point at which the data fell at along a particular direction in observation space (the direction in which the test statistic T varies) within the reference distribution F(t) used to compute the P-value. That P-value is deduced from all the assumptions, including the tested hypothesis and the background model in which it is embedded.

That embedding (background) model comprises every other assumption from linearity (obvious when assumed) to no P-hacking to no data tampering (always assumed, rarely mentioned). In causal terms, that model assumes the test statistic arose only from “controlled” extraneous effects which merged in a known way with random error. But the model actually says nothing about why the statistic fell where it did, e.g., it does not and cannot say P was small (or large) because the effect you were examining was present (or absent), or because of biases (systematic errors) assumed absent by the model, or because of random noise as per the model, or because of data tampering, or all these and more. To claim otherwise is just an inversion fallacy (confusing premises with conclusions).

Any explanation for the distance between the data and full model has to come from mechanical, physical, causal models for the actual data generation, and will be conditioned on those models. That’s what I mean when I say the (perhaps too succinctly) the reference distribution for a null P-value assumes “no treatment effect or bias” – in epidemiology, “bias” is any deviation from the assumed reference distributions produced by failure of the background (embedding) model assumptions. Those assumption failures are among the causes of the observations. That includes everything outside the model, which can be viewed as systematic errors from model misspecification – with the latter including erroneous sampling models and erroneous (misinformed) prior distributions.

To restate this view in terms I think resembling yours: The reference distribution comes from a partially specified data model (i.e., a model family) that combines in a formula a random-number generator with some data information (e.g., observed covariate values). For a P-value, the specification just has to determine a reference distribution F(t) for a test statistic T, not for the entire data; then the upper and lower P-values F(T) and 1-F(T) will be approximately uniform under that distribution (within limits imposed by discreteness of T). [Yes there are sometimes technical issues in doing that, but fequentist math stats has addressed those issues well beyond the data and models I encountered in any of my collaborations.] We can then say a valid P-value (one uniformly distributed under the test model – you call those U-values) gives the percentile at which the test statistic fell in the reference distribution deduced from the specified data-model family. That interpretation was discussed by Perezgonzalez (2015; see our citations), and I think it creates a fair analogy between P-values and where students fall on “standardized” college admission tests like the SAT. The S-value or surprisal s = log(1/p) is then just a “standardized” measure of information distance between the data and the model that can be equated to data from a simple thought experiment.

The information deconstruction of the P-value and the more geometrical S-value I think makes clearer that both are relational measure between data and models, meaning: They don’t know or care if the model is just some naive off-the-shelf nonsense default in your software, or some Bayesian betting scheme (whether well or ill informed), or has instead been carefully deduced from the physical mechanisms that produced the data. Nor do they care if the data were tampered with to get P over or under 0.05, or were actually generated from a neutral mechanism you know in every relevant detail down to its exact propensities (e.g., a simple randomizer coupled with a classic ANOVA design with no loss or censoring). They are just measures of discrepancy between the data and model; only context can provide plausible causal explanations of their size, and apportion those between “data problems” and “model problems”.

The unconditional interpretation eliminates that last distinction by apportioning everything to “model problems”, thus logically conditioning on the data – as in: Here’s the data and a long list of possible explanations for them; if you want to choose among them, supply more evidence. With that interpretation, the refutational status of P-values is made clear by the fact that they cannot in any way imply the test model is correct if we don’t assume that the embedding model is correct (as one should never assume in my applications for more than a brief moment; I do believe it’s the same story in your applications).

Reply ↓
- Thomas on September 25, 2019 9:04 AM at 9:04 am said:
  
  Thank you Sander Greeenland for the clear presentation here and in the linked paper. It makes a lot of sense. I’m just wondering if the notion that researchers cling to statistical significance as a reflection of truth isn’t a bit of a straw man. In epidemiology/health at least I don’t see that happening, and discussions of possible bias, confounding, direction of causality, data quality etc take up most of the discussions. A fair example is the JAMA paper on antidepressants taken by pregnant mothers and autism is children that you comment on in the linked paper. Even though there is too much emphasis on a “non-significant” hazard ratio of 1.6 (with a confidence interval that starts at 0.997), the authors do not really present this as “evidence of absence” and discuss a variety of mechanisms that may explain the observed results. They conclude by “Although a causal relationship cannot be ruled out, the previously observed association may be explained by other factors”. Another example is the GRADE system for rating evidence; it’s not just about significance. The limitation of the current state of things is that many people treat the numerical result (point estimate, CI, p) and the validity of the model as though they were separate issues, whereas you provide an integrated view.
  
  Reply ↓
  - Martha (Smith) on September 25, 2019 3:45 PM at 3:45 pm said:
    
    Thomas said,
    “A fair example is the JAMA paper on antidepressants taken by pregnant mothers and autism is children that you comment on in the linked paper.”
    
    Do you mean, “autism in children”?
    
    Reply ↓
    - Thomas on September 25, 2019 3:48 PM at 3:48 pm said:
      
      yes, typo, apologies
  - Sander Greenland on September 25, 2019 4:10 PM at 4:10 pm said:
    
    Thanks Thomas… “In epidemiology/health at least” is in my experience more like “at most”. I do think epidemiology (or at least Epidemiology) is well ahead of most fields because of all the disrespect it has had to address due to its nonexperimental base. I thus agree epidemiology is probably least in need of our article; in fact, our article can be seen as a formal expression of what has long been good practice in the field.
    
    But our article was written for all fields and audiences, not for epidemiology. In the wider world, the take-home message promoted by stat primers and practice is still that 0.05 (or the 95% CI) is magic; that the P-value is “the probability of any observed differences having happened by chance” (e.g., in “Medical Statistics Made Easy” 2008 ed., p. 24-25, and repeated in a JAMA editorial last year; see our 2016 TAS paper explaining why that is completely wrong); and worst of all that P-values and CIs have “objective” contextual meaning apart from any we imbue them with.
    
    Take the Brown et al. article: The only statement in their abstract’s Conclusion was that they observed no association, made in the face of an elevation statistically indistinguishable from the 3 previous studies they cited. “No effect” is then what was trumpeted by research news outlets like Medscape, whose April 21, 2017 headline about the article was “Antidepressants in Pregnancy: No Link to Autism, ADHD”; its first sentence was “Use of antidepressants before and during pregnancy does not cause autism or ADHD, new research shows.” They then quote Vigod (the interviewed author of Brown et al.) as claiming the study “no longer found an increased risk” after adjustments. All these headlines and leads are sheer fabrications relative to the actual data showed, enabled by getting the null barely inside their final CI (using an inefficient adjustment method, BTW). So no, I don’t buy that “the authors do not really present this as “evidence of absence””. Yes they did, after burying due cautions within the text (all the more evidence of their null bias).
    
    Then too, examples abound in the health and med lit without clear bias or buried due cautions, driven solely by 0.05 or 95% CI. They are found even in epidemiology journals (even if not in Epidemiology; for an example see Greenland S. A serious misinterpretation of a consistent inverse association of statin use with glioma. Eur J Epidemiol 2017;32: 87-88. https://link.springer.com/article/10.1007%2Fs10654-016-0205-z).
    
    Those kind experiences are why, when I see “straw man” comments, it reminds me that some readers (chiefly those with strong methodology interests) have become seriously detached from the statistical and political realities in the broader world of everyday research, focusing instead on a highly sophisticated but narrow subsegment of the literature. Some even become blind to the problems in papers they read, filling in a choirboy defense when all the evidence on the page points to manipulation of reporting to foregone conclusions, aided by P-hacking (exploring the “Garden of Forking Paths”) to get above or below the magic 0.05 cutoff (depending on the conclusion the authors want to be true). I believe this kind of denial is why statistics remains a leading contributor to bad science instead of the preventive it was supposed to be.
    
    Reply ↓
    - Adan Becerra on September 25, 2019 11:34 PM at 11:34 pm said:
      
      Sander,
      
      Thank you for the summaries. As an epidemiologist I agree with you that we are ahead of most fields with regards to these topics, but I still found the piece useful since it is one good source that puts it all together for us.
      
      I am particularly intrigued by your comment above re: “Any explanation for the distance between the data and full model has to come from mechanical, physical, causal models for the actual data generation”
      
      Agree- If you’re interested in explanation, statistical models are often meaningless without a causal model (i.e. directed acyclic graphs). I think this is why epi is ahead. We spend a lot of time thinking about statistical assumptions/interpretations because we spend even more time developing our causal models and explicitly stating our assumptions and the relationships between all of the variables with each other. Identification. In fact the statistical assumptions are often driven by the causal assumptions. I choose a statistical model by looking at the DAG I drew and seeing what model best answers that question. And I also think epi is ahead because we are very much aware of dangers with common p value fallacies and misinterpretations. Also things like Table 1, Table 2 fallacy which I am sure you know of :)
      
      Next steps I think would be to develop DAGs or other tools to be able to identify the best functional form of variables. Even epi often suffers from unreal assumptions such as that our models have been not been mispecified paramtetricaly. This might be one reason why folks are moving to Targeted maximum likelihood estimation and longitudinal targeted maximum likelihood estimation which combines power of DAGs with nonparametric models that can overcome many assumptions. But cant speak of the performance of these myself. Haven’t tried them. Happy with gformula and mgformula.
      
      Next steps in DAG theory would be
    - Martha (Smith) on September 26, 2019 12:55 AM at 12:55 am said:
      
      Did this get cut off?
    - Adan Becerra on September 26, 2019 1:31 AM at 1:31 am said:
      
      I did an oops and didnt delete it.
    - Sander Greenland on September 26, 2019 1:50 AM at 1:50 am said:
      
      Adan: I agree with that nice general overview you gave of why epi is ahead on the causal aspects of inference, especially causation of bias (which applies even if the target is noncausal, as in causes of nonresponse and misresponse to a simple voter-preference survey).
      
      Maybe though not so far ahead on the foundations of statistical inference. Here’s an example of a study with two top pharmacoepidemiologic statisticians (one a leading researcher in causal modeling) on the author list:
      https://www.ncbi.nlm.nih.gov/pubmed/27178449
      The abstract states “PDE5-I use was not associated with an overall increased risk of melanoma (rates: 66.7 vs 54.1 per 100 000 person-years; HR: 1.18; 95% CI, 0.95–1.47)”
      then concludes with a one-sentence Patient Summary:
      “In this study, the use of phosphodiesterase type 5 inhibitors was not associated with an increased risk of melanoma skin cancer.”
      – That’s just false: There was after adjustment an estimated 18% higher rate among users, with the CI ranging from a 5% decrease to a 47% increase. How does that result “indicate that use is not associated with melanoma”? (as stated at the start of the paper’s Conclusion). Well it doesn’t.
      
      Like the other examples mentioned here, the association was examined because it had been seen in earlier studies (the estimate was not from fishing), and the verbal claims are just another example of the fallacy decried in item 6 of TAS supplement 1 to the 1996 ASA Statement on P-values. [Notably, their claims were often followed by an admission that elevated risks were seen at higher dose categories.]
      
      As for DAGs, I don’t see how they could be used as you have in mind and haven’t seen what I’d call high-impact developments in DAGs for decades, but maybe some will yet be found.
      
      TML is great in concept but the versions I’ve looked at seemed like big-data procedures with all the usual black-bag of strong assumptions needed to create target identification, which in epi jargon is “no uncontrolled bias” (from selection, confounders, measurement errors, data tampering, etc.). With enough data TML may (under a conventional statistical metric) more accurately adapt to the observational association structure predicted by a causal model, making it a good tool to have. And I think it’s one of many good perspectives on effect estimation to have available. But it’s not addressing the problems at the top of my list, which are really only addressable through design features for bias control (of which baseline cohort matching is a primal example).
  - Zad Chow on September 25, 2019 4:37 PM at 4:37 pm said:
    
    Other thing I would add is that the quality of epidemiological studies/papers published in journals like JAMA and journals like Epidemiology or IJE is vastly different, with the former often being very low in quality. My issue with the latter, which tends to focus more on things like measurement and bias, compared to the former, is that they’ve become dogmatic about the usage of P-values and put too much faith in interval estimates. Of course, I think a more nuanced take is necessary with the acknowledgement that P-values and interval estimates, when interpreted unconditionally, can be very useful
    
    https://journals.lww.com/epidem/fulltext/2013/01000/Living_with_P_Values__Resurrecting_a_Bayesian.9.aspx
    https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625
    
    Reply ↓
Anoneuoid on September 25, 2019 9:45 AM at 9:45 am said:

The “unconditional interpretation” seems to just be the correct interpretation. Basically, you can only falsify the conjunction of all the assumptions that went into deriving your model. People want to put all the emphasis on one assumption about the value of a parameter such as mu = 0, which is wrong. This is covered by Paul Meehl here:

Meehl, P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S. A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 393-425). Mahwah, NJ: Erlbaum. http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/169problemisepistemology.pdf

I think you are getting close, but people will ask what to do with data they collected like “the difference between treatment and control was x (+/- d)”. You won’t have a satisfactory answer.

The next step is to realize you shouldn’t be collecting that type of data to begin with. You need to collect data that lets you distinguish between different possible explanations.

Reply ↓
Nick Adams on September 25, 2019 6:44 PM at 6:44 pm said:

Sander, I was interested in this :

“To which many have added: Stop computing P-values only for nulls; also give them for alternatives.”

I have heard you mention P-values for alternatives several times before but have not come across any other references to this (quite obvious) idea.

Here’s my take on it: http://arXiv:1806.02419

Reply ↓
- Zad Chow on September 25, 2019 6:59 PM at 6:59 pm said:
  
  Your link is broken. You forgot the .com/abs/, but for others, here’s the correct link: https://arxiv.org/abs/1806.02419
  
  Reply ↓
- Anoneuoid on September 25, 2019 7:00 PM at 7:00 pm said:
  
  Your link doesn’t work, but the eventual next step is p-values only for “alternatives”.
  
  I guess this is it: https://arxiv.org/abs/1806.02419
  
  Reply ↓
- Sander Greenland on September 25, 2019 8:58 PM at 8:58 pm said:
  
  Nick: The idea of computing P-values for or tests of alternatives can be found in many places including ones we cite in our companion paper at
  https://arxiv.org/abs/1909.08579
  among them Birnbaum 1961, Neyman 1977, Poole 1987, and Modern Epidemiology (Rothman Greenland Lash 2008) Ch. 10.
  
  Unfortunately in your use of “S-value” in your arXiv preprint for likelihood support is different from our use for the Shannon (surprisal) transform of the P-value, s = -log(p). We’ll just have to mind that.
  
  The relation between likelihood and P-value based measures is discussed a bit and illustrated in figures in the appendix of our companion paper. One reason some prefer P-value based measures is that they can be defined with much less model specification than needed for a classical likelihood function (just enough is needed to identify the reference distribution for the test statistic, not the whole data), and have an absolute unconditional interpretation as well as the usual conditional ones. In contrast, likelihoods must condition on an embedding model, within which they have only relative interpretations. To the extent one would trust such a model, that is not a fatal objection: In that model, P-values may bound posterior probabilities while likelihood ratios can map prior to posterior odds under the model.
  
  As many before me have advocated, I see the two approaches as somewhat complementary modeling tools, with uniform (valid) P-values providing elementary checks on models (as opposed to more sophisticated checks like residual and influence plots), and likelihood ratios or Bayes factors (which are built from likelihood functions) providing elementary model comparisons (as opposed to more sophisticated comparisons, like of predictive accuracy).
  
  Reply ↓
  - Zad Chow on September 26, 2019 5:37 PM at 5:37 pm said:
    
    This world isn’t big enough for two S-values. There can only be one…
    
    Reply ↓
- Carlos Ungil on September 26, 2019 5:13 PM at 5:13 pm said:
  
  Nick, your clinical significance support level (S-value), equivalent to the value of (1-p) when the null hypothesis is set equal to the
  minimum clinically significant effect size, is what Deborah Mayo would call the severity with which the claim “the effect size is larger than the MCSES” passes the test of the original null hypothesis of zero effect.
  
  Reply ↓
  - Nick Adams on September 26, 2019 5:53 PM at 5:53 pm said:
    
    Thanks. You’re right, and clever to pick this up. I (foolishly) didn’t realise support is the same 1-p when I wrote it 2 years ago. I left the paper on arXiv despite it’s many flaws but maybe I should edit it (severely). Basically, with a lot of useless effort, I worked out the inverse of Wilk’s theorem from first principles without even knowing it.
    Despite reading a lot of Mayo’s work I am still not sure what severity is but I’ll take your word for it.
    
    Reply ↓
  - Sander Greenland on September 26, 2019 8:25 PM at 8:25 pm said:
    
    Carlos and Nick: Isn’t severity taking a 1-sided p at the alternative, whereas most users are taking the 2-sided p at the null? The mismatch there never made any sense to me; it seemed to come from some analogy with power calculations, which are on our 2016 blacklist as analysis tools (as opposed to design tools). Instead, for any sense of comparability I would want the same type of P-value at all the tested hypotheses (all 2-sided, or else all 1-sided in the same direction).
    
    Reply ↓
    - Carlos Ungil on September 27, 2019 1:48 PM at 1:48 pm said:
      
      I would say that the one-tailed test is the “correct” one in many practical cases, when we are interested in finding an effect only to the extent that it goes in the right direction. You cannot “unprescribe” pills to patients (but I guess that in other cases the intervention can go in either direction and any effect could be useful).
      
      Regarding the consistency issue, it’s true that if the p-value for the two-sided alternative “mu greater than or less than 0” is 0.04 the severity or support for the “mu greater than 0” is 0.98. But is anyone going to compare different metrics?
      
      However, when comparing two p-values consistency is important. When the significance threshold is, for better or worse, fixed at p=0.05 it’s difficult to use one-sided tests when two-sided tests are the standard. The use of one-sided tests is seen with suspicion (and rightly so in many cases).
Christian Hennig on September 25, 2019 8:26 PM at 8:26 pm said:

Interesting. I think that to some extent I have always interpreted p-values unconditionally (in Greenland & Chow’s terms), because this is what makes sense to me, and it seems that some people wouldn’t get why I was defending them because they would be conditioned so strongly on thinking “conditionally” that they wouldn’t get what I meant by “computation of the p-value doesn’t assume that all these things in fact hold in reality, rather it compares what we observe in reality with an artificial model that makes all these assumptions” (including the better hidden ones).

Now I’d like to add one thing, which is that I wouldn’t quite agree to the statement that a small p-value does not indicate which assumption is wrong. True, it can’t be pined down precisely, there will normally be alternative plausible explanations. But at least we shouldn’t forget that the test statistic is constructed to highlight deviations from the null hypothesis of a particular kind, for example mean differences, deviations from expected cell frequencies and the like. I agree that it’d be wrong to say this-or-that alternative is confirmed as true because the null is rejected, but at least one can say that the test indicates not only that the null is implausible, but also that the mean difference is larger than expected than under the null, or that the cell probabilities deviate Maybe pinpointing which one in particular etc.). That’s modest but better than nothing, I’d say. It may at least rule out *some* potential issues with assumptions (i.e., deviations from the null) from causing the small p.

Another thing is that the compatibility interpretation of confidence regions looks very much like Laurie Davies’s adequate approximation, see here
https://www.crcpress.com/Data-Analysis-and-Approximate-Models-Model-Choice-Location-Scale-Analysis/Davies/p/book/9781482215861
and here
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9574.1995.tb01464.x
One nice thing of Davies’s approach is that the data analyst can specify multiple “data features” (or test statistics) to see what models/parameters are adequate in various ways (such as mean, variance, skewness, (non)-existence of outliers etc.), including the possibility that for a chosen model (say normal) no parameter value at all is adequate.

Reply ↓
Sander Greenland on September 25, 2019 11:03 PM at 11:03 pm said:

Christian: Yes, I find the unconditional interpretation is indeed what the (apparently elite) corp of P-defenders like me, you, Senn and Stark come down to at the end of a long road of experience and thinking. But we do have to recognize our own tendency to “recondition”, as I think you are doing when you say “I wouldn’t quite agree to the statement that a small p-value does not indicate which assumption is wrong. True, it can’t be pinned down precisely, there will normally be alternative plausible explanations. But…”
– Sorry, there are no “buts” in unconditional logic, none at all:
A “very small” p only conveys that there is “a lot” of information against the tested model in the direction being measured, at least s = -log_2(p) bits. It says nothing at all about which assumptions in that model are wrong. And I mean nothing, zero: Remember that it does not even exclude that someone tampered with the data to make sure we got that small a P-value (as Fisher concluded happened with Mendel’s data, using the lower tail of the fit statistic to measure “too good a fit”).

Any statement with information content beyond “the data are at the top 100p’th percentile of distance from the model” or “the data are s bits away from the model” (in the tested direction) is from conditioning on assumptions beyond the P-value’s scope, like that there was no tampering. And any pinpointing means that you are now examining deviations within a more narrow manifold (maybe a along a line) than used in the original fit test, and thus going beyond the original P-value’s scope or target.

As Senn has eloquently complained, the problem with P-values and tests based on them is that they convey very limited information, and yet are so ubiquitous and expected (even demanded) that everyone tries to wring far more meaning out of them than they can logically bear. These treatments are blind to the fact that a P-value is just a syntactic (deductive) information summary that cannot be given a real-world interpretation without semantic information bearing on the model from which it was derived (like a causal model for bias sources).

But then, no statistic can be given a real-world interpretation without semantic information bearing on the model from which it was derived. That fact reveals how a lot of P-bashing is railing against the projections people put on P-values. Where those bashings most go off the rails for me is when they say researchers want posterior probabilities, so everyone should replace frequentist with Bayesian stats – not recognizing that Bayes supplies posterior probabilities only at a big cost in what may be dubious assumptions (the first principle of scientific reasoning we should teach is the NFLP = no-free-lunch principle).

Regarding Davies, the book link was broken but I’ve seen his articles. To the extent I understand at the moment (not much I admit):
P-values and their transforms don’t contain enough information to allow them to serve as measures of approximation quality (AQ) in any practical sense. To think otherwise is just the long lamented confusion of “statistical significance” with practical significance (already being complained about by 1919, before Fisher hit big). Thus, for AQ one has to turn to features that include real-world quality components like failure rates, costs, etc. So, it’s great to have a theory that provides AQ measures based on contextually relevant features. I’d just ask those who favor the Davies theory to describe how it provides such relevant AQ measures.

Reply ↓
- Christian Hennig on September 26, 2019 5:50 AM at 5:50 am said:
  
  My wording wasn’t meant to imply anything more than what you call “in the direction being measured”. I just say that we learn some more keeping the direction in mind rather than just “p=0.0002”. I agree however that what we learn isn’t anything precise about what assumption caused this.
  
  Reply ↓
  - Christian Hennig on September 26, 2019 7:26 AM at 7:26 am said:
    
    …like for example, p=0.0002 in a one-sided t-test can indicate any issue that may cause the mean of group 1 (say) too large related to the pooled variance (i.e., for example it may mean that the variance is too low because of ignored autocorrelation, apart from group 1 having indeed a larger expectated value), but it surely does not indicate the opposite (such as outliers causing the variance to be too large).
    
    Reply ↓
- Daniel Lakeland on September 26, 2019 9:24 AM at 9:24 am said:
  
  For me, the problem with p values and frequentists stats is that they distract people from doing science and instead focus people on constructing surrogates for science.
  
  We know ahead of time that numbers collected in observational or experimental studies don’t come from random number generators. RNGs are very specially constructed algorithms that must pass stringent requirements. Sure in randomized experiments they are used, but also randomized experiments are never so simple, non-compliance and soforth being real factors, but more than that… scientific measurements come from physical processes in the world that follow regularities. The job of science is to discover these regularities. p values and classical stats lull people into the idea that you can ignore the hard work of building descriptive models of regularity, and focus instead on pretending the world is random. And the worst part is people are lulled into pretending the world is random with a known stable shape to the histogram…
  
  I have no problem with Christians interpretation of p values as telling us that a model was inadequate, but I just think we shouldn’t be constructing models we know are inadequate to begin with and then acting surprised if they give small p values… and this seems to be the primary actual use for p values.
  
  an example of a good use of p values is to collect a bunch of information in a context of stable measurements, construct a model of the measurements using this sample, and then use the model to detect when unusual events happen and Mark them for further study. Here we expect the model to be adequate because it’s built on substantial data, the shape of the distribution can be estimated, stability can be assessed… etc. This is typical of say quality control, or earthquake detection, or satellite reconnaissance or whatever.
  
  The reason I advocate Bayesian methods is because I’m really advocating *thoughtful model construction* which I see as the job of scientists. When Sander talks about “dubious model assumptions” I think this is where people take their old way of doing things and bolt on a sample from a posterior distribution… without the thoughtful model construction Bayes is just same shit different day.
  
  With the thoughtful models, Bayes offers the right interpretation, but p values give a way to assess model misfit, so as with any model we might have reason to trust, a p value becomes a useful tool, just like in the frequency violation detection examples such as quality control.
  
  Reply ↓
  - Anoneuoid on September 26, 2019 10:15 AM at 10:15 am said:
    
    I just think we shouldn’t be constructing models we know are inadequate to begin with and then acting surprised if they give small p values… and this seems to be the primary actual use for p values.
    
    Exactly. And as usual, this point was already made long ago:
    
    The usual application of statistics in psychology consists of testing a “null hypothesis” that the investigator hopes is false. For example, he tests the hypothesis that the experimental group is the same as the control group even though he has done his best to make them perform differently.Then a “significant” difference is obtained which shows that the data do not agree with the hypothesis tested. The experimenter is then pleased because he has shown that a hypothesis he didn’t believe, isn’t true. Having found a “significant difference,” the more important next step should not be neglected. Namely, formulate a hypothesis that the scientist does believe and show that the data do not differ significantly from it. This is an indication that the newer hypothesis may be regarded as true. A definite scientific advance has been achieved.
    
    Mathematical Solutions for Psychological Problems. Harold Gulliksen. American Scientist, Vol. 47, No. 2 (June 1959), pp. 178-201. https://www.jstor.org/stable/27827302
    
    Reply ↓
  - Christian Hennig on September 26, 2019 11:22 AM at 11:22 am said:
    
    I have no problems whatsoever with thoughtful model building where the required information is available. Nobody in their right mind would say that this shouldn’t be done and stupid straw man null hypotheses should be tested instead.
    
    However in many areas such information is very weak or non-existing and even if it exist, its translation into a proper model is highly nontrivial. And then somebody may come and run a test and show that the data can actually not be distinguished from independent data generated by a Normal(0,1). So where does that leave the sophisticated modeller? Obviously one then shouldn’t claim that the data are indeed N(0,1), but neither can they serve as evidence for anything more sophisticated…
    
    Furthermore, misinterpreting data because of ignoring existing autocorrelation and violations of other “hidden assumptions” can happen in a Bayesian framework just as easily as in a frequentist one. I wouldn’t think that your major argument for a Bayesian approach is that it is more intimidating, so that people with limited enough understanding to do all kinds of nonsense won’t touch it and rather do p-values (with all the implications that we see)? My expectation still is that if ever Bayesian methods gain the same popularity that p-values currently have among non-statisticians, we will see the same levels of Bayesian nonsense as we see p-nonsense these days.
    
    Reply ↓
    - Anoneuoid on September 26, 2019 11:35 AM at 11:35 am said:
      
      And then somebody may come and run a test and show that the data can actually not be distinguished from independent data generated by a Normal(0,1).
      
      People only collect data like this because they want to do NHST, then they try to do something useful with it and discover they can’t. The mistake is already made by collected poorly informative data. By poorly informative I mean it does not help distinguish between various models of how it was generated. And the normal distribution is just another theory derived from certain assumptions, ie a series of independent additive events like seen in a Galton board.
      
      At the very least you should be measuring how the process changes over time. Almost everything in real life is cyclical.
    - Christian Hennig on September 26, 2019 11:48 AM at 11:48 am said:
      
      How data is collected is one thing, what their distribution looks like is quite another. The pure fact that data are indistinguishable from N(0,1) doesn’t tell you anything about the quality of data collection. You never know, maybe I have tested for process changes over time and haven’t found any (and neither would you, looking at the same data)? You’re making a lot of assumptions about a hypothetical example that doesn’t even come with a real story…
    - Anoneuoid on September 26, 2019 2:12 PM at 2:12 pm said:
      
      However in many areas such information is very weak or non-existing and even if it exist, its translation into a proper model is highly nontrivial. And then somebody may come and run a test and show that the data can actually not be distinguished from independent data generated by a Normal(0,1). So where does that leave the sophisticated modeller?
      […]
      The pure fact that data are indistinguishable from N(0,1) doesn’t tell you anything about the quality of data collection.
      
      The fact that you would do such a “test” tells us your data is univariate and no one predicted any particular value for the mean though. There is not much insight to be gained from such info, but it does have utility in predicting what values you are likely to see in the future (even if only giving you some idea of the order of magnitude).
    - Daniel Lakeland on September 26, 2019 2:06 PM at 2:06 pm said:
      
      >..show that the data can actually not be distinguished from independent data generated by a Normal(0,1). So where does that leave the sophisticated modeller?
      
      Suppose I am listening to a HAM radio, there is a lot of static, and then very faintly in the background is the sound of someone saying “hello hello can anyone hear me?”
      
      If I sample this data at 44100 Hz (like typical CD quality audio) and I take 1 second of sample, and test to see if the samples come from a uniform normal random number generator, I will not be able to reject this hypothesis because the vast majority of the power is in static… Does this mean the person isn’t talking?
      
      If I have a model of english speech and run a filter on this sound I can pull out very clearly and distinctly the words “hello hello can anyone hear me” but if I have a model for Pashto speech I don’t pull out words at all… if I run a model of many different languages I get mostly no signal, but a few languages related to English pull out some slightly garbled words… Is this or is it not evidence that someone is speaking English with a massive amount of noise over it?
      
      Anyone can make bad models and fit them with Bayes. But to make the kind of models I’m discussing and fit them well, I have not seen anything nearly as effective as Bayes for fitting. Since the models I’m talking about *don’t assume randomness* of the type described by say Per Martin-Lof or Kolmogorov (ie. sequences that pass tests of randomness) it’s *not possible* to apply Frequency statistics to the models.
      
      It is, however, possible to apply plausibility calculus, which turns out to have the same math but describe a different “information” phenomenon instead of “repeatable random” phenomenon.
Sander Greenland on September 26, 2019 12:02 PM at 12:02 pm said:

Daniel, re: “the problem with p values and frequentists stats is that they distract people from doing science and instead focus people on constructing surrogates for science.” My response is: No they don’t. Blaming the tools is an example of a social mind-projection fallacy. What distracts people is the social expectation and demand (e.g., read JAMA’s instructions to authors) that p-values be misused for purposes that they are not sufficient for, like deciding what results to report, and whether to report “no association was observed” or something else.

You yourself close by noting that P-values can be used for model checking; I presume you would add that they are but primitive first steps in that direction (although a few P-values or a CI will often give you all the bad news about lack of data information). So surely we agree an integrated toolkit approach is warranted. Toward that end it might even be healthy to stop thinking in terms of “frequentist” and “Bayesian” as anything more than labels for an antiquated dichotomy that would be best replaced by talking and teaching about tools for data description, information merging, prediction, calibration, decision, and more (including tools developed outside that deceptive dichotomy, like EDA, pure likelihood, etc.).

For decades I’ve advocated incorporation of Bayesian viewpoints (at least a few of the 46,656 kinds) and tools into basic teaching. But I don’t think they are sufficient or can displace frequentist viewpoints. To my alarm, there are those who do so, failing to accept that most every misuse of frequentist stats maps right over to a misuse of Bayesian stats, often in a more pernicious form. That’s because the problem is not one of philosophy or math but of human and social propensities – like to fabricate certainty where none can be had on scientific (logical-empirical) grounds, or to hack analyses to ensure that desired conclusions look like they were reached on scientific grounds, or just to get published (happy with any important-sounding conclusion that can be presented as if a consequence of the study, even when there is none to be had).

Bayesian “inference” is every bit as (if not more) easy to game to these goals as frequentist “inference” because it provides an extra garden of forking paths in the form of a joint prior distribution. And when defaults are imposed to prevent prior gaming, the consequences are often gross distortions paralleling those from the defaults (like 0.05) used to prevent gaming of frequentist “inference.” That’s especially so when the defaults are contextually nonsensical (like most apps I see with flat, reference, spiked, or independence priors).

It’s no surprise that Bayesian apps look no better or worse to me than frequentist apps conducted at the same level of contextual depth, honesty, and technical competence (or at the same level of shallowness, deceit, and incompetence). That’s because its those human psychosocial factors that are pivotal to statistics and its endless crises and abuse in research. Those factors arise in every step of research from planning to publishing, involving incentives and demands of funding agencies, reviewers, and editors with their own goals and (often perverse) incentives. That means we should be seeking mitigation by studying the social psychology of inference (not “philosophy of statistics”) and effective ways of getting valid information out of studies. The only aspect of this human-factor problem I am addressing (or may be qualified to address) is the notion that some mathematical convolution of data with a model (whether a P-value or a CI or a posterior probability) is responsible for or can do much in such a vast, complex, psychosocial minefield.

Reply ↓
- Martha (Smith) on September 26, 2019 12:40 PM at 12:40 pm said:
  
  Sander said,
  “Daniel, re: “the problem with p values and frequentists stats is that they distract people from doing science and instead focus people on constructing surrogates for science.” My response is: No they don’t. Blaming the tools is an example of a social mind-projection fallacy. What distracts people is the social expectation and demand (e.g., read JAMA’s instructions to authors) that p-values be misused for purposes that they are not sufficient for, like deciding what results to report, and whether to report “no association was observed” or something else. ”
  
  This seems to be looking at something from an “either/or” perspective when what fits is a “both/and” perspective. For example, what drives JAMA’s instructions to authors? Is it not in large part too much focus on constructing surrogates for science, and not enough focus on doing good science?
  
  Reply ↓
  - Zad Chow on September 26, 2019 12:49 PM at 12:49 pm said:
    
    “For example, what drives JAMA’s instructions to authors?”
    
    That’s a very good question. Here are their latest reporting guidelines for randomized controlled trials.
    
    https://jamanetwork.com/journals/jama/fullarticle/2748772
    
    One notable excerpt:
    
    “For instance, recently there has been rekindling of the debate about the use of the term “significance” and reporting of P values based on statistical testing when reporting findings from clinical trials (and other types of studies).6 Some have advocated removing the term “significance” from descriptions of the results of clinical trials, simply providing effect sizes, generally with 95% confidence intervals, and allowing the authors (and readers and others) to use some other approach to interpret whether the observed findings are likely to represent a true effect vs a sampling error, as well as whether the effect size is important. These arguments acknowledge that the most commonly used significance threshold of .05 represents a historical tradition rather than a rationally established cut point. Others have continued to advocate for describing results of clinical trials in terms of statistical significance, in part because of the need for a starting point in discussion; decisions that are made by regulatory bodies, such as the US Food and Drug Administration, which are generally dichotomous; and the need to assist clinicians and patients in their interpretation and operationalization of clinical trial results.”
    
    It’s funny that they mention the FDA to backup the usage of decision making based on statistical significance, especially when there is no actual evidence that the FDA put that much thought into adopting the paradigm when they first did,
    
    “However, there is nothing that any of the authors of this report can find, after diligent review of Congressional hearings and documents or the FDA archives, that discusses or debates the level of evidence needed for drug approval or the standard by which that evidence is judged. Thus, the acceptance of fundamental concepts for what constitutes adequate, well-controlled investigations—use of placebo, blinding and randomization—that we take for granted today, were heavily debated 50 years ago. While these trial design characteristics are accepted today as the gold standard for generating credible evidence, the acceptable level of statistical evidence (p < 0.05) was, AND STILL IS, just a matter of tradition and convenience."
    
    https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1566091
    
    Reply ↓
  - Sander Greenland on September 26, 2019 1:37 PM at 1:37 pm said:
    
    Martha: You asked “what drives JAMA’s instructions to authors? Is it not in large part too much focus on constructing surrogates for science, and not enough focus on doing good science?”
    – Yes, but why is such an (unjustifiably) prestigious journal dug into anti-science policies? As has been often said, it’s not the fault of P-values, which are just numbers that sit there. It’s complicated, but this is my caricature of the individual roles in it (you may see more than hint of Gigerenzer here): In their student days, researchers, editors, and their consultants and reviewers were taught superstitious science surrogates and ritualistic caricatures of statistical analysis. These superstitions imbue P-values with a totemic spirit force called “significance.” The future power players were then forced to use these superstitions and rituals to rise through the ranks. Consequently, the entire totemic system came to permeate their own work, and they used it to determine the fate of other works. This permeation made these player irreversibly invested in the system, to the point of spewing defensive double-talk like claiming (with arguments worthy of a president) that eliminating significance thresholds “will give bias a free pass,” when we have a century of literature about how those thresholds and the dichotomous thinking they feed are among the largest sources of distorted research reporting and statistics abuse. (There is also the dovetailing of these defenses with hidden sociopolitical agendas, e.g., fending off liability claims, as in the Brown et al. example.)
    
    Reply ↓
    - Martha (Smith) on September 26, 2019 4:29 PM at 4:29 pm said:
      
      What you are describing falls under the phrase, “That’s The Way We’ve Always Done It,” (TTWWADI, for short) that I have often used on this blog to describe many common practices in the use of statistics.
    - Keith O'Rourke on September 26, 2019 4:54 PM at 4:54 pm said:
      
      Sander:
      
      I remember an academic that managed to immigrate to Canada from Soviet Russia back in frigid days of the cold war – he said something like – speaking freely and openly even in a closed academic meeting was something he would be struggling with for years. He compared it to walking with crutches due to a long injury and then not needing them – you still limp for a long time afterwards.
    - Anoneuoid on September 26, 2019 5:17 PM at 5:17 pm said:
      
      Much of medicine seems to be based on politics, fads, and clickbait to me. Eg, the vaping scare the CDC is currently causing. They basically took:
      
      1) An unexplained medical issue (idiopathic pneumonia)
      2) A new yet very common behavior (vaping)
      
      Then linked the two to cause a scare. Look at the definition, basically any unexplained case of pneumonia in a person who vaped in the last 90 days gets called “confirmed vaping illness”[1].
      
      Lets say the population at risk is primarily 15-40 year olds,[2] which is about 100 million people in the US.[3] Lets also say that about 30% of them vaped in the last 90 days.[4] Then we have about 1000 cases[2] per 30 million vapers or about 3/100k. Before vaping was a thing, the prevalence of just one type of idiopathic pneumonia was estimated at ~0.4 to 8 cases per 100k.[5]
      
      So there does not seem to be a meaningful increase in the symptoms, only in vaping. And anyway 3 per 100k is pretty low risk activity. For comparison, 50 per 100k die from some kind of accident.[6] So the average person is ~10x more likely to die from just going about their life than to even get a case of this pneumonia (~1000x for dying from it).
      
      Fine… disagree with the numbers which are all rough guesses. But why doesn’t the CDC present us with a common sense analysis like this?
      
      [1] https://www.cdc.gov/tobacco/basic_information/e-cigarettes/severe-lung-disease/health-departments/index.html#case-definitions
      [2] https://www.cdc.gov/tobacco/basic_information/e-cigarettes/severe-lung-disease.html
      [3] https://en.wikipedia.org/wiki/Demography_of_the_United_States#/media/File:Uspop.svg
      [4] https://www.drugabuse.gov/publications/drugfacts/monitoring-future-survey-high-school-youth-trends
      [5] https://www.ncbi.nlm.nih.gov/pubmed/16088690
      [6] https://www.cdc.gov/nchs/fastats/accidental-injury.htm
- Daniel Lakeland on September 26, 2019 2:18 PM at 2:18 pm said:
  
  >It’s no surprise that Bayesian apps look no better or worse to me than frequentist apps conducted at the same level of contextual depth, honesty, and technical competence (or at the same level of shallowness, deceit, and incompetence). That’s because its those human psychosocial factors that are pivotal to statistics and its endless crises and abuse in research.
  
  I agree with this perspective, and I’m sure that Bayes can be abused by people who want to continue conducting surrogate-science…
  
  The thing is, I don’t see how we can stop conducting surrogate science and start doing a good job of modeling regularity *until* we adopt the idea that once you have a model, the way to fit it and get probability out of it is to use Bayesian methods. non-surrogate models of the kind I’m advocating *do not have frequency content* because frequency properties imply that the universe has a mysterious kind of regularity that the universe doesn’t have.
  
  These models of regularity I recommend aren’t models of “how often x occurs in infinite trials” they are models of physics, chemistry, biology, human interaction, resource consumption, communications, etc therefore the kinds of questions that can be answered by frequency type tests fail to be useful as fitting tools etc.
  
  Can you calculate a Bayesian p value? Yes. If you fit a posterior to your process, you can then ask questions like “what is the probability that observed data would be at least as far away from predicted as we saw in this case?” This helps you notice things like “even in the region of good fitting parameters, we under-estimate how far away outliers can be” or the like.
  
  But the p value here is *not* answering a frequency question, it’s answering a “how implausible is this data if our model were true?” question.
  
  That we can calculate with frequencies using an RNG in our computer in order to calculate numbers is a happy coincidence, it does not describe what our Bayesian model thinks about the frequency in the long run of actual collected data.
  
  Reply ↓
  - Sander Greenland on September 26, 2019 8:10 PM at 8:10 pm said:
    
    Daniel, re: “non-surrogate models of the kind I’m advocating do not have frequency content because frequency properties imply that the universe has a mysterious kind of regularity that the universe doesn’t have.”
    From my perspective that’s fine because it’s addressing the radical, repeated-sampling frequentism that Neyman espoused and most of American basic-stats adopted in his wake.
    
    You also come close to my point when you say the P-value is “answering a “how implausible is this data if our model were true?”.”
    – However in the view I prefer, the P-value alone is answering less than that. In what I’d call information-based statistics (some might label “neo-Fisherian” frequentism), a reference (test) distribution only provides a measure (by no means the only possible one) for relating the data to the math model from which that reference distribution was derived. This measure is purely syntactic, with contextual semantics (like “plausibility”) added via the derivation of the math model from context, e.g., a hypothesized plausible physical (causal) mechanism for data generation.
    
    Now, the information description suggests “frequentist” is not such an accurate characterization of the approach. I find it has more in common with reference Bayes (and even gives numerically identical estimates in the most commonly used class of models) than with Neymanian repeated sampling; in my reading of Fisher I think he realized that in the end. This view doesn’t need to think in terms of repeated sampling or an RNG (although RNGs are useful to simulate answers when analytic derivations get too involved, just as with posterior simulation for Bayes; this is not a “happy coincidence” but rather just a means of integration). See also Vos and Holbert at https://arxiv.org/abs/1906.08360.
    
    In this view, there’s a sense in which it doesn’t matter whether you say your data model is some kind of RNG, or instead (or in addition) a bookmaker (betting function) for data, with improper priors in some dimensions (those of the so-called, horribly misnamed “fixed effects”). Nor does it matter what labels you apply to any derived distribution, like a P-value or compatibility function for a parameter. In all cases, a valid P-value p from a test statistic T translates into the 100p percentile at which the statistic fell in the reference distribution F(t) derived from the distribution – whether you think of that as a repeated-sampling (frequency) distribution, a betting distribution, or whatever. P is just a math measure based on a “standardized” distance from the model manifold to the data, one I happen to think more easily seen for what it is (and isn’t) by taking its negative log, s = -log(p), as explained in my TAS article at
    http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
    
    A point you did not address is the algorithmic sense in which the frequentist/Bayes distinction is artificial, in that they are both just types of data processors that rely on the user for responsible, valid, or accurate use. See my coverage of that here (p. 802-806):
    https://link.springer.com/content/pdf/10.1007%2Fs10654-019-00552-z.pdf
    – Algorithms, whether “frequentist”, “Bayesian”, or something else, and whether outputting P-values, intervals, or something else, are just blind programs that process the data the way they were designed to do, and no more. Again, I maintain it’s not the algorithms, it’s the user and the system miseducating and perversely pressuring the user that’s the core problem. To address that, I’ve come to think we need to not only retire statistical significance, but also retire “frequentism”, “Bayesianism”, and other weird techno-religions in favor of professionalism (including integrity and competency).
    
    Reply ↓
    - Daniel Lakeland on September 27, 2019 10:12 AM at 10:12 am said:
      
      > Again, I maintain it’s not the algorithms, it’s the user and the system miseducating and perversely pressuring the user that’s the core problem.
      
      That’s why people like Wansink who seem to either enjoy doing bad science or just be clueless come from.
      
      The question I have is, for those good scientists like some of the ones I work with who I teach Bayesian methods to, what are we going to do to enable them to do the good science they want to do? Right now statistics is a barrier they have to leap that stands in the way, not a tool for insight and investigation.
    - Sander Greenland on September 27, 2019 1:19 PM at 1:19 pm said:
      
      Daniel re: “…Right now statistics is a barrier they have to leap that stands in the way, not a tool for insight and investigation.”
      Love that “barrier they have to leap.” Making it otherwise is like trying to clean up politics, as if we could…but we ought to keep trying.
      
      To answer your teaching question: I don’t think that statistics is a tool; it is a bunch of specialized tools that have been mischaracterized by being classified and explained only via by their probability mathematics, leading to endless abuse. In particular, the causality theory that underpins design-based tools (e.g., permutation tests) was long neglected in favor of probability models. Yet it is that causality theory which forms the basis for deducing a probability model from the application context, and for recognizing possible contextual explanations for observations.
      
      I did the teaching challenge for 4 decades, mostly with grad students, and probably learned more than I taught. To judge from evals and awards though, the students seemed to appreciate it in the end, with many sending comments years later on how what I taught helped them minimize bad practices in their teams. Of course that’s a selection-biased sample, to say the least, and upon reading articles by some who took my classes I saw I had not exactly eliminated bad practices. When I queried those former students about this, a common response was that it was the best they could do given the political realities (e.g., having senior authors and journals demand reification of artificial dichotomies via NHST).
      Anyway…
      
      Among core points I emphasized by the end was that statistics is a huge shop of tools, each of which take training to master and skills to use properly. Hence, no one (including their stat profs) could come close to understanding and mastering them all or even most of them.
      
      Also, that sound application required a thorough understanding of exactly how the tools mapped into the application area (context), which required reading review and research articles in the area. Hence, no one (including their stat profs) could come close to understanding and mastering all or even most contexts, especially over a domain as vast as health and medicine. This mapping would often falter if no one on the team understood both the tools and the context in depth; and bridging that gap would someday fall to the student (like when they did their dissertation).
      
      These human limits run right up against a reality: Beyond the simplest jobs, taking multiple perspectives will be needed to make the most valid inferences. Doing so will often point to the need for use of multiple toolkits, or at least tools that have justifications from multiple perspectives (like regression models). Thus it was only to be expected that strict adherence to one perspective or class of tools (whether in the name of “philosophy” or for lack of competence beyond it) could undermine practice: It led to abuse of tools in that class to do tasks that were outside their scope, like treating Fisherian (design-based) P-values as if they were posterior probabilities (as in “P is the probability that chance alone produced the data or more extreme”), or treating posterior predictive P-values (PPP) as if they were calibrated model checks.
      
      Finally, like many of my colleagues, I found causal directed acyclic graphs (cDAGs) indispensable for explaining intuitively difficult concepts like collider (“Berksonian”) bias, and why a regression model for data does not “explain” the data no matter how good it fits or predicts – because there are always multiple causal-models/cDAGs (and hence multiple physical explanations for the data) that imply the regression model.
    - Daniel Lakeland on September 27, 2019 2:11 PM at 2:11 pm said:
      
      Well, I think we have a few different pieces of terminology, but ultimately are on the same path. If I describe “math” as a tool to describe say physics experiments, I don’t mean it’s something like a hammer… I mean its not something we’re studying because we’re interested in the mathematical relationships themselves, it’s not the subject of study.
      
      Similarly when I say “statistics is a tool for analysis and insight” I mean we select bits from the bag called statistics and then we don’t study the bits we selected, we use them to study the medicine or the biology or the economics or whatever.
      
      In this analogy The Statistician is like the guy in the basement who runs the shop. He or she knows some stuff about what kinds of tools can accomplish what kinds of tasks, and he knows something about how to design new tools that accomplish specific tasks. The Scientist is like the … well the scientist who shows up down in the shop and says something like “I need to hold a certain thing steady, and then slice it up and dice it like this and that, and then see what’s inside”.
      
      Of course the shop person isn’t going to solve that problem with a hammer, but they are going to solve that problem by designing a special tool that carries out the task, and then making that tool using a bunch of parts that get combined in a combinatorially exploding set of possibilities…
      
      The NHST paradigm is basically like deciding to replace your machine shop with the Grainger catalog. Anything you can buy off the shelf you can have, anything else forget it.
      
      We are in a situation where people are changing what they want to do as scientists because all they have available to them is any variety of threaded carriage bolt, a couple of different hammers, and a drill press. So they keep drilling holes and putting bolts in them by pounding them in with hammers… and the ones that fit they call “significant” and the ones that don’t they throw out…
    - Sameera Daniels on September 27, 2019 2:35 PM at 2:35 pm said:
      
      I think that the current crises in knowledge, including statistics, have compelled non-experts to formulate strategies to deal with the conflicts of interests and the obfuscation that can issue from engaging in too technical explanations. The gatekeeper role is already under scrutiny. Gerd Gigerenzer has conducted a very good effort to education non-expert consumers of science, given the research & clinical trial outcomes more generally. I believe the Open Science Movement can be engaged. Certainly, there are non-experts that experts flag for not being well trained. Well that seems like a specious and egoistical claim to retain the gatekeeper role.
    - Daniel Lakeland on September 27, 2019 2:55 PM at 2:55 pm said:
      
      Sameera, I certainly didn’t mean to imply that statisticians are the only ones who should go down in the shop and get dirty and build the tools they need to do the work. I just meant that if you want to do it, you need to
      
      1) Know that it can be done
      2) Seek out some help
      3) Don’t be satisfied with the posers who tell you to grab the Grainger catalog and just order up whatever you need because that’s all there is anyway.
    - Sander Greenland on September 27, 2019 5:33 PM at 5:33 pm said:
      
      Sameera: You said “I think that the current crises in knowledge, including statistics, have compelled non-experts to formulate strategies to deal with the conflicts of interests and the obfuscation that can issue from engaging in too technical explanations.” Yes, but conversely many nonexperts (and some experts) have largely created the current crisis by glossing over technicalities in order to manufacture the interpretations they want to be true but are just false. Statistics has long aided this problem by selling methods that were supposed to be and can only be cautionary, as if they were instead providing certainties, passing off NHST as a yes/no truth indicator or something like that; P-values as nonsignificance and confidence measures; and posterior probabilities from uncertain data models and misspecified priors as if they captured all or even most uncertainty in the problem. Andrew named these phenomena well: Uncertainty laundering.
      
      As for gatekeepers: The main gatekeeping has been and continues to be by nonexperts and suborned experts (not always a clear distinction between the two) in forcing statistics usage and interpretation that demonstrably and vastly distorts the information actually provided by studies (as in the examples just listed), and has warped the information base of many fields in ways not always recognized (and often denied by those gatekeepers).
    - Martha (Smith) on September 27, 2019 4:03 PM at 4:03 pm said:
      
      Many good points from both Sander and Daniel.
    - Daniel Lakeland on September 27, 2019 11:46 AM at 11:46 am said:
      
      > A point you did not address is the algorithmic sense in which the frequentist/Bayes distinction is artificial, in that they are both just types of data processors that rely on the user for responsible, valid, or accurate use
      
      This again gets down to how to enable good scientists to do good work.
      
      Frequentist methods have much much more narrow validity as a method than has been widely publicized and taught. For example it’s a very valid way to think about the US Census because *lots of effort* goes into ensuring quality random sampling. This is like computational RNGs we can’t just use whatever stuff some random programmer decided to stick into the C library… we use things like Mersenne Twister which were developed by mathematicians and tested using deep testing suites and have proofs of certain properties. it’s this effort to impose the properties of randomness onto our problem that makes frequency statistics valid, but even then the validity is limited.
      
      If I take a sample of 10 numbers from the UTC unix clock with randomly generated intervals between, I can calculate a mean and a standard deviation and investigate the histogram and find it appears to be a uniform random number generator, and then assert that in the future when I do this again 95% of the numbers will be within x standard deviations of the mean…
      
      > data = replicate(10,{Sys.sleep(runif(1,0,3)); as.numeric(Sys.time())})
      > hist(data)
      > mean(data)
      [1] 1569598107
      > sd(data)
      [1] 4.182665
      
      Nothing about the fact that I used a proper RNG or have “frequency guarantees” from my methods will make this work. In the future 100% of the samples will fall outside my interval.
      
      The world is like the clock only far far more complicated: in the future it will do the stuff that the regularity properties of the universe cause to happen. Completely ignoring the regularity and treating it as a random number generator can work accidentally in some cases, but normally will fail. And it shirks the duty scientists have to *try to discover the regularity properties in the first place*.
      
      Because of the limited applicability of standard Frequentist methods, and the lack of understanding of what else can be done among many scientists, scientists are literally left with nothing that they are aware of that accomplishes the actual goals they have. They spend their time trying to shoehorn their real problem into something that sounds like “testing a hypothesis”, and this is *people who really want to do a good job*.
      
      While Bayes can be done ritualistically and accomplish the same kind of meaningless stuff, of the two probabilistic methodologies, Bayes is the only one that is even applicable to problems like “the skeletal system has redundant muscles, there are many possible muscle force curves that could result in a given observed movement, which ones are likely to actually occur among healthy patients walking on a treadmill, and which ones are likely to occur among stroke victims with known areas of brain damage performing the same task?” (a problem I am actually working with researchers on at the moment. It’s computationally challenging with potentially thousands of dimensions and requires running big mechanics simulations at each step).
      
      Nothing about such problems can be solved by resort to random number generation, repeated measurement across different participants, and appeals to theorems about central limits or whatever. What is needed is a way to quantify which muscle activities are likely and which are not based on both the kinematics and dynamics of the action and our knowledge about physiology and human behavior.
      
      Far FAR more of science falls into the category “we need to distinguish between several types of descriptions of what might have happened and what probably didn’t happen and decide which kinds of regularity our system exhibits” than otherwise.
      
      Yes, there are lots of applications of censuses, surveys, and “simple measurement”, but in general this is what you do *before* you do the science…. you collect a bunch of data and try to figure out how to even formulate your question.
      
      Statistics as taught has cut the scientists off before they start their real task, which is formulating the question and a large number of possible answers to it, and then winnowing out the ones that seem to hold up under detailed scrutiny.
      
      Mechanistically applied Bayes won’t solve the problem, but *even PROPERLY applied Frequentist methods are simply inapplicable to the problem* whereas Bayesian methods are at least applicable.
    - Sander Greenland on September 27, 2019 5:04 PM at 5:04 pm said:
      
      Daniel: I think we agree on most broad principles here, but in my application areas at least, standard Bayesian methods have their own profound limits, such as not dealing well with settings in which it is simply infeasible to express our total information – and thus accurately gauge our total uncertainty – with probability measures. So pure Bayesian methods are “not applicable”, either (do not provide accurate information or uncertainty assessments).
      
      From your description it sounds like you work on projects with far more input information (more accurate observations, more knowledge of the underlying biomechanics) than I encountered in my primary applications, in epidemiology (which seem more akin to Andrew’s main area, poli sci, than to yours). In my areas, the lack solid empirical information on details means plausibility and designs to enable basic signed-ordinal distinctions are far more central than posterior probabilities (PostP). Fully-specified PostP are based on speculations about ill-determined parameters that get overspecified when encoded as prior distributions (PriorP). When you couple that problem with the highly charged, biasing environment in which this research takes place, prior specification (and hence Bayesian analysis) becomes an avenue for staggering misinterpretation and abuse. I see the misinterpretation every time a Bayesian analysis describes “the posterior probability” when it should be “the posterior probability from this joint prior” – which reveals PostP to be just as hypothetical as those repeated samples that frequentist primers talk of, albeit along a different axis of uncertainty. The user problem that afflicts both frequentist and Bayesian outputs is that both provide only measures of uncertainty conditional on whatever models went into them, and so their outputs will generate overconfidence by anchoring viewers to incomplete uncertainty measures.
      
      These problems lead many savvy analysts to target feasible goals like getting credible estimates for marginal relations. They limit data models to design information plus some regularity (smoothness) conditions that no stakeholder would question, and then limit parameter information to parsimony (penalization or shrinkage) conditions that all stakeholders would find plausible (as in empirical and semi-Bayes methods). A companion idea is to find a compression of the data that retains the information about the target provided by the data, given the model (which includes all the penalty functions), with minimal removal or distortion of the information that would be seen if we knew actual data-generating process in full mechanical detail (so that our predictions would indeed be accurate down to a real-noise floor). Among other ways of putting this in engineering apps I’ve seen include that our model is a filter that we want to remove noise but pass signal with minimal distortion (such as “coloration” in music reproduction).
      
      A large branch of the current causal-modeling literature stems from Robins’ work on longitudinal causality. There, full frequency models and priors are avoided using semiparametric estimation of average treatment effects using marginal structural models. There’s not much about penalization or Bayes in that literature as yet, even though those ideas can be easily applied to the parametric portion of the models; instead most models are limited to hard parsimony constraints (like generalized additivity of effects) as Box (JRSS A 1980) described for traditional “frequentist” models. Hence the outputs are limited to various P-value based statistics like CIs. But, as usual, those statistics do tell us a lot about what semi-Bayes PostP would have to look like under a broad class of priors (as previously seen in Casella & Berger Stat Sci 1987), and so provide an origin or reference point for semi-Bayes sensitivity analyses. And (not coincidentally) they numerically correspond well with outputs using very low-information “reference” priors for the parameters (as previously seen in generalized linear models, e.g., Firth Biometrika 1993). But they don’t depend on full prior specification (in fact they leave most model dimensions unconstrained), and the literature emphasizes how doing so could cause irremediable bias (systematic error) in PostP relative to the actual target (as seen in Ritov, Bickel et al,, JASA 2014).
      
      Bottom line is that in these applications (so-called frequentist) causal data-design considerations dominate the often vague genuine prior information actually present in background data, and that forcing the latter into the strict Bayesian (probability) framework can introduce serious information distortion (which reminds me of Rubin’s slogan, “Design trumps Analysis”).
      
      Finally, I still think your comments about RNGs miss my point that so-called frequentist use can be viewed as simply a way of finding the tail integrals for distributions of statistics under a model, as in bootstrapping. This is just a math application to find P-values and CIs, same as finding posterior tail areas via simulation. You seem to be decrying reification of RNGs as part of the physical mechanism generating the real data; your complaint is fine by me (unless the RNGs are for accepted physical laws as in quantum physics or thermodynamics, settings vastly beyond our fields in explanatory success). But it doesn’t apply to what I’ve been saying; again I have a different view of data probabilities, in which those are just reference or control charts for operations and model monitoring, not to be reified without explicit deduction from the underlying mechanics (design) of data generation – a view I find common among pragmatic “toolkit” statisticians.
    - Daniel Lakeland on September 27, 2019 5:36 PM at 5:36 pm said:
      
      One thing I think we should keep in mind in this discussion is the difference between engineering and science.
      
      In Engineering, a tool that gets a job accomplished pretty well is sufficient, it doesn’t need to explain why the job is accomplished, it doesn’t need to be accurate, it just needs to kind of work pretty well. It needs to accomplish an economic task (ie help us use our limited resources in an ok way).
      
      For example if you ask me what’s the wind load on a house that I should design for, I could look at the plans, calculate the cross sectional area of the house, multiply it by the stagnation pressure produced by the peak wind velocity measured in the last century in that county, and double it…. This will work fine, even though it’s very wrong.
      
      I could even put in a big dataset of wind loads and success/failure data from a bunch of damaged houses, and create an enormous spaghetti code Excel spreadsheet and then later reify it as the Oracle of Wind Loading. It could have all kinds of weird calculations that eventually are equivalent to my method above, and then we could forget how fluid mechanics works, and just use our “machine learning”. As far as the economic goal goes, it’s the same final outcome.
      
      Medicine *is* Engineering. It’s biologists who study why stuff works the way it does in bodies. Doctors and Pharmacists and things are people who try to come up with ways to do something useful to hopefully reduce suffering. Doctors and HVAC repair people are kind of similar. You talk to an HVAC repair guy and he knows a bunch of facts about how HVAC works, but in the end he’s unlikely to be able to figure out that you need to replace a particular IC on a particular control board, he’s more likely to say “let’s throw out this control board and order a new one”. This is kind of like “let’s take out your Gallbladder”. Even outside of surgery, the HVAC guy may know “using this particular lubricant oil semi-annually on the blower motors will extend the life of the motor” and doctors will say “take this statin it seems like you’ll have a reduced risk of strokes” Neither one of them knows much about why it works exactly. They have a basic just-so story: lubricant reduces wear, or statins reduce cholesterol, or whatever (I’m sure I’ll offend some doctors here, particularly some surgeons, oh well).
      
      So, when you talk about applications where we know nothing much about the mechanisms but need to take some action anyway… these are almost all Engineering problems of some sort, and in this case we don’t really have as a goal to study the underlying causes. A goal of “process control” is sufficient. And when you’re talking process control, you’re talking about averages, repetition, and frequency of “out of control” outcomes. To some extent frequency models are kind of acceptable in this context.
      
      The problem is, in the actual sciences where people are supposed to study the underlying causes, the same “process control” methodologies are being used, and the questions like “how does this work?” aren’t being asked quantitatively because there’s no tools to answer them with anyway (that the biologists know of).
      
      I work with people who study topics that have really limited mechanistic background too sometimes. Like people who want to figure out how cancer comes about, what genetics is involved. The “standard method” is to throw a crapload of tumor data into a blender and select some features and then claim that because gene X has some small p value that it is involved in “prostate cancer metastasis” or something.
      
      The people I work with have through decades of study come to some conclusions about what kinds of questions they should ask that are very different from the kinds of questions actually being asked by their peers. It’s not “what tissue type is the cancer from” for example but rather “which pathways are actively involved in the everyday function of the origin tissue”…
      
      And while we may want to build houses before we have detailed computational fluid mechanics capabilities… if we never developed the computational fluid mechanics because we solved the engineering problem “how much wind force does this house need to be designed for?” using some pure machine learning algorithm thingy that you input a crapload of “features” of the house, and it outputs a number… well we’d never really learn how the wind works. and that’d be a shame. And that’s what too much of the emphasis is in science today… You won’t understand cancer by making lists of “drug targets” based on GWAS studies, and then screening 10,000 small molecules to see which ones kill cells in vitro, and then doing in-vivo studies in mice, and then selecting the best candidates and moving to human trials on end stage patients and etc etc…
      
      But you might solve the engineering problem of “what molecule can we produce in a pill that will make us money by increasing cancer patients lives a few months?”
      
      Different tasks, different methods.
    - Martha (Smith) on September 27, 2019 10:03 PM at 10:03 pm said:
      
      Makes sense.
    - AllanC on September 28, 2019 8:14 AM at 8:14 am said:
      
      +Many
- Sameera Daniels on September 26, 2019 3:31 PM at 3:31 pm said:
  
  RE: ‘Toward that end it might even be healthy to stop thinking in terms of “frequentist” and “Bayesian” as anything more than labels for an antiquated dichotomy that would be best replaced by talking and teaching about tools for data description, information merging, prediction, calibration, decision, and more (including tools developed outside that deceptive dichotomy, like EDA, pure likelihood, etc.).’
  
  —–
  Excellent. I wonder though whether the replacements can then be adapted to self-serving ends, as some dimensions of the evidence-based medicine movement have been
  
  Yes, study the social psychology of inference and the sociology of expertise, both of which, in part, shaped by institutional & career incentives.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Chow and Greenland: “Unconditional Interpretations of Statistics”

53 thoughts on “Chow and Greenland: “Unconditional Interpretations of Statistics””

Leave a Reply to Sander Greenland Cancel reply