“We should only let data speak for themselves when they have learned to clean themselves”

Valentin Amrhein points to this article in the Journal of the American Medical Association making use of Sander Greenland’s idea of “compatability intervals” (see earlier discussion here).

Sander adds:

I [Greenland] was impressed with and even heartened by Hawkins & Samuels until their closing statement, which while not by any means nullifying the value of the article, still repeated what I’d call the support fallacy, a subtle variation on the inversion-fallacy theme: “This approach could provide a conclusion centered around an understanding of the parameter values that are best supported by the data.”

I would have replaced that offending closer by a paragraph explaining how that sort of description may not be too harmful in a perfect RCT analyzed with a specification-robust method such as a randomization test, but is just wrong conceptually and potentially quite misleading for most any RCT with important amounts of selection/censoring and for any nonrandomized causal comparison as in their example. Such data don’t support anything, any more than they speak for themselves (Erik van Zwet responded “we should only let data speak for themselves when they have learned to clean themselves”). At best we only get support from a model informed by the data, as in a pure-likelihood or Bayesian analysis. That model has to impose strong restrictions on possibilities in order to produce support, even support as weak as that from a CI of 0.43 to 2.98 from the 50 vs. 52 patient comparison in the example in Hawkins & Samuels. So while it’s great to see a detailed response to the null fallacy at last break through to JAMA, it’s still a bit short of what’s needed for much if not most of what med journals publish, which involves slapping perfect-RCT stats onto data for which randomness assumptions at best constitute optimistic speculation (“perfect” RCTs include those with random nonadherence/loss/censoring as assumed in an ordinary Cox PH analysis).

The temptation to mistranslate compatibility into support seems as severe as the temptation to mistranslate null P-values into no-effect probabilities, reinforcing the need to underscore how frequentist stats can only refute, not support hypotheses or models (despite all the contortions to make it appear otherwise, such as claiming that rejection of the null constitutes support for a specific alternative). It seems ironic that this limitation of frequentist methods is used by most critics of those methods to condemn those methods for providing poor measures of support, which for me is like condemning hammers because they won’t drive screws.

28 thoughts on ““We should only let data speak for themselves when they have learned to clean themselves”

  1. “At best we only get support from a model informed by the data, as in a pure-likelihood or Bayesian analysis” – if support here means that the interval actually has the probability it’s assigned (e.g., 95%) of containing the true value, it seems like it’s still useful to take a compatibility perspective, from the standpoint that the values in the Bayesian interval are compatible with the data under all the assumptions the model makes about how the data were generated, right? I guess I’m wondering whether how much the emphasis in the compatibility interpretation is in response to people treating intervals as containing the true value with some probability versus how much it’s about emphasizing that the interpretation has to be conditional on all the assumptions made in modeling. The latter seems especially important… anecdotally I’ve found that while many people are somewhat receptive to the idea that intervals are hard to interpret in the conventional way, it’s harder to convince people that it’s very difficult to take any estimates from experiments at face value due to potential model misspecification and bad assumptions.

      • The statements are called compatibility statements to _force_ recognition of the model dependence, whatever methodological basis of the statement.

        Bayesian analysis from that perspective is just a weighted by the prior compatibility such as degrading a credibility interval into a (weighted) compatibility interval.

        • Keith:

          Sure, but, as discussed before, a problem with the “compatibility interval” concept is: What is the user supposed to do when the compatibility interval is very narrow? This can either be taken as good news (under the model, the parameter has been estimated very accurately) or as bad news (there is only a very narrow range of parameter values that are compatible with the data). In practice, narrow intervals are generally taken as good news, but then this interpretation blows up when the interval is empty. Recall my discussion from a few years ago, Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests.

        • Keith:

          It’s also a real question: If you’re gonna treat interval estimates as compatability intervals rather than uncertainty intervals, what are you supposed to do with them?

        • > are called compatibility statements to _force_ recognition of the model dependence

          I like this, though I am scared I’m missing some technical definition of compatibility. Which I guess is my fear with all uncertainty/confidence/compatibility phrases, and maybe that’s appropriate lol.

      • Andrew: I and others have repeatedly started out by defining “model” logically, as the set of all assumptions (constraints) used in deriving a statistic (whether a P-value, CI, posterior probability, whatever), including assumptions about the sample space, selection, parametric form, random distributions, etc. This means that all nontrivial statements are model dependent, even so-called nonparametric (including “distribution-free”) methods that produce only P-values (“significance levels”), as well as all Neyman-Pearson (NP) decision methods (“hypothesis tests” and “confidence intervals”), pure-likelihood methods, Bayesian methods, etc. They differ greatly however in how explicit they are in their model dependence.

        Keith is absolutely correct to note that compatibility statements are intended to be more explicit about the model involved than are conventional statements. In parallel with everyday usage, compatibility (consistency, consonance) and its negation (incompatibility, discrepancy) refer only to some measure of distance or divergence between that underlying model and the numeric data. Unlike NP decisions and posterior probabilities (two sides of the same coin in my view), it does not require conditioning on the model. It is instead mute about whether the model makes any sense in the more broad application context. That means that if we observe a very small P-value or posterior probability p for a statistical hypothesis H about a parameter, the apparent incompatibility could be due to failings of the model from which that p was deduced, even if the scientific hypothesis that H is supposed to represent (e.g., no benefit of treatment on the outcome scale) is correct. Conversely, if we observe a very large P-value or posterior probability p for a statistical hypothesis H about a parameter, the apparent incompatibility could be due to failings of the model from which that p was deduced, even if the scientific hypothesis that H is supposed to represent is grossly violated. Compatibility only measures how much the data and the model seem to agree or conflict along some axis in observation-expectation space, nothing about why the data seem to conflict or agree with H. In contrast, NP and Bayesian interpretations are dead in the water without assuming the model (or, in “robust” statistics, some more general but still highly constrained model).

        To appreciate how much logically weaker compatibility is compared to claims based on “error rates” and posterior probabilities, consider that everything we have observed is perfectly compatible with the hypothesis (believed by some) that we are actually part of a simulation program in a hypercomputer, with all experiences being delusions generated from feedback between inputs and outputs of our personal subprogram (a modern version of the “brain in a jar” scenario). It is also perfectly compatible with the older theory (believed by many) that the universe was created in six days around seven thousand years ago, and that all its fossils, artefacts, and geologic and cosmologic properties were created then in order to test our faith in the literal interpretation of creation account in the Old Testament, by deceptively pointing to the very different explanation for the empirical world now accepted as “scientific”. There is also a denumerable infinitude of other theories perfectly compatible with all our experiences.

        Now, just as NP confidence intervals and NP tests are inversions of one another (a trivial corollary of their definitions), compatibility P-intervals are inversions of compatibility P-values. But unlike NP, compatibility is not forced into what are often useless or misleading conditional statements. Compatibility intervals or regions are only shorthand summaries of compatibility measurements on models in a family relative to data (or data projections). The models in the family share all their background assumptions (such as linearity on some scale); the region is constructed by measuring compatibility along one or a few other dimensions of the model subspace defined by the family (e.g., the one defined by a model coefficient). They show which models in the family meet some conventional descriptive minimum (usually p>0.05) along that dimension. Thus, in compatibility terms, a narrow region does NOT mean the results are more precise in the usual conditional sense of pinning down a parameter in an assumed model family; instead it says that the data narrowly box in the members of the model family according to the minimal-compatibility convention. If we are unwilling to take the model family as known with certainty (as is done in multi-billion dollar experiments in high-energy physics), there is NO deduction from that narrowness to a scientific claim that the real effect represented by the parameter has been precisely determined.

        In the compatibility view, any NP, likelihoodist, Bayesian etc. claims about an analysis having power or precision in a real application involve acts of faith in the model families (assumption sets) used to derive the test or region. Those families will include the model family used for the prior or random-effect distribution, as well as the model family used for treatment assignment or for outcome probability or expectation. If the resulting region were empty, that would tell us that no model in the family met the minimum compatibility criterion. The only practical inference I would draw from such a result is that the model family has shown itself unsuitable for the scientific problem. This inference extends to nonempty but narrow regions: When a region becomes so narrow that we suspect from the application context that errors and uncertainties from model misspecification are no longer negligible compared to the random errors allowed by the model, we should move on to more general model families that better account for real contextual uncertainties. Note that model diagnostics can fail us in this task, because they too are compatibility measures and like all statistics are incapable of discriminating among the infinitude of model families that appear compatible with the data (as may be seen from the variety of ways we can “overfit” data).

        • That means that if we observe a very small P-value or posterior probability p for a statistical hypothesis H about a parameter, the apparent incompatibility could be due to failings of the model from which that p was deduced, even if the scientific hypothesis that H is supposed to represent (e.g., no benefit of treatment on the outcome scale) is correct.

          Yes, you can only conclude that at least one of the assumptions used to derive the model is incorrect. Negating the conjunction (H and A) gives (!H or !A), where that is an “inclusive or”.

          This is the case for any method we use to compare a theory to data, that is why it is impossible to disprove anything using science. NB: It is also impossible to prove something due to affirming the consequent.

          To appreciate how much logically weaker compatibility is compared to claims based on “error rates” and posterior probabilities, consider that everything we have observed is perfectly compatible with the hypothesis (believed by some) that we are actually part of a simulation program in a hypercomputer, with all experiences being delusions generated from feedback between inputs and outputs of our personal subprogram (a modern version of the “brain in a jar” scenario).

          You might want to look into Meehl’s corroboration index, which punishes vague theories like that one:
          https://www.barelysignificant.com/post/corroboration2/

          The likelihood/posterior does as well, since if any observation is consistent with the theory the value is going to be small everywhere relative to a theory that makes a precise prediction. In the latter case the density/mass is concentrated into a smaller region.

        • What do you mean by these two statements (they’re back to back I’m just pulling them apart to show the separation):

          > Compatibility only measures how much the data and the model seem to agree or conflict along some axis in observation-expectation space, nothing about why the data seem to conflict or agree with H.

          > In contrast, NP and Bayesian interpretations are dead in the water without assuming the model

          So compatibility requires model + data, and doesn’t tell you much about H.

          The next sentence is saying, in NP [I’m assuming non-parametrics] and Bayesian we’re dead in the water without a model. But my understanding is compatibility requires model + data, and it apparently doesn’t tell us much about H already.

          So is NP and Bayesian more dead in the water? Or just as dead in the water as compatibility?

          Is the without-a-model part only referring to NP and not Bayesian?

  2. From the article:

    When such a study yields nonstatistically significant results (referred to as nonsignificant results in this article), an important question is whether the lack of statistical significance was likely due to a true absence of difference between the approaches or due to insufficient power.

    This question is of no importance at all. The answer is insufficient power 100% of the time.

    The scales will fall from your eyes once you accept “everything is correlated with everything else” as a basic principle.

    • With a few Important exceptions like tests of whether the mass of the electron in Paris is the same as the mass of the electron in New York, or similar.

      The exceptions are so far outside the usual use of statistics that they are ignorable by almost everyone.

      • Sure. I’d note in that case exactly no difference is predicted by the theory.

        You could have a different theory that predicts some small difference (eg, the universal “constants” are not actually quite constant), and check that value too.

        This is the opposite of how significance tests are being used in JAMA.

    • Let me push back on that slightly. Daniel Lakeland’s example, in which identical values are part of the theory, is, as you acknowlege, one exception, but even in social science, the null hypothesis is only meant metaphorically to mean “no difference,” I think. What it *really* means is “differences in which sampling variance is much much larger than between-unit variances.” This is why so may people have proposed tightening alpha as sample sizes increases, something that clearly makes no sense under a sharp null physics-like hypothesis. As sample size increases, the variance of the mean due solely to samping shrinks and the *real* maintained hypothesis, that between-unit variances, which do not shrink as sample size increases will inevitably come to dominate, leading to statements such as yours that 100 percent of lack of stignificance issues are power issues. The solution (if you’re going to keep NHST at all) is to acknowledge that the null itself is a rhetorical approximation to a much more complex null in which we understand that unmeasured differences are always going to lead to mean differences between treated and untreated groups.

      • The model being tested is literally zero difference though.

        If you test a prediction of general relativity it isn’t like you really incorporate the gravitational influence of every piece of matter in the universe. So we know even that point hypothesis that gets compared to the data is going to be off. You deal with it by estimating the amount of systematic error.

        How do you estimate the amount of systematic error for a null hypothesis with no theoretical basis that no one believes though? In fact the intervention amounts to purposefully introducing a source of systematic error.

        • I realize the model is literally zero difference. And the systematic error can be at least proxied for by variance within the treated and the controls, or more grossly by looking at variance in other variables across the treated and controls. All I’m saying is that the zero-difference null is like friction… nobody thinks it’s correct, and in some problems it gives an obviously wrong answer, but it’s good enough in a lot of contexts. (Like I said, I’m only pushing back slightly. You’re clearly technically correct on the math.)

    • We really should be defining power with respect to confidence interval length, not statistical significance.

      Every study proposal should have a target confidence interval and be powered appropriately to achieve a confidence interval of that size.

  3. > In summary, both “confidence interval” and “uncertainty interval” are deceptive terms, for they insinuate that we have achieved valid quantification of confidence or uncertainty despite omitting important uncertainty sources. Such labels misrepresent deep knowledge gaps as if they were mere random errors, fully accounted for by the intervals. Replacing “significance” and “confidence” labels with “compatibility” is a simple step to encourage honest reporting of how little we can confidently conclude from our data.

    Unfortunately the “compatibility” label is also deceptive in that it insinuates that we have precisely identified the values which are incompatible – which can therefore be ruled out completely.

      • Sure, values outside of the compatibility interval are not really incompatible.

        Just like we cannot be absolutely certain that true values are inside uncertainty intervals or and we cannot be completely confident that they are within confidence intervals.

        It’s like an epistemic euphemism treadmill, changing labels is not enough.

    • (It’s true that “compatibility” has an implicit “with the model” ring to it, but the fact js that models – unlike priors when frequentist results are misinterpreted – are usually quite explicit anyway. Why wouldn’t people still behave as if the model was true?)

      • Why wouldn’t people still behave as if the model was true?

        Because there can be other models/explanations that are also compatible. The key is to figure out an “experimentum crucis” where the observations are compatible with one explanation but not the others.

        In reality, compatibility is not binary though. That is why you use Bayes’ rule for hypotheses i through n:

        p(H_0|data) = p(H_0) * p(data |H_0) / [ p(H_0) * p(data |H_0) + … + p(H_n) * p(data |H_n) ]

        This posterior probability is the continuous measure of relative compatibility we want.

        • Also note that if we treat every explanation as equally likely a priori, then p(H_0) = p(H_1) = … = p(H_n). So the priors all cancel out and we are just looking at the likelihood of H_0 normalized to the sum of all the likelihoods.

        • There were also other models before the change in terminology, weren’t they? (I could agree that posterior probability is what we want, but calling an interval “compatibility” rather than “uncertainty” doesn’t give you that. I’m also skeptical that it would lead to much more honest reporting of how little we can conclude from our data. I would expect people to start publishing results “incompatible with zero” rather than “significantly different from zero”.)

        • Well yea, you can’t fix testing a strawman hypothesis by manipulating the numbers and words used to describe the procedure.

          This can only be addressed by testing the research hypothes(es).

Leave a Reply

Your email address will not be published. Required fields are marked *