Statistical-significance thinking is not just a bad way to publish, it’s also a bad way to think

Eric Loken writes:

The table below was on your blog a few days ago, with the clear point about p-values (and even worse the significance versus non-significance) being a poor summary of data. The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses.

It’s getting much easier to make the case that Table 4 is not acceptable to publish. But I think it’s also true that Table 4 is actually the internal working model for a lot of otherwise smart scientists and researchers. That’s harder to fix!

I agree. One problem with all this discussion of forking paths, publication bias, etc., is that this focus on the process of publication/criticism/replication/etc can distract us from the value of thinking clearly when doing research: avoiding the habits of wishful thinking and discretization that lead us to draw strong conclusions from noisy data.

Not long ago we discussed a noisy study produced a result in the opposite direction of the original hypothesis, leading the researcher to completely change the scientific story. Changing your model of the world in response to data is a good thing—but not if the data are essentially indistinguishable from noise. Actually, in that case the decision was based on p-value that did not reach the traditional level of statistical significance, but the general point still holds.

Whether you’re studying voting, or political attitudes, or sex ratios, or whatever, it’s ultimately not about what it takes, or should take, to get a result published, but rather how we as researchers can navigate through uncertainty and not get faked out by noise in our own data.

36 thoughts on “Statistical-significance thinking is not just a bad way to publish, it’s also a bad way to think

    • Matt:

      Here’s what Guido and I wrote about forward and reverse causation:

      The statistical and econometrics literature on causality is more focused on “effects of causes” than on “causes of effects.” That is, in the standard approach it is natural to study the effect of a treatment, but it is not in general possible to define the causes of any particular outcome. This has led some researchers to dismiss the search for causes as “cocktail party chatter” that is outside the realm of science. We argue here that the search for causes can be understood within traditional statistical frameworks as a part of model checking and hypothesis generation. We argue that it can make sense to ask questions about the causes of effects, but the answers to these questions will be in terms of effects of causes.

      I don’t think null hypothesis significance testing has anything to do with this, one way or another.

  1. “this focus on the process of publication/criticism/replication/etc can distract us from the value of thinking clearly when doing research: avoiding the habits of wishful thinking and discretization that lead us to draw strong conclusions from noisy data.”

    “it’s ultimately not about what it takes, or should take, to get a result published, but rather how we as researchers can navigate through uncertainty and not get faked out by noise in our own data.”

    Yes and yes.

  2. Re: One problem with all this discussion of forking paths, publication bias, etc., is that this focus on the process of publication/criticism/replication/etc can distract us from the value of thinking clearly when doing research: avoiding the habits of wishful thinking and discretization that lead us to draw strong conclusions from noisy data.

    It can distract if you let it. My guess though is that some keep very good focus. I mentioned on my Twitter that those exposed to the 90s’ Evidence-Based thought leaders and their scholarship have substantial decisional edge analytically.

    I was wondering though whether there were articles comparing conflict-free research with conflict-ridden research practices: that is,c comparing how biases cycle in both. I don’t think Kahneman and Tversky have done a comparative analysis as such.

  3. “…it’s ultimately not about what it takes, or should take, to get a result published, but rather how we as researchers can navigate through uncertainty”

    Realistically, for me as an advisor, for the sake of my students and postdocs’ careers, it *is* about what it will take to get the result published. But it is also about navigating uncertainty.

    The big problem I am facing currently is making reviewers and editors of journals understand that they need to get past finding answers and just focusing on the probability distribution over possible answers.

  4. Assuming the underlying p-values are real (meaning they apropriately account for any multiple comparisons/forking paths problems), had they replaced the stars in the table with +/- signs, I don’t know that this would be so regrettable. The vast majority of (two-sided) p-values can be interpreted as twice the posterior probability of a sign error when we have no prior information about the parameter nor any conditional type I error information. In that case, “significance” is really just a conventional cutoff (an arbitrary one, to be sure) for deciding when we have and when we haven’t accumulated adequate evidence to pin down the sign of the parameter. And if all we care about at the moment is the sign of the parameter, or if we think our study is insufficient to precisely estimate magnitudes, then this doesn’t seem too bad.

    We can always criticize such a table for failing to incorporate information we have external to the data, or for not appropriately accounting for multiple comparisons/forking paths problems, or for 0.05 being too forgiving or too demanding of a standard in context. But since pinning down agreement on some of these things is challenging, in some cases I can believe this is not a terrible way to present what you found. I haven’t read this particularly study so I have no opinion in this specific case.

    • Assuming the underlying p-values are real (meaning they apropriately account for any multiple comparisons/forking paths problems

      If multiple comparisons/etc isn’t part of your model of how the data was generated and you get a small p-value when testing that model, the p-value correctly did it’s job. It is perfectly “real”. The problem is testing a strawman model to begin with.

      • Anoneuoid,

        My point is precisely that we need not interpret a real (read: uniformly distributed on [0, 1] under the point null hypothesis) p-value in relation to testing a straw man model, but instead as twice the posterior probability that we have made a sign error (under some conditions). In that case, the p-value itself or a conventional dichotomization of it can be a reasonable way to summarize what we’ve found.

        • My point is precisely that we need not interpret a real (read: uniformly distributed on [0, 1] under the point null hypothesis) p-value in relation to testing a straw man model, but instead as twice the posterior probability that we have made a sign error (under some conditions).

          The p-value is calculated based on the entire model, the value of the average difference (or whatever) is just another assumption along with iid, normality, etc. These assumptions are combined into a model that makes a prediction about what the data should look like if the model was correct.

          If any assumption is wrong it will affect the p-value. Just because you care more about one assumption than the others doesn’t mean you don’t get to attribute a small p-value to it being incorrect.

        • I’m not sure which statement of mine you’re responding to. I agree that a small p-value is not just evidence against the null parameter value, but against all of the assumptions made in deriving the p-value. I’m not sure what that has to do with my point, however.

        • I’m responding to this:

          twice the posterior probability that we have made a sign error

          I’m explaining why that isn’t what a p-value tells you.

        • Sure, if the model is wrong it doesn’t have that interpretation, or any other interpretation. This wouldn’t apply to “robust” p-values, like p-values generated by bootstrap CI inversion for example (asymptotically, anyway). Note that by real p-value, I meant one that is U(0,1) under the point null. If the model is wrong, it isn’t, so that’s not a real p-value. If the response is “the model is always wrong”, I agree—use robust p-values in that case.

        • Ram:

          In most applications I’ve seen, even a point null hypothesis is really a composite hypothesis in that involves lots of nuisance parameters (this can be seen even in the “Table 4” example in the above post; if you’re looking at lots of p-values, each one is conditional on, or averaging over, some model for all the other comparisons in the table) so it will not generally be uniformly distributed under the null hypothesis. This topic comes up from time to time on the blog, as there’s lots of confusion on this point. I don’t think it’s helpful to describe the vast majority of p-values as not being “real”! Basically it’s assumptions all the way down, except in some very rare simple situations. The real point, though, all distributional questions aside, is that I think the rejection of a null hypothesis very rarely answers scientific questions of interest. Rejecting a null hypothesis can give people an illusion of certainty, so I can see the appeal to such procedures for working scientists, but I think it’s a bad illusion to have, and it has real impacts when people then start classifying results based on significance level. At that point, they’re pretty much just taking their data and adding noise.

        • Andrew,

          Fair point. By real I don’t mean typical, I mean a p-value that actually provides textbook frequentist guarantees, and not one that merely appears to (which I agree is the far more often encountered case). Regarding nuisance parameters, this is less of an issue with e.g. bootstrap based p-values, but I agree that things are a bit messier with parametric model-based p-values. And yes, I don’t want to suggest that scientists should think of everything as effect v. no effect. I’m just pointing out that strategically and carefully using p-values in a way similar to how they appear to be used in the table in your post may not always be so dumb.

    • This seems to miss the point that I read from Loken’s statement: “The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses”, which I interpret as asterisks train our brain into thinking that “significance” is the goal of science (when in fact we learn almost nothing from these tests). If I’m interpreting Loken correctly, I agree and (as I wrote below) tried to address this here: https://rapidecology.com/2018/05/02/abandon-anova-type-experiments/

      • Jeff,

        My point is that sometimes this is not a poor way of thinking. If what matters for advancing the relevant piece of science is understanding the sign of several parameters, and the data we have are too noisy to precisely estimate the magnitudes of these parameters, then thinking about the goal in terms of which signs are positive, negative or inconclusive is not necessarily a bad way to think about things.

        • I get this. But it doesn’t address Loken’s point that this way of thinking (which may be okay for the very local problem at hand) gets hardened into the idea that the presence, or sign, of an effect, is the *ultimate* goal of good science; that brains are trained to not even think about the consequences of effect magnitudes, or of non-linear responses, or mechanistic models, etc.

      • Jeff,

        Thanks for the link. I think your statement,

        “The absurdity of the t-test or ANOVA way of doing science is apparent if something like temperature and CO2 are the experimental factors – for example in the many global climate change studies. What in ecology or physiology or cell biology is not related to temperature and CO2?”

        makes a good point.

        But I think the point needs to be stretched further, to say that the design of a study (including the type of analysis) needs to fit the circumstances of what is being studied. As one example, ANOVA experiments are appropriate in some circumstances, but not in others. Students need to see a variety of studies in a variety of situations, and need to understand why a method of analysis is or is not appropriate for a specific situation.

        • I agree Martha. The provocative title was meant to shake us in the biology community to get us to think about, and start a conversation about, the consequences of the way we train our students starting at the very beginning their career.

    • Ram:

      In addition to the problems that Anon and Jeff point out (and I agree with both of them), there are two large and inappropriate assumption hidden in your above statements:

      1. “Assuming . . . they appropriately account for any multiple comparisons/forking paths problems”

      and

      2. “we have no prior information about the parameter”

      The problem with your statement #1 is that I think the appropriate response to multiple comparisons and forking paths is not to adjust p-values but rather to report all comparisons of interest and embed them in a multilevel model, as discussed in my paper with Hill and Yajima. Again, I don’t think it makes sense to be trying to reject a null hypothesis that we already know is false, nor do I think it makes sense to pull out a few comparisons at random from the many different things we could be looking at.

      The problem with your statement #2 is that, mathematically, saying “we have no prior information about the parameter” is equivalent to saying that a treatment effect is 10 times as likely to be between 100 and 110, say, than it is to be between -0.5 and 0.5. And this leads to claims like, Early childhood intervention increases earnings by 42%, or, Beautiful parents are 26% more likely to have girls.

      • Andrew,

        To quote my initial post:

        “We can always criticize such a table for failing to incorporate information we have external to the data, or for not appropriately accounting for multiple comparisons/forking paths problems, or for 0.05 being too forgiving or too demanding of a standard in context. But since pinning down agreement on some of these things is challenging, in some cases I can believe this is not a terrible way to present what you found.”

        It seems you’re unhappy with particular ways of accounting for multiplicity, and with my ignoring prior information. My point is these things are debatable as to how best to incorporate them, and so this gives a good summary of the data on the relevant point which can be mentally corrected by the reader using their own preferred ideas about these things.

    • Sure, if the model is wrong it doesn’t have that interpretation, or any other interpretation. This wouldn’t apply to “robust” p-values, like p-values generated by bootstrap CI inversion for example (asymptotically, anyway). Note that by real p-value, I meant one that is U(0,1) under the point null. If the model is wrong, it isn’t, so that’s not a real p-value. If the response is “the model is always wrong”, I agree—use robust p-values in that case.

      Any method of calculating a p-value (“robust” or not) is going to require some set of assumptions beyond delta = 0.

      You gave the example of bootstrapping:

      In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset).

      https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

      Btw, there are other issues with what you are claiming but I am just focusing on this one to avoid confusing the matter.

      • Yes, basic bootstrapping assumes iid, and that the relevant statistical functional is Hadamard differentiable, and clustered bootstrapping that observations are independent across clusters, etc. I’m not suggesting you can escape any assumptions at all. I’m just saying the force of the point you’re making is considerably weaker when using robust p-values.

        • the force of the point you’re making is considerably weaker when using robust p-values

          I don’t see it affected at all. The point is this can’t possibly be true:

          twice the posterior probability that we have made a sign error

          The p-value is determined by the entire model, not just delta = 0 assumption. The logic goes:

          ! = NOT
          & = AND
          | = OR
          P = “delta=0 & iid &etc”
          Q = “null distribution”

          If P is true then we must observe Q.

          When we observe !Q, that means (via modus tollens) we can validly conclude !P, where:

          !P = !(delta=0 & iid & etc)
          !P = !delta=0 | !iid | !etc

          You aren’t testing delta=0 in isolation. We only know at least one of the assumptions used to derive the prediction (here the “null distribution”) is incorrect.

          Given this, I know the p-value cannot possible be “twice the posterior probability that” delta is positive (although it appears negative).

          P.S.
          It is likely you are also going to be “transposing the conditional” as part of whatever line of reasoning has lead you to make this claim, ie falsely equating P(model|data) with P(data|model).

        • That is fine. Do you have a link to someone else coming to this “p-value is twice the posterior probability that we have made a sign error” conclusion?

        • See e.g. here for a critical perspective on this relationship:

          https://statmodeling.stat.columbia.edu/2015/09/04/p-values-and-statistical-practice-2/

          I’m also (very slowly) working on a paper which substantially generalizes this result, will share if I ever finish.

          So it comes from this paper: https://www.ncbi.nlm.nih.gov/pubmed/23232611

          It is just like I said. Using the notation above, where

          P = “delta=0 & iid & etc”

          They conclude:
          !P = !delta=0

          The correct answer is:
          !P = !delta=0 | !iid | !etc

          You can make the same error without p-values, it doesn’t matter how the conclusion of !P is arrived at. The error comes after the whole “statistical” aspect of the process.

        • They basically even admit to committing this error:

          This model asserts that only three parameters (α, β, γ) are needed to perfectly specify (or encode) the disease frequencies at every combination of X = 1,0 and Z = 1,0. There is rarely any justification for this assumption; however, it is routine and usually unmentioned, or else unquestioned if the P value for the test of model fit is “big enough” (usually meaning at least 0.05 or 0.10).
          […]
          As an example, suppose θ’ = 1.40 and σ’ = 0.60 . Then, the following Bayesian posterior probability statements follow from model 2, the data, and an equal-odds prior:
          |θ’ − 0|/σ’ = 1.40/0.60 = 2.33 , giving P = 0.02 as the probability that 1.40 is closer to 0 than to θ_t and P_0/2 = 0.01 as the probability that θ_t is negative

          This is just handwaving away (whatever the equivalent is for their model of) the !etc possibilities because “everyone does it”.

          Also, how do they calculate Z = 2.33 -> P_0 = 0.02?

          – R:
          > 2*pnorm(2.33, lower.tail = F)
          [1] 0.019806

          So look into how the calculation for pnorm was derived to find all the other stuff that influence this p-value besides θ.

  5. The point is good, but the example seems ill suited: It might be an example of bad thinking, I haven’t read the paper. But from the looks of it, it seems more likely Table 4 is an example of a pattern inspection, which to IMHO is just as valid as its statistical equivalent, it just trades rigor for comprehensibility. Basically, yes, signs and magnitude would be preferable as would be tests of the strength of the difference. But if a researcher argues they have a noisy experiment, with an effect size that doesn’t transfer well to the real world and so they’re happy to take the difference between the pattern of P0 and PI in table 4 across regions as ‘the result’, I think that’s perfectly fine. If they get 4 stars for PI if the test the target region and 0 stars at the post-target region, I’m willing to accept provisionally that there’s somthing going on there. I’ll also take the authors word that quantifying that effect is not useful/possible in this instance. At that point, other considerations dominate: Is there a theoritical prediction, is the experiment sound, is the sample size representative etc. All these are more important than whether this pattern was tested with the appropriate multilevel model.

    • Markus:

      The problem with the example is that summarizing by statistical significance throws away data. It’s as if we could have a photograph of the data but instead the researcher decides to pixellate it. The result is: (a) to throw away information by discretizing continuous data, while (b) creating arbitrary patterns out of noise. Displaying all the comparisons is a great idea, but then display the continuous results, not an arbitrary discretization.

      I understand that there are situations—many situations!—where we don’t particularly care about effect sizes, what we care about is the patterns of effects. But in that case they way to learn about such patterns is to display as much as you can. Not to throw away information by thresholding. Once you have all the data, I think multilevel modeling will help, but the key point with the statistics is not multilevel modeling but rather to first do no harm.

    • if a researcher argues they have a noisy experiment, with an effect size that doesn’t transfer well to the real world
      […]
      quantifying that effect is not useful/possible in this instance.

      If the measurement is too noisy to yield a reliable effect size why would you be able to trust the direction? Determining magnitude is an intermediate step to getting the direction.

Leave a Reply

Your email address will not be published. Required fields are marked *