if a researcher argues they have a noisy experiment, with an effect size that doesn’t transfer well to the real world

[…]

quantifying that effect is not useful/possible in this instance.

If the measurement is too noisy to yield a reliable effect size why would you be able to trust the direction? Determining magnitude is an intermediate step to getting the direction.

]]>Markus:

The problem with the example is that summarizing by statistical significance throws away data. It’s as if we could have a photograph of the data but instead the researcher decides to pixellate it. The result is: (a) to throw away information by discretizing continuous data, while (b) creating arbitrary patterns out of noise. Displaying all the comparisons is a great idea, but then display the continuous results, not an arbitrary discretization.

I understand that there are situations—many situations!—where we don’t particularly care about effect sizes, what we care about is the patterns of effects. But in that case they way to learn about such patterns is to display as much as you can. Not to throw away information by thresholding. Once you have all the data, I think multilevel modeling will help, but the key point with the statistics is not multilevel modeling but rather to first do no harm.

]]>They basically even admit to committing this error:

This model asserts that only three parameters (α, β, γ) are needed to perfectly specify (or encode) the disease frequencies at every combination of X = 1,0 and Z = 1,0. There is rarely any justification for this assumption; however, it is routine and usually unmentioned, or else unquestioned if the P value for the test of model fit is “big enough” (usually meaning at least 0.05 or 0.10).

[…]

As an example, suppose θ’ = 1.40 and σ’ = 0.60 . Then, the following Bayesian posterior probability statements follow from model 2, the data, and an equal-odds prior:

|θ’ − 0|/σ’ = 1.40/0.60 = 2.33 ,giving P = 0.02 as the probability that 1.40 is closer to 0than to θ_t and P_0/2 = 0.01 as the probability that θ_t is negative

This is just handwaving away (whatever the equivalent is for their model of) the !etc possibilities because “everyone does it”.

Also, how do they calculate Z = 2.33 -> P_0 = 0.02?

– R:

> 2*pnorm(2.33, lower.tail = F)

[1] 0.019806

So look into how the calculation for pnorm was derived to find all the other stuff that influence this p-value besides θ.

]]>This must be Greenland, Poole, Gelman day. LOL CANNOT escape them.

]]>See e.g. here for a critical perspective on this relationship:

https://statmodeling.stat.columbia.edu/2015/09/04/p-values-and-statistical-practice-2/

I’m also (very slowly) working on a paper which substantially generalizes this result, will share if I ever finish.

So it comes from this paper: https://www.ncbi.nlm.nih.gov/pubmed/23232611

It is just like I said. Using the notation above, where

P = “delta=0 & iid & etc”

They conclude:

!P = !delta=0

The correct answer is:

!P = !delta=0 | !iid | !etc

You can make the same error without p-values, it doesn’t matter how the conclusion of !P is arrived at. The error comes after the whole “statistical” aspect of the process.

]]>I agree Martha. The provocative title was meant to shake us in the biology community to get us to think about, and start a conversation about, the consequences of the way we train our students starting at the very beginning their career.

]]>See e.g. here for a critical perspective on this relationship:

https://statmodeling.stat.columbia.edu/2015/09/04/p-values-and-statistical-practice-2/

I’m also (very slowly) working on a paper which substantially generalizes this result, will share if I ever finish.

]]>That is fine. Do you have a link to someone else coming to this “p-value is twice the posterior probability that we have made a sign error” conclusion?

]]>“Negating a conjunction” was on the tip of my tongue:

https://en.wikipedia.org/wiki/De_Morgan%27s_laws

OK, it’s clear resolving our disagreement would take more time than I have, so let’s agree to disagree.

]]>the force of the point you’re making is considerably weaker when using robust p-values

I don’t see it affected at all. The point is this can’t possibly be true:

twice the posterior probability that we have made a sign error

The p-value is determined by the entire model, not just delta = 0 assumption. The logic goes:

! = NOT

& = AND

| = OR

P = “delta=0 & iid &etc”

Q = “null distribution”

If P is true then we must observe Q.

When we observe !Q, that means (via modus tollens) we can validly conclude !P, where:

!P = !(delta=0 & iid & etc)

!P = !delta=0 | !iid | !etc

You aren’t testing delta=0 in isolation. We only know at least one of the assumptions used to derive the prediction (here the “null distribution”) is incorrect.

Given this, I know the p-value cannot possible be “twice the posterior probability that” delta is positive (although it appears negative).

P.S.

It is likely you are also going to be “transposing the conditional” as part of whatever line of reasoning has lead you to make this claim, ie falsely equating P(model|data) with P(data|model).

Andrew,

Fair point. By real I don’t mean typical, I mean a p-value that actually provides textbook frequentist guarantees, and not one that merely appears to (which I agree is the far more often encountered case). Regarding nuisance parameters, this is less of an issue with e.g. bootstrap based p-values, but I agree that things are a bit messier with parametric model-based p-values. And yes, I don’t want to suggest that scientists should think of everything as effect v. no effect. I’m just pointing out that strategically and carefully using p-values in a way similar to how they appear to be used in the table in your post may not always be so dumb.

]]>Yes, basic bootstrapping assumes iid, and that the relevant statistical functional is Hadamard differentiable, and clustered bootstrapping that observations are independent across clusters, etc. I’m not suggesting you can escape any assumptions at all. I’m just saying the force of the point you’re making is considerably weaker when using robust p-values.

]]>Ram:

In most applications I’ve seen, even a point null hypothesis is really a composite hypothesis in that involves lots of nuisance parameters (this can be seen even in the “Table 4” example in the above post; if you’re looking at lots of p-values, each one is conditional on, or averaging over, some model for all the other comparisons in the table) so it will not generally be uniformly distributed under the null hypothesis. This topic comes up from time to time on the blog, as there’s lots of confusion on this point. I don’t think it’s helpful to describe the vast majority of p-values as not being “real”! Basically it’s assumptions all the way down, except in some very rare simple situations. The real point, though, all distributional questions aside, is that I think the rejection of a null hypothesis very rarely answers scientific questions of interest. Rejecting a null hypothesis can give people an illusion of certainty, so I can see the appeal to such procedures for working scientists, but I think it’s a bad illusion to have, and it has real impacts when people then start classifying results based on significance level. At that point, they’re pretty much just taking their data and adding noise.

]]>Sure, if the model is wrong it doesn’t have that interpretation, or any other interpretation. This wouldn’t apply to “robust” p-values, like p-values generated by bootstrap CI inversion for example (asymptotically, anyway). Note that by real p-value, I meant one that is U(0,1) under the point null. If the model is wrong, it isn’t, so that’s not a real p-value. If the response is “the model is always wrong”, I agree—use robust p-values in that case.

Any method of calculating a p-value (“robust” or not) is going to require some set of assumptions beyond delta = 0.

You gave the example of bootstrapping:

In the case where a set of observations can be assumed to be from an

independent and identically distributedpopulation, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset).

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

Btw, there are other issues with what you are claiming but I am just focusing on this one to avoid confusing the matter.

]]>Andrew,

To quote my initial post:

“We can always criticize such a table for failing to incorporate information we have external to the data, or for not appropriately accounting for multiple comparisons/forking paths problems, or for 0.05 being too forgiving or too demanding of a standard in context. But since pinning down agreement on some of these things is challenging, in some cases I can believe this is not a terrible way to present what you found.”

It seems you’re unhappy with particular ways of accounting for multiplicity, and with my ignoring prior information. My point is these things are debatable as to how best to incorporate them, and so this gives a good summary of the data on the relevant point which can be mentally corrected by the reader using their own preferred ideas about these things.

]]>Fair enough. I don’t have anything interesting to say about that.

]]>Sure, if the model is wrong it doesn’t have that interpretation, or any other interpretation. This wouldn’t apply to “robust” p-values, like p-values generated by bootstrap CI inversion for example (asymptotically, anyway). Note that by real p-value, I meant one that is U(0,1) under the point null. If the model is wrong, it isn’t, so that’s not a real p-value. If the response is “the model is always wrong”, I agree—use robust p-values in that case.

]]>Jeff,

Thanks for the link. I think your statement,

“The absurdity of the t-test or ANOVA way of doing science is apparent if something like temperature and CO2 are the experimental factors – for example in the many global climate change studies. What in ecology or physiology or cell biology is not related to temperature and CO2?”

makes a good point.

But I think the point needs to be stretched further, to say that the design of a study (including the type of analysis) needs to fit the circumstances of what is being studied. As one example, ANOVA experiments are appropriate in some circumstances, but not in others. Students need to see a variety of studies in a variety of situations, and need to understand why a method of analysis is or is not appropriate for a specific situation.

]]>I get this. But it doesn’t address Loken’s point that this way of thinking (which may be okay for the very local problem at hand) gets hardened into the idea that the presence, or sign, of an effect, is the *ultimate* goal of good science; that brains are trained to not even think about the consequences of effect magnitudes, or of non-linear responses, or mechanistic models, etc.

]]>Ram:

In addition to the problems that Anon and Jeff point out (and I agree with both of them), there are two large and inappropriate assumption hidden in your above statements:

1. “Assuming . . . they appropriately account for any multiple comparisons/forking paths problems”

and

2. “we have no prior information about the parameter”

The problem with your statement #1 is that I think the appropriate response to multiple comparisons and forking paths is *not* to adjust p-values but rather to report all comparisons of interest and embed them in a multilevel model, as discussed in my paper with Hill and Yajima. Again, I don’t think it makes sense to be trying to reject a null hypothesis that we already know is false, nor do I think it makes sense to pull out a few comparisons at random from the many different things we could be looking at.

The problem with your statement #2 is that, mathematically, saying “we have no prior information about the parameter” is equivalent to saying that a treatment effect is 10 times as likely to be between 100 and 110, say, than it is to be between -0.5 and 0.5. And this leads to claims like, Early childhood intervention increases earnings by 42%, or, Beautiful parents are 26% more likely to have girls.

]]>I’m responding to this:

twice the posterior probability that we have made a sign error

I’m explaining why that isn’t what a p-value tells you.

]]>Jeff,

My point is that sometimes this is not a poor way of thinking. If what matters for advancing the relevant piece of science is understanding the sign of several parameters, and the data we have are too noisy to precisely estimate the magnitudes of these parameters, then thinking about the goal in terms of which signs are positive, negative or inconclusive is not necessarily a bad way to think about things.

]]>I’m not sure which statement of mine you’re responding to. I agree that a small p-value is not just evidence against the null parameter value, but against all of the assumptions made in deriving the p-value. I’m not sure what that has to do with my point, however.

]]>This seems to miss the point that I read from Loken’s statement: “The thought I’ve had lately, working with various groups of really smart and thoughtful researchers, is that Table 4 is also a model of their mental space as they think about their research and as they do their initial data analyses”, which I interpret as asterisks train our brain into thinking that “significance” is the goal of science (when in fact we learn almost nothing from these tests). If I’m interpreting Loken correctly, I agree and (as I wrote below) tried to address this here: https://rapidecology.com/2018/05/02/abandon-anova-type-experiments/

]]>My point is precisely that we need not interpret a real (read: uniformly distributed on [0, 1] under the point null hypothesis) p-value in relation to testing a straw man model, but instead as twice the posterior probability that we have made a sign error (under some conditions).

The p-value is calculated based on the entire model, the value of the average difference (or whatever) is just another assumption along with iid, normality, etc. These assumptions are combined into a model that makes a prediction about what the data should look like if the model was correct.

If **any** assumption is wrong it will affect the p-value. Just because you care more about one assumption than the others doesn’t mean you don’t get to attribute a small p-value to it being incorrect.

Anoneuoid,

My point is precisely that we need not interpret a real (read: uniformly distributed on [0, 1] under the point null hypothesis) p-value in relation to testing a straw man model, but instead as twice the posterior probability that we have made a sign error (under some conditions). In that case, the p-value itself or a conventional dichotomization of it can be a reasonable way to summarize what we’ve found.

]]>Assuming the underlying p-values are real (meaning they apropriately account for any multiple comparisons/forking paths problems

If multiple comparisons/etc isn’t part of your model of how the data was generated and you get a small p-value when testing that model, the p-value correctly did it’s job. It is perfectly “real”. The problem is testing a strawman model to begin with.

]]>We can always criticize such a table for failing to incorporate information we have external to the data, or for not appropriately accounting for multiple comparisons/forking paths problems, or for 0.05 being too forgiving or too demanding of a standard in context. But since pinning down agreement on some of these things is challenging, in some cases I can believe this is not a terrible way to present what you found. I haven’t read this particularly study so I have no opinion in this specific case.

]]>Realistically, for me as an advisor, for the sake of my students and postdocs’ careers, it *is* about what it will take to get the result published. But it is also about navigating uncertainty.

The big problem I am facing currently is making reviewers and editors of journals understand that they need to get past finding answers and just focusing on the probability distribution over possible answers.

]]>Matt:

Here’s what Guido and I wrote about forward and reverse causation:

The statistical and econometrics literature on causality is more focused on “effects of causes” than on “causes of effects.” That is, in the standard approach it is natural to study the effect of a treatment, but it is not in general possible to define the causes of any particular outcome. This has led some researchers to dismiss the search for causes as “cocktail party chatter” that is outside the realm of science. We argue here that the search for causes can be understood within traditional statistical frameworks as a part of model checking and hypothesis generation. We argue that it can make sense to ask questions about the causes of effects, but the answers to these questions will be in terms of effects of causes.

I don’t think null hypothesis significance testing has anything to do with this, one way or another.

]]>It can distract if you let it. My guess though is that some keep very good focus. I mentioned on my Twitter that those exposed to the 90s’ Evidence-Based thought leaders and their scholarship have substantial decisional edge analytically.

I was wondering though whether there were articles comparing conflict-free research with conflict-ridden research practices: that is,c comparing how biases cycle in both. I don’t think Kahneman and Tversky have done a comparative analysis as such.

]]>“it’s ultimately not about what it takes, or should take, to get a result published, but rather how we as researchers can navigate through uncertainty and not get faked out by noise in our own data.”

Yes and yes.

]]>