Jonathan Falk points me to an amusing post by Matthew Hankins giving synonyms for “not statistically significant.” Hankins writes:

The following list is culled from peer-reviewed journal articles in which (a) the authors set themselves the threshold of 0.05 for significance, (b) failed to achieve that threshold value for p and (c) described it in such a way as to make it seem more interesting.

And here are some examples:

slightly significant (p=0.09)

sufficiently close to significance (p=0.07)

trending towards significance (p>0.15)

trending towards significant (p=0.099)

vaguely significant (p>0.2)

verged on being significant (p=0.11)

verging on significance (p=0.056)

weakly statistically significant (p=0.0557)

well-nigh signiﬁcant (p=0.11)

Lots more at the link.

This is great, but I do disagree with one thing in the post, which is where Hankins writes: “if you do [play the significance testing game], the rules are simple: the result is either significant or it isn’t.”

I don’t like this; I think the idea that it’s a “game” with wins and losses is a big part of the problem! More on this point in our “power = .06” post.

Andrew, you may have made Hankins quote into a misstatement. Neyman-Peartsonian testing, which is hypothesis testing, has the rules that Hankins quote suggests. Significance testing, for which Fisher and Student are as close to originators as you can name, yields a P-value that scales with the evidence in the data regarding the null hypothesis within the statistical model. The rules for what to do with an evidentially interpreted P-value are much more nebulous than the rules for a P-value that is to be treated dichotomously in a hypothesis test. Thus the editorially inserted words should be something like “dichotomous hypothesis testing game” rather than “significance testing game”.

This is not just word-play and historical nit-picking because the failure of statisticians to keep separate the significance test P-values and the dichotomous results of hypothesis tests is one of the most important reasons that we have ended up in a world where the P-value is so misunderstood, misused and mistreated. If anyone is uncertain of what I am writing about then he or she should read this paper for a full account that includes more historical background: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900/

OK, I’ve looked at the original and the inserted text is accurate. My point stands, nonetheless.

So I gather very low p-values (say .01) imply significance and some evidence against H0, and high p-values (say .11) imply insignificance and some evidence for H0.

Well imagine this situation. There are four mutually exclusive exhaustive hypothesis H0, H1, H2, H3. One of them has to be physically true, but we don’t know which one. We collect data and get the following p-values:

Assuming H0 the p-value is .01

Assuming H1 the p-value is .00000000000001

Assuming H2 the p-value is .00000000000001

Assuming H3 the p-value is .00000000000001

Now based on this evidence, if we had to guess which hypothesis was the true one which would we pick? Wouldn’t we pick H0 because it’s the most consistent with the evidence?

Or how about this situation:

Assuming H0 the p-value is .11

Assuming H1 the p-value is .5

Assuming H2 the p-value is .11

Assuming H3 the p-value is .11

In this case wouldn’t we pick H1 because it’s the most consistent with evidence?

Bayes effortlessly gets this right of course and also does the intuitively right thing as we add more/better data to the mix. At some point, Frequentists are really going to have admit the reason their methods suck isn’t because they were teaching it wrong for 80 years.

Hey I just thought of a new slogan for Frequentists: “Frequentist Statistics: The only subject ever invented Frequentists can’t teach correctly!”

Anon:

Noooooo, you can’t compare p-values like that. People do it all the time but it’s wrong!

Anonymous,

While I certainly agree with the position that statistical testing and p-values are widely misused and often poorly taught and I myself would encourage an unapologetically subjective Bayesian, decision theoretic approach, I do not think that you fairly represent the methodology available within classical statistics to reach a decision regarding four mutually exclusive and exhaustive hypotheses.

Classical statisticians can use decision theory too.

Just out of curiosity what did you have in mind that these four hypotheses represented, 4 points in a discrete parameter space, or a partition of a continuous parameter space? I suspect that, either way, there are probably classical statistics tools, that don’t have anything to do with p-values, that would be just as “effortless” to apply as their Bayesian counterparts.

JD

The choice of four is irrelevant. It could be infinity many hypothesis indexed by some lambda as is often considered in statistics. The point remains.

I don’t think “p-values are widely misused and poorly taught”. There’s is something ridiculously absurd with with people claiming that frequentists statistics is the only unteachable subject in the history of the human race — which is effectively what they are saying when they place the blame for frequentist failures on poor teaching. Frequentsits invented it, they wrote the text books, they taught it as the only true statistics, and did all this for 3 or 4 generations. They failed because there are a mass of

theoreticalproblems with it that don’t go away no matter how well taught.The fact is they tried to substitute their intuitive judgments in place of Bayesian statistics thinking they’d get something better. They got something far worse, and the slimly self serving worthless academics shits that they are can’t admit it.

Hi Anonymous,

Again, I think you are painting all of frequentist statistics with too broad a brush, based on theoretical problems with a subset of concepts associated with frequentist statistics.

I suspect that during those three or four generations, in an effort to make the topic of statistics accessible to a broad audience, more and more very important technical details were gradually swept under the rug. Eventually, as the details were watered down, many non-statisticians began using these tools with the false impression that they are conceptually easy to understand and that there is no controversy surrounding their use, thus the widespread use of things like p-values, that I doubt frequentist statisticians consider to be particularly noteworthy among the disciplines contributions to science. Both of these false impressions would be mitigated by better textbooks and better teaching.

To me it seems that the question of “Bayesian” or “Frequentist” isn’t about choosing one over the other to avoid the problems with the theory of some of the methods that can be used in those paradigms, but foundational, that of philosophy of science. What arguments does one believe in favor of the claim that probabilities are unavoidably personal? Once someone has identified their philosophical view of “probability”, that should help them determine a paradigm for them to adopt for their statistical inference. And once they do, they can simply not use some tools/methodology that are proposed in their adopted paradigm when they find its properties problematic. For example, frequentists don’t have to use NHST… and if it were taught better, I suspect that they would probably understand that its applicability and strength of conclusions that can be drawn from it is really quite limited, and they would voluntarily choose other approaches or invent new ones, consistent with their philosophical view of the state of nature.

JD

The people using default null hypotheses do not have alternatives that allow comparison. One solution is to not test any hypotheses at all in that case, just collect data for others to use in comparing theories. That is a perfectly valid contribution to science. Also, it really is preferable to “hypothesis-driven” data collection because there is less likely to be bias. Of course at some point you have to choose what data to collect, which will be determined by some kind of preliminary speculation.

I just don’t see the point of strawman null hypothesis vs. vague preliminary speculation. Just collect the data and report it along with the methodology.

How does the

“to use in comparing theories”part work?Sometimes I think the attacking p-values bit would get more constructive if people started posting concretely about the decision theory alternative in each case they criticize.

Because, in most cases, there is a real yes-no decision at some point, however much people may hate sharp thresholds. i.e. The real question is “Should we prescribe this drug?” not “Was the drug’s effect on pain significant”

Shifting from Frequentist to Bayesian doesn’t help unless you tell me how to use that framework to take the actual decision.

I wish Decision Theory expositions attracted 5% of the number of bloggers that p-value ridicule does.

actually decision theory is one of the motivations for bayesian approaches – they map directly to the goal of optimizing a decision theoretic outcome.

even advocates of p-values would say that you should not be using p-value when you want to do a decision theoretic inference.

That’s great. I’d love to see more posts that actually follow through till the decision theory part.

Even applied papers seem to tread very gingerly when it comes to using actual decision theory in a specific, applied context.

Rahul:

Check out the decision analysis chapter in BDA. We have three real examples.

I believe Rahul is not talking about those kinds of decision analyses. In the BDA3 chapter, the examples involve decisions of a different sort. E.g., should the 95 year old patient get radiotherapy? Here, we can define the loss function.

In the kinds of psychology experiments where the decision being made is “reject the null hypothesis of no effect”, we would have to define a loss function, which is not quite as simple as in the radiotherapy example (not to say that it’s “simple”, but it’s possible to do this in a defensible way, given enough information). Andrew, you’ve pointed out before that you didn’t say it’s going to be easy to deal with psychology experiments in a decision-theoretic manner. So, really, the decision-theoretic approach to doing hypothesis testing is going to be an uphill struggle, and Rahul’s point remains unaddressed. People attack p-values as a decision criterion, but no alternative is given that involves decision theory.

Incidentally, my own solution has been to just plot the 95% credible interval, and if zero is outside that interval, I decide to assume that the effect is present given the data. If it includes 0, I decide to assume there is no evidence for an effect; although in those cases I do present the posterior probability that the parameter is greater or less than 0, to leave the reader to make his/her own call. I did this here most recently:

http://www.ling.uni-potsdam.de/~vasishth/pdfs/HusainEtAlETHindi.pdf

Is this a crazy approach? Perhaps the Bayesian experts reading this blog have comments on this. I am avoiding Bayes Factors for now because I don’t really understand the arguments against it (from Andrew et al), but if I were to use it, I would use the Lee and Wagenmakers Dickey-Savage method, discussed in their book. The decision is usually the same regardless of whether I use the credible intervals or BFs.

If someone gave me a cost function everyone could agree with, I could use my posterior distributions to work out my decision using a purely decision-theoretic approach.

@Shravan

All good points.

Even in cases where a loss function *can* be defined, few studies actually define it & then use it in formulating their conclusions.

“even advocates of p-values would say that you should not be using p-value when you want to do a decision theoretic inference.”

Unfortunately, I have heard even quite eminent/sensible medical statisticians say exactly that!

Never say never – there are sensible ways/contexts to interpret most things.

http://statmodeling.stat.columbia.edu/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/

I’m relatively certain people like Stephen Senn, Deb Mayo, Larry Wasserman, all well-known frequentists and sometimes-pvalue-defenders would agree.

Senn and Mayo would probably follow-up saying that they do not feel it is appropriate to use loss functions to evaluate scientific theories.

In only one of those examples significance was marked as “statistical”. I am sure that all other examples were written about (modified) statistical significance as well, but it worth reminding that the result mat be very significant even if it is statistically insignificant.

I just returned from a presentation where the researcher said this:

“But it’s almost significant so we see some movement towards X.” Later adding: “But if it’s 0.52, just add a few more informants and it’s significant because significance depends on sample size.”

How do you communicate the problem? And how do you evaluate the research when this is all that is reported?

Dominik Lukes wrote: “How do you communicate the problem?”

Reminds me of this:

‘On two occasions, I have been asked [by members of Parliament], “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?”…I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.’

https://en.wikiquote.org/wiki/Computers

I don’t think it’s even worth trying to communicate the problem when it is due to that level of confusion. The only thing to do is your own job correctly (which most likely does not involve significance levels at all), which will set a good example. Others who are interested will then have the opportunity to follow it, but you can’t make them.

All this has been noted more than 50 years ago and enough has been written on the topic that anyone who puts in the effort to understand the issues can find out:

‘David Bakan said back in 1966 that his claim that “a great deal of mischief has been associated” with the test of significance “is hardly original”, that it is “what ‘everybody knows,'” and that “to say it ‘out loud’ is… to assume the role of the child who pointed out that the emperor was really outfitted in his underwear”… this naked emperor has been shamelessly running around for a long time.’

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf

Dominik Lukes wrote: "And how do you evaluate the research when this is all that is reported?"

The research outcomes are being reported by someone who is severely confused… That answer should be obvious. You need to get the raw data and come to your own conclusions, or treat the report as an opinion piece.

With p-values, not even the complete raw data are enough; you need info about how the data were collected too. For example, the sequential process hinted at by the presenter (i.e., if you’re “on the edge of significance”, collect more data and combine with the previous data) has a different sample space than if the sample size were determined at the start. You can’t calculate a p-value without a sampling distribution, and you can’t define a sampling distribution without a sample space.

That is very true, and yet another reason that most every p-value ever reported is uninterpretable. How often do the authors report their sampling schemes, stopping rules, etc? In my experience it is based upon money, time to graduation of the student doing the work, and how promising the results are looking so far.

But really it is impossible to interpret any data without an understanding of how it was collected and processed. Well, maybe it could be ok if you are measuring literally the same thing that you care about (eg the widths of these widgets are too often too large to fit in the box-go recalibrate the machine).

i.e. any observational study that ever published a p-value should be retracted.

I communicate it by pointing out that you are assuming the effect size will stay fixed as you increase the number of informants. Under the null, that is unlikely. If you were certain the effect size would stay the same as the number of respondents increased, there’d be no need to sample more than one person.

Jonathan,

I don’t think this explanation will advance Dominik’s speaker’s understanding very much.

It has me confused, but perhaps that is just due to terminology.

Are you just directly challenging any possibility that the effect can be a fixed constant across all experimental units? Doesn’t it seem like that would be an implicit assumption built into the speaker’s null model? Why are you suggesting that in the presence of measurement error, which is potentially an assumption of this speaker’s model, that there would be no need to sample more than one person?

JD

What I’m saying (however inartfully) is that additional observations only reduce the p value if you hold the effect size constant. But under the null, the effect size is zero. If the null is true (or Andrew, close to true) then additional observations will tend to be closer to zero than the effects you happen to have seen so far. So Dominik’s speaker’s statement only makes sense if you *knew* the initial effect was the true effect, or at least a reliable estimate of it. But it can’t be a very reliable estimate of it… that’s what the p value is already telling you. That doesn’t mean that more observations couldn’t possibly turn a given result “statistically significant,” but there’s no particular reason to think it would.

“How do you communicate the problem?”

Short answer: With difficulty.

Longer attempted answer: See pp. 48 – 53 of http://www.ma.utexas.edu/users/mks/CommonMistakes2015/CommonMistakesDay2Part3_2015.pdf.

Still longer attempted answer: See also preceding parts of Day 2 Slides at http://www.ma.utexas.edu/users/mks/CommonMistakes2015/commonmistakeshome2015.html

See https://xkcd.com/1478/

In the alt text: If all else fails, use “Significant at a p>0.05 level” and hope no one notices

My favorite comes from an agronomist: “This is numerically different, but not statistically different.”

Once I took a writing seminar in which we got feedback on our writing samples. Mine was a statistical analysis involving NHST (which my employer had requested). The feedback was:

“Try to avoid negative statements like ‘not significant.’ Can you find a more positive way to phrase this?”

“In the game of significance you either win or you weakly trend towards an almost well-nigh win”

How about the other direction?

In a recent discussion regarding a suspected to be low powered comparison with p > .05.

“Yes, I recognize that lack of significance does not mean ‘no difference,’ but it does provide an indication.”

I have my students write assessments of results sections and so get to see several every year outside of my own interests. On many occasions problematic phrases like those found by Hankins don’t come up but rather are replaced with, “…significant, p < 0.05." To be clear, the actual p-value is over 0.05. They just state that it's under. The need to pass that 0.05 is so strong they just flat out lie. I'm a bit torn about which is better… published examples of confused thinking about what p-values mean or deception about what the p-values actually are.

Further, for values very close to 0.05, I see both so often I'm not sure which is more frequent.