As D.M.C. would say, bad meaning bad not bad meaning good.

Deborah Mayo points to this terrible, terrible definition of statistical significance from the Agency for Healthcare Research and Quality:

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

The definition is wrong, as is the example. I mean, really wrong. So wrong that it’s perversely impressive how many errors they managed to pack into two brief paragraphs:

1. I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right? You could try to give them some slack and assume they meant, “whether the results of a study represent a true pattern in the general population” or something like that—but, even so, it’s not clear what is meant by “true.”

2. Even if you could some how get some definition of “likely to be true,” that is not what statistical significance is about. It’s just not.

3. “Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.” Ummm, this is close, if you replace “an effect” with “a difference at least as large as what was observed” and if you append “conditional on there being a zero underlying effect.” Of course in real life there are very few zero underlying effects (I hope the Agency for Healthcare Research and Quality mostly studies treatments with positive effects!), hence the irrelevance of statistical significance to relevant questions in this field.

4. “The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).” No no no no no. As has been often said, the p-value is a measure of sample size. And, even conditional on sample size, and conditional on measurement error and variation between people, the probability that the results are true (whatever exactly that means) depends strongly on what is being studied, what Tversky and Kahneman called the base rate.

5. As Mayo points out, it’s sloppy to use “likely” to talk about probability.

6. “Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05)." Ummmm, yes, I guess that's correct. Lots of ignorant researchers believe this. I suppose that, without this belief, Psychological Science would have difficulty filling its pages, and Science, Nature, and PPNAS would have no social science papers to publish and they'd have to go back to their traditional plan of publishing papers in the biological and physical sciences. 7. "The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems." Hahahahahaha. Funny. What's really amusing is that they hyperlink "probability" so we can learn more technical stuff from them. OK, I'll bite, I'll follow the link:

Probability

Definition: The likelihood (or chance) that an event will occur. In a clinical research study, it is the number of times a condition or event occurs in a study group divided by the number of people being studied.

Example: For example, a group of adult men who had chest pain when they walked had diagnostic tests to find the cause of the pain. Eighty-five percent were found to have a type of heart disease known as coronary artery disease. The probability of coronary artery disease in men who have chest pain with walking is 85 percent.

Fuuuuuuuuuuuuuuuck. No no no no no. First, of course “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.

It’s refreshing to see these sorts of errors out in the open, though. If someone writing a tutorial makes these huge, huge errors, you can see how everyday researchers make these mistakes too.

For example:

A pair of researchers find that, for a certain group of women they are studying, three times as many are wearing red or pink shirts during days 6-14 of their monthly cycle (which the researchers, in their youthful ignorance, were led to believe were the most fertile days of the month). Therefore, the *probability* (see above definition) of wearing red or pink is three times more likely during these days. And the result is *statistically significant* (see above definition), so the results are probably true. That pretty much covers it.

All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data. But now I see it. It’s the two steps: (a) the observed frequency is the probability, (b) if p less than .05 then the result is probably real. Plus, the intellectual incentive of having your pet theory confirmed, and the professional incentive of getting published in the tabloids. But underlying all this are the wrong definitions of “probability” and “statistical significance.”

Who wrote these definitions in this U.S. government document, I wonder? I went all over the webpage and couldn’t find any list of authors. This relates to a recurring point made by Basbøll and myself: it’s hard to know what to do with a piece of writing if you don’t know where it came from. Basbøll and I wrote about this in the context of plagiarism (a statistical analogy would be the statement that it can be hard to effectively use a statistical method if the person who wrote it up doesn’t understand it himself), but really the point is more general. If this article on statistical significance had an author of record, we could examine the author’s qualifications, possibly contact him or her, see other things written by the same author, etc. Without this, we’re stuck.

Wikipedia articles typically don’t have named authors, but the authors do have online handles and they thus take responsibility for their words. Also Wikipedia requires sources. There are no sources given for these two paragraphs on statistical significance which are so full of errors.

**What, then?**

The question then arises: how *should* statistical significance be defined in one paragraph for the layperson? I think the solution is, if you’re not gonna be rigorous, don’t fake it.

Here’s my try.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under thenull hypothesis(which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

I think that’s better than their definition. Of course, I’m an experienced author of statistics textbooks so I should be able to correctly and concisely define p-values and statistical significance. But . . . the government could’ve asked me to do this for them! I’d’ve done it. It only took me 10 minutes! Would I write the whole glossary for them? Maybe not. But at least they’d have a correct definition of statistical significance.

I guess they can go back now and change it.

Just to be clear, I’m not trying to slag on whoever prepared this document. I’m sure they did the best they could, they just didn’t know any better. It would be as if someone asked me to write a glossary about medicine. The flaw is in whoever commissioned the glossary, to not run it by some expert to check. Or maybe they could’ve just omitted the glossary entirely, as these topics are covered in standard textbooks.

P.S. And whassup with that ugly, ugly logo? It’s the U.S. government. We’re the greatest country on earth. Sure, our health-care system is famously crappy, but can’t we come up with a better logo than this? Christ.

P.P.S. Following Paul Alper’s suggestion, I made my definition more general by removing the phrase, “that the true underlying effect is zero.”

P.P.P.S. The bigger picture, though, is that I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest. If you’re gonna define statistical significance, you should do it right, but really I think all this stuff is generally misguided.

I am glad Andrew exists.

+1

Missing text after “The question then arises: how should”

The book, “How Not to Be Wrong” has a pretty good discussion of statistical significance that can be understood by general audiences.

“All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data”

Easy.

Step 1: Collect data, create a histogram. Then pick a distribution that looks similar. Then run a test to “show” the data is “drawn” from this distribution.

Step 2: Use this “test” to claim the histogram of future data will look roughly like the data you already have.

Step 3: Make an assertion which will be true 95% of the time if future data really does look like the data you currently have.

Step 4: When such assertions turn out to be true 10% of the time, indignantly claim steps 1-3 haven’t been taught correctly and refuse to admit your understanding of probabilities had anything to do with the failure.

Your reaction to their “definition” of probability made my morning.

+1

My favorite part was: “Fuuuuuuuuuuuuuuuck. No no no no no.”

Gonna put that on Andrew’s wikiquote page for sure.

Going public, unfortunately, is like sticking your head above the parapet. Andrew’s definition of p-value ignores any situation where “zero” is not the focus. Moreover, what about one-sided nulls as in <= "zero"? In a sense there should be some sympathy for the attempt by the authors of the Agency for Healthcare Research and Quality at defining p-value; it is hard to be concise and still cover all the nuances. Simultaneously pleasing the experts and informing the lay population is a tough job.

Paul:

Good point. I’ll remove the phrase “that the true underlying hypothesis is zero.”

“If the null hypothesis is true” is a reasonable phrasing that doesn’t specify what the null is. I agree it is very hard to explain in an understandable way as is illustrated by the many instances of people who should know better who do it wrong even writing for more numerate audiences.

I don’t think it particularly strengthens you argument to accuse women with Ph.D.s in an area they are writing on who using the standard epidemiological definitions of “youthful ignorance” on something not even related to your post. It just seems unnecessary when the point is about concerns about p hacking not about your or their interpretation of the Wilcox data on fertility windows https://www.musc.edu/mbes-ljg/Courses/Biology%20of%20Reproduction/Paper%20pdfs/Wilcox%20fertile%20window.pdf. It’s just a distraction.

I agree the “youthful ignorance” thing seemed degrading, it’s better to just come out and say that they got it wrong.

Elin:

The authors of the papers on fertility included both men and women.

And, no, they weren’t using the standard values for peak fertility. Including day 6 in the peak fertility range, that indicates an unfamiliarity with the standard recommendations in this area.

And I do think their youth is an issue. When my wife and I were trying to have our last child, we were in our 40s and we learned the days of the month that are the best for having babies. A young person

mighthappen to know this information, but he or she might also not ever happen to have encountered that information. As you get older, this is the sort of biological fact you’re more likely to come across.Yes they were if you read the literature in the field, day 6 has over a 10% risk of conception and that’s a reasonable cut off. You can say you want a higher cutoff and that’s fine, but their choice was absolutely standard and based on the epidemiology.

I can’t wait to see what reaction Andrew has to this replication (?) of the red-shirt / fertility study: http://pss.sagepub.com/content/early/2015/07/02/0956797615586403.abstract.

“…. is a frequency, not a probability.”

There are two definitions of probability which are so easily confused most do, but are very different. One is the old Bernoulli/Laplace/Bayesian one:

“The probability of x is the ratio of the number of cases favorable to it to the number of cases possible”

and the frequentist one:

“The probability of x is the ratio of the number of times x occurs to to the total number of repetitions”

Despite a superficial similarity, the first one is a quantification of the uncertainty remaining when our knolwedge/assumptions aren’t strong enough to remove all cases unfavorable to x (i.e. to deduce x) and applies even if x only occurs once or is a hypothesis. The second is a rather strong claim about the dynamical evolution of the universe and only applies if repetitions can be performed.

My understanding of probability does not agree with either of the definitions offered above.

A frequentist professor of mine, one with a bent towards physics, used a definition of probability along the lines of

The probablity of x is the ratio of the number of cases favorable to it to the total number of cases in an infinite sequence of trials.

That definition will yield the right answer if you have a real random number generator uniform on [0,1] and a process that compares the random number to 1/Pi and outputs a 1 if the number is greater than 1/Pi and a zero otherwise.

Any finite sequence of trials will yield the wrong number.

Bob

It also deserves noting the replication projects in psychology and biology–a lot of studies did not replicate. There are complicated questions as to procedures and power. But, given the failures to replicate, the “true” talk is so so so off-base.

When I was doing college statistics I remember an exam question where you had to pick the right definition of statistical significance. One of the wrong examples was worded nearly identically to the HHS wording.

They do have a contact link

http://effectivehealthcare.ahrq.gov/contact-the-effective-health-care-program/

It’s my impression that people think of our very own “Anonymous” caricatures as over-the-top, until they read this kind of thing.

I can’t get worked up over the quoted text. There isn’t a shred of evidence people indoctrinated in the nuances of the definitions of p-values, hypothesis testing, confidence intervals, and “significance” produce statistical analysis any better than those who can’t parrot back official verbiage.

The whole thing has the character of physicists arguing over the definition of “impetus” they read in an 11th century scholastic work written in latin and acting like it was important for getting the mars rover to it’s destination.

Anon:

Yes, see my P.P.P.S.

That doesn’t get at the fact that the whole piece is csstigsting a website for daring to have a lay friendly definition that doesn’t get so lost in the weeds of P value nuance so loved by statisticians that it fails to actually give the reader a usable definition.

Alex:

Yes, I will castigate a U.S. government agency for daring to give a misleading definition. It’s not about “nuance,” it’s about what the purported definition is conveying.

In what sense is it “usable” to say that the probability equals the observed proportion? In what sense is it “usable” to tell people that statistical significance answers the question of “whether the results of a study are likely to be true”? How will these definitions serve the goal of helping people understand research studies? I don’t see it. Rather, I see these definitions as teaching and reinforcing confusion, the sort of confusion that’s led to the research stylings of Daryl Bem being published in a top psychology journal, the sort of confusion that led to that himmicanes and hurricanes story, and so forth.

Simplified and reader-friendly is fine. False, not so much.

This: “Simplified and reader-friendly is fine. False, not so much.”

Still, you’re often giving examples of “Stats 101” analysis which are intentionally wrongheaded, and then someone here is inevitably coming back with “but you’ve obviously made mistake X which frequentist statisticians would never do and you’re characterizing researchers as basically ignorant and stupid” to which your reply is usually “but if you go out into the world, this is the kind of analysis you find in almost every textbook/guidelines/glossary etc” and then… Andrew went out and found one and was SHOCKED, SCHOCKED I TELL YOU… to see such a poor concept of stats in some kind of official govt document…

so, I’m pointing to this and saying “see, Anonymous isn’t so over the top after all” because this kind of stuff *IS* out there everywhere

I think a lot of this has to do with poor choice of words, you can see this elsewhere.

“The p-value is not really a probability, and the toy examples we use in introductory texts aren’t really how you should use it.”

Compare to the greenhouse effect: “The atmosphere is not really like a greenhouse, and the toy S-B law examples we use to first demonstrate it aren’t really calculating its magnitude.”

For example:

“I will say that I do not particularly like this model as a suitable introduction to the greenhouse effect. It is useful in many regards, but it fails to capture the physics of the greenhouse effect on account of making a good algebra lesson, and opens itself up to criticism on a number of grounds; that said, if you are going to criticize it, you need to do it right, but also be able to distinguish the difference between understood physics and simple educational tools.”

https://www.skepticalscience.com/postma-disproved-the-greenhouse-effect.htm

Please do not respond with arguments about the greenhouse effect existing/etc. I just mean to convey that misguided pedagogy can lead to endless confusion. It is very important to not simplify things too much.

Good analogy.

I also get a little bored over the pedantic nitpicking that statisticians do over the definition of the p-value.

Miss:

Pedantic is in the eye of the beholder. But I think these misconceptions have real consequences. There really appear to be researchers who think that statistical significance is “a mathematical technique to measure whether the results of a study are likely to be true.” And, sure, Psychological Science is a bit of a punch line, but it’s also a leading journal in psychology. And psychology is important.

So, although this bores you, I think it’s important.

And it’s not just psychology research. Remember that paper awhile ago on air pollution in China? Or that paper on early childhood intervention? Real decisions are on the line here, so I think it’s a bad idea to spread wrong ideas about commonly used statistical concepts.

Wrong conclusions have consequences, but getting these definitions just right doesn’t seem to have any affect.

Doing analysis using classical statistics/frequentism is like trying to heal someone using the four humors theory of disease. No one would dare apply the theory literally and consistently or it would be obvious it’s junk. In reality, they use some combination of trial and error, fudge factors, rules of thumb, intuition, guesswork, and rank speculation, sprinkled over with a thin layer of four-humors (frequentist) vocabulary to make it seem respectable.

The definitions just don’t play a big role.

Here’s a good example Andrew:

http://errorstatistics.com/2015/05/04/spurious-correlations-death-by-getting-tangled-in-bedsheets-and-the-consumption-of-cheese-aris-spanos/

It’s by an econometrician name Aris Spanos who bills himself on his CV as one of the 20 best or something. Two time series appear to move in unison. One is “death by getting tangled in bedsheets” and the other is “consumption of cheese”. After calculating the sample correlation coefficient which is close to 1, Spanos declares “the key issue for this inference result is whether it is reliable, or statistically spurious”.

He then spends a few slides massaging the data in whatever way he feels like doing until he gets a new correlation coefficient which isn’t “statistically significant” and declares the correlation is proven “spurious”.

This is as close to a public admission as I’ve ever seen that classical statics is the direct modern equivalent of reading chicken entrails to predict the future. It was obvious the correlation is spurious without any analysis and he simply worked toward that goal. You could repeat this exact same analysis on a hundred other times series half of which were spurious and half weren’t and you’d get a basically random combination of outcomes.

If the numbers had been exactly the same, yet the two series were connected, Spanos would has simply found another analysis which produced a “statistically significant” correlation.

If legerdemain like this actually worked, science would be much easier. Just collect millions of times series, pair each of them against each other, run them through a computer program without even knowing what the numbers mean and whenever you get a correlation coefficient which is “statistically significant” you’ve made a major scientific breakthrough.

The definitions make no difference. It makes absolutely no difference whether Spanos can give the “right” definitions for “statistically significant” or not.

I’m surprised to see you dismiss with such ease the clear causal connection between consumption of cheese, which clearly to an econometrician would include Cheeze-Whiz, and death by Tangled Bedsheets …

Of course there’s no connection between the calculations and reality. Spanos just contorted the analysis to agree with his prior that it was spurious. Frequentists do this all the (every?) time, but it’s nice to see a clear example of it, so it’s worth savoring the irony.

There is a real possibility a sizable chunk of that “spurious” correlation (rho=.94) is due to obesity or something and isn’t quite so spurious. If that’s the case then the frequentist failure here is 10 times as embarrassing.

But what’s the purpose of identifying a correlation as spurious? The usual reason is to warn the correlation may not continue. If that’s the takeaway anyone got from Spano’s analysis then they’re dead wrong. A correct analysis would likely show both are related to population growth or something, so the observed correlation would be expected to continue.

I can see that I should have employed a smiley face with my lewd innuendo.

@anon

Spanos does kind of say that in the comments:

“A trend provides a generic way one can use to account for heterogeneity in the data in order to establish statistical adequacy…Once that is secured, one can then proceed to consider the substantive question of interest, such as a common cause. The latter can be shown by pinpointing to a substantively meaningful variable z(t) that can replace the trend in the associated regression and also satisfies the correlation connections with y(t) and x(t) required for a common cause”

but I can agree that the original post left it pretty unclear on what then was achieved by the manipulations and seemed to contain a number of fairly arbitrary assumptions.

Is this the same Aris Spanos who often writes articles with Deborah Mayo?

Yes, the link is actually to Mayo’s blog where she posted Spanos’ article.

Ok, since we’re going all “straight man” on this anyway, here’s what I think of that analysis:

1) Per capita consumption of cheese has dimensions of {mass}/{person}{time}

2) Rate of death by tangled bedsheets has dimensions of {person}/{time}

In the range 2000-2009 the population increases by a few percent, but this is due to babies being born, and so we can assume the adult population increases less, let’s let it be near constant.

In the range 2000-2009 presumably based on all the hype, obesity is increasing, meaning that per capita consumption of calories has increased. Per capita consumption of calories, considering an average caloric content of the general diet, can be converted to per capita consumption of food mass, which also has dimensions of {mass}/{person}{time}. It is therefore reasonable to believe that 1) is a proxy for obesity.

Now, the adult population was assumed to be about constant, certainly not doubling over the timeperiod, as the death in bedsheets data does. So if we convert the death in bedsheets data to per capita, it will be like dividing by a constant, and the new percapita death in bedsheets data will have dimensions of 1/{time} and have the same shape, but i’m going to create a new dimension called “risk” which is really a dimensionless ratio of {person death}/{person alive}, so the second data series is now {risk}/{time}

Now, clearly based on the high correlation coefficient, if we divide the death data by the cheese consumption data we get {risk}{person}/{mass}, which when you consider the definition of {risk} works out to {death}/{mass}, and considering the trend, the result will probably be nearly constant.

Considering a taylor series about the initial 2009 death vs person’s mass set-point, and considering that obesity is linked to death, is there any question that the first-order term in the taylor series has a positive coefficient? Probably not. So death_t ~ death_2009 + C_mass * (mass_t-mass_2009).

you’ll notice that C_mass has dimensions of {death}/{mass} which is exactly the dimensions of the ratio I constructed above, which is more-or-less a constant.

We now hypothesize the following causal structure CALORIES -> BODY MASS -> RISK OF DEATH BY TANGLED BEDSHEETS

So I actually did the math, here’s the R code:

cheese <- c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8);

death <- c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717);

uspop <- 300e6+20e6*(seq(0,9)); ## approximated from graph on google

plot(death*1e8/uspop/cheese,ylim=c(0,10));

Sure enough, you have more or less a constant around 4.

Sorry, I pasted the wrong uspop code, should be:

uspop <- 275e6+(310-275)*1e6*(seq(0,9));

Daniel: “In the range 2000-2009 presumably based on all the hype, obesity is increasing, meaning that per capita consumption of calories has increased.”

Wrong. You can get fat and be close to starvation. E.g. you do not stop a cancer tumor growing by fasting.

Calories in and out is only a part, if at all, of why people are getting fat. You can get fat (or loose weight) holding calorie intake constant.

Fernando: you CAN do those things, but on average in the population, people who are more massive also consume more calories. For the overall statistics, the special cases are not so important.

Fernando wrote: “E.g. you do not stop a cancer tumor growing by fasting.”

Look up “caloric restriction cancer”…

http://www.tylervigen.com/spurious-correlations I spent some time in my class this semester having the students look at these and I think it was pretty memorable. And then we looked at the chocolate/Nobel prize graph and they were totally on it. And then we looked at gap minder and it was an interesting discussion because they were really primed (uh oh) to be skeptical.

But in this case, see above, I think it’s reasonable that cheese and death by bedsheets really could be causally related through overall calorie consumption and its risk in general of death especially cardiovascular related disease (which might result in death during sex, or death while waking, the most common time for a heart attack is the morning), and the data when transformed in a reasonable manner, produce a constant which can be interpreted as a coefficient in a risk model which is independent of time.

In other words. More Models, less Frequentist statistical claptrap ;-)

There was a person who regularly published stuff that drew wacky connections. My favorite was the finding – which got reported fairly widely – that consuming hot dogs was associated with children’s leukemia … except the effect was only significant between something like 16-20 hot dogs a week, not below and not above. Pretty much a clear message that statistical significance is fairly meaningless on its own.

Well, it really isn’t big news that the technical details of statistical significance are not widely understood. I don’t know exactly what should be done about it. I suppose I am glad that people are calling attention to it.

But at least their definition has a sort of internal coherence. Yes it is totally weird to talk about small p-values as indication that ‘the results are true.’ I guess the alternative using this weird language would be that ‘the results are due to chance.’ The conceptualization is a little off, but as Anonymous notes it is not off so much that it has much effect on their substantive conclusions.

The insistence that authors in the popular press adhere to the same standards of strict rigor as used in statistical literature can become a little silly and that is what I often find overly pedantic. Also when people say stuff like it is sloppy to use ‘likely’ when talking about probability — that’s a little hard to take seriously. But this isn’t popular press, this is research institutions and research journals, so, I agree it is important.

As an applied statistician, I can tell you that misunderstanding of p-values leads to LOTS of lost tax payer money. Well worth nitpicking over.

I work with biologists, who spend lots of money on experiments. I have people saying things like “Well, X’s results are non-reproducible because when they ran the experiment, they got p = 0.03, but when I tried to run the experiment, I got p = 0.07” or “we know that treatment 1 is significantly different than placebo and we know that treatment 2 is not significantly different than placebo, so we know that treatment 1 is better than treatment 2 (despite the estimated effects being nearly identical”. These are really expensive mistakes (made by very high level researchers) that come from not understanding the definition (or more over, the basic concept) of a p-value.

Perhaps the problem is in making the definition precise, we have lost the ability to convey the basic concept to those who need to know?

I think there are two separate but related issues here:

(1) The correct definition of the p-value

(2) The role that p-values should play in drawing conclusions from data.

Understanding (1) might help in establishing a reasonable view about (2), but it doesn’t seem necessary. A person could misunderstand (1) but it may be inconsequential, practically speaking, because they still have some sensible views about (2). Alternatively, a person might be solid on (1), but have badly considered views on (2).

It sounds like your collaborators are confused about (2), but it’s not clear that this is because their views about (1) are mistaken. P-values are only one form of information, and they should be considered in the the context of effect size, sample size, background information, and so on. The abuse of p-values in (2) I think often comes as much (or perhaps more) from discomfort in dealing with uncertainty, ambiguity, and the difficulty of constructing a formulaic method of incorporating all available information to arrive at conclusions, as it comes from a misunderstanding of (1).

I would agree with what you would say.

But a further note on the matter is the issues you mentioned in (2) are very wide spread and, in my opinion and experience, in many ways slow down the pace of science. I think these issues stem from non-statisticians trying to come up with simple rules on how to use p-values based on an overly summarized idea of what a p-value is. This occasionally leads to totally counter intuitive, counter productive conclusions.

It’s hard to think of way in which such researchers could use p-values productively without thoroughly understanding what they are. I will certainly admit that giving them a very precise definition is not the same as giving them a thorough understanding.

“I think these issues stem from non-statisticians trying to come up with simple rules on how to use p-values based on an overly summarized idea of what a p-value is.”

This is an instance of what I call The Game of Telephone (TGOT)* phenomenon: Some well-meaning person comes up with a simple way of explaining something, but that isn’t quite correct. This becomes adopted by others, who make further oversimplifications, and so on, until a lot of people are using a version of the concept that is far from the original.

*The name refers to the kids’ game where they sit in a circle, one person whispers something into a neighbor’s ear, the neighbor whispers what they hear into the next person’s ear, and so on, until the “message” has gone all the way around the circle, and usually comes out hilariously different from the original.

An explanation of p-value which will satisfy both the experts and the uninitiated is akin to stating the Heisenberg Uncertainty Principle which will pass muster to quantum physicists and to us plain folk. From https://simple.wikipedia.org/wiki/Heisenberg%27s_uncertainty_principle:

“Historically, the uncertainty principle has been confused with a somewhat similar effect in physics, called the observer effect. This says that measurements of some systems cannot be made without affecting the systems. Heisenberg offered such an observer effect at the quantum level as a physical ‘explanation’ of quantum uncertainty…Measurement does not mean just a process in which a physicist-observer takes part, but rather any interaction between classical and quantum objects regardless of any observer.”

Of course, to make matters more opaque, Heisenberg was not writing in English. Just as probability and likelihood are sort of similar in plain English but technically very different to statisticians, “uncertainty” in English is only somewhat the same as “indeterminacy.” From https://en.wikipedia.org/wiki/Uncertainty_principle:

“Throughout the main body of his original 1927 paper, written in German, Heisenberg used the word, ‘Ungenauigkeit’ (‘indeterminacy’), to describe the basic theoretical principle. Only in the endnote did he switch to the word, ‘Unsicherheit’ (‘uncertainty’). When the English-language version of Heisenberg’s textbook, The Physical Principles of the Quantum Theory, was published in 1930, however, the translation “uncertainty” was used, and it became the more commonly used term in the English language thereafter.”

“Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.”

I think you mean proportion, not frequency.

And I think what Anonymous is saying up above is that a finite “proportion” has been posited as a definition of probability by some of the classical folks. It satisfies Kolmogorov’s Three Axioms, no?

Anyhow, I do agree that this is not at all a modern definition of probability that would be applicable to the practice of data analysis.

But it seems forgivable to botch the definition of probability when most of us are unsure of what it means.

JD

I think Andrew’s point is that the observed frequency (or proportion) is not the probability, in the same way as the sample mean is not the population mean. My two cents.

Andrea:

Yes, indeed. That’s the point: people get focused on the statistical significance of their data and forget that the goal is to learn about the population.

But my point is simply that this type of proportion (what we’d call a sample proportion in statistics) has been used as a definition of probability, albeit this is perhaps not the concept of probability we’d reference when in this context.

And to reiterate, it is perhaps forgivable for them to link to such a definition since there are *many* definitions of probability and it seems that when statisticians are taught the “definition” of probability we just give them the Three Axioms. Note that the Three Axioms cannot serve alone as a definition of probability, because there are concepts that satisfy the axioms that to many of us are clearly not probabilities.

It seems that probability and statistics texts tend to just let students retain whatever their “intuitive” understanding of probability is, perhaps only briefly making the point that there are many “concepts” of probability.

So the question is, how would we prefer them to have defined “probability” here? Will there not be a problem with just about any definition they provide? Whether it is one of the so-called frequentist definitions (there isn’t just one), a propensity definition, a “subjective” definition? So, ok, I accept that have linked to a definition that is one of the less relevant to the context. But I don’t think “defining” probability is as straightforward as it might be presented to us when we’re taking our probability and statistics courses.

I include myself in the “unsure of what it means” category.

Jd:

In that case, I’d rather have them not give a definition at all.

I think trying to “define” probability is as problematical as trying to “define” time. See http://www.ma.utexas.edu/users/mks/statmistakes/probability.html for an attempt to address the question “What is probability?” rather than giving a definition.

I may not know what probability is — is there even a single right answer to the question “what is the probability a National League team will win the 2030 World Series?” — but I know some things it isn’t, and one thing it isn’t is an observed frequency. If I flip a coin 8 times and get 6 heads, it’s wrong to say “the probability of getting heads with this coin is 75%.”

Are you saying that if 100 events per year occur in a population of 1000 for 5 years straight — that the probability of randomly selection a member of the population who has experienced the event is not .10?

James:

Why would you think Phil is saying that?? He explicitly said he was talking about flipping a coin 8 times and getting 6 heads. Nowhere did he talk about 100 events or a population of 1000. If you have large n and random sampling, all definitions of probability converge to the same answer.

Perhaps I misunderstood. But this is how I read his comment:

1. He made a general statement, ‘one thing it isn’t is an observed frequency’.

2. He gave a single example to support this claim, ‘If I flip a coin 8 times and get 6 heads, it’s wrong to say “the probability of getting heads with this coin is 75%’.

It seemed to me that the broad statement was not accurate and I gave an example to support that.

James:

A probability is not the same thing as an observed proportion. When sample size is large and bias is low, an observed proportion is a good approximation to a probability—that’s the law of large numbers. But probability and observed proportion are different. Phil gave an example to illustrate that probability and observed proportion are different concepts. They coincide in a certain limiting case but not in general.

Andrew:

Let me see if I can articulate the distinction you are making:

1. The probability of the event occurring in the future is not equal to the proportion of events in the past across some time frame.

2. The probability of selecting without replacement a member of the population who experienced the event is equal to the proportion, but is conceptually different.

Is this close to what you mean? Or am I still missing it?

In pure mathematics / probability theory we need some kind of a precise definition. And since mathematics doesn’t deal with making inferences from data about physical processes etc, the definition needs to be in terms of something mathematics deals with. The main thing left for probabilists to work with is large or even countably infinite sequences of numbers. If, like me, you are a fan of nonstandard analysis, you could do something like:

“if (x_i) is a sequence of numbers, each of which is either 0 or 1, and whose length is a finite nonstandard integer N, then the probability of a given arbitrarily chosen number being a 1 is st(sum(x_i)/N)”

But this doesn’t help us, because it’s a mathematical definition about pure mathematical objects. It seems to me that the flaw in Frequentist statistics is assuming that the same definition needs to be brought over directly into the realm of using probability to make inferences about scientific questions.

Ah! This was worrying me. So can we call the observed frequency(or proportion) an estimate of the probability??

JD, no I wasn’t taking issue with finite proportions, but since you brought it up it’s worth mentioning the Infinite Set Irrelevance Supposition (ISIS, catch name I know) which says

“any issue in probability theory that only occurs for infinite sets is irrelevant to both the foundations and practice of statistics”

Hi Anonymous,

“finite” wasn’t really the operative word in my mistaken rephrasing of your post. I was just, perhaps mistakenly, assuming you meant what I say up above that what we know of in statistics as the “sample proportion” has been used as definition of probability before, with the caveat that I acknowledge it is not the most relevant concept of of probability for the context here.

JD

I also have suggested improvements for the key terms on my current blog. Gelman’s is nearly the same as mine. Feel free to add to it, maybe we can send it to them.

http://errorstatistics.com/2015/07/17/statistical-significance-according-to-the-u-s-dept-of-health-and-human-services-i/

Anonymous’ claimed definition for frequentist probability is wrong. It’s actually scarcely different from the other one he gives, but I have no desire to have an exchange with him.

“under the null hypothesis (which is commonly the hypothesis that there is no effect)”

I think the use of ‘effect’ here leads to confusion for laymen who interpret it causally, which is understandable since that is the colloquial meaning of the term and there’s nothing to signal that it has a different technical meaning in this context.

+1

the US government appears to spend approximately zero on graphic design and UX/UI. I guess you are American, so you will never have had the misery of trying to get https://esta.cbp.dhs.gov/esta/ clearance. Hideous design and worse UI.

There is a slightly different version of this. It is somehow conveyed that a p-value is not directly related to the truth of the null/alternative hypotheses (perhaps because p means probability which means you can never know for sure), so the mind substitutes “real” for “true”:

“Definition: A mathematical technique to measure whether the results of a study are likely to be *real*. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are *real*). Researchers generally believe the results are probably *real* if the statistical significance is a P-value less than 0.05 (p<.05)."

The confusion really is quite a fascinating subject when in a position to view it from afar.

1. Andrew wrote: “I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right?” Mike LaCour and John Lott send their regards;-) Actually, Lott doesn’t. He’s off sulking in a corner. But Mary Rosh sends her best.

2. Are likelihood ratios a thing? Sure, I care about p(H0|data) but I care at least as much about L=p(H1|data)/p(H0|data). Am I really supposed to disregard a favorable L for a p-value of 0.06? That makes no sense to me. (For example, in the context of a Hotelling’s T2 test, computing a p-value seems like a non sequitur. Am I missing something?)

3. Setting aside the arbitrary cutoff of 0.05, using a p-value to reject seems akin to reducing your decision hypotheses to H0 and ~H0. Why would anyone do that if they had an H1 hypothesis?

> Sure, I care about p(H0|data) but…

Arghh! Typo! That should be “Sure, I care about p(data|H0) …”

There is basically just one reason to use p values, and that is to determine whether a random number generator of a certain type would be likely or unlikely to produce a certain dataset. If you have some process which might reasonably be considered to be “like a random number generator” (such as a roulette wheel, a repeated polling procedure with low non-response rate, an algorithm for picking genes to study out of the genome, a “well oiled” manufacturing process which very regularly produces similar output, or a continuous-recording seismometer) then you can use p values to see if something that comes out of that process is “weird” compared to some calibrated historical performance.

Pretty much any other use of p values is in my opinion wrongheaded. Sometimes, we can discover something interesting by using p values, but it will always be of the form “this previously well calibrated random number generator is no longer doing what it used to”, which is hugely informative to manufacturing engineers (stop the production line and look for a broken machine!), seismologists (There’s a tiny earthquake!), casino owners (Take that slot machine offline and get it repaired!) or the like.

Selecting small samples from processes not proven to be stationary and trying to argue that there’s a difference of interest between them due to a p value is a large part of what’s wrong with the “mythology” of Stats 101 as taught to undergrads everywhere.

If p-values are so useless, the change, if any, needs to start from within the profession. What fraction of statisticians are still using p-values?

The one thing that puzzles me is that if indeed p-values are so bad & useless why are not the bulk of statisticians themselves convinced to dump them? It is one thing for an untrained applied researcher to thoughtlessly abuse p-values, but shouldn’t trained statisticians be more able to see all these flaws?

Fisher provides a plausible explanation. I would quote but do not know an OCRed source for this paper:

Fisher, R N (1958). “The Nature of Probability”. Centennial Review 2: 261–274. http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf

What fraction of research publications are actually analyzed by statisticians?

Communicating statistical concepts in an accurate way can be a daunting task. Kudos to Andrew for providing concise, meaningful criticisms of the definition contained in the glossary written by the Agency for Healthcare Research and Quality.

I would also like to comment on Andrew’s Definition of Statistical Significance. I am accustomed to thinking that a small p-value could arise due to several conditions:

1. The ‘true underlying effect (in the population)’ is not consistent with the null hypothesis.

2. The sample is not representative of the ‘true underlying effect (in the population)’ (Type I Error).

3. The assumptions of the method that produced the p-value are not met.

I’m not convinced that Andrew’s ‘try’ gives sufficient due to all of the potential causes of a small p-value. In particular, I feel like we should always check into the possibility that assumptions are not met.

Furthermore, ‘strength’ seems uncomfortably close to ‘power’ when I read Andrew’s Definition; could this also be considered ‘sloppy’?

In any event, I have come away from reading this post with a richer understanding of what p-values are not, so Thanks, Andrew!

The original post on the government website wasn’t intended for pedantic statistics textbook writers it was intended for confused people who have no idea or desire to know what null hypotheses, underlying effects, etc. are. Yes there are issues with the definitions as read by somebody who already knows the material. But for the ignorant layman for whom the definition was intended? Those definitions allow that layperson to meaningfully grasp with the concepts found on other parts of the website. Something that even the definition of P value put forth by the author in this smug take down piece fails to accomplish. If you ignore the writing and reading standards for lay accessible websites and presume that the readers all hold bachelors degrees then sure, perfect definition!

But as someone who has to explain these concepts to students who have not taken a statistics course and never will – – a discussion of underlying effects, sample sizes, null hypotheses, and what have you is not only a irrelevant and self-indulgent exercise in pedantry, its counterproductive as I am adding more confusion rather than reducing it. If your goal is to inform the public to the level of a PhD student that’s admirable, but not really feasible. And I think the linked definitions strike a good balance between the two.

Alex:

I disagree with your claim that these definitions “allow that layperson to meaningfully grasp with the concepts.” If they read about a p-value on another page of the website and think that it is “measuring whether the results of a study are likely to be true,” I think this sends them in the

wrongdirection, to an unfortunate state of credulity, perhaps followed by a nihilistic skepticism once they realize that they’ve been misled.I fully support your proposal to make definitions that are comprehensible to the layperson. It’s just that I think that the definition on that website is (a) not actually comprehensible to the layperson, (b) misleading, and (c) false.

I am wondering if animation might help folks see the underlying process at work better than a technically correct definition.

Here is what I am thinking of doing:

Simulate a histogram of p_values from Norm(0,1) samples of say 30 – the no underlying effect world.

Simulate another histogram of p_values from Norm(3,1) samples of say 30 – the _important_ underlying effect world.

Animate showing one histogram above the other with p_values of different colours raining down from them.

This shows them the Uniform(0,1) distribution from no underlying effect world and non-uniform one from _important_ underlying effect world and experience the raining down of p_values from these.

Now most of the screen gets blacked out and one grayed p-value falls down – in Ken Rice’s conceptualization* the question appears “Is it worth calling attention the the possibility of this p_value coming from an _important_ underlying effect world – i.e. should others do more studies to learn if we regularly get small p_values doing this study over and over again?”

If they were interested and watched closely a few times they might get the sense of the process.

Also, further vary the _important_ underlying effect size Normal(E,1) and other assumptions.

* http://statmodeling.stat.columbia.edu/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/

Do laypeople want to know the gory details?

A lot of these problems arise from answering a question that no one had in the first place. I’m not sure people ever have the question that a p-value answers.

I think that a lot of people wouldn’t understand the idea of a simulation. If you do something physical that might work. When my daughter was in 1st grade they played “Rolling Dice 6” and later “Rolling Dice 12” constantly (6 and 12 dice respectively) and making histograms and all the time. It was really memorable not to mention that they really did love that game which was great because everyone won sometimes. I can’t even remember what constituted winning, I just know whatever it was it appealed to 6 year olds.

> people wouldn’t understand the idea of a simulation. If you do something physical that might work

I agree, but but you can’t explain anything without relying on other things being understood.

With a physical model they still need to see it as a representation of something in the world using a concept of randomness (not easy).

True. I’ve been thinking about making a little lego people population and doing something with that. Or maybe a list of all the public schools in NYC. Actually I do use SAMP in my class and that’s a simulation, but there is a story behind it. I have my students each taking dozens of samples of different sizes and graphing the distribution of the results. One thing that always happens is that even though I have talked about sampling as a concept and we’ve done various things with it, when we first go to SAMP and I say okay, now everyone take a 1% sample and I do it on the screen, it’s always kind of a light bulb moment for some people that we don’t all get the same results.

One physical illustration that I (and a lot of other people) have used is to start by asking, “What proportion of M and M’s do you think are tan?” (or whatever), then give each student a small package of the candies, and have them use that as a sample to do a hypothesis test or form a confidence interval. Then compare notes. That seems to help many understand the idea of sampling variability. (Often big packages of little packages of M and M’s or other suitable candies are on sale cheap after Halloween, so one can stock up then and be prepared.)

Geoff Cumming has done something along these lines in his “Dance of the p-values” demonstration:

https://www.youtube.com/watch?v=ez4DgdurRPg

Btw, I shared this with some medical medical researchers I was working with a couple of years ago, and at least one of them said it was “eye opening”. It’s certainly entertaining.

Thanks, close to what I had in mind but I wish there was not so much judgement being overlaid “highly unreliable p_values”, “hug a confidence interval”. Not necessarily good or bad concepts, just badly interpreted/used.

(In meta-analysis, the combination of p_values can be the least wrong way to go.)

Better than the youtube video is Cumming’s Excel demo you can get at http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci (Scroll down to “The software” and click on + to expand to get downloading instructions.)

This is pretty good. Pitched at the right sort of level for the semi-layperson or basic user of stats IMO.

I am surprised to see any post that leads with Run DMC accused of pedantry.

Basically agree with this. This glossary entry is essentially a descriptivist definition of p-value. Like it or not, within the context of the US Health & Human Services Agency for Healthcare Research and Quality’s Effective Health Care Program, this is functionally what the p-value means & how it is used. The glossary isn’t trying to precisely define the concept, it’s trying to explain how it is actually used. Prescriptivist quibbling doesn’t change that.

Side note, “unfortunate credulity or nihilistic skepticism” seems like a pretty decent characterization of the options available to a layperson confronted with the use of p-values in policy-relevant fields at the moment.

S:

Indeed, this definition is what a lot of people use. It’s given us himmicanes and hurricanes, ovulation and voting, everything causes cancer, ESP, and various other areas of junk science. I and many others think we as a scientific community can do better. I and many others think the U.S. government can do better.

Millions of people believe in astrology too, but I don’t want the U.S. Department of Health and Human Services to recommend astrology either.

Alex: “But as someone who has to explain these concepts to students who have not taken a statistics course and never will – – a discussion of underlying effects, sample sizes, null hypotheses, and what have you is not only a irrelevant and self-indulgent exercise in pedantry, its counterproductive as I am adding more confusion rather than reducing it.”

And how’s this “dumbing down” of absolutely critical statistical concepts working out for your students? How many of them are being given a false sense of competence regarding these concepts and then end up botching a statistical analysis and wasting resources as we’re seeing on a massive scale?

Yes, the the theory and practice of statistics is difficult. That should be OK.

I think actually when you teach p value or types of error in a basic stats course for non stats students, what you are or should be teaching is don’t believe your eyes (or what other people say) when it comes to data. You could get data that show large differences between groups … but it could be the result of chance. You could get data that show no difference between groups … as a result of chance. Then on the basis of that be skeptical of your own and others’ conclusions. Ask questions about the sample size and how the sample was drawn. And keep in mind that statistical significance does not mean practical significance (just like random and normal don’t mean the same things in statistics that they do in normal conversation).

I think it is admirable to try to help the public improve statistical literacy but it’s hard and if it is done in a sloppy way (as in this example) counter productive. Physics, biology and philosophy are also hard. Actually so is writing essays. That doesn’t mean we don’t teach intro courses, courses for non majors, or start teaching kids basic concepts in elementary school.

If they have no desire to understand it, then this will satisfy them.

I have scanned the comments and I don’t think anyone else has pointed this out …….but I may be wrong.

The definition under discussion is taken, word for word, from this 2011 publication, which does have some authors, called investigators.

http://www.ncbi.nlm.nih.gov/books/NBK100924/

http://www.ncbi.nlm.nih.gov/books/NBK100921/

I discovered this by googling the first sentence of the definition.

The definition also appears in Appendix E of this 2013 publication.

http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0071877/

I feel compelled to post this as a general response because I have some contact with this side of healthcare. First of all the government does publish good statistics guides-see http://www.itl.nist.gov/div898/handbook/. But NIST has a 200 year legacy and much of the government/non-government healthcare metrics movement is a rapid response to the lesser known parts of Obamacare. And so far, despite lack of rigor, it seems to be working. Hospitals are penalized for re-admissions and infections so follow-up is improved. Early reports suggest some success.

The people in this field come from a range of professional and educational experiences. They know their subject matter-like operating room materials or treatment plans-but never expected to be put on a committee for metrics. They struggle with basic math like fractions, ratios and proportions. So just explaining that successes are the numerator and total sample the denominator is a big deal. The distinction between rate of event, counts and time to event is overwhelming. So talking about effect size-a difference in a metric-is impossible until they ‘get’ the metric. So as with the quality techniques in the NIST handbook, significance testing is about acting on a signal, not a research claim. I suspect the writer of this definition was copying something or dumbed-down the definition.

Re: rates, probabilities, frequencies. It is common practice to INTERPRET a rate as a probability in biostatistics, but not to say a rate IS a probability. I saw INSIDE OUT this weekend and will use a metaphor from the movie. As statisticians we have all opened the door to Room of Abstract Thinking but we can’t let ourselves be reduced to shapes and colors.

I think they are estimates of the population proportion, right? Of course people interpret them as probabilities, that’s how they can use data to try to predict what will happen next month. And that’s a good thing to try to do rather than going by gut instinct.

I agree that for huge numbers of smart people even figuring out the numerator and denominator (especially which denominator is the right one to use) is really hard.

@Andrew

Does wordpress tell you how many people play the vid? I’m curious.

I’m curious if someone can suggest a rewrite of their first example that’d make sense to a lay audience (i.e. don’t use the word “null”):

“The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.”

Seth:

Try this: “The results were consistent with the two drugs having identical effects, with the observed differences merely due to chance variation.”

I don’t think “chance variation” is very helpful by itself. It is precisely the nature of what “chance” has to do with the results that has people confused. Perhaps what should be tried is “The results were consistent with the two drugs having identical effects on the whole population, with the observed differences merely due to chance variations in what part of the population turned up in the samples.”

Srp:

Sure, but once you try to be accurate there’s more and more details that can go in. For example, it’s not just sampling variation, there’s also measurement error.

Given your writing about this issue, I expected you to say a proper discussion would be about what a p-value isn’t rather than trying to fix a bad definition that almost no one without statistical learning would understand. As in, you see a lot of p-values and you should be aware of these problems and then go into sample size and power, nature of null hypothesis, nature of model, etc., and then you would highlight that given these limitations how these p-values and measures of statistical significance can be useful.

Jonathan:

Yes, see my P.P.P.S.

The most important comment on this topic comes from Mayo herself:

http://errorstatistics.com/2015/07/17/statistical-significance-according-to-the-u-s-dept-of-health-and-human-services-i/#comment-127764

Mayo writes: “but it would be OK so long as they reported the actual type I error, which is the P-value”

Anonymous has a point after all about Mayo!

> “but it would be OK so long as they reported the actual type I error, which is the P-value”

Nope. P-value is not Type I error. Suppose, for example, that I was making a fire control decision based on some data. I’m going to decide to pull the trigger based on the value of p(H1|data)/p(H0|data), my hypotheses being H1=target and H0=non-target. The Type I error (false positive) rate is the fraction of the time that I fire on non-targets*. p(data|H0) tells me whether the data collected is typical or unusual under the H0 hypothesis. It says nothing about whether the data is typical or unusual with respect to H1. The data may be “not unusual” with respect to H0 but even more typical of H1 thus resulting in a likelihood ration which favors H1.

You can’t make an informed decision in a vacuum. In order to make an informed decision you also need to know p(data|H1). Yes, p(data|H0) factors into the H0 vs H1 decision by the error rate (Type I and Type II) follows from p(H1|data)/p(H0|data).

*At the risk of stating the obvious, Type II error also has its downside.

Ah, this explains why I find statistics textbooks so impenetrable. I learn a lot from your blog but if this definition of statistical significance is all I had to go by, I’d never be any the wiser. The AHRQ definition is wrong and dangerously misleading but wrong in a way that can be understood by the lay reader – which is what makes it so dangerously misleading. It achieves that by using clear language and giving examples. Your definition does neither. It tries to be maximally generative and minimally expressive – which is a virtue in mathematics but a sin in pedagogy. You must keep in mind a reader who will have only the definition to go by after reading it – not the expert who will nit pick every aspect of the answer. Your definition will not mislead the lay reader but it will also not lead them. Therefore, they will likely fall back on their common sense understanding of the word ‘significance’. A pedagogic definition should make it clear what it does not mean as well as what it means.

Something like:

Statistical significance is often misunderstood

* Statistical significance does not mean real world significance

* Statistical significance does not mean true results

* Statistical significance is often used by researchers but more and more statisticians believe it is not useful

In fact, all that statistical significance measures is [insert your definition – slightly reworded].

Often, it is expressed as p<0.5 but the difference between p=0.4 and p=0.6 is itself not significant. The 0.5 value was picked at random and has no real foundation.

So for example: [insert some examples from your blog].

Your problems with the definition are all pedantic – their text is sufficient for the masses who are not well versed in statistical inference.

That being said you are wrong on at least two fronts.

1.) P values, while influenced by the size of your sample are not a measure of sample size. While a P value of .051 with 1000 respondents may tick to .049 with 1500 its just as likely to tick down esp if the the relationship is indeed due to chance.

2.) a P value of .05 is sufficient in social sciences given the complexity of things we measure. That being said very few regression results in social science journals will have results at this level. We ideally shoot for .01. I know the hard sciences generally like six sigma but that is simply not tenable when studying people.

An also this ” “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.”

Yes likelihood has a technical meaning and a whole suite of modeling techniques associated with it (MLE) but the number of times an event is observed over the number of observations made (possible number of times that event can occur) is a probability. Whilst the number of times an event is observed with out standardizing it to the number of observations is a frequency.

100 people have read this article and believe what you’ve written – thats a frequency. If that is out of 10000 people who’ve just read it than that’s 10% and we can say anyone reading this article has a 10% chance of believing it.

Josh:

I think you may be trolling, but . . . your statement, “the number of times an event is observed over the number of observations made (possible number of times that event can occur) is a probability,” is not in general true, as is illustrated by Phil’s example above of the coin that was flipped 8 times with 6 heads by does

nothave a probability of 75% of coming up heads. Again, if you increase N, the empirical proportion gets closer to the probability—that’s the law of large numbers—but that’s the whole point: The empirical proportion and the probability are not the same thing, it is only under certain conditions that they approximately coincide.Of course, given the existence of people who (a) wrote the definition quoted in the above post, (b) didn’t realize their errors, it’s no surprise that there are blog commenters who share these mistaken attitudes about probability and statistical significance. Lots of people don’t understand Newton’s laws of motion either. Science is hard. If it weren’t hard, it all would’ve been discovered earlier.

Maybe one way to put it is probabilities generate frequencies not the other way round.

+1

I’m not trolling – I’m very well versed in statistics and ABD on PhD.

You’re right about the law of large numbers bringing the empirical results closer the underlying probability. However A.) in that example we can say of our observations that any one has a 75% probability of being heads. If as is the case in all of our studies we do not know the underlying probability this where B.) the P value comes in. In this example with an N of 8 you’re P value would not be sig. in your estimated probability of heads. Increasing the sample size as you indicated will bring your estimation more inline with underlying reality and thus your P value will improve.

It is important to note that the improvement in your P value occurs because your estimation gets closer to reality not because you have a larger sample. Though it is that larger sample that brings you closer to reality this is not always the case.

I really adore the fact you’re slagging on HHS do so using the similarly inaccurate and misleading prose. You’re contention that P values are a measure of sample size is just wrong. In your words “FUUUUUK no no no no no”. You also come off as pompous ridiculing a P value of .05 – which literally means there is a 95% chance that observed relationship is not due to chance and by itself very convicting and goes a long way to demonstrate the veracity of your hypothesis (note I didn’t say prove, that word belongs in math and distilleries). I’m not saying your wrong in contending that P values are sensitive to sample sizes but you are very very wrong in concluding that P values are a measure of sample size.

Josh:

I am sorry but no, a p-value of .05 does not “literally means there is a 95% chance that observed relationship is not due to chance.” This is simply wrong. If you are really studying for your Ph.D. in a quantitative field, I strongly recommend you speak with your supervisor, the other faculty in your program, and some of your fellow students and ask them to explain your confusion.

In all seriousness, it’s never too late to learn, and I suggest you swallow your pride and get this cleared up.

Andrew,

Is it even safe to conclude that Josh’s supervisor will be able (or want) to clear up his confusion?

Have you looked at the “research methods” teaching materials for some of our colleagues in the social sciences?

I have. And I’ve seen one esteemed “colleague” that presents himself as having expertise in statistics refer to such accuracy in definitions of statistical concepts that you provide here as “statistical bluster”. And then in the same teaching materials make very glaring errors in simple formulas. My conclusion is that if that “colleague” had spent more time understanding the “statistical bluster” they would be better equipped at recognizing when a formula they write is in error. But if you’re gonna reduce statistics to just the application of a bunch of formulas, it seems to me you should at least get the formulas right.

Reading between the lines of those teaching materials, it seems to me that the message being sent to that “colleague’s” graduate students is that getting statistics right will only serve in slowing them down in their efforts to publish as much as possible.

JD

You are pompous. I am not only studying for my PhD in a quantitative field but doing so in a program that is highly quantitative itself, having taken 6 stats clasess (in addition to game theory) and studied at ICPSR i’m very confident in my position. I’ve studies a suite of methods you’ve likely not heard of and I can tell you that yes A P. value tells you the probability that the observed relationship is due to chance or measurement error ect.. (unless your Bayesian) which i’m thinking you must be.

I would suggest that if you can not tell the difference between an observed probability a frequency and the true probability you should pack it in, no need to keep a blog on stats which you obviously are not well versed in. Add to that your argument that a P value is a measure of sample size and You’ve got no leg to stand on.

Lets look at the calculation of a P value shall we. In your example above the standard error for run is sqrt[(.75*.25)/7]=.163. If you null hypothesis is that you’d observe heads 50% of the time your test statistic would be z=(phat-p)/SE(p) or T=.25/.1531=1.533 the associated P value for a sample with 7 degrees of freedom is thus .1705 (or 17.05% chance you observed results are due to random chance – or measurement error). You can tell the results here are not sig. because the 95% confidence interval (which uses a P value of .05) puts the true population mean between .36 and 1.138 and as you cannot have a proportion greater than 1 its not sig.

Rather than hurling insults at me why don’t you explain any errors you see in my thinking or caluations

“(unless your [sic] Bayesian) which i’m thinking you must be.”

Gee Josh, Ya think?

Andrew, sorry, I shouldn’t feed the trolls.

Dan:

This guy has more staying power than most of our trolls, I’ll grant him that. But I agree, at this point enough is enough.

Hey Daniel Lakeland its a good thing you knew “Andrew” here was Andrew Gelman given there are millions of Andrews in this world.

Thanks for your pomposity

“which literally means there is a 95% chance that observed relationship is not due to chance”

Sheesh. This is precisely why we need to emphasize that a p-value *is not* the probability that the results were due to chance. That’s the typical layman’s interpretation of a p-value, but we need to be very clear that that is not what it is. Otherwise, this is quite logically how one would interpret 1-p, and it’s just plain wrong.

What would be your interpretation of a P value than.

BTW I think we’ve lost site of the authors original claim that a P value is a measure of sample size which it is most certainly not

Josh,

For what it’s worth — I define the p-value as follows:

“p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand, assuming

i. the model assumptions are all true, and

ii. the null hypothesis is true, and

iii. the random variable is the same (including the same

population), and

iv. the sample size is the same.”

For more details, you may download the slides for Day 2 at http://www.ma.utexas.edu/users/mks/CommonMistakes2015/commonmistakeshome2015.html

Josh is a troll, he makes statements like: “no need to keep a blog on stats which you obviously are not well versed in.” to Andrew Gelman, professor of statistics and the first-author of one of the most comprehensive textbooks on modern Bayesian statistics (BDA3), as well as the manager of a group that produces the most cutting edge statistical inference software in the world (Stan)

a p value is the probability of seeing data as extreme or MORE extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis)

If the null hypothesis is that data come from a unit normal N(0,1) but the data actually come from a N(1,.05) and you have 1 data point, you are almost guaranteed that your p value is not “significant”. Perhaps p ~ 0.16, does this mean that there’s a 16% chance that your data “occurred by chance”, or that there’s an 84% chance that it didn’t occur by chance?

The phrase “occurred by chance” is meaningless by itself.

@Martha: You wrote, “I define the p-value as follows: “p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand…”

Yes, exactly. It’s the test statistic that matters. A couple things:

1) Unless your test is just an anomaly detector you need both H0 and H1 hypotheses to compute a test statistic. If all you’ve got is H0 then you’re not in a position to test anything. Just this afternoon one of my interns ran F-tests for models she developed and calculated associated p-values. For reasons not worth getting into, we took the F- and p-values with a grain of salt but the point was to see whether the p-value was really low or really high. It was a quick way to compare two hypotheses. (It was a bit of an academic exercise. You could see by eye that the more complex model didn’t fit the data appreciably better than the simple one. If it had looked like it did, we would have computed AIC or BIC values or the like.)

2) Am I out of my mind in thinking that I pretty routinely see people write “data at least as extreme” where they should be writing “test statistic at least as extreme”? It’s not a subtle distinction, is it?

Chris G: it can appear subtle when the test statistic is sufficient or pivotal…

Chris G: Now that I consider the question for longer than two minutes, it occurs to me that according to Egon Pearson the test statistic is really (or ought really to be) defined by a system of level curves in the sample space across which we become “more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”. (I call them level curves because by their construction, we should only care about which level curve the data lie on and should be indifferent to the position of the data within that level curve.) From that point of view the the distinction between extreme data and an extreme test statistic is rather blurry, don’t you think?

@Corey: Mayo’s post is not a lite read. I’m going to have to chew on it and the level curve suggestion for a bit.

To clarify, I absolutely believe p(data|H0) is instructive but it’s only part of the story. Yes, you want to know whether your data is consistent with H0 but in order to make any definitive conclusions about what’s going on you need to establish whether the data is or isn’t consistent with other hypotheses. For example, suppose my H0 hypothesis is x~N(0,1^2) and H1 is x~N(3,1^2). I measure x=10. The value is extreme under both H0 and H1. In practice, I’d probably decide ~H0 and ~H1 because I wouldn’t believe that the H0 and H1 pdfs were actually normal out to seven and ten sigma, i.e., I’d figure that 1) the pdfs were good-faith estimates but that nowhere near enough data had been collected to characterize the tails out that far and 2) there’s probably some other Hn hypothesis that I hadn’t thought of which is more consistent with the data. If forced to make a decision between H0 and H1, I’d want to know the penalty for incorrect decisions in each case. Since I’d suspect both hypotheses are wrong I’d choose the one with the lowest penalty for error. I’m now way off topic. I’ll look more at Mayo’s post.

Because a.) I did not know who Andrew Gelman is and B.) am a frequentist rather than a Bayesian does not make me a troll. BTW I think Bayesian statistics belongs in game theory not in results. Had I known the man was indeed a Bayesian than our cause for difference of opinion would have been apparent.

This article is hitting HHS for the sloppiness of their language but rather educating, pointing out errors in my thinking you make fun of me. Could that be why people in general are not statistically literate? Instead of accusing me of being a troll and making fun of me, my advisory and my program your time would be better spent pointing out the inaccuracy of my statements. Martha did a great job on this front

Martha you explanation is very informative, makes sense and I see the inaccuracy of my statement – I would say that the claim that a P value is an estimate the observed relationship is due to chance is not wrong per say but on over simplified statement.

further not one of you have addressed my underlying points that are as follows A.) P value is not a measure of sample size B.) a P value of .05 is fine in some cases esp because we’re not taking anything as proof it provides credible supporting evidence and C.) the author confuses observed probability with frequencies.

Josh:

This is my last try . . . Nowhere did I make fun of you or your advisor or your program. I am 100% serious that you talk over p-values with some people you work with, and maybe they can clear up your misunderstanding. This stuff really can be confusing. There’s no shame in being mistaken. Just take it as an opportunity to learn.