Daniel asked,

“which textbook did you use for your class?”

Introduction to the Practice of Statistics, 5th edition, 2005/6, by Moore and McCabe

]]>My problem is you should be studying the relationship between variables, not for “an effect”. Eg, in medicine this would often be a dose response curve, or response over time under various conditions in various patients (and the average often does not look like the individual curves!). Then you come up with a model to explain the shape of the curves and use it to make predictions. No one can come up with a useful model to explain “an effect”.

The entire practice of looking for “an effect” to begin with is what needs to go away. Adding textbooks worth of generic and pedantic math on top of that just serves to hide what is going on.

]]>Martha, your notes seem like you probably had a great class. I always liked creating projects that tied multiple ideas together. I wrote up a series of projects for teaching an engineering computing course. They started with the equations of motion of a ball in 2D, then used dimensional analysis to derive a drag expression, then some data led to a regression to find an expression for a drag coefficient, then ideas about how to solve ODEs by iterative methods, taught looping, how to interpolate to find a distance horizontally at projectile impact, and then ideas of optimization, finding angles for maximum distance trajectory… doing inference on fluid viscosity by shooting a ball at a known speed… All building on a simple idea and adding complexity naturally as you asked more questions. If course I wasn’t in charge of the class, so only a few of the lessons got used.

Most students didn’t like it at USC, they seemed to want quick simple and certain answers to textbook type problems. low risk. But you always could tell who were the best students because they would eat that stuff up.

which textbook did you use for your class?

]]>Daniel said,

“seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data.”

and Anonymous added, “Even a single stat 101 course seems to nuke their ability to think about data and evidence well.”

Stats 101 needs to be taught in a way that at least makes a serious effort to prevent these undesirable effects. Below are some quotes from a first day handout that show how I have tried to forestall these common misunderstandings

[Please note:

1) The course was not a usual Stats 101 — it had a calculus-based Probability course as prerequisite, but much of what I give here would also be applicable to a “standard” Stats 101. 2) I don’t claim that these points worked miracles, but I think they helped set the tone for the course, and also gave me the right to say, “I told you so” if students complained that my grading was harsh.]

“In many problems you will need to combine common sense and everyday knowledge with mathematical and/or statistical techniques.

Some questions on homework and exams will not have one correct answer; your grade on such questions will depend largely on the case you make for your answer, rather than just on the answer itself.

Reading assignments from the text will be given. These need to be read with attention to detail as well as to getting the general idea.

Learning new technical vocabulary is important. Some of it will be new technical meanings that are different from everyday meanings of words.

Writing carefully and precisely is important.”

“I believe that it is not possible to evaluate accurately what you have learned and done in this class solely on the basis of problems that can be done within the time limits of an exam. Therefore homework problems and a project, which you can spend more time on, will be important parts of your grade.

As mentioned above and below, grading will be based not just on the final answer or on calculations, but also on the reasoning shown in arriving at your final answer. “

“Homework: You will be assigned three types of homework:

1. Reading assignments. The textbook is unusually well written, so we can make best use of it and class time by your doing reading assignments before coming to class. Then we can spend class time going over the more difficult parts of the reading, reinforcing and applying what you have read, and supplementing the text with some of the mathematical reasons behind the techniques. Be sure to read for understanding and not just superficially. Thinking about what you read, and about what we do in class, is important for learning statistics. Pay special attention to the points marked with the “caution” symbol in the margin of the book.

2. Practice exercises. These will usually have answers summarized in the back of the book. You will not hand these in, but you will need to do them to help learn the skills and concepts that you will need to put together to do the problems on written homework assignments. Be sure to do them before the date they are assigned for, so you can ask relevant questions based on preparation and understand class discussion. (We won’t be able to discuss all practice exercises in class.) Usually practice exercises will be assigned together with the reading that they cover, to help you understand and assimilate the reading.

3. Written homework. These problems will usually be longer and/or more involved than practice exercises and exam questions. Consider each written homework assignment as a mini-take-home exam. See Guidelines for Written Homework and Policy on Late and Make-up Work below. Also bear in mind that the answers in the back of the book are just summaries; your solutions to written homework need to be more detailed and show your reasoning more than the answers in the back of the book.”

“Guidelines for written homework:

1. Remember that one important purpose of written homework is to practice thinking statistically and to show me how well you have progressed in your thinking. Be sure to show your reasoning — I can’t evaluate it if you don’t show it. And keep in mind the following quote from the instructor’s manual for our textbook:

‘ If we could offer just one piece of advice to teachers using IPS, it would be this: A number or a graph, or a formula such as “Reject Ho,” is not an adequate answer to a statistical problem. Insist that students state a brief conclusion in the context of the specific problem setting. We are dealing with data, not just with numbers.’

2. Do not hand in a rough draft! Be sure to spend time organizing and writing your solution. Ask yourself if you would like to read your write-up. If not, rewrite it! Part of your grade will be based on clarity of organization and explanation. After all, communicating well is part of thinking well — and making the effort to communicate clearly is an important way to develop your thinking.

Do not hand in extra computer output. Cut and paste (either by hand or on a word processor) so that figures and computer output come as close as possible to the point in your discussion where you refer to them. In some cases, writing on computer output (especially printouts of graphs) will work.

Reminder: Answers in the back of the book are summaries, condensed to fit in as little space as possible. Do not use them as models for written homework.

3. Write in complete sentences.

4. Pay attention to correct use of vocabulary. You will be learning technical vocabulary in this course. Part of what you need to learn is to use it appropriately. Be especially careful of what in language learning are called “false friends”: words that are familiar, but have a technical meaning that is different from their common meaning. “Significant” is one example of such a word.

Also be careful not to use mathematical vocabulary inappropriately in a statistical context. In mathematics, we can often prove an assertion. In statistics, we can usually only conclude that our result supports, suggests, or gives evidence in favor of a conclusion.

5. Use symbols correctly. One symbol often misused is the equal sign. Do not use it except to mean that the two things it is between are equal!!

Exams:

Do not expect exams to be just like homework. Exam questions will on average be less involved computationally than homework problems. They will often focus in more depth than homework on conceptual understanding. For example, some exam questions will test to see if you can distinguish between similar concepts. Others will be “summing up” questions to test how well you have been thinking as you learn. Others will provide you with computer output and ask you to answer questions based on that output and a description of the study from which it came.

Class Attendance and Participation: This is important for two reasons:

1. We will be covering material in class that is not in the textbook.

2. Discussion is very helpful in learning statistical concepts and statistical thinking.

Since the class is fairly large for class discussion, I will divide the class into two groups, which will alternate taking primary responsibility for responding to questions in class. When it is your group’s turn to be responsible, be prepared to put solutions on the board or the doc cam as well as answer questions on the reading and exercises. But remember that answers to questions that have answers in the back of the book usually need to be more detailed than the answers in the back of the book, need explanations, and need to be rephrased in your own words.

Of course, you will need to do assignments for all days, since one day’s assignment typically builds on the previous days’.

Please note: I expect students to make mistakes in class participation. Sometimes we learn best from our own or others’ mistakes. What I look for in class participation is that you are trying, and thinking.

Ethical matters:

Statistical ethics: Statistics consists of a collection of tools which, like any tools, can be used either for good or ill. It is your responsibility as a citizen of the world to be sure not to misuse these tools. I encourage you to read the Ethical Guidelines for Statistical Practice developed by the American Statistical Association, available on the web at http://www.amstat.org/profession/index.cfm?fuseaction=ethicalstatistics

Authorized and unauthorized collaboration: Since the University defines collaboration that is not specifically authorized as academic dishonesty, I need to tell you what collaboration is and is not authorized in this class.

The following types of collaboration are authorized:

1. Working on homework with someone who is at roughly the same stage of progress as you, provided both parties contribute in roughly equal quantity and quality (in particular, thinking) to whatever problem or problem parts they collaborate on. In fact, I encourage this type of collaboration!

2. A moderate amount of asking, “How do I do this on (the statical program used)?” However, as you gain enough familiarity, you should get in the habit of using on-line help and trying logical possibilities, then asking for help only if these don’t succeed after a reasonable try.

The following types of collaboration are not authorized:

1. Working together with one person the do-er and one the follower.

2. Any type of copying. In particular, splitting up a problem so that different people do different parts is not authorized collaboration on homework. (A certain amount of this may be appropriate on your project.)

3. Possession or consultation of the Instructor’s Solution Manual.

Academic dishonesty aside, asking anyone, “How do I do this problem?” (as opposed to questions like, “How do I carry out this detail of this technique?” or, “I’m not sure whether to proceed this way or this way; here is my thinking about each possibility; am I missing something?”) is just cheating, since it avoids the most important part of learning statistics: developing your statistical thinking skills.”

Just saw another example of how Stat 101 nukes people’s ability to think about data:

A scientist observes an entire population which has some kind of variation in it and, in effect, interprets the variation as due to sampling error. They immediately want to do a significance test to determine if the variation is “real”.

There’s negligible measurement error, and the entire population isn’t being thought of as “one population among many” or anything like that. They simply can’t wrap their brain around the fact that they already know there’s a real variation.

I can’t tell you how many times I’ve seen smart scientists make that error. I’ve never once seen someone unschooled in stats do the same, or even be tempted to make it, when left to their own devices when analyzing data.

]]>Really Michael, really? “You should read a bit more before declaring your own errors.”

We all have to learn from others how we are wrong. Its like Rome is burning and we are all bashing each other’s violin skills.

Now, I learned from anonymous here – https://statmodeling.stat.columbia.edu/2019/12/18/attempts-at-providing-helpful-explanations-of-statistics-must-avoid-instilling-misleading-or-harmful-notions-statistical-significance-just-tells-us-whether-or-not-something-definitely-does-or-defin/#comment-1208360

OK, you wrote plausible, still sure that is the right word to use in a general audience talk?

]]>Also Michael, you will find Mayo above mentioning how using something like a likelihood interval in a “diagnostic screening model” of inference is not a legitimate frequentist inference method but rather some illegitimate hybrid thing

]]>Daniel said, “it seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data.”

+1

]]>Because I strongly suspect I know who Anonymous is, I know that he knows all of the stuff you wrote. I don’t think anything he wrote implies that they all made the same mistakes, only that they all made mistakes, nor do I think he implied that they all would be happy with stats today, they undoubtedly wouldn’t. I think his conclusion stands: “it does mean the whole edifice is fundamentally flawed”

He gives 3 example fundamental flaws, each one is in fact a flaw. Whether they are flaws endorsed by Neyman/Fisher/Pearson etc is actually irrelevant to whether they are flaws.

Also the likelihood interval is only the interval with the most plausible values if your prior knowledge was uniform density over the parameter space.

]]>Sorry, anonymous, you are wrong. Even if they all made mistakes, Fisher, Neyman, and Pearson did not all make the same mistakes. Fisher’s p-value is not Neyman & Pearson’s critical region and Neyman’s confidence interval is not Fisher’s fiducial interval. Fisher’s likelihood interval _is_ the interval with the most plausible values. Not only that, but none of your alleged miscreants would be happy with how statistics play out nowadays.

You should read a bit more before declaring your own errors.

]]>Daniel,

In my experience, quantative people (STEM majors with an MS for example) who had zero experience with stats are dramatically better at analyzing data when left to their own devices than those with formal exposure to stats. Even a single stat 101 course seems to nuke their ability to think about data and evidence well.

]]>Keith, it seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data. Like in the case where my wife was showing me the email, what if the researcher had just plotted the raw measurements on the bivariate outcome and colored the points based on the condition? instead of insisting on a proof of a difference or not difference, which is what they hoped to get from the stats analysis, they could have just acknowledged that there is variation, and yet a plot of the centroid of each group would probably have shown what they cared about, which is that certain types of cartilage are more similar to each other and other types are different… quantitative representation of their qualitative by-eye impression is their main purpose. Later they may want to look at treated samples and see which portion of the phase space they are in… but all they know to ask is yes/no is there a “significant” difference… sigh

]]>Lucky her.

Unfortunately (or fortunately) I am sometimes tasked at taking down recently graduated statisticians making arguments such as given p > .05, there is nothing to be concerned about. Move on this is not the exposure level you need to be worried about.

Sometimes I lose, mostly because the senior non-statisticians involved have taken intro stats courses (OK, I think I remember that being correct).

]]>+1

]]>Andrew: The quote is from Wasserstein et al. 2019:

“McShane et al. in Section 7 of this editorial. ‘[W]e can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’—based on some p-value or other statistical threshold being attained.’”

From this they extract the “seductive certainty falsely promised by statistical significance”.

You publish so much, I see now there is another co-authored McShane et al. 2019, without a, b, so I can see why the person I asked to look up the references for my blog comment assumed it was one and the same.

]]>Andrew: It does not say that these are fallacious uses of statistical significance tests. A statistical significance test, we are told, “begins with data and concludes with dichotomous declarations of truth or falsity— binary statements about there being ‘an effect’ or ‘no effect’— based on some p-value or other statistical threshold being attained.”

There is no mention anywhere of what a statistical significance test does except for these construals, and no qualification that these assertions allude to abuses of the methods. The allegations of what statistical significance tests purport to do aren’t even limited to a so-called NHST, which I admit is so often associated with a fallacious animal that we ought to drop the acronym (see Final Keepsake in my book):

NHST is mentioned in asking “can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?”

Repeatedly, statistical significance tests are claimed to output “deterministic” claims.

I noticed Jennifer Rogers makes a similar statement at 3:54, with a clearer context:

https://www.youtube.com/watch?v=FxQC2YMw8b8

It seems to me that what’s said in McShane is stronger and more problematic because she is referring to the kind of lists of known carcinogens–resulting from numerous tests. She’s saying the form of the assertion wrt known causal practices and substances is often: X causes cancer, and her point is an ultra obvious one: the fact that two substances can cause cancer doesn’t mean they both are equally risky.

]]>Andrew:

What is said in McShane et al. is that this is what statistical significance tests purport to do. It does not say that these are fallacious uses of statistical significance tests. A statistical significance test, we are told, “begins with data and concludes with dichotomous declarations of truth or falsity— binary statements about there being ‘an effect’ or ‘no effect’— based on some p-value or other statistical threshold being attained.”

Repeatedly, statistical significance tests are claimed to output “deterministic” claims. The error statistical qualification is absent.

There is no mention anywhere of what a statistical significance test does except for these erroneous construals, and no qualification that these assertions allude to abuses of the methods.

I noticed Jennifer Rogers makes a similar statement as in the talk being criticized at 3:54 (it sounds like “does definitely cause”), but the next line is more apt:

https://www.youtube.com/watch?v=FxQC2YMw8b8

It seems to me that what’s said in McShane is actually stronger and more problematic, because, in this talk, she is referring to the kind of lists of known carcinogens–resulting from numerous tests–not isolated small P-values. She’s saying the form of the assertion, wrt known causal practices and substances that you read about, is often: X causes cancer, and her point is an ultra obvious one: the fact that two substances can cause cancer doesn’t mean they both are equally.

In McShane et al., the allegations of what statistical significance tests purport to do aren’t even limited to a so-called NHST, which I admit is so often associated with a fallacious animal that we ought to drop the acronym (see Final Keepsake in my book: https://errorstatistics.files.wordpress.com/2019/04/souvenir-z-farewell-keepsake-2.pdf).

NHST is mentioned in McShane et al. asking “can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?”

]]>Deborah:

Also we should clarify: in your comments you attribute the following phrase to McShane et al.: ”the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.”

McShane et al. never write this in their article. The closest is this: “In brief, each is a form of statistical alchemy that falsely promises to transmute randomness into certainty, an ‘uncertainty laundering’ (Gelman 2016) that begins with data and concludes with dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’—based on some p-value or other statistical threshold being attained. A critical rst step forward is to begin accepting uncertainty and embracing variation in effects (Carlin 2016; Gelman 2016) and recognizing that we can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by such dichotomization.”

I think it’s best to avoid confusion by not putting non-quotes in quotation marks.

In their paper, McShane et al. are clearly talking about what people wrongly do with statistical significance, not what they should or must do.

]]>Yes, Daniel! I believe I understand the problem you have described. Why live streaming some case studies would be educative for all quite likely.

]]>The question of to what extent does statistical practice today result in errors of the type “misinterpretation of the meaning of statistical tests” is an empirical one.

Suppose for example that I surveyed 1000 PhD level scientists in biology, medicine, psychology, economics, and other fields where statistical tests are routinely used.

I give them some scenario involving collecting data and calculating some p values… and then ask them questions regarding what the conclusion should be, which they answer in a short paragraph form.

What fraction of them would assess the evidence in an appropriate way according to Mayo’s suggestion that “every inference is qualified by an error statistical assessment” and would not “deny uncertainty” in any way?

Just this week my wife showed me an email where someone outside her lab was analyzing some data collected in collaboration with her lab.

It said something along the lines of “an ANOVA analysis shows a statistically significant overall effect, but the analysis of A vs B and A vs C shows no statistically significant effects, but B vs C and B vs D shows an effect even … etc etc”

The researcher wrote in essence “I don’t know how this can be, because it doesn’t make any sense, the only thing I can think is that there must be a continuum of effects and statistics just can’t show it to us”

(none of these are actual quotes, but rather paraphrases, I don’t have the email myself).

The point is, the only thing this person who is a postdoc in biology can get out of their attempt at analyzing the data is that the results show conflicting information that there either is an effect or there isn’t an effect, but is incapable of really giving a coherent answer, because the truth is what they intuited, which is that the issue isn’t binary but as far as this person knows, all that stats can do is give them binary answers to yes no questions.

Fortunately my wife has me to fall back on to get this data analyzed in a more sophisticated manner, but I assure you this is a *routine* situation that postdocs in biology or medicine who have taken biostats classes go off into the world committing statistics at the drop of a hat.

]]>Yes, but Deborah who is qualifying what? That is one essential question, among many others, that a consumer of statistics have asked and should ask.

Maybe conflating the term ‘dichotomous’ with ‘binary’ in the abstract; that is without describing the precise context there are confusing statements made.

]]>Deborah:

I agree with McShane et al. that it’s bad to declare “dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.” This is done all the time by users of statistical significance. I agree that it is not *necessary* that users of statistical significance make incorrect dichotomous statements—but they do.

Correction added to the post in square brackets.

]]>David, suppose our probability measure is p(theta) where theta is the angle counterclockwise from the horizontal position and goes from -pi to pi, along the lines of normal math measuring of angles in trigonometry.

Now as Anon says p(pi/2) (straight up, 12 o’clock etc) is the highest density region for observations.

Something jostles your apparatus and you want to know if it’s still well calibrated. So you take a data measurement, it gives angle -pi/4

please using the notation

p = integrate(function,variablename,lower,upper)

and plugging in whatever functions variables and ranges are appropriate, describe to me how you can calculate a p value for the hypothesis “this data point was generated by the same process as we saw before the apparatus was jostled” *using the angle itself as the test statistic*, and why your choice is the unique obviously correct answer.

Yes, you can design other test statistics, that isn’t in question. In fact using the density as a test statistic is a method I’ve advocated here for checking the adequacy of Bayesian models.

]]>Haste makes waste. Let me restate the whole thing.

I put it to you that a more complex object shouldn’t be a problem if you have a probability measure and a distance metric for the object.

In this more general case, you would only have something analogous to a two-sided. A p-value could be defined as integral over the object where the probability measure was equal to or less than the value of the probability measure of your test-statistic.

But I must be missing something?

]]>Hang on, my definition wasn’t correct; I had the multi-modal case in my head but it make it through my fingers; a p-value could be defined as integral over the object where the probability measure was equal to or less than the value of your test-statistic.

]]>Anonymous, I don’t see how this is a hard problem. As Jeff Walker’s elevator pitch mentioned, you need a null distribution for the test statistic, in which case (as David P points out) it’s just boils down to 1-sided or 2-sided.

Even a more complex object shouldn’t be a problem if you have a probability measure and a distance metric for the object.

In this more general case, you would only have something analogous to a two-sided. p-value would be defined as integral of probability measure over region of the object as far or further from null hypothesis than your test-statistic.

What am I missing?

]]>Andrew: And if Wasserstein et al had said tests are mistakenly thought to promise certainty, a big part of the argument for banning the concept of statistical significance would go by the board. For example, in Wasserstein et al, 2019, we hear that ”the seductive certainty falsely promised by statistical significance.” This is to assert that the tests purport to give certainty, rather than embrace uncertainty. Or the McShane et al. article: ”the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.” The allegation (of denying uncertainty) is baffling to any statistical significance test user, because every inference is qualified by an error statistical assessment. I and some others argue that these are gross misinterpretations, but they’re stated as what tests purport to give us.

]]>It already becomes problematic in bimodal distributions. Suppose you have two typical situations in your experiment… values are near 0 +- 1 or values are near 10 +- 1 two little normal distribution bumps… now you get a data point 5 and want to know if this indicates a violation of your usual observations… the right tail area is about 0.5, the left tail area is about 0.5, the confidence interval for results is from about -2 to 12, but the probability to be anywhere between 3 and 7 is basically 0… so the p value completely fails as a measure of anything

]]>Indeed you are, and I absolutely agree with your point. Let me explain: I engaged here in a mode thinking I was discussing with frequentists and I had an idea to pursue the discussion until it possibly brought out some of the weak aspects of the concept of statistical significance, which I find interesting. But when both parties are NHST skeptics, as I realised is probably the case, that won’t work. So I’ll jump back in when there’s a thread dealing more specifically with that I guess. :-)

]]>Also, as pointed out I am familiar with the von Mises-Fisher distribution since it was created for work in paleomagetism (study of the ancient magnetic fields from magnetism frozen in rocks at formation) and I’ve worked in paleomagnetism for a bit.

As to your point about cutoffs, making a decision using a cutoffs is in effect making an approximation beyond anything warranted by the evidence/assumptions. As such it can lead to inherent problems no matter who does it.

But two points: (1) A bayesian understanding shows when the approximation is going to be a good one and hence acceptable.

(2) bayesians can just avoid the approximation by looking at the entire posterior for the parameter and using it in whatever analysis follows.

what do p-value-nistas got?

]]>Not always “entirely.” In many problems some regions are more natural than others. In this case, if the wrap-around cases can be distinguished from the non-wrapped cases then the distances are different. A person who sails around the world and back to the same dock is different from a person who sails around the harbor and back to the dock.

]]>Justin, I was going to copy and paste your “what about Bayes Factors” response in for you, but you beat me to it.

]]>They tried to generalize an entire system of statistics from simple cases where any reasonable statistical philosophy gives the same answers. That would be fine, but they based it on the wrong statistical philosophy (probabilities = frequencies), and got the generalization wrong.

Specifically,

(1) the tail area p-value only really works when it’s a monotonic function of the probability of the data. The probability of the data is fundamental. From a bayesian perspective, if the observed data sits low in the probability density of a model, then the model can easily be beat by even low initial probability challenger models. So it serves as a warning to find those better models. You don’t need p-values to do this. Gelman for example has a bunch of better and more general ways to check models.

(2) Confidence Intervals aren’t a range of plausible values like everyone thinks they are. The entire confidence interval in real problems can in fact be impossible values.

(3) When you do a frequentist significance test and accept the null, you’re in effect doing the following: the evidence/assumptions say there’s a range of plausible values for a parameter, but you’re going to reduce that range to single point (the null). This can be fine if the original range of plausible values was narrowly concentrated around the null, but in the vast majority of cases of significance testing in the wild, the range of plausible values is large. So replacing it with a single point leads to massive errors.

Note all of this is true even if all the other problems (definitions not being taught right, frequency histograms approximating a probability distribution upon infinite repeated trials, assumptions being wrong, and so on) aren’t there. Even accepting the best case frequentist scenario, it’s a disaster.

Fisher, Neyman, and Pearson just got it wrong! It’s just wrong. It doesn’t mean you can never use one of their methods. It doesn’t mean every paper that uses their methods reached a wrong conclusion. But it does mean the whole edifice is fundamentally flawed.

But if you’re not willing to accept that, can we at least agree to stop making the absurd claim that frequentist statistics lead a big chunk of modern science into disaster because p-values are just sooooooooooo much harder to teach right than any other concept!

]]>HP:

Of course I’m the right person to ask. Indeed, I published a paper a few years ago, P-values and statistical practice, which directly addresses the question.

The statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is wrong. That doesn’t mean that p-value or statistical significance give no information. But it’s a mistake to use them to decide that an effect exists or does not.

]]>Thank you! I appreciate your points.

]]>Thanks! Interesting stuff.

]]>Holy cow, I just realised whose blog this is (I came straight here from a Twitter link)!

So I guess you’re not the right person to ask for reflections on the usefulness of p values… :-)

]]>“What if your life depended on the outcome and one choice said “significant” and the other said “not significant”?”

This can be applied to Bayes factors, posterior probs, any statistic you have a strict cutoff on… so it is not a good criticism against p-values IMO..

Would I “bet my life” on any single outcome, no matter how small the p-value? Probably not. But what if 500 studies showed “significant”? Well, that could be a different story (assuming sound experiments and no QRPs)

“P-values are fun to think about. Suppose you have a test statistic T defined on the unit circle. The null hypothesis is that the parameter mu is at the top of the circle (the point at angle pi/2). Under this null hypothesis, the distribution of the test statistic is max at the top and decrease as you go further down, and is smallest at the bottom (angle -pi/2).

Now the value of the test statistic T(data) is the point at -pi/4. Given this, how do you compute the p-value?

Do you include just the region from -pi/4 to -pi/2? or do you include -pi/4 to -3pi/4?”

What possible values can T take on?

I know there are fields of wrapped and circular (and spherical, etc.) distributions, for example by Fisher and von Mises. I don’t know much about them, however.

Justin

]]>Yes, but that can be sorta avoided due to the fact that they’re so far apart (like Pakistan and Bangladesh becoming separate countries). Once you get more topologically complicated spaces like a figure 8 or a torus or something it’s harder to sweep the issue under the rug.

The issue is that p-values are only serviceable in some simpler problems because there’s a monotonic relationship between the probability of the data and the tail area p-value. So using the p-value is in effect equivalent to using the probability of the data. If you continue trying to use the p-value beyond those simple cases, you start running into absurdities.

]]>As soon as you wrap around a circle it becomes no-sided. integrate in any direction far enough and you’ll get back where you started.

You can define a region, but it’s entirely arbitrary

]]>Because 1) and 2) mostly talk about single studies, I’d add to that list:

3) meta analysis, or looking at p-values from similar repeated studies, the ‘whole’ of the evidence

Justin

]]>There are two good uses for p values that I’ve been able to figure out in applied studies.

1) to decide if a particular data point should be studied as unusual compared to a large database of past “usual” events. Like for example if a seismometer day in and day out has low level vibrations and suddenly something with vibratory magnitude very high p=0.0028 relative to the past database occurs.

2) When you have a theoretically derived model and want to show that things that happened in the past are compatible with your model, so that p=0.37 is for example taken to show that your model can’t be clearly falsified by this data.

that’s it.

]]>Isn’t this just the question of a one-sided versus a two-sided test?

]]>I do not make a general claim about that statement. I am talking about it in the context of her talk and I even draw up a scenario where it would be correct to isolate the p-value as *the* defining factor for decision-making. And a key factor in that scenario: a “biologically vetted, causal factor”. That is, a very sound alternative to the null hypothesis, which all good studies have. So ESP studies isn’t really relevant to the point I am making.

Of course, unsound studies with a very tentative H1 sometimes end up with small p-value calculations. But they should first and foremost be criticised for being bad studies. We do not know what studies Rogers is talking about. What if they are really good ones?

So, again, if one chooses to understand Rogers as saying that “we should ALWAYS start our decision-making by ONLY looking at the p-value” or if one would claim that this is the message that the audience hears, I guess I just don’t hear or see it.

By the way, could I ask you, or anyone who feels Rogers’ statement is very problematic: Could you frame briefly how you want the p-value of a good study to be *used*. I do not mean what it *is* or one of the many examples of what it should *not* be used for.

]]>Sure. But I still think it’s absolutely fine to say it that way in a presentation to the general public.

]]>HP:

I disagree with your claim that the statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is right. This statement is wrong. By wrong, I don’t just mean it’s “not technically correct,” nor do I mean that it would be correct if you remove the word “definitely.” I actually mean that it’s a wrong statement and that it is misleading. Indeed, it’s a misconception that causes big problems in applied statistics.

You ask, “What is the p-value supposed to mean in the context of a *study*? Well, it means what she said.” No, it doesn’t. Yes, researchers and reporters often *act* as if statistical significance tells you whether an effect is real or not, but the problem is that in real life it doesn’t. In real life it often happens that statistically significant differences are found that do not replicate. I agree with you that statistics is an applied science. That’s why we get concerned when numbers are used to make strong and incorrect conclusions. Bem’s ESP study is one of zillions. That’s why we’ve been talking about the replication crisis. See for example this paper by Kate Button et al. from 2013 for one of many discussions of the topic.

P.S. Again, as Keith wrote, these sorts of errors are easy to make. Right at the beginning of one of my published papers, the p-value is defined “the probability that a perceived result is actually the result of random variation.” That’s wrong. It’s not just “not technically correct,” it’s wrong. Statistics is hard, it’s easy to make wrong and misleading statements, and it’s good for us to correct these errors.

]]>HP

Please see the p.s2., I added above.

]]>Deborah:

Wasserstein et al. do not promote the view that “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer.”

Wasserstein et al. promote the view that statistical significance is wrongly *believed* by researchers to tell they whether or not etc.

Deborah:

You write, “You would think the National Academies of Sciences, writing a guidebook on replication and staffed with leading statisticians, would define P-values correctly. They don’t.”

It is not a surprise that the National Academies of Sciences gets things wrong. Don’t forget, they publish the journal PNAS, which is notorious for publishing junk science such as ages-ending-in-9, himmicanes, etc. I’m sure the National Academies of Sciences does lots of good things, but a lot of what they do is to reify the eminence of their members, so if their members make mistakes, that’s gonna be a problem. They have an intellectual conflict of interest. I’m not talking about $ here, I’m talking reputation.

]]>